We don't want to introduce pointless delays in throttle_vm_writeout()
when the writeback limits are not yet exceeded, do we?
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
mm/page-writeback.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
--- linux-2.6.23-rc8-mm1.orig/mm/page-writeback.c
+++ linux-2.6.23-rc8-mm1/mm/page-writeback.c
@@ -507,16 +507,6 @@ void throttle_vm_writeout(gfp_t gfp_mask
long background_thresh;
long dirty_thresh;
- if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
- /*
- * The caller might hold locks which can prevent IO completion
- * or progress in the filesystem. So we cannot just sit here
- * waiting for IO to complete.
- */
- congestion_wait(WRITE, HZ/10);
- return;
- }
-
for ( ; ; ) {
get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
@@ -530,6 +520,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
congestion_wait(WRITE, HZ/10);
+
+ /*
+ * The caller might hold locks which can prevent IO completion
+ * or progress in the filesystem. So we cannot just sit here
+ * waiting for IO to complete.
+ */
+ if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
+ break;
}
}
-
On Thu, 27 Sep 2007 09:50:16 +0800 Reviewed-by: Rik van Riel <riel@redhat.com> -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan -
It could be a good fix for 2.6.22/23. But for -mm, I'm not sure if throttle_vm_writeout() will be eventually removed ;-) -
On Thu, 27 Sep 2007 09:50:16 +0800 This is a pretty major bugfix. GFP_NOIO and GFP_NOFS callers should have been spending really large amounts of time stuck in that sleep. I wonder why nobody noticed this happening. Either a) it turns out that kswapd is doing a good job and such callers don't do direct reclaim much or b) nobody is doing any in-depth kernel instrumentation. Now, how _would_ one notice this problem? We don't have very good tools, really. Booting with "profile=sleep" and looking at the profile data would be one way. Repeatedly doing sysrq-T is another. Perhaps the new lockstat-via-lockdep code would allow this to be observed in some fashion, dunno. Anyway, this patch has the potential to significantly alter the dynamics of the VM behaviour under particular workloads. It might turn up other stuff... -
[ Oh, it's Friday already, so soapbox time i guess. The easily offended please skip this mail ;-) ] People _have_ noticed, and we often ignored them. I can see four fundamental, structural problems: 1) A certain lack of competitive pressure. An MM is too complex and there is no "better Linux MM" to compare against objectively. The BSDs are way too different and it's easy to dismiss even objective comparisons due to the real complexity of the differences. Heck, 2.6.9 is "way too different" and we routinely reject bugreports from such old kernels and lose vital feedback. 2) There is a wide-spread mentality of "you prove that there is a problem" in the MM and elsewhere in the Linux kernel too. While of course objective proof is paramount, we often "hide" behind our self-created complexity of the system (without malice and without realising it!). We've seen that happen in the updatedb discussions and the swap-prefetch discussions. The correct approach would be for the MM folks to be able to tell for just about any workload "this is not our problem", and to have the benefit of the doubt _on the tester's side_. We must not ignore people who tell us that "there is something wrong going on here", just because they are unable to analyze it themselves. Very often where we end up saying "we dont know what's going on here" it's likely _our_ fault. We also must not hide behind "please do these 10 easy steps and 2 kernel recompiles and 10 reboots, only takes half a day, and come back to us once you have the detailed debug data" requests. Instrumentation must be _on by default_ (like SCHED_DEBUG is on by default), which brings us to: 3) Instrumentation and tools. Instrumentation (for example MM delay statistics - like the scheduler delay statistics) give an objective measure to compare kernels against each other. _Smart_ and _easy to use_ and _default enabled_ instrumentation is a must. Not ...
This is a problem with the testers. The barrier for *reporting* problems is really low. And if they're regressions, then it is almost 100% chance of being solved. If we're simply missing the bug reports, I'll again ask for a linux-bugs list which is clear of other crap. Or are people scared of reporting regressions from previous experience? I haven't seen any reason for this and linux-mm is probably one of the most polite lists I have read (when it comes to interacting with non I don't see this as a big problem. The "better Linux MM" is implemented by the patch one is proposing -- whether it be a one liner bugfix, or a complete rewrite. I think it _would_ be really nice to run all interesting workloads on various Linux 26 and 24 kernels, various patchsets, and other open source OSes for the regression testing and cross polination aspects... however this brings us to the main problem: The alternative is completely unscientific chaos. OK, for performance heuristics, it is actually a much grayer "you show that your patch improves something / doesn't harm others" -- this is pretty tough for mm patches at the moment, maybe it could be better but at the end of the day, if something is worth merging then we should be able to actually prove (or have a pretty high confidence) that it is good. If not, then we don't want to merge it, by definition. In the case of the vm, this comes right back to the difficulty of getting a range of meaningful tests. We can definitely improve this situation a great deal, but like the scheduler, the *real* "tests" simply have to be done by users. And unfortunately, a lot of them don't actually test VM changes for a long time after they're merged. This would actually be one area where it might make a lot of sense to coordinate more significant The recent swap prefetch debate showed exactly how the process _should_ work. We discovered a serious issue with metadata caching behaviour, and also the active-inactive list balancing change that makes ...
I totally agree with Ingo here. Having a basic instrumentation that is enabled by default will help to identify code paths causing unexpected delays in the kernel. It will not only identify kernel bugs, but also unexpected behaviors that would be qualified as "quiet bugs" (e.g. long delays). The key aspect that seems to be inherent to this proposal is the need for an extensible instrumentation mechanism that would allow developers to add new instrumentation when it is needed (such as the Linux Kernel Markers on which I have been working for the last year). It will enable them, and testers, to test and benchmark kernel subsystems to detect regressions as well as erratic behaviors. When a problem such as this VM delay is found, we should really be able to ask ourselves : "what instrumentation (and tests) can we add to detect this class of problem automatically in the future ?". Otherwise, we will always be back to the starting point after future changes. When I talk about tests, I mostly refer to tests performed by an out-of-kernel tool analyzing exported traces, since the proliferation of various live tests for each and every subsystems in the default kernel configuration is not something one would sanely consider. Besides automatic testing, extracting traces taken from an instrumented kernel would help users to provide a better feedback to kernel developers by telling us what led to a particular problematic situation. Since the burden of the proof that a particular subsystem is acting in funny ways is left to the users/testers, we cannot expect a useful feedback from them if we don't provide a mechanism that will let them _prove_ that something is really going wrong. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
It is. See: CONFIG_VM_EVENT_COUNTERS and all the other vm specific crap littered in /proc/ (buddyinfo, zoneinfo, meminfo, etc). There is always an issue of sometimes not instrumenting enough basic things... but we fundamentally have always tried to do improve this. We can and do (Andrew did, in this case). -
Isn't the actual instrumentation present in the VM subsystem consisting mostly of event counters ? This kind of profiling provides limited help in following specific delays in the kernel. Martin Bligh's paper "Linux Kernel Debugging on Google-sized clusters", presented at OLS2007, discusses instrumentation that had to be added to the VM subsystem in order to find a race condition in the OOM killer by gathering a trace of the problematic behavior. Gathering a full trace (timestamps and events) seems to be better suited to this kind of timing-related problem. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
These existing VM events/counters could be a good target for a second worked-out example (after blktrace) of reimplementation using generic markers. - FChE -
AFAIK, nobody is nacking patches that add more generally useful instrumentation to the vm. You'll obviously always get problems that you need specialised information to solve. Those kinds of corner cases are probably a good idea for dynamic instrumentation I guess. -
I don't think so (ie. I agree with you). -
