We don't want to introduce pointless delays in throttle_vm_writeout()
when the writeback limits are not yet exceeded, do we?Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
mm/page-writeback.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)--- linux-2.6.23-rc8-mm1.orig/mm/page-writeback.c
+++ linux-2.6.23-rc8-mm1/mm/page-writeback.c
@@ -507,16 +507,6 @@ void throttle_vm_writeout(gfp_t gfp_mask
long background_thresh;
long dirty_thresh;- if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
- /*
- * The caller might hold locks which can prevent IO completion
- * or progress in the filesystem. So we cannot just sit here
- * waiting for IO to complete.
- */
- congestion_wait(WRITE, HZ/10);
- return;
- }
-
for ( ; ; ) {
get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);@@ -530,6 +520,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
congestion_wait(WRITE, HZ/10);
+
+ /*
+ * The caller might hold locks which can prevent IO completion
+ * or progress in the filesystem. So we cannot just sit here
+ * waiting for IO to complete.
+ */
+ if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
+ break;
}
}-
I don't think so (ie. I agree with you).
-
On Thu, 27 Sep 2007 09:50:16 +0800
This is a pretty major bugfix.
GFP_NOIO and GFP_NOFS callers should have been spending really large
amounts of time stuck in that sleep.I wonder why nobody noticed this happening. Either a) it turns out that
kswapd is doing a good job and such callers don't do direct reclaim much or
b) nobody is doing any in-depth kernel instrumentation.Now, how _would_ one notice this problem? We don't have very good tools,
really. Booting with "profile=sleep" and looking at the profile data would
be one way. Repeatedly doing sysrq-T is another. Perhaps the new
lockstat-via-lockdep code would allow this to be observed in some fashion,
dunno.Anyway, this patch has the potential to significantly alter the dynamics of
the VM behaviour under particular workloads. It might turn up other
stuff...-
[ Oh, it's Friday already, so soapbox time i guess. The easily offended
please skip this mail ;-) ]People _have_ noticed, and we often ignored them. I can see four
fundamental, structural problems:1) A certain lack of competitive pressure. An MM is too complex and
there is no "better Linux MM" to compare against objectively. The
BSDs are way too different and it's easy to dismiss even objective
comparisons due to the real complexity of the differences. Heck,
2.6.9 is "way too different" and we routinely reject bugreports from
such old kernels and lose vital feedback.2) There is a wide-spread mentality of "you prove that there is a
problem" in the MM and elsewhere in the Linux kernel too. While of
course objective proof is paramount, we often "hide" behind our
self-created complexity of the system (without malice and without
realising it!). We've seen that happen in the updatedb discussions
and the swap-prefetch discussions. The correct approach would be for
the MM folks to be able to tell for just about any workload "this is
not our problem", and to have the benefit of the doubt _on the
tester's side_. We must not ignore people who tell us that "there is
something wrong going on here", just because they are unable to
analyze it themselves. Very often where we end up saying "we dont
know what's going on here" it's likely _our_ fault. We also must not
hide behind "please do these 10 easy steps and 2 kernel recompiles
and 10 reboots, only takes half a day, and come back to us once you
have the detailed debug data" requests. Instrumentation must be _on
by default_ (like SCHED_DEBUG is on by default), which brings us to:3) Instrumentation and tools. Instrumentation (for example MM delay
statistics - like the scheduler delay statistics) give an objective
measure to compare kernels against each other. _Smart_ and _easy to
use_ and _default enabled_ instrumentation is a must. Not "turn on...
I totally agree with Ingo here. Having a basic instrumentation that is
enabled by default will help to identify code paths causing unexpected
delays in the kernel. It will not only identify kernel bugs, but also
unexpected behaviors that would be qualified as "quiet bugs" (e.g. long
delays).The key aspect that seems to be inherent to this proposal is the need
for an extensible instrumentation mechanism that would allow developers
to add new instrumentation when it is needed (such as the Linux Kernel
Markers on which I have been working for the last year). It will enable
them, and testers, to test and benchmark kernel subsystems to detect
regressions as well as erratic behaviors.When a problem such as this VM delay is found, we should really be able
to ask ourselves : "what instrumentation (and tests) can we add to
detect this class of problem automatically in the future ?". Otherwise,
we will always be back to the starting point after future changes. When
I talk about tests, I mostly refer to tests performed by an
out-of-kernel tool analyzing exported traces, since the proliferation
of various live tests for each and every subsystems in the default
kernel configuration is not something one would sanely consider.Besides automatic testing, extracting traces taken from an instrumented
kernel would help users to provide a better feedback to kernel
developers by telling us what led to a particular problematic situation.
Since the burden of the proof that a particular subsystem is acting in
funny ways is left to the users/testers, we cannot expect a useful
feedback from them if we don't provide a mechanism that will let them
_prove_ that something is really going wrong.--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
It is. See: CONFIG_VM_EVENT_COUNTERS and all the other vm specific
crap littered in /proc/ (buddyinfo, zoneinfo, meminfo, etc).There is always an issue of sometimes not instrumenting enough basic
things... but we fundamentally have always tried to do improve this.We can and do (Andrew did, in this case).
-
Isn't the actual instrumentation present in the VM subsystem consisting
mostly of event counters ? This kind of profiling provides limited help in
following specific delays in the kernel. Martin Bligh's paper "Linux
Kernel Debugging on Google-sized clusters", presented at OLS2007,
discusses instrumentation that had to be added to the VM subsystem in
order to find a race condition in the OOM killer by gathering a trace of
the problematic behavior. Gathering a full trace (timestamps and events)
seems to be better suited to this kind of timing-related problem.Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
AFAIK, nobody is nacking patches that add more generally useful
instrumentation to the vm.You'll obviously always get problems that you need specialised
information to solve. Those kinds of corner cases are probably a
good idea for dynamic instrumentation I guess.
-
These existing VM events/counters could be a good target for a second
worked-out example (after blktrace) of reimplementation using generic
markers.- FChE
-
This is a problem with the testers. The barrier for *reporting* problems
is really low. And if they're regressions, then it is almost 100% chance of
being solved. If we're simply missing the bug reports, I'll again ask for a
linux-bugs list which is clear of other crap.Or are people scared of reporting regressions from previous experience?
I haven't seen any reason for this and linux-mm is probably one of the
most polite lists I have read (when it comes to interacting with nonI don't see this as a big problem. The "better Linux MM" is implemented
by the patch one is proposing -- whether it be a one liner bugfix, or a
complete rewrite. I think it _would_ be really nice to run all interesting
workloads on various Linux 26 and 24 kernels, various patchsets, and
other open source OSes for the regression testing and cross polination
aspects... however this brings us to the main problem:The alternative is completely unscientific chaos. OK, for performance
heuristics, it is actually a much grayer "you show that your patch
improves something / doesn't harm others" -- this is pretty tough for
mm patches at the moment, maybe it could be better but at the end
of the day, if something is worth merging then we should be able to
actually prove (or have a pretty high confidence) that it is good. If not,
then we don't want to merge it, by definition.In the case of the vm, this comes right back to the difficulty of getting a
range of meaningful tests. We can definitely improve this situation a
great deal, but like the scheduler, the *real* "tests" simply have to be
done by users. And unfortunately, a lot of them don't actually test VM
changes for a long time after they're merged. This would actually be one
area where it might make a lot of sense to coordinate more significantThe recent swap prefetch debate showed exactly how the process _should_
work. We discovered a serious issue with metadata caching behaviour, and
also the active-inactive list balancing change that makes use-onc...
On Thu, 27 Sep 2007 09:50:16 +0800
Reviewed-by: Rik van Riel <riel@redhat.com>
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
It could be a good fix for 2.6.22/23. But for -mm, I'm not
sure if throttle_vm_writeout() will be eventually removed ;-)-
| Alexandre Oliva | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Eric W. Biederman | Re: [net-2.6.24][patch 2/2] Dynamically allocate the loopback device |
| Ingo Molnar | Re: containers (was Re: -mm merge plans for 2.6.23) |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Michael Riepe | Re: 2.6.27.19 + 28.7: network timeouts for r8169 and 8139too |
