Good afternoon, According to me, each line of code removed from the kernel is worth ten lines added. If lines can be removed while at the same time improving performance, that is worth, ah, about 1,000 times more than lines added, correct? Or maybe 1,000,000 times as much, if removing lines removes a deadlock at the same time, something like that. Today's patch is a step towards removing many lines of code from mainline and also aims to improve write cache performance, eliminate a troublesome class of deadlocks, save kernel memory and make the kernel easier to read. Background We seriously began to tackle the issue of block writeout vm deadlock more than three years ago, and by now we have acreted a creaky agglomeration of dirty page limits, dirty page balancing between block devices, and miscellaneous other hacks attempting to solve these deadlocks. These bandaids as of 2.6.23 do not in themselves address the underlying problem, but they do add a lot of code to core kernel and they do impair write Linux's write cache performance. This is a classic case of fixing symptoms instead of the cause of a problem. The worst part of it? The dirty page limit idea did not actually fix the deadlock it was supposed to, but instead generated a new class of deadlocks that are easier to trigger. The basis of the writeout deadlock scenario is easy to see: a task requests a page of memory, but all pages are currently in use for caching among other things. To recover memory, some disk-backed pages must be evicted. Dirty pages must be written to disk before being evicted, so they are passed to the block layer for writeout. If any code in the block writeout path needs to allocate memory to do its work, then we can deadlock because the shortage of memory prevents the block layer from making progress to recover memory. So far, I have just summarized what everybody knows. This deadlock scenario has been present in Linux since day one, and has been solved f...
Hey, Daniel, I'm feeling like I must be really dumb, but...how can that possibly work? You're zeroing >bi_throttle before adding it back into q->available, so the latter will never increase... jon --
Hi Jon,
Don't you know? These days we optimize all our code for modern
processors with tunnelling instructions and metaphysical cache.
On such processors, setting a register to zero does not entirely
destroy all the data that used to be in the register, so subsequent
instructions can make further use of the overwritten data by
reconstructing it from remnants of bits left attached to the edges of
the register.
Um, yeah, that's it.
Actually, I fat-fingered it in the merge to -mm. Thanks for the catch,
corrected patch attached.
The offending line isn't even a functional part of the algorithm, it is
just supposed to defend against the possibility that, somehow,
->bi_endio gets called multiple times. Probably it should really be
something like:
BUG_ON(bio->bi_throttle == -1);
if (bio->bi_throttle) {
...
bio->bi_throttle = -1;
Or perhaps we should just rely on nobody ever making that mistake
and let somebody else catch it if it does.
Regards,
DanielHi,
Heh, well, that's ok as long as bio->bi_vcnt is set to zero and I think we
have some md raid drivers do just that... ;-)
Pekka
--Ehm, this patch is so broken it's not even funny - did you even compile? You would have noticed the warning on request_queue_t, surely. The big problem is the last hunk here though, how would that work on stacked devices? Clue: ->bi_bdev is not const, it can change after a call to ->make_request_fn(). -- Jens Axboe --
There you go, Jens, service with a smile. Regards, Daniel
Such paranoia. Yes, the patch was compiled. Yes, the warning was
slipped through. No, it is not substantive, and in fact was removed
from another branch of our tree already.
Ignoring the rhetoric, apparently you missed the line:
+ if (q && q->metric && !bio->bi_queue) {
The prevents any reference ti bi_bdev after the intial call to
generic_make_request. Thanks to Evgeniy for pointing out the need for
this measure on the last go round.
"So broken" is a gross exaggeration. Substantive comments welcome.
Daniel
--Which saves the initial target, for ease of accounting at end io time - that's not the point. What happens when ->make_request_fn() changes bio->bi_bdev and returns 1, causing another iteration of the __generic_make_request() loop? 'q' is no longer the valid target, Or you could try and make an effort to understand the comment instead of just glancing over it. -- Jens Axboe --
What happens on the second iteration of a recursive submission loop is exactly nothing, as is right and proper. The throttling has already been done, and all the state necessary to perform the unthrottle was recorded in the bio. Everything seems to be in order there, and the algorithm does indeed perform its function as designed, though to be sure we have not tested it on -mm branch, only on mainline. Regards, Daniel --
OK, let me get the neon out then. This has nothing to do with throttling, I thought I made it clear that I get why you store the origin queue in ->bi_queue. I'm concerned with the workings of redirecting a bio. Previously we looked up the queue associated with bio->bi_bdev inside the loop in __generic_make_request(), as is REQUIRED to correctly locate a DIFFERENT queue if bio->bi_bdev has been changed to point somewhere else. Clear? -- Jens Axboe --
Rhetoric aside, again. We are only interested in throttling against the bio->bi_bdev that was stored in the bio at the time of the call to generic_make_request, why should we care about the redirected value? Regards, Daniel --
Let me repeat - this has nothing to do with throttling! You are breaking the bio redirection by killing that bdev_get_queue() in the __generic_make_request(). I honestly don't know how to make this any clearer than I already did above. Sleep on it. -- Jens Axboe --
Sure you do, you could cut out the rhetoric and save lots of bandwidth thereby. Yes, the q = bdev_get_queue(bio->bi_bdev) needs to be repeated inside the submission loop, that was a flaw, thanks for the catch. Regards, Daniel
I spent 3 mail explaining it as clearly as I could. So you're welcome for the review and the reminder of why it's impossible to have a normal Precisely. So forgive me for thinking this patch hasn't seen very varied testing, that's 2 errors (one simple, one bad - broken was NOT a gross I still wish there was a way around this, you are bloating the bio by about 15% (yeah I know you rambled on about this, but still). Better placement would help, so there's still low hanging fruit available. -- Jens Axboe --
See the [RFC]? If I had meant Request For Flaming, I would have written that. Thankyou for the catch. Regards, Daniel --
The same mail that contained this part, copied verbatim: "Let me close with perhaps the most relevant remarks: the attached code has been in heavy testing and in production for months now. Thus there is nothing theoretical when I say it works, and the patch speaks for itself in terms of obvious correctness." We must have differing opinions on what obvious correctness is. Future replies to /dev/null, please. -- Jens Axboe --
That is quite correct, even without the redirect the code passed all our tests. Remember, we were testing for deadlock, not every possible Yes we do. You appear to have missed the plot entirely. I suppose I should remind you: this is about deadlock in _your_ subsystem that has been creating bug reports for years. Block writeout deadlock. Caused by a deficiency in _your_ subsystem. Got a plan? Or does endless, pointless flaming feel more like progress to you? --
And I may remind you that I have participated in this discussion before and made my points clear there. I suppose I should remind you how the development process works? Just because I happen to maintain some piece of code does not mean I'm under some sort of contractual obligation to fix and write new code for users. I'll happily review patches and integrate stuff I agree with, as I have been doing for years. This bug may seem really important to you - guess what, that's the normal nature Please, I'm not flaming you. I reviewed your code and pointed out errors, which was followed by lots of hand waving on your side instead of just sitting down and reading/fixing the bug. -- Jens Axboe --
Well that is indeed more civil language, if somewhat revisionist, since I distinctly remember being flamed by you from the word go. Once again, thankyou for the catch. A fairly trivial oversight that would have been caught sooner of later. As for the typedef thing, that was just a spelling flame, admit it. Truly, the way you were yelling I thought you had picked up a fundamental flaw instead of a simple misplaced line of code. Now about that block writeout deadlock... it doesn't just affect my code, it basically breaks Linux as a storage platform, among other things. Regards, Daniel --
As written in other similar threads in the past in which you also participated, I still of the opinion that this is a vm issue and should be solved as such. As to the patch in question "fixing" it in the block layer, it's a fairly simple work around and I'm not totally against it. If you get rid of the ->bi_throttle stuff and just do sanity checks on the count, then we could look at getting some testing done. -- Jens Axboe --
Oh, sorry I missed that olive branch on first read. Getting rid of those 8 bytes that bother you requires an extensive rethink of bio handling in order to make some fields that are not now constant, constant or at least restored on change. Which would be a good thing in itself. There are lots of good improvements that can be made to this subsystem along those lines. But that is properly a separate project. Quite some time will be needed to get it right, and should I mention it, everybody needs to be on the same page or the work will never start. It is therefore a theoretical solution. We have a practical, tested solution, here and now, and it is short enough to be understood, unlike any of the previous attempts. Your argument seems to be that adding 8 bytes to struct bio turns this beautiful swan into an ugly duck. Actually, because the throttling reduces the number of bios in flight in a busy system, total memory use is reduced. When the system is not busy, there are few bios hanging around so that is not a problem either. Nice, hmm? Daniel --
The problem is solved. The main cornerstone of the solution is bio throttling, simply because the resources in question are consumed by Testing is already progressing fine without you, thankyou. If you do want to participate, welcome, otherwise it is not a problem. Thanks for picking up that bug yesterday. Daniel --
... because too much is pushed out. This isn't a mathematica problem, there's more than one solution to this problem. Throttling the bio count Here we go again, thanks for picking up your jerky attitude again. I'm trying to suggest a way to get the patch in a state to be included, but apparently you are not interested. With 3 bugs so far exposed in your really short patch, seems you should take all the testing help you can get. For what it's worth, your behind the doors testing is worth basically nothing. It's already been shown that your test coverage wasn't very wide, if this patch/idea is to have any hope of proceeding further you need user testing. Period. Stop cc'ing or replying, what little interest I had is totally gone. -- Jens Axboe --
And nobody has been able to find another. Funny that. In fact, every solution proposed so far has implicitly required the writeout traffic to be throttled, even if that throttling was not part of the patch. Without throttling, deadlock. Simple as that. --
Rather than asking the stack "how much memory will this request consume" you could instead ask "how much memory are you currently using". ie: on entry to the stack, do current->account_block_allocations = 1; make_request(...); rq->used_memory += current->pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() && current->account_block_allocations) current->pages_used_for_block_allocations++; and then somehow handle deallocation too ;) The basic idea being to know in real time how much memory a particular block stack is presently using. Then, on entry to that stack, if the stack's current usage is too high, wait for it to subside. otoh we already have mechanisms for limiting the number of requests in flight. This is approximately proportional to the amount of memory which This will fall straight through if signal_pending() and (I assume) bad stuff will happen. uninterruptible sleep, methinks. --
Ah, and how do you ensure that you do not deadlock while making this inquiry? Perhaps send a dummy transaction down the pipe? Even so, deadlock is possible, quite evidently so in the real life example I have at hand. Yours is essentially one of the strategies I had in mind, the other major one being simply to examine the whole stack, which presupposes some as-yet-nonexistant kernel wide method of representing block device We do not wait for high block device resource usage to subside before submitting more requests. The improvement you suggest is aimed at automatically determining resource requirements by sampling a running system, rather than requiring a programmer to determine them arduously by hand. Something like automatically determining a workable locking strategy by analyzing running code, wouldn't that be a treat? I will hope for one of those under my tree at Christmas. More practically, I can see a debug mode implemented along the lines you describe where we automatically detect that a writeout path has Two reasons. The minor one is that device mapper bypasses that mechanism (no elevator) and the major one is that number of requests does not map well to the amount of resources consumed. In ddsnap for example, the amount of memory used by the userspace ddsnapd is roughly linear vs the number of pages transferred, not the number of Yes, as a first order repair. To be done properly I need to express this in terms of the guts of wait_event_*, and get rid of that race, maybe that changes the equation. It would be nice if threads didn't get stuck in D state here, though _interruptible is probably the wrong idea, we should instead ensure that whatever is being waited on must respond to, e.g., SIGKILL. This at the limits of my scheduler knowledge, l would appreciate a better suggestion. I do detest hang in D state with SIGKILL immunity, which behavior unfortunately does not seem all that rare. Daniel --
The problem is that you (a) may or may not know just how bad a worst case can be, and (b) may block unnecessarily by being pessimistic. The dummy transaction would be nice, but it would be perfect if you could send the real transaction down with a max memory limit and a flag, have each level check and decrement the max by what's actually needed, and then return some pass/fail status for that particular transaction. Clearly every level in the stack would have to know how to do that. It would seem that once excess memory use was detected the transaction could be failed without deadlock. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
True, but after a quick introspect I realized that that issue (it's really a single issue) is not any worse than the way I planned to wave my hands at the issue of programmers constructing their metrics wrongly and thereby breaking the throttling assumptions. Which is to say that I am now entirely convince by Andrew's argument and am prepardc to reroll the patch along the lines he suggests. The result will be somewhat bigger. Only a minor change is required to the main mechanism: we will now account things entirely in units of pages instead of abstract units, eliminating a whole class of things to go wrong. I like that. Accounting variables get shifted to a new home, maybe. Must try a few ideas and see what works. Anyway, the key idea is that task struct will gain a field pointing at a handle for the "block device stack", whatever that is (this is sure to evolve over time) and alloc_pages will know how to account pages to that object. The submit_bio and bio->endio bits change hardly at all. The runner up key idea is that we will gain a notion of "block device stack" (or block stack for short, so that we may implement block stackers) which for the time being will simply be Device Mapper's notion of device stack, however many warts that may have. It's there now and we use it for ddsnap. The other player in this is Peterz's swap over network use case, which does not involve a device mapper device. Maybe it should? Otherwise we will need a variant notion of block device stack, and the two threads of work should merge eventually. There is little harm in starting this effort in two different places, quite the contrary. In the meantime we do have a strategy that works, posted at the head of The function of the dummy transaction will be to establish roughly what kind of footprint for a single transaction we see on that block IO path. Then we will make the reservation _hugely_ greater than that, to accommodate 1000 or so of those. A transact...
On Thu, 6 Dec 2007 16:04:41 -0800
Perhaps all we need to track is the outermost point?
submit_bio(...)
{
bool remove_the_rq = false;
...
if (current->the_rq == NULL) {
current->the_rq = rq;
remove_the_rq = true;
}
...
if (remove_the_rq)
current->the_rq = NULL;
}
?
--The parent patch already has that crucial property in a simple say, see
if (q && q->metric && !bio->bi_queue) {
bio->bi_queue = q;
Regards,
Daniel
--Hi Andrew, Unfortunately, I agreed with your suggestion too hastily. Not only would it be complex to implement, It does not work. It took me several days to put my finger on exactly why. Here it is in a nutshell: resources may be consumed _after_ the gatekeeper runs the "go, no go" throttling decision. To illustrate, throw 10,000 bios simultaneously at a block stack that is supposed to allow only about 1,000 in flight at a time. If the block stack allocates memory somewhat late in its servicing scheme (for example, when it sends a network message) then it is possible that no actual resource consumption will have taken place before all 10,000 bios are allowed past the gate keeper, and deadlock is sure to occur sooner or later. In general, we must throttle against the maximum requirement of inflight bios rather than against the measured consumption. This achieves the invariant I have touted, namely that memory consumption on the block writeout path must be bounded. We could therefore possibly use your suggestion or something resembling it to implement a debug check that the programmer did in fact do their bounds arithmetic correctly, but it is not useful for enforcing the bound itself. In case that coffin needs more nails in it, consider that we would not only need to account page allocations, but frees as well. So what tells us that a page has returned to the reserve pool? Oops, tough one. The page may have been returned to a slab and thus not actually freed, though it remains available for satisfying new bio transactions. Because of such caching, your algorithm would quickly lose track of available resources and grind to a halt. Never mind that keeping track of page frees is a nasty problem in itself. They can occur in interrupt context, so forget the current-> idea. Even keeping track of page allocations for bio transactions in normal context will be a mess, and that is the easy part. I can just imagine the code attempting to imp...
It isn't an inquiry - it's a plain old submit_bio() and it runs to completion in the usual fashion. Thing is, we wouldn't have called it at all if this queue was already over its allocation limit. IOW, we know that it's below its allocation limit, so we know it won't deadlock. Given, of course, reasonably pessimistc error margins. Which margins can even be observed at runtime: keep a running "max" of this stack's most-ever memory consumption (for a single call), and only submit a We already have that, I think: blk_run_backing_dev(). One could envisage a similar thing which runs up and down the stack accumulating "how much memory do you need for this request" data, but I think that would be hard to Yeah, one would need to be pretty pessimal. Perhaps unacceptably I don't think so. If you're going to sleep in state TASK_INTERRUPTIBLE then you *have* to bale out and return to userspace (or whatever) when Well.. See add-lock_page_killable.patch kernel-add-mutex_lock_killable.patch vfs-use-mutex_lock_killable-in-vfs_readdir.patch in 2.6.24-rc4-mm1. But for now, TASK_UNINTERRUPTIBLE is the honest solution. --
OK, I see what you are suggesting. Yes, one could set the inflight limit very low and the reserve very high, and run a bio through the stack (what I meant by "inquiry") to discover the actual usage, then shrink the reserve accordingly. By also running a real bio through the stack we can discover something about the latency. So we would then know roughly how high the inflight limit should be set and how much the memalloc reserve should be increased to handle that particular driver instance. The big fly in this ointment is that we cannot possibly know that our bio followed the worst case resource consumption path, whereas it is fairly Actually, your mechanism would always have to be operable at runtime, since inserting a new driver while the system is under heavy memory load is a perfectly valid operation and has to be reliable. Anyway, even if you run a bio through the stack lots of times (insert definition of "lots" here) you still cannot be sure that it has explored the worst case path. To put this in perspective, some of the deadlocks we have hunted down recently have taken days to manifest under artificially high load. It just takes that long to randomly explore a sufficient number I don't think I quite communicated there. We don't actually have any generic notion of "the block device stack". Device mapper has its own model, md has another model, and other stacking devices may have no model at all, just some through-coded hack. It would be worth fixing this problem as part of an effort to generalize the block IO model and make block devices in general look more like device mapper devices. But that would be a pretty big project, the need for which is not generally Orders of magnitude more reserve would need to be allocated in the case of ddsnap, since bio payload can vary through a big range, which is expected to get bigger as time goes by. So the few lines of extra code and the extra bio field needed to get a better fit is well worth the Thanks for clearing...
nonono... Consider an example. - We a-priori decide to limit a particular stack's peak memory usage to 1MB - We empirically discover that the maximum amount of memory which is allocated by that stack on behalf of a single BIO is 16kb. (ie: that's the most it has ever used for a single BIO). - Now, we refuse to feed any more BIOs into the stack when its instantaneous memory usage exceeds (1MB - 16kb). Of course, the _average_ memory-per-BIO is much less than 16kb. So there are a lot of BIOs in flight - probably hundreds, but a minimum of 63. There is a teeny so-small-it-doesn't-matter chance that the stack will exceed the 1MB limit. If it happens to be at its (1MB-16kb) limit and all the memory in the machine is AWOL and then someone throws a never-seen-before twirly BIO at it. Not worth worrying about, surely. --
And printk a warning, because the programmer's static analysis was OK, I see where you are going. The programmer statically determines a total reservation for the stack, which is big enough that progress is guaranteed. We then throttle based on actual memory consumption, and essentially use a heuristic to decide when we are near the upper limit. Workable I think, but... The main idea, current->pages_used_for_block_allocations++ is valid only in direct call context. If a daemon needs to allocate memory on behalf of the IO transfer (not unusual) it won't get accounted, which is actually the central issue in this whole class of deadlocks. Any idea how to extend the accounting idea to all tasks involved in a particular block device stack? Regards, Daniel --
On Thu, 6 Dec 2007 12:04:14 -0800 SMOP, I'd have thought. As long as each piece of code which handles data for this stack knows that it's handling data for that stack it should be able to account its memory allocations. The tricky part will be networking allocations because a NIC can of course handle data for all sorts of consumers. But I expect this can be greatly simplified with a few heuristics - work out how much memory your typical networking stack will allocate for a frame and tack that onto the total. Couple of pages worst case.. --
Agreed, which I realized as soon as the post was one minute old. Sure, each helper for the device registers as a helper which puts a pointer in the task struct, which points to the accounting info so only one new Don't forget that we do not actually have a usable notion of "block device stack" yet. Perhaps you are just assuming that is Actually, the same pattern that Peter and I developed for handling network deadlock extends to this accounting concept. As you say, it's a SMOP. Daniel --
On Thu, 6 Dec 2007 03:55:11 -0800 There is only one problem I can see with this. With network block IO, some memory will be consumed upon IO completion. We need to make sure we reserve (number of in flight BIOs * maximum amount of memory consumed upon IO completion) memory, in addition to the memory you're accounting in your example above. -- All Rights Reversed --
hm, yeah, drat. What we could do is - in do_IRQ(): set up a per-cpu pointer to some counter which corresponds to this IRQ. - in the page allocator, if in_irq(), retrieve that per-cpu pointer and increment the counter. - in the network block-io stack we can now look at the number of interrupts, number of packets, size of packets and amount of memory allocated and work out the max amount of memory which needs to be allocated for each frame. That's all rather handwavy and misses a lot of details and might be inaccurate too. Probably sufficient to just work out by hand the amount of memory which the network stack will need to allocate. I expect it'll be two pages.. --
On Thu, 6 Dec 2007 09:34:32 -0800 Doesn't Peter Zijlstra's patch series take care of all those nasty details already? -- All Rights Reversed --
| Theodore Tso | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
| Jeff Garzik | Re: [RFC] Heads up on sys_fallocate() |
| Erez Zadok | [UNIONFS] 00/42 Unionfs and related patches review |
| Roland Dreier | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| Jon Smirl | Re: VCS comparison table |
| Andy Parkins | svn:externals using git submodules |
| Daniel Berlin | Re: Git and GCC |
| Sam Vilain | [PATCH] git-mergetool: add support for ediff |
| Richard Stallman | Real men don't attack straw men |
| Paul de Weerd | Re: Porting OpenBSD to OLPC XO laptops. |
| sonjaya | openvpn on openbsd 4.1 |
| Adliger Martinez von der Unterschicht | linux kills laptop hard drive... how does obsd behave? |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Andrew Morton | Re: [Bugme-new] [Bug 11144] New: dhcp doesn't work with iwl4965 |
| Arjan van de Ven | Re: [GIT]: Networking |
