(Reposted for completeness. Previously rejected by vger due to accidental send as html mail. CC's except for Mike and vger deleted) On Monday 17 September 2007 20:27, Mike Snitzer wrote:The dread blk_congestion_wait is biting you hard. We're very familiar with the feeling. Congestion_wait is basically the traffic cop that implements the dirty page limit. I believe it was conceived as a method of fixing writeout deadlocks, but in our experience it does not help, in fact it introduces a new kind of deadlock (blk_congestion_wait) that is much easier to trigger. One of the things we do to get ddsnap running reliably is disable congestion_wait via the PF_LESS_THROTTLE hack that was introduced to stop local NFS clients from deadlocking. NBD will need a similar treatment. Actually, I hope to show quite soon that dirty page limiting is not needed at all in order to prevent writeout deadlock. In which case we can just get rid of the dirty limits and go back to being able to use all of non-reserve memory as a write cache, the way things used to be in the days of yore. It has been pointed out to me that congestion_wait not only enforces the dirty limit, it controls the balancing of memory resources between slow and fast block devices. The Peterz/Phillips approach to deadlock prevention does not provide any such balancing and so it seems to me that congestion_wait is ideally situated in the kernel to provide that missing functionality. As I see it, blk_congestion_wait can easily be modified to balance the _rate_ at which cache memory is dirtied for various block devices of different speeeds. This should turn out to be less finicky than balancing the absolute ratios, after all you can make a lot of mistakes in rate limiting and still not deadlock so long as dirty rate doesn't drop to zero and stay there for any block device. Gotta be easy, hmm? Please note: this plan is firmly in the category of speculation until we have actually tried it and have patches to show, but I thought that now is about the right time to say something about where we think this storage robustness work is headed. Yes, and also inspect the code to ensure it doesn't violate mlock_all by execing programs (no shell scripts!), dynamically loading libraries, etc. Avoiding glib is a good start. Look at your library dependencies and prune them merclilessly. Just don't use any libraries that you can code up yourself in a few hundred bytes of program text for the functionalituy you need. See PF_LESS_THROTTLE. Also notice that this mechanism is somewhat less than general. In mainline it only has one user, NFS, and it only can have one user before you have to fiddle that code to create things like PF_EVEN_LESS_THROTTLE. As far as I can see, not having any dirty page limit for normal allocations is the way to go, it avoids this mess nicely. Now we just need to prove that this works ;-) No, it's a patch I wrote based on Evgeniy's original, that appeared quietly later in the thread. At the time we hadn't tested it and now we have. It works fine, it's short, general, efficient and easy to understand. So it will get a post of its own pretty soon. Yes. Ddsnap includes a bit of code almost identical to that, which we wrote independently. Seems wild and crazy at first blush, doesn't it? But this approach has proved robust in practice, and is to my mind, obviously correct. You do need the block IO throttling, and you need to bypass the dirty page limiting. Without throttling, your block driver will quickly consume any amount of reserve memory you have, and you are dead. Without an exemption from dirty page limiting, the number of pages your user space daemon can allocate without deadlocking is zero, which makes life very difficult. I will post our in-production version of the throttling patch in a day or two. Yes. Yes, at least for device mapper devices. In our production device mapper throttling patch, which I will post pretty soon, we provide an aribitrary limit by default, and the device mapper device may change it in its constructor method. Something similar should work for NBD. As far as sub-optimal throughput goes, we run with a limit of 1,000 bvecs in flight (about 4 MB) and that does not seem to restrict throughput measurably. Though you also need this throttling, it is apparent from the traceback you linked above that you ran around on blk_congestion_wait. Try setting your user space daemon into PF_LESS_THOTTLE mode and see what happens. A vm dagwood sandwich, I hope it tastes good :-) Well, pretty soon we will join you in the NBD rehabilitation effort because we require it for the next round of storage work, which centers around the ddraid distributed block device. This requires an NBD that functions reliably, even when accessing an exported block device locally. I thought Peter was swapping over NBD? Anyway, we have not moved into the NBD problem yet because we are still busy chasing non-deadlock-related ddsnap bugs. Which require increasingly creative efforts to trigger by the way, but we haven't quite run out of new bugs, so we don't get to play with distributed storage just yet. Seeing as we have a virtually identical target configuration in mind, you can expect quite a lot of help from our direction in the near future, and in the mean time we can provide encouragement, information and perhaps a few useful lines of code. Regards, Daniel -
| Greg KH | Re: Announce: Linux-next (Or Andrew's dream :-)) |
| Greg KH | [patch 26/73] NET: Correct two mistaken skb_reset_mac_header() conversions. |
| Greg Kroah-Hartman | [PATCH 007/196] Chinese: add translation of stable_kernel_rules.txt |
| Alan Cox | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
git: | |
| Alexey Dobriyan | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 03/37] dccp: List management for new feature negotiation |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
