Hi y'all, This is a request for comments on the rewrite of the e2fsck IO parallelization patches I sent out a few months ago. The mechanism is totally different. Previously IO was parallelized by issuing IOs from multiple threads; now a single thread issues fadvise(WILLNEED) and then uses read() to complete the IO. Single disk performance doesn't change, but elapsed time drops by about 50% on a big RAID-5 box. Passes 1 and 2 are parallelized. Pass 5 is left as an exercise for the reader. Many thanks to the Lustre folks for their fadvise readahead patch which this patch uses and for comments and help in general. Our good friends at EMC Centera funded this work. Here are the top things I'd like feedback on: How to split up the patch? My take: * Indirect block only flag for iterate * IO manager readahead/release functions * Readahead infrastructure (readahead.c and related) * Readahead calls for pass 1 * Readahead calls for pass 2 Killing readahead properly is hard. I implemented it several ways and didn't like any of them. The current solution is still racy and completely untested. The whole thing needs to be autoconfed correctly. Bah. The user interface kinda sucks. It should at least take arguments of the form "128KB" or "52m" instead of number of file system blocks. Guessing the right amount of buffer cache to use and io requests to issue would also be good. ext2fs_get_next_inode_ptr() - With readahead, copying the inode in ext2fs_get_next_inode_full() costs about 2-3% of elapsed time. This is a hacked up version that just returns a pointer to the inode. The patch is against e2fsprogs 1.40.4 and is attached. Future patches will be split up and sent via quilt. Thanks! -VAL e2fsck/Makefile.in | 6 e2fsck/e2fsck.h | 5 e2fsck/pass1.c | 36 e2fsck/pass2.c | 12 e2fsck/unix.c | 18 lib/ext2fs/readahead.c ...
Interesting. We ultimately rejected a similar patch to xfs_repair (pre-population the kernel block device cache) mainly because of low memory performance issues and it doesn't really enable you to do anything particularly smart with optimising I/O patterns for larger, high performance RAID arrays. The low memory problems were particularly bad; the readahead thrashing cause a slowdown of 2-3x compared to the baseline and often it was due to the repair process requiring all of memory to cache stuff it would need later. IIRC, multi-terabyte ext3 filesystems have similar memory usage problems to XFS, so there's Promising results, though.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
That was one of my first concerns - how to avoid overflowing memory? Whenever I screw it up on e2fsck, it does go, oh, 2 times slower due I have a partial solution that sort of blindly manages the buffer cache. First, the user passes e2fsck a parameter saying how much memory is available as buffer cache. The readahead thread reads things in and immediately throws them away so they are only in buffer cache (no double-caching). Then readahead and e2fsck work together so that readahead only reads in new blocks when the main thread is done with earlier blocks. The already-used blocks get kicked out of buffer cache to make room for the new ones. What would be nice is to take into account the current total memory usage of the whole fsck process and factor that in. I don't think it would be hard to add to the existing cache management framework. Thanks! It's solving a rather simpler problem than XFS check/repair. :) -VAL -
I discussed this with Ted at one point also. This is a generic problem,
not just for readahead, because "fsck" can run multiple e2fsck in parallel
and in case of many large filesystems on a single node this can cause
memory usage problems also.
What I was proposing is that "fsck.{fstype}" be modified to return an
estimated minimum amount of memory needed, and some "desired" amount of
memory (i.e. readahead) to fsck the filesystem, using some parameter like
"fsck.{fstype} --report-memory-needed /dev/XXX". If this does not
return the output in the expected format, or returns an error then fsck
will assume some amount of memory based on the device size and continue
as it does today.
If the fsck.{fstype} does understand this parameter, then fsck makes a
decision based on devices, parallelism, total RAM (less some amount to
avoid thrashing), then it can call the individual fsck commands with
"--maximum-memory MMM /dev/XXX" so each knows how much cache it can
allocate. This parameter can also be specified by the user if running
e2fsck directly.
I haven't looked through your patch yet, but I hope to get to it soon.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
-
And while fsck is running, some other program runs that uses memory and blows your carefully calculated paramters to smithereens? I think there is a clear need for applications to be able to register a callback from the kernel to indicate that the machine as a whole is running out of memory and that the application should trim it's caches to reduce memory utilisation. Perhaps instead of swapping immediately, a SIGLOWMEM could be sent to a processes that aren't masking the signal followed by a short grace period to allow the processes to free up some memory before swapping out pages from that process? With this sort of feedback, the fsck process can scale back it's readahead and remove cached info that is not critical to what it is currently doing and thereby prevent readahead thrashing as memory usage of the fsck process itself grows. Another example where this could be useful is to tell browsers to release some of their cache rather than having the VM swap it out. IMO, a scheme like this will be far more reliable than trying to guess what the optimal settings are going to be over the whole lifetime of a process.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
AIX had SIGDANGER some 15 years ago. Admittedly, that was sent when the system was about to hit OOM, not when it was about to start swapping. I suspect both approaches have their merits...
I'd tried to advocate SIGDANGER some years ago as well, but none of the kernel maintainers were interested. It definitely makes sense to have some sort of mechanism like this. At the time I first brought it up it was in conjunction with Netscape using too much cache on some system, but it would be just as useful for all kinds of other memory- hungry applications. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -
There is an early thread for a /proc file which you can add to your poll() set and it will wake people when memory is low. Very elegant and if async support is added it will also give you the signal variant for free. Alan -
It's been discussed before, but I suspect the main reason why it was
never done is no one submitted a patch. Also, the problem is actually
a pretty complex one. There are a couple of different stages where
you might want to send an alert to processes:
* Data is starting to get ejected from page/buffer cache
* System is starting to swap
* System is starting to really struggle to find memory
* System is starting an out-of-memory killer
AIX's SIGDANGER really did the last two, where the OOM killer would
tend to avoid processes that had a SIGDANGER handler in favor of
processes that were SIGDANGER unaware.
Then there is the additional complexity in Linux that you have
multiple zones of memory, which at least on the historically more
popular x86 was highly, highly important. You could say that whenever
there is sufficient memory pressure in any zone that you start
ejecting data from caches or start to swap that you start sending the
signals --- but on x86 systems with lowmem, that could happen quite
frequently, and since a user process has no idea whether its resources
are in lowmem or highmem, there's not much you can do about this.
Hopefully this is less of an issue today, since the 2.6 VM is much
more better behaved, and people are gradually moving over to x86_64
anyway. (Sorry SGI and Intel, unfortunately they're not moving over
to the Itanic :-). So maybe this would be better received now.
Bringing us back to the main topic at hand, one of the tradeoffs in
Val's current approach is that by relying on the kernel's buffer
cache, we don't have to worry about locking and coherency at the
userspace level. OTOH, we give up low-level control about when memory
gets thrown out, and it also means that simply getting notified when
the system starts to swap isn't good enough. We need to know much
earlier, when the system starts ejecting data from the buffer and page
caches.
Does this matter? Well, there are a couple of use cases:
* The ...Isn't Marcelo, Riel and some other people working on memory notifications? - Arnaldo -
As user pages are always in highmem, this should be easy to decide: only send SIGDANGER when highmem is full. (Yes, there are inodes/dentries/file descriptors in lowmem, but I doubt apps will respond to SIGDANGER by closing files). -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
Good point; for a system with at least (say) 2GB of memory, that definitely makes sense. For a system with less than 768 megs of memory (how quaint, but it wasn't that long ago this was a lot of memory :-), there wouldn't *be* any memory in highmem at all.... - Ted -
Ok, so it is 'send SIGDANGER when all zones are low', because user allocations can go from all zones (unless you have something really exotic, I'm not sure if that is true on huge NUMA machines & similar). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
thank you good point out. to be honest, the zone awareness of current mem_notify is premature. I think we need enhancement rss statistics to per zone rss. but not implemented yet ;-) and, unfortunately I have no highmem machine. the mem_notify is not so tested on highmem machine. if you help to test, I am very happy! Thanks. -
Well, fsck has a rather restricted working environment, because it is run before most other processes start (i.e. single-user mode). For fsck initiated by an admin in other runlevels the admin would need to specify the upper limit of memory usage. My proposal was only for the single-user fsck at boot time. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -
The simple case. ;) Because XFS has shutdown features, it's not uncommon to hear about people running xfs_repair on an otherwise live system. e.g. XFS detects a corrupted block, shuts down the filesystem, the admin unmounts it, runs xfs_repair, puts it back online. meanwhile, all the other filesystems and users continue unaffected. In this use case, getting feedback about memory usage is, IMO, very worthwhile. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
The problem with that approach is that the Fsck process doesn't know how its need for memory compares with other process' need for memory. How much memory should it give up? Maybe it should just quit altogether if other processes are in danger of deadlocking. Or maybe it's best for it to keep all its memory and let some other frivolous process give up its memory instead. It's the OS's job to have a view of the entire system and make resource allocation decisions. If it's just a matter of the application choosing a better page frame to vacate than what the kernel would have taken, (which is more a matter of self-interest than resource allocation), then Fsck can do that more directly by just monitoring its own page fault rate. If it's high, then it's using more real memory than the kernel thinks it's entitled to and it can reduce its memory footprint to improve its speed. It can even check whether an access to readahead data caused a page fault; if so, it knows reading ahead is actually making things worse and therefore reduce readahead until the page faults stop happening. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -
