Re: [RFC] Parallelize IO for e2fsck

Previous thread: [ANNOUNCE] util-linux-ng 2.13.1 (stable) by Karel Zak on Wednesday, January 16, 2008 - 6:19 am. (2 messages)

Next thread: [patch] VFS: extend /proc/mounts by Miklos Szeredi on Wednesday, January 16, 2008 - 3:12 pm. (11 messages)
From: Valerie Henson
Date: Wednesday, January 16, 2008 - 2:30 pm

Hi y'all,

This is a request for comments on the rewrite of the e2fsck IO
parallelization patches I sent out a few months ago.  The mechanism is
totally different.  Previously IO was parallelized by issuing IOs from
multiple threads; now a single thread issues fadvise(WILLNEED) and
then uses read() to complete the IO.

Single disk performance doesn't change, but elapsed time drops by
about 50% on a big RAID-5 box.  Passes 1 and 2 are parallelized.  Pass
5 is left as an exercise for the reader.

Many thanks to the Lustre folks for their fadvise readahead patch
which this patch uses and for comments and help in general.  Our good
friends at EMC Centera funded this work.

Here are the top things I'd like feedback on:

How to split up the patch?  My take:

* Indirect block only flag for iterate
* IO manager readahead/release functions
* Readahead infrastructure (readahead.c and related)
* Readahead calls for pass 1
* Readahead calls for pass 2

Killing readahead properly is hard.  I implemented it several ways and
didn't like any of them.  The current solution is still racy and
completely untested.

The whole thing needs to be autoconfed correctly.  Bah.

The user interface kinda sucks.  It should at least take arguments of
the form "128KB" or "52m" instead of number of file system blocks.
Guessing the right amount of buffer cache to use and io requests to
issue would also be good.

ext2fs_get_next_inode_ptr() - With readahead, copying the inode in
ext2fs_get_next_inode_full() costs about 2-3% of elapsed time.  This
is a hacked up version that just returns a pointer to the inode.

The patch is against e2fsprogs 1.40.4 and is attached.  Future patches
will be split up and sent via quilt.

Thanks!

-VAL

 e2fsck/Makefile.in                      |    6
 e2fsck/e2fsck.h                         |    5
 e2fsck/pass1.c                          |   36
 e2fsck/pass2.c                          |   12
 e2fsck/unix.c                           |   18
 lib/ext2fs/readahead.c ...
From: David Chinner
Date: Thursday, January 17, 2008 - 6:15 pm

Interesting.

We ultimately rejected a similar patch to xfs_repair (pre-population
the kernel block device cache) mainly because of low memory
performance issues and it doesn't really enable you to do anything
particularly smart with optimising I/O patterns for larger, high
performance RAID arrays.

The low memory problems were particularly bad; the readahead
thrashing cause a slowdown of 2-3x compared to the baseline and
often it was due to the repair process requiring all of memory
to cache stuff it would need later. IIRC, multi-terabyte ext3
filesystems have similar memory usage problems to XFS, so there's

Promising results, though....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Valerie Henson
Date: Thursday, January 17, 2008 - 6:43 pm

That was one of my first concerns - how to avoid overflowing memory?
Whenever I screw it up on e2fsck, it does go, oh, 2 times slower due

I have a partial solution that sort of blindly manages the buffer
cache.  First, the user passes e2fsck a parameter saying how much
memory is available as buffer cache.  The readahead thread reads
things in and immediately throws them away so they are only in buffer
cache (no double-caching).  Then readahead and e2fsck work together so
that readahead only reads in new blocks when the main thread is done
with earlier blocks.  The already-used blocks get kicked out of buffer
cache to make room for the new ones.

What would be nice is to take into account the current total memory
usage of the whole fsck process and factor that in.  I don't think it
would be hard to add to the existing cache management framework.

Thanks!  It's solving a rather simpler problem than XFS check/repair. :)

-VAL
-

From: Andreas Dilger
Date: Monday, January 21, 2008 - 4:00 pm

I discussed this with Ted at one point also.  This is a generic problem,
not just for readahead, because "fsck" can run multiple e2fsck in parallel
and in case of many large filesystems on a single node this can cause
memory usage problems also.

What I was proposing is that "fsck.{fstype}" be modified to return an
estimated minimum amount of memory needed, and some "desired" amount of
memory (i.e. readahead) to fsck the filesystem, using some parameter like
"fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
return the output in the expected format, or returns an error then fsck
will assume some amount of memory based on the device size and continue
as it does today.

If the fsck.{fstype} does understand this parameter, then fsck makes a
decision based on devices, parallelism, total RAM (less some amount to
avoid thrashing), then it can call the individual fsck commands with
"--maximum-memory MMM /dev/XXX" so each knows how much cache it can
allocate.  This parameter can also be specified by the user if running
e2fsck directly.

I haven't looked through your patch yet, but I hope to get to it soon.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-

From: David Chinner
Date: Monday, January 21, 2008 - 8:38 pm

And while fsck is running, some other program runs that uses
memory and blows your carefully calculated paramters to smithereens?

I think there is a clear need for applications to be able to
register a callback from the kernel to indicate that the machine as
a whole is running out of memory and that the application should
trim it's caches to reduce memory utilisation.

Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
to a processes that aren't masking the signal followed by a short
grace period to allow the processes to free up some memory before
swapping out pages from that process?

With this sort of feedback, the fsck process can scale back it's
readahead and remove cached info that is not critical to what it
is currently doing and thereby prevent readahead thrashing as
memory usage of the fsck process itself grows.

Another example where this could be useful is to tell browsers to
release some of their cache rather than having the VM swap it out.

IMO, a scheme like this will be far more reliable than trying to
guess what the optimal settings are going to be over the whole
lifetime of a process....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Valdis.Kletnieks
Date: Monday, January 21, 2008 - 9:17 pm

AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
the system was about to hit OOM, not when it was about to start swapping.

I suspect both approaches have their merits...
From: Andreas Dilger
Date: Tuesday, January 22, 2008 - 12:00 am

I'd tried to advocate SIGDANGER some years ago as well, but none of
the kernel maintainers were interested.  It definitely makes sense
to have some sort of mechanism like this.  At the time I first brought
it up it was in conjunction with Netscape using too much cache on some
system, but it would be just as useful for all kinds of other memory-
hungry applications.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-

From: Alan Cox
Date: Tuesday, January 22, 2008 - 6:05 am

There is an early thread for a /proc file which you can add to your
poll() set and it will wake people when memory is low. Very elegant and
if async support is added it will also give you the signal variant for
free.

Alan
-

From: Theodore Tso
Date: Tuesday, January 22, 2008 - 7:40 am

It's been discussed before, but I suspect the main reason why it was
never done is no one submitted a patch.  Also, the problem is actually
a pretty complex one.  There are a couple of different stages where
you might want to send an alert to processes:

    * Data is starting to get ejected from page/buffer cache
    * System is starting to swap
    * System is starting to really struggle to find memory
    * System is starting an out-of-memory killer

AIX's SIGDANGER really did the last two, where the OOM killer would
tend to avoid processes that had a SIGDANGER handler in favor of
processes that were SIGDANGER unaware.

Then there is the additional complexity in Linux that you have
multiple zones of memory, which at least on the historically more
popular x86 was highly, highly important.  You could say that whenever
there is sufficient memory pressure in any zone that you start
ejecting data from caches or start to swap that you start sending the
signals --- but on x86 systems with lowmem, that could happen quite
frequently, and since a user process has no idea whether its resources
are in lowmem or highmem, there's not much you can do about this.

Hopefully this is less of an issue today, since the 2.6 VM is much
more better behaved, and people are gradually moving over to x86_64
anyway.  (Sorry SGI and Intel, unfortunately they're not moving over
to the Itanic :-).   So maybe this would be better received now.

Bringing us back to the main topic at hand, one of the tradeoffs in
Val's current approach is that by relying on the kernel's buffer
cache, we don't have to worry about locking and coherency at the
userspace level.  OTOH, we give up low-level control about when memory
gets thrown out, and it also means that simply getting notified when
the system starts to swap isn't good enough.  We need to know much
earlier, when the system starts ejecting data from the buffer and page
caches.

Does this matter?  Well, there are a couple of use cases:

     * The ...
From: Arnaldo Carvalho de Melo
Date: Tuesday, January 22, 2008 - 7:57 am

Isn't Marcelo, Riel and some other people working on memory
notifications?

- Arnaldo
-

From: Pavel Machek
Date: Monday, January 28, 2008 - 12:30 pm

As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Theodore Tso
Date: Monday, January 28, 2008 - 12:56 pm

Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all....

	       		     	 	- Ted
-

From: Pavel Machek
Date: Monday, January 28, 2008 - 1:01 pm

Ok, so it is 'send SIGDANGER when all zones are low', because user
allocations can go from all zones (unless you have something really
exotic, I'm not sure if that is true on huge NUMA  machines & similar).

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: KOSAKI Motohiro
Date: Sunday, February 3, 2008 - 6:51 am

thank you good point out.

to be honest, the zone awareness of current mem_notify is premature.
I think we need enhancement rss statistics to per zone rss.
but not implemented yet ;-)

and, unfortunately I have no highmem machine.
the mem_notify is not so tested on highmem machine.

if you help to test, I am very happy!
Thanks.
-

From: david
Date: Tuesday, January 29, 2008 - 1:29 am

not to mention machines with 1G of ram (900M lowmem, 128M highmem)

David Lang
-

From: Andreas Dilger
Date: Tuesday, January 22, 2008 - 12:05 am

Well, fsck has a rather restricted working environment, because it is
run before most other processes start (i.e. single-user mode).  For fsck
initiated by an admin in other runlevels the admin would need to specify
the upper limit of memory usage.  My proposal was only for the single-user
fsck at boot time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-

From: David Chinner
Date: Tuesday, January 22, 2008 - 1:16 am

The simple case. ;)

Because XFS has shutdown features, it's not uncommon to hear about
people running xfs_repair on an otherwise live system. e.g. XFS
detects a corrupted block, shuts down the filesystem, the admin
unmounts it, runs xfs_repair, puts it back online. meanwhile, all
the other filesystems and users continue unaffected. In this use
case, getting feedback about memory usage is, IMO, very worthwhile.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Bryan Henderson
Date: Tuesday, January 22, 2008 - 10:42 am

The problem with that approach is that the Fsck process doesn't know how 
its need for memory compares with other process' need for memory.  How 
much memory should it give up?  Maybe it should just quit altogether if 
other processes are in danger of deadlocking.  Or maybe it's best for it 
to keep all its memory and let some other frivolous process give up its 
memory instead.

It's the OS's job to have a view of the entire system and make resource 
allocation decisions.

If it's just a matter of the application choosing a better page frame to 
vacate than what the kernel would have taken, (which is more a matter of 
self-interest than resource allocation), then Fsck can do that more 
directly by just monitoring its own page fault rate.  If it's high, then 
it's using more real memory than the kernel thinks it's entitled to and it 
can reduce its memory footprint to improve its speed.  It can even check 
whether an access to readahead data caused a page fault; if so, it knows 
reading ahead is actually making things worse and therefore reduce 
readahead until the page faults stop happening.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

-

Previous thread: [ANNOUNCE] util-linux-ng 2.13.1 (stable) by Karel Zak on Wednesday, January 16, 2008 - 6:19 am. (2 messages)

Next thread: [patch] VFS: extend /proc/mounts by Miklos Szeredi on Wednesday, January 16, 2008 - 3:12 pm. (11 messages)