On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote:
Well, we return after writing 128MB because of the magic
s_max_writeback_mb_bump. The fact that nr_to_write limits the number
of pages which are written is something which is intentional to the
writeback code. I've disagreed with it, but I don't think it would be
legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that
what you are saying we should do? (If it is indeed legit to ignore
nr_to_write, I would have done it a long time ago; I introduced
s_max_writeback_mb_bump instead as a workaround to what I consider to
be a serious misfeature in the writeback code.)
Hmm, does this happen with XFS, too? If not, I wonder how they handle
it? And whether we need to push a solution into the generic layers.
Yeah, I've noticed this. What it means is that if we have a massive
memory pressure in a particular zone, pages which are subject to
delayed allocation won't get written out by mm/vmscan.c. Anonymous
pages will be written out to swap, and data pages which are re-written
via random access mmap() (and so we know where they will be written on
disk) will get written, and that's not a problem. So with relatively
large zones, it happens, but most of the time I don't think it's a
major problem.
I am worried about this issue in certain configurations where pseudo
NUMA zones have been created and are artificially really tiny (128MB)
for container support, but that's not standard upstream thing.
This is done to avoid a lock inversion, and so this is an
ext4-specific thing (at least I don't think XFS's delayed allocation
has this misfeature). It would be interesting if we have documented
evidence that this is easily triggered under normal situations. If
so, we should look into figuring out how to fix this...
- Ted
--