Re: [patch][rfc] rewrite ramdisk

Previous thread: Re: 2.6.23-mm1 by Neil Brown on Monday, October 15, 2007 - 3:31 am. (2 messages)

Next thread: [PATCH] PS3 system bus add_uevent_var() fallout by Geert Uytterhoeven on Monday, October 15, 2007 - 5:51 am. (1 message)
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 4:28 am

Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit unmaintained,
so decided to sent the patch to you :-).
I have CCed Ted, who did work on the code in the 90s. I found no current
email address of Chad Page.

We have seen ramdisk based install systems, where some pages of mapped
libraries and programs were suddendly zeroed under memory pressure. This
should not happen, as the ramdisk avoids freeing its pages by keeping them
dirty all the time.

It turns out that there is a case, where the VM makes a ramdisk page clean,
without telling the ramdisk driver.
On memory pressure shrink_zone runs and it starts to run shrink_active_list.
There is a check for buffer_heads_over_limit, and if true, pagevec_strip is
called. pagevec_strip calls try_to_release_page. If the mapping has no
releasepage callback, try_to_free_buffers is called. try_to_free_buffers has
now a special logic for some file systems to make a dirty page clean, if all
buffers are clean. Thats what happened in our test case.

The solution is to provide a noop-releasepage callback for the ramdisk driver.
This avoids try_to_free_buffers for ramdisk pages.

To trigger the problem, you have to make buffer_heads_over_limit true, which
means:
- lower max_buffer_heads
or
- have a system with lots of high memory

Andrew, if there are no objections - please apply. The patch applies against
todays git.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

---
drivers/block/rd.c | 13 +++++++++++++
1 files changed, 13 insertions(+)

Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct
return 0;
}

+/*
+ * releasepage is called by pagevec_strip/try_to_release_page if
+ * buffers_heads_over_limit is true. Without a releasepage function
+ * try_to_free_buffers is called instead. That can unset the ...

To: Christian Borntraeger <borntraeger@...>
Cc: Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 10:06 am

This really needs to be fixed...

I can't make up my mind between the approaches to fixing it.

On one hand, I would actually prefer to really mark the buffers
dirty (as in: Eric's fix for this problem[*]) than this patch,
and this seems a bit like a bandaid...

On the other hand, the wound being covered by the bandaid is
actually the code in the buffer layer that does this latent
"cleaning" of the page because it sadly doesn't really keep
track of the pagecache state. But it *still* feels like we
should be marking the rd page's buffers dirty which should
avoid this problem anyway.

[*] However, hmm, with Eric's patch I guess we'd still have a hole
where filesystems that write their buffers by hand think they are
"cleaning" these things and we're back to square one. That could
be fixed by marking the buffers dirty again?

Why were Eric's patches dropped, BTW? I don't remember.

-

To: Nick Piggin <nickpiggin@...>
Cc: Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 5:16 am

runtime problems, iirc.
-

To: Andrew Morton <akpm@...>
Cc: Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 11:23 am

Why do you say that? I guess it is _different_, by necessity(?)
Is there anything that is really bad? I guess it's not nice
for operating on the pagecache from its request_fn, but the
alternative is to duplicate pages for backing store and buffer
cache (actually that might not be a bad alternative really).
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 11:14 pm

make_page_uptodate() is most hideous part I have run into.
It has to know details about other layers to now what not
to stomp. I think my incorrect simplification of this is what messed

Cool. Triple buffering :) Although I guess that would only
apply to metadata these days. Having a separate store would
solve some of the problems, and probably remove the need
for carefully specifying the ramdisk block size. We would
still need the magic restictions on page allocations though
and it we would use them more often as the initial write to the
ramdisk would not populate the pages we need.

A very ugly bit seems to be the fact that we assume we can
dereference bh->b_data without any special magic which
means the ramdisk must live in low memory on 32bit machines.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 2:45 am

Not really, it's just named funny. That's just a minor utility
function that more or less does what it says it should do.

The main problem is really that it's implementing a block device

Double buffering. You no longer serve data out of your buffer
cache. All filesystem data was already double buffered anyway,
so we'd be just losing out on one layer of savings for metadata.
I think it's worthwhile, given that we'd have a "real" looking

What magic restrictions on page allocations? Actually we have
fewer restrictions on page allocations because we can use
highmem! And the lowmem buffercache pages that we currently pin
(unsuccessfully, in the case of this bug) are now completely
reclaimable. And all your buffer heads are now reclaimable.

If you mean GFP_NOIO... I don't see any problem. Block device
drivers have to allocate memory with GFP_NOIO; this may have
been considered magic or deep badness back when the code was

Yeah but that's not rd.c. You need to rewrite the buffer layer
to fix that (see fsblock ;)).
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 12:57 am

Well to put it another way, mark_page_uptodate() is the only
place where we really need to know about the upper layers.
Given that you can kill ramdisks by coding it as:

static void make_page_uptodate(struct page *page)
{
clear_highpage(page);
flush_dcache_page(page);
SetPageUptodate(page);
}

Something is seriously non-intuitive about that function if
you understand the usual rules for how to use the page cache.

The problem is that we support a case in the buffer cache
where pages are partially uptodate and only the buffer_heads
remember which parts are valid. Assuming we are using them
correctly.

Having to walk through all of the buffer heads in make_page_uptodate

Hmm. Good point. So in net it should save memory even if

Well I always figured it was a bit rude allocating large amounts

I'm not certain which way we should go. Take fsblock and run it
in parallel until everything is converted or use fsblock as a
prototype and once we have figured out which way we should go
convert struct buffer_head into struct fsblock one patch at a time.

I'm inclined to think we should evolve the buffer_head.

Eric

-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 4:08 am

You're overwriting some buffers that were uptodate and dirty.

Sure, but it's not just about the buffers. It's the pagecache
in general. It is supposed to be invisible to the device driver
and sitting above it, and yet it is taking the buffercache and

Highmem systems would definitely like it. For others, yes, all
the duplicated pages should be able to get reclaimed if memory
gets tight, along with the buffer heads, so yeah footprint may

You'd rather not, of course, but with dirty data limits now,
it doesn't matter much. (and I doubt anybody outside testing
is going to be hammering like crazy on rd).

Note that the buffercache based ramdisk driver is going to
also be allocating with GFP_NOFS if you're talking about a
filesystem writing to its metadata. In most systems, GFP_NOFS
isn't much different to GFP_NOIO.

We could introduce a mode which allocates pages up front
quite easily if it were a problem (which I doubt it ever would
be).

-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 3:47 am

Here's a quick first hack...

Comments?

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Wednesday, October 17, 2007 - 6:30 am

I have beaten my version of this into working shape, and things
seem ok.

However I'm beginning to think that the real solution is to remove
the dependence on buffer heads for caching the disk mapping for
data pages, and move the metadata buffer heads off of the block
device page cache pages. Although I am just a touch concerned
there may be an issue with using filesystem tools while the
filesystem is mounted if I move the metadata buffer heads.

If we were to move the metadata buffer heads (assuming I haven't
missed some weird dependency) then I think there a bunch of
weird corner cases that would be simplified.

I guess that is where I look next.

Oh for what it is worth I took a quick look at fsblock and I don't think
struct fsblock makes much sense as a block mapping translation layer for
the data path where the current page caches works well. For less
then the cost of 1 fsblock I can cache all of the translations for a
1K filesystem on a 4K page.

I haven't looked to see if fsblock makes sense to use as a buffer head
replacement yet.

Anyway off to bed with me.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Wednesday, October 17, 2007 - 8:49 am

I'd prefer just to go along the lines of what I posted. It's
a pure block device driver which knows absolutely nothing about
vm or vfs.

What are you guys using rd for, and is it really important to

Except the page cache doesn't store any such translation. fsblock
works very nicely as a buffer_head and nobh-mode replacement there,
but we could probably go one better (for many filesystems) by using

I'm not exactly sure what you mean, unless you're talking about an
extent based data structure. fsblock is fairly slim as far as a
generic solution goes. You could probably save 4 or 8 bytes in it,

Well I don't want to get too far off topic on the subject of fsblock,
and block mappings (because I think the ramdisk driver simply should
have nothing to do with any of that)... but fsblock is exactly a
buffer head replacement (so if it doesn't make sense, then I've
screwed something up badly! ;))
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Wednesday, October 17, 2007 - 2:45 pm

Well what brought this up for me was old user space code using
an initial ramdisk. The actual failure that I saw occurred on
the read path. And fixing init_page_buffers was the real world
fix.

At the moment I'm messing with it because it has become the
itch I've decided to scratch. So at the moment I'm having fun,
learning the block layer, refreshing my VM knowledge and getting my
head around this wreck that we call buffer_heads. The high level
concept of buffer_heads may be sane but the implementation seems to
export a lot of nasty state.

At this point my concern is what makes a clean code change in the
kernel. Because user space can currently play with buffer_heads
by way of the block device and cause lots of havoc (see the recent
resierfs bug in this thread) that is why I increasingly think
metadata buffer_heads should not share storage with the block
device page cache.

If that change is made then it happens that the current ramdisk
would not need to worry about buffer heads and all of that
nastiness and could just lock pages in the page cache. It would not
be quite as good for testing filesystems but retaining the existing
characteristics would be simple.

After having looked a bit deeper the buffer_heads and the block
devices don't look as intricately tied up as I had first thought.
We still have the nasty case of:
if (buffer_new(bh))
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
That I don't know how it got merged. But otherwise the caches
are fully separate.

So currently it looks to me like there are two big things that will
clean up that part of the code a lot:
- moving the metadata buffer_heads to a magic filesystem inode.
- Using a simpler non-buffer_head returning version of get_block

As a meta_data cache manager perhaps, for a translation cache we need
8 bytes per page max.

However all we need for a generic translation cache (assuming we still
want one) is an array of sector_t per page.

So what we would want is:
int blkbits...

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Wednesday, October 17, 2007 - 9:06 pm

Well if userspace is writing to the filesystem metadata via the
blockdevice while it is running... that's the definition of havoc,
isn't it? ;) Whether or not the writes are going via a unified
metadata/blockdev cache or separate ones.

You really just have to not do that.

The actual reiserfs problem being seen is not because of userspace

No, it wouldn't. Because if you're proposing to split up the buffer
cache and the metadata cache, then you're back to a 2 cache
solution which is basically has the memory characteristics of my

Well its needed because some filesystems forget about their old
metadata. It's not really there to solve aliasing with the blockdev

Although this is going off the track of the ramdisk problem. For
that we should just do the rewrite.
-

To: Nick Piggin <nickpiggin@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>
Date: Tuesday, October 16, 2007 - 5:28 pm

We won't be able to fix completely this for a while time, but the fact
that BLKFLSBUF has special semantics has always been a major wart.
Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt
there are many tools that actually take advantage of this wierd aspect
of ramdisks, so hopefully it's something we could remove in a 18
months or so...

- Ted
-

To: Theodore Tso <tytso@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>
Date: Tuesday, October 16, 2007 - 6:08 pm

It would be nice to be able to do that, I agree. The new ramdisk
code will be able to flush the buffer cache and destroy its data
separately, so it can actually be implemented.
-

To: Nick Piggin <nickpiggin@...>
Cc: Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>
Date: Tuesday, October 16, 2007 - 7:48 pm

So the practical problem are peoples legacy boot setups but those
are quickly going away.

The sane thing is probably something that can be taken as a low
level format command for the block device.

Say: dd if=/dev/zero of=/dev/ramX

I know rewriting the drive with all zeroes can cause a modern
disk to redo it's low level format. And that is something
we can definitely implement without any backwards compatibility
problems.

Hmm. Do we have anything special for punching holes in files?
That would be another sane route to take to remove the special
case for clearing the memory.

Eric

-

To: Eric W. Biederman <ebiederm@...>
Cc: Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>
Date: Tuesday, October 16, 2007 - 8:28 pm

We have 2 problems. First is that, for testing/consistency, we
don't want BLKFLSBUF to throw out the data. Maybe hardly anything
uses BLKFLSBUF now, so it could be just a minor problem, but still
one to fix.

Second is actually throwing out the ramdisk data. dd from /dev/null
isn't trivial because it isn't a "command" from the kernel's POV.
rd could examine the writes to see if they are zero and page aligned,
I suppose... but if you're transitioning everyone over to a new

truncate_range, I suppose. A file descriptor syscall based
alternative for madvise would be nice though (like fallocate).

We could always put something in /sys/block/ram*/xxx
-

To: Nick Piggin <nickpiggin@...>
Cc: Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>
Date: Tuesday, October 16, 2007 - 9:13 pm

Hmm. This is interesting because we won't be doing anything that
effects correctness if we don't special case BLKFLSBUF just something
that effects efficiency. So I think we can get away with just
changing blkflsbuf as long as there is a way to get rid of

Well I was thinking you can examine the page you just wrote to
and if it is all zero's you don't need to cache that page anymore.
Call it intelligent compression.

Further it does make forwards and backwards compatibility simple
because all you would have to do to reliably free a ramdisk is:

dd if=/dev/zero of=/dev/ramX

I guess when I look at this it looks like an operation that
is unique to a ramdisk. Real hard drives have a low level
format operations and the like. If we can find something
standard there that is guaranteed to trash your data we
can use that, and have gone from less consistency to more.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Theodore Tso <tytso@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>
Date: Tuesday, October 16, 2007 - 9:47 pm

Technically, it does change correctness: after BLKFLSBUF, the
ramdisk should contain zeroes.

I'm assuming it would also cause problems in tight embedded
environments if ramdisk ram is supposed to be thrown away but
isn't. So maybe not technically a correctness problem, but could

Sure, you could do that, but you still presumably need to support
the old behaviour.

As a test vehicle for filesystems, I'd much rather it didn't do this
of course, because subsequent writes would need to reallocate the
page again.
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 5:08 am

Well on that same tune. But with a slightly different implementation.
It compiles but I need to head to bed so I haven't had a chance
to test it yet.

Nick it is very similar to yours with the big difference being that
I embedded a struct address_space instead of rolled rerolled it by
hand, which saves a lot of lines of code.

Eric

---
drivers/block/rd.c
/*
* ramdisk.c - Multiple RAM disk driver - gzip-loading version - v. 0.8 beta.
*
* (C) Chad Page, Theodore Ts'o, et. al, 1995.
*
* This RAM disk is designed to have filesystems created on it and mounted
* just like a regular floppy disk.
*
* It also does something suggested by Linus: use the buffer cache as the
* RAM disk data. This makes it possible to dynamically allocate the RAM disk
* buffer - with some consequences I have to deal with as I write this.
*
* This code is based on the original ramdisk.c, written mostly by
* Theodore Ts'o (TYT) in 1991. The code was largely rewritten by
* Chad Page to use the buffer cache to store the RAM disk data in
* 1995; Theodore then took over the driver again, and cleaned it up
* for inclusion in the mainline kernel.
*
* The original CRAMDISK code was written by Richard Lyons, and
* adapted by Chad Page to use the new RAM disk interface. Theodore
* Ts'o rewrote it so that both the compressed RAM disk loader and the
* kernel decompressor uses the same inflate.c codebase. The RAM disk
* loader now also loads into a dynamic (buffer cache based) RAM disk,
* not the old static RAM disk. Support for the old static RAM disk has
* been completely removed.
*
* Loadable module support added by Tom Dyas.
*
* Further cleanups by Chad Page (page0588@sundance.sjsu.edu):
* Cosmetic changes in #ifdef MODULE, code movement, etc.
* When the RAM disk module is removed, free the protected buffers
* Default RAM disk size changed to 2.88 MB
*
* Added initrd: Werner Almesberger & Hans Lermen, Feb '96
*
* 4/25/96 : Made RAM disk size a parameter (de...

To: Nick Piggin <nickpiggin@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 3:52 am

Perhaps unsigned?

Err! Overlapping symbols! The rd_nr() function collides with the rd_nr
variable. It also does not seem needed, since it did not exist before.
It should go, you can set the variable with brd.rd_nr=XXX (same

-

To: Jan Engelhardt <jengelh@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 4:07 am

Changed. But it will hopefully just completely replace rd.c,
so I will probably just rename it to rd.c at some point (and
change .config options to stay compatible). Unless someone

I've taken most of that stuff out of rd.c in an effort to stay

Back compat. When rewriting the internals, I want to try avoid
changing any externals if possible. Whether this is the Right

Didn't try it, but the rd.c code does the same thing so I guess it
doesn't (or didn't, when it was written).

-

To: Nick Piggin <nickpiggin@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 4:17 am

Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
and you will see.
-

To: Jan Engelhardt <jengelh@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 4:26 am

Just doesn't seem to be any point in making it a new and different
module, assuming it can support exactly the same semantics. I'm
only doing so in these first diffs so that they are easier to read
and also easier to do a side by side comparison / builds with the

Ah, nice. (I don't use them much!). Still, backward compat I
think is needed if we are to replace rd.c.
-

To: Nick Piggin <nickpiggin@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 4:53 am

Like I said, I did not see rd_nr in Documentation/kernel-parameters.txt
so I thought there was no compat.
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 5:05 am

I obviously agree ;-)

Yes, that would solve the problem as well. As long as we fix
the problem, I am happy. On the other hand, do you see any
obvious problem with this "bandaid"?

Christian
-

To: Christian Borntraeger <borntraeger@...>, Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 10:38 am

I don't think so -- in fact, it could be the best candidate for
a minimal fix for stable kernels (anyone disagree? if not, maybe
you could also send this to the stable maintainers?).

But I do want to have this fixed in a "nice" way. eg. I'd like
it to mark the buffers dirty because that actually results in
more reuse of generic kernel code, and also should make rd
behave more naturally (I like using it to test filesystems
because it can expose a lot more concurrency than something like
loop on tmpfs). It should also be possible to actually have
rd's buffer heads get reclaimed as well, preferably while
exercising the common buffer paths and without writing much new
code.

All of that is secondary to fixing the data corruption problem
of course! But the fact that those alternate patches do exist now
means I want to just bring them into the discussion again before
merging one or the other.
-

To: Nick Piggin <nickpiggin@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 2:38 pm

A minor one. It still leaves us with buffer heads out of sync with

We actually allow that currently for clean pages which is part

The core of my original fix was to modify init_page_buffers so that
when we added buffers to a dirty page the buffers became dirty.

Modifying the generic code is a bit spooky because it requires us
to audit the kernel to make certain nothing else depends on the
current behavior in odd ways. Although since init_page_buffers
is only called when we are adding buffer heads to an existing
page I still think that was the proper change.

The historical reason for my patches not getting merged the first
time is there was some weird issue with reiserfs ramdisks and so
Andrew disabled the code, and then dropped it when he had discovered
he had the patch disabled for several releases. I don't think
any causal relationship was ever established. But I didn't
hear enough about the reiserfs ramdisk issue, to make a guess
what was going on.

So it looks to me like the important invariant we need to maintain
is that when a ramdisk page is dirty it always has buffers and those
buffers are dirty as well. With a little care we can ensure this
happens with just modifications to rd.c

Eric

-

To: Nick Piggin <nickpiggin@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 6:37 pm

Hah. I looked over my last round of patches again and I have been able
to verify by review the parts I was a little iffy about and I have
found where in my cleanups I had missed a layering violation in the
ramdisk code, and removed some needed code. Which probably accounts
for the reiserfs ramdisk problems. Updated patches in a minute.

Eric
-

To: Andrew Morton <akpm@...>
Cc: Nick Piggin <nickpiggin@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 6:40 pm

The problem: When we are trying to free buffers try_to_free_buffers()
will look at ramdisk pages with clean buffer heads and remove the
dirty bit from the page. Resulting in ramdisk pages with data that get
removed from the page cache. Ouch!

Buffer heads appear on ramdisk pages when a filesystem calls
__getblk() which through a series of function calls eventually calls
init_page_buffers().

So to fix the mismatch between buffer head and page state this patch
modifies init_page_buffers() to transfer the dirty bit from the page to
the buffer heads like we currently do for the uptodate bit.

This patch is safe as only __getblk calls init_page_buffers, and
there are only two implementations of block devices cached in the
page cache. The generic implementation in block_dev.c and the
implementation in rd.c.

The generic implementation of block devices always does everything
in terms of buffer heads so it always has buffer heads allocated
before a page is marked dirty so this change does not affect it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/buffer.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 75b51df..8b87beb 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -972,6 +972,7 @@ init_page_buffers(struct page *page, struct block_device *bdev,
struct buffer_head *head = page_buffers(page);
struct buffer_head *bh = head;
int uptodate = PageUptodate(page);
+ int dirty = PageDirty(page);

do {
if (!buffer_mapped(bh)) {
@@ -980,6 +981,8 @@ init_page_buffers(struct page *page, struct block_device *bdev,
bh->b_blocknr = block;
if (uptodate)
set_buffer_uptodate(bh);
+ if (dirty)
+ set_buffer_dirty(bh);
set_buffer_mapped(bh);
}
block++;
--
1.5.3.rc6.17.g1911

-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 4:12 am

This is probably a good idea. Was this causing the reiserfs problems?
If so, I think we should be concentrating on what the real problem
is with reiserfs... (or at least why this so obviously correct
looking patch is wrong).

-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 5:35 am

I think it was my cleanup patch that was sitting on top of this,
That caused the problems.

Eric
-

To: Andrew Morton <akpm@...>
Cc: Nick Piggin <nickpiggin@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Monday, October 15, 2007 - 6:42 pm

I have not observed this case but it is possible to get a dirty page
cache with clean buffer heads if we get a clean ramdisk page with
buffer heads generated by a filesystem calling __getblk and then write
to that page from user space through the block device. Then we just
need to hit the proper window and try_to_free_buffers() will mark that
page clean and eventually drop it. Ouch!

To fix this use the generic __set_page_dirty_buffers in the ramdisk
code so that when we mark a page dirty we also mark it's buffer heads
dirty.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
drivers/block/rd.c | 13 +------------
1 files changed, 1 insertions(+), 12 deletions(-)

diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 701ea77..84163da 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -178,23 +178,12 @@ static int ramdisk_writepages(struct address_space *mapping,
return 0;
}

-/*
- * ramdisk blockdev pages have their own ->set_page_dirty() because we don't
- * want them to contribute to dirty memory accounting.
- */
-static int ramdisk_set_page_dirty(struct page *page)
-{
- if (!TestSetPageDirty(page))
- return 1;
- return 0;
-}
-
static const struct address_space_operations ramdisk_aops = {
.readpage = ramdisk_readpage,
.prepare_write = ramdisk_prepare_write,
.commit_write = ramdisk_commit_write,
.writepage = ramdisk_writepage,
- .set_page_dirty = ramdisk_set_page_dirty,
+ .set_page_dirty = __set_page_dirty_buffers,
.writepages = ramdisk_writepages,
};

--
1.5.3.rc6.17.g1911

-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 4:19 am

Hmm, so we can also have some filesystems writing their own buffers
out by hand (clear_buffer_dirty, submit buffer for IO). Other places
will do similarly dodgy things with filesystem metadata
(fsync_buffers_list, for example).

So your buffers get cleaned again, then your pages get cleaned.

While I said it was a good fix when I saw the patch earlier, I think
it's not closing the entire hole, and as such, Christian's patch is
probably the way to go for stable.

For mainline, *if* we want to keep the old rd.c around at all, I
don't see any harm in this patch so long as Christian's is merged
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 3:06 pm

So I just took a little bit of time to look at and think about
the path you are referring to, and I don't see a problem.

The rule with the buffer dirty bit is that you first clear it
and then you submit the write. When the write request finally
makes it's way to rd.c we copy the data if necessary and call
set_page_dirty. Which will then mark the page and the buffer
dirty again.

In essence the ramdisk code just attempts to lock buffers in
ram by setting their dirty bit, just like we do for pages
in ramfs.

The only case where I see the dirty bit getting cleared without
submitting I/O is when dirty state doesn't matter, in which
case if we get a page full buffers all of whose data we don't care
about it is legitimate to drop the page.

As for ramdisk_writepage it probably makes sense to remove it,
as the generic code paths seem to work as well or better the
writepage method is NULL. However if we do keep it we should call

I thought through the logic in try_to_free_buffers and it actually
makes sense to me now. mark_buffer_dirty sets the page dirty bit
so dirty buffers reside on dirty pages. When we submit I/O we
aren't guaranteed to submit all of the dirty buffers for a page
at once, so we don't clear the page dirty bit. With the result
that we can end up with pages with the dirty bit set but all of
their buffers are clean.

Since we rather deliberately allow truly clean pages to be dropped
from the ramdisk overriding try_to_free_buffer_pages looks wrong
because then except for invalidate we can not remove buffers

I do agree that with the amount of code duplication in the buffer
cache that locking pages into the buffer cache seems very error prone,
and difficult to maintain. So rewriting rd.c to store it's pages
elsewhere sounds very promising.

While I can see Christian's patch as fixing the symptom. I have
a very hard time see it as correct. If we did something more
elaborate to replace try_to_free_buffer_pages such that
we could drop buffers from clean buf...

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 6:06 pm

Yeah, which is half the reason why its so complicated. Logically
it should just hold another reference on the pages rather than
interfere with pagecache state, but it can't do that because it

Yeah, if your fix works I guess it is better to use it and converge
code rather than diverge it even more.
-

To: Nick Piggin <nickpiggin@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>
Date: Tuesday, October 16, 2007 - 4:48 am

That would be my preferred method. Merge Erics and my fixup for 2.6.24-rc.
The only open questions is, what was the reiserfs problem? Is it still

While the merge window is still open, to me the new ramdisk code seems
more like a 2.6.25-rc thing. We actually should test the behaviour in
low memory scenarios. What do you think?

Christian
-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Tuesday, October 16, 2007 - 3:56 am

Your patches look sane so far. I have applied both patches, and the problem
seems gone. I will try to get these patches to our testers.

As long as they dont find new problems:

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>

Nick, Eric. What is the best patch for the stable series? Both patches from
Eric or mine? I CCed stable, so they know that something is coming.

Christian
-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 12:14 pm

Eric,

Our testers did only a short test, and then they were stopped by problems with
reiserfs. At the moment I cannot say for sure if your patch caused this, but
we got the following BUG

ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr storage.
------------[ cut here ]------------
kernel BUG at /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117!
illegal operation: 0001 [#1]
Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur
CPU: 3 Not tainted
Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88)
Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs])
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0
Krnl GPRS: 00000002 7411b5c8 0000002b 00000000
7b04d000 00000001 00000000 76d1de00
7513eec0 00000003 00000012 77f77680
7411b608 fb343b7e fb34404a 7513ee50
Krnl Code: fb344374: a7210002 tmll %r2,2
fb344378: a7840004 brc 8,fb344380
fb34437c: a7f40001 brc 15,fb34437e
>fb344380: 5810d8c2 l %r1,2242(%r13)
fb344384: 5820b03c l %r2,60(%r11)
fb344388: 0de1 basr %r14,%r1
fb34438a: 5810d90e l %r1,2318(%r13)
fb34438e: 5820b03c l %r2,60(%r11)

Looking at the code, this really seems related to dirty buffers, so your patch
is the main suspect at the moment.

if (!barrier) {
/* If there was a write error in the journal - we can't commit
* this transaction - it will be invalid and, if successful,
* will just end up propagating the write error out to
* the file system. */
if (likely(!retval && !reiserfs_is_journal_aborted (journal))) {
if (buffer_dirty(jl->j_commit_bh))
1117----> BUG();
...

To: Christian Borntraeger <borntraeger@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 1:57 pm

Grr. I'm not certain how to call that.

Given that I should also be able to trigger this case by writing to
the block device through the buffer cache (to the write spot at the
write moment) this feels like a reiserfs bug. Although I feel
screaming about filesystems that go BUG instead of remount read-only....

At the same time I increasingly don't think we should allow user space
to dirty or update our filesystem metadata buffer heads. That seems
like asking for trouble.

Did you have both of my changes applied?
To init_page_buffer() and to the ramdisk_set_dirty_page?

I'm guessing it was the change to ramdisk_set_dirty_page....

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 5:48 pm

Yes, I removed my patch and applied both patches from you.

Christian
-

To: Christian Borntraeger <borntraeger@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 6:22 pm

Thanks.

Grr. Inconsistent rules on a core piece of infrastructure.
It looks like that if there is any trivial/minimal fix it
is based on your patch suppressing try_to_free_buffers. Ugh.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Thursday, October 18, 2007 - 5:26 am

Ok. What do you think about having my patch for 2.6.23 stable, for 2.6.24
and doing a nicer fix (rd rewrite for example for post 2.6.24)?

Christian
-

To: Christian Borntraeger <borntraeger@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Friday, October 19, 2007 - 6:46 pm

Looking at it. If we don't get carried away using our own private
inode is barely more difficult then stomping on release_page and
in a number of ways a whole lot more subtle. At least for 2.6.24 I
think it makes a sane fix, and quite possibly as a back port as well.

Eric
-

To: Christian Borntraeger <borntraeger@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Friday, October 19, 2007 - 6:51 pm

Currently the ramdisk tries to keep the block device page cache pages
from being marked clean and dropped from memory. That fails for
filesystems that use the buffer cache because the buffer cache is not
an ordinary buffer cache user and depends on the generic block device
address space operations being used.

To fix all of those associated problems this patch allocates a private
inode to store the ramdisk pages in.

The result is slightly more memory used for metadata, an extra copying
when reading or writing directly to the block device, and changing the
software block size does not loose the contents of the ramdisk. Most
of all this ensures we don't loose data during normal use of the
ramdisk.

I deliberately avoid the cleanup that is now possible because this
patch is intended to be a bug fix.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
drivers/block/rd.c | 19 ++++++++++++++++---
1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 08176d2..a52f153 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -62,6 +62,7 @@
/* Various static variables go here. Most are used only in the RAM disk code.
*/

+static struct inode *rd_inode[CONFIG_BLK_DEV_RAM_COUNT];
static struct gendisk *rd_disks[CONFIG_BLK_DEV_RAM_COUNT];
static struct block_device *rd_bdev[CONFIG_BLK_DEV_RAM_COUNT];/* Protected device data */
static struct request_queue *rd_queue[CONFIG_BLK_DEV_RAM_COUNT];
@@ -267,7 +268,7 @@ static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
static int rd_make_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
- struct address_space * mapping = bdev->bd_inode->i_mapping;
+ struct address_space * mapping = rd_inode[MINOR(bdev->bd_dev)]->i_mapping;
sector_t sector = bio->bi_sector;
unsigned long len = bio->bi_size >> 9;
int rw = bio_data_dir(bio);
@@ -312,6 +313,7 @@ ...

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 12:28 am

This just breaks coherency again like the last patch. That's a
really bad idea especially for stable (even if nothing actually
was to break, we'd likely never know about it anyway).

Christian's patch should go upstream and into stable. For 2.6.25-6,
my rewrite should just replace what's there. Using address spaces
to hold the ramdisk pages just confuses the issue even if they
*aren't* actually wired up to the vfs at all. Saving 20 lines is
not a good reason to use them.

-

To: Nick Piggin <nickpiggin@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 1:10 am

Not a chance. The only way we make it to that inode is through block
device I/O so it lives at exactly the same level in the hierarchy as
a real block device. My patch is the considered rewrite boiled down
to it's essentials and made a trivial patch.

It fundamentally fixes the problem, and doesn't attempt to reconcile

Well is more like saving 100 lines. Not having to reexamine complicated
infrastructure code and doing things the same way ramfs is. I think
that combination is a good reason. Especially since I can do with a
16 line patch as I just demonstrated. It is a solid and simple
incremental change.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 1:24 am

Yes it does. It is exactly breaking the coherency between block
device and filesystem metadata coherency that Andrew cared about.
Whether or not that matters, that is a much bigger conceptual
change than simply using slightly more (reclaimable) memory in
some situations that my patch does.

If you want to try convincing people to break that coherency,
fine, but it has to be done consistently and everywhere rather than

No, it doesn't. A real block device driver does have its own
buffer cache as it's backing store. It doesn't know about

What's the considered rewrite here? The rewrite I posted is the
only one so far that's come up that I would consider [worthy],

It fixes the bug in question, but not because it makes any
fundamental improvement to the conceptual ickyness of rd.c. I
don't know what you mean about attempting to reconcile the

Using radix_tree_insert instead of add_to_page_cache is hardly
complicated. If anything it is simpler because it isn't actually
confusing the issues and it is much better fit with real block
device drivers, which is actually the thing that is familiar to

My opinion is that it is a much bigger behavioural change because
it results in incoherent buffer cache / blockdev cache, and it
results in code which is still ugly and conceptually wrong.
-

To: Nick Piggin <nickpiggin@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 2:48 am

Nick. Reread the patch. The only thing your arguments have
established for me is that this patch is not obviously correct. Which
makes it ineligible for a back port. Frankly I suspect the whole
issue is to subtle and rare to make any backport make any sense. My

Well those pages are only accessed through rd_blkdev_pagecache_IO
and rd_ioctl.

The address space operations can (after my patch) be deleted or
be replaced by their generic versions. I just didn't take that
step because it was an unnecessary change and I wanted the minimal

Well it looks like you were blind when you read the patch.

Because the semantics between the two are almost identical,
except I managed to implement BLKFLSBUF in a backwards compatible
way by flushing both the buffer cache and my private cache. You
failed to flush the buffer cache in your implementation.

Yes. I use an inode 99% for it's mapping and the mapping 99% for it's
radix_tree. But having truncate_inode_pages and grab_cache_page
continue to work sure is convenient. I certainly think it makes it a
lot simpler to audit the code to change just one thing at a time (the
backing store) then to rip out and replace everything and then try and
prove that the two patches are equivalent.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 5:39 am

OK, I missed that you set the new inode's aops to the ramdisk_aops
rather than the bd_inode. Which doesn't make a lot of sense because

Wrong. It will be via the LRU, will get ->writepage() called,
block_invalidate_page, etc. And I guess also via sb->s_inodes, where
drop_pagecache_sb might do stuff to it (although it probably escapes
harm). But you're right that it isn't the obviously correct fix for

If you think it is a nice way to go, then I think you are

Obviously a simple typo that can be fixed by adding one line

It's horrible. And using truncate_inode_pages / grab_cache_page and
new_inode is an incredible argument to save a few lines of code. You
obviously didn't realise your so called private pages would get
accessed via the LRU, for example. This is making a relatively
larger logical change than my patch, because now as well as having
a separate buffer cache and backing store, you are also making the

I think it's a bad idea just to stir the shit. We should take the
simple fix for the problem, and then fix it properly.
-

To: Nick Piggin <nickpiggin@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 1:56 pm

Not totally useless as you have mentioned they are accessed by

Of course it should be fixed. I just don't know if a backport
makes sense. The problem once understood is hard to trigger

Yes. We will be accessed via the LRU. Yes I knew that. The truth is
whatever we do we will be visible to the LRU. My preferences run
towards having something that is less of a special case then a new
potentially huge cache that is completely unaccounted for, that we
have to build and maintain all of the infrastructure for
independently. The ramdisk code doesn't seem interesting enough
long term to get people to do independent maintenance.

With my patch we are on the road to being just like the ramdisk
for maintenance purposes code except having a different GFP mask.

I think I might have to send in a patch that renames
block_invalidatepage to block_invalidate_page which is how everyone
seems to be spelling it.

Anyway nothing in the ramdisk code sets PagePrivate o
block_invalidagepage should never be called, nor try_to_free_buffers

Well we each have different tastes. I think my patch
is a sane sensible small incremental step that does just what
is needed to fix the problem. It doesn't drag a lot of other
stuff into the problem like a rewrite would so we can easily verify

Quite true. I noticed the breakage in mine because the patch
was so simple, and I only had to worry about one aspect. With
a rewrite I missed it because there was so much other noise I

I did but but that is relatively minor. Using the pagecache this
way for this purpose is a well established idiom in the kernel

I am continuing to have the backing store pages visible to the VM,
and from that perspective it is a smaller change then where we are

The code in rd.c isn't terrible, and it isn't shit. There is only one
fundamental problem with it. rd.c is fundamentally incompatible with
the buffer cache. Currently rd.c is a legitimate if a bit ugly
user of the page cache. The ugliness comes from working a...

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 8:29 pm

Well, the ones that are unused are totally useless ;) I would

I mean backported. That's just me though, I don't know the nuances
of -stable releases. It could be that they rather not risk introducing

OK it just didn't sound like it, seeing as you said that's the only

No. With my patch, nothing in the ramdisk code is visible to the LRU.

It's not a cache, and it's not unaccounted for. It's specified exactly
with the rd sizing parameters. I don't know why you would say your
patch is better in this regard. Your ramdisk backing store will be

You can be using highmem, BTW. And technically it probably isn't

The small incremental step that fixes the problem is Christian's

It is smaller lines of code. It is a larger change. Because what you
are doing is 2 things. You are firstly discontinuing the use of the
buffer cache for the backing store, and secondly you are introducing
a new backing store which registers an inode with the vfs and pages
with the pagecache.

My patch does the same thing without those two last questionable
steps.

You now have to treat those backing store pages as pagecache pages,

Any page you allocate is your private page. The problem is you're
just sending them back to the VM again.
-

To: Eric W. Biederman <ebiederm@...>
Cc: Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 3:28 am

About being rare, when I force the VM to be more aggressive reclaiming buffer
by using the following patch:
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -3225,7 +3225,7 @@ void __init buffer_init(void)
* Limit the bh occupancy to 10% of ZONE_NORMAL
*/
nrpages = (nr_free_buffer_pages() * 10) / 100;
- max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
+ max_buffer_heads = 0;
hotcpu_notifier(buffer_cpu_notify, 0);
}

I can actually cause data corruption within some seconds. So I think the
problem is real enough to be worth fixing.

I still dont fully understand what issues you have with my patch.
- it obviously fixes the problem
- I am not aware of any regression it introduces
- its small

One concern you had, was the fact that buffer heads are out of sync with
struct pages. Testing your first patch revealed that this is actually needed
by reiserfs - and maybe others.
I can also see, that my patch looks a bit like a bandaid that cobbles the rd
pieces together.
Is there anything else, that makes my patch unmergeable in your opinion?

Christian

-

To: Christian Borntraeger <borntraeger@...>
Cc: Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 4:23 am

Let me put it another way. Looking at /proc/slabinfo I can get
37 buffer_heads per page. I can allocate 10% of memory in
buffer_heads before we start to reclaim them. So it requires just
over 3.7 buffer_heads on very page of low memory to even trigger
this case. That is a large 1k filesystem or a weird sized partition,
that we have written to directly.

That makes this condition very rare in practice without your patch.

Especially since even after we reach the above condition we have
to have enough vm pressure to find a page with clean buffer heads
that is dirty in the ramdisk.

While it can be done deterministically usually it is pretty hard
to trigger and pretty easy to work around by simply using partition

My primary issue with your patch is that it continues the saga the
trying to use buffer cache to store the data which is a serious

For linus's tree the consensus is that to fix rd.c that we
need to have a backing store that is stored somewhere besides
in the page cache/buffer cache for /dev/ram0. Doing that prevents
all of the weird issues.

Now we have the question of which patch gets us there. I contend
I have implemented it with my last little patch that this thread
is a reply to. Nick hasn't seen that just yet.

So if we have a small patch that can implement the proper long
term fix I contend we are in better shape.

As for backports we can worry about that after we get something
sane merged upstream.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 5:56 am

You don't want to change that for a stable patch, however.

Or ever will. It wasn't that my whole argument against it is
based on that I mistakenly thought your patch served the bdev

I just don't think what you have is the proper fix. Calling
into the core vfs and vm because right now it does something
that works for you but is completely unrelated to what you
are conceptually doing is not the right fix.

Also, the patch I posted is big because it did other stuff
with dynamically allocated ramdisks from loop (ie. a modern
rewrite). As it is applied to rd.c and split into chunks, the
actual patch to switch to the new backing store isn't actually
that big. I'll submit it to -mm after things stabilise after
the merge window too.
-

To: Nick Piggin <nickpiggin@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 2:39 pm

Possibly. But the same proportions still hold. 1k filesystems
are not the default these days and ramdisks are relatively uncommon.
The memory quantities involved are all low mem.

This is an old problem. If it was common I suspect we would have
noticed and fixed it long ago. As far as I can tell this problem dates
back to 2.5.13 in the commit below (which starts clearing the dirty
bit in try_to_free_buffers). The rd.c had earlier made the transition
from using BH_protected to using the dirty bit to lock it into the

No it avoids the bug which is something slightly different.
Further I contend that it is not obviously correct that there
are no other side effects (because it doesn't actually fix the
bug), and that makes it of dubious value for a backport.

If I had to slap a patch on there at this point just implementing
an empty try_to_release_page (which disables try_to_free_buffers)
would be my choice. I just think something that has existed
at least potentially for the entire 2.6 series, and is easy

I think there is a strong conceptual relation and other code
doing largely the same thing is already in the kernel (ramfs). Plus
my gut feel says shared code will make maintenance easier.

You do have a point that the reuse may not be perfect and if that
is the case we need to watch out for the potential to mess things
up.

Sounds like a plan.

Eric

-

To: Eric W. Biederman <ebiederm@...>
Cc: Nick Piggin <nickpiggin@...>, Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Monday, October 22, 2007 - 9:11 am

On Sun, 21 Oct 2007 12:39:30 -0600

It is definitely common during run time. It was seen in practice enough
to be reproducible and get fixed for the non-ramdisk case.

The big underlying question is how which ramdisk usage case are we
shooting for. Keeping the ram disk pages off the LRU can certainly help
the VM if larger ramdisks used at runtime are very common.

Otherwise, I'd say to keep it as simple as possible and use Eric's
patch. By simple I'm not counting lines of code, I'm counting overall
readability between something everyone knows (page cache usage) and
something specific to ramdisks (Nick's patch).

-chris
-

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 9:56 pm

You don't need 1K filesystems to have buffers attached though,
of course. You can hit the limit with a 4K filesystem with less

The bug in question isn't exactly that it uses buffercache for its
backing store (although that's obviously rather hairy as well). It's

How is that better? Now you're making the exact same change for
all filesystems that you didn't think was obviously correct for

ramfs is rather a different case. Filesystems intimately know

It's just wrong. I guess if you don't see that by now, then we
have to just disagree.
-

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 3:14 pm

In this case, the commit block isn't allowed to be dirty before reiserfs
decides it is safe to write it. The journal code expects it is the only
spot in the kernel setting buffer heads dirty, and it only does so after
the rest of the log blocks are safely on disk.

Given this is a ramdisk, the check can be ignored, but I'd rather not

Demanding trouble ;)

-chris

-

To: Chris Mason <chris.mason@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 4:29 pm

Ok. So the journal code here fundamentally depends on being able to
control the order of the writes, and something else being able to set

Looks like it. There are even comments in jbd about the same class
of problems. Apparently dump and tune2fs on mounted filesystems have
triggered some of these issues. The practical question is any of this
trouble worth handling.

Thinking about it. I don't believe anyone has ever intentionally built
a filesystem tool that depends on being able to modify a file systems
metadata buffer heads while the filesystem is running, and doing that
would seem to be fragile as it would require a lot of cooperation
between the tool and the filesystem about how the filesystem uses and
implement things.

Now I guess I need to see how difficult a patch would be to give
filesystems magic inodes to keep their metadata buffer heads in.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 4:54 pm

That's right. For example, ext2 is doing directories in the page cache
of the directory inode, so there's a cache alias between the block

Not hard, the block device inode is already a magic inode for metadata
buffer heads. You could just make another one attached to the bdev.

But, I don't think I fully understand the problem you're trying to
solve?

-chris

-

To: Chris Mason <chris.mason@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 5:30 pm

So the start:
When we write buffers from the buffer cache we clear buffer_dirty
but not PageDirty

So try_to_free_buffers() will mark any page with clean buffer_heads
that is not clean itself clean.

The ramdisk set pages dirty to keep them from being removed from the
page cache, just like ramfs.

Unfortunately when those dirty ramdisk pages get buffers on them and
those buffers all go clean and we are trying to reclaim buffer_heads
we drop those pages from the page cache. Ouch!

We can fix the ramdisk by setting making certain that buffer_heads
on ramdisk pages stay dirty as well. The problem is this breaks
filesystems like reiserfs and ext3 that expect to be able to make
buffer_heads clean sometimes.

There are other ways to solve this for ramdisks, such as changing
where ramdisks are stored. However fixing the ramdisks this way
still leaves the general problem that there are other paths to the
filesystem metadata buffers, and those other paths cause the code
to be complicated and buggy.

So I'm trying to see if we can untangle this Gordian knot, so the
code because more easily maintainable.

To make the buffer cache a helper library instead of require
infrastructure, it looks like two things need to happen.
- Move metadata buffer heads off block devices page cache entries,
- Communicate the mappings of data pages to block device sectors
in a generic way without buffer heads.

How we ultimately fix the ramdisk tends to depend on how we untangle
the buffer head problem. Right now the only simple solution is to
suppress try_to_free_buffers, which is a bit ugly. We can also come
up with a completely separate store for the pages in the buffer cache
but if we wind up moving the metadata buffer heads anyway then that
should not be necessary.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 6:58 pm

So, the problem is using the Dirty bit to indicate pinned. You're
completely right that our current setup of buffer heads and pages and
filesystem metadata is complex and difficult.

But, moving the buffer heads off of the page cache pages isn't going to
make it any easier to use dirty as pinned, especially in the face of
buffer_head users for file data pages.

You've already seen Nick fsblock code, but you can see my general
approach to replacing buffer heads here:

http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/e...

(alpha quality implementation in extent_map.c and users in inode.c) The
basic idea is to do extent based record keeping for mapping and state of
things in the filesystem, and to avoid attaching these things to the

Don't get me wrong, I'd love to see a simple and coherent fix for what
reiserfs and ext3 do with buffer head state, but I think for the short
term you're best off pinning the ramdisk pages via some other means.

-chris

-

To: Chris Mason <chris.mason@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 7:28 pm

Let me specific. Not moving buffer_heads off of page cache pages,
moving buffer_heads off of the block devices page cache pages.

My problem is the coupling of how block devices are cached and the
implementation of buffer heads, and by removing that coupling
we can generally make things better. Currently that coupling
means silly things like all block devices are cached in low memory.
Which probably isn't what you want if you actually have a use
for block devices.

For the ramdisk case in particular what this means is that there
are no more users that create buffer_head mappings on the block
device cache so using the dirty bit will be safe.

Further it removes the nasty possibility of user space messing with
metadata buffer head state. So the only way those cases can happen is
a code bug, or a hardware bug.

So I think by removing these unnecessary code paths things will

Yes. And the problem is hard enough to trigger that a short term fix
is actually of debatable value. The reason this hasn't shown up more
frequently is that it only ever triggers if you are in the buffer head
reclaim state, which on a 64bit box means you have to use < 4K buffers
and have your ram cache another block device. That plus most people
use initramfs these days.

For the short term we have Christian's other patch which simply
disables calling try_to_free_buffers. Although that really feels
like a hack to me.

For 2.6.25 I think I have a shot at fixing these things cleanly.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 8:03 pm

Ok, we move the buffer heads off to a different inode, and that indoe
has pages. The pages on the inode still need to get pinned, how does
that pinning happen?

The problem you described where someone cleans a page because the buffer
heads are clean happens already without help from userland. So, keeping
the pages away from userland won't save them from cleaning.

Sorry if I'm reading your suggestion wrong...

-chris

-

To: Chris Mason <chris.mason@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 11:59 pm

If filesystems care at all they want absolute control over the buffer
cache. Controlling which buffers are dirty and when. Because we
keep the buffer cache in the page cache for the block device we have
not quite been giving filesystems that control leading to really weird
bugs.

In addition this tieing of the implemetation of block device caching
and the buffer cache has resulted in a much more complicated and
limited implementation then necessary. Block devices for example
don't need buffer_heads, and it is perfectly reasonable to cache
block devices in high memory.

To start untangling the worst of this mess this patch introduces a
second block device inode for the buffer cache. All buffer cache
operations are diverted to that use the new bd_metadata_inode, which
keeps the weirdness of the metadata requirements isolated in their
own little world.

This should enable future cleanups to diverge and simplify the
address_space_operations of the buffer cache and block device
page cache.

As a side effect of this cleanup the current ramdisk code should
be safe from dropping pages because we never place any buffer heads
on ramdisk pages.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/block_dev.c | 45 ++++++++++++++++++++++++++++++++-------------
fs/buffer.c | 17 ++++++++++++-----
fs/ext3/dir.c | 2 +-
fs/ext4/dir.c | 2 +-
fs/fat/inode.c | 2 +-
include/linux/fs.h | 1 +
6 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 379a446..87a5760 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -59,10 +59,12 @@ static sector_t max_block(struct block_device *bdev)
/* Kill _all_ buffers and pagecache , dirty or not.. */
static void kill_bdev(struct block_device *bdev)
{
- if (bdev->bd_inode->i_mapping->nrpages == 0)
- return;
- invalidate_bh_lrus();
- truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+ if (bdev->bd_inode->i_map...

To: Eric W. Biederman <ebiederm@...>
Cc: Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Thursday, October 18, 2007 - 1:10 am

Mmm. Like I said, when a live filesystem is mounted on a bdev,
it isn't like you want userspace to go dancing around on it without
knowing exactly what it is doing.

The kernel more or less does the right thing here with respect to
the *state* of the data[*] (that is, buffer heads and pagecache).
It's when you actually start changing the data itself around is when
you'll blow up the filesystem.

[*] The ramdisk code is simply buggy, right? (and not the buffer
cache)

The idea of your patch in theory is OK, but Andrew raises valid
points about potential coherency problems, I think.

-

To: Nick Piggin <nickpiggin@...>
Cc: Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Friday, October 19, 2007 - 5:35 pm

From the perspective of the ramdisk it expects the buffer cache to
simply be a user of the page cache, and thus the buffer cache
is horribly buggy.

From the perspective of the buffer cache it still the block device
cache in the kernel and it the way it works are the rules for how
caching should be done in the kernel, and it doesn't give any

There are certainly implementation issues in various filesystems
to overcome before remounting read-write after doing a fsck
on a read-only filesystem will work as it does today. So my patch
is incomplete.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Thursday, October 18, 2007 - 12:32 am

I don't think we little angels want to tread here. There are so many
weirdo things out there which will break if we bust the coherence between
the fs and /dev/hda1. Online resize, online fs checkers, various local
tools which people have hacked up to look at metadata in a live fs,
direct-io access to the underlying fs, heaven knows how many boot loader
installers, etc. Cerainly I couldn't enumerate tham all.

The mere thought of all this scares the crap out of me.

I don't actually see what the conceptual problem is with the existing
implementation. The buffer_head is a finer-grained view onto the
blockdev's pagecache: it provides additional states and additional locking
against a finer-grained section of the page. It works well.

Yeah, the highmem thing is a bit of a problem (but waning in importance).
But we can fix that by teaching individual filesystems about kmap and then
tweak the blockdev's caching policy with mapping_set_gfp_mask() at mount
time. If anyone cares, which they don't.
-

To: Andrew Morton <akpm@...>
Cc: Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Friday, October 19, 2007 - 5:27 pm

We broke coherence between the fs and /dev/hda1 when we introduced
the page cache years ago, and weird hacky cases like
unmap_underlying_metadata don't change that. Currently only

Well I took a look at ext3. For online resize all of the writes are
done by the fs not by the user space tool. For e2fsck of a read-only
filesystem currently we do cache the buffers for the super block and
reexamine those blocks when we mount read-only.

Which makes my patch by itself unsafe. If however ext3 and anyone
else who does things like that were to reread the data and not
to merely reexamine the data we should be fine.

Fundamentally doing anything like this requires some form of
synchronization, and if that synchronization does not exist
today there will be bugs. Further decoupling things only makes that
requirement clearer.

Unfortunately because of things like the ext3 handling of remounting

The buffer_head itself seems to be a reasonable entity.

The buffer cache is a monster. It does not follow the ordinary rules
of the page cache, making it extremely hard to reason about.

Currently in the buffer cache there are buffer_heads we are not
allowed to make dirty which hold dirty data. Some filesystems
panic the kernel when they notice this. Others like ext3 use a
different bit to remember that the buffer is dirty.

Because of ordering considerations the buffer cache does not hold a
consistent view of what has been scheduled for being written to disk.
It instead holds partially complete pages.

The only place we should ever clear the dirty bit is just before
calling write_page but try_to_free_buffers clears the dirty bit!

We have buffers on pages without a mapping!

In general the buffer cache violates a primary rule for comprehensible
programming having. The buffer cache does not have a clear enough
definition that it is clear what things are bugs and what things
are features.

99% of the weird strange behavior in rd.c is because of the buffer

This presumes I want to...

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 12:24 am

Not for metadata. And I wouldn't expect many filesystem analysis

unmap_underlying_metadata isn't about raw block device access at
all, though (if you write to the filesystem via the blockdevice

It either is or it isn't, right? And it is, isn't it? (at least
for the common filesystems).
-

To: Nick Piggin <nickpiggin@...>
Cc: Eric W. Biederman <ebiederm@...>, Andrew Morton <akpm@...>, Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 8:15 pm

It is not true for XFS - it's metadata is not in sync with /dev/<block>
at all as all the cached metadata is kept in a different address space
to the raw block device.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 12:53 am

Well my goal with separating things is so that we could decouple two
pieces of code that have different usage scenarios, and where
supporting both scenarios simultaneously appears to me to needlessly
complicate the code.

Added to that we could then tune the two pieces of code for their

ext2 doesn't store directories in the buffer cache.

Journaling filesystems and filesystems that do ordered writes
game the buffer cache. Putting in data that should not yet
be written to disk. That gaming is where reiserfs goes BUG
and where JBD moves the dirty bit to a different dirty bit.

So as far as I can tell what is in the buffer cache is not really
in sync with what should be on disk at any given movement except
when everything is clean.

My suspicion is that actually reading from disk is likely to
give a more coherent view of things. Because there at least
we have the writes as they are expected to be seen by fsck
to recover the data, and a snapshot there should at least
be recoverable. Whereas a snapshot provides not such guarantees.

Eric
-

To: Eric W. Biederman <ebiederm@...>
Cc: Andrew Morton <akpm@...>, Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 1:36 am

Doesn't that give you any suspicion that other tools mightn't

I don't see too much complication from it. If we can actually
simplify things or make useful tuning, maybe it will be worth

Oh that's what you mean. OK, agreed there. But for the filesystems
and types of metadata that can now expect to have coherency, doing
this will break that expectation.

Again, I have no opinions either way on whether we should do that
in the long run. But doing it as a kneejerk response to braindead
rd.c code is wrong because of what *might* go wrong and we don't

Filesystems really want better control of writeback, I think.
This isn't really a consequence of the unified blockdev pagecache
/ metadata buffer cache, it is just that most of the important
things they do are with metadata.

If they have their own metadata inode, then they'll need to game
the cache for it, or the writeback code for that inode somehow

ext3 fsck I don't think is supposed to be run under a read/write
filesystem, so it's going to explode if you do that regardless.
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, Chris Mason <chris.mason@...>, Christian Borntraeger <borntraeger@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Sunday, October 21, 2007 - 3:09 am

I read a representative sample of the relevant tools before replying

That was my feeling that we could simplify things. The block layer
page cache operations certainly.

I know in the filesystems that use the buffer cache like reiser and
JBD they could stop worrying about the buffers becoming mysteriously
dirty. Beyond that I think there is a lot of opportunity I just

The rd.c code is perfectly valid if someone wasn't forcing buffer
heads on it's pages. It is a conflict of expectations.

Regardless I didn't do it as a kneejerk and I don't think that
patch should be merged at this time. I proposed it because as I
see it that starts untangling the mess that is the buffer cache.
rd.c was just my entry point into understanding how all of those
pieces work. I was doing my best to completely explore my options

Yes. Although they will at least get the guarantee that no one

Not that so much as the order in which things go into the cache

Yes. I was thinking of dump or something like that here. Where
we simply read out the data and try to make some coherent sense
of it. If we see a version of the metadata that points to things
that have not been finished yet or is in the process of being
written to that could be a problem.

When going through the buffer cache as far as I can tell people
don't use little things like page lock when writing data so
the page cache reads can potentially race with what should
be atomic writes.

Eric
-

To: Chris Mason <chris.mason@...>
Cc: Christian Borntraeger <borntraeger@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Wednesday, October 17, 2007 - 11:27 pm

Yes. I was suggesting to continue to pin the pages for the page
cache pages block device inode, and having the buffer cache live
on some other inode. Thus not causing me problems by adding clean
buffer_heads to my dirty pages.

Eric

-

To: Christian Borntraeger <borntraeger@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, <linux-mm@...>, <linux-kernel@...>, Martin Schwidefsky <schwidefsky@...>, Theodore Ts'o <tytso@...>, <stable@...>
Date: Tuesday, October 16, 2007 - 5:22 am

Sounds good. Please make certain to test reiserfs as well as ext2+ as

My gut feel says my patches assuming everything tests out ok, as
they actually fix the problem instead of papering over it, and there
isn't really a size difference.

In addition the change to init_page_buffers is interesting all by
itself. With that patch we now have the option of running block
devices in mode where we don't bother to cache the buffer heads
which should reduce memory pressure a bit. Not that an enhancement
like that will show up in stable, but...

Eric
-

To: <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 4:05 am

hi!to everybody i need sex video

--
This message was sent on behalf of rubenjr_22@yahoo.com.ph at openSubscriber.com
http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/77881...
-

To: <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 4:13 am

hi! i need sex video

--
This message was sent on behalf of rubenjr_22@yahoo.com.ph at openSubscriber.com
http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/77881...
-

Previous thread: Re: 2.6.23-mm1 by Neil Brown on Monday, October 15, 2007 - 3:31 am. (2 messages)

Next thread: [PATCH] PS3 system bus add_uevent_var() fallout by Geert Uytterhoeven on Monday, October 15, 2007 - 5:51 am. (1 message)