Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit unmaintained, so decided to sent the patch to you :-). I have CCed Ted, who did work on the code in the 90s. I found no current email address of Chad Page. We have seen ramdisk based install systems, where some pages of mapped libraries and programs were suddendly zeroed under memory pressure. This should not happen, as the ramdisk avoids freeing its pages by keeping them dirty all the time. It turns out that there is a case, where the VM makes a ramdisk page clean, without telling the ramdisk driver. On memory pressure shrink_zone runs and it starts to run shrink_active_list. There is a check for buffer_heads_over_limit, and if true, pagevec_strip is called. pagevec_strip calls try_to_release_page. If the mapping has no releasepage callback, try_to_free_buffers is called. try_to_free_buffers has now a special logic for some file systems to make a dirty page clean, if all buffers are clean. Thats what happened in our test case. The solution is to provide a noop-releasepage callback for the ramdisk driver. This avoids try_to_free_buffers for ramdisk pages. To trigger the problem, you have to make buffer_heads_over_limit true, which means: - lower max_buffer_heads or - have a system with lots of high memory Andrew, if there are no objections - please apply. The patch applies against todays git. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> --- drivers/block/rd.c | 13 +++++++++++++ 1 files changed, 13 insertions(+) Index: linux-2.6/drivers/block/rd.c =================================================================== --- linux-2.6.orig/drivers/block/rd.c +++ linux-2.6/drivers/block/rd.c @@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct return 0; } +/* + * releasepage is called by pagevec_strip/try_to_release_page if + * buffers_heads_over_limit is true. Without a releasepage function + * try_to_free_buffers is called instead. That can unset the ...
This really needs to be fixed... I can't make up my mind between the approaches to fixing it. On one hand, I would actually prefer to really mark the buffers dirty (as in: Eric's fix for this problem[*]) than this patch, and this seems a bit like a bandaid... On the other hand, the wound being covered by the bandaid is actually the code in the buffer layer that does this latent "cleaning" of the page because it sadly doesn't really keep track of the pagecache state. But it *still* feels like we should be marking the rd page's buffers dirty which should avoid this problem anyway. [*] However, hmm, with Eric's patch I guess we'd still have a hole where filesystems that write their buffers by hand think they are "cleaning" these things and we're back to square one. That could be fixed by marking the buffers dirty again? Why were Eric's patches dropped, BTW? I don't remember. -
I obviously agree ;-) Yes, that would solve the problem as well. As long as we fix the problem, I am happy. On the other hand, do you see any obvious problem with this "bandaid"? Christian -
I don't think so -- in fact, it could be the best candidate for a minimal fix for stable kernels (anyone disagree? if not, maybe you could also send this to the stable maintainers?). But I do want to have this fixed in a "nice" way. eg. I'd like it to mark the buffers dirty because that actually results in more reuse of generic kernel code, and also should make rd behave more naturally (I like using it to test filesystems because it can expose a lot more concurrency than something like loop on tmpfs). It should also be possible to actually have rd's buffer heads get reclaimed as well, preferably while exercising the common buffer paths and without writing much new code. All of that is secondary to fixing the data corruption problem of course! But the fact that those alternate patches do exist now means I want to just bring them into the discussion again before merging one or the other. -
A minor one. It still leaves us with buffer heads out of sync with We actually allow that currently for clean pages which is part The core of my original fix was to modify init_page_buffers so that when we added buffers to a dirty page the buffers became dirty. Modifying the generic code is a bit spooky because it requires us to audit the kernel to make certain nothing else depends on the current behavior in odd ways. Although since init_page_buffers is only called when we are adding buffer heads to an existing page I still think that was the proper change. The historical reason for my patches not getting merged the first time is there was some weird issue with reiserfs ramdisks and so Andrew disabled the code, and then dropped it when he had discovered he had the patch disabled for several releases. I don't think any causal relationship was ever established. But I didn't hear enough about the reiserfs ramdisk issue, to make a guess what was going on. So it looks to me like the important invariant we need to maintain is that when a ramdisk page is dirty it always has buffers and those buffers are dirty as well. With a little care we can ensure this happens with just modifications to rd.c Eric -
Hah. I looked over my last round of patches again and I have been able to verify by review the parts I was a little iffy about and I have found where in my cleanups I had missed a layering violation in the ramdisk code, and removed some needed code. Which probably accounts for the reiserfs ramdisk problems. Updated patches in a minute. Eric -
The problem: When we are trying to free buffers try_to_free_buffers()
will look at ramdisk pages with clean buffer heads and remove the
dirty bit from the page. Resulting in ramdisk pages with data that get
removed from the page cache. Ouch!
Buffer heads appear on ramdisk pages when a filesystem calls
__getblk() which through a series of function calls eventually calls
init_page_buffers().
So to fix the mismatch between buffer head and page state this patch
modifies init_page_buffers() to transfer the dirty bit from the page to
the buffer heads like we currently do for the uptodate bit.
This patch is safe as only __getblk calls init_page_buffers, and
there are only two implementations of block devices cached in the
page cache. The generic implementation in block_dev.c and the
implementation in rd.c.
The generic implementation of block devices always does everything
in terms of buffer heads so it always has buffer heads allocated
before a page is marked dirty so this change does not affect it.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/buffer.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 75b51df..8b87beb 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -972,6 +972,7 @@ init_page_buffers(struct page *page, struct block_device *bdev,
struct buffer_head *head = page_buffers(page);
struct buffer_head *bh = head;
int uptodate = PageUptodate(page);
+ int dirty = PageDirty(page);
do {
if (!buffer_mapped(bh)) {
@@ -980,6 +981,8 @@ init_page_buffers(struct page *page, struct block_device *bdev,
bh->b_blocknr = block;
if (uptodate)
set_buffer_uptodate(bh);
+ if (dirty)
+ set_buffer_dirty(bh);
set_buffer_mapped(bh);
}
block++;
--
1.5.3.rc6.17.g1911
-
I have not observed this case but it is possible to get a dirty page
cache with clean buffer heads if we get a clean ramdisk page with
buffer heads generated by a filesystem calling __getblk and then write
to that page from user space through the block device. Then we just
need to hit the proper window and try_to_free_buffers() will mark that
page clean and eventually drop it. Ouch!
To fix this use the generic __set_page_dirty_buffers in the ramdisk
code so that when we mark a page dirty we also mark it's buffer heads
dirty.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
drivers/block/rd.c | 13 +------------
1 files changed, 1 insertions(+), 12 deletions(-)
diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 701ea77..84163da 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -178,23 +178,12 @@ static int ramdisk_writepages(struct address_space *mapping,
return 0;
}
-/*
- * ramdisk blockdev pages have their own ->set_page_dirty() because we don't
- * want them to contribute to dirty memory accounting.
- */
-static int ramdisk_set_page_dirty(struct page *page)
-{
- if (!TestSetPageDirty(page))
- return 1;
- return 0;
-}
-
static const struct address_space_operations ramdisk_aops = {
.readpage = ramdisk_readpage,
.prepare_write = ramdisk_prepare_write,
.commit_write = ramdisk_commit_write,
.writepage = ramdisk_writepage,
- .set_page_dirty = ramdisk_set_page_dirty,
+ .set_page_dirty = __set_page_dirty_buffers,
.writepages = ramdisk_writepages,
};
--
1.5.3.rc6.17.g1911
-
Your patches look sane so far. I have applied both patches, and the problem seems gone. I will try to get these patches to our testers. As long as they dont find new problems: Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Nick, Eric. What is the best patch for the stable series? Both patches from Eric or mine? I CCed stable, so they know that something is coming. Christian -
hi! i need sex video -- This message was sent on behalf of rubenjr_22@yahoo.com.ph at openSubscriber.com http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/7788114.html -
hi!to everybody i need sex video -- This message was sent on behalf of rubenjr_22@yahoo.com.ph at openSubscriber.com http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/7788114.html -
Sounds good. Please make certain to test reiserfs as well as ext2+ as My gut feel says my patches assuming everything tests out ok, as they actually fix the problem instead of papering over it, and there isn't really a size difference. In addition the change to init_page_buffers is interesting all by itself. With that patch we now have the option of running block devices in mode where we don't bother to cache the buffer heads which should reduce memory pressure a bit. Not that an enhancement like that will show up in stable, but... Eric -
Eric,
Our testers did only a short test, and then they were stopped by problems with
reiserfs. At the moment I cannot say for sure if your patch caused this, but
we got the following BUG
ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr storage.
------------[ cut here ]------------
kernel BUG at /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117!
illegal operation: 0001 [#1]
Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur
CPU: 3 Not tainted
Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88)
Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs])
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0
Krnl GPRS: 00000002 7411b5c8 0000002b 00000000
7b04d000 00000001 00000000 76d1de00
7513eec0 00000003 00000012 77f77680
7411b608 fb343b7e fb34404a 7513ee50
Krnl Code: fb344374: a7210002 tmll %r2,2
fb344378: a7840004 brc 8,fb344380
fb34437c: a7f40001 brc 15,fb34437e
>fb344380: 5810d8c2 l %r1,2242(%r13)
fb344384: 5820b03c l %r2,60(%r11)
fb344388: 0de1 basr %r14,%r1
fb34438a: 5810d90e l %r1,2318(%r13)
fb34438e: 5820b03c l %r2,60(%r11)
Looking at the code, this really seems related to dirty buffers, so your patch
is the main suspect at the moment.
if (!barrier) {
/* If there was a write error in the journal - we can't commit
* this transaction - it will be invalid and, if successful,
* will just end up propagating the write error out to
* the file system. */
if (likely(!retval && !reiserfs_is_journal_aborted (journal))) {
if (buffer_dirty(jl->j_commit_bh))
1117----> BUG();
...Grr. I'm not certain how to call that. Given that I should also be able to trigger this case by writing to the block device through the buffer cache (to the write spot at the write moment) this feels like a reiserfs bug. Although I feel screaming about filesystems that go BUG instead of remount read-only.... At the same time I increasingly don't think we should allow user space to dirty or update our filesystem metadata buffer heads. That seems like asking for trouble. Did you have both of my changes applied? To init_page_buffer() and to the ramdisk_set_dirty_page? I'm guessing it was the change to ramdisk_set_dirty_page.... Eric -
In this case, the commit block isn't allowed to be dirty before reiserfs decides it is safe to write it. The journal code expects it is the only spot in the kernel setting buffer heads dirty, and it only does so after the rest of the log blocks are safely on disk. Given this is a ramdisk, the check can be ignored, but I'd rather not Demanding trouble ;) -chris -
Ok. So the journal code here fundamentally depends on being able to control the order of the writes, and something else being able to set Looks like it. There are even comments in jbd about the same class of problems. Apparently dump and tune2fs on mounted filesystems have triggered some of these issues. The practical question is any of this trouble worth handling. Thinking about it. I don't believe anyone has ever intentionally built a filesystem tool that depends on being able to modify a file systems metadata buffer heads while the filesystem is running, and doing that would seem to be fragile as it would require a lot of cooperation between the tool and the filesystem about how the filesystem uses and implement things. Now I guess I need to see how difficult a patch would be to give filesystems magic inodes to keep their metadata buffer heads in. Eric -
That's right. For example, ext2 is doing directories in the page cache of the directory inode, so there's a cache alias between the block Not hard, the block device inode is already a magic inode for metadata buffer heads. You could just make another one attached to the bdev. But, I don't think I fully understand the problem you're trying to solve? -chris -
So the start: When we write buffers from the buffer cache we clear buffer_dirty but not PageDirty So try_to_free_buffers() will mark any page with clean buffer_heads that is not clean itself clean. The ramdisk set pages dirty to keep them from being removed from the page cache, just like ramfs. Unfortunately when those dirty ramdisk pages get buffers on them and those buffers all go clean and we are trying to reclaim buffer_heads we drop those pages from the page cache. Ouch! We can fix the ramdisk by setting making certain that buffer_heads on ramdisk pages stay dirty as well. The problem is this breaks filesystems like reiserfs and ext3 that expect to be able to make buffer_heads clean sometimes. There are other ways to solve this for ramdisks, such as changing where ramdisks are stored. However fixing the ramdisks this way still leaves the general problem that there are other paths to the filesystem metadata buffers, and those other paths cause the code to be complicated and buggy. So I'm trying to see if we can untangle this Gordian knot, so the code because more easily maintainable. To make the buffer cache a helper library instead of require infrastructure, it looks like two things need to happen. - Move metadata buffer heads off block devices page cache entries, - Communicate the mappings of data pages to block device sectors in a generic way without buffer heads. How we ultimately fix the ramdisk tends to depend on how we untangle the buffer head problem. Right now the only simple solution is to suppress try_to_free_buffers, which is a bit ugly. We can also come up with a completely separate store for the pages in the buffer cache but if we wind up moving the metadata buffer heads anyway then that should not be necessary. Eric -
So, the problem is using the Dirty bit to indicate pinned. You're completely right that our current setup of buffer heads and pages and filesystem metadata is complex and difficult. But, moving the buffer heads off of the page cache pages isn't going to make it any easier to use dirty as pinned, especially in the face of buffer_head users for file data pages. You've already seen Nick fsblock code, but you can see my general approach to replacing buffer heads here: http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/extent_map.h (alpha quality implementation in extent_map.c and users in inode.c) The basic idea is to do extent based record keeping for mapping and state of things in the filesystem, and to avoid attaching these things to the Don't get me wrong, I'd love to see a simple and coherent fix for what reiserfs and ext3 do with buffer head state, but I think for the short term you're best off pinning the ramdisk pages via some other means. -chris -
Let me specific. Not moving buffer_heads off of page cache pages, moving buffer_heads off of the block devices page cache pages. My problem is the coupling of how block devices are cached and the implementation of buffer heads, and by removing that coupling we can generally make things better. Currently that coupling means silly things like all block devices are cached in low memory. Which probably isn't what you want if you actually have a use for block devices. For the ramdisk case in particular what this means is that there are no more users that create buffer_head mappings on the block device cache so using the dirty bit will be safe. Further it removes the nasty possibility of user space messing with metadata buffer head state. So the only way those cases can happen is a code bug, or a hardware bug. So I think by removing these unnecessary code paths things will Yes. And the problem is hard enough to trigger that a short term fix is actually of debatable value. The reason this hasn't shown up more frequently is that it only ever triggers if you are in the buffer head reclaim state, which on a 64bit box means you have to use < 4K buffers and have your ram cache another block device. That plus most people use initramfs these days. For the short term we have Christian's other patch which simply disables calling try_to_free_buffers. Although that really feels like a hack to me. For 2.6.25 I think I have a shot at fixing these things cleanly. Eric -
Ok, we move the buffer heads off to a different inode, and that indoe has pages. The pages on the inode still need to get pinned, how does that pinning happen? The problem you described where someone cleans a page because the buffer heads are clean happens already without help from userland. So, keeping the pages away from userland won't save them from cleaning. Sorry if I'm reading your suggestion wrong... -chris -
Yes. I was suggesting to continue to pin the pages for the page cache pages block device inode, and having the buffer cache live on some other inode. Thus not causing me problems by adding clean buffer_heads to my dirty pages. Eric -
If filesystems care at all they want absolute control over the buffer
cache. Controlling which buffers are dirty and when. Because we
keep the buffer cache in the page cache for the block device we have
not quite been giving filesystems that control leading to really weird
bugs.
In addition this tieing of the implemetation of block device caching
and the buffer cache has resulted in a much more complicated and
limited implementation then necessary. Block devices for example
don't need buffer_heads, and it is perfectly reasonable to cache
block devices in high memory.
To start untangling the worst of this mess this patch introduces a
second block device inode for the buffer cache. All buffer cache
operations are diverted to that use the new bd_metadata_inode, which
keeps the weirdness of the metadata requirements isolated in their
own little world.
This should enable future cleanups to diverge and simplify the
address_space_operations of the buffer cache and block device
page cache.
As a side effect of this cleanup the current ramdisk code should
be safe from dropping pages because we never place any buffer heads
on ramdisk pages.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/block_dev.c | 45 ++++++++++++++++++++++++++++++++-------------
fs/buffer.c | 17 ++++++++++++-----
fs/ext3/dir.c | 2 +-
fs/ext4/dir.c | 2 +-
fs/fat/inode.c | 2 +-
include/linux/fs.h | 1 +
6 files changed, 48 insertions(+), 21 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 379a446..87a5760 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -59,10 +59,12 @@ static sector_t max_block(struct block_device *bdev)
/* Kill _all_ buffers and pagecache , dirty or not.. */
static void kill_bdev(struct block_device *bdev)
{
- if (bdev->bd_inode->i_mapping->nrpages == 0)
- return;
- invalidate_bh_lrus();
- truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+ if (bdev->bd_inode->i_mapping->nrpages) ...I don't think we little angels want to tread here. There are so many weirdo things out there which will break if we bust the coherence between the fs and /dev/hda1. Online resize, online fs checkers, various local tools which people have hacked up to look at metadata in a live fs, direct-io access to the underlying fs, heaven knows how many boot loader installers, etc. Cerainly I couldn't enumerate tham all. The mere thought of all this scares the crap out of me. I don't actually see what the conceptual problem is with the existing implementation. The buffer_head is a finer-grained view onto the blockdev's pagecache: it provides additional states and additional locking against a finer-grained section of the page. It works well. Yeah, the highmem thing is a bit of a problem (but waning in importance). But we can fix that by teaching individual filesystems about kmap and then tweak the blockdev's caching policy with mapping_set_gfp_mask() at mount time. If anyone cares, which they don't. -
We broke coherence between the fs and /dev/hda1 when we introduced the page cache years ago, and weird hacky cases like unmap_underlying_metadata don't change that. Currently only Well I took a look at ext3. For online resize all of the writes are done by the fs not by the user space tool. For e2fsck of a read-only filesystem currently we do cache the buffers for the super block and reexamine those blocks when we mount read-only. Which makes my patch by itself unsafe. If however ext3 and anyone else who does things like that were to reread the data and not to merely reexamine the data we should be fine. Fundamentally doing anything like this requires some form of synchronization, and if that synchronization does not exist today there will be bugs. Further decoupling things only makes that requirement clearer. Unfortunately because of things like the ext3 handling of remounting The buffer_head itself seems to be a reasonable entity. The buffer cache is a monster. It does not follow the ordinary rules of the page cache, making it extremely hard to reason about. Currently in the buffer cache there are buffer_heads we are not allowed to make dirty which hold dirty data. Some filesystems panic the kernel when they notice this. Others like ext3 use a different bit to remember that the buffer is dirty. Because of ordering considerations the buffer cache does not hold a consistent view of what has been scheduled for being written to disk. It instead holds partially complete pages. The only place we should ever clear the dirty bit is just before calling write_page but try_to_free_buffers clears the dirty bit! We have buffers on pages without a mapping! In general the buffer cache violates a primary rule for comprehensible programming having. The buffer cache does not have a clear enough definition that it is clear what things are bugs and what things are features. 99% of the weird strange behavior in rd.c is because of the buffer This presumes I ...
Not for metadata. And I wouldn't expect many filesystem analysis unmap_underlying_metadata isn't about raw block device access at all, though (if you write to the filesystem via the blockdevice It either is or it isn't, right? And it is, isn't it? (at least for the common filesystems). -
Well my goal with separating things is so that we could decouple two pieces of code that have different usage scenarios, and where supporting both scenarios simultaneously appears to me to needlessly complicate the code. Added to that we could then tune the two pieces of code for their ext2 doesn't store directories in the buffer cache. Journaling filesystems and filesystems that do ordered writes game the buffer cache. Putting in data that should not yet be written to disk. That gaming is where reiserfs goes BUG and where JBD moves the dirty bit to a different dirty bit. So as far as I can tell what is in the buffer cache is not really in sync with what should be on disk at any given movement except when everything is clean. My suspicion is that actually reading from disk is likely to give a more coherent view of things. Because there at least we have the writes as they are expected to be seen by fsck to recover the data, and a snapshot there should at least be recoverable. Whereas a snapshot provides not such guarantees. Eric -
Doesn't that give you any suspicion that other tools mightn't I don't see too much complication from it. If we can actually simplify things or make useful tuning, maybe it will be worth Oh that's what you mean. OK, agreed there. But for the filesystems and types of metadata that can now expect to have coherency, doing this will break that expectation. Again, I have no opinions either way on whether we should do that in the long run. But doing it as a kneejerk response to braindead rd.c code is wrong because of what *might* go wrong and we don't Filesystems really want better control of writeback, I think. This isn't really a consequence of the unified blockdev pagecache / metadata buffer cache, it is just that most of the important things they do are with metadata. If they have their own metadata inode, then they'll need to game the cache for it, or the writeback code for that inode somehow ext3 fsck I don't think is supposed to be run under a read/write filesystem, so it's going to explode if you do that regardless. -
I read a representative sample of the relevant tools before replying That was my feeling that we could simplify things. The block layer page cache operations certainly. I know in the filesystems that use the buffer cache like reiser and JBD they could stop worrying about the buffers becoming mysteriously dirty. Beyond that I think there is a lot of opportunity I just The rd.c code is perfectly valid if someone wasn't forcing buffer heads on it's pages. It is a conflict of expectations. Regardless I didn't do it as a kneejerk and I don't think that patch should be merged at this time. I proposed it because as I see it that starts untangling the mess that is the buffer cache. rd.c was just my entry point into understanding how all of those pieces work. I was doing my best to completely explore my options Yes. Although they will at least get the guarantee that no one Not that so much as the order in which things go into the cache Yes. I was thinking of dump or something like that here. Where we simply read out the data and try to make some coherent sense of it. If we see a version of the metadata that points to things that have not been finished yet or is in the process of being written to that could be a problem. When going through the buffer cache as far as I can tell people don't use little things like page lock when writing data so the page cache reads can potentially race with what should be atomic writes. Eric -
It is not true for XFS - it's metadata is not in sync with /dev/<block> at all as all the cached metadata is kept in a different address space to the raw block device. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Mmm. Like I said, when a live filesystem is mounted on a bdev,
it isn't like you want userspace to go dancing around on it without
knowing exactly what it is doing.
The kernel more or less does the right thing here with respect to
the *state* of the data[*] (that is, buffer heads and pagecache).
It's when you actually start changing the data itself around is when
you'll blow up the filesystem.
[*] The ramdisk code is simply buggy, right? (and not the buffer
cache)
The idea of your patch in theory is OK, but Andrew raises valid
points about potential coherency problems, I think.
-
From the perspective of the ramdisk it expects the buffer cache to simply be a user of the page cache, and thus the buffer cache is horribly buggy. From the perspective of the buffer cache it still the block device cache in the kernel and it the way it works are the rules for how caching should be done in the kernel, and it doesn't give any There are certainly implementation issues in various filesystems to overcome before remounting read-write after doing a fsck on a read-only filesystem will work as it does today. So my patch is incomplete. Eric -
Yes, I removed my patch and applied both patches from you. Christian -
Thanks. Grr. Inconsistent rules on a core piece of infrastructure. It looks like that if there is any trivial/minimal fix it is based on your patch suppressing try_to_free_buffers. Ugh. Eric -
Ok. What do you think about having my patch for 2.6.23 stable, for 2.6.24 and doing a nicer fix (rd rewrite for example for post 2.6.24)? Christian -
Currently the ramdisk tries to keep the block device page cache pages
from being marked clean and dropped from memory. That fails for
filesystems that use the buffer cache because the buffer cache is not
an ordinary buffer cache user and depends on the generic block device
address space operations being used.
To fix all of those associated problems this patch allocates a private
inode to store the ramdisk pages in.
The result is slightly more memory used for metadata, an extra copying
when reading or writing directly to the block device, and changing the
software block size does not loose the contents of the ramdisk. Most
of all this ensures we don't loose data during normal use of the
ramdisk.
I deliberately avoid the cleanup that is now possible because this
patch is intended to be a bug fix.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
drivers/block/rd.c | 19 ++++++++++++++++---
1 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 08176d2..a52f153 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -62,6 +62,7 @@
/* Various static variables go here. Most are used only in the RAM disk code.
*/
+static struct inode *rd_inode[CONFIG_BLK_DEV_RAM_COUNT];
static struct gendisk *rd_disks[CONFIG_BLK_DEV_RAM_COUNT];
static struct block_device *rd_bdev[CONFIG_BLK_DEV_RAM_COUNT];/* Protected device data */
static struct request_queue *rd_queue[CONFIG_BLK_DEV_RAM_COUNT];
@@ -267,7 +268,7 @@ static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
static int rd_make_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
- struct address_space * mapping = bdev->bd_inode->i_mapping;
+ struct address_space * mapping = rd_inode[MINOR(bdev->bd_dev)]->i_mapping;
sector_t sector = bio->bi_sector;
unsigned long len = bio->bi_size >> 9;
int rw = bio_data_dir(bio);
@@ -312,6 +313,7 @@ static int rd_ioctl(struct ...This just breaks coherency again like the last patch. That's a really bad idea especially for stable (even if nothing actually was to break, we'd likely never know about it anyway). Christian's patch should go upstream and into stable. For 2.6.25-6, my rewrite should just replace what's there. Using address spaces to hold the ramdisk pages just confuses the issue even if they *aren't* actually wired up to the vfs at all. Saving 20 lines is not a good reason to use them. -
Not a chance. The only way we make it to that inode is through block device I/O so it lives at exactly the same level in the hierarchy as a real block device. My patch is the considered rewrite boiled down to it's essentials and made a trivial patch. It fundamentally fixes the problem, and doesn't attempt to reconcile Well is more like saving 100 lines. Not having to reexamine complicated infrastructure code and doing things the same way ramfs is. I think that combination is a good reason. Especially since I can do with a 16 line patch as I just demonstrated. It is a solid and simple incremental change. Eric -
Yes it does. It is exactly breaking the coherency between block device and filesystem metadata coherency that Andrew cared about. Whether or not that matters, that is a much bigger conceptual change than simply using slightly more (reclaimable) memory in some situations that my patch does. If you want to try convincing people to break that coherency, fine, but it has to be done consistently and everywhere rather than No, it doesn't. A real block device driver does have its own buffer cache as it's backing store. It doesn't know about What's the considered rewrite here? The rewrite I posted is the only one so far that's come up that I would consider [worthy], It fixes the bug in question, but not because it makes any fundamental improvement to the conceptual ickyness of rd.c. I don't know what you mean about attempting to reconcile the Using radix_tree_insert instead of add_to_page_cache is hardly complicated. If anything it is simpler because it isn't actually confusing the issues and it is much better fit with real block device drivers, which is actually the thing that is familiar to My opinion is that it is a much bigger behavioural change because it results in incoherent buffer cache / blockdev cache, and it results in code which is still ugly and conceptually wrong. -
Nick. Reread the patch. The only thing your arguments have established for me is that this patch is not obviously correct. Which makes it ineligible for a back port. Frankly I suspect the whole issue is to subtle and rare to make any backport make any sense. My Well those pages are only accessed through rd_blkdev_pagecache_IO and rd_ioctl. The address space operations can (after my patch) be deleted or be replaced by their generic versions. I just didn't take that step because it was an unnecessary change and I wanted the minimal Well it looks like you were blind when you read the patch. Because the semantics between the two are almost identical, except I managed to implement BLKFLSBUF in a backwards compatible way by flushing both the buffer cache and my private cache. You failed to flush the buffer cache in your implementation. Yes. I use an inode 99% for it's mapping and the mapping 99% for it's radix_tree. But having truncate_inode_pages and grab_cache_page continue to work sure is convenient. I certainly think it makes it a lot simpler to audit the code to change just one thing at a time (the backing store) then to rip out and replace everything and then try and prove that the two patches are equivalent. Eric -
About being rare, when I force the VM to be more aggressive reclaiming buffer by using the following patch: --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -3225,7 +3225,7 @@ void __init buffer_init(void) * Limit the bh occupancy to 10% of ZONE_NORMAL */ nrpages = (nr_free_buffer_pages() * 10) / 100; - max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); + max_buffer_heads = 0; hotcpu_notifier(buffer_cpu_notify, 0); } I can actually cause data corruption within some seconds. So I think the problem is real enough to be worth fixing. I still dont fully understand what issues you have with my patch. - it obviously fixes the problem - I am not aware of any regression it introduces - its small One concern you had, was the fact that buffer heads are out of sync with struct pages. Testing your first patch revealed that this is actually needed by reiserfs - and maybe others. I can also see, that my patch looks a bit like a bandaid that cobbles the rd pieces together. Is there anything else, that makes my patch unmergeable in your opinion? Christian -
Let me put it another way. Looking at /proc/slabinfo I can get 37 buffer_heads per page. I can allocate 10% of memory in buffer_heads before we start to reclaim them. So it requires just over 3.7 buffer_heads on very page of low memory to even trigger this case. That is a large 1k filesystem or a weird sized partition, that we have written to directly. That makes this condition very rare in practice without your patch. Especially since even after we reach the above condition we have to have enough vm pressure to find a page with clean buffer heads that is dirty in the ramdisk. While it can be done deterministically usually it is pretty hard to trigger and pretty easy to work around by simply using partition My primary issue with your patch is that it continues the saga the trying to use buffer cache to store the data which is a serious For linus's tree the consensus is that to fix rd.c that we need to have a backing store that is stored somewhere besides in the page cache/buffer cache for /dev/ram0. Doing that prevents all of the weird issues. Now we have the question of which patch gets us there. I contend I have implemented it with my last little patch that this thread is a reply to. Nick hasn't seen that just yet. So if we have a small patch that can implement the proper long term fix I contend we are in better shape. As for backports we can worry about that after we get something sane merged upstream. Eric -
You don't want to change that for a stable patch, however. Or ever will. It wasn't that my whole argument against it is based on that I mistakenly thought your patch served the bdev I just don't think what you have is the proper fix. Calling into the core vfs and vm because right now it does something that works for you but is completely unrelated to what you are conceptually doing is not the right fix. Also, the patch I posted is big because it did other stuff with dynamically allocated ramdisks from loop (ie. a modern rewrite). As it is applied to rd.c and split into chunks, the actual patch to switch to the new backing store isn't actually that big. I'll submit it to -mm after things stabilise after the merge window too. -
Possibly. But the same proportions still hold. 1k filesystems are not the default these days and ramdisks are relatively uncommon. The memory quantities involved are all low mem. This is an old problem. If it was common I suspect we would have noticed and fixed it long ago. As far as I can tell this problem dates back to 2.5.13 in the commit below (which starts clearing the dirty bit in try_to_free_buffers). The rd.c had earlier made the transition from using BH_protected to using the dirty bit to lock it into the No it avoids the bug which is something slightly different. Further I contend that it is not obviously correct that there are no other side effects (because it doesn't actually fix the bug), and that makes it of dubious value for a backport. If I had to slap a patch on there at this point just implementing an empty try_to_release_page (which disables try_to_free_buffers) would be my choice. I just think something that has existed at least potentially for the entire 2.6 series, and is easy I think there is a strong conceptual relation and other code doing largely the same thing is already in the kernel (ramfs). Plus my gut feel says shared code will make maintenance easier. You do have a point that the reuse may not be perfect and if that is the case we need to watch out for the potential to mess things up. Sounds like a plan. Eric -
You don't need 1K filesystems to have buffers attached though, of course. You can hit the limit with a 4K filesystem with less The bug in question isn't exactly that it uses buffercache for its backing store (although that's obviously rather hairy as well). It's How is that better? Now you're making the exact same change for all filesystems that you didn't think was obviously correct for ramfs is rather a different case. Filesystems intimately know It's just wrong. I guess if you don't see that by now, then we have to just disagree. -
On Sun, 21 Oct 2007 12:39:30 -0600 It is definitely common during run time. It was seen in practice enough to be reproducible and get fixed for the non-ramdisk case. The big underlying question is how which ramdisk usage case are we shooting for. Keeping the ram disk pages off the LRU can certainly help the VM if larger ramdisks used at runtime are very common. Otherwise, I'd say to keep it as simple as possible and use Eric's patch. By simple I'm not counting lines of code, I'm counting overall readability between something everyone knows (page cache usage) and something specific to ramdisks (Nick's patch). -chris -
OK, I missed that you set the new inode's aops to the ramdisk_aops rather than the bd_inode. Which doesn't make a lot of sense because Wrong. It will be via the LRU, will get ->writepage() called, block_invalidate_page, etc. And I guess also via sb->s_inodes, where drop_pagecache_sb might do stuff to it (although it probably escapes harm). But you're right that it isn't the obviously correct fix for If you think it is a nice way to go, then I think you are Obviously a simple typo that can be fixed by adding one line It's horrible. And using truncate_inode_pages / grab_cache_page and new_inode is an incredible argument to save a few lines of code. You obviously didn't realise your so called private pages would get accessed via the LRU, for example. This is making a relatively larger logical change than my patch, because now as well as having a separate buffer cache and backing store, you are also making the I think it's a bad idea just to stir the shit. We should take the simple fix for the problem, and then fix it properly. -
Not totally useless as you have mentioned they are accessed by Of course it should be fixed. I just don't know if a backport makes sense. The problem once understood is hard to trigger Yes. We will be accessed via the LRU. Yes I knew that. The truth is whatever we do we will be visible to the LRU. My preferences run towards having something that is less of a special case then a new potentially huge cache that is completely unaccounted for, that we have to build and maintain all of the infrastructure for independently. The ramdisk code doesn't seem interesting enough long term to get people to do independent maintenance. With my patch we are on the road to being just like the ramdisk for maintenance purposes code except having a different GFP mask. I think I might have to send in a patch that renames block_invalidatepage to block_invalidate_page which is how everyone seems to be spelling it. Anyway nothing in the ramdisk code sets PagePrivate o block_invalidagepage should never be called, nor try_to_free_buffers Well we each have different tastes. I think my patch is a sane sensible small incremental step that does just what is needed to fix the problem. It doesn't drag a lot of other stuff into the problem like a rewrite would so we can easily verify Quite true. I noticed the breakage in mine because the patch was so simple, and I only had to worry about one aspect. With a rewrite I missed it because there was so much other noise I I did but but that is relatively minor. Using the pagecache this way for this purpose is a well established idiom in the kernel I am continuing to have the backing store pages visible to the VM, and from that perspective it is a smaller change then where we are The code in rd.c isn't terrible, and it isn't shit. There is only one fundamental problem with it. rd.c is fundamentally incompatible with the buffer cache. Currently rd.c is a legitimate if a bit ugly user of the page cache. The ugliness comes from ...
Well, the ones that are unused are totally useless ;) I would I mean backported. That's just me though, I don't know the nuances of -stable releases. It could be that they rather not risk introducing OK it just didn't sound like it, seeing as you said that's the only No. With my patch, nothing in the ramdisk code is visible to the LRU. It's not a cache, and it's not unaccounted for. It's specified exactly with the rd sizing parameters. I don't know why you would say your patch is better in this regard. Your ramdisk backing store will be You can be using highmem, BTW. And technically it probably isn't The small incremental step that fixes the problem is Christian's It is smaller lines of code. It is a larger change. Because what you are doing is 2 things. You are firstly discontinuing the use of the buffer cache for the backing store, and secondly you are introducing a new backing store which registers an inode with the vfs and pages with the pagecache. My patch does the same thing without those two last questionable steps. You now have to treat those backing store pages as pagecache pages, Any page you allocate is your private page. The problem is you're just sending them back to the VM again. -
Looking at it. If we don't get carried away using our own private inode is barely more difficult then stomping on release_page and in a number of ways a whole lot more subtle. At least for 2.6.24 I think it makes a sane fix, and quite possibly as a back port as well. Eric -
Hmm, so we can also have some filesystems writing their own buffers out by hand (clear_buffer_dirty, submit buffer for IO). Other places will do similarly dodgy things with filesystem metadata (fsync_buffers_list, for example). So your buffers get cleaned again, then your pages get cleaned. While I said it was a good fix when I saw the patch earlier, I think it's not closing the entire hole, and as such, Christian's patch is probably the way to go for stable. For mainline, *if* we want to keep the old rd.c around at all, I don't see any harm in this patch so long as Christian's is merged -
That would be my preferred method. Merge Erics and my fixup for 2.6.24-rc. The only open questions is, what was the reiserfs problem? Is it still While the merge window is still open, to me the new ramdisk code seems more like a 2.6.25-rc thing. We actually should test the behaviour in low memory scenarios. What do you think? Christian -
So I just took a little bit of time to look at and think about the path you are referring to, and I don't see a problem. The rule with the buffer dirty bit is that you first clear it and then you submit the write. When the write request finally makes it's way to rd.c we copy the data if necessary and call set_page_dirty. Which will then mark the page and the buffer dirty again. In essence the ramdisk code just attempts to lock buffers in ram by setting their dirty bit, just like we do for pages in ramfs. The only case where I see the dirty bit getting cleared without submitting I/O is when dirty state doesn't matter, in which case if we get a page full buffers all of whose data we don't care about it is legitimate to drop the page. As for ramdisk_writepage it probably makes sense to remove it, as the generic code paths seem to work as well or better the writepage method is NULL. However if we do keep it we should call I thought through the logic in try_to_free_buffers and it actually makes sense to me now. mark_buffer_dirty sets the page dirty bit so dirty buffers reside on dirty pages. When we submit I/O we aren't guaranteed to submit all of the dirty buffers for a page at once, so we don't clear the page dirty bit. With the result that we can end up with pages with the dirty bit set but all of their buffers are clean. Since we rather deliberately allow truly clean pages to be dropped from the ramdisk overriding try_to_free_buffer_pages looks wrong because then except for invalidate we can not remove buffers I do agree that with the amount of code duplication in the buffer cache that locking pages into the buffer cache seems very error prone, and difficult to maintain. So rewriting rd.c to store it's pages elsewhere sounds very promising. While I can see Christian's patch as fixing the symptom. I have a very hard time see it as correct. If we did something more elaborate to replace try_to_free_buffer_pages such that we could drop buffers from clean ...
Yeah, which is half the reason why its so complicated. Logically it should just hold another reference on the pages rather than interfere with pagecache state, but it can't do that because it Yeah, if your fix works I guess it is better to use it and converge code rather than diverge it even more. -
This is probably a good idea. Was this causing the reiserfs problems? If so, I think we should be concentrating on what the real problem is with reiserfs... (or at least why this so obviously correct looking patch is wrong). -
I think it was my cleanup patch that was sitting on top of this, That caused the problems. Eric -
Why do you say that? I guess it is _different_, by necessity(?) Is there anything that is really bad? I guess it's not nice for operating on the pagecache from its request_fn, but the alternative is to duplicate pages for backing store and buffer cache (actually that might not be a bad alternative really). -
make_page_uptodate() is most hideous part I have run into. It has to know details about other layers to now what not to stomp. I think my incorrect simplification of this is what messed Cool. Triple buffering :) Although I guess that would only apply to metadata these days. Having a separate store would solve some of the problems, and probably remove the need for carefully specifying the ramdisk block size. We would still need the magic restictions on page allocations though and it we would use them more often as the initial write to the ramdisk would not populate the pages we need. A very ugly bit seems to be the fact that we assume we can dereference bh->b_data without any special magic which means the ramdisk must live in low memory on 32bit machines. Eric -
Not really, it's just named funny. That's just a minor utility function that more or less does what it says it should do. The main problem is really that it's implementing a block device Double buffering. You no longer serve data out of your buffer cache. All filesystem data was already double buffered anyway, so we'd be just losing out on one layer of savings for metadata. I think it's worthwhile, given that we'd have a "real" looking What magic restrictions on page allocations? Actually we have fewer restrictions on page allocations because we can use highmem! And the lowmem buffercache pages that we currently pin (unsuccessfully, in the case of this bug) are now completely reclaimable. And all your buffer heads are now reclaimable. If you mean GFP_NOIO... I don't see any problem. Block device drivers have to allocate memory with GFP_NOIO; this may have been considered magic or deep badness back when the code was Yeah but that's not rd.c. You need to rewrite the buffer layer to fix that (see fsblock ;)). -
Well to put it another way, mark_page_uptodate() is the only
place where we really need to know about the upper layers.
Given that you can kill ramdisks by coding it as:
static void make_page_uptodate(struct page *page)
{
clear_highpage(page);
flush_dcache_page(page);
SetPageUptodate(page);
}
Something is seriously non-intuitive about that function if
you understand the usual rules for how to use the page cache.
The problem is that we support a case in the buffer cache
where pages are partially uptodate and only the buffer_heads
remember which parts are valid. Assuming we are using them
correctly.
Having to walk through all of the buffer heads in make_page_uptodate
Hmm. Good point. So in net it should save memory even if
Well I always figured it was a bit rude allocating large amounts
I'm not certain which way we should go. Take fsblock and run it
in parallel until everything is converted or use fsblock as a
prototype and once we have figured out which way we should go
convert struct buffer_head into struct fsblock one patch at a time.
I'm inclined to think we should evolve the buffer_head.
Eric
-
You're overwriting some buffers that were uptodate and dirty. Sure, but it's not just about the buffers. It's the pagecache in general. It is supposed to be invisible to the device driver and sitting above it, and yet it is taking the buffercache and Highmem systems would definitely like it. For others, yes, all the duplicated pages should be able to get reclaimed if memory gets tight, along with the buffer heads, so yeah footprint may You'd rather not, of course, but with dirty data limits now, it doesn't matter much. (and I doubt anybody outside testing is going to be hammering like crazy on rd). Note that the buffercache based ramdisk driver is going to also be allocating with GFP_NOFS if you're talking about a filesystem writing to its metadata. In most systems, GFP_NOFS isn't much different to GFP_NOIO. We could introduce a mode which allocates pages up front quite easily if it were a problem (which I doubt it ever would be). -
Perhaps unsigned? Err! Overlapping symbols! The rd_nr() function collides with the rd_nr variable. It also does not seem needed, since it did not exist before. It should go, you can set the variable with brd.rd_nr=XXX (same -
Changed. But it will hopefully just completely replace rd.c, so I will probably just rename it to rd.c at some point (and change .config options to stay compatible). Unless someone I've taken most of that stuff out of rd.c in an effort to stay Back compat. When rewriting the internals, I want to try avoid changing any externals if possible. Whether this is the Right Didn't try it, but the rd.c code does the same thing so I guess it doesn't (or didn't, when it was written). -
Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 and you will see. -
Just doesn't seem to be any point in making it a new and different module, assuming it can support exactly the same semantics. I'm only doing so in these first diffs so that they are easier to read and also easier to do a side by side comparison / builds with the Ah, nice. (I don't use them much!). Still, backward compat I think is needed if we are to replace rd.c. -
Like I said, I did not see rd_nr in Documentation/kernel-parameters.txt so I thought there was no compat. -
Well on that same tune. But with a slightly different implementation. It compiles but I need to head to bed so I haven't had a chance to test it yet. Nick it is very similar to yours with the big difference being that I embedded a struct address_space instead of rolled rerolled it by hand, which saves a lot of lines of code. Eric --- drivers/block/rd.c /* * ramdisk.c - Multiple RAM disk driver - gzip-loading version - v. 0.8 beta. * * (C) Chad Page, Theodore Ts'o, et. al, 1995. * * This RAM disk is designed to have filesystems created on it and mounted * just like a regular floppy disk. * * It also does something suggested by Linus: use the buffer cache as the * RAM disk data. This makes it possible to dynamically allocate the RAM disk * buffer - with some consequences I have to deal with as I write this. * * This code is based on the original ramdisk.c, written mostly by * Theodore Ts'o (TYT) in 1991. The code was largely rewritten by * Chad Page to use the buffer cache to store the RAM disk data in * 1995; Theodore then took over the driver again, and cleaned it up * for inclusion in the mainline kernel. * * The original CRAMDISK code was written by Richard Lyons, and * adapted by Chad Page to use the new RAM disk interface. Theodore * Ts'o rewrote it so that both the compressed RAM disk loader and the * kernel decompressor uses the same inflate.c codebase. The RAM disk * loader now also loads into a dynamic (buffer cache based) RAM disk, * not the old static RAM disk. Support for the old static RAM disk has * been completely removed. * * Loadable module support added by Tom Dyas. * * Further cleanups by Chad Page (page0588@sundance.sjsu.edu): * Cosmetic changes in #ifdef MODULE, code movement, etc. * When the RAM disk module is removed, free the protected buffers * Default RAM disk size changed to 2.88 MB * * Added initrd: Werner Almesberger & Hans Lermen, Feb '96 * * 4/25/96 : Made RAM disk size a parameter ...
We won't be able to fix completely this for a while time, but the fact that BLKFLSBUF has special semantics has always been a major wart. Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt there are many tools that actually take advantage of this wierd aspect of ramdisks, so hopefully it's something we could remove in a 18 months or so... - Ted -
It would be nice to be able to do that, I agree. The new ramdisk code will be able to flush the buffer cache and destroy its data separately, so it can actually be implemented. -
So the practical problem are peoples legacy boot setups but those are quickly going away. The sane thing is probably something that can be taken as a low level format command for the block device. Say: dd if=/dev/zero of=/dev/ramX I know rewriting the drive with all zeroes can cause a modern disk to redo it's low level format. And that is something we can definitely implement without any backwards compatibility problems. Hmm. Do we have anything special for punching holes in files? That would be another sane route to take to remove the special case for clearing the memory. Eric -
We have 2 problems. First is that, for testing/consistency, we don't want BLKFLSBUF to throw out the data. Maybe hardly anything uses BLKFLSBUF now, so it could be just a minor problem, but still one to fix. Second is actually throwing out the ramdisk data. dd from /dev/null isn't trivial because it isn't a "command" from the kernel's POV. rd could examine the writes to see if they are zero and page aligned, I suppose... but if you're transitioning everyone over to a new truncate_range, I suppose. A file descriptor syscall based alternative for madvise would be nice though (like fallocate). We could always put something in /sys/block/ram*/xxx -
Hmm. This is interesting because we won't be doing anything that effects correctness if we don't special case BLKFLSBUF just something that effects efficiency. So I think we can get away with just changing blkflsbuf as long as there is a way to get rid of Well I was thinking you can examine the page you just wrote to and if it is all zero's you don't need to cache that page anymore. Call it intelligent compression. Further it does make forwards and backwards compatibility simple because all you would have to do to reliably free a ramdisk is: dd if=/dev/zero of=/dev/ramX I guess when I look at this it looks like an operation that is unique to a ramdisk. Real hard drives have a low level format operations and the like. If we can find something standard there that is guaranteed to trash your data we can use that, and have gone from less consistency to more. Eric -
Technically, it does change correctness: after BLKFLSBUF, the ramdisk should contain zeroes. I'm assuming it would also cause problems in tight embedded environments if ramdisk ram is supposed to be thrown away but isn't. So maybe not technically a correctness problem, but could Sure, you could do that, but you still presumably need to support the old behaviour. As a test vehicle for filesystems, I'd much rather it didn't do this of course, because subsequent writes would need to reallocate the page again. -
I have beaten my version of this into working shape, and things seem ok. However I'm beginning to think that the real solution is to remove the dependence on buffer heads for caching the disk mapping for data pages, and move the metadata buffer heads off of the block device page cache pages. Although I am just a touch concerned there may be an issue with using filesystem tools while the filesystem is mounted if I move the metadata buffer heads. If we were to move the metadata buffer heads (assuming I haven't missed some weird dependency) then I think there a bunch of weird corner cases that would be simplified. I guess that is where I look next. Oh for what it is worth I took a quick look at fsblock and I don't think struct fsblock makes much sense as a block mapping translation layer for the data path where the current page caches works well. For less then the cost of 1 fsblock I can cache all of the translations for a 1K filesystem on a 4K page. I haven't looked to see if fsblock makes sense to use as a buffer head replacement yet. Anyway off to bed with me. Eric -
I'd prefer just to go along the lines of what I posted. It's a pure block device driver which knows absolutely nothing about vm or vfs. What are you guys using rd for, and is it really important to Except the page cache doesn't store any such translation. fsblock works very nicely as a buffer_head and nobh-mode replacement there, but we could probably go one better (for many filesystems) by using I'm not exactly sure what you mean, unless you're talking about an extent based data structure. fsblock is fairly slim as far as a generic solution goes. You could probably save 4 or 8 bytes in it, Well I don't want to get too far off topic on the subject of fsblock, and block mappings (because I think the ramdisk driver simply should have nothing to do with any of that)... but fsblock is exactly a buffer head replacement (so if it doesn't make sense, then I've screwed something up badly! ;)) -
Well what brought this up for me was old user space code using an initial ramdisk. The actual failure that I saw occurred on the read path. And fixing init_page_buffers was the real world fix. At the moment I'm messing with it because it has become the itch I've decided to scratch. So at the moment I'm having fun, learning the block layer, refreshing my VM knowledge and getting my head around this wreck that we call buffer_heads. The high level concept of buffer_heads may be sane but the implementation seems to export a lot of nasty state. At this point my concern is what makes a clean code change in the kernel. Because user space can currently play with buffer_heads by way of the block device and cause lots of havoc (see the recent resierfs bug in this thread) that is why I increasingly think metadata buffer_heads should not share storage with the block device page cache. If that change is made then it happens that the current ramdisk would not need to worry about buffer heads and all of that nastiness and could just lock pages in the page cache. It would not be quite as good for testing filesystems but retaining the existing characteristics would be simple. After having looked a bit deeper the buffer_heads and the block devices don't look as intricately tied up as I had first thought. We still have the nasty case of: if (buffer_new(bh)) unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); That I don't know how it got merged. But otherwise the caches are fully separate. So currently it looks to me like there are two big things that will clean up that part of the code a lot: - moving the metadata buffer_heads to a magic filesystem inode. - Using a simpler non-buffer_head returning version of get_block As a meta_data cache manager perhaps, for a translation cache we need 8 bytes per page max. However all we need for a generic translation cache (assuming we still want one) is an array of sector_t per page. So what we would want is: int ...
Well if userspace is writing to the filesystem metadata via the blockdevice while it is running... that's the definition of havoc, isn't it? ;) Whether or not the writes are going via a unified metadata/blockdev cache or separate ones. You really just have to not do that. The actual reiserfs problem being seen is not because of userspace No, it wouldn't. Because if you're proposing to split up the buffer cache and the metadata cache, then you're back to a 2 cache solution which is basically has the memory characteristics of my Well its needed because some filesystems forget about their old metadata. It's not really there to solve aliasing with the blockdev Although this is going off the track of the ramdisk problem. For that we should just do the rewrite. -
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm | [PATCH 03/07] ARM: Use shared GIC entry macros on Realview |
| Oliver Neukum | Re: [Bug #13682] The webcam stopped working when upgrading from 2.6.29 to 2.6.30 |
| Martin Schwidefsky | Re: [PATCH] optimized ktime_get[_ts] for GENERIC_TIME=y |
git: | |
| Junio C Hamano | Re: Some advanced index playing |
| Jeff King | Re: confusion over the new branch and merge config |
| Robin Rosenberg | Re: cvs2svn conversion directly to git ready for experimentation |
| Linus Torvalds | git binary size... |
| Ævar Arnfjörð Bjarmason |
