login
Header Space

 
 

Linux: Data Corruption Bug Fixed

December 29, 2006 - 7:25am
Submitted by Jeremy on December 29, 2006 - 7:25am.
Linux news

After several days of effort, Linus Torvalds tracked down and posted a patch for a low level data corruption bug [story]. In a series of emails, Linus thoroughly explained the thought process involved in isolating the exact problem. He described the bug as being a difference of expectations between the filesystem and the VM:

"Both filesystem and VM actually _think_ they do the right thing, because they simply have totally different expectations. The filesystem thinks that it should care about dirty buffers (that got marked dirty _after_ they were dirtied), while the filesystem thinks that it cares about dirty _pages_ (that got dirted at any time _before_ 'writepage()' was called). Neither is really 'wrong', per se, it's just that the two parts have different expectations, and the _combination_ just doesn't work. 'set_page_dirty()' at some point meant 'the writes have been done', but these days it really means something else."

Linus posted the fix in a followup email, "putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it." A comment in the patch explains, "we use this sequence to make sure that (a) we account for dirty stats properly (b) we tell the low-level filesystem to mark the whole page dirty if it was dirty in a pagetable. Only to then (c) clean the page again and return 1 to cause the writeback. This way we avoid all nasty races with the dirty bit in multiple places and clearing them concurrently from different threads." Another comments explains the remaining race, "if somebody adds the page back to the page tables in between the 'page_mkclean()' and the 'TestClearPageDirty()', we might have it mapped without the dirty bit set."


From: Linus Torvalds [email blocked]
To: Segher Boessenkool [email blocked]
Subject: Re: [PATCH] mm: fix page_mkclean_one
Date:	Thu, 28 Dec 2006 22:48:51 -0800 (PST)



On Fri, 29 Dec 2006, Segher Boessenkool wrote:
>
> > I think what might be happening is that pdflush writes them out fine,
> > however we don't trap writes by the application _during_ that writeout.
> 
> Yeah.  I believe that more exactly it happens if the very last
> write to the page causes a writeback (due to dirty balancing)
> while another writeback for the page is already happening.
> 
> As usual in these cases, I have zero proof.

I actually have proof to the contrary, ie I have traces that say "the 
write was started" after the last write.

And the VM layer in this area is actually fairly sane and civilized. It 
has a bit that says "writeback in progress", and if that bit is set, it 
simply _will_not_ start a new write. It even has various BUG_ON()'s to 
that effect.

So everything I have ever seen says that the VM layer is actually doing 
everything right.

> It's the do_wp_page -> balance_dirty_pages -> generic_writepages
> path for sure.  Maybe it's enough to change
> 
>                         if (wbc->sync_mode != WB_SYNC_NONE)
>                                 wait_on_page_writeback(page);
> 
>                         if (PageWriteback(page) ||
>                             !clear_page_dirty_for_io(page)) {
>                                 unlock_page(page);
>                                 continue;
>                         }

Notice how this one basically says:

 - if it's under writeback, don't even clear the page dirty flag.

Your suggested change:

>                         if (wbc->sync_mode != WB_SYNC_NONE)
>                                 wait_on_page_writeback(page);
> 
>                         if (PageWriteback(page)) {
>                         	    redirty_page_for_writepage(wbc, page);

makes no sense, because we simply never _did_ the "clear_page_dirty()" if 
the thing was under writeback in the first place. That's how C 
conditionals work.  So there's no reason to "redirty" it, because it 
wasn't cleaned in the first place.

I've double- and triple-checked the dirty bits, including having traces 
that actually say that the IO was started (from a VM perspective) _after_ 
the last write was done. The IO just didn't hit the disk.

I'm personally fairly convinced that it's not a VM issue, but a "IO 
issue". Either in a low-level filesystem or in some of the fs/buffer.c 
helper routines.

But I'd love to be proven wrong. 

I do have a few interesting details from the trace I haven't really 
analyzed yet. Here's the trace for events on one of the pages that was 
corrupted. Note how the events are numbered (there were 171640 events 
total), so the thing you see is just a small set of events from the whole 
big trace, but it's the ones that talk about _that_ particular page.

I've grouped them so hat "consecutive" events group together. That just 
means that no events on any other pages happened in between those events, 
and it is usually a sign that it's really one single call-chain that 
causes all the events.

For example, for the first group of three events (44366-44368), it's the 
page fault that brings in the page, and since it's a write-fault, it will 
not only map the page, it will mark the page itself dirty and then also 
set the TAG_DIRTY on the mapping. So the "group" is just really a result 
of one single event happening, which causes several things to happen to 
that page. That's exactly what you'd expect.

Anyway, here is the list of events that page went through:

   44366  PG 00000f6d: mm/memory.c:2254 mapping at b789fc54 (write)
   44367  PG 00000f6d: mm/page-writeback.c:817 setting dirty
   44368  PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY

   64231  PG 00000f6d: mm/page-writeback.c:872 clean_for_io
   64232  PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000
   64233  PG 00000f6d: mm/page-writeback.c:914 set writeback
   64234  PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
   64235  PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY

   67570  PG 00000f6d: mm/page-writeback.c:891 end writeback
   67571  PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

   76705  PG 00000f6d: mm/page-writeback.c:817 setting dirty
   76706  PG 00000f6d: fs/buffer.c:725 dirtied buffers
   76707  PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY

  105267  PG 00000f6d: mm/page-writeback.c:872 clean_for_io
  105268  PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000
  105269  PG 00000f6d: mm/page-writeback.c:914 set writeback
  105270  PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
  105271  PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
  105272  PG 00000f6d: mm/page-writeback.c:891 end writeback
  105273  PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

  128032  PG 00000f6d: mm/memory.c:670 unmapped at b789f000

  132662  PG 00000f6d: mm/filemap.c:119 Removing page cache

  148278  PG 00000f6d: mm/memory.c:2254 mapping at b789f000 (read)

  166326  PG 00000f6d: mm/memory.c:670 unmapped at b789f000

  171958  PG 00000f6d: mm/filemap.c:119 Removing page cache

And notice that big grouping of seven events (105267-105273). The five 
first events really _do_ make sense together: it's our page cleaning that 
happens. But notice how the "end writeback" happens _immediately_.

Here's another page cleaning event for the page that preceded that page, 
and did _not_ get corrupted:

  105262  PG 00000f6c: mm/page-writeback.c:872 clean_for_io
  105263  PG 00000f6c: mm/rmap.c:451 cleaning PTE b789e000
  105264  PG 00000f6c: mm/page-writeback.c:914 set writeback
  105265  PG 00000f6c: mm/page-writeback.c:916 setting TAG_WRITEBACK
  105266  PG 00000f6c: mm/page-writeback.c:922 clearing TAG_DIRTY

  108437  PG 00000f6c: mm/page-writeback.c:891 end writeback
  108438  PG 00000f6c: mm/page-writeback.c:893 clearing TAG_WRITEBACK

and this looks a lot more like what you'd expect: other thngs happened in 
between the "clear dirty, set writeback" stage and the "end writeback" 
stage. That's what you'd expect to see if there was actually overlapping 
IO and/or work. 

(And notice that that _was_ what you saw even for the corrupted page for 
the _first_ writeback: you saw the group-of-five that indicated a page 
cleaning event had started, and then a group-of-two to indicate that the 
writeback finished).

So I find this kind of pattern really suspicious. We have a missing 
writeout, and my traces show (I didn't analyze this _particular_ one 
closely, but I did the previous trace for another page that I posted) that 
the writeback was actually started after the write that went missing was 
done. AND I have this trace that seems to show that the writeback 
basically completed immediately, with no other work in between.

That to me says: "somebody didn't actually write out out". The VM layer 
asked the filesystem to do the write, but the filesystem just didn't do 
it. I personally think it's because some buffer-head BH_dirty bit got 
scrogged, but it could be some event that makes the filesystem simply not 
do the IO because it thinks the "disk queues are too full", so it just 
says "IO completed", without actually doing anything at all.

Now, the fact that it apparently happens for all of ext2, ext3 
and reiserfs (but NOT apparently with "data=writeback"), makes me suspect 
that there is some common interaction, and that it's somehow BH-related 
(they all share much of the buffer head infrastructure). So it doesn't 
look like it's just a bug in one random filesystem, I think it's a bug in 
some buffer-head infrastructure/helper function.

So I don't think it's "core VM". I don't think it's the "page cache". I 
think we handle the dirty state correctly at that level.

It looks more like "buffer cache" or "filesystem" to me by now.

(Btw, don't get me wrong - the above sequence numbers are in no way 
*proof* of anything. You could get big groups for one page just because 
something ended up being synchronous. I'll add some timestamps to my 
traces to make it easier to see where there was real IO going on and where 
there wasn't).

		Linus


From: Linus Torvalds [email blocked] Subject: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Date: Fri, 29 Dec 2006 00:58:12 -0800 (PST) On Thu, 28 Dec 2006, Linus Torvalds wrote: > > So everything I have ever seen says that the VM layer is actually doing > everything right. That was true, but at the same time, it's not. Let me explain. > That to me says: "somebody didn't actually write out out". The VM layer > asked the filesystem to do the write, but the filesystem just didn't do > it. I personally think it's because some buffer-head BH_dirty bit got > scrogged Ok, I have proof of this now. Here's a trace (with cycle counts), and with a new trace event added: this is for another corrupted page. I have: 49105 PG 000015d8 (14800): mm/page-writeback.c:872 clean_for_io 49106 PG 000015d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000 49107 PG 000015d8 (9900): mm/page-writeback.c:914 set writeback 49108 PG 000015d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK 49109 PG 000015d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY 49110 PG 000015d8 (7190): fs/buffer.c:1713 no IO underway 49111 PG 000015d8 (6180): mm/page-writeback.c:891 end writeback 49112 PG 000015d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK where that first column is the trace event number again, and the "PG 000015d8" is that corrupted page. The thing in the parenthesis is "CPU cycles since last event), and the important part to note is that this is indeed all one single thing with no actual IO anywhere (~7000 CPU cycles may sound like a lot, but (a) it's not that many cache misses and (b) a lot of it is the logging overhead - back-to-back log events will take about 3500 cycles) just because it does the actual ASCII printk() etc. Also, the new event is: fs/buffer.c:1713 no IO underway which is just the if (nr_underway == 0) case in fs/buffer.c And I now finally really believe that I fully understand the corruption, but I don't have a simple solution, much less a patch. What the problem basically boils down to is that "set_page_dirty()" is supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not "the whole page when the 'set_page_dirty()' itself happens", but more of a "the next writepage() needs to write back the whole page" thing. And that's not what "__set_page_dirty_buffers()" really does. Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as dirty. That may _sound_ like what we want, but it really isn't. Because by the time "writepage()" is actually called (which can be MUCH MUCH later), some internal filesystem activity may actually have cleaned one or more of those buffers in the meantime, and now we call "writepage()" (which really wants to write them _all_), and it will write only part of them, or none at all. So the VM thought that since it did a "writepage()", all the dirty state at that point got written back. But it didn't - the filesystem could have written back part or all of the page much earlier, and the writepage() actually does nothing at all. Both filesystem and VM actually _think_ they do the right thing, because they simply have totally different expectations. The filesystem thinks that it should care about dirty buffers (that got marked dirty _after_ they were dirtied), while the filesystem thinks that it cares about dirty _pages_ (that got dirted at any time _before_ "writepage()" was called). Neither is really "wrong", per se, it's just that the two parts have different expectations, and the _combination_ just doesn't work. "set_page_dirty()" at some point meant "the writes have been done", but these days it really means something else. Now, the reason there is no trivial patch is not that this is conceptually really hard to fix. I can see several different approaches to fixing it, but they all really boil down to two alternatives: (a) splitting the one "PG_dirty" bit up into two bits: the "PG_writescheduled" bit and the "PG_alldirty" bit. The "PG_write_scheduled" bit would be the bit that the filesystem would set when it has pending dirty data that it wrote itself (and that may not cover the whole page), and is the part of PG_dirty that sets the PAGECACHE_TAG_DIRTY. It's also what forces "writepage()" to be called. The "PG_alldirty" bit is just an additional "somebody else dirtied random parts of this page, and we don't know what" flag, which is set by "set_page_dirty()" in addition to doing the PG_write_scheduled stuff. We would test-and-clear it at "writepage()" time, and pass it in to "writepages()" to tell the writepage() function that it can't just write out its own small limited notion of what is dirty. (There are various variations on this whole theme: instead of having a flag to "writepage()", we could split the "whole page" case out as a separate callback or similar) (b) making sure that all "set_page_dirty()" calls are _after_ the page has been marked dirty (which in the case of memory mapped pages would mean that we would _not_ call it when we mark the page writable at all, we would call it when we _remove_ the dirty bit and mark it unwritable). That would have the nice fearture that it wouldn't require any FS-level changes, which would be a nice thing - it would basically make the VM dirty accounting work the way the FS layer now already expects it to. I think (b) is conceptually simpler, and I think I'll try it tomorrow after I've slept on it. The reason I mention (a) at all is that I like the conceptual notion of telling he filesystem ahead of time that "you're going to get a full dirty page", because what (b) will inevitably lead to is that the filesystem will maintain its own partial state, and then effectively just before it gets the writepage() notification, it will be told it was all pointless, because we just dirtied the whole thing. IOW, the advantage of (a) is also it's disadvantage: it tells the filesystem more. The disadvantage is that it would require VFS interface changes exactly to do that (ie the "mapping->set_page_dirty()" semantics would also be split up into two, and it would now be a "prepare to write the whole page during the next 'writepage()'" thing). So to recap: the VM layer really expected "writepage()" to act as if it wrote out the whole page. It doesn't. Not in the presense of the buffer layer and the filesystem having written out some buffers independently of the VM layer earlier. I think this also explains why "data=ordered" broke, and "data=writeback" didn't. When ext3 does "ordered" writebacks, it will do file data writebacks on its own, in _its_ order. In contrast, when it does "data=writeback", it will do the writebacks exactly as the VM presents them, and won't write any buffers on its own - which makes the bug go away, because now VM and FS end up agreeing about the semantics of "writepage()". Andrew, do you see anything wrong in my thinking? Peter - on a VM level, the fix would be: - remove the "set_page_dirty()" from the page fault path, and just set the PAGECACHE_TAG_DIRTY instead. - clear_page_dirty_for_io() would now need to check the mappings of the page even if it wasn't marked PG_dirty (or we'd have another page flag for the "page is dirty in page tables"), which is kind of a mixture of (a) and (b) cases above, except we don't expose it to the FS. - if it was dirty in the page tables, we do a "set_page_dirty()" after cleaning the page tables, and then the rest of "clear_page_dirty_for_io()" really boils down to a simple "TestAndClearDirty(page)" Hmm? I'd love it if somebody else wrote the patch and tested it, because I'm getting sick and tired of this bug ;) Linus
From: Linus Torvalds [email blocked] Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Date: Fri, 29 Dec 2006 02:48:35 -0800 (PST) On Fri, 29 Dec 2006, Linus Torvalds wrote: > > Hmm? I'd love it if somebody else wrote the patch and tested it, because > I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. The patch is mostly a comment. The "real" meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Linus ---- diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping && mapping_cap_account_dirty(mapping)) { + /* + * Yes, Virginia, this is indeed insane. + * + * We use this sequence to make sure that + * (a) we account for dirty stats properly + * (b) we tell the low-level filesystem to + * mark the whole page dirty if it was + * dirty in a pagetable. Only to then + * (c) clean the page again and return 1 to + * cause the writeback. + * + * This way we avoid all nasty races with the + * dirty bit in multiple places and clearing + * them concurrently from different threads. + * + * Note! Normally the "set_page_dirty(page)" + * has no effect on the actual dirty bit - since + * that will already usually be set. But we + * need the side effects, and it can help us + * avoid races. + * + * We basically use the page "master dirty bit" + * as a serialization point for all the different + * threds doing their things. + * + * FIXME! We still have a race here: if somebody + * adds the page back to the page tables in + * between the "page_mkclean()" and the "TestClearPageDirty()", + * we might have it mapped without the dirty bit set. + */ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io);
From: Andrei Popa <andrei.popa@i-neo.ro> To: Linus Torvalds [email blocked] Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Date: Fri, 29 Dec 2006 13:16:46 +0200 On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote: > > On Fri, 29 Dec 2006, Linus Torvalds wrote: > > > > Hmm? I'd love it if somebody else wrote the patch and tested it, because > > I'm getting sick and tired of this bug ;) > > Who the hell am I kidding? I haven't been able to sleep right for the last > few days over this bug. It was really getting to me. > > And putting on the thinking cap, there's actually a fairly simple an > nonintrusive patch. It still has a tiny tiny race (see the comment), but I > bet nobody can really hit it in real life anyway, and I know several ways > to fix it, so I'm not really _that_ worried about it. > > The patch is mostly a comment. The "real" meat of it is actually just a > few lines. > > Can anybody get corruption with this thing applied? It goes on top of > plain v2.6.20-rc2. Tested with rtorrent and there is no corruption.



Related Links:

Interesting bug story this is

December 30, 2006 - 11:09am
Anonymous (not verified)

I think this proves that people like Torvalds are still needed. Sometimes people say they don't think he has done anything relevant in the past years and such. However, this proves that the kernel development process needs in fact people controlling what other people is doing, and playing the sweeper position so they can concentrate on deciding what to add and what not to add, and miscelaneous tasks, like fixing this bug. People who still have a more or less global vision of the kernel and how the different subsystems interact. In other words: Torvaldses and Mortons.

Congrats to the kernel developpers and to Torvalds for being the one who solved this nasty bug.

Not yet fixed in 2.6.19

December 30, 2006 - 5:41pm

Its been more than a day, and they haven't released a fixed 2.6.19 release. Ok, it's not that long, but I had been expecting them to move faster on a bug such as this one which can cause ext3 corruption.

There also haven't been a fix for the unlikely race condition Linus mentions.

Not yet fixed

December 30, 2006 - 9:50pm
Anonymous (not verified)

Well, you CAN apply the patch yourself.

It's pretty clear that Linus wants to get at least a minimal amount of testing in before making this part of an official release. Also, as scary as it is, this bug isn't a showstopper for most people. (Since it's been around for quite some time)

Not really a filesystem corruption

December 31, 2006 - 10:48am

As I understand it, only the downloaded file(s) gets corrupted.
Not such a big issue when you can simply switch to another (not libtorrent based) BT client, have it re-check your file(s) and re-download the corrupted chunks.

I really don't think it's such a big issue from the end user's viewpoint.

It is not only affecting rtorrent

January 1, 2007 - 3:14pm

"Arjan" has posting a comment at http://lwn.net/Articles/215868/#Comments :

people noticed.. in hindsight. Suddenly a series of db4 reports show up with people saying they see this regularly and it's now gone away with the fix...

Remember that the bugs has been there in production kernels for some time. It just became easier to trigger in 2.6.19. It does not only affect rtorrent.

You really don't want to be running a kernel which randomly corrupts data written to disks.

But you are right, the file system structure does not seem to be corrupted.

Is this an issue in 2.4 or

January 2, 2007 - 9:31am
Anonymous (not verified)

Is this an issue in 2.4 or just 2.6?

Just 2.6.

January 2, 2007 - 9:43am
Anonymous (not verified)

Just 2.6.

patch for 2.6.19

January 4, 2007 - 2:02am
Sten (not verified)

A patch against 2.6.19[.1] is available at
http://lkml.org/lkml/diff/2006/12/29/89/1.

Sten

Has this bug 2.6.17?

January 4, 2007 - 7:48am
Alexander (not verified)

Any patches for 2.6.17?

I think we should thank the

January 4, 2007 - 7:18am
Anonymous (not verified)

I think we should thank the developers of rTorrent. rTorrent's aggressive use of mmap() stresses the OS subsystems so much that a deeply hidden bug was exposed.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary