login
Header Space

 
 

Linux: Chasing Down Data Corruption

December 28, 2006 - 10:28pm
Submitted by Jeremy on December 28, 2006 - 10:28pm.
Linux news

In a couple of fascinating threads on the lkml, Linus Torvalds has been working with several other kernel developers to try and track down a difficult data corruption bug [story]. Linus posted a test-program that's capable of consistently triggering the data corruption, so it's a matter of time before the bug is found and fixed. "I think the page-writeout is implicated," Linus explains, "because I do seem to need it, but the page-cache flush does seem to make corruption _easier_ to see. I now seem about to trigger it with a 100MB file on a 256MB machine in a minute or so, with this slight modification. I still don't see _why_, though. But maybe smarter people than me can see it." Earlier it was thought that new page balancing code added in the 2.6.19 kernel was to blame, but using Linus' test-program the data corruption has been reported as far back as the 2.6.5 kernel. "It's not actually a new bug at all," suggested Linus, "it's just that the dirty page balancing causes writeback to happen _earlier_, and thus is better able to _show_ a bug that we've likely had for a long long time." Before heading out to dinner to celebrate his birthday, Linus sent out a patch for tracing the areas of the kernel where the corruption bug is happening, "in the hope that somebody else is working on this corruption issue and is interested." He went on to summarize the current status of the debugging effort:

"What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help.

"(And hey, you don't have to be a VM expert to help: this could be a learning experience. However, I'll warn you: this is _the_ most grotty part of the whole kernel. It's not even ugly, it's just damn hard and complex)."


From: Tobias Diedrich [email blocked]
To: Linus Torvalds [email blocked]
Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
Date:	Tue, 26 Dec 2006 17:17:00 +0100

Linus Torvalds wrote:
> I don't think it's a page table issue any more, it just doesn't look 
> likely with the ARM UP corruption. It's also not apparently even on a 
> cacheline boundary, so it probably is really a dirty bit that got cleared 
> wrogn due to some race with IO.

So, until now it's only been reported for SMP on i386?
I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over
here, UP kernel, no preempt.

I've first seen it with 2.6.20-rc1, but am running 2.6.20-rc2 now.
The corruption pattern looks like the one already reported, rtorrent
hash check fails (for some files it succeeds at first, but
fails after "echo 1 > /proc/sys/vm/drop_caches"), the corruption is
zeroes at the end of page instead of data.

ii  rtorrent       0.6.4-1        ncurses BitTorrent client based on LibTorren
ii  libtorrent9    0.10.4-1       a C++ BitTorrent library

-- 
Tobias						PGP: http://9ac7e0bc.uguu.de


From: David Miller [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Tue, 26 Dec 2006 20:55:18 -0800 (PST) From: Tobias Diedrich [email blocked] Date: Tue, 26 Dec 2006 17:17:00 +0100 > Linus Torvalds wrote: > > I don't think it's a page table issue any more, it just doesn't look > > likely with the ARM UP corruption. It's also not apparently even on a > > cacheline boundary, so it probably is really a dirty bit that got cleared > > wrogn due to some race with IO. > > So, until now it's only been reported for SMP on i386? > I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over > here, UP kernel, no preempt. I've seen it on sparc64, UP kernel, no preempt.
From: Linus Torvalds [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Wed, 27 Dec 2006 16:16:12 -0800 (PST) On Tue, 26 Dec 2006, David Miller wrote: > > I've seen it on sparc64, UP kernel, no preempt. Ok, I still don't have a clue, but I think I at least have a new test-case. It can probably be improved upon, but this would _seem_ to trigger the problem. Can people check? You'd want to make sure you get page-put activity, by making TARGETSIZE be big enough to cause memory pressure (and rather than making it bigger, you might want to make your memory smaller instead, to make it run more quickly. Either using "mem=128M" or big compiles or something...). If it finds corruption, you'll see something like Writing chunk 183858/183859 (99%) Chunk .. Chunk 120887 corrupted Chunk 122372 corrupted Chunk ... Checking chunk 183858/183859 (99%) otherwise it will just say Writing chunk 183858/183859 (99%) Checking chunk 183858/183859 (99%) and exit. I didn't spend a lot of time verifying this, but I _was_ able to cause those "Chunk xxx corrupted" messages with this. There's probably a more efficient better way to do it, but this is better than trying to use rtorrent, and also makes any worries about what rtorrent does go away. Of course, maybe it's this test-program that is buggy now, although it looks trivial enough that I don't think it is. I think my earlier stress-tester may not have triggered this, because it just did all its writing in a linear order, so any LRU logic will happen to write back old pages that we are no longer touching. The randomization (and using a chunksize that isn't a multiple of a page-size) makes sure that we're actually going to have lots of rewriting going on. I think the test-case could probably be improved by having a munmap() and page-cache flush in between the writing and the checking, to see whether that shows the corruption easier (and possibly without having to start paging in order to throw the pages out, which would simplify testing a lot). But I haven't tested. I decided to post this asap, now that I've recreated the corruption with something else, and something that is possibly easier to analyze.. Linus ---- #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (256 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } static void checkmem(void *start, int nr) { unsigned char c = nr, *p = start; int i; for (i = 0; i < CHUNKSIZE; i++) { if (*p++ != c) { printf("Chunk %d corrupted \n", nr); return; } } } int main(int argc, char **argv) { char *mapping; int fd, i; static int chunkorder[NRCHUNKS]; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); return 0; }
From: Linus Torvalds [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Wed, 27 Dec 2006 16:39:43 -0800 (PST) On Wed, 27 Dec 2006, Linus Torvalds wrote: > > I think the test-case could probably be improved by having a munmap() and > page-cache flush in between the writing and the checking, to see whether > that shows the corruption easier (and possibly without having to start > paging in order to throw the pages out, which would simplify testing a > lot). I think the page-writeout is implicated, because I do seem to need it, but the page-cache flush does seem to make corruption _easier_ to see. I now seem about to trigger it with a 100MB file on a 256MB machine in a minute or so, with this slight modification. I still don't see _why_, though. But maybe smarter people than me can see it.. Linus --- #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (100 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } static void checkmem(void *start, int nr) { unsigned char c = nr, *p = start; int i; for (i = 0; i < CHUNKSIZE; i++) { if (*p++ != c) { printf("Chunk %d corrupted \n", nr); return; } } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { char *mapping; int fd, i; static int chunkorder[NRCHUNKS]; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); return 0; }
From: David Miller [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Wed, 27 Dec 2006 16:52:46 -0800 (PST) From: Linus Torvalds [email blocked] Date: Wed, 27 Dec 2006 16:39:43 -0800 (PST) > > > On Wed, 27 Dec 2006, Linus Torvalds wrote: > > > > I think the test-case could probably be improved by having a munmap() and > > page-cache flush in between the writing and the checking, to see whether > > that shows the corruption easier (and possibly without having to start > > paging in order to throw the pages out, which would simplify testing a > > lot). > > I think the page-writeout is implicated, because I do seem to need it, but > the page-cache flush does seem to make corruption _easier_ to see. I now > seem about to trigger it with a 100MB file on a 256MB machine in a minute > or so, with this slight modification. > > I still don't see _why_, though. But maybe smarter people than me can see > it.. FWIW this program definitely triggers the bug for me.
From: Linus Torvalds [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Wed, 27 Dec 2006 19:04:34 -0800 (PST) On Wed, 27 Dec 2006, David Miller wrote: > > > > I still don't see _why_, though. But maybe smarter people than me can see > > it.. > > FWIW this program definitely triggers the bug for me. Ok, now that I have something simple to do repeatable stuff with, I can say what the pattern is.. It's not all that surprising, but it's still worth just stating for the record. What happens is that when I do the "packetized writes" in random order, the _last_ write to a page occasionally just goes missing. It's not always at the end of a page, as shown by for example: - A whole chunk got dropped: Chunk 2094 corrupted (0-1459) (1624-3083) Expected 46, got 0 Written as (30912)55414(10000) That "Written as (x)y(z)" line means that the corrupted chunk was written as chunk #y, and the preceding and following chunks (that were _not_ corrupt) on the page was written as #x and #z respectively. In other words, the missing chunk (which is still zero) was written much later than the ones that were ok, and never hit the disk. It's a contiguous chunk in the middle of the page (chunks are 1460 bytes in size) The first line means that all bytes of the chunk (0-1459) were corrupted, and the values in parenthesis are the offsets within a page. In other words, this was a chunk in the _middle_ of a page. - The missing data can also be at the beginning or ends of pages: Beginning of the chunk missing, it was at the end of a page (page offsets 3288-4095) and the _next_ page got written out fine: Chunk 2126 corrupted (0-807) (3288-4095) Expected 78, got 0 Written as (32713)55573(14301) End of a chunk missing, it was the beginning of a page (and the _previous_ page that contained the beginning of the chunk was written out fine) Chunk 2179 corrupted (1252-1459) (0-207) Expected 131, got 0 Written as (45189)55489(15515) Now, the reason I say this isn't surprising is that this is entirely consistent with the dirty bit being dropped on the floor somewhere, and likely through some interaction with the previous changes being in the process of being written out. Something (incorrectly) ends up deciding that it doesn't need to write the page, since it's already written, or alternatively clears the dirty bit too late (clears it because an _earlier_ write finished, never mind that the new dirty data didn't make it). I also figured out that it's not the low-memory situation that does it, it really must be the "page_mkclean()" triggering. Becuase I can do echo 5 > /proc/sys/vm/dirty_ratio echo 3 > /proc/sys/vm/dirty_background_ratio to make it clean the pages much more aggressively than the default, and I can see corruption on my 256MB machine with just a 40MB shared file, and 70MB of memory consistently free. So this thing is definitely giving some answers. It's NOT about low memory, and it very much seems to be about the whole "balance_dirty_ratio" thing. I don't think I triggered the actual low-memory stuff once in that situation.. So I have some more data on the behaviour, but I _still_ don't see the reason behind it. It's probably something really obvious once it's pointed out.. [ Modified test-program that tells you where the corruption happens (and when the missing parts were supposed to be written out) appended, in case people care. ] Linus --- #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (100 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (0xfff & ((unsigned)(unsigned long)(buf)+(off))) static int chunkorder[NRCHUNKS]; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%u-%u) \n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { char *mapping; int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); return 0; }
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Subject: RE: [PATCH] mm: fix page_mkclean_one Date: Wed, 27 Dec 2006 21:55:21 -0800 Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > it.. > > > > FWIW this program definitely triggers the bug for me. > > Ok, now that I have something simple to do repeatable stuff with, I can > say what the pattern is.. It's not all that surprising, but it's still > worth just stating for the record. Running the test code, git bisect points its finger at this commit. Reverting this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Mon Sep 25 23:30:58 2006 -0700 [PATCH] mm: balance dirty pages Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins [email blocked] Signed-off-by: Andrew Morton [email blocked] Signed-off-by: Linus Torvalds [email blocked]
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Subject: RE: [PATCH] mm: fix page_mkclean_one Date: Wed, 27 Dec 2006 22:10:52 -0800 Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > > it.. > > > > > > FWIW this program definitely triggers the bug for me. > > > > Ok, now that I have something simple to do repeatable stuff with, I can > > say what the pattern is.. It's not all that surprising, but it's still > > worth just stating for the record. > > > Running the test code, git bisect points its finger at this commit. Reverting > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Mon Sep 25 23:30:58 2006 -0700 > > [PATCH] mm: balance dirty pages > > Now that we can detect writers of shared mappings, throttle them. Avoids OOM > by surprise. Oh, never mind :-( I just didn't create enough write out pressure when test this. I just saw bug got triggered on a kernel I previously thought was OK.
From: Linus Torvalds [email blocked] Subject: RE: [PATCH] mm: fix page_mkclean_one Date: Thu, 28 Dec 2006 09:10:42 -0800 (PST) On Wed, 27 Dec 2006, Chen, Kenneth W wrote: > > > > Running the test code, git bisect points its finger at this commit. Reverting > > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > > > [PATCH] mm: balance dirty pages > > > > Now that we can detect writers of shared mappings, throttle them. Avoids OOM > > by surprise. > > Oh, never mind :-( I just didn't create enough write out pressure when > test this. I just saw bug got triggered on a kernel I previously thought > was OK. Btw, this is an important point - people have long felt that the new page balancing in 2.6.19 was to blame, but you've just confirmed the long-held suspicion (at least by me) that it's not actually a new bug at all, it's just that the dirty page balancing causes writeback to happen _earlier_, and thus is better able to _show_ a bug that we've likely had for a long long time. Linus
From: Guillaume Chazarain [email blocked] Subject: Re: Re: [PATCH] mm: fix page_mkclean_one Date: Thu, 28 Dec 2006 16:09:37 +0100 I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/ I have corruption with every Fedora release kernel except the first, that is 2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit some corruption. Command line to test: qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw root=/dev/hda' I get this kind of corruption: http://guichaz.free.fr/linux-bug/corruption.png -- Guillaume
From: Guillaume Chazarain [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Thu, 28 Dec 2006 20:19:46 +0100 Guillaume Chazarain a écrit : > I get this kind of corruption: > http://guichaz.free.fr/linux-bug/corruption.png Actually in qemu, I get three different behaviours: - no corruption at all : with linux-2.4 - corruption only on the first chunks: before [PATCH] mm: balance dirty pages as identified by Kenneth - corruption of all chunks: after the balance dirty pages patch Bisecting in linux-2.5 land I found http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch to cause the corruption for me. The attached patch fixes the corruption for me. -- Guillaume [fadvise-dontneed.patch text/x-patch (493B)] diff -r 3859b1144d3a mm/fadvise.c --- a/mm/fadvise.c Sun Dec 24 05:00:03 2006 +0000 +++ b/mm/fadvise.c Thu Dec 28 19:53:40 2006 +0100 @@ -96,9 +96,6 @@ asmlinkage long sys_fadvise64_64(int fd, case POSIX_FADV_NOREUSE: break; case POSIX_FADV_DONTNEED: - if (!bdi_write_congested(mapping->backing_dev_info)) - filemap_flush(mapping); - /* First and last FULL page! */ start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT; end_index = (endbyte >> PAGE_CACHE_SHIFT);
From: Linus Torvalds [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Thu, 28 Dec 2006 11:28:52 -0800 (PST) On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > The attached patch fixes the corruption for me. Well, that's a good hint, but it's really just a symptom. You effectively just made the test-program not even try to flush the data to disk, so the page cache would stay in memory, and you'd not see the corruption as well. So you basically disabled the code that tried to trigger the bug more easily. But the reason I say it's interesting is that "WB_SYNC_NONE" is very much implicated in mm/page-writeback.c, and if there is a bug triggered by WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also fails.. Linus
From: Andrew Morton [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Thu, 28 Dec 2006 11:45:17 -0800 On Thu, 28 Dec 2006 11:28:52 -0800 (PST) Linus Torvalds [email blocked] wrote: > > > On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > > > The attached patch fixes the corruption for me. > > Well, that's a good hint, but it's really just a symptom. You effectively > just made the test-program not even try to flush the data to disk, so the > page cache would stay in memory, and you'd not see the corruption as well. > > So you basically disabled the code that tried to trigger the bug more > easily. > > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much > implicated in mm/page-writeback.c, and if there is a bug triggered by > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also > fails.. > It would be interesting to convert your app to do fsync() before FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well (apart from pdflush activity).
From: Linus Torvalds [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST) On Thu, 28 Dec 2006, Andrew Morton wrote: > > It would be interesting to convert your app to do fsync() before > FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well > (apart from pdflush activity). I get corruption - but the whole point is that it's very much pdflush that should be writing these pages out. Andrew - give my test-program a try. It can run in about 1 minute if you have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems to very consistently cause corruption. What I do is: # Make sure we write back aggressively echo 5 > /proc/sys/vm/dirty_ratio as root, and then just run the thing. Tons of corruption. But the corruption goes away if I just leave the default dirty ratio alone (but then I can increse the file size to trigger it, of course - but that also makes the test run a lot slower). Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as easily (at least with the "fsync()"), but that's less to do with anything new, and probably just because then you simply won't have any pdflushing going on - since the kernel won't even notice that you have tons of dirty pages ;) It might also depend on the speed of your disk drive - the machine I test this on has a slow 4200 rpm laptop drive in it, and that probably makes things go south more easily. That's _especially_ true if this is related to any "bdi_write_congested()" logic. Now, it could also be related to various code snippets like ... if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } ... where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do the writeback at all (but it also won't clear the dirty bit, so it's certainly not an *OBVIOUS* bug). We also have code like this ("pageout()"): if (clear_page_dirty_for_io(page)) { int res; struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, .. } ... res = mapping->a_ops->writepage(page, &wbc); and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call won't do anything at all because of congestion, then that would be a _bad_ thing, and would certainly explain how something didn't get written out. But that particular path should only trigger for the "shrink_page_list()" case, and it's not the case I seem to be testing with my "low dirty_ratio" testing. Linus [test.c TEXT/PLAIN (2.8KB)] #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (22 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping)) static int chunkorder[NRCHUNKS]; static char *mapping; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%x-%x) \n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); // fsync(fd); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Clean up for next time */ sleep(5); sync(); sleep(5); munmap(mapping, SIZE); close(fd); unlink("mapfile"); return 0; }
From: David Miller [email blocked] Subject: Re: [PATCH] mm: fix page_mkclean_one Date: Thu, 28 Dec 2006 14:38:15 -0800 (PST) From: Linus Torvalds [email blocked] Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST) > I get corruption - but the whole point is that it's very much pdflush that > should be writing these pages out. I think what might be happening is that pdflush writes them out fine, however we don't trap writes by the application _during_ that writeout. These corruptions look exactly as if: 1) pdflush begins writeback of page X 2) page goes to disk 3) application writes a chunk to the page 4) pdflush et al. think the page is clean, so it gets tossed, losing the writes done in #3 So there's a missing PTE change in there, so that we never get proper re-dirtying of the page if the application tries to write to the page during the writeback. It's something that will only occur with writeback and MAP_SHARED writable access to the file pages. That's why we never see this with normal filesystem writes, since those explicitly manage the page dirty state. I think the dirty balancing logic etc. isn't where the problems are, to me it's a PTE state update issue for sure.
From: Marc Haber <mh+linux-kernel@zugschlus.de> Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 19:05:36 +0100 On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote: > On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote: > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k > > blocksize ext3, mainline. Zero failures. It's unlikely that this testing > > would pass, yet people running normal workloads are able to easily trigger > > failures. I suspect we're looking in the wrong place. > > I do not have a clue about memory management at all, but is it > possible that you're testing on a box with too much memory? My box has > only 256 MB, and I used to use mutt with a _huge_ inbox with mutt > taking somewhat 150 MB. Add spamassassin and a reasonably busy mail > server, and the box used to be like 150 MB in swap. > > I have tidied my inbox in the mean time and mutt's memory requirement > has been reduced to somewhat 30 MB, which might be the cause that I > don't see the issue that often any more. After being up for ten days, I have now encountered the file corruption of pkgcache.bin for the first time again. The 256 MB i386 box is like 26M in swap, is under very moderate load. I am running plain vanilla 2.6.19.1. Is there a patch that I should apply against 2.6.19.1 that would help in debugging? Greetings Marc -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
From: Linus Torvalds [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 11:00:46 -0800 (PST) On Thu, 28 Dec 2006, Marc Haber wrote: > > After being up for ten days, I have now encountered the file > corruption of pkgcache.bin for the first time again. The 256 MB i386 > box is like 26M in swap, is under very moderate load. > > I am running plain vanilla 2.6.19.1. Is there a patch that I should > apply against 2.6.19.1 that would help in debugging? Not right now. And I have a test-program that shows the corruption _much_ easier (at least according to my own testing, and that of several reporters that back me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new bug. What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help. (And hey, you don't have to be a VM expert to help: this could be a learning experience. However, I'll warn you: this is _the_ most grotty part of the whole kernel. It's not even ugly, it's just damn hard and complex). Linus
From: Petri Kaukasoina [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 21:05:41 +0200 On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote: > And I have a test-program that shows the corruption _much_ easier (at > least according to my own testing, and that of several reporters that back > me up), and that seems to show the corruption going way way back (ie going > back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)?
From: Linus Torvalds [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 11:21:21 -0800 (PST) On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > me up), and that seems to show the corruption going way way back (ie going > > back to Linux-2.6.5 at least, according to one tester). > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. I bet it's the same bug, and it's been around for ages. Linus
From: Linus Torvalds [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 13:24:30 -0800 (PST) On Thu, 28 Dec 2006, Linus Torvalds wrote: > > What we need now is actually looking at the source code, and people who > understand the VM, I'm afraid. I'm gathering traces now that I have a good > test-case. I'll post my trace tools once I've tested that they work, in > case others want to help. Ok, I've got the traces, but quite frankly, I doubt anybody is crazy enough to want to trawl through them. It's a bit painful, since we're talking thousands of pages to trigger this problem. Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably ARM, but is used for other things on ia64, powerpc and sparc64. But here's the patch in case anybody cares. It wants a _big_ kernel buffer to capture all the crud into (which is why I made the thing accept a bigger log buffer), and quite frankly, I'm not at all sure that all the locking is ok (ie I could imagine that the dcache-locking thing there in "is_interesting()" could deadlock, what do I know..) But I've captured some real data with this, which I'll describe separately. Linus ---- diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..967dd80 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) #if (BITS_PER_LONG > 32) /* diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c26818..7735b83 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config LOG_BUF_SHIFT int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL - range 12 21 + range 12 24 default 17 if S390 || LOCKDEP default 16 if X86_NUMAQ || IA64 default 15 if SMP diff --git a/mm/filemap.c b/mm/filemap.c index 8332c77..d6a0f56 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; +if (PageInteresting(page)) printk("Removing index %08x from page cache\n", page->index); radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; mapping->nrpages--; @@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping, return err; } +static noinline int is_interesting(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct dentry *dentry; + int retval = 0; + + spin_lock(&dcache_lock); + list_for_each_entry(dentry, &inode->i_dentry, d_alias) { + if (strcmp(dentry->d_name.name, "mapfile")) + continue; + retval = 1; + break; + } + spin_unlock(&dcache_lock); + return retval; +} + /** * add_to_page_cache - add newly allocated pagecache pages * @page: page to add @@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + if (is_interesting(mapping)) + SetPageInteresting(page); + if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); diff --git a/mm/memory.c b/mm/memory.c index 563792f..14c9815 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; +if (PageInteresting(page)) + printk("Unmapped index %08x at %08x\n", page->index, addr); if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, addr) != page->index) @@ -1605,6 +1607,7 @@ gotten: */ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); +if (PageInteresting(new_page)) printk("do_wp_page: mapping index %08x at %08lx\n", new_page->index, address); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); @@ -2249,6 +2252,7 @@ retry: entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); +if (PageInteresting(new_page)) printk("do_no_page: mapping index %08x at %08lx (%s)\n", new_page->index, address, write_access ? "write" : "read"); set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..0466601 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -813,6 +813,7 @@ int fastcall set_page_dirty(struct page *page) if (!spd) spd = __set_page_dirty_buffers; #endif +if (PageInteresting(page)) printk("Setting page %08x dirty\n", page->index); return (*spd)(page); } if (!PageDirty(page)) { @@ -867,6 +868,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { if (mapping_cap_account_dirty(mapping)) { +if (PageInteresting(page)) printk("cpd_for_io: index %08x\n", page->index); page_mkclean(page); dec_zone_page_state(page, NR_FILE_DIRTY); } diff --git a/mm/rmap.c b/mm/rmap.c index 57306fa..e98e84c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (pte_dirty(*pte) || pte_write(*pte)) { pte_t entry; +if (PageInteresting(page)) printk("cleaning index %08x at %08x\n", page->index, address); flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); @@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, goto out_unmap; } +if (PageInteresting(page)) printk("unmapping index %08x from %08lx\n", page->index, address); /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); @@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor, if (ptep_clear_flush_young(vma, address, pte)) continue; +if (PageInteresting(page)) printk("Cluster-unmapping %08x from %08lx\n", page->index, address); /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte);
From: Linus Torvalds [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST) Ok, with the ugly trace capture patch, I've actually captured this corruption in action, I think. I did a full trace of all pages involved in one run, and picked one corruption at random: Chunk 14465 corrupted (0-75) (01423fb4-01423fff) Expected 129, got 0 Written as (5126)9509(15017) That's the first 76 bytes of a chunk missing, and it's the last 76 bytes on a page. It's page index 01423 in the mapped file, and bytes fb4-fff within that file. There were four chunks written to that page: Writing chunk 14463/15800 (15%) (0142344c) (1) Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423) Writing chunk 14464/15800 (32%) (01423a00) (3) Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST! and the other three chunks checked out all right. And here's the annotated trace as it concerns that page: - here we write the first chunk to the page: ** (1) do_no_page: mapping index 00001423 at b7d1f44c (write) ** Setting page 00001423 dirty - something flushes it out to disk: ** cpd_for_io: index 00001423 ** cleaning index 00001423 at b7d1f000 - here we write the second chunk (which was split over the previous page and the interesting one): ** (2) Setting page 00001422 dirty ** (2) Setting page 00001423 dirty - and here we do a cleaning event ** cpd_for_io: index 00001423 ** cleaning index 00001423 at b7d1f000 - here we write the third chunk: ** (3) Setting page 00001423 dirty - here we write the fourth chunk: ** (4) NO DIRTY EVENT - and a third flush to disk: ** cpd_for_io: index 00001423 ** cleaning index 00001423 at b7d1f000 - here we unmap and flush: ** Unmapped index 00001423 at b7d1f000 ** Removing index 00001423 from page cache - here we remap to check: ** do_no_page: mapping index 00001423 at b7d1f000 (read) ** Unmapped index 00001423 at b7d1f000 - and finally, here I remove the file after the run: ** Removing index 00001423 from page cache Now, the important thing to see here is: - the missing write did not have a "Setting page 00001423 dirty" event associated with it. - but I can _see_ where the actual dirty event would be happening in the logs, because I can see the dirty events of the other chunk writes around it, so I know exactly where that fourth write happens. And indeed, it _shouldn't_ get a dirty event, because the page is still dirty from the write of chunk #3 to that page, which _did_ get a dirty event. I can see that, because the testing app writes the log of the pages it writes, and this is the log around the fourth and final write: ... Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f Writing chunk 960/15800 (60%) (00156300) PFN: 156 Writing chunk 14465/15800 (60%) (01423fb4) <---- Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 Writing chunk 556/15800 (60%) (000c62f0) PFN: c6 Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 ... and I can match this up with the full log from the kernel, which looks like this: Setting page 0000076e dirty Setting page 0000076f dirty Setting page 00000156 dirty Setting page 000000c6 dirty Setting page 00001526 dirty so I know exactly where the missing writes (to our page at pfn 1423, and the fpn-bf7 page) happened. - and the thing is, I can see a "cpd_for_io()" happening AFTER that fourth write. Quite a long while after, in fact. So all of this looks very fine indeed. We are not losing any dirty bits. - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses the SAME dirty bit as write 4 did (which didn't make it out to disk!). The event that clears the dirty bit that write 3 did happens AFTER write 4 has happened! So if we're not losing any dirty bits, what's going on? I think we have some nasty interaction with the buffer heads. In particular, I don't think it's the dirty page bits that are broken (I _see_ that the PageDirty bit was set after write 4 was done to memory in the kernel traces). So I think that a real writeback just doesn't happen, because somebody has marked the buffer heads clean _after_ it started IO on them. I think "__mpage_writepage()" is buggy in this regard, for example. It even has a comment about its crapola behaviour: /* * Must try to add the page before marking the buffer clean or * the confused fail path above (OOM) will be very confused when * it finds all bh marked clean (i.e. it will not write anything) */ however, I don't think that particular thing explains it, because I don't think we use that function for the cases I'm looking at. Anyway, I'll add tracing for page-writeback setting/cleaning too, in case I might see anything new there.. Linus
From: David Miller [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 14:50:38 -0800 (PST) From: Linus Torvalds [email blocked] Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST) > So if we're not losing any dirty bits, what's going on? What happens when we writeback, to the PTEs? page_mkclean_file() iterates the VMAs and when it finds a shared one it goes: entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); and that's fine, but that PTE is still marked writable, and I think that's key. What does the fault path do in this situation? if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); entry = pte_mkdirty(entry); } It does nothing to update the page dirty state, because it's writable, it just sets the PTE dirty bit and that's it. Should it be setting the page dirty here for SHARED cases? So until vmscan actually unmaps the PTE completely, we have this window in which the application can write to the PTE and the page dirty state doesn't get updated. Perhaps something later cleans up after this, f.e. by rechecking the PTE dirty bit at the end of I/O or when vmscan unmaps the page. I guess that should handle things, but the above logic definitely stood out to me.
From: Linus Torvalds [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 15:01:29 -0800 (PST) On Thu, 28 Dec 2006, David Miller wrote: > > What happens when we writeback, to the PTEs? Not a damn thing. We clear the PTE's _before_ we even start the write. The writeback does nothing to them. If the user dirties the page while writeback is in progress, we'll take the page fault and re-dirty it _again_. > page_mkclean_file() iterates the VMAs and when it finds a shared > one it goes: > > entry = ptep_clear_flush(vma, address, pte); > entry = pte_wrprotect(entry); > entry = pte_mkclean(entry); > > and that's fine, but that PTE is still marked writable, and > I think that's key. No it's not. It's right there. "pte_wrprotect(entry)". You even copied it yourself. > What does the fault path do in this situation? > > if (write_access) { > if (!pte_write(entry)) > return do_wp_page(mm, vma, address, > pte, pmd, ptl, entry); So we call "do_wp_page()", and that does everythign right. Linus
From: Linus Torvalds [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 17:38:38 -0800 (PST) Btw, much cleaned-up page tracing patch here, in case anybody cares (and "test.c" attached, although I don't think it changed since last time). The test.c output is a bit hard to read at times, since it will give offsets in bytes as hex (ie "00a77664" means page frame 00000a77, and byte 664h within that page), while the kernel output is obvioiusly the page indexes (but the page fault _addresses_ can contain information about the exact byte in a page, so you can match them up when some kernel event is related to a page fault). So both forms are necessary/logical, but it means that to match things up, you often need to ignore the last three hex digits of the address that "test.c" outputs. This one also adds traces for the tags and the writeback activity, but since I'm going out for birthday dinner, I won't have time to try to actually analyse the trace I have.. Which is why I'm sending it out, in the hope that somebody else is working on this corruption issue and is interested.. Linus ---- diff --git a/fs/buffer.c b/fs/buffer.c index 263f88e..f5e132a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page) set_buffer_dirty(bh); bh = bh->b_this_page; } while (bh != head); + PAGE_TRACE(page, "dirtied buffers"); } spin_unlock(&mapping->private_lock); @@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, "setting TAG_DIRTY"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..0cf3dce 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,14 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) + +#define PAGE_TRACE(page, msg, arg...) do { \ + if (PageInteresting(page)) \ + printk(KERN_DEBUG "PG %08lx: %s:%d " msg "\n", \ + (page)->index, __FILE__, __LINE__ ,##arg ); \ +} while (0) #if (BITS_PER_LONG > 32) /* @@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page) #define PageWriteback(page) test_bit(PG_writeback, &(page)->flags) #define SetPageWriteback(page) \ do { \ - if (!test_and_set_bit(PG_writeback, \ - &(page)->flags)) \ + if (!test_and_set_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK); \ + } \ } while (0) #define TestSetPageWriteback(page) \ ({ \ int ret; \ ret = test_and_set_bit(PG_writeback, \ &(page)->flags); \ - if (!ret) \ + if (!ret) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK); \ + } \ ret; \ }) #define ClearPageWriteback(page) \ do { \ - if (test_and_clear_bit(PG_writeback, \ - &(page)->flags)) \ + if (test_and_clear_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "end writeback"); \ dec_zone_page_state(page, NR_WRITEBACK); \ + } \ } while (0) #define TestClearPageWriteback(page) \ ({ \ int ret; \ ret = test_and_clear_bit(PG_writeback, \ &(page)->flags); \ - if (ret) \ + if (ret) { \ + PAGE_TRACE(page, "end writeback"); \ dec_zone_page_state(page, NR_WRITEBACK); \ + } \ ret; \ }) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c26818..7735b83 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config LOG_BUF_SHIFT int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL - range 12 21 + range 12 24 default 17 if S390 || LOCKDEP default 16 if X86_NUMAQ || IA64 default 15 if SMP diff --git a/mm/filemap.c b/mm/filemap.c index 8332c77..4df7d35 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + PAGE_TRACE(page, "Removing page cache"); radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; mapping->nrpages--; @@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping, return err; } +static noinline int is_interesting(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct dentry *dentry; + int retval = 0; + + spin_lock(&dcache_lock); + list_for_each_entry(dentry, &inode->i_dentry, d_alias) { + if (strcmp(dentry->d_name.name, "mapfile")) + continue; + retval = 1; + break; + } + spin_unlock(&dcache_lock); + return retval; +} + /** * add_to_page_cache - add newly allocated pagecache pages * @page: page to add @@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + if (is_interesting(mapping)) + SetPageInteresting(page); + if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); diff --git a/mm/memory.c b/mm/memory.c index 563792f..20af32f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -667,6 +667,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; + PAGE_TRACE(page, "unmapped at %08lx", addr); if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, addr) != page->index) @@ -1605,6 +1606,7 @@ gotten: */ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); + PAGE_TRACE(new_page, "write fault at %08lx", address); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); @@ -2249,6 +2251,7 @@ retry: entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); + PAGE_TRACE(new_page, "mapping at %08lx (%s)", address, write_access ? "write" : "read"); set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..15f3aaf 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -773,6 +773,7 @@ int __set_page_dirty_nobuffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, "setting TAG_DIRTY"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } @@ -813,6 +814,7 @@ int fastcall set_page_dirty(struct page *page) if (!spd) spd = __set_page_dirty_buffers; #endif + PAGE_TRACE(page, "setting dirty"); return (*spd)(page); } if (!PageDirty(page)) { @@ -867,6 +869,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { if (mapping_cap_account_dirty(mapping)) { + PAGE_TRACE(page, "clean_for_io"); page_mkclean(page); dec_zone_page_state(page, NR_FILE_DIRTY); } @@ -886,10 +889,12 @@ int test_clear_page_writeback(struct page *page) write_lock_irqsave(&mapping->tree_lock, flags); ret = TestClearPageWriteback(page); - if (ret) + if (ret) { + PAGE_TRACE(page, "clearing TAG_WRITEBACK"); radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + } write_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestClearPageWriteback(page); @@ -907,14 +912,18 @@ int test_set_page_writeback(struct page *page) write_lock_irqsave(&mapping->tree_lock, flags); ret = TestSetPageWriteback(page); - if (!ret) + if (!ret) { + PAGE_TRACE(page, "setting TAG_WRITEBACK"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); - if (!PageDirty(page)) + } + if (!PageDirty(page)) { + PAGE_TRACE(page, "clearing TAG_DIRTY"); radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + } write_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestSetPageWriteback(page); diff --git a/mm/rmap.c b/mm/rmap.c index 57306fa..e6b4676 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (pte_dirty(*pte) || pte_write(*pte)) { pte_t entry; + PAGE_TRACE(page, "cleaning PTE %08lx", address); flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); @@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, goto out_unmap; } + PAGE_TRACE(page, "unmapping from %08lx", address); /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); @@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor, if (ptep_clear_flush_young(vma, address, pte)) continue; + PAGE_TRACE(page, "unmapping from %08lx", address); /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); [test.c TEXT/PLAIN (2.9KB)] #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (22 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping)) static int chunkorder[NRCHUNKS]; static char *mapping; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%x-%x) \n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) (%08x) \r", chunk, NRCHUNKS, 100*i/NRCHUNKS, page_offset(mapping, chunk * CHUNKSIZE)); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) (%08x) \r", i, NRCHUNKS, 100*i/NRCHUNKS, page_offset(mapping, i * CHUNKSIZE)); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Clean up for next time */ sleep(5); sync(); sleep(5); munmap(mapping, SIZE); close(fd); unlink("mapfile"); return 0; }
From: Linus Torvalds [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 17:38:38 -0800 (PST) Btw, much cleaned-up page tracing patch here, in case anybody cares (and "test.c" attached, although I don't think it changed since last time). The test.c output is a bit hard to read at times, since it will give offsets in bytes as hex (ie "00a77664" means page frame 00000a77, and byte 664h within that page), while the kernel output is obvioiusly the page indexes (but the page fault _addresses_ can contain information about the exact byte in a page, so you can match them up when some kernel event is related to a page fault). So both forms are necessary/logical, but it means that to match things up, you often need to ignore the last three hex digits of the address that "test.c" outputs. This one also adds traces for the tags and the writeback activity, but since I'm going out for birthday dinner, I won't have time to try to actually analyse the trace I have.. Which is why I'm sending it out, in the hope that somebody else is working on this corruption issue and is interested.. Linus ---- diff --git a/fs/buffer.c b/fs/buffer.c index 263f88e..f5e132a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page) set_buffer_dirty(bh); bh = bh->b_this_page; } while (bh != head); + PAGE_TRACE(page, "dirtied buffers"); } spin_unlock(&mapping->private_lock); @@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, "setting TAG_DIRTY"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..0cf3dce 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,14 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) + +#define PAGE_TRACE(page, msg, arg...) do { \ + if (PageInteresting(page)) \ + printk(KERN_DEBUG "PG %08lx: %s:%d " msg "\n", \ + (page)->index, __FILE__, __LINE__ ,##arg ); \ +} while (0) #if (BITS_PER_LONG > 32) /* @@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page) #define PageWriteback(page) test_bit(PG_writeback, &(page)->flags) #define SetPageWriteback(page) \ do { \ - if (!test_and_set_bit(PG_writeback, \ - &(page)->flags)) \ + if (!test_and_set_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK); \ + } \ } while (0) #define TestSetPageWriteback(page) \ ({ \ int ret; \ ret = test_and_set_bit(PG_writeback, \ &(page)->flags); \ - if (!ret) \ + if (!ret) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK); \ + } \ ret; \ }) #define ClearPageWriteback(page) \ do { \ - if (test_and_clear_bit(PG_writeback, \ - &(page)->flags)) \ + if (test_and_clear_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "end writeback"); \ dec_zone_page_state(page, NR_WRITEBACK); \ + } \ } while (0) #define TestClearPageWriteback(page) \ ({ \ int ret; \ ret = test_and_clear_bit(PG_writeback, \ &(page)->flags); \ - if (ret) \ + if (ret) { \ + PAGE_TRACE(page, "end writeback"); \ dec_zone_page_state(page, NR_WRITEBACK); \ + } \ ret; \ }) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c26818..7735b83 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config LOG_BUF_SHIFT int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL - range 12 21 + range 12 24 default 17 if S390 || LOCKDEP default 16 if X86_NUMAQ || IA64 default 15 if SMP diff --git a/mm/filemap.c b/mm/filemap.c index 8332c77..4df7d35 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + PAGE_TRACE(page, "Removing page cache"); radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; mapping->nrpages--; @@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping, return err; } +static noinline int is_interesting(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct dentry *dentry; + int retval = 0; + + spin_lock(&dcache_lock); + list_for_each_entry(dentry, &inode->i_dentry, d_alias) { + if (strcmp(dentry->d_name.name, "mapfile")) + continue; + retval = 1; + break; + } + spin_unlock(&dcache_lock); + return retval; +} + /** * add_to_page_cache - add newly allocated pagecache pages * @page: page to add @@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + if (is_interesting(mapping)) + SetPageInteresting(page); + if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); diff --git a/mm/memory.c b/mm/memory.c index 563792f..20af32f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -667,6 +667,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; + PAGE_TRACE(page, "unmapped at %08lx", addr); if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, addr) != page->index) @@ -1605,6 +1606,7 @@ gotten: */ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); + PAGE_TRACE(new_page, "write fault at %08lx", address); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); @@ -2249,6 +2251,7 @@ retry: entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); + PAGE_TRACE(new_page, "mapping at %08lx (%s)", address, write_access ? "write" : "read"); set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..15f3aaf 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -773,6 +773,7 @@ int __set_page_dirty_nobuffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, "setting TAG_DIRTY"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } @@ -813,6 +814,7 @@ int fastcall set_page_dirty(struct page *page) if (!spd) spd = __set_page_dirty_buffers; #endif + PAGE_TRACE(page, "setting dirty"); return (*spd)(page); } if (!PageDirty(page)) { @@ -867,6 +869,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { if (mapping_cap_account_dirty(mapping)) { + PAGE_TRACE(page, "clean_for_io"); page_mkclean(page); dec_zone_page_state(page, NR_FILE_DIRTY); } @@ -886,10 +889,12 @@ int test_clear_page_writeback(struct page *page) write_lock_irqsave(&mapping->tree_lock, flags); ret = TestClearPageWriteback(page); - if (ret) + if (ret) { + PAGE_TRACE(page, "clearing TAG_WRITEBACK"); radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + } write_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestClearPageWriteback(page); @@ -907,14 +912,18 @@ int test_set_page_writeback(struct page *page) write_lock_irqsave(&mapping->tree_lock, flags); ret = TestSetPageWriteback(page); - if (!ret) + if (!ret) { + PAGE_TRACE(page, "setting TAG_WRITEBACK"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); - if (!PageDirty(page)) + } + if (!PageDirty(page)) { + PAGE_TRACE(page, "clearing TAG_DIRTY"); radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + } write_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestSetPageWriteback(page); diff --git a/mm/rmap.c b/mm/rmap.c index 57306fa..e6b4676 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (pte_dirty(*pte) || pte_write(*pte)) { pte_t entry; + PAGE_TRACE(page, "cleaning PTE %08lx", address); flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); @@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, goto out_unmap; } + PAGE_TRACE(page, "unmapping from %08lx", address); /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); @@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor, if (ptep_clear_flush_young(vma, address, pte)) continue; + PAGE_TRACE(page, "unmapping from %08lx", address); /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); [test.c TEXT/PLAIN (2.9KB)] #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (22 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping)) static int chunkorder[NRCHUNKS]; static char *mapping; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%x-%x) \n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) (%08x) \r", chunk, NRCHUNKS, 100*i/NRCHUNKS, page_offset(mapping, chunk * CHUNKSIZE)); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) (%08x) \r", i, NRCHUNKS, 100*i/NRCHUNKS, page_offset(mapping, i * CHUNKSIZE)); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Clean up for next time */ sleep(5); sync(); sleep(5); munmap(mapping, SIZE); close(fd); unlink("mapfile"); return 0; }
From: Andrew Morton [email blocked] Subject: Re: 2.6.19 file content corruption on ext3 Date: Thu, 28 Dec 2006 17:59:54 -0800 On Thu, 28 Dec 2006 17:38:38 -0800 (PST) Linus Torvalds [email blocked] wrote: > in > the hope that somebody else is working on this corruption issue and is > interested.. What corruption issue? ;) I'm finding that the corruption happens trivially with your test app, but apparently doesn't happen at all with ext2 or ext3, data=writeback. Maybe it will happen with increased rarity, but the difference is quite stark. Removing the err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL, journal_dirty_data_fn); from ext3_ordered_writepage() fixes things up. The things which journal_submit_data_buffers() does after dropping all the locks are ... disturbing - I don't think we have sufficient tests in there to ensure that the buffer is still where we think it is after we retake locks (they're slippery little buggers). But that wouldn't explain it anyway. It's inefficient that journal_dirty_data() will put these locked, clean buffers onto BJ_SyncData instead of BJ_Locked, but journal_submit_data_buffers() seems to dtrt with them. So no theory yet. Maybe ext3 is just altering timing. But the difference is really large.. Disabling all the WB_SYNC_NONE stuff and making everything go synchronous everywhere has no effect. Disabling bdi_write_congested() has no effect.



Related Links:

Explained

December 29, 2006 - 6:09am
Anonymous (not verified)

and the patch verified htt

December 29, 2006 - 7:35am
Anonymous (not verified)

and the patch verified

http://lkml.org/lkml/2006/12/29/47

rxsklxxr

April 17, 2007 - 1:00am
Anonymous (not verified)

rdelhtgw rvfpvvhd http://souiifxn.com sszbbgqy ixihlehv [URL=http://ewbbkedf.com]sapravhx[/URL]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary