When the data corruption bug which is fixed as of 2.6.20-rc3 [story [1]] was still being tracked down [story [2]], it was thought that the bug, a race in shared mmap'ed page writeback, might have been in the 2.6 kernel for a very long time. It has since been determined that the bug was introduced much more recently. Nick Piggin [interview [3]] explains, "this bug was only introduced in 2.6.19, due to a change that caused pte dirty bits to be discarded without a subsequent set_page_dirty() (nowhere else in the kernel should have done this)." Linus Torvalds noted that earlier kernels could have been affected by a less serious version of the bug:
"Actually, I think 2.6.18 may have a subtle variation on it. But that much older race would only trigger on SMP (or possibly UP with preempt). And I haven't actually thought about it that much, so I could be full of crap. But I don't see anything that protects against it: we may hold the page lock, but since the code that marks things _dirty_ doesn't necessarily always hold it, that doesn't help us. And we may hold the 'private_lock', but we drop it before we do the dirty bit clearing, and in fact on UP+PREEMPT that very dropping could cause an active preemption to take place, so.. I dunno. For older kernels? If there is a race there, it must be pretty damn hard to hit in practice (and it must have been there for a looong time), so trying to fix it is possibly as likely to cause problems as it migh to fix them."
David Miller pointed out that some of the confusion as to when the bug was actually introduced comes from the fact that the original bug was against a 2.6.18 Debian kernel. Andrew Morton [interview [4]] explained, "that was 2.6.18+debian-added-dirty-page-tracking-patches," then went on to caution that the fix still does not address a newly reported and currently unconfirmed BerkeleyDB corruption bug, "I'll assert (and emphasise) that the cause of the alleged BerkeleyDB corruption is not known at this time. The post-2.6.19 'fix' might make it go away. But if it does, we do not know why, and it might still be there, only harder to hit."
From: Andrea Gelmini [email blocked] To: Linux Kernel Mailing List [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Fri, 29 Dec 2006 23:43:09 +0100 On Fri, Dec 29, 2006 at 06:59:02PM +0000, Linux Kernel Mailing List wrote: > Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git [5];a=commit;h=7658cc289288b8ae7dd2c2224549a048431222b3 > Commit: 7658cc289288b8ae7dd2c2224549a048431222b3 > Parent: 3bf8ba38f38d3647368e4edcf7d019f9f8d9184a > Author: Linus Torvalds [email blocked] > AuthorDate: Fri Dec 29 10:00:58 2006 -0800 > Committer: Linus Torvalds [email blocked] > CommitDate: Fri Dec 29 10:00:58 2006 -0800 > > VM: Fix nasty and subtle race in shared mmap'ed page writeback With 2.6.20-rc2-git1, which contain this patch, I have no more Berkeley DB corruption with Klibido._ I'm afraid a lot of software project switched to Sqlite,_ from BDB,_ because the bug this patch fix (ie. http://bogofilter.sourceforge.net/ [6]). I've also thought, since years, it was an userland problem. Ciao, gelma ------------------- _ http://klibido.sourceforge.net/ [7] _ http://www.sqlite.org/ [8] _ http://www.oracle.com/database/berkeley-db/index.html [9]
From: Nick Piggin [10] [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Sun, 31 Dec 2006 14:55:58 +1100 Andrea Gelmini wrote: > On Fri, Dec 29, 2006 at 06:59:02PM +0000, Linux Kernel Mailing List wrote: > >>Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git [11];a=commit;h=7658cc289288b8ae7dd2c2224549a048431222b3 >>Commit: 7658cc289288b8ae7dd2c2224549a048431222b3 >>Parent: 3bf8ba38f38d3647368e4edcf7d019f9f8d9184a >>Author: Linus Torvalds [email blocked] >>AuthorDate: Fri Dec 29 10:00:58 2006 -0800 >>Committer: Linus Torvalds [email blocked] >>CommitDate: Fri Dec 29 10:00:58 2006 -0800 >> >> VM: Fix nasty and subtle race in shared mmap'ed page writeback > > > With 2.6.20-rc2-git1, which contain this patch, I have no more Berkeley > DB corruption with Klibido. > I'm afraid a lot of software project switched to Sqlite, from BDB, > because the bug this patch fix (ie. http://bogofilter.sourceforge.net/ [12]). > I've also thought, since years, it was an userland problem. This bug was only introduced in 2.6.19, due to a change that caused pte dirty bits to be discarded without a subsequent set_page_dirty() (nowhere else in the kernel should have done this). So if your corruption is years old, then it must be something else. Maybe it is hidden by a timing change, or BDB isn't using msync properly. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com [13]
From: Andrea Gelmini [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Sun, 31 Dec 2006 14:50:31 +0100 On Sun, Dec 31, 2006 at 02:55:58PM +1100, Nick Piggin wrote: > This bug was only introduced in 2.6.19, due to a change that caused pte no, Linus said that with 2.6.19 it's easier to trigger this bug... > So if your corruption is years old, then it must be something else. > Maybe it is hidden by a timing change, or BDB isn't using msync properly. I can give you a complete image where just changing kernel (everything is same, of course) corruptions goes away. we spent a lot, I mean a *lot*, of time looking for our code mistake, and so on. I don't want to seem rude, but I am sure that Berkeley DB corruption we have seen (not just Klibido, but I also think about postgrey, and so on) depends on this bug. I repeat, if you have time/interest I can give you a complete machine to see the problem. thanks a lot for your time, gelma
From: Nick Piggin [14] [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Thu, 04 Jan 2007 14:57:24 +1100 Andrea Gelmini wrote: > On Sun, Dec 31, 2006 at 02:55:58PM +1100, Nick Piggin wrote: > >>This bug was only introduced in 2.6.19, due to a change that caused pte > > no, Linus said that with 2.6.19 it's easier to trigger this bug... Yhat's when the bug was introduced -- 2.6.19. 2.6.18 does not have this bug, so it cannot be years old. >>So if your corruption is years old, then it must be something else. >>Maybe it is hidden by a timing change, or BDB isn't using msync properly. > > I can give you a complete image where just changing kernel (everything > is same, of course) corruptions goes away. > we spent a lot, I mean a *lot*, of time looking for our code mistake, > and so on. > I don't want to seem rude, but I am sure that Berkeley DB corruption we > have seen (not just Klibido, but I also think about postgrey, and so on) > depends on this bug. > I repeat, if you have time/interest I can give you a complete machine > to see the problem. You're not being rude, but I just wanted to point out that this patch (nor the dirty page accounting also in 2.6.19) doesn't fix anything that was in 2.6.18, AFAIKS. I wouldn't discount a kernel bug, but it will be hard to track down unless you can find an earlier kernel that did not cause the corruptions and/or provide source for a minimal test case to reproduce. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com [15]
From: Linus Torvalds [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Wed, 3 Jan 2007 20:44:36 -0800 (PST) On Thu, 4 Jan 2007, Nick Piggin wrote: > > Yhat's when the bug was introduced -- 2.6.19. 2.6.18 does not have > this bug, so it cannot be years old. Actually, I think 2.6.18 may have a subtle variation on it. In particular, I look back at the try_to_free_buffers() thing that I hated so much, and it makes me wonder.. It used to do: spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); if (ret) { .. crappy comment .. if (test_clear_page_dirty(page)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); } and I think that at least on SMP, we had a race with another CPU doing the "mark page dirty if it was dirty in the PTE" at the same time. Because the marking dirty would come in, find no buffers (they just got dropped), and then mark the page dirty (ignoring the lack of any buffers), but then the above would do the "test_clear_page_dirty()" thing on it. Ie the race, I think, existed where that crappy comment was. But that much older race would only trigger on SMP (or possibly UP with preempt). And I haven't actually thought about it that much, so I could be full of crap. But I don't see anything that protects against it: we may hold the page lock, but since the code that marks things _dirty_ doesn't necessarily always hold it, that doesn't help us. And we may hold the "private_lock", but we drop it before we do the dirty bit clearing, and in fact on UP+PREEMPT that very dropping could cause an active preemption to take place, so.. I dunno. I would certainly suggest the whole series of my "dirty cleanup" be added on top of 2.6.18.3 (which apparently has the dirty mapped page tracking). For older kernels? If there is a race there, it must be pretty damn hard to hit in practice (and it must have been there for a looong time), so trying to fix it is possibly as likely to cause problems as it migh to fix them. Linus
From: Nick Piggin [16] [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Thu, 04 Jan 2007 16:07:18 +1100 Linus Torvalds wrote: > > On Thu, 4 Jan 2007, Nick Piggin wrote: > >>Yhat's when the bug was introduced -- 2.6.19. 2.6.18 does not have >>this bug, so it cannot be years old. > > > Actually, I think 2.6.18 may have a subtle variation on it. > > In particular, I look back at the try_to_free_buffers() thing that I hated > so much, and it makes me wonder.. It used to do: > > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); > if (ret) { > .. crappy comment .. > if (test_clear_page_dirty(page)) > task_io_account_cancelled_write(PAGE_CACHE_SIZE); > } > > and I think that at least on SMP, we had a race with another CPU doing the > "mark page dirty if it was dirty in the PTE" at the same time. Because the > marking dirty would come in, find no buffers (they just got dropped), and > then mark the page dirty (ignoring the lack of any buffers), but then the > above would do the "test_clear_page_dirty()" thing on it. > > Ie the race, I think, existed where that crappy comment was. > > But that much older race would only trigger on SMP (or possibly UP with > preempt). Oh yes the try_to_free_buffers race, I think, does exist in older kernels. Yes according to our earlier analysis it would trigger with UP+preempt and SMP. But the patch that Andrea was pointing to was your last patch (The Fix), which stopped page_mkclean caller throwing out dirty bits. You probably didn't see that in the mail I cc'ed you on. So yes it would be interesting to see whether fixing try_to_free_buffers fixes Andrea's problem on older kernels. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com [17]
From: Andrew Morton [18] [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Wed, 3 Jan 2007 21:41:21 -0800 On Wed, 3 Jan 2007 20:44:36 -0800 (PST) Linus Torvalds [email blocked] wrote: > Actually, I think 2.6.18 may have a subtle variation on it. > > In particular, I look back at the try_to_free_buffers() thing that I hated > so much, and it makes me wonder.. It used to do: > > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); > if (ret) { > .. crappy comment .. > if (test_clear_page_dirty(page)) > task_io_account_cancelled_write(PAGE_CACHE_SIZE); > } > > and I think that at least on SMP, we had a race with another CPU doing the > "mark page dirty if it was dirty in the PTE" at the same time. Because the > marking dirty would come in, find no buffers (they just got dropped), and > then mark the page dirty (ignoring the lack of any buffers), but then the > above would do the "test_clear_page_dirty()" thing on it. > That bug was introduced in 2.6.19, with the dirty page tracking patches. 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to prevent it. > Ie the race, I think, existed where that crappy comment was. The comment was complete, accurate and needed.
From: Nick Piggin [19] [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Thu, 04 Jan 2007 17:03:43 +1100 Andrew Morton wrote: > On Wed, 3 Jan 2007 20:44:36 -0800 (PST) > Linus Torvalds [email blocked] wrote: > > >>Actually, I think 2.6.18 may have a subtle variation on it. >> >>In particular, I look back at the try_to_free_buffers() thing that I hated >>so much, and it makes me wonder.. It used to do: >> >> spin_lock(&mapping->private_lock); >> ret = drop_buffers(page, &buffers_to_free); >> spin_unlock(&mapping->private_lock); >> if (ret) { >> .. crappy comment .. >> if (test_clear_page_dirty(page)) >> task_io_account_cancelled_write(PAGE_CACHE_SIZE); >> } >> >>and I think that at least on SMP, we had a race with another CPU doing the >>"mark page dirty if it was dirty in the PTE" at the same time. Because the >>marking dirty would come in, find no buffers (they just got dropped), and >>then mark the page dirty (ignoring the lack of any buffers), but then the >>above would do the "test_clear_page_dirty()" thing on it. >> > > > That bug was introduced in 2.6.19, with the dirty page tracking patches. > > 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to > prevent it. Ohh, right you are, I was looking at 2.6.19 sources. The comments above ttfb match that as well. Curious that the dirty page patches were allowed to mess with this... Anyway that leaves us with the question of why Andrea's database is getting corrupted. Hopefully he can give us a minimal test-case. >>Ie the race, I think, existed where that crappy comment was. > > > The comment was complete, accurate and needed. Indeed ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com [20]
From: Andrew Morton [21] [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Wed, 3 Jan 2007 22:12:20 -0800 On Thu, 04 Jan 2007 17:03:43 +1100 Nick Piggin [email blocked] wrote: > > That bug was introduced in 2.6.19, with the dirty page tracking patches. > > > > 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to > > prevent it. > > Ohh, right you are, I was looking at 2.6.19 sources. The comments above > ttfb match that as well. Curious that the dirty page patches were allowed > to mess with this... Frankly, those patches scared the crap out of me, specifically because of the delicacy and complexity of the various dirtiness state coherencies. But I just didn't have the bandwidth to go through them with a sufficiently fine toothcomb, sorry. > Anyway that leaves us with the question of why Andrea's database is getting > corrupted. Hopefully he can give us a minimal test-case. It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming out of the woodwork. It's the first I've ever heard of them.
From: David Miller [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Wed, 03 Jan 2007 22:56:07 -0800 (PST) From: Andrew Morton [22] [email blocked] Date: Wed, 3 Jan 2007 22:12:20 -0800 > On Thu, 04 Jan 2007 17:03:43 +1100 > Nick Piggin [email blocked] wrote: > > > > That bug was introduced in 2.6.19, with the dirty page tracking patches. > > > > > > 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to > > > prevent it. > > > > Ohh, right you are, I was looking at 2.6.19 sources. The comments above > > ttfb match that as well. Curious that the dirty page patches were allowed > > to mess with this... > > Frankly, those patches scared the crap out of me, specifically because of > the delicacy and complexity of the various dirtiness state coherencies. > But I just didn't have the bandwidth to go through them with a sufficiently > fine toothcomb, sorry. > > > Anyway that leaves us with the question of why Andrea's database is getting > > corrupted. Hopefully he can give us a minimal test-case. > > It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming > out of the woodwork. It's the first I've ever heard of them. Note that the original rtorrent debian bug report was against 2.6.18
From: Andrew Morton [23] [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Wed, 3 Jan 2007 23:06:29 -0800 On Wed, 03 Jan 2007 22:56:07 -0800 (PST) David Miller [email blocked] wrote: > From: Andrew Morton [24] [email blocked] > Date: Wed, 3 Jan 2007 22:12:20 -0800 > > > On Thu, 04 Jan 2007 17:03:43 +1100 > > Nick Piggin [email blocked] wrote: > > > > > > That bug was introduced in 2.6.19, with the dirty page tracking patches. > > > > > > > > 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to > > > > prevent it. > > > > > > Ohh, right you are, I was looking at 2.6.19 sources. The comments above > > > ttfb match that as well. Curious that the dirty page patches were allowed > > > to mess with this... > > > > Frankly, those patches scared the crap out of me, specifically because of > > the delicacy and complexity of the various dirtiness state coherencies. > > But I just didn't have the bandwidth to go through them with a sufficiently > > fine toothcomb, sorry. > > > > > Anyway that leaves us with the question of why Andrea's database is getting > > > corrupted. Hopefully he can give us a minimal test-case. > > > > It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming > > out of the woodwork. It's the first I've ever heard of them. > > Note that the original rtorrent debian bug report was against 2.6.18 I think that was 2.6.18+debian-added-dirty-page-tracking-patches. If that memory is correct, I'll assert (and emphasise) that the cause of the alleged BerkeleyDB corruption is not known at this time. The post-2.6.19 "fix" might make it go away. But if it does, we do not know why, and it might still be there, only harder to hit.
From: Nick Piggin [25] [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Thu, 04 Jan 2007 18:16:37 +1100 Andrew Morton wrote: > On Wed, 03 Jan 2007 22:56:07 -0800 (PST) > David Miller [email blocked] wrote: >>>>Anyway that leaves us with the question of why Andrea's database is getting >>>>corrupted. Hopefully he can give us a minimal test-case. >>> >>>It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming >>>out of the woodwork. It's the first I've ever heard of them. >> >>Note that the original rtorrent debian bug report was against 2.6.18 > > > I think that was 2.6.18+debian-added-dirty-page-tracking-patches. > > If that memory is correct, I'll assert (and emphasise) that the cause of the > alleged BerkeleyDB corruption is not known at this time. I think that's right. Even if it were plain 2.6.18 that had rtorrent corruption, then it would be more evidence we still have an unidentified bug, because none of the patches fixed anything we have found to be buggy in 2.6.18. > The post-2.6.19 "fix" might make it go away. But if it does, we do not know > why, and it might still be there, only harder to hit. Likely. I think it is only hiding the bug (maybe the writeout patterns from shared dirty accounting are changing timings or codepaths). Of course, this means that we still can't confirm whether or not it is a kernel bug. It could be a BDB bug that's being hidden. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com [26]
From: Hugh Dickins [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Thu, 4 Jan 2007 13:25:55 +0000 (GMT) On Wed, 3 Jan 2007, Andrew Morton wrote: > On Wed, 03 Jan 2007 22:56:07 -0800 (PST) > David Miller [email blocked] wrote: > > From: Andrew Morton [27] [email blocked] > > > > > > It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming > > > out of the woodwork. It's the first I've ever heard of them. > > > > Note that the original rtorrent debian bug report was against 2.6.18 > > I think that was 2.6.18+debian-added-dirty-page-tracking-patches. That's right. Debian's 2.6.18-3, not -stable's 2.6.18.3 as Linus feared. I'll be sending 2.6.18-stable the fix to the msync ENOMEM-on-unmapped issue later today (that little buglet being what led them to integrate the much more interesting dirty page tracking patches, which happened to fix it in passing). Hugh
From: Andrea Gelmini [email blocked] Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback Date: Thu, 4 Jan 2007 14:08:23 +0100 On Wed, Jan 03, 2007 at 10:12:20PM -0800, Andrew Morton wrote: > > Anyway that leaves us with the question of why Andrea's database is getting > > corrupted. Hopefully he can give us a minimal test-case. > > It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming > out of the woodwork. It's the first I've ever heard of them. of course, because nobody had never thought it could be a kernel bug. since first release of klibido we had db corruption. so Bauno, main author/programmer, introduced various check in it. so we had found db corruption. he had talked with sleepycat (it happens long before Oracle buys them), they were very kind, but in the meanwhile lot of code was changed, so he had decided to wait for other tests. anyway, in the klibido mailing list people started to complain about corruption db. we spended a lot of time trying to find clues. anyway, to make this part very short, after months we got clear that with Red Hat/Suse kernel we got no crash, and with vanilla/Debian/Ubuntu/Slackware we can reproduce it with simple action. so, we started checking klibido code, g++ versions, qt/kde versions, and so on. but nothing changed. well, all this story could mean nothing, but then I said "let's look at other projects using bdb", and I subscribed to different mailing lists. you can change klibido with postgrey/bogofilter, and you've same story. if you have time to look at their mailing list archive, you'll see people complain about db corruption, people telling them "it happens", and, usually, switching to sqlite. nobody wrote/write to kernel mailing list because nobody thought/think it could be a kernel bug. anyway, the important thing is that I can give you an image of my debian machine where you can have a lot/not at all corruption just switching bitween debian kernel (linux-image-2.6.18-3-686) and vanilla 2.6.20-rc2-git1. by the way, klibido works also under BSD, and we have no bug report about db corruption (I know, we dunno how many user, which DB size, and so on). I repeat, all these things are not so important because klibido, but because in common with other projects. I put Bauno, klibido author, in Cc (it also speak english better thank me...) thanks a lot for your time, gelma
Related Links:
- Archive of above thread [28]
- KernelTrap interview with Nick Piggin [29]
- KernelTrap interview with Andrew Morton [30]