Re: [PATCH RESEND] Minimal fix for private_list handling races

Previous thread: [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview by Ryo Tsuruta on Wednesday, January 23, 2008 - 8:53 am. (18 messages)

Next thread: Okey'ciler Artık Yep Yeni Bir Okey Salonunuz Var Üstelik 10 ytl Bonusla İlk Üyelik Şansı by Duygu TUFANOGLU on Wednesday, January 23, 2008 - 10:11 am. (1 message)
To: Nick Piggin <nickpiggin@...>
Cc: <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, January 23, 2008 - 9:30 am

I thought about this solution as well. But main issue I had with this
solution is that currently, you nicely submit all the metadata buffers at
once, so that block layer can sort them and write them in nice order. With
the array you submit buffers by 16 (or any other fixed amount) and in
The problem here is that mark_buffer_dirty_inode() can move the buffer
from 'tmp' list back to private_list while we are waiting for another
I think this get_bh() should stop try_to_free_buffers() from removing the
buffer. brelse() before taking the private_lock is fine, because the loop
actually checks for while (!list_empty(tmp)) so we really don't care what
happens with the buffer after we are done with it. So I think that logic is
Well, I wrote a patch which cleaned up this logic and got rid of that tmp
list because the handling is really messy but Andrew didn't like it and

Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

To: Jan Kara <jack@...>
Cc: <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, January 23, 2008 - 11:05 am

Oh, I don't know the filesystems very well... which ones would

Hmm, no not while we're waiting for another buffer because b_assoc_buffers
will not be empty. However it is possible between removing from the inode
list and insertion onto the temp list I think, because

if (list_empty(&bh->b_assoc_buffers)) {

check in mark_buffer_dirty_inode is done without private_lock held. Nice.

With that in mind, doesn't your first patch suffer from a race due to
exactly this unlocked list_empty check when you are removing clean buffers
from the queue?

if (!buffer_dirty(bh) && !buffer_locked(bh))
mark_buffer_dirty()
if (list_empty(&bh->b_assoc_buffers))
/* add */
__remove_assoc_queue(bh);

Which results in the buffer being dirty but not on the ->private_list,
doesn't it?

But let's see... there must be a memory ordering problem here in existing
code anyway, because I don't see any barriers. Between b_assoc_buffers and
b_state (via buffer_dirty); fsync_buffers_list vs mark_buffer_dirty_inode,

Oh, of course. I overlooked the important fact that the tmp list is
also actually subject to modification via other threads exactly the
same as private_list...
--

To: Nick Piggin <nickpiggin@...>
Cc: <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, January 23, 2008 - 11:48 am

This logic is actually used only by a few filesystems - ext2 and UDF are
probably the most common ones. For example for ext2, the indirect blocks
are on the list if the file is freshly written, so that is roughly around
1MB of metadata per 1GB of data (for 4KB blocks, with 1KB blocks it is 4MB
per 1GB). Because seeks are expensive, you could really end up with the
Hmm, I'm not sure about which patch you speak. Logic with removing clean
buffers has been in the first version (but there mark_buffer_dirty_inode()
was written differently). In the current version, we readd buffer to
private_list if it is found dirty in the second while loop of
I'm not sure. What exactly to you mean? BTW: spin_lock is a memory barrier,
isn't it?

Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

To: Jan Kara <jack@...>
Cc: <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Friday, January 25, 2008 - 4:34 am

Yeah, that's fair enough I suppose. I wasn't thinking you'd have a
huge newly dirtied file, but it could happen. I don't want to cause

Sure, I think there is still a data race though, but if there is one
it's already been there for a long time and nobody cares too much about

In existing code:

mark_buffer_dirty_inode(): fsync_buffers_list():
test_set_buffer_dirty(bh); list_del_init(&bh->b_assoc_buffers);
if (list_empty(&bh->b_assoc_buffers)) if (buffer_dirty(bh)) {
... list_add(&bh->b_assoc_buffers, );

These two code sequences can run concurrently because only fsync_buffers_list
takes the lock.

So fsync_buffers_list can speculatively load bh->b_state before
its stores to clear b_assoc_buffers propagate to the CPU running
mark_buffer_dirty_inode.

So if there is a !dirty buffer on the list, then fsync_buffers_list will
remove it from the list, but mark_buffer_dirty_inode won't see it has been
removed from the list and won't re-add it. I think.

This is actually even possible to hit on x86 because they reorder loads
past stores. It needs a smp_mb() before if (buffer_dirty(bh) {}.

Actually I very much dislike testing list entries locklessly, because they
are not trivially atomic operations like single stores... which is another
reason why I like your first patch.
--

To: Nick Piggin <nickpiggin@...>
Cc: <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, January 28, 2008 - 6:16 pm

OK, Nick, how do you like the patch below?

Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().

Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves
buffer to a dedicated list and by reinserting buffer in private_list when
it is found dirty after we have submitted buffer for IO.

Signed-off-by: Jan Kara <jack@suse.cz>

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..76e1ab4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -678,7 +678,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
} else {
BUG_ON(mapping->assoc_mapping != buffer_mapping);
}
- if (list_empty(&bh->b_assoc_buffers)) {
+ if (!bh->b_assoc_map) {
spin_lock(&buffer_mapping->private_lock);
list_move_tail(&bh->b_assoc_buffers,
&mapping->private_list);
@@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
{
struct buffer_head *bh;
struct list_head tmp;
+ struct address_...

To: Nick Piggin <nickpiggin@...>
Cc: <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Friday, January 25, 2008 - 2:16 pm

OK, I'll believe you ;) I was never completely sure what all can happen
Yes, that was actually the reason why I changed the checks from
list_empty() to b_assoc_map testing. Well, so I'll add the barrier and
maybe also change these list_empty() checks to b_assoc_map tests...

Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

Previous thread: [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview by Ryo Tsuruta on Wednesday, January 23, 2008 - 8:53 am. (18 messages)

Next thread: Okey'ciler Artık Yep Yeni Bir Okey Salonunuz Var Üstelik 10 ytl Bonusla İlk Üyelik Şansı by Duygu TUFANOGLU on Wednesday, January 23, 2008 - 10:11 am. (1 message)