login
Header Space

 
 

Re: [PATCH] jbd_commit_transaction() races with journal_try_to_drop_buffers() causing DIO failures

Previous thread: [PATCH] drivers/net: convert & to && by Julia Lawall on Thursday, March 6, 2008 - 1:41 pm. (9 messages)

Next thread: 2.6.25-rc4 OOMs itself dead on bootup by Frank Sorenson on Thursday, March 6, 2008 - 1:26 pm. (12 messages)
To: <linux-ext4@...>
Cc: <linux-kernel@...>
Date: Thursday, March 6, 2008 - 1:42 pm

Hi,

  Below is my rewrite of ordered mode in JBD. Now we don't have a list of
data buffers that need syncing on transaction commit but a list of inodes
that need writeout during commit. This brings all sorts of advantages such
as possibility to get rid of journal heads and buffer heads for data
buffers in ordered mode, better ordering of writes on transaction commit,
simplification of some JBD code, no more anonymous pages when truncate of
data being committed happens. The patch has survived some light testing
but it still has some potential of eating your data so beware :) I've run
dbench to see whether we didn't decrease performance by different handling
of truncate and the throughput I'm getting on my machine is the same (OK,
is lower by 0.5%) if I disable the code in truncate waiting for commit to
finish... Also the throughput of dbench is about 2% better with my patch
than with current JBD.
  Any comments or testing most welcome.

									Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
---

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;

diff --git a/fs/buffer.c b/fs/buffer.c
index 897cd74..bd6aefd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1675,7 +1675,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 			 */
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
-		} else if (!buffer_mapped(bh) &amp;&amp; buffer_dirty(bh)) {
+		} else if (!buffer_mapped(bh) &amp;&amp; buffer_dirty(bh)
+			   &amp;&amp; !wbc-&gt;skip_unmapped) {
 			WARN_ON(bh-&gt;b_size != blocksize);
 			err = get_block(inode, block, bh, 1);
 			if (err)
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4f4020c..8530e5d 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -588,6 +588,7 @@ got:
 	ei-&gt;i_extra_isize =
 		(EXT3_INODE_SIZE(inode-&gt;i_sb) &gt; EXT3_GOOD_OLD_INODE_SIZE) ?
 		sizeof(struct ext3_inode) - EXT3_GOOD_OLD_INODE_SIZE : 0;
+	journal_init_jbd_inode(&amp;ei-&gt;jinode, inode);
 
 	ret = inode;
 	if(DQUOT_ALLOC_INODE(i...
To: <akpm@...>
Cc: <pbadari@...>, Jan Kara <jack@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, April 25, 2008 - 7:38 pm

Hi,

While looking at a bug related to direct IO returns to EIO, after
looking at the code, I found there is a window that
try_to_free_buffers() from direct IO could race with JBD, which holds
the reference to the data buffers before journal_commit_transaction()
ensures the data buffers has reached to the disk.

A little more detail: to prepare for direct IO, generic_file_direct_IO()
calls invalidate_inode_pages2_range() to invalidate the pages in the
cache before performaning direct IO.  invalidate_inode_pages2_range()
tries to free the buffers via try_to free_buffers(), but sometimes it
can't, due to the buffers is possible still on some transaction's
t_sync_datalist or t_locked_list waiting for
journal_commit_transaction() to process it. 

Currently Direct IO simply returns EIO if try_to_free_buffers() finds
the buffer is busy, as it has no clue that JBD is referencing it.

Is this a known issue and expected behavior? Any thoughts?


Mingming

--
To: Mingming Cao <cmm@...>
Cc: <akpm@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 8:26 am

Hi,

  Are you seeing this in data=ordered mode? As Andrew pointed out we do
filemap_write_and_wait() so all the relevant data buffers of the inode
should be already on disk. In __journal_try_to_free_buffer() we check
whether the buffer is already-written-out data buffer and unfile and free
it in that case. It shouldn't happen that a data buffer has
b_next_transaction set so really the only idea why try_to_free_buffers()
could fail is that somebody manages to write to a page via mmap before
invalidate_inode_pages2_range() gets to it. Under which kind of load do you
observe the problem? Do you know exactly because of which condition does
journal_try_to_free_buffers() fail?

								Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Jan Kara <jack@...>
Cc: Mingming Cao <cmm@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 1:11 pm

Thank you for your reply.

What we are noticing is invalidate_inode_pages2_range() fails with -EIO
(from try_to_free_buffers() since b_count &gt; 0).

I don't think the file is being updated through mmap(). Previous
writepage() added these buffers to t_sync_data list (data=ordered).
filemap_write_and_wait() waits for pagewrite back to be cleared.
So, buffers are no longer dirty, but still on the t_sync_data and
kjournald didn't get chance to process them yet :(

Since we have elevated b_count on these buffers, try_to_free_buffers()
fails. How can we make filemap_write_and_wait() to wait for kjournald
to unfile these buffers ?

Does this makes sense ? Am I missing something here ?

Thanks,
Badari

--
To: Badari Pulavarty <pbadari@...>
Cc: Mingming Cao <cmm@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 2:09 pm

Hmm, I don't get one thing:
  The call chain is invalidate_inode_pages2_range() -&gt;
invalidate_complete_page2() -&gt; try_to_release_page() -&gt; ext3_releasepage()
-&gt; journal_try_to_free_buffers() -&gt; __journal_try_to_free_buffer() and this
function should remove the buffer from the committing transaction. So who's
holding the reference to those buffers? Or is it that
__journal_try_to_free_buffer() fails to remove the buffer from the
committing transaction? Why?
  Hmm, maybe I have one idea - in theory we could call
__journal_try_to_free_buffer() exactly at the moment commit code inspects
the buffer. Then we'd release the buffer from the transaction but
try_to_free_buffers() would fail because of elevated b_count exactly as you
described. Could you maybe verify this? Not that I'd know how to easily fix
this ;)...

									Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 3:09 pm

Thanks, yes I noticed that after you pointing this out.

But __journal_try_to_free_buffer() only unfile the buffer from
t_sync_datalist or t_locked list, the journal head is not removed in
journal_remove_journal_head() there, at that time,
journal_remove_journal_head() just check if counter b_jcount is 0. But
before calling __journal_try_to_free_buffer(), since
journal_try_to_free_buffers() already increase the b_jcount in
journal_grab_journal_head(), so the journal head is not removed in

Looking at the code, it seems the it's the journal_put_journal_head(jh)
who remove the journal head and decrease the bh

journal_try_to_free_buffers() 
{
...

jh = journal_grab_journal_head(bh);
                if (!jh)
                        continue;

                jbd_lock_bh_state(bh);
                __journal_try_to_free_buffer(journal, bh);
                journal_put_journal_head(jh);
                jbd_unlock_bh_state(bh);

...

}
 so when journal_put_journal_head()-&gt; __journal_remove_journal_head(),
now the b_jcount is zero, but is  
jh-&gt;b_transaction is NULL? So it seems possible that bh ref count is non
zero when exit from journal_put_journal_head() if jh_b_transaction is
not cleared.

I miss where jh-&gt;b_transaction is clear to NULL? 

void journal_put_journal_head(struct journal_head *jh)
{
        struct buffer_head *bh = jh2bh(jh);

        jbd_lock_bh_journal_head(bh);
        J_ASSERT_JH(jh, jh-&gt;b_jcount &gt; 0);
        --jh-&gt;b_jcount;
        if (!jh-&gt;b_jcount &amp;&amp; !jh-&gt;b_transaction) {
                __journal_remove_journal_head(bh);
                __brelse(bh);
        }
        jbd_unlock_bh_journal_head(bh);
}



--
To: Mingming Cao <cmm@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Tuesday, April 29, 2008 - 8:43 am

__journal_unfile_buffer() called from __journal_try_to_free_buffer() sets
jh-&gt;b_transaction to NULL. So as soon as journal_put_journal_head() is
called, it results in freeing of journal head and releasing buffer
reference. So really the only possible race I see is what I describe

									Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Jan Kara <jack@...>
Cc: Mingming Cao <cmm@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Thursday, May 1, 2008 - 11:16 am

Hi Andrew &amp; Jan,

I was able to reproduce the customer problem involving DIO
(invalidate_inode_pages2) problem by writing simple testcase
to keep writing to a file using buffered writes and DIO writes
forever in a loop. I see DIO writes fail with -EIO.

After a long debug, found 2 cases how this could happen.
These are race conditions with journal_try_to_free_buffers()
and journal_commit_transaction().

1) journal_submit_data_buffers() tries to get bh_state lock. If
try lock fails, it drops the j_list_lock and sleeps for
bh_state lock, while holding a reference on the buffer.
In the meanwhile, journal_try_to_free_buffers() can clean up the
journal head could call try_to_free_buffers(). try_to_free_buffers()
would fail due to the reference held by journal_submit_data_buffers()
- which in turn causes failues for DIO (invalidate_inode_pages2()).

2) When the buffer is on t_locked_list waiting for IO to finish,
we hold a reference and give up the cpu, if we can't get
bh_state lock. This causes try_to_free_buffers() to fail.

Fix is to drop the reference on the buffer if we can't get
bh_state lock, give up the cpu and re-try the whole operation -
instead of waiting for the vh_state lock.

Does this look like a resonable fix ?

Thanks,
Badari

1) journal_submit_data_buffers() tries to get bh_state lock. If
try lock fails, it drops the j_list_lock and sleeps for
bh_state lock, while holding a reference on the buffer head.
In the meanwhile, journal_try_to_free_buffers() can clean up the
journal head could call try_to_free_buffers(). try_to_free_buffers()
would fail due to the reference held by journal_submit_data_buffers()
- which inturn causes failues for DIO (invalidate_inode_pages2()).

2) When the buffer is on t_locked_list waiting for IO to finish,
we hold a reference and give up the cpu, if we can't get 
bh_state lock. This causes try_to_free_buffers() to fail.

Fix is to drop the reference on the buffer, give up the cpu
and re-try the whole operation.

Signed-off-...
To: Badari Pulavarty <pbadari@...>
Cc: Mingming Cao <cmm@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 1:06 pm

As Mingming pointed out there are few other places where we could hold
the bh reference. Note also that we accumulate references to buffers in the
wbuf[] list and we need that for submit_bh() which consumes one bh
reference. Generally, it seems to me as a too fragile and impractical
rule "nobody can hold bh reference when not holding page lock" which is
basically what it comes down to if you really want to be sure that
journal_try_to_free_buffers() succeeds. And also note that in principle
there are other places which hold references to buffers without holding the
page lock - for example writepage() in ordered mode (although this one is
in practice hardly triggerable). So how we could fix at least the races
with commit code is to implement launder_page() callback for ext3/4 which
would wait for the previous transaction commit in case the page has buffers
that are part of that commit (I don't want this logic in
journal_try_to_free_buffers() as that is called also on memory-reclaim
path, but journal_launder_page() is fine with me). This would be correct
but could considerably slow down O_DIRECT writes in cases they're mixed
with buffered writes so I'm not sure if this is acceptable.
  OTOH with the ordered mode rewrite patch, the problem with commit code
also goes away since there we don't need extra references to data buffers

							Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 9, 2008 - 6:27 pm

I am not sure how we are going to gurantee that by the time
journal_try_to_free_buffers() get called, the page has buffers are not
as part of the current transaction commit(which could be different than
the one we waited in ext3_launder_page())?  

It seems more realistic to fix the races one by one to me.

There is still a window that journal_submit_data_buffers() already
removed the jh from the bh (when found the buffers are already being
synced), but still keep a reference to the buffer head.
journal_try_to_free_buffers() could be called. In that case
try_to_free_buffers() will be called since there is no jh related to
this buffer, and failed due to journal_submit_data_buffers() hasn't
finish the cleanup business yet. 

For this new race, we could just grab the j_list_lock when re-try
try_to_free_buffers() to force waiting for journal_commit_transaction()
to finish it flush work. But not sure if this is acceptable approach?

Patch like this? Comments?

Mingming

---------------------------------------------------------------------
There are a few cases direct IO could race with kjournal flushing
data buffers which could result direct IO return EIO error.

1) journal_submit_data_buffers() tries to get bh_state lock. If
try lock fails, it drops the j_list_lock and sleeps for
bh_state lock, while holding a reference on the buffer.
In the meanwhile, journal_try_to_free_buffers() can clean up the
journal head could call try_to_free_buffers(). try_to_free_buffers()
would fail due to the reference held by journal_submit_data_buffers()
- which in turn causes failues for DIO (invalidate_inode_pages2()).

2) When the buffer is on t_locked_list waiting for IO to finish,
we hold a reference and give up the cpu, if we can't get
bh_state lock. This causes try_to_free_buffers() to fail.

3) when journal_submit_data_buffers() saw the buffer is dirty but failed
to lock the buffer bh1, journal_submit_data_buffers() released the
j_list_lock and submit other buffers collected from previo...
To: Mingming Cao <cmm@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, May 12, 2008 - 11:54 am

Hello,

  Hmm, you are right. It is not enough to just wait in ext3_launder_page()
because we don't have a transaction for direct_IO started yet. But if we
  Not to me, really. The scheme for buffer references you are trying to
impose is awkward to say the least. First, it is completely
counter-intuitive (at least to me ;), second, it is impractical as well.
For example in your scheme, you have no sensible way of locking ordered
data mode buffer - you cannot just do: get the reference and do
lock_buffer() because that violates your requirements.  The only reasonable
way you could do that is to lock the page to make sure buffer won't go away
from you - but you cannot currently do that in journal commit code because
of lock ordering. So the only way I can see which is left is: get some jbd
spin lock to serialize with journal_try_to_free_buffers(), get the buffer
reference, try to lock buffer, if it fails, drop everything and restart.
And this is IMO no-go...
  And BTW even if you fix such races, I think you'll still have races like:
CPU1:                                      CPU2:
  filemap_write_and_wait()
                                           dirty a page
                                           msync() (dirties buffers)
  invalidate_inode_page2_range() -&gt; -EIO

  The code could historically always return EIO when mixing buffered and
unbuffered accesses and the question is, under which circumstances is this
acceptable?  I agree that the current state when if you do "buffered write,
DIO write" in sequence and you'll possibly get EIO is bad and we should fix
it. But I'm not sure we should fix the EIO return under all possible
  ^^^ Here you can see what I wrote above. Basically you just busy-loop
wait for buffer lock. You should at least put schedule() there so that you
  ^^^ And here you add a place where we are not guaranteed to make any
progress... If someone intensively spins on that buffer, commit code could

									Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE ...
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, May 12, 2008 - 8:39 pm

Does this match what you are thinking? It certainly slow down the DIO
path, but the positive side is it doesn't disturb the other code path...
thanks for your feedback!

--------------------------------------------

An unexpected EIO error gets returned when writing to a file
using buffered writes and DIO writes at the same time.

We found there are a number of places where journal_try_to_free_buffers()
could race with journal_commit_transaction(), the later still
helds the reference to the buffers on the t_syncdata_list or t_locked_list
, while journal_try_to_free_buffers() tries to free them, which resulting an EIO
error returns back to the dio caller. 

The logic fix is to retry freeing if journal_try_to_free_buffers() to failed
to free those data buffers while journal_commit_transaction() is still
reference those buffers.
This is done via implement ext3 launder_page() callback, instead of inside
journal_try_to_free_buffers() itself, so that it doesn't affecting other code
path calling journal_try_to_free_buffers and only dio path get affected.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Index: linux-2.6.26-rc1/fs/ext3/inode.c
===================================================================
--- linux-2.6.26-rc1.orig/fs/ext3/inode.c	2008-05-03 11:59:44.000000000 -0700
+++ linux-2.6.26-rc1/fs/ext3/inode.c	2008-05-12 12:41:27.000000000 -0700
@@ -1766,6 +1766,23 @@ static int ext3_journalled_set_page_dirt
 	return __set_page_dirty_nobuffers(page);
 }
 
+static int ext3_launder_page(struct page *page)
+{
+        int ret;
+	int retry = 5;
+
+	while (retry --) {
+		ret = ext3_releasepage(page, GFP_KERNEL);
+		if (ret == 1)
+			break;
+		else
+			schedule();
+	}
+
+        return ret;
+}
+
+
 static const struct address_space_operations ext3_ordered_aops = {
 	.readpage	= ext3_readpage,
 	.readpages	= ext3_readpages,
@@ -1778,6 +1795,7 @@ static const struct address_space_operat
 	.releasepage	= ext3_releasepage,
 	.direct_IO	= ext3_direct_IO,
 	.migr...
To: Mingming Cao <cmm@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Tuesday, May 13, 2008 - 10:54 am

Yes, I meant something like this. We could be more clever and do:

	head = bh = page_buffers(page);
	do {
		wait_on_buffer(bh);
		bh = bh-&gt;b_this_page;
	} while (bh != head);
	/*
	 * Now commit code should have been able to proceed and release
         * those buffers
	 */
        schedule();

or we could do simple:
	log_wait_commit(...);

That would impose larger perf. penalty but on the other hand you shouldn't
hit this path too often. But maybe the code above would be fine and would
handle most cases. Also please add a big comment to that function to explain
  Actually, we need .launder_page callback only in data=order mode.
data=writeback mode doesn't need it at all (journal code doesn't touch data
buffers there) and for data=journal mode DIO could have never worked
reasonably when mixed with buffered IO and it would have to do a different
and much more expensive trickery (like flushing the journal, or at least
forcing current transaction to commit).

								Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Tuesday, May 13, 2008 - 6:23 pm

Bummer, we can't free buffers in ext3_launder_page() before calling
try_to_free_page, as later
invalidate_complete_page2()-&gt;try_to_free_page() expecting the page
buffers are still here, and will return EIO if it launder_page() has
already freed those buffers.:(

Doing wait_on_buffer() alone in launder_page() is not enough as it

Guess this is the last it worth give a try to see how bad it is, here is
a draft patch. Do you see any obvious problems?

Mingming


Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
 fs/ext3/inode.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

Index: linux-2.6.26-rc1/fs/ext3/inode.c
===================================================================
--- linux-2.6.26-rc1.orig/fs/ext3/inode.c	2008-05-13 13:35:27.000000000 -0700
+++ linux-2.6.26-rc1/fs/ext3/inode.c	2008-05-13 14:33:55.000000000 -0700
@@ -1766,6 +1766,53 @@ static int ext3_journalled_set_page_dirt
 	return __set_page_dirty_nobuffers(page);
 }
 
+/*
+ * There are a number of places where journal_try_to_free_buffers()
+ * could race with journal_commit_transaction(), the later still
+ * helds the reference to the buffers on the t_syncdata_list or t_locked_list
+ * while journal_try_to_free_buffers() tries to free them,
+ * which resulting an EIO error returns back to the generic_file_direct_IO()
+ *
+ * It is also possible that mapped IO could re-dirty the page and buffers
+ * after direct IO has waited dirty pages have been flush to disk.
+ * When the journal_try_to_free_buffers() is called from the direct IO
+ * path, it may failed to free the buffers as the buffer may be locked again
+ *
+ * To fix this problem, we add a ext3_launder_page() callback which
+ * is only called on direct IO path, to wait for the buffer unlocked
+ * and waiting for the jbd finish commit those data buffers.
+ *
+ * This will impact the direct IO perf, but this alows direct IO and
+ * buffered IO could be performed concurrently.
+ *
+ * returns...
To: Mingming Cao <cmm@...>
Cc: Jan Kara <jack@...>, Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Wednesday, May 14, 2008 - 1:08 pm

Are you sure? Because if bufferes are released in ext3_launder_page(),
PagePrivate() has been set to 0 and we should directly fall through to
releasing the page without ever calling try_to_release_page()... So I'd
want to find out why PagePrivate is still set in
  Yes, this would not be enough.

								Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Wednesday, May 14, 2008 - 1:41 pm

You are right. PagePrivate() is being set to 0 in drop_buffers(). 

The problem is do_launder_page() returns successfully if the page is not
dirty (our case), so ext3_launder_page() is not even get called. This
also explains why the log_wait_commit() approach doesn't work for me:(

Have to think other ways...could we pass some flag to
journal_try_to_free_buffers(), and ask journal_try_to_free_buffers()
wait for jbd commit to finish flushing the data, if the request is from
directo IO?


Mingming

--
To: Mingming Cao <cmm@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Wednesday, May 14, 2008 - 2:14 pm

I didn't realize PageDirty() is going to be already cleared by previous
  Well, we could do that but we'd have to change try_to_release_page() call
to accept an extra argument which would consequently mean changing all the
filesystems... But yes, it probably makes sence because it is really
different whether we should just release the page because of memory
pressure or because direct IO needs to write to that area of the file.
So adding the parameter to releasepage() callback is probably a reasonable
thing to do.

								Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 10:14 am

This patch fixed a few races between direct IO and kjournld commit transaction.
An unexpected EIO error gets returned to direct IO caller when it failed to
free those data buffers. This could be reproduced easily with parallel 
direct write and buffered write to the same file

More specific, those races could cause journal_try_to_free_buffers()
fail to free the data buffers, when jbd is committing the transaction that has
those data buffers on its t_syncdata_list or t_locked_list. 
journal_commit_transaction() still holds the reference to those buffers
 before data reach to disk and buffers are removed from the 
t_syncdata_list of t_locked_list. This prevent the concurrent 
journal_try_to_free_buffers() to free those buffers at the same time, and cause
EIO error returns back to direct IO.

With this patch, in case of direct IO and when try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish
flushing the current committing transaction's data buffers to disk, 
then try to free those buffers again.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd/commit.c      |    1 +
 fs/jbd/journal.c     |    1 +
 fs/jbd/transaction.c |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd.h  |    3 +++
 4 files changed, 51 insertions(+)

Index: linux-2.6.26-rc1/include/linux/jbd.h
===================================================================
--- linux-2.6.26-rc1.orig/include/linux/jbd.h	2008-05-14 16:36:41.000000000 -0700
+++ linux-2.6.26-rc1/include/linux/jbd.h	2008-05-15 14:12:10.000000000 -0700
@@ -667,6 +667,9 @@ struct journal_s
 	 */
 	wait_queue_head_t	j_wait_transaction_locked;
 
+	/* Wait queu for waiting for data buffers to flushed to disk*/
+	wait_queue_head_t	j_wait_data_flushed;
+
 	/* Wait queue for waiting for checkpointing to complete */
 	wait_queue_head_t	j_wait_logspace;
 
Index: linux-2.6.26-rc1/fs/jbd/commit.c
===============================...
To: <cmm@...>
Cc: Jan Kara <jack@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 1:12 pm

Small nits..


Fix typo - "queue" and also this line is wrapping. Could you spilt it



fix grammar - "wait for the current transaction to finish syncing data





--
To: Jan Kara <jack@...>, <akpm@...>, Badari Pulavarty <pbadari@...>
Cc: <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 5:01 pm

JBD2: fix DIO error caused by race with DIO free_buffers and jbd2 commit transaction

From: Mingming Cao &lt;cmm@us.ibm.com&gt;

This patch fixed a few races between direct IO and kjournlad commit transaction.
An unexpected EIO error gets returned to direct IO caller when it failed to
free those data buffers. This could be reproduced easily with parallel 
direct write and buffered write to the same file

More specific, those races could cause jbd2_journal_try_to_free_buffers()
fail to free the data buffers, when jbd is committing the transaction that has
those data buffers on its t_syncdata_list or t_locked_list. 
jbd2_journal_commit_transaction() still holds the reference to those buffers
 before data reach to disk and buffers are removed from the 
t_syncdata_list of t_locked_list. This prevent the concurrent 
jbd2_journal_try_to_free_buffers() to free those buffers at the same time, but cause
EIO error returns back to direct IO.

With this patch, in case of direct IO and when try_to_free_buffers() failed,
let's waiting for jbd2_journal_commit_transaction() to finish
flushing the current committing transaction's data buffers to disk, 
then try to free those buffers again.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd2/transaction.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd2/transaction.c	2008-05-16 11:16:56.000000000 -0700
+++ linux-2.6.26-rc2/fs/jbd2/transaction.c	2008-05-16 13:52:04.000000000 -0700
@@ -1656,12 +1656,39 @@ out:
 	return;
 }
 
+/*
+ * jbd2_journal_try_to_free_buffers() could race with jbd2_journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * jbd2_journal_try_t...
To: Jan Kara <jack@...>, <akpm@...>, Badari Pulavarty <pbadari@...>
Cc: <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 5:01 pm

This patch fixed a few races between direct IO and kjournald commit transaction.
An unexpected EIO error gets returned to direct IO caller when it failed to
free those data buffers. This could be reproduced easily with parallel 
direct write and buffered write to the same file

More specific, those races could cause journal_try_to_free_buffers()
fail to free the data buffers, when jbd is committing the transaction that has
those data buffers on its t_syncdata_list or t_locked_list. 
journal_commit_transaction() still holds the reference to those buffers
 before data reach to disk and buffers are removed from the 
t_syncdata_list of t_locked_list. This prevent the concurrent 
journal_try_to_free_buffers() to free those buffers at the same time, but cause
EIO error returns back to direct IO.

With this patch, in case of direct IO and when try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish
flushing the current committing transaction's data buffers to disk, 
then try to free those buffers again.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd/transaction.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/truncate.c        |    3 +-
 2 files changed, 57 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/transaction.c	2008-05-16 11:51:02.000000000 -0700
+++ linux-2.6.26-rc2/fs/jbd/transaction.c	2008-05-16 13:43:02.000000000 -0700
@@ -1648,12 +1648,39 @@ out:
 	return;
 }
 
+/*
+ * journal_try_to_free_buffers() could race with journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * Journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free...
To: Mingming Cao <cmm@...>
Cc: <akpm@...>, Badari Pulavarty <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Sunday, May 18, 2008 - 6:37 pm

If Andrew or Christoph wouldn't beat you for "inventive use" of
gfp_mask, I'm fine with the patch as well ;). You can add

								Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SuSE CR Labs
--
To: Jan Kara <jack@...>
Cc: <akpm@...>, Badari Pulavarty <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, May 19, 2008 - 3:59 pm

This is less intrusive way to fix this problem. The gfp_mask was marked
as unused in try_to_free_page(). I looked at filesystems in the kernel,
there is only a few defined releasepage() callback, and only xfs checks
the flag(but not used). btrfs is actually using it though. I thought
about the way you have suggested, i.e.clean up this gfp_mask and and
replace with a flag.  I am not entirely sure if it we need to change the
address_space_operations and fix all the filesystems for this matter.

Andrew, what do you think? Is this approach acceptable? 


Thanks and regards,

--
To: <cmm@...>
Cc: <jack@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Monday, May 19, 2008 - 4:25 pm

On Mon, 19 May 2008 12:59:18 -0700

&lt;wakes up&gt;

Please ensure that the final patch is sufficiently well changelogged to
permit me to remain asleep ;)

The -&gt;releasepage semantics are fairly ad-hoc and have grown over time.
It'd be nice to prevent them from becoming vaguer than they are.

It has been (approximately?) the case that code paths which really care
about having the page released will set __GFP_WAIT (via GFP_KERNEL)
whereas code paths which are happy with best-effort will clear
__GFP_WAIT (with a "0').  And that's reasonsable - __GFP_WAIT here
means "be synchronous" whereas !__GFP_WAIT means "be non-blocking".

Is that old convention not sufficient here as well?  Two problem areas
I see are mm/vmscan.c and fs/splice.c (there may be others).

In mm/vmscan.c we probably don't want your new synchronous behaviour
and it might well be deadlockable anyway.  No probs, that's what
__GFP_FS is for.

In fs/splice.c, reading the comment there I have a feeling that you've
found another bug, and that splice _does_ want your new synchronous
behaviour?
--
To: Andrew Morton <akpm@...>
Cc: <jack@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Monday, May 19, 2008 - 6:07 pm

Yes, it looks like page_cache_pipe_buf_steal() expects page is free
before removeing it by passing the GFP_KERNEL flag, but currently ext3
could fails to releasepage when it called. In fact try_to_release_page()
return value is ignored in page_cache_pipe_buf_steal(), should probably
checked the failure case.


The other caller of try_to_release_page() in mm/splice.c is
fallback_migrate_page(), which does want the synchronous behaviour to
make sure buffers are dropped.


I will reuse the GFP_WAIT and GFP_FS flag in the updated patch.

Thanks for your feedback.


--
To: Andrew Morton <akpm@...>
Cc: <jack@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Tuesday, May 20, 2008 - 2:03 pm

JBD2: fix race between jbd2_journal_try_to_free_buffers() and jbd2 commit transaction

From: Mingming Cao &lt;cmm@us.ibm.com&gt;

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race. 
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the GFP_KERNEL to indicating this
call could wait, in case of try_to_free_buffers() failed, let's waiting for
journal_commit_transaction() to finish commit the current committing transaction
, then try to free those buffers again with journal locked.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd2/transaction.c |   56 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 54 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd2/transaction.c	2008-05-11 17:09:41.000000000 -0700
+++ linux-2.6.26-rc2/fs/jbd2/transaction.c	2008-05-19 17:20:01.000000000 -0700
@@ -1656,12 +1656,39 @@ out:
 	return;
 }
 
+/*
+ * jbd2_journal_try_to_free_buffers() could race with jbd2_journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * jbd2_journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ * Called with journal-&gt;j_state_lock hold.
+ */
+static void jbd2_jo...
To: Andrew Morton <akpm@...>
Cc: <jack@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Tuesday, May 20, 2008 - 2:02 pm

Updated patch to use existing GFP_KERNEL mask to indicating synchornous
wait in journal_try_to_free_buffers() to resolve the race with
journal_commit_transaction().

JBD: fix race between journal_try_to_free_buffers() and jbd commit transaction

From: Mingming Cao &lt;cmm@us.ibm.com&gt;

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race. 
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the GFP_KERNEL to indicating this
call could wait, in case of try_to_free_buffers() failed, let's waiting for
journal_commit_transaction() to finish commit the current committing transaction
, then try to free those buffers again with journal locked.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd/transaction.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/filemap.c         |    3 --
 2 files changed, 54 insertions(+), 4 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/transaction.c	2008-05-11 17:09:41.000000000 -0700
+++ linux-2.6.26-rc2/fs/jbd/transaction.c	2008-05-19 16:16:41.000000000 -0700
@@ -1648,12 +1648,39 @@ out:
 	return;
 }
 
+/*
+ * journal_try_to_free_buffers() could race with journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * Journal_try_to_free_buffers() will call this function to
+ * wai...
To: Andrew Morton <akpm@...>
Cc: <jack@...>, <pbadari@...>, <linux-ext4@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, May 21, 2008 - 7:39 pm

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race. 
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the GFP_KERNEL to indicating this
call could wait, in case of try_to_free_buffers() failed, let's waiting for
journal_commit_transaction() to finish commit the current committing transaction
, then try to free those buffers again with journal locked.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd2/transaction.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc3/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.26-rc3.orig/fs/jbd2/transaction.c	2008-05-21 16:17:51.000000000 -0700
+++ linux-2.6.26-rc3/fs/jbd2/transaction.c	2008-05-21 16:22:03.000000000 -0700
@@ -1656,12 +1656,40 @@ out:
 	return;
 }
 
+/*
+ * jbd2_journal_try_to_free_buffers() could race with jbd2_journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * jbd2_journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ * Called with journal-&gt;j_state_lock hold.
+ */
+static void jbd2_journal_wait_for_transaction_sync_data(journal_t *journal)
+{
+	transaction_t *transaction = NULL;
+	tid_t tid;
+
+	transaction = j...
To: Andrew Morton <akpm@...>
Cc: <jack@...>, <pbadari@...>, <linux-ext4@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, May 21, 2008 - 7:38 pm

Changes since take 2:
- fix a bug pointed by Jan, and updated the comments


journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race. 
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the GFP_KERNEL to indicating this
call could wait, in case of try_to_free_buffers() failed, let's waiting for
journal_commit_transaction() to finish commit the current committing transaction
, then try to free those buffers again with journal locked.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd/transaction.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/filemap.c         |    3 --
 2 files changed, 56 insertions(+), 4 deletions(-)

Index: linux-2.6.26-rc3/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc3.orig/fs/jbd/transaction.c	2008-05-21 16:17:51.000000000 -0700
+++ linux-2.6.26-rc3/fs/jbd/transaction.c	2008-05-21 16:20:11.000000000 -0700
@@ -1648,12 +1648,40 @@ out:
 	return;
 }
 
+/*
+ * journal_try_to_free_buffers() could race with journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * Journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ * Called with journal-&gt;j_state_lock hold.
+ */
+static void journal_wait_for_transaction_sync_data(journ...
To: Mingming <cmm@...>
Cc: <jack@...>, <pbadari@...>, <linux-ext4@...>, linux-kernel <linux-kernel@...>
Date: Thursday, May 22, 2008 - 1:57 am

"latter"




Unneeded initialisation.  Could just do



"callers"


Sorry about all the spelling flames ;) I'd normally just fix them
myself rather than typing them all into an email and having you type
them in again, etc.  But I think the patch needs to be respun anyway.

The mask-and-compare with GFP_KERNEL does appear to be correct, but it
is quite unusual.  Generally in a situation like this we will test for
the specific __GFP_foo flags which we're interested in.  For
documentation reasons if nothing else.

So the preferred form here would be

	if (ret == 0 &amp;&amp;
		(gfp_mask &amp; (__GFP_WAIT|__GFP_FS)) == (__GFP_WAIT|__GFP_FS)) {

which really tells the reader what we're trying to do here.  And I
don't think this code cares about __GFP_IO, even though it would be
mighty peculirr (probably buggy) for someone to do

Did we actually need to hold j_state_lock across the
try_to_free_buffers() call here?  Because it'll increase hold times and
will introduce a lock-ranking dependency which we might not otherwise

--
To: Mingming Cao <cmm@...>
Cc: Andrew Morton <akpm@...>, <jack@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Tuesday, May 20, 2008 - 7:53 pm

What is actually the point of entering the function with j_state_lock
held and also keeping it after return? It should be enough to take it
  This comment seems a bit misleading to me - I'd rather write there:

@gfp_mask: we use the mask to detect how hard should we try to release
buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
  I think this test is wrong - it should rather be something like
(ret == 0 &amp;&amp; (gfp_mask &amp; GFP_KERNEL == GFP_KERNEL)) - or even expand the
test to gfp_mask &amp; __GFP_WAIT &amp;&amp; gfp_mask &amp; __GFP_FS &amp;&amp; gfp_mask &amp;
  Probably __GFP_WAIT | __GFP_IO here... But I'm not sure why do we

								Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SuSE CR Labs
--
To: Jan Kara <jack@...>
Cc: Andrew Morton <akpm@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Wednesday, May 21, 2008 - 1:14 pm

I was worried about the case when we call try_to_free_buffers() again,
it races with the current transaction commit again. Is it possible? I
guess the question is whether it is possible to have buffers on the same
page attached to different transaction.  If so, I think we need to keep
the journal state lock while retry try_to_free_buffers(), so that the



For try_to_release_page(),we should wait only when (__GFP_WAIT &amp;

Thanks, patch v3 to follow.

Mingming

--
To: Mingming <cmm@...>
Cc: Andrew Morton <akpm@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Saturday, May 24, 2008 - 6:44 pm

Well, but by the time log_wait_commit() finishes, it may well
happen that a new transaction is already started so your lock doesn't
help you much. And the page you are called on is actually locked, so
noone can really mess with it until you unlock it... So I think you can
just use the lock for obtaining tid and then drop it.

								Honza

PS: For JBD2 you'd need to be a bit more careful because you cannot call
log_wait_commit() while holding page lock (we have reversed locking
order for ext4) - but ordered-mode rewrite patch actually fixes this
problem and I'm going to submit the splitted patches on Monday or
Tuesday (I only need to test them that I didn't do something stupid
while porting them to ext4)...

-- 
Jan Kara &lt;jack@suse.cz&gt;
SuSE CR Labs
--
To: Jan Kara <jack@...>
Cc: Andrew Morton <akpm@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Wednesday, May 28, 2008 - 2:18 pm

You are right that the page was locked during the process we are trying
Thanks for pointing this out. I think when we put back the reversed
locking order and new ordered mode the jbd2 patch could go away...

Updated patch for JBD (take 4) below.
Mingming

JBD: fix race between journal_try_to_free_buffers() and jbd commit transaction

From: Mingming Cao &lt;cmm@us.ibm.com&gt;

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race. 
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the GFP_KERNEL to indicating this
call could wait, in case of try_to_free_buffers() failed, let's waiting for
journal_commit_transaction() to finish commit the current committing transaction
, then try to free those buffers again with journal locked.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd/transaction.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/filemap.c         |    3 --
 2 files changed, 56 insertions(+), 4 deletions(-)

Index: linux-2.6.26-rc3/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc3.orig/fs/jbd/transaction.c	2008-05-28 10:55:37.000000000 -0700
+++ linux-2.6.26-rc3/fs/jbd/transaction.c	2008-05-28 10:57:32.000000000 -0700
@@ -1648,12 +1648,42 @@ out:
 	return;
 }
 
+/*
+ * journal_try_to_free_buffers() could race with journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_lis...
To: Mingming Cao <cmm@...>
Cc: Andrew Morton <akpm@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Wednesday, May 28, 2008 - 2:55 pm

I think Andrew prefered this test to be expanded but otherwise the patch
is fine now. You can add:
  Acked-by: Jan Kara &lt;jack@suse.cz&gt;


								Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Andrew Morton <akpm@...>
Cc: Jan Kara <jack@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Wednesday, May 28, 2008 - 8:16 pm

Updated patch after Jan acked.

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race. 
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS
to indicating this call could wait, in case of try_to_free_buffers() failed, 
let's waiting for journal_commit_transaction() to finish commit the 
current committing transaction, then try to free those buffers again.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;

---
 fs/jbd/transaction.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/filemap.c         |    3 --
 2 files changed, 56 insertions(+), 4 deletions(-)

Index: linux-2.6.26-rc3/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc3.orig/fs/jbd/transaction.c	2008-05-28 14:16:41.000000000 -0700
+++ linux-2.6.26-rc3/fs/jbd/transaction.c	2008-05-28 16:12:15.000000000 -0700
@@ -1648,12 +1648,42 @@ out:
 	return;
 }
 
+/*
+ * journal_try_to_free_buffers() could race with journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * Journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ * Called with journal-&gt;j_state_lock hold.
+ */
+static void journal_wait_for_transaction_sync_data(journal_t *journ...
To: Andrew Morton <akpm@...>
Cc: Jan Kara <jack@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Wednesday, May 28, 2008 - 8:18 pm

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race. 
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()-&gt;journal_try_to_free_buffers().

With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS 
to indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the 
current committing transaction, then try to free those buffers again.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
---
 fs/jbd2/transaction.c |   59 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc3/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.26-rc3.orig/fs/jbd2/transaction.c	2008-05-28 16:10:41.000000000 -0700
+++ linux-2.6.26-rc3/fs/jbd2/transaction.c	2008-05-28 16:13:16.000000000 -0700
@@ -1656,12 +1656,42 @@ out:
 	return;
 }
 
+/*
+ * jbd2_journal_try_to_free_buffers() could race with jbd2_journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * jbd2_journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ * Called with journal-&gt;j_state_lock hold.
+ */
+static void jbd2_journal_wait_for_transaction_sync_data(journal_t *journal)
+{
+	transaction_t *transaction = NULL;
...
To: Mingming Cao <cmm@...>
Cc: Andrew Morton <akpm@...>, Jan Kara <jack@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 30, 2008 - 2:24 am

-aneesh
--
To: Aneesh Kumar K.V <aneesh.kumar@...>
Cc: Andrew Morton <akpm@...>, Jan Kara <jack@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 30, 2008 - 11:17 am

Thanks.  I noticed this yesterday and have sent Andrew updated patch to 
replace the one he just added to mm tree but forget to copy to the list. Here is the updated patch,

Mingming

---
 fs/jbd2/transaction.c |   61 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 58 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd2/transaction.c	2008-05-29 12:21:40.000000000 -0700
+++ linux-2.6.26-rc4/fs/jbd2/transaction.c	2008-05-29 12:38:30.000000000 -0700
@@ -1656,12 +1656,43 @@ out:
 	return;
 }
 
+/*
+ * jbd2_journal_try_to_free_buffers() could race with
+ * jbd2_journal_commit_transaction(). The later might still hold the
+ * reference count to the buffers when inspecting them on
+ * t_syncdata_list or t_locked_list.
+ *
+ * jbd2_journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ */
+static void jbd2_journal_wait_for_transaction_sync_data(journal_t *journal)
+{
+	transaction_t *transaction = NULL;
+	tid_t tid;
+
+	spin_lock(&amp;journal-&gt;j_state_lock);
+	transaction = journal-&gt;j_committing_transaction;
+
+	if (!transaction) {
+		spin_unlock(&amp;journal-&gt;j_state_lock);
+		return;
+	}
+
+	tid = transaction-&gt;t_tid;
+	spin_unlock(&amp;journal-&gt;j_state_lock);
+	jbd2_log_wait_commit(journal, tid);
+}
 
 /**
  * int jbd2_journal_try_to_free_buffers() - try to free page buffers.
  * @journal: journal for operation
  * @page: to try and free
- * @unused_gfp_mask: unused
+ * @gfp_mask: we use the mask to detect how hard should we try to release
+ * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
+ * release the buffers.
  *
  *
  * For all the buffers on this page,
@@ -1690,9 +1721,11 @@ out:
  * journal_try_to_free_buffer() is changing its state.  But that
  * cannot happ...
To: Jan Kara <jack@...>
Cc: Andrew Morton <akpm@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>, Jens Axboe <jens.axboe@...>
Date: Wednesday, May 28, 2008 - 8:15 pm

Okay, I will update the patch and cleanup the history, sent it to
Andrew. Thanks.

Mingming

--
To: Mingming Cao <cmm@...>
Cc: Andrew Morton <akpm@...>, <jack@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Tuesday, May 20, 2008 - 5:30 am

So something like this, then?

diff --git a/fs/splice.c b/fs/splice.c
index 7815003..e08a2f5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,8 +58,8 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
 		 */
 		wait_on_page_writeback(page);
 
-		if (PagePrivate(page))
-			try_to_release_page(page, GFP_KERNEL);
+		if (PagePrivate(page) &amp;&amp; !try_to_release_page(page, GFP_KERNEL))
+			goto out_unlock;
 
 		/*
 		 * If we succeeded in removing the mapping, set LRU flag
@@ -75,6 +75,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
 	 * Raced with truncate or failed to remove page from current
 	 * address space, unlock and return failure.
 	 */
+out_unlock:
 	unlock_page(page);
 	return 1;
 }

-- 
Jens Axboe

--
To: Jens Axboe <jens.axboe@...>
Cc: Andrew Morton <akpm@...>, <jack@...>, <pbadari@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Tuesday, May 20, 2008 - 1:47 pm

--
To: Mingming Cao <cmm@...>
Cc: Jan Kara <jack@...>, Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 11:01 am

Got a couple of whitespace problems above it looks like.  Thanks,

Josef

--
To: Josef Bacik <jbacik@...>
Cc: Jan Kara <jack@...>, Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 1:11 pm

Thanks for catching this, below is updated patch, fixed the whitespace
and comments.


---------------------------------------------------
JBD: fix journal_try_to_free_buffers race with
journal_commit_transaction

From: Mingming Cao &lt;cmm@us.ibm.com&gt;

This patch fixed a few races between direct IO and kjournld commit
transaction.
An unexpected EIO error gets returned to direct IO caller when it failed
to
free those data buffers. This could be reproduced easily with parallel 
direct write and buffered write to the same file

More specificly, those races could cause journal_try_to_free_buffers()
fail to free the data buffers, when jbd is committing the transaction
that has
those data buffers on its t_syncdata_list or t_locked_list. 
journal_commit_transaction() still holds the reference to those buffers
 before data reach to disk and buffers are removed from the 
t_syncdata_list of t_locked_list. This prevent the concurrent 
journal_try_to_free_buffers() to free those buffers at the same time,
but cause
EIO error returns back to direct IO.

With this patch, in case of direct IO and when try_to_free_buffers()
failed,
let's waiting for journal_commit_transaction() to finish
flushing the current committing transaction's data buffers to disk, 
then try to free those buffers again.

Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Reviewed-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt; 
---
 fs/jbd/commit.c      |    1 +
 fs/jbd/journal.c     |    1 +
 fs/jbd/transaction.c |   46
++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd.h  |    3 +++
 4 files changed, 51 insertions(+)

Index: linux-2.6.26-rc1/include/linux/jbd.h
===================================================================
--- linux-2.6.26-rc1.orig/include/linux/jbd.h	2008-05-14
16:36:41.000000000 -0700
+++ linux-2.6.26-rc1/include/linux/jbd.h	2008-05-15 14:12:10.000000000
-0700
@@ -667,6 +667,9 @@ struct journal_s
 	 */
 	wait_queue_head_t	j_wait_transaction_locked;
 
+	/* Wa...
To: <cmm@...>
Cc: Josef Bacik <jbacik@...>, Jan Kara <jack@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 1:17 pm

drop "dio" variable and compare here, like 

Thanks,
Badari

--
To: Badari Pulavarty <pbadari@...>
Cc: Josef Bacik <jbacik@...>, Jan Kara <jack@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 1:30 pm

Okay, will do. Also will update the patch with other format changes you


Also, This patch changed the struct journal_s to add a new wait queue,
so that journal_try_to_free_buffers() could only need to wait for data
to be commited, rather than using j_wait_done_commit queue and wait for
the whole transaction being committed. It might break other fs  that
uses journal_s, thus not worth it. Will update the patch.

--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Friday, May 16, 2008 - 10:13 am

Actually there is an argument gfp_mask passed to try_to_release_page()
which we could pass a special flag from direct IO that could be parsed
as direct IO request.  This would avoid changing all the filesystems and
the address space operation interface.  In fact, I don't see in-kernel
tree fs releasepage() cal back functions is using this gfp_mask, but
Will send a patch shortly, with that patch the test fine for about 18
hours. 


--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Tuesday, May 13, 2008 - 12:37 pm

Thanks.
We could recheck if buffer_busy() before calling wait_on_buffer(bh) to
wait for buffer unlocked. This will handles the mapped IO re-dirty race
case, but still need the schedule() and retry to handle the buffered IO

My concern with doing log_wait_commit() here is the perf penalty. In the
case the buffers is at the end of the queue to commit, we have to wait
for all other previous transactions to finish committing before we could

You are right, thanks for pointing this out.
Will post an updated patch.

Mingming

--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, May 12, 2008 - 3:23 pm

Do you mean calling journal_try_to_free_buffers() inside
ext3_launder_page()? I think we still need some lock to serialize
launder_page() with kjournald commit code(not sure if is that Okay?),
otherwise there is always a window that by the time
try_to_free_buffers() get called, the current transaction has be

Sigh...I am not very happy with the solution either, but I could not see
a decent solution that could fix this problem.  Currently we constantly
hit EIO error only in 10 minutes with the simple parallel buffered IO

I could see this is possible with mapped IO. But for buffered IO, since
Yup. 

The conflict is that if we still held the bh refrence after released the
j_list_lock, journal_try_to_free_buffers() could came in returns EIO to
direct IO since buffer is busy(); but if we release the bh reference
after released the j_list_lock, it made possible that
journal_try_to_free_buffers() to free that buffer, so we can't do
lock_buffer() here anymore so we have to loop here. This is a trade
off ...

On the other hand, the journal_submit_data_bufferes() continue process
this buffer after regrab the j_list_lock even if it's has been removed
from the t_syncdata_list by __journal_try_to_free_buffers(). IMO this is

--
To: Mingming Cao <cmm@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Tuesday, May 13, 2008 - 10:20 am

Once we succeed with removing journal heads from data buffers in
journal_try_to_free_buffers(), they can be added only if someone dirties
the page via mmap and writepage is called (which I'd currently neglect as
too rare and not common usage). So after that moment, once commit code
  Well, you are going to hit this path for example when the buffer is under
IO just now. So it happens quite often e.g. when the machine is under
memory pressure and writes-out dirty data more aggressively.

									Honza
-- 
Jan Kara &lt;jack@suse.cz&gt;
SUSE Labs, CR
--
To: Jan Kara <jack@...>
Cc: Mingming Cao <cmm@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 8:10 pm

Yes. I have been discussing all of these with Mingming. I agree with you
that it looks silly to update all the places to not hold a ref on the
buffer while waiting for spinlocks. 

Adding a launder_page() seems to be the right approach. I wouldn't worry
about making O_DIRECT slower. Currently, its failing with -EIO in this
case anyway. 

Unfortunately, this happens at a customer site and I can't effort to
give them a complete re-write type patch - I need to get this into 
older distro release :( Thats why I am trying to fix the cases.
I am finding more and more of these ..

Thanks,
Badari

--
To: Jan Kara <jack@...>
Cc: Badari Pulavarty <pbadari@...>, <akpm@...>, <linux-ext4@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 1:53 pm

Actually there is one more place that journal_try_to_free_buffers
(calling from DIO path) could race with journal_submit_data_buffers(),
the DIO and buffered IO test still returns EIO with the fix which should
fixed the first 3 race cases.

I could not figure out how this could happen with Badari's fix to
inverted_lock().

This time the debug shows that the one who release put_bh() after the
journal_try_to_free_buffers() failed is from this code path:

journal_submit_data_buffers() {

...

 } else if (!locked &amp;&amp; buffer_locked(bh)) {
                        __journal_file_buffer(jh, commit_transaction,
                                                BJ_Locked);
                        jbd_unlock_bh_state(bh);
                        put_bh(bh);

}

But when we get here we should already making sure the bh is on the

It seems to me it's safe in this case. When
journal_try_to_free_buffers() called from DIO path,
filemap_fdatawrite_and_wait() already making sure that the IO submitted

--