Hi,
Below is my rewrite of ordered mode in JBD. Now we don't have a list of
data buffers that need syncing on transaction commit but a list of inodes
that need writeout during commit. This brings all sorts of advantages such
as possibility to get rid of journal heads and buffer heads for data
buffers in ordered mode, better ordering of writes on transaction commit,
simplification of some JBD code, no more anonymous pages when truncate of
data being committed happens. The patch has survived some light testing
but it still has some potential of eating your data so beware :) I've run
dbench to see whether we didn't decrease performance by different handling
of truncate and the throughput I'm getting on my machine is the same (OK,
is lower by 0.5%) if I disable the code in truncate waiting for commit to
finish... Also the throughput of dbench is about 2% better with my patch
than with current JBD.
Any comments or testing most welcome.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---
Signed-off-by: Jan Kara <jack@suse.cz>
diff --git a/fs/buffer.c b/fs/buffer.c
index 897cd74..bd6aefd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1675,7 +1675,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
*/
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
- } else if (!buffer_mapped(bh) && buffer_dirty(bh)) {
+ } else if (!buffer_mapped(bh) && buffer_dirty(bh)
+ && !wbc->skip_unmapped) {
WARN_ON(bh->b_size != blocksize);
err = get_block(inode, block, bh, 1);
if (err)
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4f4020c..8530e5d 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -588,6 +588,7 @@ got:
ei->i_extra_isize =
(EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ?
sizeof(struct ext3_inode) - EXT3_GOOD_OLD_INODE_SIZE : 0;
+ journal_init_jbd_inode(&ei->jinode, inode);
ret = inode;
if(DQUOT_ALLOC_INODE(i...Hi, While looking at a bug related to direct IO returns to EIO, after looking at the code, I found there is a window that try_to_free_buffers() from direct IO could race with JBD, which holds the reference to the data buffers before journal_commit_transaction() ensures the data buffers has reached to the disk. A little more detail: to prepare for direct IO, generic_file_direct_IO() calls invalidate_inode_pages2_range() to invalidate the pages in the cache before performaning direct IO. invalidate_inode_pages2_range() tries to free the buffers via try_to free_buffers(), but sometimes it can't, due to the buffers is possible still on some transaction's t_sync_datalist or t_locked_list waiting for journal_commit_transaction() to process it. Currently Direct IO simply returns EIO if try_to_free_buffers() finds the buffer is busy, as it has no clue that JBD is referencing it. Is this a known issue and expected behavior? Any thoughts? Mingming --
Hi, Are you seeing this in data=ordered mode? As Andrew pointed out we do filemap_write_and_wait() so all the relevant data buffers of the inode should be already on disk. In __journal_try_to_free_buffer() we check whether the buffer is already-written-out data buffer and unfile and free it in that case. It shouldn't happen that a data buffer has b_next_transaction set so really the only idea why try_to_free_buffers() could fail is that somebody manages to write to a page via mmap before invalidate_inode_pages2_range() gets to it. Under which kind of load do you observe the problem? Do you know exactly because of which condition does journal_try_to_free_buffers() fail? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Thank you for your reply. What we are noticing is invalidate_inode_pages2_range() fails with -EIO (from try_to_free_buffers() since b_count > 0). I don't think the file is being updated through mmap(). Previous writepage() added these buffers to t_sync_data list (data=ordered). filemap_write_and_wait() waits for pagewrite back to be cleared. So, buffers are no longer dirty, but still on the t_sync_data and kjournald didn't get chance to process them yet :( Since we have elevated b_count on these buffers, try_to_free_buffers() fails. How can we make filemap_write_and_wait() to wait for kjournald to unfile these buffers ? Does this makes sense ? Am I missing something here ? Thanks, Badari --
Hmm, I don't get one thing: The call chain is invalidate_inode_pages2_range() -> invalidate_complete_page2() -> try_to_release_page() -> ext3_releasepage() -> journal_try_to_free_buffers() -> __journal_try_to_free_buffer() and this function should remove the buffer from the committing transaction. So who's holding the reference to those buffers? Or is it that __journal_try_to_free_buffer() fails to remove the buffer from the committing transaction? Why? Hmm, maybe I have one idea - in theory we could call __journal_try_to_free_buffer() exactly at the moment commit code inspects the buffer. Then we'd release the buffer from the transaction but try_to_free_buffers() would fail because of elevated b_count exactly as you described. Could you maybe verify this? Not that I'd know how to easily fix this ;)... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Thanks, yes I noticed that after you pointing this out.
But __journal_try_to_free_buffer() only unfile the buffer from
t_sync_datalist or t_locked list, the journal head is not removed in
journal_remove_journal_head() there, at that time,
journal_remove_journal_head() just check if counter b_jcount is 0. But
before calling __journal_try_to_free_buffer(), since
journal_try_to_free_buffers() already increase the b_jcount in
journal_grab_journal_head(), so the journal head is not removed in
Looking at the code, it seems the it's the journal_put_journal_head(jh)
who remove the journal head and decrease the bh
journal_try_to_free_buffers()
{
...
jh = journal_grab_journal_head(bh);
if (!jh)
continue;
jbd_lock_bh_state(bh);
__journal_try_to_free_buffer(journal, bh);
journal_put_journal_head(jh);
jbd_unlock_bh_state(bh);
...
}
so when journal_put_journal_head()-> __journal_remove_journal_head(),
now the b_jcount is zero, but is
jh->b_transaction is NULL? So it seems possible that bh ref count is non
zero when exit from journal_put_journal_head() if jh_b_transaction is
not cleared.
I miss where jh->b_transaction is clear to NULL?
void journal_put_journal_head(struct journal_head *jh)
{
struct buffer_head *bh = jh2bh(jh);
jbd_lock_bh_journal_head(bh);
J_ASSERT_JH(jh, jh->b_jcount > 0);
--jh->b_jcount;
if (!jh->b_jcount && !jh->b_transaction) {
__journal_remove_journal_head(bh);
__brelse(bh);
}
jbd_unlock_bh_journal_head(bh);
}
--__journal_unfile_buffer() called from __journal_try_to_free_buffer() sets jh->b_transaction to NULL. So as soon as journal_put_journal_head() is called, it results in freeing of journal head and releasing buffer reference. So really the only possible race I see is what I describe Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Hi Andrew & Jan, I was able to reproduce the customer problem involving DIO (invalidate_inode_pages2) problem by writing simple testcase to keep writing to a file using buffered writes and DIO writes forever in a loop. I see DIO writes fail with -EIO. After a long debug, found 2 cases how this could happen. These are race conditions with journal_try_to_free_buffers() and journal_commit_transaction(). 1) journal_submit_data_buffers() tries to get bh_state lock. If try lock fails, it drops the j_list_lock and sleeps for bh_state lock, while holding a reference on the buffer. In the meanwhile, journal_try_to_free_buffers() can clean up the journal head could call try_to_free_buffers(). try_to_free_buffers() would fail due to the reference held by journal_submit_data_buffers() - which in turn causes failues for DIO (invalidate_inode_pages2()). 2) When the buffer is on t_locked_list waiting for IO to finish, we hold a reference and give up the cpu, if we can't get bh_state lock. This causes try_to_free_buffers() to fail. Fix is to drop the reference on the buffer if we can't get bh_state lock, give up the cpu and re-try the whole operation - instead of waiting for the vh_state lock. Does this look like a resonable fix ? Thanks, Badari 1) journal_submit_data_buffers() tries to get bh_state lock. If try lock fails, it drops the j_list_lock and sleeps for bh_state lock, while holding a reference on the buffer head. In the meanwhile, journal_try_to_free_buffers() can clean up the journal head could call try_to_free_buffers(). try_to_free_buffers() would fail due to the reference held by journal_submit_data_buffers() - which inturn causes failues for DIO (invalidate_inode_pages2()). 2) When the buffer is on t_locked_list waiting for IO to finish, we hold a reference and give up the cpu, if we can't get bh_state lock. This causes try_to_free_buffers() to fail. Fix is to drop the reference on the buffer, give up the cpu and re-try the whole operation. Signed-off-...
As Mingming pointed out there are few other places where we could hold the bh reference. Note also that we accumulate references to buffers in the wbuf[] list and we need that for submit_bh() which consumes one bh reference. Generally, it seems to me as a too fragile and impractical rule "nobody can hold bh reference when not holding page lock" which is basically what it comes down to if you really want to be sure that journal_try_to_free_buffers() succeeds. And also note that in principle there are other places which hold references to buffers without holding the page lock - for example writepage() in ordered mode (although this one is in practice hardly triggerable). So how we could fix at least the races with commit code is to implement launder_page() callback for ext3/4 which would wait for the previous transaction commit in case the page has buffers that are part of that commit (I don't want this logic in journal_try_to_free_buffers() as that is called also on memory-reclaim path, but journal_launder_page() is fine with me). This would be correct but could considerably slow down O_DIRECT writes in cases they're mixed with buffered writes so I'm not sure if this is acceptable. OTOH with the ordered mode rewrite patch, the problem with commit code also goes away since there we don't need extra references to data buffers Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
I am not sure how we are going to gurantee that by the time journal_try_to_free_buffers() get called, the page has buffers are not as part of the current transaction commit(which could be different than the one we waited in ext3_launder_page())? It seems more realistic to fix the races one by one to me. There is still a window that journal_submit_data_buffers() already removed the jh from the bh (when found the buffers are already being synced), but still keep a reference to the buffer head. journal_try_to_free_buffers() could be called. In that case try_to_free_buffers() will be called since there is no jh related to this buffer, and failed due to journal_submit_data_buffers() hasn't finish the cleanup business yet. For this new race, we could just grab the j_list_lock when re-try try_to_free_buffers() to force waiting for journal_commit_transaction() to finish it flush work. But not sure if this is acceptable approach? Patch like this? Comments? Mingming --------------------------------------------------------------------- There are a few cases direct IO could race with kjournal flushing data buffers which could result direct IO return EIO error. 1) journal_submit_data_buffers() tries to get bh_state lock. If try lock fails, it drops the j_list_lock and sleeps for bh_state lock, while holding a reference on the buffer. In the meanwhile, journal_try_to_free_buffers() can clean up the journal head could call try_to_free_buffers(). try_to_free_buffers() would fail due to the reference held by journal_submit_data_buffers() - which in turn causes failues for DIO (invalidate_inode_pages2()). 2) When the buffer is on t_locked_list waiting for IO to finish, we hold a reference and give up the cpu, if we can't get bh_state lock. This causes try_to_free_buffers() to fail. 3) when journal_submit_data_buffers() saw the buffer is dirty but failed to lock the buffer bh1, journal_submit_data_buffers() released the j_list_lock and submit other buffers collected from previo...
Hello,
Hmm, you are right. It is not enough to just wait in ext3_launder_page()
because we don't have a transaction for direct_IO started yet. But if we
Not to me, really. The scheme for buffer references you are trying to
impose is awkward to say the least. First, it is completely
counter-intuitive (at least to me ;), second, it is impractical as well.
For example in your scheme, you have no sensible way of locking ordered
data mode buffer - you cannot just do: get the reference and do
lock_buffer() because that violates your requirements. The only reasonable
way you could do that is to lock the page to make sure buffer won't go away
from you - but you cannot currently do that in journal commit code because
of lock ordering. So the only way I can see which is left is: get some jbd
spin lock to serialize with journal_try_to_free_buffers(), get the buffer
reference, try to lock buffer, if it fails, drop everything and restart.
And this is IMO no-go...
And BTW even if you fix such races, I think you'll still have races like:
CPU1: CPU2:
filemap_write_and_wait()
dirty a page
msync() (dirties buffers)
invalidate_inode_page2_range() -> -EIO
The code could historically always return EIO when mixing buffered and
unbuffered accesses and the question is, under which circumstances is this
acceptable? I agree that the current state when if you do "buffered write,
DIO write" in sequence and you'll possibly get EIO is bad and we should fix
it. But I'm not sure we should fix the EIO return under all possible
^^^ Here you can see what I wrote above. Basically you just busy-loop
wait for buffer lock. You should at least put schedule() there so that you
^^^ And here you add a place where we are not guaranteed to make any
progress... If someone intensively spins on that buffer, commit code could
Honza
--
Jan Kara <jack@suse.cz>
SUSE ...Does this match what you are thinking? It certainly slow down the DIO
path, but the positive side is it doesn't disturb the other code path...
thanks for your feedback!
--------------------------------------------
An unexpected EIO error gets returned when writing to a file
using buffered writes and DIO writes at the same time.
We found there are a number of places where journal_try_to_free_buffers()
could race with journal_commit_transaction(), the later still
helds the reference to the buffers on the t_syncdata_list or t_locked_list
, while journal_try_to_free_buffers() tries to free them, which resulting an EIO
error returns back to the dio caller.
The logic fix is to retry freeing if journal_try_to_free_buffers() to failed
to free those data buffers while journal_commit_transaction() is still
reference those buffers.
This is done via implement ext3 launder_page() callback, instead of inside
journal_try_to_free_buffers() itself, so that it doesn't affecting other code
path calling journal_try_to_free_buffers and only dio path get affected.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Index: linux-2.6.26-rc1/fs/ext3/inode.c
===================================================================
--- linux-2.6.26-rc1.orig/fs/ext3/inode.c 2008-05-03 11:59:44.000000000 -0700
+++ linux-2.6.26-rc1/fs/ext3/inode.c 2008-05-12 12:41:27.000000000 -0700
@@ -1766,6 +1766,23 @@ static int ext3_journalled_set_page_dirt
return __set_page_dirty_nobuffers(page);
}
+static int ext3_launder_page(struct page *page)
+{
+ int ret;
+ int retry = 5;
+
+ while (retry --) {
+ ret = ext3_releasepage(page, GFP_KERNEL);
+ if (ret == 1)
+ break;
+ else
+ schedule();
+ }
+
+ return ret;
+}
+
+
static const struct address_space_operations ext3_ordered_aops = {
.readpage = ext3_readpage,
.readpages = ext3_readpages,
@@ -1778,6 +1795,7 @@ static const struct address_space_operat
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
.migr...Yes, I meant something like this. We could be more clever and do:
head = bh = page_buffers(page);
do {
wait_on_buffer(bh);
bh = bh->b_this_page;
} while (bh != head);
/*
* Now commit code should have been able to proceed and release
* those buffers
*/
schedule();
or we could do simple:
log_wait_commit(...);
That would impose larger perf. penalty but on the other hand you shouldn't
hit this path too often. But maybe the code above would be fine and would
handle most cases. Also please add a big comment to that function to explain
Actually, we need .launder_page callback only in data=order mode.
data=writeback mode doesn't need it at all (journal code doesn't touch data
buffers there) and for data=journal mode DIO could have never worked
reasonably when mixed with buffered IO and it would have to do a different
and much more expensive trickery (like flushing the journal, or at least
forcing current transaction to commit).
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--Bummer, we can't free buffers in ext3_launder_page() before calling try_to_free_page, as later invalidate_complete_page2()->try_to_free_page() expecting the page buffers are still here, and will return EIO if it launder_page() has already freed those buffers.:( Doing wait_on_buffer() alone in launder_page() is not enough as it Guess this is the last it worth give a try to see how bad it is, here is a draft patch. Do you see any obvious problems? Mingming Signed-off-by: Mingming Cao <cmm@us.ibm.com> fs/ext3/inode.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) Index: linux-2.6.26-rc1/fs/ext3/inode.c =================================================================== --- linux-2.6.26-rc1.orig/fs/ext3/inode.c 2008-05-13 13:35:27.000000000 -0700 +++ linux-2.6.26-rc1/fs/ext3/inode.c 2008-05-13 14:33:55.000000000 -0700 @@ -1766,6 +1766,53 @@ static int ext3_journalled_set_page_dirt return __set_page_dirty_nobuffers(page); } +/* + * There are a number of places where journal_try_to_free_buffers() + * could race with journal_commit_transaction(), the later still + * helds the reference to the buffers on the t_syncdata_list or t_locked_list + * while journal_try_to_free_buffers() tries to free them, + * which resulting an EIO error returns back to the generic_file_direct_IO() + * + * It is also possible that mapped IO could re-dirty the page and buffers + * after direct IO has waited dirty pages have been flush to disk. + * When the journal_try_to_free_buffers() is called from the direct IO + * path, it may failed to free the buffers as the buffer may be locked again + * + * To fix this problem, we add a ext3_launder_page() callback which + * is only called on direct IO path, to wait for the buffer unlocked + * and waiting for the jbd finish commit those data buffers. + * + * This will impact the direct IO perf, but this alows direct IO and + * buffered IO could be performed concurrently. + * + * returns...
Are you sure? Because if bufferes are released in ext3_launder_page(), PagePrivate() has been set to 0 and we should directly fall through to releasing the page without ever calling try_to_release_page()... So I'd want to find out why PagePrivate is still set in Yes, this would not be enough. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
You are right. PagePrivate() is being set to 0 in drop_buffers(). The problem is do_launder_page() returns successfully if the page is not dirty (our case), so ext3_launder_page() is not even get called. This also explains why the log_wait_commit() approach doesn't work for me:( Have to think other ways...could we pass some flag to journal_try_to_free_buffers(), and ask journal_try_to_free_buffers() wait for jbd commit to finish flushing the data, if the request is from directo IO? Mingming --
I didn't realize PageDirty() is going to be already cleared by previous Well, we could do that but we'd have to change try_to_release_page() call to accept an extra argument which would consequently mean changing all the filesystems... But yes, it probably makes sence because it is really different whether we should just release the page because of memory pressure or because direct IO needs to write to that area of the file. So adding the parameter to releasepage() callback is probably a reasonable thing to do. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
This patch fixed a few races between direct IO and kjournld commit transaction. An unexpected EIO error gets returned to direct IO caller when it failed to free those data buffers. This could be reproduced easily with parallel direct write and buffered write to the same file More specific, those races could cause journal_try_to_free_buffers() fail to free the data buffers, when jbd is committing the transaction that has those data buffers on its t_syncdata_list or t_locked_list. journal_commit_transaction() still holds the reference to those buffers before data reach to disk and buffers are removed from the t_syncdata_list of t_locked_list. This prevent the concurrent journal_try_to_free_buffers() to free those buffers at the same time, and cause EIO error returns back to direct IO. With this patch, in case of direct IO and when try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish flushing the current committing transaction's data buffers to disk, then try to free those buffers again. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> --- fs/jbd/commit.c | 1 + fs/jbd/journal.c | 1 + fs/jbd/transaction.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/jbd.h | 3 +++ 4 files changed, 51 insertions(+) Index: linux-2.6.26-rc1/include/linux/jbd.h =================================================================== --- linux-2.6.26-rc1.orig/include/linux/jbd.h 2008-05-14 16:36:41.000000000 -0700 +++ linux-2.6.26-rc1/include/linux/jbd.h 2008-05-15 14:12:10.000000000 -0700 @@ -667,6 +667,9 @@ struct journal_s */ wait_queue_head_t j_wait_transaction_locked; + /* Wait queu for waiting for data buffers to flushed to disk*/ + wait_queue_head_t j_wait_data_flushed; + /* Wait queue for waiting for checkpointing to complete */ wait_queue_head_t j_wait_logspace; Index: linux-2.6.26-rc1/fs/jbd/commit.c ===============================...
Small nits.. Fix typo - "queue" and also this line is wrapping. Could you spilt it fix grammar - "wait for the current transaction to finish syncing data --
JBD2: fix DIO error caused by race with DIO free_buffers and jbd2 commit transaction From: Mingming Cao <cmm@us.ibm.com> This patch fixed a few races between direct IO and kjournlad commit transaction. An unexpected EIO error gets returned to direct IO caller when it failed to free those data buffers. This could be reproduced easily with parallel direct write and buffered write to the same file More specific, those races could cause jbd2_journal_try_to_free_buffers() fail to free the data buffers, when jbd is committing the transaction that has those data buffers on its t_syncdata_list or t_locked_list. jbd2_journal_commit_transaction() still holds the reference to those buffers before data reach to disk and buffers are removed from the t_syncdata_list of t_locked_list. This prevent the concurrent jbd2_journal_try_to_free_buffers() to free those buffers at the same time, but cause EIO error returns back to direct IO. With this patch, in case of direct IO and when try_to_free_buffers() failed, let's waiting for jbd2_journal_commit_transaction() to finish flushing the current committing transaction's data buffers to disk, then try to free those buffers again. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> --- fs/jbd2/transaction.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 2 deletions(-) Index: linux-2.6.26-rc2/fs/jbd2/transaction.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd2/transaction.c 2008-05-16 11:16:56.000000000 -0700 +++ linux-2.6.26-rc2/fs/jbd2/transaction.c 2008-05-16 13:52:04.000000000 -0700 @@ -1656,12 +1656,39 @@ out: return; } +/* + * jbd2_journal_try_to_free_buffers() could race with jbd2_journal_commit_transaction() + * The later might still hold the reference count to the buffers when inspecting + * them on t_syncdata_list or t_locked_list. + * + * jbd2_journal_try_t...
This patch fixed a few races between direct IO and kjournald commit transaction. An unexpected EIO error gets returned to direct IO caller when it failed to free those data buffers. This could be reproduced easily with parallel direct write and buffered write to the same file More specific, those races could cause journal_try_to_free_buffers() fail to free the data buffers, when jbd is committing the transaction that has those data buffers on its t_syncdata_list or t_locked_list. journal_commit_transaction() still holds the reference to those buffers before data reach to disk and buffers are removed from the t_syncdata_list of t_locked_list. This prevent the concurrent journal_try_to_free_buffers() to free those buffers at the same time, but cause EIO error returns back to direct IO. With this patch, in case of direct IO and when try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish flushing the current committing transaction's data buffers to disk, then try to free those buffers again. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> --- fs/jbd/transaction.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-- mm/truncate.c | 3 +- 2 files changed, 57 insertions(+), 3 deletions(-) Index: linux-2.6.26-rc2/fs/jbd/transaction.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/transaction.c 2008-05-16 11:51:02.000000000 -0700 +++ linux-2.6.26-rc2/fs/jbd/transaction.c 2008-05-16 13:43:02.000000000 -0700 @@ -1648,12 +1648,39 @@ out: return; } +/* + * journal_try_to_free_buffers() could race with journal_commit_transaction() + * The later might still hold the reference count to the buffers when inspecting + * them on t_syncdata_list or t_locked_list. + * + * Journal_try_to_free_buffers() will call this function to + * wait for the current transaction to finish syncing data buffers, before + * try to free...
If Andrew or Christoph wouldn't beat you for "inventive use" of gfp_mask, I'm fine with the patch as well ;). You can add Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs --
This is less intrusive way to fix this problem. The gfp_mask was marked as unused in try_to_free_page(). I looked at filesystems in the kernel, there is only a few defined releasepage() callback, and only xfs checks the flag(but not used). btrfs is actually using it though. I thought about the way you have suggested, i.e.clean up this gfp_mask and and replace with a flag. I am not entirely sure if it we need to change the address_space_operations and fix all the filesystems for this matter. Andrew, what do you think? Is this approach acceptable? Thanks and regards, --
On Mon, 19 May 2008 12:59:18 -0700 <wakes up> Please ensure that the final patch is sufficiently well changelogged to permit me to remain asleep ;) The ->releasepage semantics are fairly ad-hoc and have grown over time. It'd be nice to prevent them from becoming vaguer than they are. It has been (approximately?) the case that code paths which really care about having the page released will set __GFP_WAIT (via GFP_KERNEL) whereas code paths which are happy with best-effort will clear __GFP_WAIT (with a "0'). And that's reasonsable - __GFP_WAIT here means "be synchronous" whereas !__GFP_WAIT means "be non-blocking". Is that old convention not sufficient here as well? Two problem areas I see are mm/vmscan.c and fs/splice.c (there may be others). In mm/vmscan.c we probably don't want your new synchronous behaviour and it might well be deadlockable anyway. No probs, that's what __GFP_FS is for. In fs/splice.c, reading the comment there I have a feeling that you've found another bug, and that splice _does_ want your new synchronous behaviour? --
Yes, it looks like page_cache_pipe_buf_steal() expects page is free before removeing it by passing the GFP_KERNEL flag, but currently ext3 could fails to releasepage when it called. In fact try_to_release_page() return value is ignored in page_cache_pipe_buf_steal(), should probably checked the failure case. The other caller of try_to_release_page() in mm/splice.c is fallback_migrate_page(), which does want the synchronous behaviour to make sure buffers are dropped. I will reuse the GFP_WAIT and GFP_FS flag in the updated patch. Thanks for your feedback. --
JBD2: fix race between jbd2_journal_try_to_free_buffers() and jbd2 commit transaction From: Mingming Cao <cmm@us.ibm.com> journal_try_to_free_buffers() could race with jbd commit transaction when the later is holding the buffer reference while waiting for the data buffer to flush to disk. If the caller of journal_try_to_free_buffers() request tries hard to release the buffers, it will treat the failure as error and return back to the caller. We have seen the directo IO failed due to this race. Some of the caller of releasepage() also expecting the buffer to be dropped when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers(). With this patch, if the caller is passing the GFP_KERNEL to indicating this call could wait, in case of try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish commit the current committing transaction , then try to free those buffers again with journal locked. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> --- fs/jbd2/transaction.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 54 insertions(+), 2 deletions(-) Index: linux-2.6.26-rc2/fs/jbd2/transaction.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd2/transaction.c 2008-05-11 17:09:41.000000000 -0700 +++ linux-2.6.26-rc2/fs/jbd2/transaction.c 2008-05-19 17:20:01.000000000 -0700 @@ -1656,12 +1656,39 @@ out: return; } +/* + * jbd2_journal_try_to_free_buffers() could race with jbd2_journal_commit_transaction() + * The later might still hold the reference count to the buffers when inspecting + * them on t_syncdata_list or t_locked_list. + * + * jbd2_journal_try_to_free_buffers() will call this function to + * wait for the current transaction to finish syncing data buffers, before + * try to free that buffer. + * + * Called with journal->j_state_lock hold. + */ +static void jbd2_jo...
Updated patch to use existing GFP_KERNEL mask to indicating synchornous wait in journal_try_to_free_buffers() to resolve the race with journal_commit_transaction(). JBD: fix race between journal_try_to_free_buffers() and jbd commit transaction From: Mingming Cao <cmm@us.ibm.com> journal_try_to_free_buffers() could race with jbd commit transaction when the later is holding the buffer reference while waiting for the data buffer to flush to disk. If the caller of journal_try_to_free_buffers() request tries hard to release the buffers, it will treat the failure as error and return back to the caller. We have seen the directo IO failed due to this race. Some of the caller of releasepage() also expecting the buffer to be dropped when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers(). With this patch, if the caller is passing the GFP_KERNEL to indicating this call could wait, in case of try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish commit the current committing transaction , then try to free those buffers again with journal locked. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> --- fs/jbd/transaction.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++-- mm/filemap.c | 3 -- 2 files changed, 54 insertions(+), 4 deletions(-) Index: linux-2.6.26-rc2/fs/jbd/transaction.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/transaction.c 2008-05-11 17:09:41.000000000 -0700 +++ linux-2.6.26-rc2/fs/jbd/transaction.c 2008-05-19 16:16:41.000000000 -0700 @@ -1648,12 +1648,39 @@ out: return; } +/* + * journal_try_to_free_buffers() could race with journal_commit_transaction() + * The later might still hold the reference count to the buffers when inspecting + * them on t_syncdata_list or t_locked_list. + * + * Journal_try_to_free_buffers() will call this function to + * wai...
journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race.
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers().
With this patch, if the caller is passing the GFP_KERNEL to indicating this
call could wait, in case of try_to_free_buffers() failed, let's waiting for
journal_commit_transaction() to finish commit the current committing transaction
, then try to free those buffers again with journal locked.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com>
---
fs/jbd2/transaction.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 55 insertions(+), 2 deletions(-)
Index: linux-2.6.26-rc3/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.26-rc3.orig/fs/jbd2/transaction.c 2008-05-21 16:17:51.000000000 -0700
+++ linux-2.6.26-rc3/fs/jbd2/transaction.c 2008-05-21 16:22:03.000000000 -0700
@@ -1656,12 +1656,40 @@ out:
return;
}
+/*
+ * jbd2_journal_try_to_free_buffers() could race with jbd2_journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * jbd2_journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ * Called with journal->j_state_lock hold.
+ */
+static void jbd2_journal_wait_for_transaction_sync_data(journal_t *journal)
+{
+ transaction_t *transaction = NULL;
+ tid_t tid;
+
+ transaction = j...Changes since take 2: - fix a bug pointed by Jan, and updated the comments journal_try_to_free_buffers() could race with jbd commit transaction when the later is holding the buffer reference while waiting for the data buffer to flush to disk. If the caller of journal_try_to_free_buffers() request tries hard to release the buffers, it will treat the failure as error and return back to the caller. We have seen the directo IO failed due to this race. Some of the caller of releasepage() also expecting the buffer to be dropped when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers(). With this patch, if the caller is passing the GFP_KERNEL to indicating this call could wait, in case of try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish commit the current committing transaction , then try to free those buffers again with journal locked. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> --- fs/jbd/transaction.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-- mm/filemap.c | 3 -- 2 files changed, 56 insertions(+), 4 deletions(-) Index: linux-2.6.26-rc3/fs/jbd/transaction.c =================================================================== --- linux-2.6.26-rc3.orig/fs/jbd/transaction.c 2008-05-21 16:17:51.000000000 -0700 +++ linux-2.6.26-rc3/fs/jbd/transaction.c 2008-05-21 16:20:11.000000000 -0700 @@ -1648,12 +1648,40 @@ out: return; } +/* + * journal_try_to_free_buffers() could race with journal_commit_transaction() + * The later might still hold the reference count to the buffers when inspecting + * them on t_syncdata_list or t_locked_list. + * + * Journal_try_to_free_buffers() will call this function to + * wait for the current transaction to finish syncing data buffers, before + * try to free that buffer. + * + * Called with journal->j_state_lock hold. + */ +static void journal_wait_for_transaction_sync_data(journ...
"latter"
Unneeded initialisation. Could just do
"callers"
Sorry about all the spelling flames ;) I'd normally just fix them
myself rather than typing them all into an email and having you type
them in again, etc. But I think the patch needs to be respun anyway.
The mask-and-compare with GFP_KERNEL does appear to be correct, but it
is quite unusual. Generally in a situation like this we will test for
the specific __GFP_foo flags which we're interested in. For
documentation reasons if nothing else.
So the preferred form here would be
if (ret == 0 &&
(gfp_mask & (__GFP_WAIT|__GFP_FS)) == (__GFP_WAIT|__GFP_FS)) {
which really tells the reader what we're trying to do here. And I
don't think this code cares about __GFP_IO, even though it would be
mighty peculirr (probably buggy) for someone to do
Did we actually need to hold j_state_lock across the
try_to_free_buffers() call here? Because it'll increase hold times and
will introduce a lock-ranking dependency which we might not otherwise
--What is actually the point of entering the function with j_state_lock held and also keeping it after return? It should be enough to take it This comment seems a bit misleading to me - I'd rather write there: @gfp_mask: we use the mask to detect how hard should we try to release buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to I think this test is wrong - it should rather be something like (ret == 0 && (gfp_mask & GFP_KERNEL == GFP_KERNEL)) - or even expand the test to gfp_mask & __GFP_WAIT && gfp_mask & __GFP_FS && gfp_mask & Probably __GFP_WAIT | __GFP_IO here... But I'm not sure why do we Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs --
I was worried about the case when we call try_to_free_buffers() again, it races with the current transaction commit again. Is it possible? I guess the question is whether it is possible to have buffers on the same page attached to different transaction. If so, I think we need to keep the journal state lock while retry try_to_free_buffers(), so that the For try_to_release_page(),we should wait only when (__GFP_WAIT & Thanks, patch v3 to follow. Mingming --
Well, but by the time log_wait_commit() finishes, it may well happen that a new transaction is already started so your lock doesn't help you much. And the page you are called on is actually locked, so noone can really mess with it until you unlock it... So I think you can just use the lock for obtaining tid and then drop it. Honza PS: For JBD2 you'd need to be a bit more careful because you cannot call log_wait_commit() while holding page lock (we have reversed locking order for ext4) - but ordered-mode rewrite patch actually fixes this problem and I'm going to submit the splitted patches on Monday or Tuesday (I only need to test them that I didn't do something stupid while porting them to ext4)... -- Jan Kara <jack@suse.cz> SuSE CR Labs --
You are right that the page was locked during the process we are trying Thanks for pointing this out. I think when we put back the reversed locking order and new ordered mode the jbd2 patch could go away... Updated patch for JBD (take 4) below. Mingming JBD: fix race between journal_try_to_free_buffers() and jbd commit transaction From: Mingming Cao <cmm@us.ibm.com> journal_try_to_free_buffers() could race with jbd commit transaction when the later is holding the buffer reference while waiting for the data buffer to flush to disk. If the caller of journal_try_to_free_buffers() request tries hard to release the buffers, it will treat the failure as error and return back to the caller. We have seen the directo IO failed due to this race. Some of the caller of releasepage() also expecting the buffer to be dropped when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers(). With this patch, if the caller is passing the GFP_KERNEL to indicating this call could wait, in case of try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish commit the current committing transaction , then try to free those buffers again with journal locked. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> --- fs/jbd/transaction.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-- mm/filemap.c | 3 -- 2 files changed, 56 insertions(+), 4 deletions(-) Index: linux-2.6.26-rc3/fs/jbd/transaction.c =================================================================== --- linux-2.6.26-rc3.orig/fs/jbd/transaction.c 2008-05-28 10:55:37.000000000 -0700 +++ linux-2.6.26-rc3/fs/jbd/transaction.c 2008-05-28 10:57:32.000000000 -0700 @@ -1648,12 +1648,42 @@ out: return; } +/* + * journal_try_to_free_buffers() could race with journal_commit_transaction() + * The later might still hold the reference count to the buffers when inspecting + * them on t_syncdata_lis...
I think Andrew prefered this test to be expanded but otherwise the patch is fine now. You can add: Acked-by: Jan Kara <jack@suse.cz> Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Updated patch after Jan acked. journal_try_to_free_buffers() could race with jbd commit transaction when the later is holding the buffer reference while waiting for the data buffer to flush to disk. If the caller of journal_try_to_free_buffers() request tries hard to release the buffers, it will treat the failure as error and return back to the caller. We have seen the directo IO failed due to this race. Some of the caller of releasepage() also expecting the buffer to be dropped when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers(). With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to indicating this call could wait, in case of try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish commit the current committing transaction, then try to free those buffers again. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Jan Kara <jack@suse.cz> --- fs/jbd/transaction.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-- mm/filemap.c | 3 -- 2 files changed, 56 insertions(+), 4 deletions(-) Index: linux-2.6.26-rc3/fs/jbd/transaction.c =================================================================== --- linux-2.6.26-rc3.orig/fs/jbd/transaction.c 2008-05-28 14:16:41.000000000 -0700 +++ linux-2.6.26-rc3/fs/jbd/transaction.c 2008-05-28 16:12:15.000000000 -0700 @@ -1648,12 +1648,42 @@ out: return; } +/* + * journal_try_to_free_buffers() could race with journal_commit_transaction() + * The later might still hold the reference count to the buffers when inspecting + * them on t_syncdata_list or t_locked_list. + * + * Journal_try_to_free_buffers() will call this function to + * wait for the current transaction to finish syncing data buffers, before + * try to free that buffer. + * + * Called with journal->j_state_lock hold. + */ +static void journal_wait_for_transaction_sync_data(journal_t *journ...
journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data buffer
to flush to disk. If the caller of journal_try_to_free_buffers() request
tries hard to release the buffers, it will treat the failure as error and return
back to the caller. We have seen the directo IO failed due to this race.
Some of the caller of releasepage() also expecting the buffer to be dropped
when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers().
With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS
to indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
---
fs/jbd2/transaction.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 57 insertions(+), 2 deletions(-)
Index: linux-2.6.26-rc3/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.26-rc3.orig/fs/jbd2/transaction.c 2008-05-28 16:10:41.000000000 -0700
+++ linux-2.6.26-rc3/fs/jbd2/transaction.c 2008-05-28 16:13:16.000000000 -0700
@@ -1656,12 +1656,42 @@ out:
return;
}
+/*
+ * jbd2_journal_try_to_free_buffers() could race with jbd2_journal_commit_transaction()
+ * The later might still hold the reference count to the buffers when inspecting
+ * them on t_syncdata_list or t_locked_list.
+ *
+ * jbd2_journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ * Called with journal->j_state_lock hold.
+ */
+static void jbd2_journal_wait_for_transaction_sync_data(journal_t *journal)
+{
+ transaction_t *transaction = NULL;
...-aneesh --
Thanks. I noticed this yesterday and have sent Andrew updated patch to
replace the one he just added to mm tree but forget to copy to the list. Here is the updated patch,
Mingming
---
fs/jbd2/transaction.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 58 insertions(+), 3 deletions(-)
Index: linux-2.6.26-rc4/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd2/transaction.c 2008-05-29 12:21:40.000000000 -0700
+++ linux-2.6.26-rc4/fs/jbd2/transaction.c 2008-05-29 12:38:30.000000000 -0700
@@ -1656,12 +1656,43 @@ out:
return;
}
+/*
+ * jbd2_journal_try_to_free_buffers() could race with
+ * jbd2_journal_commit_transaction(). The later might still hold the
+ * reference count to the buffers when inspecting them on
+ * t_syncdata_list or t_locked_list.
+ *
+ * jbd2_journal_try_to_free_buffers() will call this function to
+ * wait for the current transaction to finish syncing data buffers, before
+ * try to free that buffer.
+ *
+ */
+static void jbd2_journal_wait_for_transaction_sync_data(journal_t *journal)
+{
+ transaction_t *transaction = NULL;
+ tid_t tid;
+
+ spin_lock(&journal->j_state_lock);
+ transaction = journal->j_committing_transaction;
+
+ if (!transaction) {
+ spin_unlock(&journal->j_state_lock);
+ return;
+ }
+
+ tid = transaction->t_tid;
+ spin_unlock(&journal->j_state_lock);
+ jbd2_log_wait_commit(journal, tid);
+}
/**
* int jbd2_journal_try_to_free_buffers() - try to free page buffers.
* @journal: journal for operation
* @page: to try and free
- * @unused_gfp_mask: unused
+ * @gfp_mask: we use the mask to detect how hard should we try to release
+ * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
+ * release the buffers.
*
*
* For all the buffers on this page,
@@ -1690,9 +1721,11 @@ out:
* journal_try_to_free_buffer() is changing its state. But that
* cannot happ...Okay, I will update the patch and cleanup the history, sent it to Andrew. Thanks. Mingming --
So something like this, then? diff --git a/fs/splice.c b/fs/splice.c index 7815003..e08a2f5 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -58,8 +58,8 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe, */ wait_on_page_writeback(page); - if (PagePrivate(page)) - try_to_release_page(page, GFP_KERNEL); + if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL)) + goto out_unlock; /* * If we succeeded in removing the mapping, set LRU flag @@ -75,6 +75,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe, * Raced with truncate or failed to remove page from current * address space, unlock and return failure. */ +out_unlock: unlock_page(page); return 1; } -- Jens Axboe --
--
Got a couple of whitespace problems above it looks like. Thanks, Josef --
Thanks for catching this, below is updated patch, fixed the whitespace and comments. --------------------------------------------------- JBD: fix journal_try_to_free_buffers race with journal_commit_transaction From: Mingming Cao <cmm@us.ibm.com> This patch fixed a few races between direct IO and kjournld commit transaction. An unexpected EIO error gets returned to direct IO caller when it failed to free those data buffers. This could be reproduced easily with parallel direct write and buffered write to the same file More specificly, those races could cause journal_try_to_free_buffers() fail to free the data buffers, when jbd is committing the transaction that has those data buffers on its t_syncdata_list or t_locked_list. journal_commit_transaction() still holds the reference to those buffers before data reach to disk and buffers are removed from the t_syncdata_list of t_locked_list. This prevent the concurrent journal_try_to_free_buffers() to free those buffers at the same time, but cause EIO error returns back to direct IO. With this patch, in case of direct IO and when try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish flushing the current committing transaction's data buffers to disk, then try to free those buffers again. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> --- fs/jbd/commit.c | 1 + fs/jbd/journal.c | 1 + fs/jbd/transaction.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/jbd.h | 3 +++ 4 files changed, 51 insertions(+) Index: linux-2.6.26-rc1/include/linux/jbd.h =================================================================== --- linux-2.6.26-rc1.orig/include/linux/jbd.h 2008-05-14 16:36:41.000000000 -0700 +++ linux-2.6.26-rc1/include/linux/jbd.h 2008-05-15 14:12:10.000000000 -0700 @@ -667,6 +667,9 @@ struct journal_s */ wait_queue_head_t j_wait_transaction_locked; + /* Wa...
drop "dio" variable and compare here, like Thanks, Badari --
Okay, will do. Also will update the patch with other format changes you Also, This patch changed the struct journal_s to add a new wait queue, so that journal_try_to_free_buffers() could only need to wait for data to be commited, rather than using j_wait_done_commit queue and wait for the whole transaction being committed. It might break other fs that uses journal_s, thus not worth it. Will update the patch. --
Actually there is an argument gfp_mask passed to try_to_release_page() which we could pass a special flag from direct IO that could be parsed as direct IO request. This would avoid changing all the filesystems and the address space operation interface. In fact, I don't see in-kernel tree fs releasepage() cal back functions is using this gfp_mask, but Will send a patch shortly, with that patch the test fine for about 18 hours. --
Thanks. We could recheck if buffer_busy() before calling wait_on_buffer(bh) to wait for buffer unlocked. This will handles the mapped IO re-dirty race case, but still need the schedule() and retry to handle the buffered IO My concern with doing log_wait_commit() here is the perf penalty. In the case the buffers is at the end of the queue to commit, we have to wait for all other previous transactions to finish committing before we could You are right, thanks for pointing this out. Will post an updated patch. Mingming --
Do you mean calling journal_try_to_free_buffers() inside ext3_launder_page()? I think we still need some lock to serialize launder_page() with kjournald commit code(not sure if is that Okay?), otherwise there is always a window that by the time try_to_free_buffers() get called, the current transaction has be Sigh...I am not very happy with the solution either, but I could not see a decent solution that could fix this problem. Currently we constantly hit EIO error only in 10 minutes with the simple parallel buffered IO I could see this is possible with mapped IO. But for buffered IO, since Yup. The conflict is that if we still held the bh refrence after released the j_list_lock, journal_try_to_free_buffers() could came in returns EIO to direct IO since buffer is busy(); but if we release the bh reference after released the j_list_lock, it made possible that journal_try_to_free_buffers() to free that buffer, so we can't do lock_buffer() here anymore so we have to loop here. This is a trade off ... On the other hand, the journal_submit_data_bufferes() continue process this buffer after regrab the j_list_lock even if it's has been removed from the t_syncdata_list by __journal_try_to_free_buffers(). IMO this is --
Once we succeed with removing journal heads from data buffers in journal_try_to_free_buffers(), they can be added only if someone dirties the page via mmap and writepage is called (which I'd currently neglect as too rare and not common usage). So after that moment, once commit code Well, you are going to hit this path for example when the buffer is under IO just now. So it happens quite often e.g. when the machine is under memory pressure and writes-out dirty data more aggressively. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Yes. I have been discussing all of these with Mingming. I agree with you that it looks silly to update all the places to not hold a ref on the buffer while waiting for spinlocks. Adding a launder_page() seems to be the right approach. I wouldn't worry about making O_DIRECT slower. Currently, its failing with -EIO in this case anyway. Unfortunately, this happens at a customer site and I can't effort to give them a complete re-write type patch - I need to get this into older distro release :( Thats why I am trying to fix the cases. I am finding more and more of these .. Thanks, Badari --
Actually there is one more place that journal_try_to_free_buffers
(calling from DIO path) could race with journal_submit_data_buffers(),
the DIO and buffered IO test still returns EIO with the fix which should
fixed the first 3 race cases.
I could not figure out how this could happen with Badari's fix to
inverted_lock().
This time the debug shows that the one who release put_bh() after the
journal_try_to_free_buffers() failed is from this code path:
journal_submit_data_buffers() {
...
} else if (!locked && buffer_locked(bh)) {
__journal_file_buffer(jh, commit_transaction,
BJ_Locked);
jbd_unlock_bh_state(bh);
put_bh(bh);
}
But when we get here we should already making sure the bh is on the
It seems to me it's safe in this case. When
journal_try_to_free_buffers() called from DIO path,
filemap_fdatawrite_and_wait() already making sure that the IO submitted
--