Re: [RFC] ext3: per-process soft-syncing data=ordered mode

Previous thread: [patch 17/26] mount options: fix hugetlbfs by Miklos Szeredi on Thursday, January 24, 2008 - 12:33 pm. (1 message)

Next thread: [RFC] Add vfsmount to vfs helper functions. by Kentaro Takeda on Friday, January 25, 2008 - 3:20 am. (7 messages)
From: Al Boldi
Date: Thursday, January 24, 2008 - 1:36 pm

Greetings!

data=ordered mode has proven reliable over the years, and it does this by 
ordering filedata flushes before metadata flushes.  But this sometimes 
causes contention in the order of a 10x slowdown for certain apps, either 
due to the misuse of fsync or due to inherent behaviour like db's, as well 
as inherent starvation issues exposed by the data=ordered mode.

data=writeback mode alleviates data=order mode slowdowns, but only works 
per-mount and is too dangerous to run as a default mode.

This RFC proposes to introduce a tunable which allows to disable fsync and 
changes ordered into writeback writeout on a per-process basis like this:

      echo 1 > /proc/`pidof process`/softsync


Your comments are much welcome!


Thanks!

--
Al

-

From: Diego Calleja
Date: Thursday, January 24, 2008 - 2:50 pm

There's a related bug in bugzilla: http://bugzilla.kernel.org/show_bug.cgi?id=9546

The diagnostic from Jan Kara is different though, but I think it may be the same
problem...

"One process does data-intensive load. Thus in the ordered mode the
transaction is tiny but has tons of data buffers attached. If commit
happens, it takes a long time to sync all the data before the commit
can proceed... In the writeback mode, we don't wait for data buffers, in
the journal mode amount of data to be written is really limited by the
maximum size of a transaction and so we write by much smaller chunks
and better latency is thus ensured."


I'm hitting this bug too...it's surprising that there's not many people
reporting more bugs about this, because it's really annoying.


There's a patch by Jan Kara (that I'm including here because bugzilla didn't
include it and took me a while to find it) which I don't know if it's supposed to
fix the problem , but it'd be interesting to try:




Don't allow too much data buffers in a transaction.

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 08ff6c7..e6f9dd6 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -163,7 +163,7 @@ repeat_locked:
 	spin_lock(&transaction->t_handle_lock);
 	needed = transaction->t_outstanding_credits + nblocks;
 
-	if (needed > journal->j_max_transaction_buffers) {
+	if (needed > journal->j_max_transaction_buffers || atomic_read(&transaction->t_data_buf_count) > 32768) {
 		/*
 		 * If the current transaction is already too large, then start
 		 * to commit it: we can then go back and attach this handle to
@@ -1528,6 +1528,7 @@ static void __journal_temp_unlink_buffer(struct journal_head *jh)
 		return;
 	case BJ_SyncData:
 		list = &transaction->t_sync_datalist;
+		atomic_dec(&transaction->t_data_buf_count);
 		break;
 	case BJ_Metadata:
 		transaction->t_nr_buffers--;
@@ -1989,6 +1990,7 @@ void __journal_file_buffer(struct journal_head *jh,
 		return;
 	case BJ_SyncData:
 ...
From: Al Boldi
Date: Friday, January 25, 2008 - 10:27 pm

Thanks a lot, but it doesn't fix it.

--
Al

-

From: Jan Kara
Date: Monday, January 28, 2008 - 10:34 am

Hmm, if you're willing to test patches, then you could try a debug patch:
http://bugzilla.kernel.org/attachment.cgi?id=14574
  and send me the output. What kind of load do you observe problems with
and which problems exactly?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

From: Valdis.Kletnieks
Date: Thursday, January 24, 2008 - 2:58 pm

If they're misusing it, they should be fixed.  There should be a limit to

Well-written programs only call fsync() when they really do need the semantics
of fsync.  Disabling that is just *asking* for trouble.

From rfc2821:

6.1 Reliable Delivery and Replies by Email

   When the receiver-SMTP accepts a piece of mail (by sending a "250 OK"
   message in response to DATA), it is accepting responsibility for
   delivering or relaying the message.  It must take this responsibility
   seriously.  It MUST NOT lose the message for frivolous reasons, such
   as because the host later crashes or because of a predictable
   resource shortage.

Some people really *do* think "the CPU took a machine check and after replacing
the motherboard, the resulting fsck ate the file" is a "frivolous" reason to
lose data.

But if you want to give them enough rope to shoot themselves in the foot with,
I'd suggest abusing LD_PRELOAD to replace the fsync() glibc code instead.  No
need to clutter the kernel with rope that can be (and has been) done in userspace.
From: Al Boldi
Date: Friday, January 25, 2008 - 10:27 pm

Ok that's possible, but as you cannot use LD_PRELOAD to deal with changing 
ordered into writeback mode, we might as well allow them to disable fsync 
here, because it is in the same use-case.


Thanks!

--
Al

-

From: Chris Snook
Date: Thursday, January 24, 2008 - 6:19 pm

This is basically a kernel workaround for stupid app behavior.  It wouldn't be 
the first time we've provided such an option, but we shouldn't do it without a 
very good justification.  At the very least, we need a test case that 
demonstrates the problem and benchmark results that prove that this approach 
actually fixes it.  I suspect we can find a cleaner fix for the problem.

	-- Chris
-

From: Al Boldi
Date: Friday, January 25, 2008 - 10:28 pm

Exactly right to some extent, but don't forget the underlying data=ordered 
starvation problem, which looks like a genuinely deep problem maybe related 


8M-record insert into indexed db-table:
         ordered  writeback
sqlite3:  75m22s    8m45s

I hope so, but even with a fix available addressing the data=ordered 
starvation issue, this tunable could remain useful for those apps that 
misbehave.


Thanks!

--
Al

-

From: Jan Kara
Date: Tuesday, January 29, 2008 - 10:22 am

It is a problem with the way how ext3 does fsync (at least that's what
we ended up with in that konqueror problem)... It has to flush the
current transaction which means that app doing fsync() has to wait till
all dirty data of all files on the filesystem are written (if we are in
ordered mode). And that takes quite some time... There are possibilities
how to avoid that but especially with freshly created files, it's tough
and I don't see a way how to do it without some fundamental changes to
JBD.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
-

From: Al Boldi
Date: Tuesday, January 29, 2008 - 11:04 pm

Ok, but keep in mind that this starvation occurs even in the absence of 
fsync, as the benchmarks show.

And, a quick test of successive 1sec delayed syncs shows no hangs until about 
1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for 
minutes on end, and io-wait shows almost 100%.

Now it turns out that 'echo 3 > /proc/.../drop_caches' has no effect, but 
doing it a few more times makes the hangs go away for while, only to come 
back again and again.


Thanks!

--
Al

-

From: Chris Mason
Date: Wednesday, January 30, 2008 - 7:29 am

Do you see this on older kernels as well?  The first thing we need to 
understand is if this particular stall is new.

-chris
-

From: Al Boldi
Date: Wednesday, January 30, 2008 - 11:39 am

2.6.24,22,19 and 2.4.32 show the same problem.


Thanks!

--
Al

-

From: Andreas Dilger
Date: Wednesday, January 30, 2008 - 5:32 pm

How large is the journal in this filesystem?  You can check via
"debugfs -R 'stat <8>' /dev/XXX".  Is this affected by increasing
the journal size?  You can set the journal size via "mke2fs -J size=400" 
at format time, or on an unmounted filesystem by running
"tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400 /dev/XXX".

I suspect that the stall is caused by the journal filling up, and then
waiting while the entire journal is checkpointed back to the filesystem
before the next transaction can start.

It is possible to improve this behaviour in JBD by reducing the amount
of space that is cleared if the journal becomes "full", and also doing
journal checkpointing before it becomes full.  While that may reduce
performance a small amount, it would help avoid such huge latency problems.
I believe we have such a patch in one of the Lustre branches already,
and while I'm not sure what kernel it is for the JBD code rarely changes
much....

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-

From: Al Boldi
Date: Wednesday, January 30, 2008 - 11:20 pm

The big difference between ordered and writeback is that once the slowdown 
starts, ordered goes into ~100% iowait, whereas writeback continues 100% 
user.


Thanks!

--
Al

-

From: Chris Mason
Date: Thursday, January 31, 2008 - 9:56 am

Does data=ordered write buffers in the order they were dirtied?  This might 
explain the extreme problems in transactional workloads.

-chris
-

From: Jan Kara
Date: Thursday, January 31, 2008 - 10:10 am

Well, it does but we submit them to block layer all at once so elevator
should sort the requests for us...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

From: Chris Mason
Date: Thursday, January 31, 2008 - 10:14 am

nr_requests is fairly small, so a long stream of random requests should still 
end up being random IO.

Al, could you please compare the write throughput from vmstat for the 
data=ordered vs data=writeback runs?  I would guess the data=ordered one has 
a lower overall write throughput.

-chris
-

From: Al Boldi
Date: Friday, February 1, 2008 - 2:26 pm

That's what I would have guessed, but it's actually going up 4x fold for 
mysql from 559mb to 2135mb, while the db-size ends up at 549mb.

This may mean that data=ordered isn't buffering redundant writes; or worse.


Thanks!

--
Al

-

From: Jan Kara
Date: Monday, February 4, 2008 - 10:54 am

So you say we write 4-times as much data in ordered mode as in writeback
mode. Hmm, probably possible because we force all the dirty data to disk
when committing a transation in ordered mode (and don't do this in
writeback mode). So if the workload repeatedly dirties the whole DB, we are
going to write the whole DB several times in ordered mode but in writeback
mode we just keep the data in memory all the time. But this is what you
ask for if you mount in ordered mode so I wouldn't consider it a bug.
  I still don't like your hack with per-process journal mode setting but we
could easily do per-file journal mode setting (we already have a flag to do
data journaling for a file) and that would help at least your DB
workload...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

From: Al Boldi
Date: Tuesday, February 5, 2008 - 12:07 am

Ok, maybe not a bug, but a bit inefficient.  Check out this workload:

sync;

while :; do
  dd < /dev/full > /mnt/sda2/x.dmp bs=1M count=20
  rm -f /mnt/sda2/x.dmp
  usleep 10000
done

vmstat 1 ( with mount /dev/sda2 /mnt/sda2 -o data=writeback) << note io-bo >>

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0      0 293008   5232  57436    0    0     0     0   18   206  4 80 16  0
 1  0      0 282840   5232  67620    0    0     0     0   18   238  3 81 16  0
 1  0      0 297032   5244  53364    0    0     0   152   21   211  4 79 17  0
 1  0      0 285236   5244  65224    0    0     0     0   18   232  4 80 16  0
 1  0      0 299464   5244  50880    0    0     0     0   18   222  4 80 16  0
 1  0      0 290156   5244  60176    0    0     0     0   18   236  3 80 17  0
 0  0      0 302124   5256  47788    0    0     0   152   21   213  4 80 16  0
 1  0      0 292180   5256  58248    0    0     0     0   18   239  3 81 16  0
 1  0      0 287452   5256  62444    0    0     0     0   18   202  3 80 17  0
 1  0      0 293016   5256  57392    0    0     0     0   18   250  4 80 16  0
 0  0      0 302052   5256  47788    0    0     0     0   19   194  3 81 16  0
 1  0      0 297536   5268  52928    0    0     0   152   20   233  4 79 17  0
 1  0      0 286468   5268  63872    0    0     0     0   18   212  3 81 16  0
 1  0      0 301572   5268  48812    0    0     0     0   18   267  4 79 17  0
 1  0      0 292636   5268  57776    0    0     0     0   18   208  4 80 16  0
 1  0      0 302124   5280  47788    0    0     0   152   21   237  4 80 16  0
 1  0      0 291436   5280  58976    0    0     0     0   18   205  3 81 16  0
 1  0      0 302068   5280  47788    0    0     0     0   18   234  3 81 16  0
 1  0      0 293008   5280  57388    0    0     0     0   18   221  4 79 17  0
 1  0      0 297288   5292  52532    0    0     0   156   22   233  2 81 16  1
 1  0  ...
From: Jan Kara
Date: Tuesday, February 5, 2008 - 8:07 am

No, I don't think so. At least when I run it, number of blocks written
out varies which confirms that these 12mb are just data blocks which happen
to be in the file when transaction commits (which is every 5 seconds). And
to satisfy journaling gurantees in ordered mode you must write them so you
really have no choice...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

From: Al Boldi
Date: Tuesday, February 5, 2008 - 12:20 pm

Making this RFC rather useful.

What we need now is an implementation, which should be easy.

Maybe something on these lines:

<< in ext3_ordered_write_end >>
  if (current->soft_sync & 1)
    return ext3_writeback_write_end;

<< in ext3_ordered_writepage >>
  if (current->soft_sync & 2)
    return ext3_writeback_writepage;

<< in ext3_sync_file >>
  if (current->soft_sync & 4)
    return ret;

<< in ext3_file_write >>
  if (current->soft_sync & 8)
    return ret;

As you can see soft_sync is masked and bits are ordered by importance.

It would be neat if somebody interested could cook-up a patch.


Thanks!

--
Al

-

From: Jan Kara
Date: Friday, January 25, 2008 - 8:36 am

I guess disabling fsync() was already commented on enough. Regarding
switching to writeback mode on per-process basis - not easily possible
because sometimes data is not written out by the process which stored
them (think of mmaped file). And in case of DB, they use direct-io
anyway most of the time so they don't care about journaling mode anyway.
  But as Diego wrote, there is definitely some room for improvement in
current data=ordered mode so the difference shouldn't be as big in the
end.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
-

From: Al Boldi
Date: Friday, January 25, 2008 - 10:27 pm

Testing with sqlite3 and mysql4 shows that performance drastically improves 

Yes, it would be nice to get to the bottom of this starvation problem, but 
even then, the proposed tunable remains useful for misbehaving apps.


Thanks!

--
Al

-

From: Jan Kara
Date: Monday, January 28, 2008 - 10:27 am

No, but if you write to an mmaped file, then we can find out only later
we have dirty data in pages and we call writepage() on behalf of e.g.
  And do you have the databases configured to use direct IO or not?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

From: Al Boldi
Date: Monday, January 28, 2008 - 1:17 pm

Ok, that's a special case, which we could code for, but doesn't seem 

I don't think so, but these tests are only meant to expose the underlying 
problem which needs to be fixed, while this RFC proposes a useful 
workaround.


8M-record insert into indexed db-table:
         ordered  writeback
sqlite3:  75m22s    8m45s
mysql4 :  23m35s    5m29s

Also, see the 'konqueror deadlocks in 2.6.22' thread.


Thanks!

--
Al

-

From: Andreas Dilger
Date: Wednesday, February 6, 2008 - 5:00 pm

Al, can you try a patch posted to linux-fsdevel and linux-ext4 from
Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> to see if this improves
your situation?  Dated Mon, 04 Feb 2008 19:15:25 +0900.

    [PATCH] ext3,4:fdatasync should skip metadata writeout when overwriting

It may be that we already have a solution in that patch for database
workloads where the pages are already allocated by avoiding the need
for ordered mode journal flushing in that case.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-

From: Al Boldi
Date: Sunday, February 10, 2008 - 7:54 am

Well, it seems that it does have a positive effect for the 'konqueror hangs' 
case, but doesn't improve the db case.

This shouldn't be surprising, as the db redundant writeout problem is 
localized not in fsync but rather in ext3_ordered_write_end.

Maybe some form of a staged merged commit could help.


Thanks!

--
Al

-

From: Andreas Dilger
Date: Thursday, January 24, 2008 - 11:47 pm

If fsync performance is an issue for you, run the filesystem in data=journal
mode, put the journal on a separate disk and make it big enough that you
don't block on it to flush the data to the filesystem (but not so big that
it is consuming all of your RAM).

That keeps your data guarantees without hurting performance.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-

From: david
Date: Friday, January 25, 2008 - 2:57 pm

my understanding is that the journal is limited to 128M or so. This 
prevents you from making it big enough to avoid all problems.

-

Previous thread: [patch 17/26] mount options: fix hugetlbfs by Miklos Szeredi on Thursday, January 24, 2008 - 12:33 pm. (1 message)

Next thread: [RFC] Add vfsmount to vfs helper functions. by Kentaro Takeda on Friday, January 25, 2008 - 3:20 am. (7 messages)