Re: writeout stalls in current -git

Previous thread: Re: writeout stalls in current -git by Fengguang Wu on Thursday, November 1, 2007 - 7:21 pm. (2 messages)

Next thread: [PATCH 3/10] 9p: socket and fd transport changes by Latchesar Ionkov on Friday, November 2, 2007 - 9:59 am. (1 message)
From: Peter Zijlstra
Date: Friday, November 2, 2007 - 3:15 am

Yeah, the background threshold is not (yet) scaled. So it can happen
that the bdi_dirty limit is below the background limit.

I'm curious though as to these stalls, though, I can't seem to think of
what goes wrong.. esp since most writeback seems to happen from pdflush.

(or I'm totally misreading it - quite a possible as I'm still recovering
from a serious cold and not all the green stuff has yet figured out its
proper place wrt brain cells 'n stuff)


I still have this patch floating around:

---
Subject: mm: speed up writeback ramp-up on clean systems

We allow violation of bdi limits if there is a lot of room on the
system. Once we hit half the total limit we start enforcing bdi limits
and bdi ramp-up should happen. Doing it this way avoids many small
writeouts on an otherwise idle system and should also speed up the
ramp-up.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page-writeback.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-09-28 10:08:33.937415368 +0200
+++ linux-2.6/mm/page-writeback.c	2007-09-28 10:54:26.018247516 +0200
@@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long bdi_nr_reclaimable;
-	long bdi_nr_writeback;
+	long nr_reclaimable, bdi_nr_reclaimable;
+	long nr_writeback, bdi_nr_writeback;
 	long background_thresh;
 	long dirty_thresh;
 	long bdi_thresh;
@@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a
 
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
+
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+					global_page_state(NR_UNSTABLE_NFS);
+		nr_writeback = global_page_state(NR_WRITEBACK);
+
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, ...
From: Fengguang Wu
Date: Friday, November 2, 2007 - 3:33 am

Me confused too. The new debug patch will confirm whether emerge is


I think this patch is OK for 2.6.24 :-)


-

From: Andrew Morton
Date: Monday, November 5, 2007 - 4:57 pm

On Fri, 2 Nov 2007 18:33:29 +0800


Given the problems we're having in there I'm a bit reluctant to go tossing
hastily put together and inadequately tested stuff onto the fire.  And
that's what this patch looks like to me.

Wanna convince me otherwise?
-

From: Peter Zijlstra
Date: Tuesday, November 6, 2007 - 3:20 am

Not really hastily, I think it was written before the stuff hit
mainline. Inadequately tested, perhaps, its been in my and probably Wu's
kernels for a while. Granted that's not a lot of testing in the face of

I'm perfectly happy with this patch earning its credits in -mm for a
while and maybe going in around -rc4 or something like that (hoping that
by then we've fixed these nagging issues).

Another patch I did come up with yesterday - not driven by any problems
in that area - could perhaps join this one on that path:

---
Subject: mm: bdi: tweak task dirty penalty

Penalizing heavy dirtiers with 1/8-th the total dirty limit might be rather
excessive on large memory machines. Use sqrt to scale it sub-linearly.

Update the comment while we're there.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page-writeback.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6-2/mm/page-writeback.c
===================================================================
--- linux-2.6-2.orig/mm/page-writeback.c
+++ linux-2.6-2/mm/page-writeback.c
@@ -213,17 +213,21 @@ static inline void task_dirties_fraction
 }
 
 /*
- * scale the dirty limit
+ * Task specific dirty limit:
  *
- * task specific dirty limit:
+ *   dirty -= 8 * sqrt(dirty) * p_{t}
  *
- *   dirty -= (dirty/8) * p_{t}
+ * Penalize tasks that dirty a lot of pages by lowering their dirty limit. This
+ * avoids infrequent dirtiers from getting stuck in this other guys dirty
+ * pages.
+ *
+ * Use a sub-linear function to scale the penalty, we only need a little room.
  */
 void task_dirty_limit(struct task_struct *tsk, long *pdirty)
 {
 	long numerator, denominator;
 	long dirty = *pdirty;
-	u64 inv = dirty >> 3;
+	u64 inv = 8*int_sqrt(dirty);
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;


-

From: Torsten Kaiser
Date: Friday, November 2, 2007 - 12:22 pm

I still have not seen a trigger of the "throttle_vm_writeout".
This time installing 2.6.24-rc1 again it not even triggerd any other
debugs apart from the one in wb_kupdate.
But 300Mb of new files might still not trigger this with 4Gb of RAM.

I'm currently testing 2.6.23-mm1 with this patch and the second

I also don't know. But looking at the time the system takes to write
out 8kb, I'm starting to suspect that something is writing this out,


applied, but did not fix the stalls.

Here the complete log from vmstat 10 and the syslog from an install of
vanilla 2.6.24-rc1.
(Please note: I installed the source of vanilla 2.6.24-rc1, but I am
still using 2.6.23-mm1!)
All lines with [note] are my comments about what the system was doing,
both logs are from the same run, so the notes should be more or less
in sync. I used SysRq+L to insert the SysRq-Helptext into the syslog
as marker...

The visible effects are similar to the unmerge run, but the stalls
during the moving did only start later. But the same effect after
emerge finished and the almost all dirty data was written, was
visible: I can still hear the disks and see the hdd light flickering
(mostly on) for much, much longer than it should take to write 8kb.

vmstat 10:
[note]emerge start
 1  0      0 3668496    332 187748    0    0     0    29   39  491  3  0 96  0
 1  0      0 3623940    332 188880    0    0    83    17 1724 3893 15  2 81  1
 0  0      0 3559488    332 252432    0    0  1021    48 11719 4536  9  4 74 13
 2  0      0 3482220    332 311916    0    0    70    60   93 3818 11  3 86  0
 1  0      0 3289352    332 486932    0    0     2    35   33 11997 25  3 72  0
 1  0      0 3174036    332 596412    0    0    10    33   35 3937 21  4 75  0
 2  0      0 3215756    332 555292    0    0     6    28   85  742 12 12 76  0
 2  0      0 3202128    332 559792    0    0    32     9   34 1566 31  1 68  0
 2  0      0 3192804    332 568072    0    0    60    46  172 4206 30  2 67  1
 3  0      0 3202424    332 ...
From: David Chinner
Date: Friday, November 2, 2007 - 1:43 pm

That's stalled waiting on the inode cluster buffer lock. That implies
that the inode lcuser is already being written out and the inode has
been redirtied during writeout.

Does the kernel you are testing have the "flush inodes in ascending
inode number order" patches applied? If so, can you remove that
patch and see if the problem goes away?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Torsten Kaiser
Date: Friday, November 2, 2007 - 2:02 pm

It's 2.6.23-mm1 with only some small fixes.

In it's broken-out directory I see:
git-xfs.patch

and

writeback-fix-periodic-superblock-dirty-inode-flushing.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
writeback-fix-time-ordering-of-the-per-superblock-inode-lists-8.patch
writeback-introduce-writeback_controlmore_io-to-indicate-more-io.patch

I don't know if the patch you mentioned is part of that version of the
mm-patchset.

Torsten
-

From: Torsten Kaiser
Date: Sunday, November 4, 2007 - 4:19 am

I can now confirm, that I see this also with the current mainline-git-version
I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
plus the fix for the sg changes in ieee1394.
Bisecting would be troublesome, as the sg changes prevent mainline to
boot with my normal config / kill my network.

treogen ~ # vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
-> starting emerge
 1  0      0 3627072    332 157724    0    0    97    13   41  189  2  2 94  2
 0  0      0 3607240    332 163736    0    0   599    10  332  951  2  1 93  4
 0  0      0 3601920    332 167592    0    0   380     2  218  870  1  1 98  0
 0  0      0 3596356    332 171648    0    0   404    21  182  818  0  0 99  0
 0  0      0 3579328    332 180436    0    0   878    12  147  912  1  1 97  2
 0  0      0 3575376    332 182776    0    0   236     4  244  953  1  1 95  3
 2  1      0 3571792    332 185084    0    0   232     7  256 1003  2  1 95  2
 0  0      0 3564844    332 187364    0    0   228   605  246 1167  2  1 93  4
 0  0      0 3562128    332 189784    0    0   230     4  527 1238  2  1 93  4
 0  1      0 3558764    332 191964    0    0   216    24  438 1059  1  1 93  6
 0  0      0 3555120    332 193868    0    0   199    36  406  959  0  0 92  8
 0  0      0 3552008    332 195928    0    0   197    11  458 1023  1  1 90  8
 0  0      0 3548728    332 197660    0    0   183     7  496 1086  1  1 90  8
 0  0      0 3545560    332 199372    0    0   170     8  483 1017  1  1 90  9
 0  1      0 3542124    332 201256    0    0   190     1  544 1137  1  1 88 10
 1  0      0 3536924    332 203296    0    0   195     7  637 1209  2  1 89  8
 1  1      0 3485096    332 249184    0    0   101    16 10372 4537 13  3 76  8
 2  0      0 3442004    332 279728    0    0  1086    40  219 1349  7  3 87  4
-> emerge is done reading its package database
 1  0      0 3254796    332 448636    0    ...
From: David Chinner
Date: Sunday, November 4, 2007 - 6:45 pm

Ok, so it's probably a side effect of the writeback changes.

Attached are two patches (two because one was in a separate patchset as
a standalone change) that should prevent async writeback from blocking
on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
Can you see if this fixes the problem?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


From: Torsten Kaiser
Date: Monday, November 5, 2007 - 12:01 am

Applied both patches against the kernel mentioned above.
This blows up at boot:
[   80.807589] Filesystem "dm-0": Disabling barriers, not supported by
the underlying device
[   80.820241] XFS mounting filesystem dm-0
[   80.913144] ------------[ cut here ]------------
[   80.914932] kernel BUG at drivers/md/raid5.c:143!
[   80.916751] invalid opcode: 0000 [1] SMP
[   80.918338] CPU 3
[   80.919142] Modules linked in:
[   80.920345] Pid: 974, comm: md1_raid5 Not tainted 2.6.24-rc1 #3
[   80.922628] RIP: 0010:[<ffffffff804b6ee4>]  [<ffffffff804b6ee4>]
__release_stripe+0x164/0x170
[   80.925935] RSP: 0018:ffff8100060e7dd0  EFLAGS: 00010002
[   80.927987] RAX: 0000000000000000 RBX: ffff81010141c288 RCX: 0000000000000000
[   80.930738] RDX: 0000000000000000 RSI: ffff81010141c288 RDI: ffff810004fb3200
[   80.933488] RBP: ffff810004fb3200 R08: 0000000000000000 R09: 0000000000000005
[   80.936240] R10: 0000000000000e00 R11: ffffe200038465e8 R12: ffff81010141c298
[   80.938990] R13: 0000000000000286 R14: ffff810004fb3330 R15: 0000000000000000
[   80.941741] FS:  000000000060c870(0000) GS:ffff810100313700(0000)
knlGS:0000000000000000
[   80.944861] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[   80.947080] CR2: 00007fff7b295000 CR3: 0000000101842000 CR4: 00000000000006e0
[   80.949830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   80.952580] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   80.955332] Process md1_raid5 (pid: 974, threadinfo
ffff8100060e6000, task ffff81000645c730)
[   80.958584] Stack:  ffff81010141c288 00000000000001f4
ffff810004fb3200 ffffffff804b6f2d
[   80.961761]  00000000000001f4 ffff81010141c288 ffffffff804c8bd0
0000000000000000
[   80.964681]  ffff8100060e7ee8 ffffffff804bd094 ffff81000645c730
ffff8100060e7e70
[   80.967518] Call Trace:
[   80.968558]  [<ffffffff804b6f2d>] release_stripe+0x3d/0x60
[   80.970677]  [<ffffffff804c8bd0>] md_thread+0x0/0x100
[   80.972629]  [<ffffffff804bd094>] ...
From: Torsten Kaiser
Date: Monday, November 5, 2007 - 11:27 am

Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
Applying your two patches ontop of that does not fix the stalls.

vmstat 10 output from unmerging (uninstalling) a kernel:
 1  0      0 3512188    332 192644    0    0   185    12  368  735 10  3 85  1
-> emerge starts to remove the kernel source files
 3  0      0 3506624    332 192836    0    0    15  9825 2458 8307  7 12 81  0
 0  0      0 3507212    332 192836    0    0     0   554  630 1233  0  1 99  0
 0  0      0 3507292    332 192836    0    0     0   537  580 1328  0  1 99  0
 0  0      0 3507168    332 192836    0    0     0   633  626 1380  0  1 99  0
 0  0      0 3507116    332 192836    0    0     0  1510  768 2030  1  2 97  0
 0  0      0 3507596    332 192836    0    0     0   524  540 1544  0  0 99  0
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 3507540    332 192836    0    0     0   489  551 1293  0  0 99  0
 0  0      0 3507528    332 192836    0    0     0   527  510 1432  1  1 99  0
 0  0      0 3508052    332 192840    0    0     0  2088  910 2964  2  3 95  0
 0  0      0 3507888    332 192840    0    0     0   442  565 1383  1  1 99  0
 0  0      0 3508704    332 192840    0    0     0   497  529 1479  0  0 99  0
 0  0      0 3508704    332 192840    0    0     0   594  595 1458  0  0 99  0
 0  0      0 3511492    332 192840    0    0     0  2381 1028 2941  2  3 95  0
 0  0      0 3510684    332 192840    0    0     0   699  600 1390  0  0 99  0
 0  0      0 3511636    332 192840    0    0     0   741  661 1641  0  0 100  0
 0  0      0 3524020    332 192840    0    0     0  2452 1080 3910  2  3 95  0
 0  0      0 3524040    332 192844    0    0     0   530  617 1297  0  0 99  0
 0  0      0 3524128    332 192844    0    0     0   812  674 1667  0  1 99  0
 0  0      0 3527000    332 193672    0    0   339   721  754 1681  3  2 93  1
-> emerge is finished, no ...
From: David Chinner
Date: Monday, November 5, 2007 - 9:25 pm

So you are having RAID5 problems as well?

I'm struggling to understand what possible changed in XFS or writeback that
would lead to stalls like this, esp. as you appear to be removing files when
the stalls occur. Rather than vmstat, can you use something like iostat to
show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
some such issue.


At this point, can you run a "sync" and see how long that takes to
complete? The only thing I can think that woul dbe written out after
this point is inodes, but even then it seems to go on for a long,
long time and it really doesn't seem like XFS is holding up the
inode writes.

Another option is to use blktrace/blkparse to determine which process is

Well, the patches I posted should prevent blocking in the places that it
was seen, so if that does not stop the slowdowns then either the writeback
code is not feeding us inodes fast enough or the block device below is
having some kind of problem....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Torsten Kaiser
Date: Tuesday, November 6, 2007 - 12:10 am

The first 2.6.24-rc1-git-kernel that I patched with your patches did
not boot for me. (Oops send in one of my previous mails) But given
that the stacktrace was not xfs related and I had seen this patch on
the lkml, I tried to fix this Oops this way.


meta-data=/dev/mapper/root       isize=256    agcount=32, agsize=4731132 blks
         =                       sectsz=512   attr=1
data     =                       bsize=4096   blocks=151396224, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0

Already tried that: http://lkml.org/lkml/2007/11/2/178
See the logs from the second unmerge in the second half of the mail.


Yes, I completly agree that this is much to long. Thats why I included
the after-emerge-finished parts of the logs. But I still partly

I don't think its the block device, because reading/writing larger
files do not seem to be troubled. It looks much more like an inode
problem. For example both installing and uninstalling kernel source
trees show these stalls, but during uninstalling this is much more
noticeable.

But I agree that this might not be xfs specific, as this showed up at
the same time as other people started reporting about the 100% iowait
bug. Could be that this is the same bug and the differences between
reiserfs and xfs might explain the iowait vs. idle. Or that I don't
see the 100% iowait is something else on my system...

Torsten
-

From: Peter Zijlstra
Date: Tuesday, November 6, 2007 - 12:01 pm

Just a crazy idea,..

Could there be a set_page_dirty() that doesn't have
balance_dirty_pages() call near? For example modifying meta data in
unlink?

Such a situation could lead to an excess of dirty pages and the next
call to balance_dirty_pages() would appear to stall, as it would
desperately try to get below the limit again.

-

From: Torsten Kaiser
Date: Tuesday, November 6, 2007 - 1:26 pm

Only if accounting of the dirty pages is also broken.
In the unmerge testcase I see most of the time only <200kb of dirty
data in /proc/meminfo.

The system has 4Gb of RAM so I'm not sure if it should ever be valid
to stall even the emerge/install testcase.

Torsten

Now building a kernel with the skipped-pages-accounting-patch reverted...
-

From: Fengguang Wu
Date: Tuesday, November 6, 2007 - 2:17 am

The above messages and the below 'D' state pdflush indicate that one
single writeback_inodes(4MB) call takes a long time(up to 10s!) to complete.

Let's try reverting the below patch with `patch -R`? It looks like

------------------------------------------------------
Subject: writeback: remove pages_skipped accounting in __block_write_full_page()
From: Fengguang Wu <wfg@mail.ustc.edu.cn>


It can be produced by the following test scheme:

# cat bin/test-writeback.sh
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done

# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023

% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x

% dmesg|grep 847824  # generated by a debug printk
[  529.263184] redirtied inode 847824 line 548
[  564.250872] redirtied inode 847824 line 548
[  594.272797] redirtied inode 847824 line 548
[  629.231330] redirtied inode 847824 ...
From: Torsten Kaiser
Date: Tuesday, November 6, 2007 - 2:53 pm

I have now testet v2.6.24-rc1-748-g2655e2c with above patch reverted.
This does still stall.


Both "vmstat 10" and "iostat -x 10" output from this test:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0      0 3700592      0  85424    0    0    31    83  108  244  2  1 95  1
-> emerge reads something, don't knwo for sure what...
 1  0      0 3665352      0  87940    0    0   239     2  343  585  2  1 97  0
 0  0      0 3657728      0  91228    0    0   322    35  445  833  0  0 99  0
 1  0      0 3653136      0  94692    0    0   330    33  455  844  1  1 98  0
 0  0      0 3646836      0  97720    0    0   289     3  422  751  1  1 98  0
 0  0      0 3616468      0  99692    0    0   185    33  399  614  9  3 87  1
-> starts to remove the kernel tree
 0  0      0 3610452      0 102592    0    0   138  3598 1398 3945  3  6 90  1
 0  0      0 3607136      0 104548    0    0     2  5962 1919 6070  4  9 87  0
 0  0      0 3606636      0 105080    0    0     0  1539  810 2200  1  2 97  0
-> first stall 28 sec.
 0  0      0 3606592      0 105292    0    0     0   698  679 1390  0  1 99  0
 0  0      0 3606440      0 105532    0    0     0   658  690 1457  0  0 99  0
 0  0      0 3606068      0 106128    0    0     1  1780  947 1982  1  3 96  0
-> second stall 24 sec.
 0  0      0 3606036      0 106464    0    0     4   858  758 1457  0  1 98  0
 0  0      0 3605380      0 106872    0    0     0  1173  807 1880  1  2 97  0
 0  0      0 3605000      0 107748    0    0     1  2413 1103 2996  2  4 94  0
-> third stall 38 sec.
 0  0      0 3604488      0 108472    0    0    45   897  748 1577  0  1 98  0
 0  0      0 3604176      0 108764    0    0     0   824  752 1700  0  1 98  0
 0  0      0 3604012      0 108988    0    0     0   660  643 1237  0  1 99  0
 0  0      0 3608936      0 110120    0    0     1  3490 1232 3455  3  5 91  0
-> fourth stall 64 sec.
 1  0      0 ...
From: David Chinner
Date: Tuesday, November 6, 2007 - 4:31 pm

^^
Apparently we are doing synchronous writes. That would explain why
it is slow. We shouldn't be doing synchronous writes here. I'll see if
I can reproduce this.

<goes off and looks>

Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
look into this further and send a patch when I have something concrete.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: David Chinner
Date: Tuesday, November 6, 2007 - 7:13 pm

Ok, so it's not synchronous writes that we are doing - we're just
submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
The "synchronous" nature appears to be coming from higher level
locking when reclaiming inodes (on the flush lock). It appears that
inode write clustering is failing completely so we are writing the
same block multiple times i.e. once for each inode in the cluster we
have to write.

This must be a side effect of some other change as we haven't
changed anything in the reclaim code recently.....

/me scurries off to run some tests 

Indeed it is. The patch below should fix the problem - the inode
clusters weren't getting set up properly when inodes were being
read in or allocated. This is a regression, introduced by this
mod:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/xfs_iget.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c	2007-11-02 13:44:46.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c	2007-11-07 13:08:42.534440675 +1100
@@ -248,7 +248,7 @@ finish_inode:
 	icl = NULL;
 	if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq,
 							first_index, 1)) {
-		if ((iq->i_ino & mask) == first_index)
+		if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
 			icl = iq->i_cluster;
 	}
 
-

From: Torsten Kaiser
Date: Wednesday, November 7, 2007 - 12:15 am

Works for me. The only remaining stalls are sub second and look
completely valid, considering the amount of files being removed.

iostat 10 from this test:
 3  0      0 3500192    332 204956    0    0   105  8512 1809 6473  6 10 83  1
 0  0      0 3500200    332 204576    0    0     0  4367 1355 3712  2  6 92  0
 2  0      0 3504264    332 203528    0    0     0  6805 1912 4967  4  8 88  0
 0  0      0 3511632    332 203528    0    0     0  2843  805 1791  2  4 94  0
 0  0      0 3516852    332 203516    0    0     0  3375  879 2712  3  5 93  0
 0  0      0 3530544    332 202668    0    0   186   776  488 1152  4  2 89  4
 0  0      0 3574788    332 204960    0    0   226   326  358  787  0  1 98  0
 0  0      0 3576820    332 204960    0    0     0   376  332  737  0  0 99  0
 0  0      0 3578432    332 204960    0    0     0   356  293  606  1  1 99  0
 0  0      0 3580192    332 204960    0    0     0   101  104  384  0  0 99  0

I'm pleased to note that this is now much faster again.
Thanks!

Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>

CC's please note: It looks like this was really a different problem
then the 100% iowait that was seen with reiserfs.
Also the one complete stall I have seen is probably something else.
But I have not been able to reproduce this again with -mm and have
never seen this on mainline, so I will just ignore that single event
until I see it again.

-

From: David Chinner
Date: Wednesday, November 7, 2007 - 5:38 pm

Great - thanks for reporting the problem and testing the fix.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Damien Wyart
Date: Tuesday, November 20, 2007 - 6:16 am

This patch has not yet made its way into 2.6.24 (rc3). Is it intended?
Maybe the fix can wait for 2.6.25, but wanted to make sure...

-- 
Damien Wyart
-

From: David Chinner
Date: Tuesday, November 20, 2007 - 2:09 pm

The patch is in the XFS dev tree being QA'd, and we will push it
to 2.6.24-rcX in the next few days.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

Previous thread: Re: writeout stalls in current -git by Fengguang Wu on Thursday, November 1, 2007 - 7:21 pm. (2 messages)

Next thread: [PATCH 3/10] 9p: socket and fd transport changes by Latchesar Ionkov on Friday, November 2, 2007 - 9:59 am. (1 message)