Re: 2.6.23.1: mdadm/raid5 hung/d-state

Previous thread: [PATCH] 82596: free nonexistent resource fix by Evgeniy Dushistov on Sunday, November 4, 2007 - 1:22 pm. (1 message)

Next thread: Re: [PATCH] fix i486 boot failure due to stale %ds by Mikael Pettersson on Sunday, November 4, 2007 - 2:58 pm. (9 messages)
From: Neil Brown
Date: Sunday, November 4, 2007 - 2:49 pm

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2


NeilBrown

Fix misapplied patch in raid5.c

commit 4ae3f847e49e3787eca91bced31f8fd328d50496 did not get applied
correctly, presumably due to substantial similarities between
handle_stripe5 and handle_stripe6.

This patch (with lots of context) moves the chunk of new code from
handle_stripe6 (where it isn't needed (yet)) to handle_stripe5.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2007-11-02 12:10:49.000000000 +1100
+++ ./drivers/md/raid5.c	2007-11-02 12:25:31.000000000 +1100
@@ -2607,40 +2607,47 @@ static void handle_stripe5(struct stripe
 	struct bio *return_bi = NULL;
 	struct stripe_head_state s;
 	struct r5dev *dev;
 	unsigned long pending = 0;
 
 	memset(&s, 0, sizeof(s));
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d "
 		"ops=%lx:%lx:%lx\n", (unsigned long long)sh->sector, sh->state,
 		atomic_read(&sh->count), sh->pd_idx,
 		sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
 	spin_lock(&sh->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 	/* Now to look around and see what can be done */
 
+	/* clean-up completed biofill operations */
+	if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
+		clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+		clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
+		clear_bit(STRIPE_OP_BIOFILL, ...
From: Justin Piszcz
Date: Sunday, November 4, 2007 - 2:51 pm

Ah, thanks Neil, will be updating as soon as it is released, thanks.

Justin.

-

From: Dan Williams
Date: Monday, November 5, 2007 - 11:35 am

Are you seeing the same "md thread takes 100% of the CPU" that Joël is
reporting?
-

From: Justin Piszcz
Date: Monday, November 5, 2007 - 11:35 am

Yes, in another e-mail I posted the top output with md3_raid5 at 100%.

Justin.
From: Dan Williams
Date: Monday, November 5, 2007 - 5:19 pm

On 11/5/07, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:

This seems too similar to Jo=EBl's situation for them not to be
correlated, and it shows that iscsi is not a necessary component of
the failure.

The attached patch allows the debug statements in MD to be enabled via
sysfs.  Jo=EBl, since it is easier for you to reproduce can you capture
the kernel log output after the raid thread goes into the spin?  It
will help if you have CONFIG_PRINTK_TIME=3Dy set in your kernel
configuration.

After the failure run:

echo 1 > /sys/block/md_d0/md/debug_print_enable; sleep 5; echo 0 >
/sys/block/md_d0/md/debug_print_enable

...to enable the print messages for a few seconds.  Please send the
output in a private message if it proves too big for the mailing list.
From: BERTRAND Joël
Date: Tuesday, November 6, 2007 - 3:19 am

Done. Here is obtained ouput :

[ 1260.967796] for sector 7629696, rmw=0 rcw=0
[ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
[ 1260.980606] check 5: state 0x6 toread 0000000000000000 read 
0000000000000000 write fffff800ffcffcc0 written 0000000000000000
[ 1260.994808] check 4: state 0x6 toread 0000000000000000 read 
0000000000000000 write fffff800fdd4e360 written 0000000000000000
[ 1261.009325] check 3: state 0x1 toread 0000000000000000 read 
0000000000000000 write 0000000000000000 written 0000000000000000
[ 1261.244478] check 2: state 0x1 toread 0000000000000000 read 
0000000000000000 write 0000000000000000 written 0000000000000000
[ 1261.270821] check 1: state 0x6 toread 0000000000000000 read 
0000000000000000 write fffff800ff517e40 written 0000000000000000
[ 1261.312320] check 0: state 0x6 toread 0000000000000000 read 
0000000000000000 write fffff800fd4cae60 written 0000000000000000
[ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 
failed_num=0
[ 1261.443120] for sector 7629696, rmw=0 rcw=0
[ 1261.453348] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
[ 1261.491538] check 5: state 0x6 toread 0000000000000000 read 
0000000000000000 write fffff800ffcffcc0 written 0000000000000000
[ 1261.529120] check 4: state 0x6 toread 0000000000000000 read 
0000000000000000 write fffff800fdd4e360 written 0000000000000000
[ 1261.560151] check 3: state 0x1 toread 0000000000000000 read 
0000000000000000 write 0000000000000000 written 0000000000000000
[ 1261.599180] check 2: state 0x1 toread 0000000000000000 read 
0000000000000000 write 0000000000000000 written 0000000000000000
[ 1261.637138] check 1: state 0x6 toread 0000000000000000 read 
0000000000000000 write fffff800ff517e40 written 0000000000000000
[ 1261.674502] check 0: state 0x6 toread 0000000000000000 read 
0000000000000000 write fffff800fd4cae60 written 0000000000000000
[ 1261.712589] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 
failed_num=0
[ 1261.864338] for ...
From: Justin Piszcz
Date: Tuesday, November 6, 2007 - 4:29 am

After the crash it is not 'resyncing' ?

Justin.
From: BERTRAND Joël
Date: Tuesday, November 6, 2007 - 4:39 am

No, it isn't...

	JKB
-

From: Justin Piszcz
Date: Tuesday, November 6, 2007 - 4:42 am

After any crash/unclean shutdown the RAID should resync, if it doesn't,=20
that's not good, I'd suggest running a raid check.

The 'repair' is supposed to clean it, in some cases (md0=3Dswap) it gets=20
dirty again.

Tue May  8 09:19:54 EDT 2007: Executing RAID health check for /dev/md0...
Tue May  8 09:19:55 EDT 2007: Executing RAID health check for /dev/md1...
Tue May  8 09:19:56 EDT 2007: Executing RAID health check for /dev/md2...
Tue May  8 09:19:57 EDT 2007: Executing RAID health check for /dev/md3...
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 2176
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: The meta-device /dev/md0 has 2176 mismatched=
=20
sectors.
Tue May  8 10:09:58 EDT 2007: Executing repair on /dev/md0
Tue May  8 10:09:59 EDT 2007: The meta-device /dev/md1 has no mismatched=20
sectors.
Tue May  8 10:10:00 EDT 2007: The meta-device /dev/md2 has no mismatched=20
sectors.
Tue May  8 10:10:01 EDT 2007: The meta-device /dev/md3 has no mismatched=20
sectors.
Tue May  8 10:20:02 EDT 2007: All devices are clean...
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 2176
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
From: BERTRAND Joël
Date: Tuesday, November 6, 2007 - 5:20 am

I cannot repair this raid volume. I cannot reboot server without 
sending stop+A. init 6 stops at "INIT:". After reboot, md0 is 
resynchronized.

	Regards,

	JKB
-

From: Dan Williams
Date: Tuesday, November 6, 2007 - 6:25 pm

On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Jo
From: Jeff Lessem
Date: Tuesday, November 6, 2007 - 10:00 pm

Dan Williams wrote:
 > The following patch, also attached, cleans up cases where the code looks
 > at sh->ops.pending when it should be looking at the consistent
 > stack-based snapshot of the operations flags.

I tried this patch (against a stock 2.6.23), and it did not work for
me.  Not only did I/O to the effected RAID5 & XFS partition stop, but
also I/O to all other disks.  I was not able to capture any debugging
information, but I should be able to do that tomorrow when I can hook
a serial console to the machine.

I'm not sure if my problem is identical to these others, as mine only
seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
and I've not had any problems with RAID5+ext3.

 >
 >
 > ---
 >
 >  drivers/md/raid5.c |   16 +++++++++-------
 >  1 files changed, 9 insertions(+), 7 deletions(-)
 >
 > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
 > index 496b9a3..e1a3942 100644
 > --- a/drivers/md/raid5.c
 > +++ b/drivers/md/raid5.c
 > @@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 >  }
 >
 >  static struct dma_async_tx_descriptor *
 > -ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 > +ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
 > +		 unsigned long pending)
 >  {
 >  	int disks = sh->disks;
 >  	int pd_idx = sh->pd_idx, i;
 > @@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 >  	/* check if prexor is active which means only process blocks
 >  	 * that are part of a read-modify-write (Wantprexor)
 >  	 */
 > -	int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
 > +	int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
 >
 >  	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 >  		(unsigned long long)sh->sector);
 > @@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref)
 >  }
 >
 >  static void
 > -ops_run_postxor(struct stripe_head *sh, struct ...
From: Bill Davidsen
Date: Thursday, November 8, 2007 - 10:45 am

That can't be good! This is worrisome because Joel is giddy with joy 
because it fixes his iSCSI problems. I was going to try it with nbd, but 
perhaps I'll wait a week or so and see if others have more information. 
Applying patches before a holiday weekend is a good way to avoid time 

Hopefully it's not the raid which is the issue.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

-

From: Dan Williams
Date: Thursday, November 8, 2007 - 11:02 am

We need to see more information on the failure that Jeff is seeing,
and whether it goes away with the two known patches applied.  He
applied this most recent patch against stock 2.6.23 which means that
the platform was still open to the first biofill flags issue.

--
Dan
-

From: Jeff Lessem
Date: Friday, November 9, 2007 - 1:36 pm

Dan Williams wrote:
 > On 11/8/07, Bill Davidsen <davidsen@tmr.com> wrote:
 >> Jeff Lessem wrote:
 >>> Dan Williams wrote:
 >>>> The following patch, also attached, cleans up cases where the code
 >>> looks
 >>>> at sh->ops.pending when it should be looking at the consistent
 >>>> stack-based snapshot of the operations flags.
 >>> I tried this patch (against a stock 2.6.23), and it did not work for
 >>> me.  Not only did I/O to the effected RAID5 & XFS partition stop, but
 >>> also I/O to all other disks.  I was not able to capture any debugging
 >>> information, but I should be able to do that tomorrow when I can hook
 >>> a serial console to the machine.
 >> That can't be good! This is worrisome because Joel is giddy with joy
 >> because it fixes his iSCSI problems. I was going to try it with nbd, but
 >> perhaps I'll wait a week or so and see if others have more information.
 >> Applying patches before a holiday weekend is a good way to avoid time
 >> off. :-(
 >
 > We need to see more information on the failure that Jeff is seeing,
 > and whether it goes away with the two known patches applied.  He
 > applied this most recent patch against stock 2.6.23 which means that
 > the platform was still open to the first biofill flags issue.

I applied both of the patches.  The biofill one did not apply cleanly,
as it was adding biofill to one section, and removing it from another,
but it appears that biofill does not need to be removed from a stock
2.6.23 kernel.  The second patch applies with a slight offset, but no
errors.

I can report success so far with both patches applied.  I created an
1100GB RAID5, formated it XFS, and successfully "tar c | tar x" 895GB
of data onto it.  I'm also in the process of rsync-ing the 895GB of
data from the (slightly changed) original.  In the past, I would
always get a hang within 0-50GB of data transfer.

For each drive in the RAID I also:

echo 128 > /sys/block/"$i"/queue/max_sectors_kb
echo 512 > /sys/block/"$i"/queue/nr_requests
echo ...
From: Carlos Carvalho
Date: Thursday, November 8, 2007 - 2:40 pm

Jeff Lessem (Jeff@Lessem.org) wrote on 6 November 2007 22:00:
 >Dan Williams wrote:
 > > The following patch, also attached, cleans up cases where the code looks
 > > at sh->ops.pending when it should be looking at the consistent
 > > stack-based snapshot of the operations flags.
 >
 >I tried this patch (against a stock 2.6.23), and it did not work for
 >me.  Not only did I/O to the effected RAID5 & XFS partition stop, but
 >also I/O to all other disks.  I was not able to capture any debugging
 >information, but I should be able to do that tomorrow when I can hook
 >a serial console to the machine.
 >
 >I'm not sure if my problem is identical to these others, as mine only
 >seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
 >and I've not had any problems with RAID5+ext3.

Us too! We're stuck trying to build a disk server with several disks
in a raid5 array, and the rsync from the old machine stops writing to
the new filesystem. It only happens under heavy IO. We can make it
lock without rsync, using 8 simultaneous dd's to the array. All IO
stops, including the resync after a newly created raid or after an
unclean reboot.

We could not trigger the problem with ext3 or reiser3; it only happens
with xfs.
-

From: Justin Piszcz
Date: Friday, November 9, 2007 - 2:14 am

Including XFS mailing list as well can you provide more information to 
them?
-

From: Fabiano Silva
Date: Friday, November 9, 2007 - 7:09 am

In our case all process using md4, including md4_resync, stay in D state.
Call Trace:
  [<ffffffff803615ac>] __generic_unplug_device+0x13/0x24
  [<ffffffff803622cf>] generic_unplug_device+0x18/0x28
  [<ffffffff803f2cf7>] get_active_stripe+0x22b/0x472
...
see dmesg (sysrq t) attached.

We can reproduce this problem in two machines with the same configuration:
  - 2 x Dual-Core Opteron 2.8GHz
  - 8GB memory
  - 3ware 9000 with 10 x 750GB sata disks
  - Debian Etch x86_64
  - raid5 + xfs (/dev/md4)
in all these stock kernel's:
  - 2.6.22.11, 2.6.22.12, 2.6.23.1, 2.6.24-rc2
running:
  - for i in f{0..7}; do (dd bs=1M count=100000 if=/dev/zero of=$i &); done

If we increase /sys/block/md4/md/stripe_cache_size the device and process
back to work.
From: BERTRAND Joël
Date: Wednesday, November 7, 2007 - 4:20 am

Thanks for this patch. I'm testing it for three hours. I'm rebuilding a 
1.5 TB raid1 array over iSCSI without any trouble.

gershwin:[/usr/scripts] > cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
       1464725632 blocks [2/1] [U_]
       [=>...................]  recovery =  6.7% (99484736/1464725632) 
finish=1450.9min speed=15679K/sec

Without your patch, I never reached 1%... I hope it fix this bug and I 
shall come back when my raid1 volume shall be resynchronized.

	Regards,

	JKB
-

From: Jeff Lessem
Date: Tuesday, November 6, 2007 - 4:18 pm

Dan Williams wrote:
 > On 11/4/07, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
 >>
 >> On Mon, 5 Nov 2007, Neil Brown wrote:
 >>
 >>> On Sunday November 4, jpiszcz@lucidpixels.com wrote:
 >>>> # ps auxww | grep D
 >>>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME 
COMMAND
 >>>> root       273  0.0  0.0      0     0 ?        D    Oct21  14:40 
[pdflush]
 >>>> root       274  0.0  0.0      0     0 ?        D    Oct21  13:00 
[pdflush]
 >>>>
 >>>> After several days/weeks, this is the second time this has 
happened, while
 >>>> doing regular file I/O (decompressing a file), everything on the 
device
 >>>> went into D-state.
 >>> At a guess (I haven't looked closely) I'd say it is the bug that was
 >>> meant to be fixed by
 >>>
 >>> commit 4ae3f847e49e3787eca91bced31f8fd328d50496
 >>>
 >>> except that patch applied badly and needed to be fixed with
 >>> the following patch (not in git yet).
 >>> These have been sent to stable@ and should be in the queue for 2.6.23.2
 >>>
 >> Ah, thanks Neil, will be updating as soon as it is released, thanks.
 >>
 >
 > Are you seeing the same "md thread takes 100% of the CPU" that Joël is
 > reporting?

I'm also seeing something similar, but it only seems to cause a
problem if the file system is xfs.  Once I observed the md thread at
100% cpu, but usually the machine is just idle, with processes stuck
in the D state.

The system:

Quad Xeon with 4GB of ram running stock 2.6.23, x86_64.  Drives are
attached to an Adaptec AIC-9410W with aic94xx driver 1.0.3 and
firmware 1.1 (V17/10c6).  Unfortunately I can't try earlier kernels,
because the aic94xx driver didn't support SATA disks until 2.6.23.

I have 4 750GB drives in a RAID5 with LVM and an LV formated ext3
that works without problem.  I can't do extensive testing on those
drives because they contain important data.

I also have 4 400GB drives attached to the same controller.  I created
10GB partition on the 4 drives and a 30GB RAID5 across the ...
From: BERTRAND Joël
Date: Monday, November 5, 2007 - 1:36 am

My linux-2.6.23/drivers/md/raid5.c contains your patch for a long time :

...
         spin_lock(&sh->lock);
         clear_bit(STRIPE_HANDLE, &sh->state);
         clear_bit(STRIPE_DELAYED, &sh->state);

         s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
         s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
         s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
         /* Now to look around and see what can be done */

         /* clean-up completed biofill operations */
         if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
                 clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
                 clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
                 clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
         }

         rcu_read_lock();
         for (i=disks; i--; ) {
                 mdk_rdev_t *rdev;
                 struct r5dev *dev = &sh->dev[i];
...

but it doesn't fix this bug.

	Regards,

	JKB
-

From: Chuck Ebbert
Date: Wednesday, November 7, 2007 - 9:39 am

Did that chunk starting with "clean-up completed biofill operations" end
up where it belongs? The patch with the big context moves it to a different
place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.

-

From: BERTRAND Joël
Date: Wednesday, November 7, 2007 - 9:48 am

I always apply this kind of patches by hands, and no by patch command. 
Last patch sent here seems to fix this bug :

gershwin:[/usr/scripts] > cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
       1464725632 blocks [2/1] [U_]
       [=====>...............]  recovery = 27.1% (396992504/1464725632) 
finish=1040.3min speed=17104K/sec

	Regards,

	JKB
-

From: BERTRAND Joël
Date: Thursday, November 8, 2007 - 4:42 am

Resync done. Patch fix this bug.

	Regards,

	JKB
-

From: Justin Piszcz
Date: Thursday, November 8, 2007 - 5:44 am

Excellent!

I cannot easily re-produce the bug on my system so I will wait for the=20
next stable patch set to include it and let everyone know if it happens=20
again, thanks.
Previous thread: [PATCH] 82596: free nonexistent resource fix by Evgeniy Dushistov on Sunday, November 4, 2007 - 1:22 pm. (1 message)

Next thread: Re: [PATCH] fix i486 boot failure due to stale %ds by Mikael Pettersson on Sunday, November 4, 2007 - 2:58 pm. (9 messages)