At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by
commit 4ae3f847e49e3787eca91bced31f8fd328d50496
except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2
NeilBrown
Fix misapplied patch in raid5.c
commit 4ae3f847e49e3787eca91bced31f8fd328d50496 did not get applied
correctly, presumably due to substantial similarities between
handle_stripe5 and handle_stripe6.
This patch (with lots of context) moves the chunk of new code from
handle_stripe6 (where it isn't needed (yet)) to handle_stripe5.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/raid5.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c 2007-11-02 12:10:49.000000000 +1100
+++ ./drivers/md/raid5.c 2007-11-02 12:25:31.000000000 +1100
@@ -2607,40 +2607,47 @@ static void handle_stripe5(struct stripe
struct bio *return_bi = NULL;
struct stripe_head_state s;
struct r5dev *dev;
unsigned long pending = 0;
memset(&s, 0, sizeof(s));
pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d "
"ops=%lx:%lx:%lx\n", (unsigned long long)sh->sector, sh->state,
atomic_read(&sh->count), sh->pd_idx,
sh->ops.pending, sh->ops.ack, sh->ops.complete);
spin_lock(&sh->lock);
clear_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
/* Now to look around and see what can be done */
+ /* clean-up completed biofill operations */
+ if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
+ clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+ clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
+ clear_bit(STRIPE_OP_BIOFILL, ...Ah, thanks Neil, will be updating as soon as it is released, thanks. Justin. -
Are you seeing the same "md thread takes 100% of the CPU" that Joël is reporting? -
Yes, in another e-mail I posted the top output with md3_raid5 at 100%. Justin.
On 11/5/07, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: This seems too similar to Jo=EBl's situation for them not to be correlated, and it shows that iscsi is not a necessary component of the failure. The attached patch allows the debug statements in MD to be enabled via sysfs. Jo=EBl, since it is easier for you to reproduce can you capture the kernel log output after the raid thread goes into the spin? It will help if you have CONFIG_PRINTK_TIME=3Dy set in your kernel configuration. After the failure run: echo 1 > /sys/block/md_d0/md/debug_print_enable; sleep 5; echo 0 > /sys/block/md_d0/md/debug_print_enable ...to enable the print messages for a few seconds. Please send the output in a private message if it proves too big for the mailing list.
Done. Here is obtained ouput : [ 1260.967796] for sector 7629696, rmw=0 rcw=0 [ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 [ 1260.980606] check 5: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ffcffcc0 written 0000000000000000 [ 1260.994808] check 4: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fdd4e360 written 0000000000000000 [ 1261.009325] check 3: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 [ 1261.244478] check 2: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 [ 1261.270821] check 1: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ff517e40 written 0000000000000000 [ 1261.312320] check 0: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fd4cae60 written 0000000000000000 [ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 [ 1261.443120] for sector 7629696, rmw=0 rcw=0 [ 1261.453348] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 [ 1261.491538] check 5: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ffcffcc0 written 0000000000000000 [ 1261.529120] check 4: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fdd4e360 written 0000000000000000 [ 1261.560151] check 3: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 [ 1261.599180] check 2: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 [ 1261.637138] check 1: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ff517e40 written 0000000000000000 [ 1261.674502] check 0: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fd4cae60 written 0000000000000000 [ 1261.712589] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 [ 1261.864338] for ...
After the crash it is not 'resyncing' ? Justin.
After any crash/unclean shutdown the RAID should resync, if it doesn't,=20 that's not good, I'd suggest running a raid check. The 'repair' is supposed to clean it, in some cases (md0=3Dswap) it gets=20 dirty again. Tue May 8 09:19:54 EDT 2007: Executing RAID health check for /dev/md0... Tue May 8 09:19:55 EDT 2007: Executing RAID health check for /dev/md1... Tue May 8 09:19:56 EDT 2007: Executing RAID health check for /dev/md2... Tue May 8 09:19:57 EDT 2007: Executing RAID health check for /dev/md3... Tue May 8 10:09:58 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 2176 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: The meta-device /dev/md0 has 2176 mismatched= =20 sectors. Tue May 8 10:09:58 EDT 2007: Executing repair on /dev/md0 Tue May 8 10:09:59 EDT 2007: The meta-device /dev/md1 has no mismatched=20 sectors. Tue May 8 10:10:00 EDT 2007: The meta-device /dev/md2 has no mismatched=20 sectors. Tue May 8 10:10:01 EDT 2007: The meta-device /dev/md3 has no mismatched=20 sectors. Tue May 8 10:20:02 EDT 2007: All devices are clean... Tue May 8 10:20:02 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 2176 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0
I cannot repair this raid volume. I cannot reboot server without sending stop+A. init 6 stops at "INIT:". After reboot, md0 is resynchronized. Regards, JKB -
On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Jo
Dan Williams wrote:
> The following patch, also attached, cleans up cases where the code looks
> at sh->ops.pending when it should be looking at the consistent
> stack-based snapshot of the operations flags.
I tried this patch (against a stock 2.6.23), and it did not work for
me. Not only did I/O to the effected RAID5 & XFS partition stop, but
also I/O to all other disks. I was not able to capture any debugging
information, but I should be able to do that tomorrow when I can hook
a serial console to the machine.
I'm not sure if my problem is identical to these others, as mine only
seems to manifest with RAID5+XFS. The RAID rebuilds with no problem,
and I've not had any problems with RAID5+ext3.
>
>
> ---
>
> drivers/md/raid5.c | 16 +++++++++-------
> 1 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 496b9a3..e1a3942 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct
dma_async_tx_descriptor *tx)
> }
>
> static struct dma_async_tx_descriptor *
> -ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
> +ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
> + unsigned long pending)
> {
> int disks = sh->disks;
> int pd_idx = sh->pd_idx, i;
> @@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct
dma_async_tx_descriptor *tx)
> /* check if prexor is active which means only process blocks
> * that are part of a read-modify-write (Wantprexor)
> */
> - int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
> + int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
>
> pr_debug("%s: stripe %llu\n", __FUNCTION__,
> (unsigned long long)sh->sector);
> @@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref)
> }
>
> static void
> -ops_run_postxor(struct stripe_head *sh, struct ...That can't be good! This is worrisome because Joel is giddy with joy because it fixes his iSCSI problems. I was going to try it with nbd, but perhaps I'll wait a week or so and see if others have more information. Applying patches before a holiday weekend is a good way to avoid time Hopefully it's not the raid which is the issue. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 -
We need to see more information on the failure that Jeff is seeing, and whether it goes away with the two known patches applied. He applied this most recent patch against stock 2.6.23 which means that the platform was still open to the first biofill flags issue. -- Dan -
Dan Williams wrote: > On 11/8/07, Bill Davidsen <davidsen@tmr.com> wrote: >> Jeff Lessem wrote: >>> Dan Williams wrote: >>>> The following patch, also attached, cleans up cases where the code >>> looks >>>> at sh->ops.pending when it should be looking at the consistent >>>> stack-based snapshot of the operations flags. >>> I tried this patch (against a stock 2.6.23), and it did not work for >>> me. Not only did I/O to the effected RAID5 & XFS partition stop, but >>> also I/O to all other disks. I was not able to capture any debugging >>> information, but I should be able to do that tomorrow when I can hook >>> a serial console to the machine. >> That can't be good! This is worrisome because Joel is giddy with joy >> because it fixes his iSCSI problems. I was going to try it with nbd, but >> perhaps I'll wait a week or so and see if others have more information. >> Applying patches before a holiday weekend is a good way to avoid time >> off. :-( > > We need to see more information on the failure that Jeff is seeing, > and whether it goes away with the two known patches applied. He > applied this most recent patch against stock 2.6.23 which means that > the platform was still open to the first biofill flags issue. I applied both of the patches. The biofill one did not apply cleanly, as it was adding biofill to one section, and removing it from another, but it appears that biofill does not need to be removed from a stock 2.6.23 kernel. The second patch applies with a slight offset, but no errors. I can report success so far with both patches applied. I created an 1100GB RAID5, formated it XFS, and successfully "tar c | tar x" 895GB of data onto it. I'm also in the process of rsync-ing the 895GB of data from the (slightly changed) original. In the past, I would always get a hang within 0-50GB of data transfer. For each drive in the RAID I also: echo 128 > /sys/block/"$i"/queue/max_sectors_kb echo 512 > /sys/block/"$i"/queue/nr_requests echo ...
Jeff Lessem (Jeff@Lessem.org) wrote on 6 November 2007 22:00: >Dan Williams wrote: > > The following patch, also attached, cleans up cases where the code looks > > at sh->ops.pending when it should be looking at the consistent > > stack-based snapshot of the operations flags. > >I tried this patch (against a stock 2.6.23), and it did not work for >me. Not only did I/O to the effected RAID5 & XFS partition stop, but >also I/O to all other disks. I was not able to capture any debugging >information, but I should be able to do that tomorrow when I can hook >a serial console to the machine. > >I'm not sure if my problem is identical to these others, as mine only >seems to manifest with RAID5+XFS. The RAID rebuilds with no problem, >and I've not had any problems with RAID5+ext3. Us too! We're stuck trying to build a disk server with several disks in a raid5 array, and the rsync from the old machine stops writing to the new filesystem. It only happens under heavy IO. We can make it lock without rsync, using 8 simultaneous dd's to the array. All IO stops, including the resync after a newly created raid or after an unclean reboot. We could not trigger the problem with ext3 or reiser3; it only happens with xfs. -
Including XFS mailing list as well can you provide more information to them? -
In our case all process using md4, including md4_resync, stay in D state.
Call Trace:
[<ffffffff803615ac>] __generic_unplug_device+0x13/0x24
[<ffffffff803622cf>] generic_unplug_device+0x18/0x28
[<ffffffff803f2cf7>] get_active_stripe+0x22b/0x472
...
see dmesg (sysrq t) attached.
We can reproduce this problem in two machines with the same configuration:
- 2 x Dual-Core Opteron 2.8GHz
- 8GB memory
- 3ware 9000 with 10 x 750GB sata disks
- Debian Etch x86_64
- raid5 + xfs (/dev/md4)
in all these stock kernel's:
- 2.6.22.11, 2.6.22.12, 2.6.23.1, 2.6.24-rc2
running:
- for i in f{0..7}; do (dd bs=1M count=100000 if=/dev/zero of=$i &); done
If we increase /sys/block/md4/md/stripe_cache_size the device and process
back to work.
Thanks for this patch. I'm testing it for three hours. I'm rebuilding a
1.5 TB raid1 array over iSCSI without any trouble.
gershwin:[/usr/scripts] > cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
1464725632 blocks [2/1] [U_]
[=>...................] recovery = 6.7% (99484736/1464725632)
finish=1450.9min speed=15679K/sec
Without your patch, I never reached 1%... I hope it fix this bug and I
shall come back when my raid1 volume shall be resynchronized.
Regards,
JKB
-
Dan Williams wrote: > On 11/4/07, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: >> >> On Mon, 5 Nov 2007, Neil Brown wrote: >> >>> On Sunday November 4, jpiszcz@lucidpixels.com wrote: >>>> # ps auxww | grep D >>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND >>>> root 273 0.0 0.0 0 0 ? D Oct21 14:40 [pdflush] >>>> root 274 0.0 0.0 0 0 ? D Oct21 13:00 [pdflush] >>>> >>>> After several days/weeks, this is the second time this has happened, while >>>> doing regular file I/O (decompressing a file), everything on the device >>>> went into D-state. >>> At a guess (I haven't looked closely) I'd say it is the bug that was >>> meant to be fixed by >>> >>> commit 4ae3f847e49e3787eca91bced31f8fd328d50496 >>> >>> except that patch applied badly and needed to be fixed with >>> the following patch (not in git yet). >>> These have been sent to stable@ and should be in the queue for 2.6.23.2 >>> >> Ah, thanks Neil, will be updating as soon as it is released, thanks. >> > > Are you seeing the same "md thread takes 100% of the CPU" that Joël is > reporting? I'm also seeing something similar, but it only seems to cause a problem if the file system is xfs. Once I observed the md thread at 100% cpu, but usually the machine is just idle, with processes stuck in the D state. The system: Quad Xeon with 4GB of ram running stock 2.6.23, x86_64. Drives are attached to an Adaptec AIC-9410W with aic94xx driver 1.0.3 and firmware 1.1 (V17/10c6). Unfortunately I can't try earlier kernels, because the aic94xx driver didn't support SATA disks until 2.6.23. I have 4 750GB drives in a RAID5 with LVM and an LV formated ext3 that works without problem. I can't do extensive testing on those drives because they contain important data. I also have 4 400GB drives attached to the same controller. I created 10GB partition on the 4 drives and a 30GB RAID5 across the ...
My linux-2.6.23/drivers/md/raid5.c contains your patch for a long time :
...
spin_lock(&sh->lock);
clear_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
/* Now to look around and see what can be done */
/* clean-up completed biofill operations */
if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
}
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = &sh->dev[i];
...
but it doesn't fix this bug.
Regards,
JKB
-
Did that chunk starting with "clean-up completed biofill operations" end up where it belongs? The patch with the big context moves it to a different place from where the original one puts it when applied to 2.6.23... Lately I've seen several problems where the context isn't enough to make a patch apply properly when some offsets have changed. In some cases a patch won't apply at all because two nearly-identical areas are being changed and the first chunk gets applied where the second one should, leaving nowhere for the second chunk to apply. -
I always apply this kind of patches by hands, and no by patch command.
Last patch sent here seems to fix this bug :
gershwin:[/usr/scripts] > cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
1464725632 blocks [2/1] [U_]
[=====>...............] recovery = 27.1% (396992504/1464725632)
finish=1040.3min speed=17104K/sec
Regards,
JKB
-
Resync done. Patch fix this bug. Regards, JKB -
Excellent! I cannot easily re-produce the bug on my system so I will wait for the=20 next stable patch set to include it and let everyone know if it happens=20 again, thanks.
