Yeah... weird.
RAID0 is so light-weight and so different from RAID1 or RAID5 that I
feel fairly safe concluding that the problem isn't in or near md.
But that doesn't help you.This really feels like a locking problem.
The problem occurs when ->next_rq is NULL, but ->sort_list.rb_node is
not NULL. That happens plenty of times in the code (particularly as
the first request is inserted) but always under ->queue_lock so it
should never be visible to cfq_dispatch_insert..Except that drivers/scsi/ide-scsi.c:idescsi_eh_reset calls
elv_next_request which could ultimately call __cfq_dispatch_requests
without taking ->queue_lock (that I can see). But you probably aren't
using ide-scsi (does anyone?).Given that interrupts are always disabled when queue_lock is taken, it
might be useful to add
WARN_ON(!irqs_disabled());
every time ->next_rq is set.
Something like the following.It might show something useful.... if we are lucky.
NeilBrown
diff .prev/block/cfq-iosched.c ./block/cfq-iosched.c
--- .prev/block/cfq-iosched.c 2007-04-17 15:01:36.000000000 +1000
+++ ./block/cfq-iosched.c 2007-04-17 15:02:25.000000000 +1000
@@ -628,6 +628,7 @@ static void cfq_remove_request(struct re
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);+ BUG_ON(!irqs_disabled());
if (cfqq->next_rq == rq)
cfqq->next_rq = cfq_find_next_rq(cfqq->cfqd, cfqq, rq);@@ -1810,6 +1811,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, s
/*
* check if this request is a better next-serve candidate)) {
*/
+ BUG_ON(!irqs_disabled());
cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
BUG_ON(!cfqq->next_rq);-
Well the fact is that we have 3-4 distinct reports of this oops, and all
of them are using md. No reports have been filed where md isn't managing
the disks. While this doesn't conclusively put the finger of blame on
md, it is still likely. It could bug a CFQ bug too of course, or (moreI had something similar for generic_unplug_request() as well, but didn't
see/hear any reports of it being tried out. Here's a complete debugging
patch for this and other potential dangers.diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b6491c0..8f749aa 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -587,8 +587,11 @@ static void cfq_remove_request(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);- if (cfqq->next_rq == rq)
+ BUG_ON(!irqs_disabled());
+ if (cfqq->next_rq == rq) {
cfqq->next_rq = cfq_find_next_rq(cfqq->cfqd, cfqq, rq);
+ BUG_ON(!cfqq->next_rq && !RB_EMPTY_ROOT(&cfqq->sort_list));
+ }list_del_init(&rq->queuelist);
cfq_del_rq_rb(rq);
@@ -1637,6 +1640,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
/*
* check if this request is a better next-serve candidate)) {
*/
+ BUG_ON(!irqs_disabled());
cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
BUG_ON(!cfqq->next_rq);diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 3de0695..c16863e 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1602,6 +1602,8 @@ EXPORT_SYMBOL(__generic_unplug_device);
**/
void generic_unplug_device(request_queue_t *q)
{
+ BUG_ON(irqs_disabled());
+
spin_lock_irq(q->queue_lock);
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);--
Jens Axboe-
Which had a bug (do the check _after_ deleting from the rbtree,
otherwise it'll go rightfully BUG() the first time).diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6853fb7..9aa7bf7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -644,12 +644,15 @@ static void cfq_remove_request(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);+ BUG_ON(!irqs_disabled());
if (cfqq->next_rq == rq)
cfqq->next_rq = cfq_find_next_rq(cfqq->cfqd, cfqq, rq);list_del_init(&rq->queuelist);
cfq_del_rq_rb(rq);+ BUG_ON(!cfqq->next_rq && !RB_EMPTY_ROOT(&cfqq->sort_list));
+
if (rq_is_meta(rq)) {
WARN_ON(!cfqq->meta_pending);
cfqq->meta_pending--;
@@ -1231,6 +1234,8 @@ static void cfq_exit_single_io_context(struct cfq_io_context *cic)
{
struct cfq_data *cfqd = cic->key;+ BUG_ON(irqs_disabled());
+
if (cfqd) {
request_queue_t *q = cfqd->queue;@@ -1699,6 +1704,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
/*
* check if this request is a better next-serve candidate)) {
*/
+ BUG_ON(!irqs_disabled());
cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
BUG_ON(!cfqq->next_rq);@@ -2014,6 +2020,7 @@ static void cfq_exit_queue(elevator_t *e)
cfq_shutdown_timer_wq(cfqd);
+ BUG_ON(irqs_disabled());
spin_lock_irq(q->queue_lock);if (cfqd->active_queue)
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 3de0695..c16863e 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1602,6 +1602,8 @@ EXPORT_SYMBOL(__generic_unplug_device);
**/
void generic_unplug_device(request_queue_t *q)
{
+ BUG_ON(irqs_disabled());
+
spin_lock_irq(q->queue_lock);
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);--
Jens Axboe-
I had a clean 2.6.21-rc7 that I forgot to change the default sched on take down my main server last
night. Another oops of the same variety. Unfortunately only a digital photo to go by though.I've removed Neil's patch, applied yours and fired it up on my 4 drive test box (the one that was
causing the trouble before). I've finally scrounged another machine for a serial logger and I'll
beat it until it breaks. Perhaps if I can get this to reliably trigger again we can track it down.Brad
--
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams
-
Thanks for testing Brad, be sure to use the next patch I sent instead.
The one from this mail shouldn't even get you booted. So double check
that you are still using CFQ :-)--
Jens Axboe-
[184901.576773] BUG: unable to handle kernel NULL pointer dereference at virtual address 0000005c
[184901.602612] printing eip:
[184901.610990] c0205399
[184901.617796] *pde = 00000000
[184901.626421] Oops: 0000 [#1]
[184901.635044] Modules linked in:
[184901.644500] CPU: 0
[184901.644501] EIP: 0060:[<c0205399>] Not tainted VLI
[184901.644503] EFLAGS: 00010082 (2.6.21-rc7 #7)
[184901.681294] EIP is at cfq_dispatch_insert+0x19/0x70
[184901.696168] eax: f7f078e0 ebx: f7ca2794 ecx: 00000004 edx: 00000000
[184901.716743] esi: c1acaa1c edi: f7c9c6c0 ebp: 00000000 esp: dbaefde0
[184901.737316] ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
[184901.755032] Process md5sum (pid: 4268, ti=dbaee000 task=f794a5a0 task.ti=dbaee000)
[184901.777422] Stack: 00000000 c1acaa1c f7c9c6c0 00000000 c0205509 e6b61bd8 c0133451 00001000
[184901.803121] 00000008 00000000 00000004 0e713800 c1acaa1c f7c9c6c0 c1acaa1c 00000000
[184901.828837] c0205749 f7ca2794 f7ca2794 f79bc000 00000282 c01fb829 00000000 c016ea8d
[184901.854552] Call Trace:
[184901.862723] [<c0205509>] __cfq_dispatch_requests+0x79/0x170
[184901.879971] [<c0133451>] do_generic_mapping_read+0x281/0x470
[184901.897478] [<c0205749>] cfq_dispatch_requests+0x69/0x90
[184901.913946] [<c01fb829>] elv_next_request+0x39/0x130
[184901.929375] [<c016ea8d>] bio_endio+0x5d/0x90
[184901.942725] [<c0270375>] scsi_request_fn+0x45/0x280
[184901.957896] [<c01fde92>] blk_run_queue+0x32/0x70
[184901.972286] [<c026f8e0>] scsi_next_command+0x30/0x50
[184901.987716] [<c026f9cb>] scsi_end_request+0x9b/0xc0
[184902.002886] [<c026fb31>] scsi_io_completion+0x81/0x330
[184902.018835] [<c026daeb>] scsi_delete_timer+0xb/0x20
[184902.034006] [<c027ee35>] ata_scsi_qc_complete+0x65/0xd0
[184902.050214] [<c02751bb>] sd_rw_intr+0x8b/0x220
[184902.064085] [<c0280c0c>] ata_altstatus+0x1c/0x20
[184902.078475] [<c027b68d>] ata_hsm...
Ok, can you try and reproduce with this one applied? It'll keep the
system running (unless there are other corruptions going on), so it
should help you a bit as well. It will dump some cfq state info when the
condition triggers that can perhaps help diagnose this. So if you can
apply this patch and reproduce + send the output, I'd much appreciate
it!diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b6491c0..2aba928 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -947,6 +947,36 @@ keep_queue:
return cfqq;
}+static void cfq_dump_queue(struct cfq_queue *cfqq)
+{
+ printk(" %d: sort=%d,next=%p,q=%d/%d,a=%d/%d,d=%d/%d,f=%x\n", cfqq->key, RB_EMPTY_ROOT(&cfqq->sort_list), cfqq->next_rq, cfqq->queued[0], cfqq->queued[1], cfqq->allocated[0], cfqq->allocated[1], cfqq->on_dispatch[0], cfqq->on_dispatch[1], cfqq->flags);
+}
+
+static void cfq_dump_state(struct cfq_data *cfqd)
+{
+ struct cfq_queue *cfqq;
+ int i;
+
+ printk("cfq: busy=%d,drv=%d,timer=%d\n", cfqd->busy_queues, cfqd->rq_in_driver, timer_pending(&cfqd->idle_slice_timer));
+
+ printk("cfq rr_list:\n");
+ for (i = 0; i < CFQ_PRIO_LISTS; i++)
+ list_for_each_entry(cfqq, &cfqd->rr_list[i], cfq_list)
+ cfq_dump_queue(cfqq);
+
+ printk("cfq busy_list:\n");
+ list_for_each_entry(cfqq, &cfqd->busy_rr, cfq_list)
+ cfq_dump_queue(cfqq);
+
+ printk("cfq idle_list:\n");
+ list_for_each_entry(cfqq, &cfqd->idle_rr, cfq_list)
+ cfq_dump_queue(cfqq);
+
+ printk("cfq cur_rr:\n");
+ list_for_each_entry(cfqq, &cfqd->cur_rr, cfq_list)
+ cfq_dump_queue(cfqq);
+}
+
static int
__cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
int max_dispatch)
@@ -964,6 +994,30 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if ((rq = cfq_check_fifo(cfqq)) == NULL)
rq = cfqq->next_rq;+ if (unlikely(!rq)) {
+ /*
+ * fixup that weird condition that happens with...
I think I'm wearing holes in my platters. This is being a swine to hit, but I finally got some..
It seems to respond to a combination of high cpu usage and relatively high disk utilisation.I tried all sorts of acrobatics with multiple readers and hammering the array while reading from
individual drives..The only reliable way I can reproduce this seems to be on a degraded array running a "while true ;
do md5sum -c md5 ; done" on about a 180GB directory of files. It is taking anywhere from 4 to 96
hours to hit it though.. but at least it's reproducible.[105449.653682] cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the issue to
lkml@vger.kernel.org
[105449.683646] cfq: busy=1,drv=0,timer=0
[105449.694871] cfq rr_list:
[105449.702715] 3108: sort=0,next=00000000,q=0/1,a=1/0,d=0/0,f=69
[105449.720693] cfq busy_list:
[105449.729054] cfq idle_list:
[105449.737418] cfq cur_rr:
[115435.022192] cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the issue to
lkml@vger.kernel.org
[115435.052160] cfq: busy=1,drv=0,timer=0
[115435.063383] cfq rr_list:
[115435.071227] 3196: sort=0,next=00000000,q=0/1,a=1/0,d=0/0,f=69
[115435.089205] cfq busy_list:
[115435.097566] cfq idle_list:
[115435.105930] cfq cur_rr:
[115616.651883] cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the issue to
lkml@vger.kernel.org
[115616.681848] cfq: busy=1,drv=0,timer=0
[115616.693071] cfq rr_list:
[115616.700916] 3196: sort=0,next=00000000,q=0/1,a=1/0,d=0/0,f=61
[115616.718893] cfq busy_list:
[115616.727253] cfq idle_list:
[115616.735617] cfq cur_rr:
[119679.564753] cfq: rbroot not empty, but ->next_rq == NULL! Fixing up, report the issue to
lkml@vger.kernel.org
[119679.594732] cfq: busy=1,drv=0,timer=0
[119679.605955] cfq rr_list:
[119679.613799] 3241: sort=0,next=00000000,q=0/1,a=1/0,d=0/0,f=69
[119679.631778] cfq busy_list:
[119679.640136] cfq idle_list:
[119679.648502] cfq cur_rr:Brad
--
"Human beings, who are almost uniqu...
Hi,
ide-scsi is holding ide_lock while calling elv_next_request()
(for ide ide_lock == ->queue_lock)Also from the original report:
and you can even see libata functions in the OOPS...
Bart
-
Just a quick thought. Could this issue in any way cause silent data corruption on write to the md?
I've been chasing a bug for months now on this box which resulted in a small bit of seemingly random
corruption in large copies to /dev/md0. Problem is, I changed the PSU in the box (which appeared to
help a little) but then about the same time I discovered the oops was in cfq so changed the
scheduler also. I've not been able to oops or reproduce the data corruption since.I've now got to go back to cfq and see if I can get it to panic reproducibly again, but I'm
wondering now if there may have been another issue there.Brad
--
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams
-
It's raid-5 here and it was degraded. I've just added the extra disk now and I've been running with
the deadline scheduler to work around the issue. I'll add Neil's patch and try some tests with cfq
tonight if I get the chance.Brad
--
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams
-
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Rafael J. Wysocki | 2.6.27-rc4-git1: Reported regressions from 2.6.26 |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Stephen Rothwell | Announce: Linux-next (Or Andrew's dream :-)) |
git: | |
| David Miller | Re: [PATCH 3/3] Convert the UDP hash lock to RCU |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Linus Torvalds | Re: [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
