Re: 2.6.26-rc4: RIP __call_for_each_cic+0x20/0x50

Previous thread: ld: <...> internal error, aborting at arange-set.c line 202 in arange_set_new by Jesper Juhl on Sunday, April 27, 2008 - 2:35 pm. (6 messages)

Next thread: [PATCH] eventfd, signalfd, timerfd, epoll_create w/flags by Ulrich Drepper on Sunday, April 27, 2008 - 3:17 pm. (8 messages)
From: Alexey Dobriyan
Date: Sunday, April 27, 2008 - 3:55 pm

This happened while ~90 cross-compile jobs were running in parallel on
ext2/noatime partition (slowly -- much debugging was on)


general protection fault: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
CPU 0 
Modules linked in: ext2 nf_conntrack_irc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables usblp uhci_hcd ehci_hcd usbcore sr_mod cdrom
Pid: 16483, comm: as Not tainted 2.6.25-c3bf9bc243092c53946fd6d8ebd6dc2f4e572d48 #1
RIP: 0010:[&lt;ffffffff80307525&gt;]  [&lt;ffffffff80307525&gt;] call_for_each_cic+0x25/0x50
RSP: 0018:ffff810170811e58  EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff81010ff92000
RBP: ffff810170811e78 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: ffff8100010069d8 R12: ffff810138ada300
R13: ffffffff803075b0 R14: ffff81017fcd2000 R15: ffff81010ff92168
FS:  00002ac3462426f0(0000) GS:ffffffff805d0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002ab602550000 CR3: 000000013609d000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process as (pid: 16483, threadinfo ffff810170810000, task ffff81010ff92000)
Stack:  ffff810170811e88 ffff810138ada300 0000000000000010 ffff81010ff92100
 ffff810170811e88 ffffffff80307580 ffff810170811ea8 ffffffff80302a55
 ffff81010ff92100 ffff810138ada300 ffff810170811ec8 ffffffff80302b1f
Call Trace:
 [&lt;ffffffff80307580&gt;] cfq_free_io_context+0x10/0x20
 [&lt;ffffffff80302a55&gt;] put_io_context+0x85/0x90
 [&lt;ffffffff80302b1f&gt;] exit_io_context+0x8f/0xb0
 [&lt;ffffffff80235d19&gt;] do_exit+0x549/0x780
 [&lt;ffffffff80235f8e&gt;] do_group_exit+0x3e/0xb0
 [&lt;ffffffff80236012&gt;] sys_exit_group+0x12/0x20
 [&lt;ffffffff8020b6db&gt;] system_call_after_swapgs+0x7b/0x80


Code: 84 00 00 00 00 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 fc 53 48 83 ec 08 e8 18 e1 f5 ff 49 8b 44 24 ...
From: Andrew Morton
Date: Monday, April 28, 2008 - 5:01 am

cfq-iosched.c hasn't been altered (yet) so it might not be a regression.

--

From: Jens Axboe
Date: Monday, April 28, 2008 - 5:04 am

It's not a regression, it's definitely in 2.6.25 as well. So that's a
bit scary, I've been looking over this stuff this morning but haven't
pin pointed anything yet.

Alexey, is this something that reproduces for you?

-- 
Jens Axboe

--

From: Alexey Dobriyan
Date: Monday, April 28, 2008 - 12:55 pm

Not yet, second run of same workload went fine and I've never seen such
oopses before.

FWIW, the only unusual thing in .config is preemptible RCU. Before
2.6.25 it was turned off here. 

--

From: Alexey Dobriyan
Date: Monday, April 28, 2008 - 11:21 pm

And it oopses the very same way on the third run. as(1) again.
So if there are any debugging patches, let me know.

--

From: Jens Axboe
Date: Tuesday, April 29, 2008 - 2:06 am

There seems to be a small race in the destructor path, can you see if
this makes a difference?

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index e34df7c..012f065 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -41,8 +41,8 @@ int put_io_context(struct io_context *ioc)
 		rcu_read_lock();
 		if (ioc-&gt;aic &amp;&amp; ioc-&gt;aic-&gt;dtor)
 			ioc-&gt;aic-&gt;dtor(ioc-&gt;aic);
-		rcu_read_unlock();
 		cfq_dtor(ioc);
+		rcu_read_unlock();
 
 		kmem_cache_free(iocontext_cachep, ioc);
 		return 1;

-- 
Jens Axboe

--

From: Alexey Dobriyan
Date: Wednesday, April 30, 2008 - 3:12 pm

This helps in sense that 3 times bulk cross-compiles finish to the end.
You'll hear me if another such oops will resurface.

--

From: Jens Axboe
Date: Sunday, May 4, 2008 - 12:08 pm

Still looking good?

-- 
Jens Axboe

--

From: Alexey Dobriyan
Date: Sunday, May 4, 2008 - 1:15 pm

Yup!

--

From: Jens Axboe
Date: Sunday, May 4, 2008 - 12:25 pm

Great! I'll clean the path up and submit a patch for 2.6.26-rc1 as well
as 2.6.25.stable. Thanks a lot for reporting and testing.

-- 
Jens Axboe

--

From: Alexey Dobriyan
Date: Sunday, May 4, 2008 - 2:17 pm

And please include side-by-side diagram in changelog :^!

	Alexey, who still doesn't quite got it.

--

From: Alexey Dobriyan
Date: Saturday, May 10, 2008 - 3:37 am

And this with patch in mainline, again with PREEMPT_RCU.

general protection fault: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
CPU 0 
Modules linked in: ext2 nls_utf8 cifs nf_conntrack_irc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables usblp ehci_hcd uhci_hcd usbcore sr_mod cdrom
Pid: 26971, comm: fixdep Not tainted 2.6.26-rc1-28a4acb48586dc21d2d14a75a7aab7be78b7c83b #4
RIP: 0010:[&lt;ffffffff80304b30&gt;]  [&lt;ffffffff80304b30&gt;] __call_for_each_cic+0x20/0x50
RSP: 0018:ffff810104335e58  EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
RDX: 0000000000002827 RSI: 0000000000000000 RDI: ffff810105a62280
RBP: ffff810104335e78 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8101715bde40
R13: ffffffff80304ba0 R14: ffff81017fcc0000 R15: ffff810105a623e8
FS:  00002b50c7e566f0(0000) GS:ffffffff805c6000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b4d7f8b9b80 CR3: 0000000000201000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process fixdep (pid: 26971, threadinfo ffff810104334000, task ffff810105a62280)
Stack:  ffff810104335e88 ffff8101715bde40 0000000000000010 ffff810105a62380
 ffff810104335e88 ffffffff80304b70 ffff810104335ea8 ffffffff8030000b
 ffff810105a62380 ffff8101715bde40 ffff810104335ec8 ffffffff803000df
Call Trace:
 [&lt;ffffffff80304b70&gt;] cfq_free_io_context+0x10/0x20
 [&lt;ffffffff8030000b&gt;] put_io_context+0x7b/0x90
 [&lt;ffffffff803000df&gt;] exit_io_context+0x8f/0xb0
 [&lt;ffffffff80236533&gt;] do_exit+0x543/0x770
 [&lt;ffffffff802367a1&gt;] do_group_exit+0x41/0xb0
 [&lt;ffffffff80236822&gt;] sys_exit_group+0x12/0x20
 [&lt;ffffffff8020b60b&gt;] system_call_after_swapgs+0x7b/0x80


Code: 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 fc 53 48 83 ec 08 48 8b 47 68 48 85 c0 74 1f 48 89 c3 90 &lt;48&gt; 8b 03 48 8d ...
From: Alexey Dobriyan
Date: Monday, May 26, 2008 - 10:27 pm

From: Jens Axboe
Date: Tuesday, May 27, 2008 - 6:35 am

Worrisome... Paul, would you mind taking a quick look at cfq
and see if you can detect why breaks with preempt rcu? It's
clearly a use-after-free symptom, but I don't see how it can
happen.

-- 
Jens Axboe

--

From: Paul E. McKenney
Date: Tuesday, May 27, 2008 - 8:18 am

Some quick and probably off-the-mark questions...

o	What is the purpose of __call_for_each_cic()?  When called
	from call_for_each_cic(), it is under rcu_read_lock(), as
	required, but it is also called from cfq_free_io_context(),
	which is assigned to the -&gt;dtor and -&gt;exit members of the
	cfq_io_context struct.  What protects calls through these
	members?

	(This is for the -&gt;cic_list field of the cfq_io_context structure.
	One possibility is that the io_context's -&gt;lock member is held,
	but I don't see this.  Not that I looked all that hard...)

	My suggestion would be to simply change all invocations of
	__call_for_each_cic() to instead invoke call_for_each_cic().
	The rcu_read_lock()/rcu_read_unlock() pair is pretty
	lightweight, even in CONFIG_PREEMPT_RCU.

o	When calling cfq_slab_kill(), for example from cfq_exit(),
	what ensures that all previous RCU callbacks have completed?
	
	I suspect that you need an rcu_barrier() at the beginning
	of cfq_slab_kill(), but I could be missing something.

o	Updates to the -&gt;ioc_data field of the cfq_io_context
	seem to be protected by the request_queue -&gt;queue_lock
	field.  This seems very strange to me.  It is OK if every
	cfq_io_context is associated with only one request_queue
	structure -- is this the case?

o	What protects the first rcu_dereference() in cfq_cic_lookup()?
	There needs to be either an enclose rcu_read_lock() on the
	one hand or the -&gt;queue_lock needs to be held.

	(My guess is the latter, given the later rcu_assign_pointer()
	in this same function, but I don't see a lock acquisition
	in the immediate vicinity -- might be further up the function
	call stack, though.)

o	Why is there no grace period associated with the ioc_data?
	For example, what happens to the old value of -&gt;ioc_data
	after the rcu_assign_pointer() in cfq_cic_lookup()?  Readers
	might still be referencing the old version, right?  If so,
	how do we avoid messing them up?

	Or are we somehow leveraging the call_rcu() in ...
From: Jens Axboe
Date: Wednesday, May 28, 2008 - 3:07 am

__call_for_each_cic() is always called under rcu_read_lock(), it merely
exists to avoid a double rcu_read_lock(). Even if it is cheap. The
convention follows the usual __lock_is_already_held() double under
score, but I guess it could do with a comment! There are only two
callers of the function, call_for_each_cic() which does the
rcu_read_lock(), and cfq_free_io_context() which is called from -&gt;dtor
(and holds the rcu_read_lock() and -&gt;trim which actually does not. That
looks like it could be problematic, but it's only called when an io
scheduler module is removed so not really critical. I'll add it, though!
Actually, the task_lock() should be enough there. So no bug, but (again)

So we have two callers of that, one is from the error path at init time

-&gt;ioc_data is part of the io_context, not cfq_io_context. And it can be
shared now, so the correct locking for that would be ioc-&gt;lock and not
the queue lock. __cfq_exit_single_io_context() is serialized in the

There's no locking going into that function when coming from
cfq_get_io_context(), the other paths (happen) to hold the queue lock

The data belonging to -&gt;ioc_data (the cic, or per-process per-queue

Very, perhaps with a few more rounds we can find some more bugs :-). I'm
attaching a patch below, how does that look?

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4df3f05..75db529 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1142,6 +1142,9 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+/*
+ * Must always be called with the rcu_read_lock() held
+ */
 static void
 __call_for_each_cic(struct io_context *ioc,
 		    void (*func)(struct io_context *, struct cfq_io_context *))
@@ -1197,6 +1200,11 @@ static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
 	cfq_cic_free(cic);
 }
 
+/*
+ * Must be called with rcu_read_lock() held or preemption otherwise disabled.
+ * Only two callers of this - -&gt;dtor() which is ...
From: Paul E. McKenney
Date: Wednesday, May 28, 2008 - 3:30 am

So the call from cfq_get_io_context() needs an rcu_read_lock()?
Not seeing this in the patch below, but maybe you have it up a


Looks much improved!  Very interested  to hear how it does with  the testing.

--

From: Jens Axboe
Date: Wednesday, May 28, 2008 - 5:44 am

So that one was also OK, as Fabio pointed out. If you follow the
ioc_gone and user tracking, the:

        if (elv_ioc_count_read(ioc_count))
                wait_for_completion(ioc_gone);

also has the side effect of waiting for RCU callbacks calling

It's in there, it now does:

        rcu_read_lock();
        cic = rcu_dereference(ioc-&gt;ioc_data);
        if (cic &amp;&amp; cic-&gt;key == cfqd) {
                rcu_read_unlock();
                return cic;
        }
        ...

OK? Which is basically what remains of the patch now, except for the
comment additions. Oh, and the ioc-&gt;lock protecting setting of
-&gt;ioc_data as well. New version below. Alexey, care to give this a
spin? Seems your box is very well suited for finding RCU preempt
problems :-)


diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4df3f05..d01b411 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1142,6 +1142,9 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+/*
+ * Must always be called with the rcu_read_lock() held
+ */
 static void
 __call_for_each_cic(struct io_context *ioc,
 		    void (*func)(struct io_context *, struct cfq_io_context *))
@@ -1197,6 +1200,11 @@ static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
 	cfq_cic_free(cic);
 }
 
+/*
+ * Must be called with rcu_read_lock() held or preemption otherwise disabled.
+ * Only two callers of this - -&gt;dtor() which is called with the rcu_read_lock(),
+ * and -&gt;trim() which is called with the task lock held
+ */
 static void cfq_free_io_context(struct io_context *ioc)
 {
 	/*
@@ -1502,20 +1510,24 @@ static struct cfq_io_context *
 cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 {
 	struct cfq_io_context *cic;
+	unsigned long flags;
 	void *k;
 
 	if (unlikely(!ioc))
 		return NULL;
 
+	rcu_read_lock();
+
 	/*
 	 * we maintain a last-hit cache, to avoid browsing over the tree
 	 */
 	cic = ...
From: Paul E. McKenney
Date: Wednesday, May 28, 2008 - 6:20 am

OK, looks good.

--

From: Paul E. McKenney
Date: Wednesday, May 28, 2008 - 9:38 pm

But one additional question...

	static void cfq_cic_free_rcu(struct rcu_head *head)
	{
		struct cfq_io_context *cic;

		cic = container_of(head, struct cfq_io_context, rcu_head);

		kmem_cache_free(cfq_ioc_pool, cic);
		elv_ioc_count_dec(ioc_count);

		if (ioc_gone &amp;&amp; !elv_ioc_count_read(ioc_count))
			complete(ioc_gone);
	}

Suppose that a pair of tasks both execute the elv_ioc_count_dec()
at the same time, so that all counters are now zero.  Both then
find that there is still an ioc_gone, and that the count is
now zero.  One of the tasks invokes complete(ioc_gone).  This
awakens the corresponding cfq_exit(), which now returns, getting
rid of its stack frame -- and corrupting the all_gone auto variable
that ioc_gone references.

Now the second task gets a big surprise when it tries to invoke
complete(ioc_gone).

Or is there something else that I am missing here?

--

From: Jens Axboe
Date: Wednesday, May 28, 2008 - 11:26 pm

No, I think that's a problem spot as well. To my knowledge, nobody has
ever hit that. The anticipatory scheduler has the same code.

What we want to avoid here is making cfq_cic_free_rcu() a lot more
expensive, which is why the elv_ioc_count_read() is behind that
ioc_gone check. I'll need to think a bit on how to handle that
better :-)

-- 
Jens Axboe

--

From: Jens Axboe
Date: Wednesday, May 28, 2008 - 11:42 pm

So how about this? Add a spinlock for checking and clearing ioc_gone
back to NULL. It doesn't matter if we make the ioc_gone != NULL
case a little more expensive, as it will only happen on cfq-iosched
module unload. And it seems the clearest way of making this safe.
The last hunk should really not be necessary, as ioc_gone wont be
set back to NULL before wait_for_completion() is entered.

An identical patch is needed in AS as well.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d01b411..32aa367 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -48,6 +48,7 @@ static struct kmem_cache *cfq_ioc_pool;
 
 static DEFINE_PER_CPU(unsigned long, ioc_count);
 static struct completion *ioc_gone;
+static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
 #define cfq_class_idle(cfqq)	((cfqq)-&gt;ioprio_class == IOPRIO_CLASS_IDLE)
@@ -1177,8 +1178,19 @@ static void cfq_cic_free_rcu(struct rcu_head *head)
 	kmem_cache_free(cfq_ioc_pool, cic);
 	elv_ioc_count_dec(ioc_count);
 
-	if (ioc_gone &amp;&amp; !elv_ioc_count_read(ioc_count))
-		complete(ioc_gone);
+	if (ioc_gone) {
+		/*
+		 * CFQ scheduler is exiting, grab exit lock and check
+		 * the pending io context count. If it hits zero,
+		 * complete ioc_gone and set it back to NULL
+		 */
+		spin_lock(&amp;ioc_gone_lock);
+		if (ioc_gone &amp;&amp; !elv_ioc_count_read(ioc_count)) {
+			complete(ioc_gone);
+			ioc_gone = NULL;
+		}
+		spin_unlock(&amp;ioc_gone_lock);
+	}
 }
 
 static void cfq_cic_free(struct cfq_io_context *cic)
@@ -2317,7 +2329,7 @@ static void __exit cfq_exit(void)
 	 * pending RCU callbacks
 	 */
 	if (elv_ioc_count_read(ioc_count))
-		wait_for_completion(ioc_gone);
+		wait_for_completion(&amp;all_gone);
 	cfq_slab_kill();
 }
 

-- 
Jens Axboe

--

From: Paul E. McKenney
Date: Thursday, May 29, 2008 - 2:17 am

Looks better!  I do have one scenario that seems troublesome, but
it should be easy to fix, see below.  (Assuming it really is a
problem, that is...)


Suppose that at this point some other CPU does the last complete().
They have set ioc_gone to NULL, so everything is fine.  But suppose
that in the meantime, some other CPU sets up a cfq and then starts
tearing it down.  Then ioc_gone would be non-NULL, and we would cause
this new teardown to end prematurely.

If this is a real problem, one way to get around it is to have a
generation number.  We capture this before doing the elv_ioc_count_dec()
(alas, with a memory barrier between the capture and the elv_ioc_count_dec()),
and then check it under the lock.  If it has changed, we know someone else
has already done the awakening for us.  Increment the generation number
in the same place that ioc_gone is set to NULL.

--

From: Jens Axboe
Date: Thursday, May 29, 2008 - 3:13 am

This isn't a problem, since cfq_exit() cannot be called before
all block queues in the system have been detached from CFQ.
cfq_exit() calls elv_unregister() before setting ioc_gone, so
when elv_unregister() has returned, CFQ is in its own little world.
Do we need an smp_wmb() between elv_unregister() and the ioc_gone
assignment to ensure this ordering as well? IIRC, the spin_lock
and spin_unlock in elv_unregister() isn't enough to guarentee this.
We are really down to splitting hairs now, but better safe than
sorry :-)

-- 
Jens Axboe

--

From: Paul E. McKenney
Date: Thursday, May 29, 2008 - 4:25 am

And once all block queues have been detached, no future block queues
can ever be attached again?  Or perhaps a better way of putting it,
once CFQ has been shut down, can it be restarted without rebooting
the system?  If it can be restarted without reboot, then I do not

I believe that the spinlock takes care of that ordering issue.  I am
instead worried about a &quot;Rip Van Winkle&quot; effect where a given task
is delayed at a crucial point.  By the time it wakes back up, CFQ
has been not only restarted, but is now in the process of being torn
down again.  (Assuming that it can in fact be restarted without
a reboot.)

						Thanx, Paul
--

From: Jens Axboe
Date: Thursday, May 29, 2008 - 4:44 am

Certainly, you can modprobe cfq-iosched and switch any device to cfq
again. But I really don't see the problem in this case. Once CFQ
has exited, all tasks are detached from CFQ. They have to be, or

Why would we have done the ioc dec and complete() before that task
had finished?

-- 
Jens Axboe

--

From: Paul E. McKenney
Date: Thursday, May 29, 2008 - 5:11 am

Ah, OK -- so we cannot reload the module until the earlier instantiation
has been unloaded, which cannot happen until everyone has returned from
any code in the CFQ module.  If it is linked directly into the kernel,
it can only be &quot;unloaded&quot; by bringing the system down.

Sorry for my confusion!

This approach looks good to me.

--

From: Jens Axboe
Date: Thursday, May 29, 2008 - 5:13 am

Pheew good, I was starting to have a brain melt down as well :-)

-- 
Jens Axboe

--

From: Paul E. McKenney
Date: Friday, May 30, 2008 - 4:04 am

This needs to be upgraded to an smp_mb() because we have a write followed

In absence of these two changes, either compiler or CPU could reorder
the accesses to ioc_gone on the one hand and to ioc_count on the other,
which could result in a hang, for example, cfq_exit() might see the
reader's old count, but the reader might fail to see the assignment
to ioc_gone.

						Thanx, Paul
--

From: Paul E. McKenney
Date: Friday, May 30, 2008 - 6:16 am

OK, make that two more things to consider...

Why the smp_read_barrier_depends() in cfq_get_io_context()?  It appears
to be guarding the dereferencing of ioc.  But we would have obtained
ioc from get_io_context(), which already dereferenced this pointer when
it bumped the -&gt;refcount.  So any barriers would seem to need to be
inside of get_io_context().

So what am I missing this time?

							Thanx, Paul
--

From: Alexey Dobriyan
Date: Friday, May 30, 2008 - 11:34 am

[test patch]

Ha-ha! Posting full dmesg just in case something important is missed.

I checked that .version and number in uname match, timestamps between
vmlinux and vmlinuz are within 20 seconds, there is call to
__rcu_read_lock at the beginning of cfq_cic_lookup(), MD5 of vmlinuz and
vmliniz after reinstalling match so I'm pretty sure I booted right kernel.

Kernel is 2.6.24 with the following patch (one fix for my NIC, one
debugging for Oleg, one for you:


diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b399c62..8cdb821 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1142,6 +1142,9 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+/*
+ * Must always be called with the rcu_read_lock() held
+ */
 static void
 __call_for_each_cic(struct io_context *ioc,
 		    void (*func)(struct io_context *, struct cfq_io_context *))
@@ -1197,6 +1200,11 @@ static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
 	cfq_cic_free(cic);
 }
 
+/*
+ * Must be called with rcu_read_lock() held or preemption otherwise disabled.
+ * Only two callers of this - -&gt;dtor() which is called with the rcu_read_lock(),
+ * and -&gt;trim() which is called with the task lock held
+ */
 static void cfq_free_io_context(struct io_context *ioc)
 {
 	/*
@@ -1502,20 +1510,24 @@ static struct cfq_io_context *
 cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
 {
 	struct cfq_io_context *cic;
+	unsigned long flags;
 	void *k;
 
 	if (unlikely(!ioc))
 		return NULL;
 
+	rcu_read_lock();
+
 	/*
 	 * we maintain a last-hit cache, to avoid browsing over the tree
 	 */
 	cic = rcu_dereference(ioc-&gt;ioc_data);
-	if (cic &amp;&amp; cic-&gt;key == cfqd)
+	if (cic &amp;&amp; cic-&gt;key == cfqd) {
+		rcu_read_unlock();
 		return cic;
+	}
 
 	do {
-		rcu_read_lock();
 		cic = radix_tree_lookup(&amp;ioc-&gt;radix_root, (unsigned long) cfqd);
 		rcu_read_unlock();
 		if (!cic)
@@ -1524,10 +1536,13 @@ cfq_cic_lookup(struct ...
From: Paul E. McKenney
Date: Tuesday, June 3, 2008 - 8:31 pm

The fact that put_io_context() was called from exit_io_context() means
that this is the last thread of a process exiting.  The fact that
cfq_free_io_context() was called (via cfq_dtor()) from put_io_context()
means that this was the last reference to the io_context.  Yet when
we traverse the cic_list, part of it is corrupted -- ascii &quot;k&quot;s through
RAX and RBX.

In other news, I have been hammering preemptable RCU with rcutorture, and
will be making rcutorture more aggressive.  When I get back home, I will
try reproducing this test workload.

--

From: Linus Torvalds
Date: Wednesday, June 4, 2008 - 11:32 am

The &quot;ascii 'k's&quot; are just the slab POISON_FREE byte (0x6b).

IOW, something simply kfree'd the memory too early, causing the list 
traversal to then break.

		Linus
--

From: Paul E. McKenney
Date: Wednesday, June 4, 2008 - 9:23 pm

Hence the CFQ code perhaps just an innocent bystander in the wrong place
at the wrong time.  Fair enough!

							Thanx, Paul
--

From: Paul E. McKenney
Date: Friday, June 6, 2008 - 7:49 am

I have been running the attached scripts, one for LTP and one for
the 170 parallel randconfig kernel builds.  I have been using the
&quot;ltp-full-20080131&quot; version of LTP.  Thus far, I have seen &quot;task X
blocked for more than 120 seconds&quot; messages and OOM killer messages.
I disabled the former with:

	echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs

I have not yet seen the types of failures you posted.  So, are my scripts
doing the right thing?  If so, how long should I expect to run before
seeing a failure?  I see loads in excess of 10,000, for whatever that
is worth.

From: Fabio Checconi
Date: Wednesday, May 28, 2008 - 4:52 am

But isn't the ioc_gone completion (notified only when there are no more
cic allocated) assuring that cfq_slab_kill() is called only after all
the rcu callbacks are completed?  This should avoid the need for the
rcu_barrier().
--

From: Jens Axboe
Date: Wednesday, May 28, 2008 - 4:58 am

Good point, I was thinking we decremented the mod count on call_rcu(),
but we don't actually do it before the rcu callback has completed. So
that part is actually OK already.

-- 
Jens Axboe

--

Previous thread: ld: <...> internal error, aborting at arange-set.c line 202 in arange_set_new by Jesper Juhl on Sunday, April 27, 2008 - 2:35 pm. (6 messages)

Next thread: [PATCH] eventfd, signalfd, timerfd, epoll_create w/flags by Ulrich Drepper on Sunday, April 27, 2008 - 3:17 pm. (8 messages)