This happened while ~90 cross-compile jobs were running in parallel on ext2/noatime partition (slowly -- much debugging was on) general protection fault: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: ext2 nf_conntrack_irc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables usblp uhci_hcd ehci_hcd usbcore sr_mod cdrom Pid: 16483, comm: as Not tainted 2.6.25-c3bf9bc243092c53946fd6d8ebd6dc2f4e572d48 #1 RIP: 0010:[<ffffffff80307525>] [<ffffffff80307525>] call_for_each_cic+0x25/0x50 RSP: 0018:ffff810170811e58 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff81010ff92000 RBP: ffff810170811e78 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: ffff8100010069d8 R12: ffff810138ada300 R13: ffffffff803075b0 R14: ffff81017fcd2000 R15: ffff81010ff92168 FS: 00002ac3462426f0(0000) GS:ffffffff805d0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002ab602550000 CR3: 000000013609d000 CR4: 0000000000000660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process as (pid: 16483, threadinfo ffff810170810000, task ffff81010ff92000) Stack: ffff810170811e88 ffff810138ada300 0000000000000010 ffff81010ff92100 ffff810170811e88 ffffffff80307580 ffff810170811ea8 ffffffff80302a55 ffff81010ff92100 ffff810138ada300 ffff810170811ec8 ffffffff80302b1f Call Trace: [<ffffffff80307580>] cfq_free_io_context+0x10/0x20 [<ffffffff80302a55>] put_io_context+0x85/0x90 [<ffffffff80302b1f>] exit_io_context+0x8f/0xb0 [<ffffffff80235d19>] do_exit+0x549/0x780 [<ffffffff80235f8e>] do_group_exit+0x3e/0xb0 [<ffffffff80236012>] sys_exit_group+0x12/0x20 [<ffffffff8020b6db>] system_call_after_swapgs+0x7b/0x80 Code: 84 00 00 00 00 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 fc 53 48 83 ec 08 e8 18 e1 f5 ff 49 8b 44 24 ...
cfq-iosched.c hasn't been altered (yet) so it might not be a regression. --
It's not a regression, it's definitely in 2.6.25 as well. So that's a bit scary, I've been looking over this stuff this morning but haven't pin pointed anything yet. Alexey, is this something that reproduces for you? -- Jens Axboe --
Not yet, second run of same workload went fine and I've never seen such oopses before. FWIW, the only unusual thing in .config is preemptible RCU. Before 2.6.25 it was turned off here. --
And it oopses the very same way on the third run. as(1) again. So if there are any debugging patches, let me know. --
There seems to be a small race in the destructor path, can you see if this makes a difference? diff --git a/block/blk-ioc.c b/block/blk-ioc.c index e34df7c..012f065 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -41,8 +41,8 @@ int put_io_context(struct io_context *ioc) rcu_read_lock(); if (ioc->aic && ioc->aic->dtor) ioc->aic->dtor(ioc->aic); - rcu_read_unlock(); cfq_dtor(ioc); + rcu_read_unlock(); kmem_cache_free(iocontext_cachep, ioc); return 1; -- Jens Axboe --
This helps in sense that 3 times bulk cross-compiles finish to the end. You'll hear me if another such oops will resurface. --
Great! I'll clean the path up and submit a patch for 2.6.26-rc1 as well as 2.6.25.stable. Thanks a lot for reporting and testing. -- Jens Axboe --
And please include side-by-side diagram in changelog :^! Alexey, who still doesn't quite got it. --
And this with patch in mainline, again with PREEMPT_RCU. general protection fault: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: ext2 nls_utf8 cifs nf_conntrack_irc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables usblp ehci_hcd uhci_hcd usbcore sr_mod cdrom Pid: 26971, comm: fixdep Not tainted 2.6.26-rc1-28a4acb48586dc21d2d14a75a7aab7be78b7c83b #4 RIP: 0010:[<ffffffff80304b30>] [<ffffffff80304b30>] __call_for_each_cic+0x20/0x50 RSP: 0018:ffff810104335e58 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 RDX: 0000000000002827 RSI: 0000000000000000 RDI: ffff810105a62280 RBP: ffff810104335e78 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff8101715bde40 R13: ffffffff80304ba0 R14: ffff81017fcc0000 R15: ffff810105a623e8 FS: 00002b50c7e566f0(0000) GS:ffffffff805c6000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002b4d7f8b9b80 CR3: 0000000000201000 CR4: 0000000000000660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process fixdep (pid: 26971, threadinfo ffff810104334000, task ffff810105a62280) Stack: ffff810104335e88 ffff8101715bde40 0000000000000010 ffff810105a62380 ffff810104335e88 ffffffff80304b70 ffff810104335ea8 ffffffff8030000b ffff810105a62380 ffff8101715bde40 ffff810104335ec8 ffffffff803000df Call Trace: [<ffffffff80304b70>] cfq_free_io_context+0x10/0x20 [<ffffffff8030000b>] put_io_context+0x7b/0x90 [<ffffffff803000df>] exit_io_context+0x8f/0xb0 [<ffffffff80236533>] do_exit+0x543/0x770 [<ffffffff802367a1>] do_group_exit+0x41/0xb0 [<ffffffff80236822>] sys_exit_group+0x12/0x20 [<ffffffff8020b60b>] system_call_after_swapgs+0x7b/0x80 Code: 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 fc 53 48 83 ec 08 48 8b 47 68 48 85 c0 74 1f 48 89 c3 90 <48> 8b 03 48 8d ...
Worrisome... Paul, would you mind taking a quick look at cfq and see if you can detect why breaks with preempt rcu? It's clearly a use-after-free symptom, but I don't see how it can happen. -- Jens Axboe --
Some quick and probably off-the-mark questions... o What is the purpose of __call_for_each_cic()? When called from call_for_each_cic(), it is under rcu_read_lock(), as required, but it is also called from cfq_free_io_context(), which is assigned to the ->dtor and ->exit members of the cfq_io_context struct. What protects calls through these members? (This is for the ->cic_list field of the cfq_io_context structure. One possibility is that the io_context's ->lock member is held, but I don't see this. Not that I looked all that hard...) My suggestion would be to simply change all invocations of __call_for_each_cic() to instead invoke call_for_each_cic(). The rcu_read_lock()/rcu_read_unlock() pair is pretty lightweight, even in CONFIG_PREEMPT_RCU. o When calling cfq_slab_kill(), for example from cfq_exit(), what ensures that all previous RCU callbacks have completed? I suspect that you need an rcu_barrier() at the beginning of cfq_slab_kill(), but I could be missing something. o Updates to the ->ioc_data field of the cfq_io_context seem to be protected by the request_queue ->queue_lock field. This seems very strange to me. It is OK if every cfq_io_context is associated with only one request_queue structure -- is this the case? o What protects the first rcu_dereference() in cfq_cic_lookup()? There needs to be either an enclose rcu_read_lock() on the one hand or the ->queue_lock needs to be held. (My guess is the latter, given the later rcu_assign_pointer() in this same function, but I don't see a lock acquisition in the immediate vicinity -- might be further up the function call stack, though.) o Why is there no grace period associated with the ioc_data? For example, what happens to the old value of ->ioc_data after the rcu_assign_pointer() in cfq_cic_lookup()? Readers might still be referencing the old version, right? If so, how do we avoid messing them up? Or are we somehow leveraging the call_rcu() in ...
__call_for_each_cic() is always called under rcu_read_lock(), it merely exists to avoid a double rcu_read_lock(). Even if it is cheap. The convention follows the usual __lock_is_already_held() double under score, but I guess it could do with a comment! There are only two callers of the function, call_for_each_cic() which does the rcu_read_lock(), and cfq_free_io_context() which is called from ->dtor (and holds the rcu_read_lock() and ->trim which actually does not. That looks like it could be problematic, but it's only called when an io scheduler module is removed so not really critical. I'll add it, though! Actually, the task_lock() should be enough there. So no bug, but (again) So we have two callers of that, one is from the error path at init time ->ioc_data is part of the io_context, not cfq_io_context. And it can be shared now, so the correct locking for that would be ioc->lock and not the queue lock. __cfq_exit_single_io_context() is serialized in the There's no locking going into that function when coming from cfq_get_io_context(), the other paths (happen) to hold the queue lock The data belonging to ->ioc_data (the cic, or per-process per-queue Very, perhaps with a few more rounds we can find some more bugs :-). I'm attaching a patch below, how does that look? diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 4df3f05..75db529 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1142,6 +1142,9 @@ static void cfq_put_queue(struct cfq_queue *cfqq) kmem_cache_free(cfq_pool, cfqq); } +/* + * Must always be called with the rcu_read_lock() held + */ static void __call_for_each_cic(struct io_context *ioc, void (*func)(struct io_context *, struct cfq_io_context *)) @@ -1197,6 +1200,11 @@ static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic) cfq_cic_free(cic); } +/* + * Must be called with rcu_read_lock() held or preemption otherwise disabled. + * Only two callers of this - ->dtor() which is ...
So the call from cfq_get_io_context() needs an rcu_read_lock()? Not seeing this in the patch below, but maybe you have it up a Looks much improved! Very interested to hear how it does with the testing. --
So that one was also OK, as Fabio pointed out. If you follow the
ioc_gone and user tracking, the:
if (elv_ioc_count_read(ioc_count))
wait_for_completion(ioc_gone);
also has the side effect of waiting for RCU callbacks calling
It's in there, it now does:
rcu_read_lock();
cic = rcu_dereference(ioc->ioc_data);
if (cic && cic->key == cfqd) {
rcu_read_unlock();
return cic;
}
...
OK? Which is basically what remains of the patch now, except for the
comment additions. Oh, and the ioc->lock protecting setting of
->ioc_data as well. New version below. Alexey, care to give this a
spin? Seems your box is very well suited for finding RCU preempt
problems :-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4df3f05..d01b411 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1142,6 +1142,9 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
kmem_cache_free(cfq_pool, cfqq);
}
+/*
+ * Must always be called with the rcu_read_lock() held
+ */
static void
__call_for_each_cic(struct io_context *ioc,
void (*func)(struct io_context *, struct cfq_io_context *))
@@ -1197,6 +1200,11 @@ static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
cfq_cic_free(cic);
}
+/*
+ * Must be called with rcu_read_lock() held or preemption otherwise disabled.
+ * Only two callers of this - ->dtor() which is called with the rcu_read_lock(),
+ * and ->trim() which is called with the task lock held
+ */
static void cfq_free_io_context(struct io_context *ioc)
{
/*
@@ -1502,20 +1510,24 @@ static struct cfq_io_context *
cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
{
struct cfq_io_context *cic;
+ unsigned long flags;
void *k;
if (unlikely(!ioc))
return NULL;
+ rcu_read_lock();
+
/*
* we maintain a last-hit cache, to avoid browsing over the tree
*/
cic = ...But one additional question...
static void cfq_cic_free_rcu(struct rcu_head *head)
{
struct cfq_io_context *cic;
cic = container_of(head, struct cfq_io_context, rcu_head);
kmem_cache_free(cfq_ioc_pool, cic);
elv_ioc_count_dec(ioc_count);
if (ioc_gone && !elv_ioc_count_read(ioc_count))
complete(ioc_gone);
}
Suppose that a pair of tasks both execute the elv_ioc_count_dec()
at the same time, so that all counters are now zero. Both then
find that there is still an ioc_gone, and that the count is
now zero. One of the tasks invokes complete(ioc_gone). This
awakens the corresponding cfq_exit(), which now returns, getting
rid of its stack frame -- and corrupting the all_gone auto variable
that ioc_gone references.
Now the second task gets a big surprise when it tries to invoke
complete(ioc_gone).
Or is there something else that I am missing here?
--
No, I think that's a problem spot as well. To my knowledge, nobody has ever hit that. The anticipatory scheduler has the same code. What we want to avoid here is making cfq_cic_free_rcu() a lot more expensive, which is why the elv_ioc_count_read() is behind that ioc_gone check. I'll need to think a bit on how to handle that better :-) -- Jens Axboe --
So how about this? Add a spinlock for checking and clearing ioc_gone
back to NULL. It doesn't matter if we make the ioc_gone != NULL
case a little more expensive, as it will only happen on cfq-iosched
module unload. And it seems the clearest way of making this safe.
The last hunk should really not be necessary, as ioc_gone wont be
set back to NULL before wait_for_completion() is entered.
An identical patch is needed in AS as well.
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d01b411..32aa367 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -48,6 +48,7 @@ static struct kmem_cache *cfq_ioc_pool;
static DEFINE_PER_CPU(unsigned long, ioc_count);
static struct completion *ioc_gone;
+static DEFINE_SPINLOCK(ioc_gone_lock);
#define CFQ_PRIO_LISTS IOPRIO_BE_NR
#define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
@@ -1177,8 +1178,19 @@ static void cfq_cic_free_rcu(struct rcu_head *head)
kmem_cache_free(cfq_ioc_pool, cic);
elv_ioc_count_dec(ioc_count);
- if (ioc_gone && !elv_ioc_count_read(ioc_count))
- complete(ioc_gone);
+ if (ioc_gone) {
+ /*
+ * CFQ scheduler is exiting, grab exit lock and check
+ * the pending io context count. If it hits zero,
+ * complete ioc_gone and set it back to NULL
+ */
+ spin_lock(&ioc_gone_lock);
+ if (ioc_gone && !elv_ioc_count_read(ioc_count)) {
+ complete(ioc_gone);
+ ioc_gone = NULL;
+ }
+ spin_unlock(&ioc_gone_lock);
+ }
}
static void cfq_cic_free(struct cfq_io_context *cic)
@@ -2317,7 +2329,7 @@ static void __exit cfq_exit(void)
* pending RCU callbacks
*/
if (elv_ioc_count_read(ioc_count))
- wait_for_completion(ioc_gone);
+ wait_for_completion(&all_gone);
cfq_slab_kill();
}
--
Jens Axboe
--
Looks better! I do have one scenario that seems troublesome, but it should be easy to fix, see below. (Assuming it really is a problem, that is...) Suppose that at this point some other CPU does the last complete(). They have set ioc_gone to NULL, so everything is fine. But suppose that in the meantime, some other CPU sets up a cfq and then starts tearing it down. Then ioc_gone would be non-NULL, and we would cause this new teardown to end prematurely. If this is a real problem, one way to get around it is to have a generation number. We capture this before doing the elv_ioc_count_dec() (alas, with a memory barrier between the capture and the elv_ioc_count_dec()), and then check it under the lock. If it has changed, we know someone else has already done the awakening for us. Increment the generation number in the same place that ioc_gone is set to NULL. --
This isn't a problem, since cfq_exit() cannot be called before all block queues in the system have been detached from CFQ. cfq_exit() calls elv_unregister() before setting ioc_gone, so when elv_unregister() has returned, CFQ is in its own little world. Do we need an smp_wmb() between elv_unregister() and the ioc_gone assignment to ensure this ordering as well? IIRC, the spin_lock and spin_unlock in elv_unregister() isn't enough to guarentee this. We are really down to splitting hairs now, but better safe than sorry :-) -- Jens Axboe --
And once all block queues have been detached, no future block queues can ever be attached again? Or perhaps a better way of putting it, once CFQ has been shut down, can it be restarted without rebooting the system? If it can be restarted without reboot, then I do not I believe that the spinlock takes care of that ordering issue. I am instead worried about a "Rip Van Winkle" effect where a given task is delayed at a crucial point. By the time it wakes back up, CFQ has been not only restarted, but is now in the process of being torn down again. (Assuming that it can in fact be restarted without a reboot.) Thanx, Paul --
Certainly, you can modprobe cfq-iosched and switch any device to cfq again. But I really don't see the problem in this case. Once CFQ has exited, all tasks are detached from CFQ. They have to be, or Why would we have done the ioc dec and complete() before that task had finished? -- Jens Axboe --
Ah, OK -- so we cannot reload the module until the earlier instantiation has been unloaded, which cannot happen until everyone has returned from any code in the CFQ module. If it is linked directly into the kernel, it can only be "unloaded" by bringing the system down. Sorry for my confusion! This approach looks good to me. --
Pheew good, I was starting to have a brain melt down as well :-) -- Jens Axboe --
This needs to be upgraded to an smp_mb() because we have a write followed In absence of these two changes, either compiler or CPU could reorder the accesses to ioc_gone on the one hand and to ioc_count on the other, which could result in a hang, for example, cfq_exit() might see the reader's old count, but the reader might fail to see the assignment to ioc_gone. Thanx, Paul --
OK, make that two more things to consider... Why the smp_read_barrier_depends() in cfq_get_io_context()? It appears to be guarding the dereferencing of ioc. But we would have obtained ioc from get_io_context(), which already dereferenced this pointer when it bumped the ->refcount. So any barriers would seem to need to be inside of get_io_context(). So what am I missing this time? Thanx, Paul --
[test patch]
Ha-ha! Posting full dmesg just in case something important is missed.
I checked that .version and number in uname match, timestamps between
vmlinux and vmlinuz are within 20 seconds, there is call to
__rcu_read_lock at the beginning of cfq_cic_lookup(), MD5 of vmlinuz and
vmliniz after reinstalling match so I'm pretty sure I booted right kernel.
Kernel is 2.6.24 with the following patch (one fix for my NIC, one
debugging for Oleg, one for you:
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b399c62..8cdb821 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1142,6 +1142,9 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
kmem_cache_free(cfq_pool, cfqq);
}
+/*
+ * Must always be called with the rcu_read_lock() held
+ */
static void
__call_for_each_cic(struct io_context *ioc,
void (*func)(struct io_context *, struct cfq_io_context *))
@@ -1197,6 +1200,11 @@ static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
cfq_cic_free(cic);
}
+/*
+ * Must be called with rcu_read_lock() held or preemption otherwise disabled.
+ * Only two callers of this - ->dtor() which is called with the rcu_read_lock(),
+ * and ->trim() which is called with the task lock held
+ */
static void cfq_free_io_context(struct io_context *ioc)
{
/*
@@ -1502,20 +1510,24 @@ static struct cfq_io_context *
cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
{
struct cfq_io_context *cic;
+ unsigned long flags;
void *k;
if (unlikely(!ioc))
return NULL;
+ rcu_read_lock();
+
/*
* we maintain a last-hit cache, to avoid browsing over the tree
*/
cic = rcu_dereference(ioc->ioc_data);
- if (cic && cic->key == cfqd)
+ if (cic && cic->key == cfqd) {
+ rcu_read_unlock();
return cic;
+ }
do {
- rcu_read_lock();
cic = radix_tree_lookup(&ioc->radix_root, (unsigned long) cfqd);
rcu_read_unlock();
if (!cic)
@@ -1524,10 +1536,13 @@ cfq_cic_lookup(struct ...The fact that put_io_context() was called from exit_io_context() means that this is the last thread of a process exiting. The fact that cfq_free_io_context() was called (via cfq_dtor()) from put_io_context() means that this was the last reference to the io_context. Yet when we traverse the cic_list, part of it is corrupted -- ascii "k"s through RAX and RBX. In other news, I have been hammering preemptable RCU with rcutorture, and will be making rcutorture more aggressive. When I get back home, I will try reproducing this test workload. --
The "ascii 'k's" are just the slab POISON_FREE byte (0x6b). IOW, something simply kfree'd the memory too early, causing the list traversal to then break. Linus --
Hence the CFQ code perhaps just an innocent bystander in the wrong place at the wrong time. Fair enough! Thanx, Paul --
I have been running the attached scripts, one for LTP and one for the 170 parallel randconfig kernel builds. I have been using the "ltp-full-20080131" version of LTP. Thus far, I have seen "task X blocked for more than 120 seconds" messages and OOM killer messages. I disabled the former with: echo 0 > /proc/sys/kernel/hung_task_timeout_secs I have not yet seen the types of failures you posted. So, are my scripts doing the right thing? If so, how long should I expect to run before seeing a failure? I see loads in excess of 10,000, for whatever that is worth.
But isn't the ioc_gone completion (notified only when there are no more cic allocated) assuring that cfq_slab_kill() is called only after all the rcu callbacks are completed? This should avoid the need for the rcu_barrier(). --
Good point, I was thinking we decremented the mod count on call_rcu(), but we don't actually do it before the rcu callback has completed. So that part is actually OK already. -- Jens Axboe --
