Hi,
This appeared in my logs:
kmemcheck: Caught 32-bit read from freed memory (f7042348)
Pid: 1374, comm: bash Not tainted (2.6.25-rc7 #92)
EIP: 0060:[<c0502f0d>] EFLAGS: 00210202 CPU: 0
EIP is at call_for_each_cic+0x2d/0x44
EAX: 00200286 EBX: 00000001 ECX: c200e908 EDX: f7042348
ESI: f6c26c60 EDI: c0503310 EBP: f70fff38 ESP: c082ec88
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: f7826904 CR3: 36cd7000 CR4: 000006c0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c041cff8>] kmemcheck_read+0xa8/0xe0
[<c041d1d5>] kmemcheck_access+0x1a5/0x244
[<c0668252>] do_page_fault+0x622/0x6fc
[<c06666aa>] error_code+0x72/0x78
[<c050323f>] cfq_free_io_context+0xf/0x70
[<c04fc4d7>] put_io_context+0x4f/0x58
[<c04fc568>] exit_io_context+0x60/0x6c
[<c042f871>] do_exit+0x4d9/0x6f0
[<c042fab1>] do_group_exit+0x29/0x88
[<c042fb1f>] sys_exit_group+0xf/0x14
[<c0406105>] sysenter_past_esp+0x6d/0xa4
[<ffffffff>] 0xffffffff
The error occurs in cfq_free_io_context()'s call to
call_for_each_cic() which looks like this:
rcu_read_lock();
hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list) {
func(ioc, cic);
called++;
}
rcu_read_unlock();
The function that is called is cic_free_func(). It is postulated that
hlist_for_each_entry_rcu() will dereference the previously freed list
element to get the ->next pointer.
After a short discussion with Pekka Enberg and Peter Zijlstra, it
seemed evident that this list traversal should use
hlist_for_each_entry_safe_rcu() instead, which would buffer the next
pointer before the object is freed.
Does this report seem to be valid?
The kernel is 2.6.25-rc7.
Kind regards,
Vegard Nossum
--
The missing hlist for loop would look something like so:
#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) \
for (pos = (head)->first; \
rcu_dereference(pos) && ({ n = pos->next; 1; }) && \
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
pos = n)
--
Why the heck is cic_free_func() immediately doing a kmem_cache_free() on the cfq_io_context structure??? Shouldn't we have a call_rcu() or a synchronize_rcu() in there somewhere??? Given the way this is written, wouldn't readers on other code paths get dumped onto the freelist? This would not be a good thing... Thanx, Paul --
I was told, but didn't check for myself it is using SLAB_RCU. Of course that requires an object validation pass, and I didn't check it does that either. --
Yes, it's using SLAB_DESTROY_BY_RCU. If you find faults with that, please let me know. -- Jens Axboe --
But doesn't SLAB_RCU only wait for a grace period when returning a slab to the system, rather than on each kmem_cache_free()? Thanx, Paul --
Good catch, I wonder why it didn't complain in my testing. I've added a patch to fix that, please see it here: http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=51998b2da4e5db65cb2431732905904... -- Jens Axboe --
You probably don't have kmemcheck in your kernel ;-) --
Ehm no, you are right :) -- Jens Axboe --
... and you can get kmemcheck by testing on x86.git/latest: http://people.redhat.com/mingo/x86.git/README ;-) Ingo --
Thanks Ingo, I'll give it a spin to verify this fix! -- Jens Axboe --
I will check this when I get back to some bandwidth -- but in the meantime, does kmemcheck special-case SLAB_DESTROY_BY_RCU? It is legal to access newly-freed items in that case, as long as you did rcu_read_lock() before gaining a reference to them and don't hold the reference past the matching rcu_read_unlock(). Thanx, Paul --
I don't think it does. It would have to register an call_rcu callback itself in order to mark it freed - and handle the race with the object being handed out again. --
I had the same problem while debugging a cfq-derived i/o scheduler,
and I found nothing preventing the reuse of the freed memory.
The patch below seemed to fix the logic.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
---
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0f962ec..f26da2b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1143,24 +1143,37 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
}
/*
- * Call func for each cic attached to this ioc. Returns number of cic's seen.
+ * Call func for each cic attached to this ioc.
*/
-static unsigned int
+static void
call_for_each_cic(struct io_context *ioc,
void (*func)(struct io_context *, struct cfq_io_context *))
{
struct cfq_io_context *cic;
struct hlist_node *n;
- int called = 0;
rcu_read_lock();
- hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list) {
+ hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list)
func(ioc, cic);
- called++;
- }
rcu_read_unlock();
+}
+
+static void cfq_cic_free_rcu(struct rcu_head *head)
+{
+ struct cfq_io_context *cic;
+
+ cic = container_of(head, struct cfq_io_context, rcu_head);
+
+ kmem_cache_free(cfq_ioc_pool, cic);
+ elv_ioc_count_dec(ioc_count);
+
+ if (ioc_gone && !elv_ioc_count_read(ioc_count))
+ complete(ioc_gone);
+}
- return called;
+static void cfq_cic_free(struct cfq_io_context *cic)
+{
+ call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
}
static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
@@ -1174,24 +1187,18 @@ static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
hlist_del_rcu(&cic->cic_list);
spin_unlock_irqrestore(&ioc->lock, flags);
- kmem_cache_free(cfq_ioc_pool, cic);
+ cfq_cic_free(cic);
}
static void cfq_free_io_context(struct io_context *ioc)
{
- int freed;
-
/*
- * ioc->refcount is zero here, so no more cic's are allowed to be
- * linked into this ioc. So it should be ok to iterate over the ...Thanks, from a first look this looks like it'll fix this bad rcu slab usage. I'll give it some closer scrutiny and testing. -- Jens Axboe --
(CC reinstated, sometimes mutt is really annoying and drops the person you are replying too :-( Looks good and tests fine as well. I've applied it, on top of the hlist_for_each_entry_safe_rcu() fix. http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=436151ad32b4a59e0d36165a7d63125... -- Jens Axboe --
ok, thanks. anyway I don't think the hlist_for_each_entry_safe_rcu() is needed at this point, since the pos->next pointer is still valid (at least) until the next rcu_read_unlock(). am I wrong? --
it isn't, but it's still clearer. so either that or a nice comment, I just stuck with what I had already committed. -- Jens Axboe --
Christ, mutt is still dropping you. Sorry about that. -- Jens Axboe --
ok I agree on that. the only remaining concern I have is that when I first looked at it it seemed to me that hlist_for_each_entry_safe_rcu() was missing by purpose from the list interface, since hlist_del_rcu() can be called anyway during the traversal from a concurrent context, so the semantics of *_safe_* have to be carried out by other means (i.e., call_rcu()). --
Given Peter's comment, would you be willing to drop the hlist_for_each_entry_safe_rcu()? People tend to read too much into the "safe" sometimes. ;-) Thanx, Paul --
Sure, let me rebase it... I already sent a pull request to Linus, and he just loves when I rebase and change the diffstats behind his back :) -- Jens Axboe --
Would it help if I submitted the patch to you to back it out later? ;-) Thanx, Paul --
Hehe, I already wrote to Linus and explained why it changed, so what is done is done :) -- Jens Axboe --
Looks good to me from a strictly RCU viewpoint -- I must confess great ignorance of the CFQ code. :-/ --
Hey, I am not complaining about 2007 being gone already, because I am still wondering what the heck happened to 2005 and 2006!!! ;-) --
Hi Paul,
On Wed, Apr 2, 2008 at 1:55 PM, Paul E. McKenney
No, kmemcheck is work in progress and does not know about
SLAB_DESTROY_BY_RCU yet. The reason I asked Vegard to post the warning
was because Peter, Vegard, and myself identified this particular
warning as a real problem. But yeah, kmemcheck can cause false
positives for RCU for now.
Pekka
--
Makes sense, and to me Pauls analysis of the code looks totally correct - there's no bug there, at least related to hlist traversal and kmem_cache_free(), since we are under rcu_read_lock() and thus hold off the grace for freeing. -- Jens Axboe --
but what holds off the slab allocator re-issueing that same object and someone else writing other stuff into it? --
Nothing. So you cannot access the object at all after you've called kmem_cache_free(). SLAB_RCU or no SLAB_RCU. --
Well, you can, but you have to validate you get the object you were looking for. --
You can of course access the object. And if you establish a definite state of locks on free (and through a ctor) then its even okay to take locks. --
Nothing, that's how rcu destry works here. But for the validation to be WRONG radix_tree_lookup(..., old_key) must return cic for new_key, not NULL. -- Jens Axboe --
A B C cfq_cic_lookup(cfqd_1, ioc) rcu_read_lock() cic = radix_tree_lookup(, cfqd_q); cfq_cic_free() cfq_cic_link(cfqd_2, ioc,) rcu_read_unlock() and now we have that: cic->key == cfqd_2 I'm not seeing anything stopping this from happening. Which is also why we need hlist_for_each_safe_rcu() because as soon as we kfree()d the thing, someone else might get the object and start poking at the hlist pointers, wrecking out iteration. --
Or worse, when C doesn't happen and B free's the very last object and the slab does get returned, any usage of cic after rcu_read_unlock() might poke into free memory. --
I don't follow your A-B-C here, what do they refer to? -- Jens Axboe --
A does a radix_tree_lookup() of cfqd_1 (darn typos) B does a kfree of the same cic found by A C does an alloc and gets the same cic as freed by B and inserts it in a different location. So that when we return to A, cic->key == cfqd_2 even though we did a lookup for cfqd_1. --
That I follow, my question was if A, B, and C refer to different processes but with a shared io context? I'm assuming that is correct... And it does look buggy. It looks my assumption of what slab rcu destroy did is WRONG, it should be replaced by a manual call_rcu() freeing instead. -- Jens Axboe --
Yeah, SLAB_DESTROY_BY_RCU should have a _HUGE_ comment explaining it, I'm sure this is not the first (nor the last) time people get that wrong. This would be one of those things that score very low on Rusty's API list. --
The only place where you'll have multiple processes involved with this at all is if they share io contexts. That is also why the bug isn't that critical, since it's not possible right now (CLONE_IO flag must be It should, SLAB_DESTROY_BY_RCU is definitely useful, but it is expected to be an 'easier' way of doing the call_rcu() manually. So it definitely needs more documentation. -- Jens Axboe --
Ok I gave it a go, how bad is this text?
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
diff --git a/include/linux/slab.h b/include/linux/slab.h
index f950a89..e049ddc 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -25,6 +25,32 @@
#define SLAB_CACHE_DMA 0x00004000UL /* Use GFP_DMA memory */
#define SLAB_STORE_USER 0x00010000UL /* DEBUG: Store the last owner for bug hunting */
#define SLAB_PANIC 0x00040000UL /* Panic if kmem_cache_create() fails */
+/*
+ * SLAB_DESTROY_BY_RCU - **WARNING** READ THIS!
+ *
+ * This delays freeing the SLAB page by a grace period, it does _NOT_
+ * delay object freeing. This means that if you do kmem_cache_free()
+ * that memory location is free to be reused at any time. Thus it may
+ * be possible to see another object there in the same RCU grace period.
+ *
+ * This feature only ensures the memory location backing the object
+ * stays valid, the trick to using this is relying on an independent
+ * object validation pass. Something like:
+ *
+ * rcu_read_lock()
+ * again:
+ * obj = lockless_lookup(key);
+ * if (obj) {
+ * if (!try_get_ref(obj)) // might fail for free objects
+ * goto again;
+ *
+ * if (obj->key != key) { // not the object we expected
+ * put_ref(obj);
+ * goto again;
+ * }
+ * }
+ * rcu_read_unlock();
+ */
#define SLAB_DESTROY_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */
#define SLAB_TRACE 0x00200000UL /* Trace allocations and frees */
--
I think it looks good. The key point is this: "This delays freeing the SLAB page by a grace period, it does _NOT_ delay object freeing." which is right in the front of the text and with sample validation below. So you can add my acked-by to that, if you want. -- Jens Axboe --
Should be also in the kernel doc description of kmem_cache_create(), so it appears in manpages etc. --
There are 3 races here:
1) A continues with another object than intended
(requires CLONE_IO)
2) A does hlist_for_each_rcu() and races with B,C so that
we continue the iteration on a possibly unrelated list.
3) cic is freed after the !cic->key check.
I'm not familiar enough with the code yet to see if 3 really is an
possibility. But from what I can see there is nothing guarding its
existence.
--
All 3 require CLONE_IO, because if that is not set, there's a 1:1 mapping between a process and io context (no sharing occurs). -- Jens Axboe --
Would the following be an appropriate fix? It seems to me to be in
the same spirit as the existing check for s->ctor.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
slub_kmemcheck.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slub_kmemcheck.c b/mm/slub_kmemcheck.c
index 8620a8b..e07f62a 100644
--- a/mm/slub_kmemcheck.c
+++ b/mm/slub_kmemcheck.c
@@ -93,6 +93,6 @@ kmemcheck_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, void *object)
void
kmemcheck_slab_free(struct kmem_cache *s, void *object)
{
- if (!s->ctor)
+ if (!s->ctor && !(s->flags & SLAB_DESTROY_BY_RCU))
kmemcheck_mark_freed(object, s->objsize);
}
--
On Wed, Apr 2, 2008 at 6:08 PM, Paul E. McKenney In my opinion, no. It would fix the false positives, but would in fact also hide cases such as this one with cfq, e.g. the real cases of mis-use. I will try to look into this -- for now, I need to understand RCU first (I've seen your LWN articles -- great work! :-)) Kind regards, Vegard Nossum --
Hi Vegard, Yes, but we might as well put Paul's patch in now and remove that later when we have a proper fix, no? Well, maybe we can add two new states: RCU_FREED and RCU_VALIDATED? The object is flagged with the first one as soon as an object is handed over to kmem_cache_free() and the latter needs to hook to the validation phase of RCU (how is that done btw?). Then kmemcheck could even give a better error message: "RCU-freed object used without validation." And with delayed free for kmemcheck we discussed before, we'd hold on to the objects long enough to actually see these error conditions. Pekka --
Well, one approach would be to add an rcu_head to the kmem_cache
structure, along with a flag stating that the rcu_head is in use. I hope
that there is a better approach, as this introduces a lock roundtrip
into kmemcheck_slab_free(). Is there a better place to put the rcu_head?
Perhaps into the per-CPU allocator? But then we have to track which
CPU has which mark pending, and there are only so many bits in a byte,
as the SGI guys would be quick to point out
Which is why I chickened out and submitted the earlier crude patch.
Anyway, here is a -very- rough sketch of the stupid lock-based approach.
Thanx, Paul
struct kmem_cache {
. . . /* existing fields */
struct rcu_head rcu;
int rcu_available; /* rcu_head above is available for use. */
spinlock_t rcu_lock; /* which of course must be initialized. */
};
Then we need to add a couple of values to the enum shadow:
enum shadow {
... /* existing values */
SHADOW_RCU_FREED,
SHADOW_RCU_FREED_PENDING,
};
Then we have:
void
kmemcheck_slab_free(struct kmem_cache *s, void *object)
{
unsigned long flags;
if (s->ctor)
return;
if (likely(!(s->flags & SLAB_DESTROY_BY_RCU)))
kmemcheck_mark_freed(object, s->objsize);
spin_lock_irqsave(&s->rcu_lock, flags);
if (s->rcu_available) {
kmemcheck_mark_rcu_freed(object, s->objsize);
/* record the address somewhere... */
call_rcu(&s->rcu, kmemcheck_slab_free_rcu);
} else {
kmemcheck_mark_rcu_pending(object, s->objsize);
/* record the address somewhere... */
}
spin_unlock_irqrestore(&s->rcu_lock, flags);
}
void kmemcheck_slab_free_rcu(struct rcu_head *rcu)
{
unsigned long flags;
struct kmem_cache *s = container_of(rcu, struct kmem_cache, rcu);
void *shadow;
spin_lock_irqsave(&s->rcu_lock, flags);
/* recover the previously recorded object address. somehow */
kmemcheck_mark_freed(object, s->objsize);
if (/* there are pending requests */) {
/* get the previously recorded object addresses, somehow ...Hi Paul, On Wed, Apr 2, 2008 at 9:23 PM, Paul E. McKenney I suppose you haven't actually run kmemcheck on your machine? We're taking a page fault for _every_ memory access so a lock round-trip in the SLAB_RCU case is probably not that bad performance-wise :-). --
Coward that I am, no I have not. ;-) The thing that worries me even more than the lock is the need to keep track of the addresses. Then again, if you are taking a page fault on every access, perhaps not such a big deal to allocate the memory and link it into a list... But yikes!!! ;-) Thanx, Paul --
OK, so another approach would be to use a larger shadow block for SLAB_DESTROY_BY_RCU slabs, so that each shadow location would have enough room for an rcu_head and a size in addition to the flag. That would trivialize tracking, or, more accurately, delegate such tracking to the RCU infrastructure. Of course, the case where the block gets reallocated before the RCU grace period ends would also need to be handled (which my rough sketch yesterday did -not- handle, by the way...). There are a couple of ways of doing this. Probably the easiest approach is to add more state to the flag, so that the RCU callback would check to see if reallocation had already happened. If so, it would update the state to indicate that the rcu_head was again available, and would need to repost itself if the block had been freed again after being reallocated. The other approach would be to defer actually adding the block to the freelist until the grace period expired. This would be more accurate, but also quite a bit more intrusive. Thoughts? Thanx, Paul --
Hi Paul, Yeah, or just allocate some extra spaces for the RCU case and not overload the current shadow pages. But sounds good to me. We already talked about deferring the actual freeing in kmemcheck to better detect these use-after-free conditions with Vegard. So it's something that we probably want to do regardless of RCU. Pekka --
As long as we have an rcu_head for each memory block in the slab, I am Then it is especially important that the rcu_head be pre-allocated. Otherwise we could get into out-of-memory deadlocks where a free operation is blocked by an allocation operation. ;-) Thanx, Paul --
Though this case apparently does not qualify as misuse until such time as CLONE_IO is implemented. Glad you liked them! And Peter's suggested approach would indeed be more accurate. But I will still put my patch forward as a stopgap. ;-) Thanx, Paul --
On Wed, Apr 2, 2008 at 6:59 PM, Paul E. McKenney Hm, no. Objects with a ctor should retain its "initializedness" between allocations of the same object. This is one of the features of Yes, I realize that it will be needed as a temporary fix. It _is_ better to hide some real warnings in favour of the showing false ones. (But a TODO comment is in order, I believe.) Thanks. Vegard --
I am still confused. o The hlist_for_each_entry_safe_rcu() is under rcu_read_lock(). o The kmem_cache has SLAB_DESTROY_BY_RCU. o This means that a given slab should not be returned to the system until a grace period elapses. o So the bugginess (or not) of this code should not be affected by adding hlist_for_each_entry_safe_rcu() here. (I am not seeing the checks that would be needed to avoid something being kmem_cache_free()ed while being accessed, but might be missing something.) Thanx, Paul --
Hi Paul,
On Wed, Apr 2, 2008 at 1:40 PM, Paul E. McKenney
Yeah, that's what I thought too, that this is a SLUB bug but Peter
convinced me otherwise. SLUB keeps the _page_ around so the pointer
will be _valid_, although it might not be _your_ pointer so the caller
needs to do some validation step. Or at least that's how I understood
what Peter was saying.
Pekka
--
Correct, that is always how i understood SLAB_DESTROY_BY_RCU to work. Does SLAB (as opposed to SLUB) do it differently? --
No, SLAB works this way as well. Pekka --
Agreed, when looking at this code its not making sense. cfq_cic_lookup() is also mightily confusing. Only the actual radix_tree_lookup() call is protected by RCU, I'm not seeing what guarantees the existance of cic after rcu_read_unlock(). Nor does it do a validation check to see if cic->key == cfqd, something that would be needed when using SLAB_DESTROY_BY_RCU. This is most fishy code. --
It checks cic->key != NULL, it's set to NULL when it's invalid. Not sure if it could transition to some other cfqd and radix_tree_lookup() still returning the cic for the old key, if so it would need a check for ->key == cfqd there as well (like in the one-hit cache above, it checks ->key Well, it's definitely not straight forward ;-) -- Jens Axboe --
