Re: [patch] radix-tree: fix small lockless radix-tree bug

Previous thread: [patch 2.6.26-rc5] gpio: sysfs interface (updated) by David Brownell on Thursday, June 12, 2008 - 2:53 pm. (9 messages)

Next thread: [patch 2.6.26-git] pcmcia: simplify rsrc_nonstatic attributes by David Brownell on Thursday, June 12, 2008 - 3:13 pm. (2 messages)
To: <akpm@...>, Peter Zijlstra <peterz@...>, <linux-kernel@...>, <paulmck@...>
Date: Thursday, June 12, 2008 - 3:03 pm

Hi guys,

Although this doesn't seem like cause for alarm (as per the analysis),
it may still be a good 2.6.26 candidate as we should have a few more
weeks of testing left.

It should definitely go in -mm with the lockless pagecache patch.

Thanks,
Nick

To: Nick Piggin <nickpiggin@...>
Cc: <akpm@...>, Peter Zijlstra <peterz@...>, <linux-kernel@...>
Date: Friday, June 13, 2008 - 3:53 am

Good catch!!!

--

To: Nick Piggin <nickpiggin@...>
Cc: <peterz@...>, <linux-kernel@...>, <paulmck@...>
Date: Thursday, June 12, 2008 - 3:34 pm

On Fri, 13 Jun 2008 05:03:45 +1000

oic, that stuff got moved from the synchronous case into the RCU callback
case.
--

To: Andrew Morton <akpm@...>
Cc: <peterz@...>, <linux-kernel@...>, <paulmck@...>
Date: Thursday, June 12, 2008 - 3:47 pm

Yeah, it was annoying because the real change moved the first references
of a couple of those up. Then we have to move all of them to keep them
in one place.

The changelog I submitted looks like utter gibberish, btw. Rewritten one
here which should be a bit better.

We shrink a radix tree when its root node has only one child, in the left
most slot. The child becomes the new root node. To perform this operation
in a manner compatible with concurrent lockless lookups, we atomically
switch the root pointer from the parent to its child.

However a concurrent lockless lookup may now have loaded a pointer to the
parent (and is presently deciding what to do next). For this reason, we
also have to keep the parent node in a valid state after shrinking the
tree, until the next RCU grace period -- otherwise this lookup with the
parent pointer may not do the right thing. Notably, we need to keep the
child in the left most slot there in case that is requested by the lookup.

This is all pretty standard RCU stuff. It is worth repeating because
in my eagerness to obey the radix tree node constructor scheme, I had
broken it by zeroing the radix tree node before the grace period.

What could happen is that a lookup can load the parent pointer, then
decide it wants to follow the left most child slot, only to find the
slot contained NULL due to the concurrent shrinker having zeroed the
parent node before waiting for a grace period. The lookup would return
a false negative as a result.

Fix it by doing that clearing in the RCU callback. I would normally want
to rip out the constructor entirely, but radix tree nodes are one of those
places where they make sense (only few cachelines will be touched soon
after allocation).

This was never actually found in any lockless pagecache testing or by
the test harness, but by seeing the odd problem with my scalable vmap
rewrite. I have not tickled the test harness into reproducing it yet,
but I'll keep working at it.

Fortunately, it is not a probl...

To: Nick Piggin <nickpiggin@...>
Cc: <peterz@...>, <linux-kernel@...>, <paulmck@...>
Date: Thursday, June 12, 2008 - 3:31 pm

On Fri, 13 Jun 2008 05:03:45 +1000

OK, I give up. A cannot spot what you actually changed amongst all the
code motion?

--

To: Andrew Morton <akpm@...>
Cc: Nick Piggin <nickpiggin@...>, <linux-kernel@...>, <paulmck@...>
Date: Thursday, June 12, 2008 - 3:38 pm

The two relevant hunks:

@@ -124,6 +175,17 @@ static void radix_tree_node_rcu_free(str
{
struct radix_tree_node *node =
container_of(head, struct radix_tree_node, rcu_head);
+
+ /*
+ * must only free zeroed nodes into the slab. radix_tree_shrink
+ * can leave us with a non-NULL entry in the first slot, so clear
+ * that here to make sure.
+ */
+ tag_clear(node, 0, 0);
+ tag_clear(node, 1, 0);
+ node->slots[0] = NULL;
+ node->count = 0;
+
kmem_cache_free(radix_tree_node_cachep, node);
}

@@ -930,11 +939,6 @@ static inline void radix_tree_shrink(str
newptr = radix_tree_ptr_to_indirect(newptr);
root->rnode = newptr;
root->height--;
- /* must only free zeroed nodes into the slab */
- tag_clear(to_free, 0, 0);
- tag_clear(to_free, 1, 0);
- to_free->slots[0] = NULL;
- to_free->count = 0;
radix_tree_node_free(to_free);
}
}

So instead of clearing the first slot on free, we delay it one grace
period and clear it later. So that anybody already having a pointer to
the node can correctly resolve the item.

--

To: Nick Piggin <nickpiggin@...>
Cc: <akpm@...>, <linux-kernel@...>, <paulmck@...>
Date: Thursday, June 12, 2008 - 3:15 pm

Ouch - good one, I'll back-port it to -rt.

This reminds me, I should get back to my radix-tree path compression

--

Previous thread: [patch 2.6.26-rc5] gpio: sysfs interface (updated) by David Brownell on Thursday, June 12, 2008 - 2:53 pm. (9 messages)

Next thread: [patch 2.6.26-git] pcmcia: simplify rsrc_nonstatic attributes by David Brownell on Thursday, June 12, 2008 - 3:13 pm. (2 messages)