Hi guys,
Although this doesn't seem like cause for alarm (as per the analysis),
it may still be a good 2.6.26 candidate as we should have a few more
weeks of testing left.It should definitely go in -mm with the lockless pagecache patch.
Thanks,
Nick
Good catch!!!
--
On Fri, 13 Jun 2008 05:03:45 +1000
oic, that stuff got moved from the synchronous case into the RCU callback
case.
--
Yeah, it was annoying because the real change moved the first references
of a couple of those up. Then we have to move all of them to keep them
in one place.The changelog I submitted looks like utter gibberish, btw. Rewritten one
here which should be a bit better.We shrink a radix tree when its root node has only one child, in the left
most slot. The child becomes the new root node. To perform this operation
in a manner compatible with concurrent lockless lookups, we atomically
switch the root pointer from the parent to its child.However a concurrent lockless lookup may now have loaded a pointer to the
parent (and is presently deciding what to do next). For this reason, we
also have to keep the parent node in a valid state after shrinking the
tree, until the next RCU grace period -- otherwise this lookup with the
parent pointer may not do the right thing. Notably, we need to keep the
child in the left most slot there in case that is requested by the lookup.This is all pretty standard RCU stuff. It is worth repeating because
in my eagerness to obey the radix tree node constructor scheme, I had
broken it by zeroing the radix tree node before the grace period.What could happen is that a lookup can load the parent pointer, then
decide it wants to follow the left most child slot, only to find the
slot contained NULL due to the concurrent shrinker having zeroed the
parent node before waiting for a grace period. The lookup would return
a false negative as a result.Fix it by doing that clearing in the RCU callback. I would normally want
to rip out the constructor entirely, but radix tree nodes are one of those
places where they make sense (only few cachelines will be touched soon
after allocation).This was never actually found in any lockless pagecache testing or by
the test harness, but by seeing the odd problem with my scalable vmap
rewrite. I have not tickled the test harness into reproducing it yet,
but I'll keep working at it.Fortunately, it is not a probl...
On Fri, 13 Jun 2008 05:03:45 +1000
OK, I give up. A cannot spot what you actually changed amongst all the
code motion?--
The two relevant hunks:
@@ -124,6 +175,17 @@ static void radix_tree_node_rcu_free(str
{
struct radix_tree_node *node =
container_of(head, struct radix_tree_node, rcu_head);
+
+ /*
+ * must only free zeroed nodes into the slab. radix_tree_shrink
+ * can leave us with a non-NULL entry in the first slot, so clear
+ * that here to make sure.
+ */
+ tag_clear(node, 0, 0);
+ tag_clear(node, 1, 0);
+ node->slots[0] = NULL;
+ node->count = 0;
+
kmem_cache_free(radix_tree_node_cachep, node);
}@@ -930,11 +939,6 @@ static inline void radix_tree_shrink(str
newptr = radix_tree_ptr_to_indirect(newptr);
root->rnode = newptr;
root->height--;
- /* must only free zeroed nodes into the slab */
- tag_clear(to_free, 0, 0);
- tag_clear(to_free, 1, 0);
- to_free->slots[0] = NULL;
- to_free->count = 0;
radix_tree_node_free(to_free);
}
}So instead of clearing the first slot on free, we delay it one grace
period and clear it later. So that anybody already having a pointer to
the node can correctly resolve the item.--
Ouch - good one, I'll back-port it to -rt.
This reminds me, I should get back to my radix-tree path compression
--
| Greg Smith | PostgreSQL pgbench performance regression in 2.6.23+ |
| Greg KH | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| Pavel Roskin | ndiswrapper and GPL-only symbols redux |
git: | |
| Tom Herbert | SO_REUSEPORT? |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| David Miller | [GIT]: Networking |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
| Miklos Szeredi | [patch 01/10] vfs: add path_create() and path_mknod() |
| Erez Zadok | [PATCH 02/14] VFS: rename do_splice_to/from to vfs_splice_* and export symbols |
| Miklos Szeredi | Re: [PATCH 2/2] SELinux: display SELinux mount options in /proc/mounts |
| Barry Naujok | [PATCH 1/7] XFS: Name operation vector for hash and compare |
