This is not yet intended for commit (see below), but rather a snapshot of what I'm working with currently. And to get some feedback on the direction where I'm going. The first patch changes flow cache reference counting fundamentally. It no longer assumes that the cached objects are GC'd later, but instead calls the virtualized delete method to delete (or drop it's reference). This does now mean that ->delete() call can be slowish. The plan is to add ->check() which is called when ever the cache is flushed, add all deleted entries to GC list, and have the main flush routine delete the GC list entries. I wanted to suggest this because: - xfrm_policy_put() is currently never guaranteed to be fast, instead it can always result to slow path. Only the flow cache wanted to have fast free when bh is disabled, and this was made possible with the policy GC ensuring that cache is flushed before policies are freed. - now that we can cache bundles or policies, it makes sense to have more selective flushing; otherwise we lose some of the speed ups. This also means that flushing gets faster, and is needed very rarely (sensible points are GC'ing bundles and when interface goes down) - now we don't have to do two periodic/delayed GC's: one for bundles, and one for policies. instead we can have central code for that in the flow cache. this also means that the flow cache hash is the owner of bundle, and when the flow cache entry is expired the bundle is dst_free()'d (which we would want to do anyway). no need to keep bundles in separate global (or policy specific) list Does this sound acceptable approach? What the current patch set is missing: - delayed deletion of flow cache objects - doing check() on flush for each object - removing the policy GC - some of the other flow cache improvements from my original patch Also, we might want to cache dummy bundles in xfrm_check_policy(). The reason is that if we matched sub policy originally, we always ...
This allows to validate the cached object before returning it.
It also allows to destruct object properly, if the last reference
was held in flow cache. This is also a prepartion for caching
bundles in the flow cache.
In return for virtualizing the methods, we save on:
- not having to regenerate the whole flow cache on policy removal:
each flow matching a killed policy gets refreshed as the getter
function notices it smartly.
- we do not have to call flow_cache_flush from policy gc, since the
flow cache now properly deletes the object if it had any references
This also means the flow cache entry deletion does more work. If
it's too slow now, may have to implement delayed deletion of flow
cache entries. But this is a save because this enables immediate
deletion of policies and bundles.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
include/net/flow.h | 17 ++++++--
include/net/xfrm.h | 2 +
net/core/flow.c | 102 ++++++++++++++++++++++++----------------------
net/xfrm/xfrm_policy.c | 105 +++++++++++++++++++++++++++++++-----------------
4 files changed, 136 insertions(+), 90 deletions(-)
diff --git a/include/net/flow.h b/include/net/flow.h
index 809970b..68fea54 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -86,11 +86,20 @@ struct flowi {
struct net;
struct sock;
-typedef int (*flow_resolve_t)(struct net *net, struct flowi *key, u16 family,
- u8 dir, void **objp, atomic_t **obj_refp);
-extern void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family,
- u8 dir, flow_resolve_t resolver);
+struct flow_cache_entry_ops {
+ struct flow_cache_entry_ops ** (*get)(struct flow_cache_entry_ops **);
+ void (*delete)(struct flow_cache_entry_ops **);
+};
+
+typedef struct flow_cache_entry_ops **(*flow_resolve_t)(
+ struct net *net, struct flowi *key, u16 family,
+ u8 dir, struct flow_cache_entry_ops **old_ops);
+
+extern struct flow_cache_entry_ops **flow_cache_lookup(
+ struct net *net, ...From: Timo Teras <timo.teras@iki.fi> I'm concerned about the new costs being added here. We have to now take the policy lock as a reader every time the flow cache wants to grab a reference. So we now have this plus the indirect function call new overhead. Maybe we can make the dead state check safe to do asynchronously somehow? I wonder if the policy layer is overdue for an RCU conversion or similar. Anyways, something to think about. Otherwise I don't mind these changes. --
If we want to have the flow cache generic, we pretty much need indirect calls. But considering that it might make sense to cache bundles, or "xfrm cache entries" on all flow directions (so we can I looked at the code and the policy lock is not needed much anymore. I think it was most heavily used to protected ->bundles which is now removed. But yes, I also previously said that ->walk.dead should probably be converted to atomic_t. It is only written once when the policy is killed. So we can make it accessible without the lock. Considering that the whole cache was broken previously, and we needed to take write lock on policy for each forwarded packet, it does not sound that bad. Apparently locally originating traffic directly to xfrm destination (not via gre) would get cached on the socket dst cache and avoids the xfrm_lookup on fast path entirely(?). We can get away from the per-cpu design with RCU hash. But I think we still need to track the hash entries similar to this. Though, there's probably some other tricks doable with RCU which I'm not all familiar with. I will take a quick look on the rcu thingy Ok, I'll add "convert walk.dead to atomic_t" so we can access it without a lock. I did also notice that the policy locking is not right exactly. E.g. migration can touch templates, and we read templates currently without locks. So I think bundle creation should be protected with policy read lock. But even this can probably be avoided by RCU type bundle creation. We just take bundle genid before starting to create it, create bundle, and check if genid was changed while doing this we retry. We might even get away from policy->lock all together. In most places it's only used to protect walk.dead. And bundle creation can be synchronizes as above. The only remaining place seems to be the timer function. I think it's safe to remove locking there too, and synchronize using timer deletion. All this is because any changes to policy will result in xfrm_policy ...
In fact, now that I read this again, we don't even need to grab the lock to perform the deadness check. This is because the existing never did it anyway. The liveliness is guaranteed by the policy destruction code doing a synchronous cache flush. Timo, what was the reason for getting rid of the synchronous flush again? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
No, just having the flush call is not enough to guarantee liveliness. The flushing happens in delayed work, and the flows might be in use before the flush has been finished or even started. I was also hoping to move the "delayed" deletion part to flow cache core so the code would be shared with policies and bundles. As stated before, we don't really need lock for the 'dead' check. It's only written once, and actually read without lock in some other places too. And all the other places that do take the lock, release it right away making the resulting dead check result "old" anyway. Looks to me that the whole policy locking is not up-to-date anymore. Apparently the only place that actually needs it is the migration code (which just happens to be broke anyway since bundle creation does not take read lock). But it could be relatively easily changed to replace policy with new templates... and the whole policy stuff converted to RCU. However, now that I've almost done the code ready. I'm thinking if this is such a good idea or not. I was hoping to have bundles always in the flow cache, but it's not sufficient. In case we have socket bound policy that results in a bundle, we might need create bundles outside the flow cache. It turns out the generic flow cache might not be so easily done that could host bundles and policies. Or at least the shared code would not be as much as hoped for. Given that it makes also sense to store other objects for outgoing stuff (we might need reference to multiple policies if matching a sub and main policy), it might be a consideration to make the flow cache specialized to contain those objects. Or do we have other possible users for the flow cache? - Timo --
No that's not the point. The lock is not there to protect reading ->dead, which is atomic anyway. It's there to guarantee that you don't increment the ref count on a dead policy. For the flow cache we didn't need this because the policy code would flush the cache synchronously so we can always grab a ref count safely as long as BH is still off. So if you leave the flow cache flushing as is, then we should still be able to do the it without holding the lock, or checking I wouldn't be surprised that the migration code is buggy. But that's orthogonal to your patch. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
No it's pretty much for reading ->dead what I can tell. That's how ->dead is accessed in multiple other places too. That's the only reason I added the locking in my new code. But yes, it's pointless. ->dead access is atomic, except where it's read/written to in xfrm_policy_kill. It's trivially Previous code did not do any locking before adding a The old code could return policy object that was killed. It relies solely on the fact that policy gc will flush the flow cache. Between the time of 'policy killed' and 'policy gc ran' the old code would return policy object that is marked dead. The change is an improvement in this regard as the flow cache objects get refreshed immediately after marking a policy dead. The reason for policy GC calling flush was there for two reasons: - purging the stale entries - making sure that refcount of policy won't go to zero after releasing flow cache's reference (because the flow cache did only atomic_dec but did not call destructor) Both issues are handled otherwise in the patch. By refreshing stale entries immediately. Or calling virtual destructor when the object gets deleted. Thus the slow flushing is not needed We can still drop the locking, as ->dead can be made atomic_t. Checking ->dead improves cache's speed to react to policy object changes. And the virtual ->get is especially needed for bundles Yeah. Just a note that the road map to RCU policies is trivial and fixes some races in locking currently. - Timo --
Which is fine. I'd rather have that than this new behaviour which adds a lock. We don't delete policies all the time, so Let's step back one second. It's best to not accumulate unrelated changes in one patch. So is there a reason why you must remove the synchronous flow cache flushing from the policy destruction I really see no point to optimising for such an unlikely case Please do one change at a time. Let's just focus on the original issue of the bundle linked list for now. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Yes, I'm splitting up the patch to more fine grained pieces. The synchronous flow cache flushing does not have to be removed, The only reason why it needs to be atomic is because of xfrm_policy_kill() which writes '1' and checks if it was zero previously. Since the idea is to get rid of the policy lock, we can turn ->dead flag to atomic_t and use atomic_xchg for that. Agreed. But as the lockless check is cheap, why not to have it. And some system do get policy changes quite a bit. IKE daemon sometimes is configured to create policies on-demand so this does Yes. The way to do that is just a bit long. I've already added some other stuff and split existing stuff in the patch. It's currently seven patches. I'm still tracking one reference leak, but I'll send in the new set soonish. - Timo --
I don't see the point. As long as the data paths do not take the lock changing this doesn't buy us much. You're still making that cacheline exclusive. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
To my understanding declaring an atomic_t, or reading it with atomic_read does not make cache line exclusive. Only the atomic_* writing to it take the cache line. And since this is done exactly once for policy (or it's a bug/warn thingy) it does not impose significant performance issue. But looking at the code more. The check should not be needed. xfrm_policy_kill() is only called if the entry is removed from the hash list, which can happen only once. Do you think we can just change it to unconditionally writing to "policy->walk.dead = 1;" and be done with that? Alternatively, we can move the ->dead check to be done while holding the hash lock to guarantee no one else is writing simultaneously. --
I was talking about the lock vs. atomic_xchg in xfrm_policy_kill. There is practically no difference for that case. Yes, on the read side the lock is a completely different beast compared to atomic_read, but I don't see how you can safely replace lock if (!dead) take ref unlock See above. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Because the lock is not needed to take ref. You can take ref as long as someone else is also holding a valid reference. The flow cache keeps a reference to each object it is holding, thus we can always take a reference if we find an object there. This is what is being done in the current code, and can be still done in the new code. The only reason my patch had the lock, was for the dead check. Since it can be checked without locks, the locking can be just removed. The dead check is still an improvement: we find outdated cache entries without doing a full flush and we find them faster than using a full flush. The old code would use policy entries with dead flag set, and happily return and use them until the policy gc had an opportunity to run. --
I'm not talking about the flow cache. The current flow cache code doesn't even take the lock. I'm talking about the other places that you have to convert in order to make this into an atomic_t. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Did you check the other places?
All other places do:
fox x policies:
lock(x)
pol_dead |= x->walk.dead;
unlock(x)
if pol_dead
abort
or similar.
And some cases don't even bother to lock the policy currently
when reading walk.dead.
All of the code treats the walk.dead as a hint. It does not need
strong synchronization with a lock.
--
Well then converting it to an atomic_t is completely pointless. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Yes, I came to same conclusion. The only thing I thought justifying it, was the xfrm_policy_kill() doing the check of the old value. But as noted few mails ago, it's not necessary. So I'll just go ahead and remove all locking from the read side, and move the xfrm_policy_kill to use plain write. --
No you can't make it a plain write in xfrm_policy_kill. The same policy may be killed simultaneously, by the timer and user action. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
So we fix up all the callers of xfrm_policy_kill to check properly result of __xfrm_policy_unlink(). Since the policy can be only once deleted from the hashes (it's protected by xfrm_policy_lock) return value of __xfrm_policy_unlink() can be used to give responsibility of calling xfrm_policy_kill() exactly once. I thought this was already being done, but apparently it's not the case. --
Actually you're right. This should be the case as otherwise we'd be triggering that WARN_ON. Since it hasn't triggered in the five years that it's been around, I suppose we can now remove it along with the lock. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
__xfrm_lookup() is called for each packet transmitted out of
system. The xfrm_find_bundle() does a linear search which can
kill system performance depending on how many bundles are
required per policy.
This modifies __xfrm_lookup() to store bundles directly in
the flow cache. If we did not get a hit, we just create a new
bundle instead of doing slow search. This means that we can now
get multiple xfrm_dst's for same flow (on per-cpu basis).
Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
include/net/flow.h | 6 +-
include/net/xfrm.h | 10 +-
net/core/flow.c | 4 +-
net/ipv4/xfrm4_policy.c | 22 --
net/ipv6/xfrm6_policy.c | 31 ---
net/xfrm/xfrm_policy.c | 690 ++++++++++++++++++++++++-----------------------
6 files changed, 360 insertions(+), 403 deletions(-)
diff --git a/include/net/flow.h b/include/net/flow.h
index 68fea54..9264c82 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -93,12 +93,12 @@ struct flow_cache_entry_ops {
};
typedef struct flow_cache_entry_ops **(*flow_resolve_t)(
- struct net *net, struct flowi *key, u16 family,
- u8 dir, struct flow_cache_entry_ops **old_ops);
+ struct net *net, struct flowi *key, u16 family, u8 dir,
+ struct flow_cache_entry_ops **old_ops, void *ctx);
extern struct flow_cache_entry_ops **flow_cache_lookup(
struct net *net, struct flowi *key, u16 family,
- u8 dir, flow_resolve_t resolver);
+ u8 dir, flow_resolve_t resolver, void *ctx);
extern void flow_cache_flush(void);
extern atomic_t flow_cache_genid;
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index cb8934b..6ce593f 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -267,7 +267,6 @@ struct xfrm_policy_afinfo {
xfrm_address_t *saddr,
xfrm_address_t *daddr);
int (*get_saddr)(struct net *net, xfrm_address_t *saddr, xfrm_address_t *daddr);
- struct dst_entry *(*find_bundle)(struct flowi *fl, struct xfrm_policy *policy);
void (*decode_session)(struct ...