Instead of doing O(n) xfrm_find_bundle() call per-packet, cache the previous lookup results in flow cache. The flow cache is updated to be per-netns and more generic. The flow cache no longer holds reference (which was not really used in the first place, as it depended on garbage collection). Now this is more explicit. The cache validity is maintained as follows: - On policy insert, the whole cache is invalideted by incrementing generation id. No synchronization required as genid checks make sure no old objects are dereferenced. - On policy removal from lists the object is marked deleted, and this invalidated the policy pointer. - Policy object deletion requires explicit synchronization to remove stale pointers before can actually free the policy objects. xfrm_policy_gc_task() synchronizes the cache. - Bundle creation and expiry is reflected in xfrm_bundle_ok() check before any bundle from cache is used. - Bundle deletion is done by incrementing policy->bundles_genid and synchronizing with other cpu's so there is no stale bundle pointers left. After this the bundle objects can be safely deleted. Basic testing done on 2.6.32 based kernel. This gives a boost of several magnitudes on transmit path. Signed-off-by: Timo Teras <timo.teras@iki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> --- include/net/flow.h | 39 ++++- include/net/netns/xfrm.h | 4 + include/net/xfrm.h | 1 + net/core/flow.c | 342 ++++++++++++++++++-------------------- net/ipv6/inet6_connection_sock.c | 6 +- net/xfrm/xfrm_policy.c | 271 +++++++++++++++++++++--------- 6 files changed, 394 insertions(+), 269 deletions(-) diff --git a/include/net/flow.h b/include/net/flow.h index 809970b..814a9d2 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -8,6 +8,9 @@ #define _NET_FLOW_H #include <linux/in6.h> +#include <linux/notifier.h> +#include <linux/timer.h> +#include <linux/slab.h> #include ...
This only works well if the traffic doesn't switch bundles much. But if that were the case then the number of bundles is likely to be small anyway. IOW I think if we're doing this then we should go the whole distance and directly cache bundles instead of policies in the flow cache. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
The problem is if I have multipoint gre1 and policy that says "encrypt all gre in transport mode". Thus for each public address, I get one bundle. But the xfrm_lookup() is called for each packet because ipgre_tunnel_xmit() calls ip_route_output_key() on per-packet basis. Then we cannot maintain policy use time. But if it's not a requirement, we could drop the policy from cache. Also. With this and your recent flowi patch, I'm seeing pmtu issues. Seems like xfrm_bundle_ok uses the original dst which resulted in the creation of the bundle. Somehow that dst does not get updated with pmtu... but the new dst used in next xfrm_lookup for same target does have proper mtu. I'm debugging right now why this is happening. Any ideas? - Timo --
But if your traffic switches between those tunnels on each packet, I don't see why we can't maintain the policy use time if we did The dynamic MTU is always maintained in a normal dst object in the IPv4 routing cache. Each xfrm_dst points to such a dst through xdst->route. If you were looking at the xfrm_dst's own MTU then that may well cause problems. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
I figured the root cause. The original dst gets expired rt_genid goes old. But xfrm_dst does not notice that so it won't create a new bundle. xfrm_bundle_ok calls dst_check, but dst->obsolete = 0, and ipv4_dst_check is a no-op anyway. Somehow the rtable object should be able to tell back to xfrm that the dst is not good anymore. Any ideas? - Timo --
Checked ipv6, it does like xfrm: sets obsolote to -1 and on dst_check checks the genid. We need to do same in for ipv4. I just wrote an hack, and tested it. It solves the pmtu issues. I will post a proper patch soon. - Timo --
Here's how things go to my understanding. When we are forwarding packets, each packet goes through __xfrm_route_forward(). Or if we are sending from ip_gre (like in my case), the packets go through ip_route_output_key(). Both of these call xfrm_lookup() to get the xfrm_dst instead of the real rtable dst. This is done on per-packet basis. Basically, the xfrm_dst is never kept referenced directly on either code path. Instead it needs to be xfrm'ed per-packet. Now, these created xfrm_dst gets cached in policy->bundles with ref count zero and not deleted until garbage collection is required. That's how xfrm_find_bundle tries to reuse them (but it does O(n) lookup). On the gre+esp case (should apply to any forward path too) the caching of bundle speeds up the output path considerably as there can be a lot of xfrm_dst's. Especially if it's a wildcard transport mode policy which basically get's bundles for each destination. So there can be a lot of xfrm_dst's all valid, since they refer to unique xfrm_state on per-destination basis. Now, even more, since xfrm_dst needs to be regenerated if the underlying ipv4 rtable entry has expired there can be a lot of bundles. So the linear search is a major performance killer. Btw. it looks like the xfrm_dst garbage collection is broke. It's only garbage collected if a network device goes down, or dst_alloc calls it after gc threshold is exceeded. Since gc threshold got dynamic not long ago, it can be very big. This causes stale xfrm_dst's to be kept alive, and what is worse their inner rtable dst is kept referenced. But as they were expired and dst_free'd the dst core "free still referenced" dst's goes through that list over and over again and will kill the whole system performance when the list grows long. I think as a minimum we should add 'do stale_bundle' check Actually no. As the pmtu case showed, it's more likely that xfrm_dst needs to be regenerated, but the policy stays the same since policy db isn't touched that ...
Yes but the way you're caching it in the policy means that it only helps if two consecutive packets happen to match the same bundle. If your traffic mix was such that each packet required a different bundle, then we're back to where we started. That's why I was asking for you to directly cache the xfrm_dst objects in the flow cache. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
But it always matches. The caching happens using the inner flow. Inner flow always matches with the same bundle unless the bundle expires or goes stale. What happens is that I get multiple cache entries per-inner flow each referencing to the same bundle. And this is even more useful with the gre+esp. No matter what I send inside gre (with private IPs), it ends up being routed to more limited set of outer public IP parties. The bundle lookup happens with the public-IP flow, so the speed up works True. But if we go and prune a bundle due to it being bad or needing garbage collection we need to invalidate all bundles pointers, and we cannot access the back-pointer. Alternatively we need to keep xfrm_dst references again in the flow cache requiring an expensive iteration of all flow cache entries whenever a xfrm_dst needs to be deleted (which happens often). - Timo --
Sorry for being slow, but if it always matches, doesn't that mean you'll only have a single bundle in the policy bundle list? IOW why do we need this at all? Or have I misread your patch? You *are* proposing to cache the last Why can't you access the back-pointer? You should always have So does the IPv4 routing cache. I think what this reflects is just that the IPsec garbage collection mechanism is broken. There is no point in doing a GC on every dst_alloc if we know that it isn't going to go below the threshold. It should gain a minimum GC interval like IPv4. Or perhaps we can move the minimum GC interval check into the dst core. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
No. The bundle created for specific flow, matches always later that flow. With transport mode wildcard policy, e.g. single policy saying encrypt traffic with protocol X to all IP-address, you get a separate bundle per-public IP destination. Bundle matches only that specific IP since it gets a separate xfrm_state. But you can talk to all the hosts in internet using same policy, so you can end up with a whole lot of valid bundles in the same policy. I'm not sure how this works in tunnel mode. It might be that single bundle can be reused for all packets. But I think the same applies to tunnel mode. Since afinfo->fill_dst() puts the inner flow to bundle xfrm_dst->u.rt.fl, which is later compared against the inner flow by afinfo->find_bundle(). I think this implies that for each flow traveling inside tunnel, it gets it's separate xfrm_dst, so again you end up with a whole lot of Yes and no. The bundle used is cached on per-flow basis. The flow cache can have lot of entries each referring to Yes, I reported xfrm_dst GC being broke in the earlier mail. But keeping policy and bundle in cache is still a win. If we kill the xfrm_dst due to GC, we will also lose the policy the flow matched. We might need to kill xfrm_dst due to the inside dst going old, but the flow cache would still give hit with policy info (but no bundle) the next a packet comes in using the same flow. - Timo --
OK so I did misread your patch. In fact it is already doing exactly what I was suggesting, I'll We were in complete agreement all along :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Just figured that's it's easier to explain with an example. We have SPD: 10.1.0.0/16 - 10.2.0.0/16 tunnel 1.2.3.4 - 4.3.2.1 Now we get n+1 clients to connect to server in 10.2.0.1. They each get separate bundle, since the xfrm_dst will be created and search using flow id's like: src 10.1.x.x dst 10.2.0.1 So there's one xfrm_policy and xfrm_state, but n+1 xfrm_dst's. Since the flow cache caches the result of lookups on the inner flow "10.1.x.x->10.2.0.1" basis, it always returns matching valid bundle in O(1) time unless the xfrm_dst expired. Currently it's looked up with O(n) search in find_bundle. Same thing happens with wildcard transport mode SPD's. E.g. SPD: 0.0.0.0/0 - 0.0.0.0/0 proto gre, transport We are talking with gre to n+1 tunnel destinations. We get n+1 xfrm_dst's in that xfrm_policy. Flow cache works on inner flow using flows like: src 1.2.3.4 dst 4.3.2.1 proto gre And can keep in cache the right policy always, and the bundle to use as long as it stays valid. Hopefully this explains why I think the patch is useful. - Timo --
A back-pointer does not require an O(n) lookup. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
This doesn't work. The flow cache operates without locking as it is a per-cpu cache. To make this work you must ensure that you stay on the same CPU or use some other form of synchronoisation if you write to the object returned. AFAICS there is no synchronisation here and you're writing to fce. So you'll need to disable preemption around the bit that touches fce. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
But flow_cache_lookup disables pre-emption until _put is called. So it should work. Would there be a cleaner way? However, now I figured that we need to make walk.dead atomic because it's read some times without taking the policy lock. - Timo --
Previously the flow cache returned a policy directly which works because whenever we modify that policy we'd take the appropriate lock. Your patch changes it so that it now returns an fce. But nothing is guarding the code that modifies fce. So two CPUs may end up modifying the same fce. However, it would appear that this race could be harmless, provided that you are careful about dereferencing fce->policy and fce->dst. IOW, this is not OK if (fce->policy) use fce->policy; and this should work policy = fce->policy; if (policy) use policy; Actually on second tought, even this isn't totally safe. Who is taking a reference count on the policy and dst? I see a ref count on the fce, but nothing on fce->dst and fce->policy. Do you have an implicit reference on them? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
But I changed that. the flow cache now does *not* call local_bh_enable if it returns something. This is deferred until corresponding _put call. So bh's are disable while we are touching the lookup results. It'd probably make sense to remove that. And require _lookup to be called with bh disabled so it's more obvious that bh's are Not a race. We need to keep bh's disabled while touching fce Noone. When policy and dst is on cache there's no reference. The cache generation id's ensure that the object exists when they are in cache. It might make sense to add references to both objects and do a BUG_ON if the flow cache flusher would need to delete an object. I guess this would be the proper way, since that's how the dst stuff works too. - Timo --
I'm sorry but making a function like flow_cache_lookup return with That would be better but it's still hacky. Proper reference The cache genid is not enough: CPU1 CPU2 check genid == OK update genid kill policy kfree on policy use policy == BOOM Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Well, the cache entry is still referenced only very shortly, I don't see why keeping bh disabled why doing it is considered a hack. Refcounting the cache entries is trickier. Though, it could be used to optimize the update process: we could safely This. And that the cache is synchronized by flow_cache_flush executing stuff on other cpu's, ensuring that it's not running The sequence goes currently. CPU1 CPU2 disable_bh check genid == OK update genid call cache_flush blocks use policy == OK and take refcount enable_bh cache_flush smpcall executes and ublocks cpu2 returns from cache_flush kill policy kfree on policy - Timo --
Well we had a nicely type-agnostic cache which is self-contained, but your patch is bleeding generic code into xfrm_policy.c, that's why I felt it to be hacky :) Anyway I see how your scheme works now as far as object life is concerned, and I agree that it is safe. However, I wonder if we could do it while still leaving all the object life-cycle management stuff (and the BH disabling bits) in flow.c The crux of the issue is that you now have two objects to track instead of one. As the direction is a key in the lookup, we're really only worried about the outbound case here. So how about going back to what I suggested earlier, and keeping a back-pointer from xfrm_dst to the policy? Of course xfrm_dst would also hold a ref count on the policy. You'd only have to do it for the top-level xfrm_dst. It does mean that you'll need to write a different resolver for outbound vs. inbound/forward, but that makes sense because we only use bundles for outbound policies. What do you think? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
When I first started reading the code, I got confused slightly on how the garbage collection is happening. What I did not like in current is the atomic_dec() in flow.c that does not check if it was turned to zero. Because on policy objects it means you need to delete it (which would a bug if it happened in flow.c; the policy gc calls flush before releasing it's own reference), but on xfrm_dst it's perfectly ok to do atomic_dec() and the dst core will garbage collect items. But now, thinking more, it would probably make more sense to just cache xfrm_dst's and keep ref to the policy on them. So in general I agree with your recommendation. The only immediate problem I can think now is that the resolved would need to atomically check if xfrm_dst is valid, if not, resolve new one. But creating new xfrm_dst involves taking locks and can sleep so it cannot be inside the main resolver. Alternatively, we'd need to: a) still expose flow cache entry structs and do locking or refcounting on them b) have two version of flow lookup: one that calls resolver with bh disabled and can atomically lookup and update entry and a second version that lookups, calls validation, if not-valid it calls resolver with bh enabled and does a new flow cache lookup to update cache Also, relatedly. Is there way to release xfrm_dst's child and route refs when xfrm_bundle_ok fails? This would improve GC collection of the ipv4 rtable entries they referenced. - Timo --
OK this brings out my favourite topic :) The reason we have to sleep is to resolve the template. So if we had queueing for pending xfrm_dst objects, we wouldn't have to sleep at all when creating the top-level xfrm_dst. Packets using that xfrm_dst can wait in the queue until it is fully resolved. Now obviously this code doesn't exist so this is all just a wet dream. Setting my favourite topic aside, I have to come to the conclusion that your patch still doesn't fully resolve the problem you set out to fix. The crux of the issue is the linked list of all bundles in a policy and the obvious problems stemming from walking a linked list that is unbounded. The reason I think it doesn't fully resolve this is because of the flow cache. Being a per-cpu cache, when you create the xfrm dst the first time around, you'll at most put it in one CPU's cache. The next CPU that comes along will still have to walk that same bundle linked list. So we're back to square one. Now Dave, my impression is that we picked the per-cpu design because it was the best data structure we had back in 2002, right? If so I'd like us to think about the possibility of switching over to a different design, in particular, an RCU-based hash table, similar to the one I just used for bridge multicasting. This would eliminate the need for walking the bundle list apart from the case when we're destroying the policy, which can be done in process context. Actually I just realised that the other way we can fix this is to make xfrm_dst objects per-cpu just like IPv4 routes. That is, when you fail to find an xfrm_dst object in the per-cpu cache, you dont' bother calling xfrm_find_bundle but just make a new bundle. This is probably much easier than replacing the whole flow cache. Can any one think of any problems with duplicate xfrm_dst objects? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: ...
Not exactly, each CPU does one slow lookup after which it finds it fast. But yes, it's not perfect solution. Especially, if cpu happens to get switched between the initial lookup and Right. This would speed the bundle lookup in all cases. Except... we can have override policy on per-socket basis. We should include the per-socket override in the flow lookups so that those sockets get also boost from the cache. Though usual use case is to disable all policies (so e.g. IKE can Sounds like a very good idea. If we instantiate new xfrm_dst, all that it shares with others is xfrm_state and xfrm_policy (inner objects will be unique). Since that's what happens anyway I don't see any problem with this. So should go ahead and: 1. modify flow cache to be more generic (have virtual put and get for each object; and remove the atomic_t pointer) 2. modify flow cache to have slow and fast resolvers so we can copy with the current sleeping requirement 3. cache bundles instead of policies for outgoing stuff 4. kill find_bundle and just instantiate new ones if we get cache miss 5. put all bundles to global hlist (since only place that walks through them is gc, and stale bundle can be dst_free'd right away); use genid's for policy to flush old bundles 6. dst_free and unlink bundle immediately if it's found to be stale - Timo --
I don't think we need either of these. To support the sleep requirement, just return -EAGAIN from the resolver when the template can't be resolved. Then the caller of flow_cache_lookup can sleep as it does now. It simply has to repeat the flow Sounds good. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Ok, we can do that to skip 2. But I think 1 would be still useful. It'd probably be good to actually have flow_cache_ops pointer in each entry instead of the atomic_t pointer. The reasoning: - we can then have type-based checks that the reference count is valid (e.g. policy's refcount must not go to zero, it's bug, and we can call dst_release which warns if refcount goes to negative); imho it's hack to call atomic_dec instead of the real type's xxx_put - the flow cache needs to somehow know if the entry is stale so it'll try to refresh it atomically; e.g. if there's no check for 'stale', the lookup returns stale xfrm_dst. we'd then need new api to update the stale entry, or flush it out and repeat the lookup. the virtual get could check for it being stale (if so release the entry) and then return null for the generic code to call the resolver atomically - for paranoia we can actually check the type of the object in Okay. Sounds like a plan. I'll work on this next week. I'll also try to make it a series of patches instead of the big hunk I sent initially. - Timo --
- could cache bundle OR policy for outgoing stuff. it's useful to cache the policy in case we need to sleep, or if it's a policy forbidding traffic. in those cases there's no bundle to cache at all. alternatively we can make dummy bundles that are marked invalid and are just used to keep a reference to the policy. Oh, this also implies that the resolver function should be changed to get the old stale object so it can re-use it to get the policy object instead of searching it all over again. - Timo --
The reason I'd prefer to keep the current scheme is to avoid
an additional indirect function call on each packet.
The way it would work is (we need flow_cache_lookup to return
fle instead of the object):
fle = flow_cache_lookup
xdst = fle->object
if (xdst is stale) {
flow_cache_mark_obsolete(fle)
fle = flow_cache_lookup
xdst = fle->object
if (xdst is stale)
return error
}
Where flow_cache_mark_obsolete would set a flag in the fle that's
checked by flow_cache_lookup. To prevent the very rare case
where we mark an entry obsolete incorrectly, the resolver function
should double-check that the existing entry is indeed obsolete
before making a new one.
This way we give the overhead over to the slow path where the
bundle is stale.
You were saying that our bundles are going stale very frequently,
that would sound like a bug that we should look into. The whole
caching scheme is pointless if the bundle is going stale every
My instinct is to go with dummy bundles. That way given the
direction we know exactly what object type it is. Having mixed
That should be easy to implement. Just prefill the obj argument
to the resolver with either NULL or the stale object.
For the bundle resolver, it should also remove the stale bundle
from the policy bundle list and drop its reference.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Well, yes. The fast past would be slightly faster. However, I still find the indirect call based thingy more elegant. We would also get more common code, as flow_cache_lookup could then figure out from the virtual call if the entry needs refreshing or not. And doing just atomic_dec instead of the type based thingy feels slightly kludgy. I don't think the speed difference between direct/indirect call is that significant. Also the fle would just need "struct flow_cache_ops *". And have wrappers that use container_of to figure out the real address of the cached struct. This would allow real type agnostic cache. So we'd just need the 'ops' pointer instead of the current object pointer and atomic_t pointer, saving in fle size. But yes, it does impose the small speed penalty of indirect call. I prefer the 'ops' thingy, but have no strong feelings either way. I mean frequently as in 'minutes' not as in 'milliseconds'. The bundles goes stale only when the policy (mostly by user action) or Yup. Cheers, Timo --
I've been thinking more about doing this. And I think this approach is fundamentally racy. Underlying fle->object can be changed underneath as since bh are not disabled. This means the refcounting needs to be done on fle level, and additional measures are needed to ensure that fle->object is as expected. Additionally, the second flow_cache_lookup can happen on another cpu, and could return a stale object that indeed would need refreshing instead of generating an error. Options: 1. return fle and fle->object separately, refcount both 2. call flow_cache_lookup with bh disabled 3. use flow_cache_entry_ops and virtualize put/get Since 2. was previously said to leak generic code, it does not sound good. On 1 vs 3, I'd still choose 3 since: - adding one more refcount means more atomic calls which have same (or more) overhead as indirect call - two refcounts sounds more complicated than one - exposing fle and touching it without bh's disabled requires a whole lot of more extra care; doing 3 is simpler So, should I go ahead and do the virtualization of the getters and putters? - Timo --
If we're going to have indirect calls per packet then let's at least minimise them. I think one should be enough, i.e., just add a ->stale() function pointer to be used in flow_cache_lookup. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Normally, only the getter is called per-packet. So I'm thinking to have get() that would check for entry being stale or not, and bump up the refcount. The resolver can just swap the old entry and call appropriate _put/_get, so we can avoid virtual calls there. Thinking more about the flushing of the flow cache. It's basically only needed to flush before the policies are garbage collected. This is strictly needed so we can do the atomic_dec() without turning policy's refcount to zero and causing a leak. I'm now thinking if it'd be worth doing virtual _put(). This way we would not need flushing at all for in policy garbage collector (we could kill flow_cache_flush). We could also call dst_release immediately in the flow cache _put, which would release the memory faster. This way we would not need at all any periodic xfrm_dst checker as flow cache is regenerated regularly. The code paths that would require calling the virtual put are: - randomization of flow cache (every 10 mins currently) from flow_cache_lookup() with bh disabled - flow cache going too full, from flow_cache_lookup() with bh disabled - cpu notifier Do you think the virtual _put doing more work would be too slow? In that case the plain atomic_dec sounds ok, but we'd then need to periodically check through the global bundle list for garbage collection. Cheers, Timo --
OK if it's just one call per packet it sounds good to me. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Basically. It was envisioned that flows at that level of detail would be spread between different cpus, and that individual flows wouldn't propagate onto multiple cpus much, if at all. And if they did, no big deal we have an entry in the cache of those cpus. Do we know cases where it happens often? In any event, RCU would certainly fit the bill just as easily and I have no qualms against going in that direction. Timo mentioned the socket overrides, we handle them at the top level right before we look into the flow cache and I think it should stay that way and we shouldn't bother tossing those into the flow cache at all. Just my humble opinion on this :-) --
From: Herbert Xu <herbert@gondor.apana.org.au> How are ipv4 routing cache entries per-cpu? That would screw up route metrics for TCP sockets quite a lot if they were. --
You're right of course, s/just like IPv4 routes// :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> And as a consequence, making the xfrm_dst's be per-cpu would mess with route metrics for TCP. If we do something like that, then there is simply no reason any longer to have such fine-grained routing metrics if the one thing that would use it heavily (ipsec) stops doing so completely. At that point we can go to a host cache for metrics just like BSD, and pull all of the metrics out of struct dst (enormous win as it makes all routes significantly smaller). I'm willing to consider this seriously, to be honest. --
Actually xfrm_dst currently relies on IPv4 rt objects to maintain the metrics. So as long as IPv4 routes are still global, then the metrics won't be affected as far as can I see. I would consider that too. But right now I'm just looking for the bare minimum to solve the problem at hand :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Good point, I was misunderstanding how things work now and how that would change with your proposal. Having multiple xfrm_dsts exist for an IPSEC route seems fine to me. --
Thanks for the confirmation. Timo, let's roll along with the per-cpu xfrm_dst approach. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
