Re: [PATCH] xfrm: cache bundle lookup results in flow cache

Previous thread: net-2.6 [Bug-Fix][dccp]: fix oops caused after failed initialisation by Gerrit Renker on Sunday, March 14, 2010 - 11:13 pm. (2 messages)

Next thread: tcp_reordering as 0 possible? by raj ravi on Monday, March 15, 2010 - 7:25 am. (3 messages)
From: Timo Teras
Date: Monday, March 15, 2010 - 5:20 am

Instead of doing O(n) xfrm_find_bundle() call per-packet, cache
the previous lookup results in flow cache. The flow cache is
updated to be per-netns and more generic.

The flow cache no longer holds reference (which was not really
used in the first place, as it depended on garbage collection).
Now this is more explicit. The cache validity is maintained as follows:
- On policy insert, the whole cache is invalideted by incrementing
  generation id. No synchronization required as genid checks make
  sure no old objects are dereferenced.
- On policy removal from lists the object is marked deleted, and
  this invalidated the policy pointer.
- Policy object deletion requires explicit synchronization to remove
  stale pointers before can actually free the policy objects.
  xfrm_policy_gc_task() synchronizes the cache.
- Bundle creation and expiry is reflected in xfrm_bundle_ok() check
  before any bundle from cache is used.
- Bundle deletion is done by incrementing policy->bundles_genid and
  synchronizing with other cpu's so there is no stale bundle pointers
  left. After this the bundle objects can be safely deleted.

Basic testing done on 2.6.32 based kernel. This gives a boost of
several magnitudes on transmit path.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
 include/net/flow.h               |   39 ++++-
 include/net/netns/xfrm.h         |    4 +
 include/net/xfrm.h               |    1 +
 net/core/flow.c                  |  342 ++++++++++++++++++--------------------
 net/ipv6/inet6_connection_sock.c |    6 +-
 net/xfrm/xfrm_policy.c           |  271 +++++++++++++++++++++---------
 6 files changed, 394 insertions(+), 269 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 809970b..814a9d2 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -8,6 +8,9 @@
 #define _NET_FLOW_H
 
 #include <linux/in6.h>
+#include <linux/notifier.h>
+#include <linux/timer.h>
+#include <linux/slab.h>
 #include ...
From: Herbert Xu
Date: Wednesday, March 17, 2010 - 6:07 am

This only works well if the traffic doesn't switch bundles much.
But if that were the case then the number of bundles is likely
to be small anyway.

IOW I think if we're doing this then we should go the whole
distance and directly cache bundles instead of policies in the
flow cache.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Wednesday, March 17, 2010 - 7:16 am

The problem is if I have multipoint gre1 and policy that says
"encrypt all gre in transport mode".

Thus for each public address, I get one bundle. But the
xfrm_lookup() is called for each packet because ipgre_tunnel_xmit()
calls ip_route_output_key() on per-packet basis.


Then we cannot maintain policy use time. But if it's not a
requirement, we could drop the policy from cache.

Also. With this and your recent flowi patch, I'm seeing pmtu
issues. Seems like xfrm_bundle_ok uses the original dst which
resulted in the creation of the bundle. Somehow that dst
does not get updated with pmtu... but the new dst used in
next xfrm_lookup for same target does have proper mtu.
I'm debugging right now why this is happening. Any ideas?

- Timo
--

From: Herbert Xu
Date: Wednesday, March 17, 2010 - 7:58 am

But if your traffic switches between those tunnels on each packet,

I don't see why we can't maintain the policy use time if we did

The dynamic MTU is always maintained in a normal dst object in
the IPv4 routing cache.  Each xfrm_dst points to such a dst
through xdst->route.

If you were looking at the xfrm_dst's own MTU then that may well
cause problems.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Wednesday, March 17, 2010 - 8:56 am

I figured the root cause. The original dst gets expired
rt_genid goes old. But xfrm_dst does not notice that so it
won't create a new bundle. xfrm_bundle_ok calls dst_check,
but dst->obsolete = 0, and ipv4_dst_check is a no-op anyway.

Somehow the rtable object should be able to tell back to
xfrm that the dst is not good anymore. Any ideas?

- Timo


--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Wednesday, March 17, 2010 - 9:32 am

Checked ipv6, it does like xfrm: sets obsolote to -1 and
on dst_check checks the genid. We need to do same in for
ipv4. I just wrote an hack, and tested it. It solves the
pmtu issues.

I will post a proper patch soon.

- Timo
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Thursday, March 18, 2010 - 12:30 pm

Here's how things go to my understanding.

When we are forwarding packets, each packet goes through
__xfrm_route_forward(). Or if we are sending from ip_gre (like in
my case), the packets go through ip_route_output_key().

Both of these call xfrm_lookup() to get the xfrm_dst instead
of the real rtable dst. This is done on per-packet basis.
Basically, the xfrm_dst is never kept referenced directly
on either code path. Instead it needs to be xfrm'ed per-packet.

Now, these created xfrm_dst gets cached in policy->bundles
with ref count zero and not deleted until garbage collection
is required. That's how xfrm_find_bundle tries to reuse them
(but it does O(n) lookup).

On the gre+esp case (should apply to any forward path too)
the caching of bundle speeds up the output path considerably
as there can be a lot of xfrm_dst's. Especially if it's a
wildcard transport mode policy which basically get's bundles
for each destination. So there can be a lot of xfrm_dst's
all valid, since they refer to unique xfrm_state on
per-destination basis.

Now, even more, since xfrm_dst needs to be regenerated if
the underlying ipv4 rtable entry has expired there can be
a lot of bundles. So the linear search is a major performance
killer.

Btw. it looks like the xfrm_dst garbage collection is broke.
It's only garbage collected if a network device goes down,
or dst_alloc calls it after gc threshold is exceeded. Since
gc threshold got dynamic not long ago, it can be very big.
This causes stale xfrm_dst's to be kept alive, and what is
worse their inner rtable dst is kept referenced. But as they
were expired and dst_free'd the dst core "free still referenced"
dst's goes through that list over and over again and will
kill the whole system performance when the list grows long.

I think as a minimum we should add 'do stale_bundle' check

Actually no. As the pmtu case showed, it's more likely that
xfrm_dst needs to be regenerated, but the policy stays the
same since policy db isn't touched that ...
From: Herbert Xu
Date: Thursday, March 18, 2010 - 5:31 pm

Yes but the way you're caching it in the policy means that it
only helps if two consecutive packets happen to match the same
bundle.  If your traffic mix was such that each packet required
a different bundle, then we're back to where we started.

That's why I was asking for you to directly cache the xfrm_dst
objects in the flow cache.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Thursday, March 18, 2010 - 10:48 pm

But it always matches. The caching happens using the inner
flow. Inner flow always matches with the same bundle unless
the bundle expires or goes stale. What happens is that I get
multiple cache entries per-inner flow each referencing to the
same bundle.

And this is even more useful with the gre+esp. No matter what
I send inside gre (with private IPs), it ends up being routed
to more limited set of outer public IP parties. The bundle
lookup happens with the public-IP flow, so the speed up works

True. But if we go and prune a bundle due to it being bad or
needing garbage collection we need to invalidate all bundles
pointers, and we cannot access the back-pointer. Alternatively
we need to keep xfrm_dst references again in the flow cache
requiring an expensive iteration of all flow cache entries
whenever a xfrm_dst needs to be deleted (which happens often).

- Timo
--

From: Herbert Xu
Date: Thursday, March 18, 2010 - 11:03 pm

Sorry for being slow, but if it always matches, doesn't that mean
you'll only have a single bundle in the policy bundle list? IOW
why do we need this at all?

Or have I misread your patch? You *are* proposing to cache the last

Why can't you access the back-pointer? You should always have

So does the IPv4 routing cache.  I think what this reflects is
just that the IPsec garbage collection mechanism is broken.

There is no point in doing a GC on every dst_alloc if we know
that it isn't going to go below the threshold.  It should gain
a minimum GC interval like IPv4.  Or perhaps we can move the
minimum GC interval check into the dst core.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Thursday, March 18, 2010 - 11:21 pm

No. The bundle created for specific flow, matches always later
that flow.

With transport mode wildcard policy, e.g. single policy saying
encrypt traffic with protocol X to all IP-address, you get a
separate bundle per-public IP destination. Bundle matches only
that specific IP since it gets a separate xfrm_state. But you
can talk to all the hosts in internet using same policy, so
you can end up with a whole lot of valid bundles in the same
policy.

I'm not sure how this works in tunnel mode. It might be that
single bundle can be reused for all packets. But I think the same
applies to tunnel mode. Since afinfo->fill_dst() puts the
inner flow to bundle xfrm_dst->u.rt.fl, which is later compared
against the inner flow by afinfo->find_bundle(). I think this
implies that for each flow traveling inside tunnel, it gets it's
separate xfrm_dst, so again you end up with a whole lot of

Yes and no. The bundle used is cached on per-flow basis.
The flow cache can have lot of entries each referring to

Yes, I reported xfrm_dst GC being broke in the earlier mail.

But keeping policy and bundle in cache is still a win. If we
kill the xfrm_dst due to GC, we will also lose the policy
the flow matched. We might need to kill xfrm_dst due to the
inside dst going old, but the flow cache would still give
hit with policy info (but no bundle) the next a packet comes
in using the same flow.

- Timo

--

From: Herbert Xu
Date: Friday, March 19, 2010 - 12:17 am

OK so I did misread your patch.

In fact it is already doing exactly what I was suggesting, I'll

We were in complete agreement all along :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Friday, March 19, 2010 - 12:27 am

Just figured that's it's easier to explain with an example.

We have SPD:
	10.1.0.0/16 - 10.2.0.0/16 tunnel
		1.2.3.4 - 4.3.2.1

Now we get n+1 clients to connect to server in 10.2.0.1.
They each get separate bundle, since the xfrm_dst will be
created and search using flow id's like:
	src 10.1.x.x dst 10.2.0.1

So there's one xfrm_policy and xfrm_state, but n+1
xfrm_dst's.

Since the flow cache caches the result of lookups on the
inner flow "10.1.x.x->10.2.0.1" basis, it always returns
matching valid bundle in O(1) time unless the xfrm_dst
expired.

Currently it's looked up with O(n) search in find_bundle.

Same thing happens with wildcard transport mode SPD's.

E.g. SPD:
	0.0.0.0/0 - 0.0.0.0/0 proto gre, transport

We are talking with gre to n+1 tunnel destinations.
We get n+1 xfrm_dst's in that xfrm_policy. Flow cache
works on inner flow using flows like:
	src 1.2.3.4 dst 4.3.2.1 proto gre
And can keep in cache the right policy always, and
the bundle to use as long as it stays valid.

Hopefully this explains why I think the patch is
useful.

- Timo

--

From: Herbert Xu
Date: Thursday, March 18, 2010 - 5:32 pm

A back-pointer does not require an O(n) lookup.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Friday, March 19, 2010 - 12:20 am

This doesn't work.

The flow cache operates without locking as it is a per-cpu cache.
To make this work you must ensure that you stay on the same CPU
or use some other form of synchronoisation if you write to the
object returned.

AFAICS there is no synchronisation here and you're writing to fce.

So you'll need to disable preemption around the bit that touches
fce.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Friday, March 19, 2010 - 12:48 am

But flow_cache_lookup disables pre-emption until _put is called.
So it should work. Would there be a cleaner way?

However, now I figured that we need to make walk.dead atomic
because it's read some times without taking the policy lock.

- Timo
--

From: Herbert Xu
Date: Friday, March 19, 2010 - 1:29 am

Previously the flow cache returned a policy directly which works
because whenever we modify that policy we'd take the appropriate
lock.

Your patch changes it so that it now returns an fce.  But nothing
is guarding the code that modifies fce.  So two CPUs may end up
modifying the same fce.

However, it would appear that this race could be harmless, provided
that you are careful about dereferencing fce->policy and fce->dst.

IOW, this is not OK

	if (fce->policy)
		use fce->policy;

and this should work

	policy = fce->policy;
	if (policy)
		use policy;

Actually on second tought, even this isn't totally safe.  Who
is taking a reference count on the policy and dst? I see a ref
count on the fce, but nothing on fce->dst and fce->policy.  Do
you have an implicit reference on them?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Friday, March 19, 2010 - 1:37 am

But I changed that. the flow cache now does *not* call local_bh_enable
if it returns something. This is deferred until corresponding _put
call. So bh's are disable while we are touching the lookup results.

It'd probably make sense to remove that. And require _lookup to
be called with bh disabled so it's more obvious that bh's are

Not a race. We need to keep bh's disabled while touching fce

Noone. When policy and dst is on cache there's no reference.
The cache generation id's ensure that the object exists when
they are in cache. It might make sense to add references to
both objects and do a BUG_ON if the flow cache flusher would
need to delete an object. I guess this would be the proper
way, since that's how the dst stuff works too.

- Timo
--

From: Herbert Xu
Date: Friday, March 19, 2010 - 1:47 am

I'm sorry but making a function like flow_cache_lookup return with

That would be better but it's still hacky.  Proper reference


The cache genid is not enough:

CPU1			CPU2
check genid == OK
			update genid
			kill policy
			kfree on policy
use policy == BOOM

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Friday, March 19, 2010 - 2:12 am

Well, the cache entry is still referenced only very shortly,
I don't see why keeping bh disabled why doing it is considered
a hack. Refcounting the cache entries is trickier. Though,
it could be used to optimize the update process: we could safely

This. And that the cache is synchronized by flow_cache_flush
executing stuff on other cpu's, ensuring that it's not running

The sequence goes currently.

CPU1			CPU2
disable_bh
check genid == OK
			update genid
			call cache_flush
			blocks
use policy == OK
and take refcount
enable_bh
cache_flush smpcall executes and ublocks cpu2
			returns from cache_flush
			kill policy
			kfree on policy

- Timo
--

From: Herbert Xu
Date: Friday, March 19, 2010 - 2:32 am

Well we had a nicely type-agnostic cache which is self-contained,
but your patch is bleeding generic code into xfrm_policy.c, that's
why I felt it to be hacky :)

Anyway I see how your scheme works now as far as object life
is concerned, and I agree that it is safe.

However, I wonder if we could do it while still leaving all the
object life-cycle management stuff (and the BH disabling bits)
in flow.c

The crux of the issue is that you now have two objects to track
instead of one.  As the direction is a key in the lookup, we're
really only worried about the outbound case here.

So how about going back to what I suggested earlier, and keeping
a back-pointer from xfrm_dst to the policy? Of course xfrm_dst
would also hold a ref count on the policy.  You'd only have to
do it for the top-level xfrm_dst.

It does mean that you'll need to write a different resolver for
outbound vs. inbound/forward, but that makes sense because we
only use bundles for outbound policies.

What do you think?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Friday, March 19, 2010 - 2:53 am

When I first started reading the code, I got confused slightly
on how the garbage collection is happening. What I did not like
in current is the atomic_dec() in flow.c that does not check if
it was turned to zero. Because on policy objects it means you
need to delete it (which would a bug if it happened in flow.c;
the policy gc calls flush before releasing it's own reference),
but on xfrm_dst it's perfectly ok to do atomic_dec() and the
dst core will garbage collect items.

But now, thinking more, it would probably make more sense to
just cache xfrm_dst's and keep ref to the policy on them. So
in general I agree with your recommendation. The only immediate
problem I can think now is that the resolved would need to
atomically check if xfrm_dst is valid, if not, resolve new one.
But creating new xfrm_dst involves taking locks and can sleep
so it cannot be inside the main resolver.

Alternatively, we'd need to:
a) still expose flow cache entry structs and do locking or
   refcounting on them
b) have two version of flow lookup: one that calls resolver
   with bh disabled and can atomically lookup and update entry
   and a second version that lookups, calls validation, if
   not-valid it calls resolver with bh enabled and does a new
   flow cache lookup to update cache

Also, relatedly. Is there way to release xfrm_dst's child and
route refs when xfrm_bundle_ok fails? This would improve GC
collection of the ipv4 rtable entries they referenced.

- Timo
--

From: Herbert Xu
Date: Saturday, March 20, 2010 - 8:17 am

OK this brings out my favourite topic :)

The reason we have to sleep is to resolve the template.  So if
we had queueing for pending xfrm_dst objects, we wouldn't have
to sleep at all when creating the top-level xfrm_dst.

Packets using that xfrm_dst can wait in the queue until it is
fully resolved.  Now obviously this code doesn't exist so this
is all just a wet dream.

Setting my favourite topic aside, I have to come to the conclusion
that your patch still doesn't fully resolve the problem you set out
to fix.

The crux of the issue is the linked list of all bundles in a
policy and the obvious problems stemming from walking a linked
list that is unbounded.

The reason I think it doesn't fully resolve this is because of
the flow cache.  Being a per-cpu cache, when you create the xfrm
dst the first time around, you'll at most put it in one CPU's
cache.

The next CPU that comes along will still have to walk that same
bundle linked list.  So we're back to square one.

Now Dave, my impression is that we picked the per-cpu design
because it was the best data structure we had back in 2002,
right?

If so I'd like us to think about the possibility of switching
over to a different design, in particular, an RCU-based hash
table, similar to the one I just used for bridge multicasting.

This would eliminate the need for walking the bundle list apart
from the case when we're destroying the policy, which can be
done in process context.

Actually I just realised that the other way we can fix this is
to make xfrm_dst objects per-cpu just like IPv4 routes.  That
is, when you fail to find an xfrm_dst object in the per-cpu
cache, you dont' bother calling xfrm_find_bundle but just make
a new bundle.

This is probably much easier than replacing the whole flow cache.
Can any one think of any problems with duplicate xfrm_dst objects?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: ...
From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Saturday, March 20, 2010 - 9:26 am

Not exactly, each CPU does one slow lookup after which it
finds it fast. But yes, it's not perfect solution. Especially,
if cpu happens to get switched between the initial lookup and

Right. This would speed the bundle lookup in all cases.

Except... we can have override policy on per-socket basis.
We should include the per-socket override in the flow lookups
so that those sockets get also boost from the cache. Though
usual use case is to disable all policies (so e.g. IKE can

Sounds like a very good idea. If we instantiate new xfrm_dst,
all that it shares with others is xfrm_state and xfrm_policy
(inner objects will be unique). Since that's what happens anyway
I don't see any problem with this.

So should go ahead and:
1. modify flow cache to be more generic (have virtual put and get
   for each object; and remove the atomic_t pointer)
2. modify flow cache to have slow and fast resolvers so we can
   copy with the current sleeping requirement
3. cache bundles instead of policies for outgoing stuff
4. kill find_bundle and just instantiate new ones if we get cache
   miss
5. put all bundles to global hlist (since only place that walks
   through them is gc, and stale bundle can be dst_free'd right
   away); use genid's for policy to flush old bundles
6. dst_free and unlink bundle immediately if it's found to be stale

- Timo

--

From: Herbert Xu
Date: Saturday, March 20, 2010 - 5:46 pm

I don't think we need either of these.  To support the sleep
requirement, just return -EAGAIN from the resolver when the
template can't be resolved.  Then the caller of flow_cache_lookup
can sleep as it does now.  It simply has to repeat the flow

Sounds good.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Sunday, March 21, 2010 - 12:34 am

Ok, we can do that to skip 2. But I think 1 would be still useful.
It'd probably be good to actually have flow_cache_ops pointer in
each entry instead of the atomic_t pointer.

The reasoning:
- we can then have type-based checks that the reference count
  is valid (e.g. policy's refcount must not go to zero, it's bug,
  and we can call dst_release which warns if refcount goes to
  negative); imho it's hack to call atomic_dec instead of the
  real type's xxx_put
- the flow cache needs to somehow know if the entry is stale so
  it'll try to refresh it atomically; e.g. if there's no
  check for 'stale', the lookup returns stale xfrm_dst. we'd
  then need new api to update the stale entry, or flush it out
  and repeat the lookup. the virtual get could check for it being
  stale (if so release the entry) and then return null for the
  generic code to call the resolver atomically
- for paranoia we can actually check the type of the object in

Okay. Sounds like a plan. I'll work on this next week.
I'll also try to make it a series of patches instead of the big
hunk I sent initially.

- Timo

--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Sunday, March 21, 2010 - 1:31 am

- could cache bundle OR policy for outgoing stuff. it's useful
  to cache the policy in case we need to sleep, or if it's a
  policy forbidding traffic. in those cases there's no bundle
  to cache at all. alternatively we can make dummy bundles that
  are marked invalid and are just used to keep a reference to
  the policy.

Oh, this also implies that the resolver function should be
changed to get the old stale object so it can re-use it to
get the policy object instead of searching it all over again.

- Timo

--

From: Herbert Xu
Date: Sunday, March 21, 2010 - 8:52 pm

The reason I'd prefer to keep the current scheme is to avoid
an additional indirect function call on each packet.

The way it would work is (we need flow_cache_lookup to return
fle instead of the object):

	fle = flow_cache_lookup
	xdst = fle->object
	if (xdst is stale) {
		flow_cache_mark_obsolete(fle)
		fle = flow_cache_lookup
		xdst = fle->object
		if (xdst is stale)
			return error
	}

Where flow_cache_mark_obsolete would set a flag in the fle that's
checked by flow_cache_lookup.  To prevent the very rare case
where we mark an entry obsolete incorrectly, the resolver function
should double-check that the existing entry is indeed obsolete
before making a new one.

This way we give the overhead over to the slow path where the
bundle is stale.

You were saying that our bundles are going stale very frequently,
that would sound like a bug that we should look into.  The whole
caching scheme is pointless if the bundle is going stale every

My instinct is to go with dummy bundles.  That way given the
direction we know exactly what object type it is.  Having mixed

That should be easy to implement.  Just prefill the obj argument
to the resolver with either NULL or the stale object.

For the bundle resolver, it should also remove the stale bundle
from the policy bundle list and drop its reference.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Monday, March 22, 2010 - 11:03 am

Well, yes. The fast past would be slightly faster.

However, I still find the indirect call based thingy more elegant. 
We would also get more common code, as flow_cache_lookup could then
figure out from the virtual call if the entry needs refreshing or
not. And doing just atomic_dec instead of the type based thingy
feels slightly kludgy. I don't think the speed difference between
direct/indirect call is that significant.

Also the fle would just need "struct flow_cache_ops *". And have
wrappers that use container_of to figure out the real address of
the cached struct. This would allow real type agnostic cache.
So we'd just need the 'ops' pointer instead of the current
object pointer and atomic_t pointer, saving in fle size.
But yes, it does impose the small speed penalty of indirect call.

I prefer the 'ops' thingy, but have no strong feelings either way.

I mean frequently as in 'minutes' not as in 'milliseconds'. The 
bundles goes stale only when the policy (mostly by user action) or


Yup.

Cheers,
 Timo
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Tuesday, March 23, 2010 - 12:28 am

I've been thinking more about doing this. And I think this
approach is fundamentally racy.

Underlying fle->object can be changed underneath as since
bh are not disabled. This means the refcounting needs to
be done on fle level, and additional measures are needed
to ensure that fle->object is as expected.

Additionally, the second flow_cache_lookup can happen
on another cpu, and could return a stale object that
indeed would need refreshing instead of generating an
error.

Options:
1. return fle and fle->object separately, refcount both
2. call flow_cache_lookup with bh disabled
3. use flow_cache_entry_ops and virtualize put/get

Since 2. was previously said to leak generic code,
it does not sound good.

On 1 vs 3, I'd still choose 3 since:
- adding one more refcount means more atomic calls
  which have same (or more) overhead as indirect call
- two refcounts sounds more complicated than one
- exposing fle and touching it without bh's disabled
  requires a whole lot of more extra care; doing
  3 is simpler

So, should I go ahead and do the virtualization of the
getters and putters?

- Timo
--

From: Herbert Xu
Date: Tuesday, March 23, 2010 - 12:42 am

If we're going to have indirect calls per packet then let's at
least minimise them.  I think one should be enough, i.e., just
add a ->stale() function pointer to be used in flow_cache_lookup.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Tuesday, March 23, 2010 - 2:19 am

Normally, only the getter is called per-packet. So I'm thinking
to have get() that would check for entry being stale or not,
and bump up the refcount.

The resolver can just swap the old entry and call appropriate
_put/_get, so we can avoid virtual calls there.

Thinking more about the flushing of the flow cache. It's basically
only needed to flush before the policies are garbage collected.
This is strictly needed so we can do the atomic_dec() without
turning policy's refcount to zero and causing a leak.

I'm now thinking if it'd be worth doing virtual _put(). This
way we would not need flushing at all for in policy garbage
collector (we could kill flow_cache_flush). We could also
call dst_release immediately in the flow cache _put, which
would release the memory faster. This way we would not need
at all any periodic xfrm_dst checker as flow cache is
regenerated regularly.

The code paths that would require calling the virtual put
are:
- randomization of flow cache (every 10 mins currently)
  from flow_cache_lookup() with bh disabled
- flow cache going too full, from flow_cache_lookup()
  with bh disabled
- cpu notifier

Do you think the virtual _put doing more work would be
too slow?

In that case the plain atomic_dec sounds ok, but we'd then
need to periodically check through the global bundle list
for garbage collection.

Cheers,
  Timo
--

From: Herbert Xu
Date: Tuesday, March 23, 2010 - 2:41 am

OK if it's just one call per packet it sounds good to me.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Sunday, March 21, 2010 - 6:26 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

Basically.

It was envisioned that flows at that level of detail would be spread
between different cpus, and that individual flows wouldn't propagate
onto multiple cpus much, if at all.  And if they did, no big deal we
have an entry in the cache of those cpus.

Do we know cases where it happens often?

In any event, RCU would certainly fit the bill just as easily and I
have no qualms against going in that direction.

Timo mentioned the socket overrides, we handle them at the top level
right before we look into the flow cache and I think it should stay
that way and we shouldn't bother tossing those into the flow cache at
all.  Just my humble opinion on this :-)
--

From: David Miller
Date: Sunday, March 21, 2010 - 6:28 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

How are ipv4 routing cache entries per-cpu?  That would screw up route
metrics for TCP sockets quite a lot if they were.
--

From: Herbert Xu
Date: Sunday, March 21, 2010 - 6:32 pm

You're right of course, s/just like IPv4 routes// :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Sunday, March 21, 2010 - 6:36 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

And as a consequence, making the xfrm_dst's be per-cpu would mess with
route metrics for TCP.

If we do something like that, then there is simply no reason any
longer to have such fine-grained routing metrics if the one thing that
would use it heavily (ipsec) stops doing so completely.

At that point we can go to a host cache for metrics just like BSD, and
pull all of the metrics out of struct dst (enormous win as it makes
all routes significantly smaller).

I'm willing to consider this seriously, to be honest.
--

From: Herbert Xu
Date: Sunday, March 21, 2010 - 6:40 pm

Actually xfrm_dst currently relies on IPv4 rt objects to maintain
the metrics.  So as long as IPv4 routes are still global, then the
metrics won't be affected as far as can I see.


I would consider that too.  But right now I'm just looking for
the bare minimum to solve the problem at hand :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Sunday, March 21, 2010 - 8:12 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

Good point, I was misunderstanding how things work now and how
that would change with your proposal.

Having multiple xfrm_dsts exist for an IPSEC route seems fine
to me.
--

From: Herbert Xu
Date: Sunday, March 21, 2010 - 8:52 pm

Thanks for the confirmation.

Timo, let's roll along with the per-cpu xfrm_dst approach.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: =?ISO-8859-1?Q?Timo_Ter=E4s?=
Date: Monday, March 22, 2010 - 11:31 am

Okay.

- Timo

--

Previous thread: net-2.6 [Bug-Fix][dccp]: fix oops caused after failed initialisation by Gerrit Renker on Sunday, March 14, 2010 - 11:13 pm. (2 messages)

Next thread: tcp_reordering as 0 possible? by raj ravi on Monday, March 15, 2010 - 7:25 am. (3 messages)