Hi David This is an RFC, based on net-2.6 for convenience only. Thank you [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx() Loopback transmit function loopback_xmit() actually calls netif_rx() to queue a skb to the softnet queue, and arms a softirq so that this skb can be handled later. This has a cost on SMP, because we need to hold a reference on the device, and free this reference when softirq dequeues packet. Following patch directly calls netif_receive_skb() and avoids lot of atomic operations. (atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, &n->state), ... atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, but also softirq overhead. This gives a nice boost on tbench for example (5 % on my machine) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
I understand this is interesting for the loopback when there is no multiple instances of it and it can't be unregistered. But now with the network namespaces, we can have multiple instances of the loopback and it can to be unregistered. Shouldn't we still use netif_rx ? Perhaps we can do something like: if (dev->nd_net == &init_net) netif_receive_skb(skb); else netif_rx(skb); Or we create: init_loopback_xmit() calling netif_receive_skb(skb); and setup this function when creating the loopback for init_net, otherwise we setup the usual loopback_xmit. We are still safe for multiple network namespaces and we have the improvement for init_net loopback. --
or #ifdef CONFIG_NET_NS if (dev->nd_net != &init_net) netif_rx(skb); else #endif I dont understand how my patch could degrade loopbackdev unregister logic. It should only help it, by avoiding a queue of 'pending packets' per cpu. When we want to unregister a network device, stack makes sure that no more calls to dev->hard_start_xmit() can occur. If no more loopback_xmit() calls are done on this device, it doesnt matter if it internally uses netif_rx() or netif_receive_skb(skb) loopback device has no queue, its really unfortunate to use the 'softirq' internal queue. --
From: Eric Dumazet <dada1@cosmosbay.com> My only concern is stack usage. Note that packet reception can elicit a response and go all the way back into this driver and all the way down into netif_receive_skb() again. And so on and so forth. If there is some bug in the stack (ACK'ing ACKs, stuff like that) we could get into a loop and overrun the kernel stack in no time at all. So, if anything, this change could make inconvenient errors become catastrophic and hard to diagnose. --
... I wonder why overloading with net processing is no concern here? There would be no napi control around this netif_receive_skb(). Another concern might be a code which depends on softirq context here (unless it was checked already)? Regards, Jarek P. --
From: Jarek Poplawski <jarkao2@gmail.com> Good point, but we're talking about loopback wherein only Hmmm.... yes this could be a problem. --
Then e.g. the lack of responsiveness should concern us, I guess. Jarek P. --
From: Eric Dumazet <dada1@cosmosbay.com> I'm willing to seriously entertain this change and stick it into net-2.6.26 if you will perform a reasonable deep stack test. For example, create an XFS filesystem, and mount it NFS over loopback. Then stress it like crazy. See if this generates stack overflows or weird crashes. --
On Mon, 03 Mar 2008 20:55:58 -0800 (PST) Also (unrealistic) benchmarks often test loopback performance, so you should also check for performance gains/losses in things like netbench, netperf, tbench, etc. --
Fair enough :) I'll do my best to stress it on various situations, with 4K stacks on i386. Thank you --
From: Eric Dumazet <dada1@cosmosbay.com> Eric, did you get a chance to kernel stack usage stress this thing out like I asked? Thanks. --
thread_struct significantly smaller than task_struct That said I agree that 4K stack is too tight for many things and in general dangerous. -Andi --
From: Eric Dumazet <dada1@cosmosbay.com>
Hmmm...
+static int enough_stack_space(void)
+{
+#ifdef CONFIG_STACK_GROWSUP
+ return 0;
+#else
+ unsigned long free = (unsigned long)&free -
+ (unsigned long)end_of_stack(current);
+ return free >= THREAD_SIZE/3 ;
+#endif
+}
+
This will always fail when we are on an interrupt stack,
I think you'd want it to succeed in such a case.
Can you agree that, at least to a point, this is getting a bit
convoluted and perhaps adding more complexity than this optimization
deserves? :-)
--
Yes, I do agree, mixing 'network' and 'mainline' in the same patch is garanted to be problematic. We shall wait for 32 cpus machines before thinking about that :) BTW, can loopback_xmit() be called on an interrupt stack ? --
From: Eric Dumazet <dada1@cosmosbay.com> Absolutely, softirq processes TCP data, ACK goes out in softirq context. Softirqs run on interrupt stacks just as hardirqs do. --
Well, it depends. Not on x86_32 with 8K stacks, and some other arches. do_softirq() can use the underlying stack too. Thank you --
that's just wrong - 4K stacks on x86 are 4K-sizeof(thread_info) - the task struct is allocated elsewhere. The patch below runs just fine on 4K-stack x86. Ingo -------------> Subject: net: loopback speedup From: Ingo Molnar <mingo@elte.hu> Date: Mon Mar 31 11:23:21 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- drivers/net/loopback.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/drivers/net/loopback.c =================================================================== --- linux.orig/drivers/net/loopback.c +++ linux/drivers/net/loopback.c @@ -158,7 +158,7 @@ static int loopback_xmit(struct sk_buff lb_stats->bytes += skb->len; lb_stats->packets++; - netif_rx(skb); + netif_receive_skb(skb); return 0; } --
Yes, this error was corrected by Andi already :) Thank you Ingo but this patch was already suggested by me previously ( http://marc.info/?l=linux-netdev&m=120361996713007&w=2 ) and was rejected, since we can very easily consume all stack space, especially with 4K stacks. (try with NFS mounts and XFS for example) Only safe way is to check available free stack space, since we can nest loopback_xmit() several time. In case of protocol errors (like in TCP, if we answer to an ACK by another ACK, or ICMP loops), we would exhaust stack instead of delaying packets for next softirq run. Problem is to check available space : It depends on stack growing UP or DOWN, and depends on caller running on --
ok - i wish such threads were on lkml so that everyone not just the netdev kabal can read it. It's quite ugly, but if we want to check stack free space i'd suggest for you to put a stack_can_recurse() call into arch/x86/kernel/process.c and offer a default __weak implementation in kernel/fork.c that always returns 0. the rule on x86 should be something like this: on 4K stacks and 64-bit [which have irqstacks] free stack space can go as low as 25%. On 8K stacks [which doesnt have irqstacks but nests irqs] it should not go below 50% before falling back to the explicitly queued packet branch. this way other pieces of kernel code code can choose between on-stack fast recursion and explicit iterators. Although i'm not sure i like the whole concept to begin with ... Ingo --
Hi Ingo
I took the time to prepare a patch to implement =20
arch_stack_can_recurse() as you suggested.
Thank you
[PATCH] x86 : arch_stack_can_recurse() introduction
Some paths in kernel would like to chose between on-stack fast recursion =
and explicit iterators.
One identified spot is in net loopback driver, where we can avoid=20
netif_rx() and its slowdown if
sufficient stack space is available.
We introduce a generic arch_stack_can_recurse() which default to a weak=20
function returning 0.
On x86 arch, we implement following logic :
32 bits and 4K stacks (separate irq stacks) : can use up to 25% of sta=
ck
64 bits, 8K stacks (separate irq stacks) : can use up to 25% of sta=
ck
32 bits and 8K stacks (no irq stacks) : can use up to 50% of sta=
ck
Example of use in drivers/net/loopback.c, function loopback_xmit()
if (arch_stack_can_recurse())
netif_receive_skb(skb); /* immediate delivery to stack */
else
netif_rx(skb); /* defer to softirq handling */
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
I think ingo meant 'up to 75% used'. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Patch is OK, my english might be a litle bit unsual :) --
From: Ingo Molnar <mingo@elte.hu> I don't think it's safe. Every packet you receive can result in a sent packet, which in turn can result in a full packet receive path being taken, and yet again another sent packet. And so on and so forth. Some cases like this would be stack bugs, but wouldn't you like that bug to be a very busy cpu instead of a crash from overrunning the current stack? --
sure. But the core problem remains: our loopback networking scalability is poor. For plain localhost<->localhost connected sockets we hit the loopback device lock for every packet, and this very much shows up on real workloads on a quad already: the lock instruction in netif_rx is the most expensive instruction in a sysbench DB workload. and it's not just about scalability, the plain algorithmic overhead is way too high as well: $ taskset 1 ./bw_tcp -s $ taskset 1 ./bw_tcp localhost Socket bandwidth using localhost: 2607.09 MB/sec $ taskset 1 ./bw_pipe Pipe bandwidth: 3680.44 MB/sec i dont think this is acceptable. Either we should fix loopback TCP performance or we should transparently switch to VFS pipes as a transport method when an app establishes a plain loopback connection (as long as there are no frills like content-modifying component in the delivery path of packets after a connection has been established - which covers 99.9% of the real-life loopback cases). I'm not suggesting we shouldnt use TCP for connection establishing - but if the TCP loopback packet transport is too slow we should use the VFS transport which is both more scalable, less cache-intense and has lower straight overhead as well. Ingo --
From: Ingo Molnar <mingo@elte.hu> Set your loopback MTU to some larger value if this result and the locking overhead upsets you. Also, woe be to the application that wants fast local interprocess communication and doesn't use IPC_SHM, MAP_SHARED, pipes, or AF_UNIX sockets. (there's not just one better facility, there are _four_!) From this perspective, people way-overemphasize loopback performance, and 999 times out of 1000 they prove their points using synthetic benchmarks. And don't give me this garbage about the application wanting to be generic and therefore use IP sockets for everything. Either they want to be generic, or they want the absolute best performance. Trying to get an "or" and have both at the same time will result in ludicrious hacks ending up in the kernel. --
yes, of course it "upsets me" - it shows up in macrobenchmarks as well (not just lmbench) - wouldnt (and shouldnt) that upset you? And even with a ridiculously high MTU of 1048576 there's only a 13% improvement: # ifconfig lo mtu 1048576 # taskset 1 ./bw_tcp -s # taskset 1 ./bw_tcp localhost Socket bandwidth using localhost: 2951.51 MB/sec pipes are still another ~25% faster: # taskset 1 ./bw_pipe i talked about the localhost data transport only (in the portion you dropped from your quotes), not about the connection API or the overall management of such sockets. There's absolutely no good technical reason i can see why plain loopback sockets should be forced to go over a global lock, or why apps should be forced to change to another API when the real problem is that kernel developers are lazy or incompetent to fix their code. And i'm still trying to establish whether we have common ground for discussion: do you accept my numbers that TCP loopback transport performs badly when compared to pipes (i think you accepted that implicitly, but i dont want to put anything into your mouth). Having agreed on that, do you share my view that it should be and could be fixed? Or do you claim that it cannot be fixed and wont ever be fixed? Ingo --
Chiming in late here, but 1048576 can't possibly work with IP which uses a 16-bit quantity as the length header. In fact a quick test seems to indicate that an 1048576 mtu doesn't generate anything bigger than the default 16K mtu. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> Right. To move things forward, we should look into doing something similar to what Al Viro suggested, which would be to return an SKB pointer from the transmit path and call back into netif_receive_skb() using that. --
yep, basically the sk_peer trick that AF_UNIX is already using. it just seems rather more tricky in the 'real skb' localhost case because there's no real established trust path we can pass this coupling of the two sockets over. Netfilter might affect it and deny a localhost connection. Lifetime rules seem rather tricky as well: either end of the localhost connection can go away independently so a refcount to the socket has to be kept. skb->sk might be something to use, but it looks like a dangerous complication and it would burden the fastpath with an extra sk reference inc/dec. ... so i'm not implying that any of this is an easy topic to solve (to me at least :). But fact is that database connections over localhost are very common on web apps and it is very convenient as well. I use it myself - AF_UNIX transport is often non-existing in apps and libraries or is often just an afterthought with limitations - apps tend to gravitate towards a single API. So i dont think "use AF_UNIX" is an acceptable answer in this case. I believe we should try to make localhost transport comparably fast to AF_UNIX. Ingo --
From: Ingo Molnar <mingo@elte.hu> Please read again, that isn't the suggestion being discussed. What's being discussed is having the top of the transmit call path getting a socket "buffer" pointer, that it can feed back into the packet input path directly. Loopback would return buffer pointers from ->hard_start_xmit() instead of passing them netif_rx(). The top of the transmit call path, upon getting a non-NULL buffer returned, would pass it to netif_receive_skb(). We're not talking about sockets, although that is another idea (which I'm working on a patch for, and I have a mechanism for what you refer to as "path validation"). --
Yes this will definitely reduce the per-packet cost. The other low-hanging fruit is to raise the loopback MTU to just below 64K. I belive the current value is a legacy from the days when we didn't support skb page frags so everything had to be physically contiguous. Longer term we could look at generating packets > 64K on lo, for IPv6 anyway. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> It's legacy from when my top-of-the-line UltraSPARC-I 130Mhz cpus got the best loopback results using that value :-) --
