From: Tom Herbert <therbert@google.com> If the hash is good is will distribute the load properly. If the NIC is sophisticated enough (Sun's Neptune chipset is) you can even group interrupt distribution by traffic type and even bind specific ports to interrupt groups. I really detest all of these software hacks that add overhead to solve problems the hardware can solve for us. --
I appreciate this philosophy, but unfortunately I don't have the luxury of working with a NIC that solves these problems. The reality may be that we're trying to squeeze performance out of crappy hardware to scale on multi-core. Left alone we couldn't get the stack to scale, but with these "destable hacks" we've gotten 3X or so improvement in packets per second across both our dumb 1G and 10G NICs. These gains have translated into tangible application performance gains, so we'll probably continue to have interest in this area of development at least for the foreseeable future. --
From: Tom Herbert <therbert@google.com>
^^^^^^^^
Do these NICs at least support multiqueue?
--
I don't think they do. See the lsat paragraph in Tom's first email. I think we all agree that hacks such as these are onlhy useful for NICs that either don't support mq or if the number of rx queues is too small. The question is how much do we love these NICs :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Yes, we are using a 10G NIC that supports multi-queue. The number of RX queues supported is half the number of cores on our platform, so that is going to limit the parallelism. With multi-queue turned on we do see about 4X improvement in pps over just using a single queue; this is about the same improvement we see using a single queue with our software steering techniques (this particular device provides the Toeplitz hash). Enabling HW multi-queue has somewhat higher CPU utilization though, the extra device interrupt load is not coming for free. We actually use the HW multi-queue in conjunction with our software steering to get maximum pps (about 20% more). --
The standard wisdom is that you don't necessarily need to transmit to each core, but rather to each shared mid or least level cache. Once the data is cache hot (or cache near) distributing it further in software is comparable cheap. So this means you don't necessarily need as many queues as cores, but more as many as big caches. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
From: Tom Herbert <therbert@google.com> This is a non-intuitive observation. Using HW multiqueue should be cheaper than doing it in software, right? --
Shared caches can play games with the numbers, we need to look at this a bit more. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
I suppose it may be counter-intuitive, but I am not making a general claim. I would only suggest that these software hacks could be a very good approximation or substitute for hardware functionality. This is a generic way to get more performance out of deficient or lower end NICs. --
From: Tom Herbert <therbert@google.com> They certainly could. Why don't you post the current version of your patches so we have something concrete to discuss? --
