Re: [RFC PATCH 0/3] Simplified 16 bit Toeplitz hash algorithm

Previous thread: [RFC PATCH 2/3] ixgbe: example of how to update ixgbe to make use of in-kernel Toeplitz hash by Alexander Duyck on Friday, December 17, 2010 - 6:00 pm. (1 message)

Next thread: [PATCH 1/1] TCP: increase default initial receive window. by Nandita Dukkipati on Friday, December 17, 2010 - 8:20 pm. (14 messages)
From: Alexander Duyck
Date: Friday, December 17, 2010 - 6:00 pm

This patch series is meant to be a proof of concept for simplifying the cost
of Toeplitz hashing by reducing the complexity of the key to a 16 bit
repeating value.  The resultant advantages are that the hash computation
performance is significantly increased, and that the resultant hash is the
same for flows in either direction.

The idea for this occurred to me while working on the ATR hashing algorithms
and improving their performance.  ATR implements a 32 bit repeating key which
results in us being able to XOR everything down to a 32 bit value.  By using a
16 bit key we are able to cut down the 12 to 36 byte input value to only 2
bytes via XOR operations.  This reduces the resultant hash to 16 bits, however
since queue selection only requires 7 bits for RSS that still leaves us with a
large enough resultant key.

I'm currently not planning to do any more work on this in the near future as I
have several other projects in which I am currently engaged.  However I just
wanted to put this code out there in case anyone had a use for it.

Thanks,

Alex

---

Alexander Duyck (3):
      igb: example of how to update igb to make use of in-kernel Toeplitz hashing
      ixgbe: example of how to update ixgbe to make use of in-kernel Toeplitz hash
      net: add simplified 16 bit Toeplitz hash function for transmit side hashing


 drivers/net/igb/igb_main.c     |   22 ++++------
 drivers/net/ixgbe/ixgbe_main.c |   47 ++++++++++++---------
 include/linux/netdevice.h      |    2 +
 include/linux/toeplitz.h       |   89 ++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c                 |   68 +++++++++++++++++++++++++++++++
 5 files changed, 195 insertions(+), 33 deletions(-)
 create mode 100644 include/linux/toeplitz.h

-- 
--

From: Tom Herbert
Date: Monday, January 3, 2011 - 11:47 am

I'm not sure why this would be needed.  What is the a advantage in
making the TX and RX queues match?

On Fri, Dec 17, 2010 at 5:00 PM, Alexander Duyck
--

From: Alexander Duyck
Date: Monday, January 3, 2011 - 12:00 pm

If the application is affinitized and you are working with RX/TX pairs 
as we have in ixgbe then you can be certain that your buffers are 
staying in the same NUMA node or CPU as the application.  Having them on 
different NUMA nodes can hurt performance for either TX or RX.

The other advantage was that I didn't have to bother with trying to 
reorder the source and destination values when computing an RX hash or a 
TX hash.  I can just call the same function and regardless of direction 
I would get the same hash.  That way I could be guaranteed in a routing 
test that if I was using the RX hash to determine the TX queue that the 
queue number shouldn't change.

I believe the same thing is being accomplished in RPS/TPS via a test for 
the values and swapping them if source is greater than destination.

Thanks,

Alex
--

From: David Miller
Date: Monday, January 3, 2011 - 12:02 pm

From: Tom Herbert <therbert@google.com>

That's how their hardware based RFS essentially works.

Instead of watching for "I/O system calls" like we do in software, the
chip watches for which TX queue a flow ends up on and matches things
up on the receive side with the same numbered RX queue to match.
--

From: Ben Hutchings
Date: Monday, January 3, 2011 - 12:30 pm

ixgbe also implements IRQ affinity setting (or rather hinting) and TX
queue selection by CPU, the inverse of IRQ affinity setting.  Together
with the hardware/firmware Flow Director feature, this should indeed
result in hardware RFS.  (However, irqbalanced does not yet follow the
affinity hints AFAIK, so this requires some manual intervention.  Maybe
the OOT driver is different?)

The proposed change to make TX queue selection hash-based seems to be a
step backwards.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Alexander Duyck
Date: Monday, January 3, 2011 - 12:52 pm

Actually this code would only be applied in the case where Flow Director 
didn't apply such as non-TCP frames.  It would essentially guarantee 
that we end up with TX/RX on the same CPU for all cases instead of just 
when Flow Director matches a given flow.

The general idea is to at least keep the traffic local to one TX/RX 
queue pair so that if we cannot match the queue pair to the application, 
perhaps the application can be affinitized to match up with the queue 
pair.  Otherwise we end up with traffic getting routed to one TX queue 
on one CPU, and the RX being routed to another queue on perhaps a 
different CPU and it becomes quite difficult to match up the queues and 
the applications.

Since the approach is based on Toeplitz it can be applied to all 
hardware capable of generating a Toeplitz based hash and as a result it 
would likely also work in a much more vendor neutral kind of way than 
Flow Director currently does.

Thanks,

Alex
--

From: David Miller
Date: Monday, January 3, 2011 - 12:54 pm

From: Alexander Duyck <alexander.h.duyck@intel.com>

Immediately this excludes NIU, for one, since it does not use Topelitz.
--

From: Ben Hutchings
Date: Monday, January 3, 2011 - 1:15 pm

Right.  That certainly seems like a Good Thing, though I believe it can
be implemented generically by recording the RX queue number on the
socket:


Which I appreciate, but I'm not convinced that weakening Toeplitz is a
good way to do it.

I understand that Robert Watson (FreeBSD hacker) has been doing some
research on the security and performance implications of flow hashing
algorithms, though I haven't seen any results of that yet.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Alexander Duyck
Date: Monday, January 3, 2011 - 2:45 pm

Actually it does, it only takes effect in the case that flow director 
isn't enabled.  I just implemented it as a ndo_select_queue and then in 
the case of the igb example I applied it directly, and in the case of 
the ixgbe example I just added it to the end of the ndo_select_queue 

That was one of the reasons why I put this chunk of code out there as an 
RFC as I didn't see anywhere where it really fit in.  I wasn't sure if 
anyone had a use for it or not, but I didn't see much point in keeping 

I wasn't really sure about it either, but from what I can tell Toeplitz 
is pretty weak in the first place, especially if using a static key, but 
really hard to do efficiently in software with a full 40 byte key.

The advantages of the 16 bit key were that I could do the hash 
computation with little CPU overhead and then I also was able to 
generate the symmetric hash result so I didn't have to mess with source 
and destination field ordering to generate the TX hash.  Since most of 
the hardware I am familiar with doesn't support more than 128 queues 
anyway the 16 bit hash input and result generated via this approach 
should be more than enough to handle the queue selection and 
distribution needs of the hardware which was my only real concern.

Thanks for the input,

Alex




--

From: Tom Herbert
Date: Monday, January 3, 2011 - 8:25 pm

I still don't see the value in doing this RX/TX queue pairing (unless
you're considering the possibility of explicitly binding sockets to
queue pairs).  XPS should be sufficient mechanism to get affinity on
sending side.  Also, don't know how the queue paring model will be
maintained when using priority queues on transmit-- transmit is likely
to be asymmetric to receive side.  The ability to seamlessly decouple
The device hash should already be available in sk_rxhash, so maybe
that could be used for this purpose.  I think it is a good property to
keeping treat the device hashes as opaque values, any reasonable
32-bit 4-tuple hash should work equally well in the stack.
--

From: Ben Hutchings
Date: Tuesday, January 4, 2011 - 8:43 am

Sure, the real value is in getting TX completions to line up with TX
[...]

At least if it's configured properly... or if this is automated using my
irq_cpu_rmap.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

Previous thread: [RFC PATCH 2/3] ixgbe: example of how to update ixgbe to make use of in-kernel Toeplitz hash by Alexander Duyck on Friday, December 17, 2010 - 6:00 pm. (1 message)

Next thread: [PATCH 1/1] TCP: increase default initial receive window. by Nandita Dukkipati on Friday, December 17, 2010 - 8:20 pm. (14 messages)