Re: SO_REUSEPORT?

Previous thread: [PATCH] ne: Use CONFIG_MACH_TX49XX by Atsushi Nemoto on Thursday, August 7, 2008 - 11:55 am. (2 messages)

Next thread: Re: OOPS, ip -f inet6 route get fec0::1, linux-2.6.26, ip6_route_output, rt6_fill_node+0x175 by Alexey Dobriyan on Thursday, August 7, 2008 - 4:37 pm. (12 messages)
To: <netdev@...>
Subject: SO_REUSEPORT?
Date: Thursday, August 7, 2008 - 12:57 pm

Hello,

We are looking at ways to scale TCP listeners. I think we like is the
ability to listen on a port from multiple threads (sockets bound to
same port, INADDR_ANY, and no interface binding) , which is what
SO_REUSEPORT would seem to allow. Has this ever been implemented for
Linux or is there a good reason not to have it?

Thanks,
Tom
--

To: Tom Herbert <therbert@...>
Cc: <netdev@...>
Date: Thursday, August 7, 2008 - 1:09 pm

On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.

In any case, there is absolutely no point in creating multiple TCP listeners.
Multiple threads can accept() on the same listener - at the same time.

--
Rémi Denis-Courmont
http://www.remlab.net/
--

To: <netdev@...>
Date: Thursday, August 7, 2008 - 1:58 pm

We've been doing that, but then on wakeup it would seem that we're at
the mercy of scheduling-- basically which ever threads wakes up first
will get to process accept queue first. This seems to bias towards
threads running on the same CPU as the wakeup is called, and so this
method doesn't give us an even distribution of new connections across
the threads that we'd like.

Tom
--

To: Tom Herbert <therbert@...>
Cc: <netdev@...>
Date: Thursday, August 7, 2008 - 2:17 pm

How would the presence of multiple TCP LISTEN endpoints change that?
You'd then be at the mercy of whatever "scheduling" there was inside the
stack.

If you want to balance the threads, perhaps a dispatch thread, or a
virtual one - each thread knows how many connections it is servicing,
let them know how many the other threads are servicing, and if a thread
has N more connections than the other threads have it not go into
accept() that time around. Might need some tweaking to handle
pathological starvation cases like all the other threads are hung I
suppose but the basic idea is there.

--

To: Rick Jones <rick.jones2@...>
Cc: Tom Herbert <therbert@...>, <netdev@...>
Date: Thursday, August 7, 2008 - 3:03 pm

On Thu, 07 Aug 2008 11:17:55 -0700

I suspect thread balancing would actually hurt performance!
You would be better off to have a couple of "hot" threads that are doing
all the work and stay in cache. If you push the work around to all the
threads, you have worst case cache behaviour.
--

To: Stephen Hemminger <stephen.hemminger@...>
Cc: Rick Jones <rick.jones2@...>, <netdev@...>
Date: Thursday, August 7, 2008 - 3:43 pm

On Thu, Aug 7, 2008 at 12:03 PM, Stephen Hemminger

I'm not sure that's applicable for us since the server application and
networking will max out all the CPUs on host anyway; one way or
another we need to dispatch the work of incoming connections to
threads on different CPUs. If we do this in user space and do all
accepts in one thread, the CPU of that thread becomes the bottleneck
(we're accepting about 40,000 connections per second). If we have
multiple accept threads running on different CPUs, this helps some,
but the load is spread unevenly across the CPUs and we still can't get
the highest connection rate. So it seems we're looking for a method
that distributes the incoming connection load across CPUs pretty
evenly.

Tom

But we need to spread the load across multiple threads on different CPUs
--

To: Tom Herbert <therbert@...>
Cc: Stephen Hemminger <stephen.hemminger@...>, <netdev@...>
Date: Thursday, August 7, 2008 - 4:14 pm

Well, if you _really_ want the load spread, you may need to use a
multiqueue (at least inbound if not also later outbound) interface,
"know" how the NIC will hash and then have N distinct port numbers each
assigned to a LISTEN endpoint. The old song and dance about making an N
CPU system look as much like N single-CPU systems and all that...

Unless there are NICs you can "tell" where to send the interrupts, which
IMO is preferable - I have a preference for the application/scheduler
telling "networking" where to work rather than networking (or the NIC)
telling the scheduler where to run a thread - the archives of either
here or netnews will probalby pull-up stuff were I've talked about
Inbound Packet Scheduling (IPS) vs Thread Optimized Packet Scheduling
(TOPS) and limitations of simplistic address hashing to pick a
queue/processor/whatnot :)

rick jones
--

To: Rick Jones <rick.jones2@...>
Cc: Stephen Hemminger <stephen.hemminger@...>, <netdev@...>
Date: Thursday, August 7, 2008 - 7:05 pm

Yep that's what I really want, except for the fact that I can only use
a single port for the server-- all flows could be nicely distributed
by the NIC multiqueue, but I still have the problem of how to ensure
that the accepting thread for a connection is run on the same CPU as

NICs are already doing steering based on tuple hash (RSS), and I think
some will allow specifying the CPU for interrupt based on RX flow.
Maybe this would address the issues of Inbound Packet Scheduling?

Thanks for the pointers on IPS and TOPS. Out of curiosity has there
been an effort to do TOPS on Linux? We are doing something very
similar in software RSS with a fair amount of success (I posted
patches for this a while back).

Tom
--

To: Tom Herbert <therbert@...>
Cc: Stephen Hemminger <stephen.hemminger@...>, <netdev@...>
Date: Thursday, August 7, 2008 - 7:28 pm

All IPS in HP-UX 10.20 was was hash the IP/port numbers and queue based
on that - this at the handoff between driver and netisr. The problem
was if you had a thread of execution servicing more than one connection,
you would start whipsawing across the processors based on the remote
addressing.

There are IIRC indeed some NICs where you can give them a finite number
of tuples and say where each tuple should go. I'm sure those vendors if
watching can speak-up :) That sort of functionality can be useful and
would address the limitations of ISS/plain NIC header address hashing.
At least for long-lived connections. Or perhaps even long-lived LISTEN
endpoints :)

While you say you are constrained to a single port number, are you

I'm not sure. Anything is possible. The nice thing about TOPS in UX
11.X was/is the lookup was essentially free and didn't involve things
going across I/O busses. Start to have to update those tuple mappings
on the NIC with any frequency and that's the end of that.

rick
--

Previous thread: [PATCH] ne: Use CONFIG_MACH_TX49XX by Atsushi Nemoto on Thursday, August 7, 2008 - 11:55 am. (2 messages)

Next thread: Re: OOPS, ip -f inet6 route get fec0::1, linux-2.6.26, ip6_route_output, rt6_fill_node+0x175 by Alexey Dobriyan on Thursday, August 7, 2008 - 4:37 pm. (12 messages)