Version 5 of RFS: - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a static function. - Apply limits to rps_sock_flow_entires systcl and rps_flow_count sysfs variable. --- This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS). The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms. The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter. To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table. rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows. rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry. Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is ...
From: Tom Herbert <therbert@google.com> I've read this over a few times and I think it's ready to go into net-next-2.6, we can tweak things as-needed from here on out. Eric, what do you think? --
I read the patch and found no error.
I booted a test machine and performed some tests
I am a bit worried of a tbench regression I am looking at right now.
if RFS disabled , tbench 16 -> 4408.63 MB/sec
# grep . /sys/class/net/lo/queues/rx-0/*
/sys/class/net/lo/queues/rx-0/rps_cpus:00000000
/sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192
# cat /proc/sys/net/core/rps_sock_flow_entries
8192
echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus
tbench 16 -> 2336.32 MB/sec
-----------------------------------------------------------------------------------------------------------------------------------------------------
PerfTop: 14561 irqs/sec kernel:86.3% [1000Hz cycles], (all, 16 CPUs)
-----------------------------------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________________ __________________________________________________________
2664.00 5.1% copy_user_generic_string /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
2323.00 4.4% acpi_os_read_port /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
1641.00 3.1% _raw_spin_lock_irqsave /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
1260.00 2.4% schedule /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
1159.00 2.2% _raw_spin_lock /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
1051.00 2.0% tcp_ack /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
991.00 1.9% tcp_sendmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
922.00 1.8% tcp_recvmsg /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
821.00 1.6% ...Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old (versus linux-2.6). I know scheduler guys did some tweaks. Because apparently, some cpus are idle part of their time (30% ???) Or a new bug on cpu accounting, reporting idle time while cpus are busy.... # vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 16 0 0 5670264 13280 63392 0 0 2 1 1512 227 12 47 41 0 18 0 0 5669396 13280 63392 0 0 0 0 657952 1606102 14 58 28 0 17 0 0 5668776 13288 63392 0 0 0 12 656701 1606369 14 58 28 0 18 0 0 5669644 13288 63392 0 0 0 0 657636 1603960 15 57 28 0 17 0 0 5670900 13288 63392 0 0 0 0 666425 1584847 15 56 29 0 15 0 0 5669164 13288 63392 0 0 0 0 682578 1472616 14 56 30 0 16 0 0 5669412 13288 63392 0 0 0 0 695767 1506302 14 54 32 0 14 0 0 5668916 13296 63396 0 0 4 148 685286 1482897 14 56 30 0 17 0 0 5669784 13296 63396 0 0 0 0 683910 1477994 14 56 30 0 18 0 0 5670032 13296 63396 0 0 0 0 692023 1497195 14 55 31 0 16 0 0 5669040 13296 63396 0 0 0 0 677477 1468157 14 56 30 0 16 0 0 5668916 13312 63396 0 0 0 32 489358 1048553 14 57 30 0 18 0 0 5667924 13320 63396 0 0 0 12 424787 897145 15 55 29 0 RFS off : # vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 24 0 0 5669624 13632 63476 0 0 2 1 261 82 12 48 40 0 26 0 0 5669492 13632 63476 0 0 0 0 4223 1740651 21 71 7 0 23 0 0 5669864 13640 63476 0 0 0 12 4205 1731882 21 71 8 0 23 0 0 5670484 13640 63476 0 0 0 0 ...
From: Eric Dumazet <eric.dumazet@gmail.com> I synced net-next-2.6 up with Linus's current tree just a day or two ago when I pulled net-2.6 into net-next-2.6. --
OK thanks :) Tom, please add a read_mostly to rps_sock_flow_table struct rps_sock_flow_table *rps_sock_flow_table __read_mostly; I'll spend some hours today to track the problem. --
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index d7107ac..7abf959 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2205,7 +2205,7 @@ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
#ifdef CONFIG_RPS
/* One global table that all flow-based protocols share. */
-struct rps_sock_flow_table *rps_sock_flow_table;
+struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;
EXPORT_SYMBOL(rps_sock_flow_table);
/*
--
From: Eric Dumazet <eric.dumazet@gmail.com> Applied, thanks Eric. --
Eric, thanks for testing that. Admittedly, we have looked at enabling RFS/RPS over loopback. I'll look at that today also. --
Hi Tom I am sorry, but I could not work on this today. I hope I can find some --
Results with "tbench 16" on an 8 core Intel machine. No RPS/RFS: 2155 MB/sec RPS (0ff mask): 1700 MB/sec RFS: 1097 I am not particularly surprised by the results, using loopback interface already provides good parallelism and RPS/RFS really would only add overhead and more trips between CPUs (last part is why RPS < RFS I suspect)-- I guess this is why we've never enabled RPS on loopback :-) Eric, do you have a particular concern that this could affect a real workload? Tom --
I was expecting RFS to be better than RPS at least, for this particular workload (tcp over loopback) With RPS, the hash function of (127.0.0.1, port1, 127.0.0.1, port2) is different than (127.0.0.1, port2, 127.0.0.1, port1), so basically we force the server to run on different processor than client However, I was expecting that with RFS, client and server would run on same cpu. Maybe we could change (for a test) hash function to use (sport ^ dport) instead of (sport << 16) + dport --
Blah, I mistakingly reported that... should have been: No RPS/RFS: 2155 MB/sec RPS (0ff mask): 1097 MB/sec RFS: 1700 MB/sec This was my expectation too, and what my "corrected" numbers show :-) But, I take it this is different in your results? Tom --
My results are on a "tbench 16" on an dual X5570 @ 2.93GHz. (16 logical cpus) No RPS , no RFS : 4448.14 MB/sec RPS : 2298.00 MB/sec (but lot of variation) RFS : 2600 MB/sec Maybe my RFS setup is bad ? (8192 flows) --
Very strange, a second tbench-16 RFS=y run gave me 2134.08 MB/sec A third run gave me 1813.21 MB/sec A fourth run gave me 2472.91 MB/sec Hmm... --
With attached patch, I reached
Throughput 4465.13 MB/sec 16 procs
RFS better than no RPS/RFS :)
So, the old idea to make rxhash consistent (same value in both
directions) is a win for some workloads (Consider connection tracking /
firewalling)
port1 = ...
port2 = ...
addr1 = ...
addr2 = ...
if (addr1 > addr2)
exchange(addr1, addr2)
if (port1 > port2)
exchange(port, port2)
hash = jhash(addr1, addr2, (port1<<16)+port2, ...)
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..6b757ff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2280,8 +2280,10 @@ static int get_rps_cpu(struct net_device *dev,
struct sk_buff *skb,
case IPPROTO_AH:
case IPPROTO_SCTP:
case IPPROTO_UDPLITE:
- if (pskb_may_pull(skb, (ihl * 4) + 4))
- ports = *((u32 *) (skb->data + (ihl * 4)));
+ if (pskb_may_pull(skb, (ihl * 4) + 4)) {
+ u16 *_ports = (u16 *)(skb->data + (ihl * 4));
+ ports = _ports[0] ^ _ports[1];
+ }
break;
default:
--
That's cool!, but I still like the idea that this hash is treated as an opaque value, getting the hash from the device to avoid the jhash or cache misses on the packet can also be a win... Maybe connection tracking/firewall could use the skb->rxhash which provides the --
consistent rxhash only adds the risk of the hash collision, and I don't think it is a big problem. For connection tracking/firewall use, I am afraid that we have to recompute this value after defrag. So we have to export the hash function we used in RPS. As NIC's hash function can be changed dynamically, the rxhash isn't consistent, so the rxhash can't be used by connection tracking, socket lookup and others come later. -- Regards, Changli Gao(xiaosuo@gmail.com) --
I have to agree with Eric and Changli here. It's especially true if you're passively tracking via one NIC, where all traffic is just forwarded. In this scenario, you need to compute consistent hashes. rxhashes by NIC will be different for "incoming" and "outgoing" traffic... Where rxhash by NIC can be used (note: didn't say _useful_) are scenarios with different net ports for incoming and outgoing traffic (in active but also passive traffic scenarios). Here, rxhashes could be used on a per-port basis, but associating two seemingly separate rxhashes with one another to match CPUs is a really annoying task. This would involve computing the corresponding "txhash" and looking it up, which is what we'd be doing with the jhash anyway. For proper flow tracking Eric's suggestion is the way to go. And if there are worries about collisions, why not add IPPROTO_* to the mix. Franco --
From: Eric Dumazet <eric.dumazet@gmail.com> Fun :-) I toyed around with this on my 128 cpu machine (2 NUMA nodes, 64 cpus each NUMA node). Vanilla net-next-2.6, no configuration changes: tbench 64: Throughput 1843.43 MB/sec 64 procs tbench 128: Throughput 1889.67 MB/sec 128 procs Vanilla net-next-2.6, rps_cpus="ffffffff,ffffffff,ffffffff,ffffffff" tbench 64: Throughput 1455.89 MB/sec 64 procs tbench 128: Throughput 2009.91 MB/sec 128 procs net-next-2.6 + Eric's port hashing patch, rps_cpus="ffffffff,ffffffff,ffffffff,ffffffff" tbench 64: Throughput 1593.13 MB/sec 64 procs tbench 128: Throughput 2367.27 MB/sec 128 procs --
From: David Miller <davem@davemloft.net> Eric, I think there is agreement that your patch is not a bad idea. Your original posting had whitespace damange in the patch plus I want to see a proper commit message and signoff, so could you please submit this formally? Thanks! --
Hmm, this was not a formal patch, just an information. Problem is if hardware provides rxhash, will it be "consistent" too ? --
From: Eric Dumazet <eric.dumazet@gmail.com> Yes, it is an issue. I am not aware of whether the Toeplitz hash computed by cards is impervious to the order of the input bits of not, probably it is. I was thinking also about how we could compute rxhash in the loopback driver :-) --
This would be easy if rxhash was not a "struct inet_sock" field but a "struct sock" one sock_alloc_send_pskb() (or skb_set_owner_w()) skb->rxhash = sk->rxhash; --
From: Eric Dumazet <eric.dumazet@gmail.com>
Agreed. I'll commit the following to net-next-2.6 after some build
testing.
net: Make RFS socket operations not be inet specific.
Idea from Eric Dumazet.
As for placement inside of struct sock, I tried to choose a place
that otherwise has a 32-bit hole on 64-bit systems.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index c1d4295..1653de5 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -102,7 +102,6 @@ struct rtable;
* @uc_ttl - Unicast TTL
* @inet_sport - Source port
* @inet_id - ID counter for DF pkts
- * @rxhash - flow hash received from netif layer
* @tos - TOS
* @mc_ttl - Multicasting TTL
* @is_icsk - is this an inet_connection_sock?
@@ -126,9 +125,6 @@ struct inet_sock {
__u16 cmsg_flags;
__be16 inet_sport;
__u16 inet_id;
-#ifdef CONFIG_RPS
- __u32 rxhash;
-#endif
struct ip_options *opt;
__u8 tos;
@@ -224,37 +220,4 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
return inet_sk(sk)->transparent ? FLOWI_FLAG_ANYSRC : 0;
}
-static inline void inet_rps_record_flow(const struct sock *sk)
-{
-#ifdef CONFIG_RPS
- struct rps_sock_flow_table *sock_flow_table;
-
- rcu_read_lock();
- sock_flow_table = rcu_dereference(rps_sock_flow_table);
- rps_record_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
- rcu_read_unlock();
-#endif
-}
-
-static inline void inet_rps_reset_flow(const struct sock *sk)
-{
-#ifdef CONFIG_RPS
- struct rps_sock_flow_table *sock_flow_table;
-
- rcu_read_lock();
- sock_flow_table = rcu_dereference(rps_sock_flow_table);
- rps_reset_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
- rcu_read_unlock();
-#endif
-}
-
-static inline void inet_rps_save_rxhash(struct sock *sk, u32 rxhash)
-{
-#ifdef CONFIG_RPS
- if (unlikely(inet_sk(sk)->rxhash != rxhash)) {
- inet_rps_reset_flow(sk);
- inet_sk(sk)->rxhash = rxhash;
- }
-#endif
-}
...Acked-by: Eric Dumazet <eric.dumazet@gmail.com> I tested same patch today (plus the skb->rxhash = sk->sk_rxhash) and got a very small speedup on my Nehalem machine, where get_rps_cpus() was using 1 % of cpu, now 0.25 %, on a tbench. --
From: Eric Dumazet <eric.dumazet@gmail.com> Great, I've added your ACK. Thanks! --
Does this problem has relationship with your patch? No. If the rxhash isn't provided by hardware, we can get more throughput from you patch, and on the other side, we don't lose anything but potential more hash collision. -- Regards, Changli Gao(xiaosuo@gmail.com) --
I am not sure what you call hash collision. There is no hash chain here. This 32bit hash is a jhash one, and we only need 1 to 12 bits in it, I am pretty sure its OK. --
In case we compute a software skb->rxhash, we can generate a consistent
hash : Its value will be the same in both flow directions.
This helps some workloads, like conntracking, since the same state needs
to be accessed in both directions.
tbench + RFS + this patch gives better results than tbench with default
kernel configuration (no RPS, no RFS)
Also fixed some sparse warnings.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/core/dev.c | 25 ++++++++++++++++++-------
1 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 05a2b29..cb150ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1974,7 +1974,7 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
if (skb->sk && skb->sk->sk_hash)
hash = skb->sk->sk_hash;
else
- hash = skb->protocol;
+ hash = (__force u16) skb->protocol;
hash = jhash_1word(hash, hashrnd);
@@ -2253,8 +2253,8 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
ip = (struct iphdr *) skb->data;
ip_proto = ip->protocol;
- addr1 = ip->saddr;
- addr2 = ip->daddr;
+ addr1 = (__force u32) ip->saddr;
+ addr2 = (__force u32) ip->daddr;
ihl = ip->ihl;
break;
case __constant_htons(ETH_P_IPV6):
@@ -2263,8 +2263,8 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
ip6 = (struct ipv6hdr *) skb->data;
ip_proto = ip6->nexthdr;
- addr1 = ip6->saddr.s6_addr32[3];
- addr2 = ip6->daddr.s6_addr32[3];
+ addr1 = (__force u32) ip6->saddr.s6_addr32[3];
+ addr2 = (__force u32) ip6->daddr.s6_addr32[3];
ihl = (40 >> 2);
break;
default:
@@ -2279,14 +2279,25 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
case IPPROTO_AH:
case IPPROTO_SCTP:
case IPPROTO_UDPLITE:
- if (pskb_may_pull(skb, (ihl * 4) + 4))
- ports = *((u32 *) (skb->data + (ihl * 4)));
+ if (pskb_may_pull(skb, (ihl * 4) + 4)) {
+ __be16 *hports = (__be16 *) (skb->data + (ihl * ...From: Eric Dumazet <eric.dumazet@gmail.com> Applied. --
I thought about this for some time... Do we really need the port numbers here at all? A simple addr1^addr2 can provide a good enough pointer for distribution amongst CPUs. The real connection tracking is better done locally at the corresponding CPU. That way a potential cache miss can be avoided and the still needed hash calculation for connection tracking will be offloaded. Franco --
Yes, doing the port test/swap is useful in the loopback case (addr1 == addr2). This is probably a bit convoluted, but David (and me) found this funny ;) --
It is funny, but I fail to see the big picture of the firewall / conntrack application here. It looks like this is needed for local netperf tests to impress, but it's a quite special use case, isn't it? --
I know many applications using TCP on loopback, they are real :) What I find 'funny' are not the tbench results, but the fact that RFS can give pretty good hints to process scheduler, something that might be good to investigate by scheduler specialists. In the meantime, if some admin finds that setting RFS on loopback can boost by 10% its application, why not ? --
From: Eric Dumazet <eric.dumazet@gmail.com> This is all true and I support your hashing patch and all of that. But if we really want TCP over loopback to go fast, there are much better ways to do this. Eric, do you remember that "TCP friends" rough patch I sent you last year that essentailly made TCP sockets over loopback behave like AF_UNIX ones and just queue the SKBs directly to the destination socket without doing any protocol work? If we ever got that working, tbench performance would become impressive :) --
I think it will break some benchmark tools. The loopback device is for testing networking protocol stacks, so we shouldn't bypass the protocol processing. And anyone who has a performance problem of loopback device should turn to UNIX domain socket. For routers, how about letting users choose whether RPS mixes layer 4 info in? -- Regards, Changli Gao(xiaosuo@gmail.com) --
From: Changli Gao <xiaosuo@gmail.com> Other systems already do this optimization, so if things break, this breakage is already pervasive. We should be able to tell people that they can use TCP solely in their applications and it will perform optimally regardless of transport. People already code their applications this way, and ignoring this issue would just makes us stupid. --
From: Tom Herbert <therbert@google.com> I'll see if I can find it, I sent it to Eric more than a year ago... The basic scheme was pretty simple: 1) Add "struct sock *friend" to struct sk_buff 2) TCP initial handshake SYN and SYN+ACK transmits set "skb->friend = sk" and TCP receive path notices this and stores this 'friend' socket pointer locally in the newly created connection socket. The purpose of skb->friend is to let the receiving socket on loopback see that the other end is on the local system and can be directly communicated to. 3) TCP sendmsg queues data directly to sk->friend's receive queue instead sending TCP protocol packets. The only complications come from making sendmsg and recvmsg not try to do all of the sequence handling and checking, stuff like that. Also, URG would need to be dealt with somehow too. I'm sure someone suitably motivated could get a working patch going in no time :-) --
From: Tom Herbert <therbert@google.com>
I was finally able to unearth a copy, it's completely raw, it's at least
a year old, and it's not fully implemented at all.
But you asked for it :-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 299ec4b..7f855d3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -206,6 +206,7 @@ typedef unsigned char *sk_buff_data_t;
* @mac_header: Link layer header
* @dst: destination entry
* @sp: the security path, used for xfrm
+ * @friend: loopback friend socket
* @cb: Control buffer. Free for use by every layer. Put private vars here
* @len: Length of actual data
* @data_len: Data length
@@ -262,6 +263,7 @@ struct sk_buff {
struct rtable *rtable;
};
struct sec_path *sp;
+ struct sock *friend;
/*
* This is the control buffer. It is free to use for every
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index b220b5f..52b2f7a 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -53,6 +53,7 @@ struct request_sock {
unsigned long expires;
const struct request_sock_ops *rsk_ops;
struct sock *sk;
+ struct sock *friend;
u32 secid;
u32 peer_secid;
};
diff --git a/include/net/sock.h b/include/net/sock.h
index dc42b44..3e86190 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -137,6 +137,7 @@ struct sock_common {
* @sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
* @sk_lock: synchronizer
* @sk_rcvbuf: size of receive buffer in bytes
+ * @sk_friend: loopback friend socket
* @sk_sleep: sock wait queue
* @sk_dst_cache: destination cache
* @sk_dst_lock: destination cache lock
@@ -227,6 +228,7 @@ struct sock {
struct sk_buff *head;
struct sk_buff *tail;
} sk_backlog;
+ struct sock *sk_friend;
wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
struct xfrm_policy *sk_policy[2];
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4fe605f..0eef90a ...Thanks! We'll take a look... I've always thought sockets should have --
What about a server behind a TCP proxy? Also, need to minimize collisions for RPS to be effective. --
What about routers? What about loopback? This all boils down to the same issue of obscuring IP data by "magical" means and then reattaching functionality by reaching for upper layer information. It is necessary in some cases, but it can cripple performance for other cases. The interesting thing is you don't need to deal with collisions while distributing amonst cpus at all. You just need to make sure the distribution algorithm keeps every single flow attached to the correct cpu. All of the actual flow hashing, tracking and whatever else the traffic needs to go through can be done locally by cpu x which helps a lot with load distribution and cache issues in mind. It also helps locking because there is no global flow lookup table. Oh, and it also reduces collisions with every cpu you add for receiving. I work with a lot of plain office and ISP traffic in mind daily, so please don't misunderstand my motivation here. I'd hate to see poor performance in scenarios in which there is a lot of potential improvement. Franco --
I am a bit lost by this conversation. Are you saying something is wrong with current schem ? What are exactly your suggestions ? Tom replied to you that a hash derived from (addr1 ^ addr2) would not work in situations where all flows goes from machine A to machine B (all hashes would be the same) Current hash is probably more than enough to cover all situations. --
Hashing for cpu distribution should be as minimal as it could possibly be with the least number operations needed to compute a hash, which normally involves touching one cold cache line (ip header). If you add the ports to your mix you have the luxury of solving static ip mappings, but only for protocols that support it. Usage of the destination port may also prove to be more or less pointless with a lot of http traffic, because it's most likely static. And you add another potential cold cache line access. For a lot of traffic scenarios, we'll have a bunch of internal ips and the internet on the other side, so having a simple hash based on a flavor if internal/external ip is more than enough to work with for distribution. If the network card can provide a complete hash all the better. Then this part of my point is void. But then, hashing for cpu distribution should have nothing todo with real flow tracking with lookup tables for let's say a firewall or dpi application, because that data is only needed by local cpu and can be gathered after distribution. Simply put, the lookup for the flow, if it is needed, does not belong to distribution. It can be outsourced to the destination cpu or just simply be ignored, if the application Yes, I see the point. And all I'm just asking if it's wise to optimize for this particular scenario. If you spin this idea further beyond flow tracking, maybe an application also needs to do some kind of user tracking by ip. Wouldn't it make sense to have user based flows on a more local basis, not a global one I agree with this, but would like to point out the phrasing "probably more than enough". :) Franco --
But we already have to bring into our cpu cache one cache line, needed in eth_type_trans() : (12+2 bytes of ethernet header) TCP/UDP tuples are included into this cache line (64 bytes on current popular arches) Cost of rxhash is absolute noise into the picture. A device provided hash, to be effective, would also make eth_type_trans() call not done. --
Maybe for the purposes of RPS, but hash collisions could definitely be an issue in RFS. If two active connections hit the same rps_flow entry this may cause thrashing of those connections between CPUs. I --
Good point. I'll make a gathering of tcp tuples on a busy server over a day and try to compute number of clashes we can get with and without the addr/port swapping. --
I think I can give my Sob, and we have time to fully test it and tweak it if necessary. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Thanks Tom ! --
From: Eric Dumazet <eric.dumazet@gmail.com> Great, I'll add this to net-next-2.6 right now. Thanks! --
From: David Miller <davem@davemloft.net> I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c to fix the build while committing this. --
From: David Miller <davem@davemloft.net> net/core/net-sysfs.c needed it too :-/ --
Ugh, vmalloc.h must be sneaking in through some other header file for me :-( Sorry about that. Do you need me to respin the patch? Tom --
From: Tom Herbert <therbert@google.com> No, I took care of it and am about to push things out to net-next-2.6 on kernel.org --
One thing I've been wondering while reading if this should be made socket or SMT aware. If you're on a hyperthreaded system and sending a IPI to your core sibling, which has a completely shared cache hierarchy, might not be the best use of cycles. The same could potentially true for shared L2 or shared L3 cache (e.g. only redirect flows between different sockets) Have you ever considered that? This is of course something that could be addressed post-merge, not a blocker. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
How are you going to schedule the net softirq on an empty queue if you do this? BTW, in my tests sending an IPI to an SMT sibling or to another core didnt make any difference in terms of latency - still 5 microsecs. I dont have dual Nehalem where we have to cross QPI - there i suspect it will be longer than 5 microsecs. cheers, jamal --
Sorry don't understand the question? I meant an IPI to a sibling is not useful. You send it to the IPI to get cache locality in the target, but if the target has the same cache locality as you you can as well avoid the cost of the IPI and process directly. For thread sibling I'm pretty sure it's useless. Not full sure about socket sibling. Maybe. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Isnt the purpose of the IPI to signal remote side that theres something Agreed, the SMT threads share L2. All the cores share L3. And it is inclusive, so if it is missing it is in L1 of one thread it must be present in L2 of shared cache as well as L3. Across the QPI i dont think that is true. But if you speacial case this - arent you being specific to Nehalem? cheers, jamal --
You handle the packet like if rps wasn't enabled. softirq on current The current CPU can queue on that socket as well. The whole point of the IPI is to do it with cache locality. But if cache locality is already there on the current CPU you don't Other CPUs have SMT too (Niagara, POWER 6/7, mips, ...). It should be the same there. Assuming L3 affinity helps it might need to be a CPU specific tunable yes. The scheduler has some information about this. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
