Re: Multicast packet loss

Previous thread: Is 64k bind patch making bad assumption? by Stephen Hemminger on Monday, February 2, 2009 - 11:35 am. (2 messages)

Next thread: Remaining bits for basic support of LDP by Russell King - ARM Linux on Monday, February 2, 2009 - 2:39 pm. (16 messages)
From: Wes Chow
Date: Monday, February 2, 2009 - 12:51 pm

(I'm Kenny's colleague, and I've been doing the kernel builds)

First I'd like to note that there were a lot of bnx2 NAPI changes between 
2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amounts of packet loss,
whereas loss in 2.6.22 is significant.

Second, some CPU affinity info: if I do like Eric and pin all of the
apps onto a single CPU, I see no packet loss. Also, I do *not* see
ksoftirqd show up on top at all!

If I pin half the processes on one CPU and the other half on another CPU, one 
ksoftirqd processes shows up in top and completely pegs one CPU. My packet loss
in that case is significant (25%).

Now, the strange case: if I pin 3 processes to one CPU and 1 process to 
another, I get about 25% packet loss and ksoftirqd pins one CPU. However, one
of the apps takes significantly less CPU than the others, and all apps lose the
*exact same number of packets*. In all other situations where we see packet
loss, the actual number lost per application instance appears random.

We're about to plug in an Intel ethernet card into this machine to collect more 
rigorous testing data. Please note, though, that we have seen packet loss with
a tg3 chipset as well. For now, though, I'm assuming that this is purely a bnx2
problem.

If I understand correctly, when the nic signals a hardware interrupt, the 
kernel grabs it and defers the meaty work to the softirq handler -- how does it
decide which ksoftirqd gets the interrupts? Is this something determined by how
the driver implements the NAPI?


Wes


--

From: Eric Dumazet
Date: Monday, February 2, 2009 - 1:29 pm

You see same number of packet lost because they are lost at NIC level

(check ifconfig eth0 for droped packets)

if softirq is too busy to process packets, we are not able to get them


Normaly, softirq runs on same cpu (the one handling hard irq)

--

From: Wes Chow
Date: Monday, February 2, 2009 - 2:09 pm

Understood.

I have a new observation: if I pin processes to just CPUs 0 and 1, I see 
no packet loss. Pinning to 0 and 2, I do see packet loss. Pinning 2 and 
3, no packet loss. 4 & 5 - no packet loss, 6 & 7 - no packet loss. Any 
other combination appears to produce loss (though I have not tried all 
28 combinations, this seems to be the case).

At first I thought maybe it had to do with processes pinned to the same 
CPU, but different cores. The machine is a dual quad core, which means 
that CPUs 0-3 should be a physical CPU, correct? Pinning to 0/2 and 0/3 
produce packet loss.

I've also noticed that it does not matter which of the working pairs I 
pin to. For example, pinning 5 processes in any combination on either 
0/1 produce no packet loss, pinning all 5 to just CPU 0 also produces no 
packet loss.

The failures are also sudden. In all of the working cases mentioned 
above, I don't see ksoftirqd on top at all. But when I run 6 processes 
on a single CPU, ksoftirqd shoots up to 100% and I lose a huge number of 

What determines which CPU the hard irq occurs on?


Wes

--

From: Eric Dumazet
Date: Monday, February 2, 2009 - 2:31 pm

a quad core is really a 2 x 2 core

L2 cache is splited on two blocks, one block used by CPU0/1, other by CPU2/3 

You are at the limit of the machine with such workload, so as soon as your

Check /proc/irq/{irqnumber}/smp_affinity

If you want IRQ16 only served by CPU0 :

echo 1 >/proc/irq/16/smp_affinity

--

From: Kenny Chang
Date: Tuesday, February 3, 2009 - 10:34 am

Eric Dumazet wrote:
> Wes Chow a 
From: Neil Horman
Date: Tuesday, February 3, 2009 - 6:21 pm

I don't think its sofirq ineffeciencies (oprofile would have shown that).  I
know I keep harping on this, but I still think irq affininty is your problem.
I'd be interested in knowning what your /proc/interrupts file looked like on
each of the above kenrels.  Perhaps its not that the bnx2 card you have can't
handle the setting of MSI interrupt affinities, but rather that something
changeed to break irq affinity on this card.


--

From: Kenny Chang
Date: Thursday, February 26, 2009 - 10:15 am

It's been a while since I updated this thread.  We've been running 
through the different suggestions and tabulating their effects, as well 
as trying out an Intel card.  The short story is that setting affinity 
and MSI works to some extent, and the Intel card doesn't seem to change 
things significantly.  The results don't seem consistent enough for us 
to be able to point to a smoking gun.

It does look like the 2.6.29-rc4 kernel performs okay with the Intel 
card, but this is not a real-time build and it's not likely to be in a 
supported Ubuntu distribution real soon.  We've reached the point where 
we'd like to look for an expert dedicated to work on this problem for a 
period of time.  The final result being some sort of solution to produce 
a realtime configuration with a reasonably "aged" kernel (.24~.28) that 
has multicast performance greater than or equal to that of 2.6.15.

If anybody is interested in devoting some compensated time to this 
issue, we're offering up a bounty: 
http://www.athenacr.com/bounties/multicast-performance/

For completeness, here's the table of our experiment results:

====================== ================== ========= ========== 
=============== ============== ============== =================
Kernel                 flavor             IRQ       affinity   *4x 
mcasttest*  *5x mcasttest* *6x mcasttest*  *Mtools2* [4]_ 
====================== ================== ========= ========== 
=============== ============== ============== =================
Intel 
e1000e                                                                                                                  

-----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
2.6.24.19              rt                |          any       | 
OK              Maybe          X                             
2.6.24.19              rt                |          CPU0      | 
OK              OK             X                        ...
From: Eric Dumazet
Date: Saturday, February 28, 2009 - 1:51 am

Hi Kenny

I am investigating how to reduce contention (and schedule() calls) on this workload.

Following patch already gave me less packet drops (but not yet *perfect*)
(10% packet loss instead of 30%, if 8 receivers on my 8 cpus machine)


David, this is a preliminary work, not meant for inclusion as is,
comments are welcome.

Thank you

[PATCH] net: sk_forward_alloc becomes an atomic_t

Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting) introduced a regression for high rate UDP flows,
because of extra lock_sock() in udp_recvmsg()

In order to reduce need for lock_sock() in UDP receive path, we might need
to declare sk_forward_alloc as an atomic_t.

udp_recvmsg() can avoid a lock_sock()/release_sock() pair.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/sock.h   |   14 +++++++-------
 net/core/sock.c      |   31 +++++++++++++++++++------------
 net/core/stream.c    |    2 +-
 net/ipv4/af_inet.c   |    2 +-
 net/ipv4/inet_diag.c |    2 +-
 net/ipv4/tcp_input.c |    2 +-
 net/ipv4/udp.c       |    2 --
 net/ipv6/udp.c       |    2 --
 net/sched/em_meta.c  |    2 +-
 9 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..c4befb9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -250,7 +250,7 @@ struct sock {
 	struct sk_buff_head	sk_async_wait_queue;
 #endif
 	int			sk_wmem_queued;
-	int			sk_forward_alloc;
+	atomic_t		sk_forward_alloc;
 	gfp_t			sk_allocation;
 	int			sk_route_caps;
 	int			sk_gso_type;
@@ -823,7 +823,7 @@ static inline int sk_wmem_schedule(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
+	return size <= atomic_read(&sk->sk_forward_alloc) ||
 		__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
@@ -831,7 +831,7 @@ static inline int sk_rmem_schedule(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= ...
From: Eric Dumazet
Date: Sunday, March 1, 2009 - 10:03 am

I bound NIC (gigabit BNX2) irq to cpu 0, so that oprofile results on this cpu can show us
where ksoftirqd is spending its time.

We can see scheduler at work :)

Also, one thing to note is __copy_skb_header() : 9.49 % of cpu0 time.
The problem comes from dst_clone() (6.05 % total, so 2/3 of __copy_skb_header()),
touching a highly contended cache line. (other cpus are doing the decrement of
dst refcounter)

CPU: Core 2, speed 3000.05 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) 
with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
(samples for other cpus 1..7 omitted)
samples  cum. samples  %        cum. %     symbol name
23750    23750          9.8159   9.8159    try_to_wake_up
22972    46722          9.4944  19.3103    __copy_skb_header
20217    66939          8.3557  27.6660    enqueue_task_fair
14565    81504          6.0197  33.6857    sock_def_readable
13454    94958          5.5606  39.2463    task_rq_lock
13381    108339         5.5304  44.7767    resched_task
13090    121429         5.4101  50.1868    udp_queue_rcv_skb
11441    132870         4.7286  54.9154    skb_queue_tail
10109    142979         4.1781  59.0935    sock_queue_rcv_skb
10024    153003         4.1429  63.2364    __wake_up_sync
9952     162955         4.1132  67.3496    update_curr
8761     171716         3.6209  70.9705    sched_clock_cpu
7414     179130         3.0642  74.0347    rb_insert_color
7381     186511         3.0506  77.0853    select_task_rq_fair
6749     193260         2.7894  79.8747    __slab_alloc
5881     199141         2.4306  82.3053    __wake_up_common
5432     204573         2.2451  84.5504    __skb_clone
4306     208879         1.7797  86.3300    kmem_cache_alloc
3524     212403         1.4565  87.7865    place_entity
2783     215186         1.1502  88.9367    skb_clone
2576     217762         1.0647  90.0014    __udp4_lib_rcv
2430     220192         1.0043  91.0057    bnx2_poll_work
2184     222376         0.9027  ...
From: David Miller
Date: Wednesday, March 4, 2009 - 1:16 am

From: Eric Dumazet <dada1@cosmosbay.com>

This adds new overhead for TCP which has to hold the socket
lock for other reasons in these paths.

I don't get how an atomic_t operation is cheaper than a
lock_sock/release_sock.  Is it the case that in many
executions of these paths only atomic_read()'s are necessary?

I actually think this scheme is racy.  There is a reason we
have to hold the socket lock when doing memory scheduling.
Two threads can get in there and say "hey I have enough space
already" even though only enough space is allocated for one
of their requests.

What did I miss? :)

--

From: Eric Dumazet
Date: Wednesday, March 4, 2009 - 1:36 am

I believe you are right, and in fact was about to post a "dont look at this patch"
since it doesnt help the multicast reception at all, I redone tests more carefuly 
and got nothing but noise.

We have a cache line ping pong mess here, and need more thinking.

I rewrote Kenny prog to use non blocking sockets.

Receivers are doing :

        int delay = 50;
	fcntl(s, F_SETFL, O_NDELAY);
        while(1)
        {
            struct sockaddr_in from;
            socklen_t fromlen = sizeof(from);
            res = recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen);
            if (res == -1) {
                      delay++;
                      usleep(delay);
                      continue;
            }
            if (delay > 40)
                delay--;
            ++npackets;

With this litle user space change and 8 receivers on my dual quad core, softirqd
only takes 8% of one cpu and no drops at all (instead of 100% cpu and 30% drops)

So this is definitly a problem mixing scheduler cache line ping pongs with network
stack cache line ping pongs.

We could reorder fields so that fewer cache lines are touched by the softirq processing,
I tried this but still got packet drops.


--

From: Eric Dumazet
Date: Saturday, March 7, 2009 - 12:46 am

I have more questions :

What is the maximum latency you can afford on the delivery of the packet(s) ?

Are user apps using real time scheduling ?

I had an idea, that keep cpu handling NIC interrupts only delivering packets to
socket queues, and not messing with scheduler : fast queueing, and wakeing up
a workqueue (on another cpu) to perform the scheduler work. But that means
some extra latency (in the order of 2 or 3 us I guess)

We could enter in this mode automatically, if the NIC rx handler *see* more than
N packets are waiting in NIC queue : In case of moderate or light trafic, no
extra latency would be necessary. This would mean some changes in NIC driver.

Hum, then, if NIC rx handler is run beside the ksoftirqd, we already know
we are in a stress situation, so maybe no driver changes are necessary :
Just test if we run ksoftirqd...


--

From: Eric Dumazet
Date: Sunday, March 8, 2009 - 9:46 am

Here is a patch that helps. It's still an RFC of course, since its somewhat ugly :)

I am now able to have 8 receivers on my 8 cpus machine, with one multicast packet every 10 us,
without any loss. (standard setup, no affinity games)

oprofile results see that scheduler overhead vanished, we get back to pure network profile :)

(First offender being __copy_skb_header because of the atomic_inc on dst refcount)

CPU: Core 2, speed 3000.43 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
126329   126329        20.4296  20.4296    __copy_skb_header
31395    157724         5.0771  25.5067    udp_queue_rcv_skb
29191    186915         4.7207  30.2274    sock_def_readable
26284    213199         4.2506  34.4780    sock_queue_rcv_skb
26010    239209         4.2063  38.6842    kmem_cache_alloc
20040    259249         3.2408  41.9251    __udp4_lib_rcv
19570    278819         3.1648  45.0899    skb_queue_tail
17799    296618         2.8784  47.9683    bnx2_poll_work
17267    313885         2.7924  50.7606    skb_release_data
14663    328548         2.3713  53.1319    __skb_recv_datagram
14443    342991         2.3357  55.4676    __slab_alloc
13248    356239         2.1424  57.6100    copy_to_user
13138    369377         2.1246  59.7347    __sk_mem_schedule
12004    381381         1.9413  61.6759    __skb_clone
11924    393305         1.9283  63.6042    skb_clone
11077    404382         1.7913  65.3956    lock_sock_nested
10320    414702         1.6689  67.0645    ip_route_input
9622     424324         1.5560  68.6205    dst_release
8344     432668         1.3494  69.9699    __slab_free
8124     440792         1.3138  71.2837    mwait_idle
7066     447858         1.1427  72.4264    udp_recvmsg
6652     454510         1.0757  73.5021    netif_receive_skb
6386     460896         1.0327  74.5349    ipt_do_table
6010     466906         0.9719  ...
From: David Miller
Date: Sunday, March 8, 2009 - 7:49 pm

From: Eric Dumazet <dada1@cosmosbay.com>

This is interesting.

I think you should make softirq_del() more flexible.  Make it the
socket's job to make sure it doesn't try to defer different
functions, and put the onus on locking there as well.

The cmpxchg() and all of this checking is just wasted work.

I'd really like to get rid of that callback lock too, then we'd
really be in business. :-)

--

From: Eric Dumazet
Date: Sunday, March 8, 2009 - 11:36 pm

First thanks for your review David.

I chose cmpxchg() because I needed some form of exclusion here.
I first added a spinlock inside "struct softirq_del" then I realize
I could use cmpxchg() instead and keep the structure small. As the
synchronization is only needed at queueing time, we could pass
the address of a spinlock XXX to sofirq_del() call.

Also, when an event was queued for later invocation, I also needed to keep
a reference on "struct socket" to make sure it doesnt disappear before
the invocation. Not all sockets are RCU guarded (we added RCU only for 
some protocols (TCP, UDP ...). So I found keeping a read_lock
on callback was the easyest thing to do. I now realize we might
overflow preempt_count, so special care is needed.

About your first point, maybe we should make sofirq_del() (poor name I confess)
only have one argument (pointer to struct softirq_del), and initialize
the function pointer at socket init time. That would insure "struct softirq_del"
is associated to one callback only. cmpxchg() test would have to be
done on "next" field then (or use the spinlock XXX)

I am not sure output path needs such tricks, since threads are rarely
blocking on output : We dont trigger 400.000 wakeups per second ?

Another point : I did a tbench test and got 2517 MB/s with the patch,
instead of 2538 MB/s (using Linus 2.6 git tree), thats ~ 0.8 % regression
for this workload.

--

From: David Miller
Date: Friday, March 13, 2009 - 2:51 pm

From: Eric Dumazet <dada1@cosmosbay.com>

I don't understand why you need the mutual exclusion in the
first place.  The function pointer always has the same value.
And this locking isn't protecting the list insertion either,
as that isn't even necessary.


You're using this in UDP so... make the rule that you can't use

Why?  You run this from softirq safe context, nothing can run other
softirqs on this cpu and corrupt the list, therefore.
--

From: Eric Dumazet
Date: Friday, March 13, 2009 - 3:30 pm

I was lazy to check all callers (all protocols) had a lock on sock,
and prefered safety.

I was fooled by the read_lock(), and though several cpus could call

UDP/TCP only ? I though many other protocols (not all using RCU) were
using sock_def_readable() too...


--

From: David Miller
Date: Friday, March 13, 2009 - 3:38 pm

From: Eric Dumazet <dada1@cosmosbay.com>

Maybe create a inet_def_readable() just for this purpose :-)
--

From: Eric Dumazet
Date: Friday, March 13, 2009 - 3:45 pm

I must be tired, I should had this idea before you :)

I post a new patch after some rest, I definitly should not be still awake !


--

From: Eric Dumazet
Date: Saturday, March 14, 2009 - 2:03 am

On x86_64, its rather unfortunate that "wait_queue_head_t wait"
field of "struct socket" spans two cache lines (assuming a 64
bytes cache line in current cpus)

offsetof(struct socket, wait)=0x30
sizeof(wait_queue_head_t)=0x18

This might explain why Kenny Chang noticed that his multicast workload
was performing bad with 64 bit kernels, since more cache lines ping pongs
were involved.

This litle patch moves "wait" field next "fasync_list" so that both
fields share a single cache line, to speedup sock_def_readable()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/include/linux/net.h b/include/linux/net.h
index 4515efa..4fc2ffd 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -129,11 +129,15 @@ struct socket {
 	socket_state		state;
 	short			type;
 	unsigned long		flags;
-	const struct proto_ops	*ops;
+	/*
+	 * Please keep fasync_list & wait fields in the same cache line
+	 */
 	struct fasync_struct	*fasync_list;
+	wait_queue_head_t	wait;
+
 	struct file		*file;
 	struct sock		*sk;
-	wait_queue_head_t	wait;
+	const struct proto_ops	*ops;
 };
 
 struct vm_area_struct;

--

From: David Miller
Date: Sunday, March 15, 2009 - 7:59 pm

From: Eric Dumazet <dada1@cosmosbay.com>

Applied, thanks a lot Eric.
--

From: Eric Dumazet
Date: Monday, March 16, 2009 - 3:22 pm

Here is the last incantation of the patch, that of course should be
split in two parts and better Changelog for further discussion on lkml.

We need to take a reference on sock when queued on a softirq delay
list. RCU wont help here because of SLAB_DESTROY_BY_RCU thing :
Another cpu could free/reuse the socket before we have a chance to
call softirq_delay_exec()

UDP & UDPLite use this delayed wakeup feature.

Thank you

[PATCH] softirq: Introduce mechanism to defer wakeups

Some network workloads need to call scheduler too many times. For example,
each received multicast frame can wakeup many threads. ksoftirqd is then
not able to drain NIC RX queues in time and we get frame losses and high
latencies.

This patch adds an infrastructure to delay work done in
sock_def_readable() at end of do_softirq(). This needs to
make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/interrupt.h |   18 +++++++++++++++
 include/linux/irqflags.h  |   11 ++++-----
 include/linux/sched.h     |    2 -
 include/net/sock.h        |    2 +
 include/net/udplite.h     |    1
 kernel/lockdep.c          |    2 -
 kernel/softirq.c          |   42 ++++++++++++++++++++++++++++++++++--
 lib/locking-selftest.c    |    4 +--
 net/core/sock.c           |   41 +++++++++++++++++++++++++++++++++++
 net/ipv4/udp.c            |    7 ++++++
 net/ipv6/udp.c            |    7 ++++++
 11 files changed, 125 insertions(+), 12 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9127f6b..a773d0c 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -295,6 +295,24 @@ extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softir
 extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
 				  int this_cpu, int softirq);
 
+/*
+ * softirq delayed works : should be delayed at do_softirq() end
+ */
+struct softirq_delay {
+	struct ...
From: Peter Zijlstra
Date: Tuesday, March 17, 2009 - 3:11 am

I read the entire thread up to now, and I still don't really understand



Shouldn't we call it __softirq_delay_queue() if the caller needs to
disabled preemption?

Futhermore, don't we always require the caller to take care of lifetime

Why can't we write:

  struct softirq_delay *sdel, *next;

  sdel = __get_cpu_var(softirq_delay_head);
  __get_cpu_var(softirq_delay_head) = NULL;

  while (sdel) {
    next = sdel->next;
    sdel->func(sdel);
    sdel = next;
  }

Why does it matter what happens to sdel->next? we've done the callback.

Aah, the crux is in the re-use policy.. that most certainly does deserve
a comment.

How about we make sdel->next point to itself in the init case?

Then we can write:

  while (sdel) {
    next = sdel->next;
    sdel->next = sdel;
    sdel->func(sdel);
    sdel = next;
  }

and have the enqueue bit look like:

int __softirq_delay_queue(struct softirq_delay *sdel)
{
  struct softirq_delay **head;

  if (sdel->next != sdel)
    return 0;

  head = &__get_cpu_var(softirq_delay_head);
  sdel->next = *head;
  *head = sdel;
  return 1;


OK, so the idea is to handle a bunch of packets and instead of waking N
threads for each packet, only wake them once at the end of the batch?

Sounds like a sensible idea.. 

--

From: Eric Dumazet
Date: Tuesday, March 17, 2009 - 4:08 am

Sure, I should have taken more time, will repost this in a couple of hours,

Apparently, on SMP machines this actually helps a lot, in case of multicast
trafic handled by many subscribers. skb_cloning involves atomic ops on
route cache entries, and if we wakeup threads as we currently do, they
start to consume skb while the feeder is still doing skb clones for
other sockets. Many cache line ping pongs are slowing down the softirq.



I was wondering if some BUG_ON() can be added to crash if preemption is enabled
at this point. Could not find an existing check,
doing again the 'if (running_from_softirq())'" test might be overkill,
should I document caller should do :

skeleton :

    lock_my_data(data); /* barrier here */
    sdel = &data->sdel;
    if (running_from_softirq()) {
	if (softirq_delay_queue(sdel)) {
		hold a refcount on data;
	} else {
		/* already queued, nothing to do */
	}
    } else {
	/* cannot queue the work , must do it right now */
	do_work(data);
    }
    release_my_data(data);




Check softirq_delay_exec(void) comment, where I stated synchronization had
to be done by the subsystem.

In this socket case, caller of softirq_delay_exec() has a lock on socket.

Problem is I dont want to get this lock again in sock_readable_defer() callback

if sdel->next is not committed, another cpu could call _softirq_delay_queue() and
find sdel->next being not null (or != sdel with your suggestion). Then next->func()
wont be called as it should (or called litle bit too soon)


Idea is to batch wakeups() yes, and if we receive several packets for
the same socket(s), we reduce number of wakeups to one. In the multicast stress
situation of Athena CR, it really helps, no packets dropped instead of
30%

Thanks Peter

--

From: Peter Zijlstra
Date: Tuesday, March 17, 2009 - 4:57 am

__get_cpu_var() has a preemption check and will generate BUGs when

Small nit: I don't particularly like the running_from_softirq() name,
but in_softirq() is already taken, and sadly means something slightly



afaiu the memory barrier semantics you cannot pair a wmb with a lock

Right, what we can do is put the wmb in the callback and the rmb right

Yes I can see that helping tremendously.

--

From: Brian Bloniarz
Date: Tuesday, March 17, 2009 - 8:00 am

One small thing: with CONFIG_IPV6=m, inet_def_readable needs to be exported,
right?

Thanks,
Brian Bloniarz
--

From: Eric Dumazet
Date: Tuesday, March 17, 2009 - 8:16 am

Absolutely, thank you !

--

From: David Stevens
Date: Tuesday, March 17, 2009 - 12:39 pm

I did some testing with this and see at least a 20% improvement
without drop.

I agree with Peter's recommended changes (esp. sentinel vs null),
and also the trivial brace indentation  in softirq_delay_exec(),
but otherwise looks  good to me. Nice work.

                                        +-DLS

--

From: Eric Dumazet
Date: Tuesday, March 17, 2009 - 2:19 pm

Still I dont like very much all softirq.c changes. I feel very
uncomfortable to justify one extra call in do_softirq(), and
not very clean interface (stuff about locking, barriers...)

Easy way could be to add a SOFTIRQ but its not very wise.

I was wondering if we could use the infrastructure added in commit
54514a70adefe356afe854e2d3912d46668068e6
(softirq: Add support for triggering softirq work on softirqs.)
But I dont understand how it can works...
(softirq_work_list is feeded, but never processed)

Alternatively, we could use a framework dedicated to
network use, with well defined semantic :

Calling softirq_delay_exec() from net_rx_action(),
from this function, we know if time_squeeze was incremented,
or all netdev_budget consumed, and in this stress case 
try to give the wakeups job to another cpu.



--

From: Brian Bloniarz
Date: Friday, April 3, 2009 - 12:28 pm

Hi Eric,

We've been experimenting with this softirq-delay patch in production, and
have seen some hard-to-reproduce crashes. We finally managed to capture a
kexec crashdump this morning.

This is the dmesg:

[53417.592868] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[53417.598377]  [<ffffffff80243643>] __do_softirq+0xc3/0x150
[53417.606300] PGD 32abb8067 PUD 32faf5067 PMD 0
[53417.610829] Oops: 0000 [1] SMP
[53417.614032] CPU 2
[53417.616083] Modules linked in: nfs lockd nfs_acl sunrpc openafs(P) autofs4 ipv6 ac sbs sbshc video output dock battery container iptable_filter ip_tables x_tables parport_pc lp parport loop joydev iTCO_wdt iTCO_vendor_support evdev button i5000_edac psmouse serio_raw pcspkr shpchp pci_hotplug edac_core ext3 jbd mbcache sr_mod cdrom ata_generic usbhid hid ata_piix sg sd_mod ehci_hcd pata_acpi uhci_hcd libata bnx2 aacraid usbcore scsi_mod thermal processor fan fbcon tileblit font bitblit softcursor fuse
[53417.662067] Pid: 13039, comm: gball Tainted: P        2.6.24-19acr2-generic #1
[53417.669219] RIP: 0010:[<ffffffff80243643>]  [<ffffffff80243643>] __do_softirq+0xc3/0x150
[53417.677368] RSP: 0018:ffff8103314f3f20  EFLAGS: 00010297
[53417.682697] RAX: ffff810084a1b000 RBX: ffffffff805ba530 RCX: 0000000000000000
[53417.689843] RDX: ffff8103305811e0 RSI: 0000000000000282 RDI: ffff810332ada580
[53417.696993] RBP: 0000000000000000 R08: ffff81032fad9f08 R09: ffff810332382000
[53417.704144] R10: 0000000000000000 R11: ffffffff80316ec0 R12: ffffffff8062b3d8
[53417.711294] R13: ffffffff8062b480 R14: 0000000000000002 R15: 000000000000000a
[53417.718447] FS:  00007fab0d7b8750(0000) GS:ffff810334401b80(0000) knlGS:0000000000000000
[53417.726568] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[53417.732332] CR2: 0000000000000000 CR3: 0000000329e2d000 CR4: 00000000000006e0
[53417.739476] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[53417.746637] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: ...
From: Eric Dumazet
Date: Sunday, April 5, 2009 - 6:49 am

Hi Brian

2.6.24-19 kernel... hmm...

Could you please send me the diff of your backport against this kernel ?

I take you use Ubuntu Hardys 8.04 LTS server edition ?

Pointer being null might tell us that we managed to call inet_def_readable()
without socket lock hold...

--

From: Brian Bloniarz
Date: Monday, April 6, 2009 - 2:53 pm

Trying to track this down: I added:
	BUG_ON(!spin_is_locked(&sk->sk_lock.slock));
to the top of inet_def_readable. This gives me the following panic:

[ 2528.745311] kernel BUG at net/core/sock.c:1674!
[ 2528.745311] invalid opcode: 0000 [#1] PREEMPT SMP
[ 2528.745311] last sysfs file: /sys/devices/system/cpu/cpu7/crash_notes
[ 2528.745311] CPU 6
[ 2528.745311] Modules linked in: iptable_filter ip_tables x_tables parport_pc lp parport loop iTCO_wdt iTCO_vendor_support serio_raw psmouse pcspkr i5k_amb shpchp i5000_edac pci_hotplug button edac_core ipv6 ibmpex joydev ipmi_msghandler evdev ext3 jbd mbcache usbhid hid sr_mod cdrom pata_acpi ata_generic sg sd_mod ata_piix ehci_hcd uhci_hcd libata aacraid usbcore scsi_mod bnx2 thermal processor fan thermal_sys fuse
[ 2528.745311] Pid: 14507, comm: signalgen Not tainted 2.6.29.1-eric2-lowlat-lockdep #3 IBM System x3550 -[7978AC1]-
[ 2528.745311] RIP: 0010:[<ffffffff80444ec2>]  [<ffffffff80444ec2>] inet_def_readable+0x52/0x60
[ 2528.745311] RSP: 0018:ffff88043b985b58  EFLAGS: 00010246
[ 2528.745311] RAX: 0000000000000019 RBX: ffff88043b90c280 RCX: 0000000000000000
[ 2528.745311] RDX: 0000000000001919 RSI: 0000000000000068 RDI: ffff88043b90c280
[ 2528.745311] RBP: ffff88043b985b68 R08: 0000000000000000 R09: 0000000000000000
[ 2528.745311] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88043b811400
[ 2528.745311] R13: 0000000000000000 R14: 0000000000000068 R15: 0000000000000000
[ 2528.745311] FS:  00007f82f0742750(0000) GS:ffff88043dbc8280(0000) knlGS:0000000000000000
[ 2528.745311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2528.745311] CR2: 000000000057f1a0 CR3: 000000043915e000 CR4: 00000000000406e0
[ 2528.745311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2528.745311] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2528.745311] Process signalgen (pid: 14507, threadinfo ffff88043b984000, task ffff8804309a9ef0)
[ 2528.745311] Stack:
[ 2528.745311]  ffff88043b811400 ffff88043b90c280 ...
From: Brian Bloniarz
Date: Monday, April 6, 2009 - 3:12 pm

Oh, sorry, I think I'm just misunderstanding how the socket
lock works. This doesn't actually check that the socket is locked,
right?

-Brian
--

From: Brian Bloniarz
Date: Tuesday, April 7, 2009 - 1:08 pm

Eric Dumazet wrote:
 > Brian Bloniarz a écrit :
 >> We've been experimenting with this softirq-delay patch in production, and
 >> have seen some hard-to-reproduce crashes. We finally managed to capture a
 >> kexec crashdump this morning.
 >
 > Pointer being null might tell us that we managed to call inet_def_readable()
 > without socket lock hold...

False alarm -- I think I did the backport to 2.6.24 incorrectly. 2.6.24 was
before the UDP receive path started taking the socket lock, so
inet_def_readable's assumption doesn't hold.

Sorry to waste everyone's time.

Thanks,
Brian Bloniarz
--

From: Eric Dumazet
Date: Wednesday, April 8, 2009 - 1:12 am

Thanks for doing this discovery work and analysis. 

I am currently off-computers and could not do this until next week.

So, if you want to use 2.6.24, we need to back port other patches as well ?

--

From: Brian Bloniarz
Date: Monday, March 9, 2009 - 3:56 pm

Hi Eric,

I did some experimenting with this patch today -- we're users, not kernel hackers,
but the performance looks great. We see no loss with mcasttest, and no loss with
our internal test programs (which do much more user-space work). We're very
encouraged :)

One thing I'm curious about: previously, setting /proc/irq/<eth0>/smp_affinity
to one CPU made things perform better, but with this patch, performance is better
with smp_affinity == ff than with smp_affinity == 1. Do you know why that
is? Our tests are all with bnx2 msi_disable=1. I can investigate with oprofile
tomorrow.

Thank you for your continued help, we all deeply appreciate having someone
looking at this workload.

Thanks,
Brian Bloniarz
--

From: Eric Dumazet
Date: Monday, March 9, 2009 - 10:28 pm

Well, smp_affinity could help in my opininon if you dedicate
one cpu for the NIC, and others for user apps, if the average
work done per packet is large. If load is light, its better
to use the same cpu to perform all the work, since no expensive
bus trafic is needed between cpu to exchange memory lines.

If you only change /proc/irq/<eth0>/smp_affinity, and let scheduler
chose any cpu for your user-space work that can have long latencies,
I would not expect better performances.

Try to cpu affine your taks to 0xFE to get better determinism.



--

From: Brian Bloniarz
Date: Tuesday, March 10, 2009 - 4:22 pm

Hi Eric,

FYI: with your patch applied and lockdep enabled, I see:
[   39.114628] ================================================
[   39.121964] [ BUG: lock held when returning to user space! ]
[   39.127704] ------------------------------------------------
[   39.133461] msgtest/5242 is leaving the kernel with locks still held!
[   39.140132] 1 lock held by msgtest/5242:
[   39.144287]  #0:  (clock-AF_INET){-.-?}, at: [<ffffffff8041f5b9>] sock_def_readable+0x19/0xb0

I can't reproduced this with the mcasttest program yet, it
was with an internal test program which does some userspace
processing on the messages. I'll let you know if I find a way
to reproduce it with a simple program I can share.

 > Well, smp_affinity could help in my opininon if you dedicate
 > one cpu for the NIC, and others for user apps, if the average
 > work done per packet is large. If load is light, its better
 > to use the same cpu to perform all the work, since no expensive
 > bus trafic is needed between cpu to exchange memory lines.

I tried this setup as well: an 8-core box with 4 userspace
processes, each affined to an individual CPU1-4. The IRQ was on
CPU0. On most kernels, this setup loses fewer packets than the default
affinity (though they both lose some). With your patch enabled, the
default affinity loses 0 packets, and this setup loses some.

Thanks,
Brian Bloniarz

--

From: Eric Dumazet
Date: Tuesday, March 10, 2009 - 8:00 pm

I reproduced it as well here quite easily with a tcpdump of a tcp session,
thanks for the report.

It seems  "if (in_softirq())" doesnt do what I thought.

I wanted to test if we were called from __do_softirq() handler,
since only this function is calling softirq_delay_exec()
to dequeue events.

It appears I have to make current->softirq_context available
even if !CONFIG_TRACE_IRQFLAGS

Here is an updated version of the patch.

I also made the call to softirq_delay_exec() is performed
with interrupts enabled, and that preempt count wont
overflow if many events are queued.

[PATCH] softirq: Introduce mechanism to defer wakeups

Some network workloads need to call scheduler too many times. For example,
each received multicast frame can wakeup many threads. ksoftirqd is then
not able to drain NIC RX queues and we get frame losses and high latencies.

This patch adds an infrastructure to delay part of work done in
sock_def_readable() at end of do_softirq(). This need to
make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/interrupt.h |   18 +++++++++++++++++
 include/linux/irqflags.h  |    7 ++----
 include/linux/sched.h     |    2 -
 include/net/sock.h        |    1
 kernel/softirq.c          |   34 +++++++++++++++++++++++++++++++++
 net/core/sock.c           |   37 ++++++++++++++++++++++++++++++++++--
 6 files changed, 92 insertions(+), 7 deletions(-)


diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9127f6b..a773d0c 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -295,6 +295,24 @@ extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softir
 extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
 				  int this_cpu, int softirq);
 
+/*
+ * softirq delayed works : should be delayed at do_softirq() end
+ */
+struct softirq_delay {
+	struct softirq_delay	*next;
+	void ...
From: Brian Bloniarz
Date: Thursday, March 12, 2009 - 8:47 am

This works great in all my tests so far.

Thanks,
Brian Bloniarz
--

From: Eric Dumazet
Date: Thursday, March 12, 2009 - 9:34 am

Cool

I am wondering if we should extend the mechanism and change
softirq_delay_exec() to wakeup a workqueue instead of
doing the loop from softirq handler, in case a given
level of stress / load is hit.

This could help machines with several cpus, and one NIC (without
multi RX queues) flooded by messages (not necessarly multicast trafic).
Imagine a media/chat server receiving XXX.000 frames / second

One cpu could be dedicated to pure softirq/network handling,
and other cpus could participate and handle the scheduler part if any.

Condition could be : 

- We run __do_softirq() from ksoftirqd and 
- We queued more than N 'struct softirq_delay' in softirq_delay_head
- We have more than one cpu online

--

Previous thread: Is 64k bind patch making bad assumption? by Stephen Hemminger on Monday, February 2, 2009 - 11:35 am. (2 messages)

Next thread: Remaining bits for basic support of LDP by Russell King - ARM Linux on Monday, February 2, 2009 - 2:39 pm. (16 messages)