2.6.24 BUG: soft lockup - CPU#X

Previous thread: [PATCH net-2.6.26][IPV6]: Fix potential net leak and oops in ipv6 routing code. by Pavel Emelyanov on Wednesday, March 26, 2008 - 11:55 am. (3 messages)

Next thread: [PATCH] e1000e: test MSI interrupts by Auke Kok on Wednesday, March 26, 2008 - 2:36 pm. (16 messages)
To: <netdev@...>
Date: Wednesday, March 26, 2008 - 12:46 pm

All,

While running iperf (transmit) on NIU driver I get several soft lockup
errors. I have seen similar soft lockup errors on kernels 2.6.18 and
later (with other 10G NIC drivers with LSO disabled) but it is more
frequent/pronounced on 2.6.24 kernel. oprofile is not giving much clue
as where the cpus are spending time (atleast doesn't indicate any thing
abnormal in the driver(s)). The lockup location as indicated by the
error message is not consistent and varies from time time and could be
outside the driver as well. I have attached several lockup error
traces and corresponding profile data. Any clues?

Regards
matheos

Test system is 8xdual core AMD opteron, with 8 cpus booted.

[root@nsn57-110 ~]# more /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 885
stepping : 2
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow rep_good pni lahf_lm cmp_legacy
bogomips : 2019.84
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

nsn57-110 login: BUG: soft lockup - CPU#3 stuck for 11s! [iperf:9064]
CPU 3:
Modules linked in: oprofile niu nfs lockd nfs_acl autofs4 hidp rfcomm
l2cap bluetooth sunrpc ipv6 cpufreq_ondemand rdma_ucm ib_ucm rdma_cm
iw_cm ib_addr ib_srp scsi_transport_srp ib_cm ib_ipoib ib_sa ib_uverbs
ib_umad ib_mad ib_core dm_multipath battery ac parport_pc lp parport
joydev sr_mod sg e1000 button i2c_nforce2 pcspkr shpchp i2c_core
dm_snapshotdm_zero dm_mirror dm_mod usb_storage ...

To: Matheos Worku <Matheos.Worku@...>
Cc: <netdev@...>
Date: Wednesday, March 26, 2008 - 4:14 pm

Matheos Worku wrote, On 03/26/2008 05:46 PM:

Are network cards' irqs balanced? If so, could you reproduce this
with affinity set?

Regards,
Jarek P.
--

To: Jarek Poplawski <jarkao2@...>
Cc: <netdev@...>
Date: Wednesday, March 26, 2008 - 4:26 pm

Jarek,

Reproduced the lockup with irqbalance disabled and with single src of
interrupt (TX interrupt, UDP transmit). Lockup appears in different
location though.

Regards
matheos

irq of interest: 454 (TX interrupt)

454: 19249 93234 907186 2691 0
188 0 160 PCI-MSI-edge eth6
455: 22607 15083 5 13104 25569
161519 62514 25637 PCI-MSI-edge eth6
456: 22390 14921 5 24605 37438
110453 251315 66 PCI-MSI-edge eth6
457: 11109 26849 2 58895 251720
84 0 67420 PCI-MSI-edge eth6
458: 22348 15859 1 21978 27839
10231 0 267743 PCI-MSI-edge eth6
459: 19922 15331 2 59275 0
149788 12394 82549 PCI-MSI-edge eth6
460: 22928 19058 4 1268 49775
183189 160901 25150 PCI-MSI-edge eth6
461: 497 32134 1 31428 0
69182 68889 45407 PCI-MSI-edge eth6
462: 11932 23212 10 11355 120509
47588 1 118637 PCI-MSI-edge eth6
463: 0 0 0 0 0
0 0 0 PCI-MSI-edge eth6
464: 0 0 0 0 0
0 0 0 PCI-MSI-edge eth6
465: 0 0 0 0 0
0 0 0 PCI-MSI-edge eth6

.......

454: 19249 126519 907186 2691 0
188 0 160 PCI-MSI-edge eth6
455: 22609 15083 5 13104 25569
161519 62514 25637 PCI-MSI-edge eth6
456: 22390 14923 5 24605 37438
110453 251315 66 PCI-MS...

To: Matheos Worku <Matheos.Worku@...>
Cc: <netdev@...>
Date: Thursday, March 27, 2008 - 6:33 am

On Wed, Mar 26, 2008 at 01:26:00PM -0700, Matheos Worku wrote:
...

Maybe I'm wrong with this again, but I wonder about this gart_map_single
on almost all traces, and probably not supposed to be seen here. Did you
try with some memory re-config/debugging?

Regards,
Jarek P.
--

To: Jarek Poplawski <jarkao2@...>, Matheos Worku <Matheos.Worku@...>
Cc: <netdev@...>
Date: Thursday, March 27, 2008 - 7:18 pm

I have some more examples of this but with the ixgbe driver. We are
running heavy bidirectional stress with multiple rx (non-napi, yeah I
know) interrupts by default (and userspace irqbalance is probably on,
I'll have the lab try it without)

Most of our traces are from redhat 2.6.18 kernel, or 2.6.21.5, I'll try
to get a more recent kernel tested.

here is some related data if that helps at all.

2.6.21.5
=====================================

[ 180.487937] BUG: soft lockup detected on CPU#0!
[ 180.492461]
[ 180.492461] Call Trace:
[ 180.496426] <IRQ> [<ffffffff8025d788>] softlockup_tick+0xd4/0xe9
[ 180.502650] [<ffffffff8023bc57>] run_local_timers+0x13/0x15
[ 180.508317] [<ffffffff8023bf37>] update_process_times+0x4c/0x78
[ 180.514323] [<ffffffff802181d0>] smp_local_timer_interrupt+0x34/0x51
[ 180.520761] [<ffffffff80218966>] smp_apic_timer_interrupt+0x43/0x5d
[ 180.527104] [<ffffffff8020a31b>] apic_timer_interrupt+0x6b/0x70
[ 180.533113] [<ffffffff80533e2e>] _spin_unlock_irqrestore+0x3f/0x69
[ 180.539372] [<ffffffff80533e31>] _spin_unlock_irqrestore+0x42/0x69
[ 180.545648] [<ffffffff804ba219>] skb_queue_tail+0x41/0x49
[ 180.551133] [<ffffffff804b990e>] sock_queue_rcv_skb+0xe2/0x11c
[ 180.557055] [<ffffffff804f440e>] udp_queue_rcv_skb+0xb6/0x180
[ 180.562877] [<ffffffff804f492e>] __udp4_lib_rcv+0x456/0x594
[ 180.568538] [<ffffffff804f4a7e>] udp_rcv+0x12/0x14
[ 180.573439] [<ffffffff804d5572>] ip_local_deliver+0xca/0x171
[ 180.579191] [<ffffffff804d5b28>] ip_rcv+0x424/0x464
[ 180.584171] [<ffffffff804bf912>] netif_receive_skb+0x25a/0x30f
[ 180.590091] [<ffffffff804c182c>] process_backlog+0xab/0x10f
[ 180.595741] [<ffffffff804c1acc>] net_rx_action+0xbb/0x254
[ 180.601220] [<ffffffff80237cb5>] __do_softirq+0x5e/0xe2
[ 180.606544] [<ffffffff8020a86c>] call_softirq+0x1c/0x28
[ 180.611850] [<ffffffff8020bc...

To: Brandeburg, Jesse <jesse.brandeburg@...>
Cc: Jarek Poplawski <jarkao2@...>, <netdev@...>
Date: Thursday, March 27, 2008 - 7:45 pm

I have seen the lockup on kernels 2.6.18 and newer mostly on TX traffic.
I have seen it on another 10G driver (off the tree niu driver sibling,
nxge). The nxge driver doesn't use any TX interrupts and I have seen it
with UDP TX, irqbalance disabled, with no irq activity at all. some
example traces included.
Regards
Matheos

nsn57-110 login: BUG: soft lockup - CPU#4 stuck for 11s! [uperf.x86_64:6697]
CPU 4:
Modules linked in: nxge nfs lockd nfs_acl autofs4 hidp rfcomm l2cap
bluetooth sunrpc ipv6 cpufreq_ondemand rdma_ucm ib_ucm rdma_cm iw_cm
ib_addr ib_srp scsi_transport_srp ib_cm ib_ipoib ib_sa ib_uverbs ib_umad
ib_mad ib_core dm_multipath battery ac parport_pc lp parport joydev
sr_mod sg e1000 i2c_nforce2 button i2c_core shpchp pcspkr dm_snapshot
dm_zerodm_mirror dm_mod usb_storage mptsas mptscsih mptbase
scsi_transport_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 6697, comm: uperf.x86_64 Not tainted 2.6.24-mati #3
RIP: 0010:[<ffffffff80316c64>] [<ffffffff80316c64>]
_raw_spin_unlock+0x37/0x7f
RSP: 0018:ffff8101e00c3af8 EFLAGS: 00000246
RAX: ffff8101e084e000 RBX: ffff8101f0908000 RCX: ffff8101f0908780
RDX: 0000000000000000 RSI: 0000000000000116 RDI: ffff8101f0908300
RBP: 00000000000003d5 R08: 0000000000000004 R09: 0000000000000115
R10: ffff8101e00c3968 R11: 0000000000000000 R12: ffff8101e00c3968
R13: 0000000000000000 R14: ffff8101f0908000 R15: ffffffff882e77ef
FS: 0000000041001940(0063) GS:ffff8102fba13580(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000004f81cf48 CR3: 00000001e05b4000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
[<ffffffff80406dca>] __qdisc_run+0x96/0x174
[<ffffffff803f8139>] dev_queue_xmit+0x141/0x270
[<ffffffff80417faf>] ip_push_pending_frames+0x32c/0x3a0
[<ffffffff80419676>] ip_generic_getfrag+0x0/0x8b
[<ffffffff8043...

To: <Matheos.Worku@...>
Cc: <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>
Date: Thursday, March 27, 2008 - 8:02 pm

From: Matheos Worku <Matheos.Worku@Sun.COM>

Interesting.

Are you running uperf in a way such that there are multiple
processors doing TX's in parallel? That might be a clue.
--

To: David Miller <davem@...>
Cc: <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>
Date: Thursday, March 27, 2008 - 8:19 pm

Dave,
Actually I am running a version of the nxge driver which uses only one
TX ring, no LLTX enabled so the driver does single threaded TX. On the
other hand, uperf (or iperf, netperf ) is running multiple TX
connections in parallel and the connections are bound on multiple
processors, hence they are running in parallel.

Regards
Matheos

--

To: <Matheos.Worku@...>
Cc: <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <herbert@...>, <hadi@...>
Date: Thursday, March 27, 2008 - 8:34 pm

From: Matheos Worku <Matheos.Worku@Sun.COM>

Yes, this is what I was interested in.

I think I know what's wrong.

If one cpu gets into the "qdisc remove, give to device" loop, other
cpus will simply add to the qdisc and that first cpu will do the all
of the TX processing.

This helps performance, but in your case it is clear that if the
device is fast enough and there are enough other cpus generating TX
traffic, it is quite trivial to get a cpu wedged there and never exit.

The code in question is net/sched/sch_generic.c:__qdisc_run(), it just
loops there until the device TX fills up or there are no more packets
in the qdisc queue.

qdisc_run() (in include/linux/pkt_sched.h) sets
__LINK_STATE_QDISC_RUNNING to tell other cpus that there is a cpu
processing the queue inside of __qdisc_run().

net/core/dev.c:dev_queue_xmit() then goes:

if (q->enqueue) {
/* Grab device queue */
spin_lock(&dev->queue_lock);
q = dev->qdisc;
if (q->enqueue) {
/* reset queue_mapping to zero */
skb_set_queue_mapping(skb, 0);
rc = q->enqueue(skb, q);
qdisc_run(dev);
spin_unlock(&dev->queue_lock);

rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
goto out;
}
spin_unlock(&dev->queue_lock);
}

The first cpu will get into __qdisc_run(), but the other
ones will just q->enqueue() and exit since the first cpu
has indicated it is processing the qdisc.

I'm not sure how we should fix this at the moment, we want
to keep the behavior but on the other hand we need to
break out of this so we don't get stuck here for too long.
--

To: David Miller <davem@...>
Cc: <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Thursday, March 27, 2008 - 9:22 pm

Indeed. Since I created the problem it's only fair that I get
to fix it too :)

[NET]: Add preemption point in qdisc_run

The qdisc_run loop is currently unbounded and runs entirely
in a softirq. This is bad as it may create an unbounded softirq
run.

This patch fixes this by calling need_resched and breaking out
if necessary.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 10b5c08..80d53dd 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -187,6 +187,10 @@ void __qdisc_run(struct net_device *dev)
do {
if (!qdisc_restart(dev))
break;
+ if (need_resched()) {
+ netif_schedule(dev);
+ break;
+ }
} while (!netif_queue_stopped(dev));

clear_bit(__LINK_STATE_QDISC_RUNNING, &dev->state);
--

To: Herbert Xu <herbert@...>
Cc: David Miller <davem@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Thursday, March 27, 2008 - 9:58 pm

Great.

In general, while the TX serialization improves performance in terms to
lock contention, wouldn't it reduce throughput since only one guy is
doing the actual TX at any given time. Wondering if it would be
worthwhile to have an enable/disable option specially for multi queue TX.

Regards
Matheos

--

To: Matheos Worku <Matheos.Worku@...>
Cc: David Miller <davem@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 6:38 am

Unless you have a truly lockless driver where two CPUs can transmit
at the same time then regardless of what we do only one CPU can
transmit at a time.

Given that premise, we might as well let one CPU transmit as much
as possible since moving to another CPU after each packet is going
to bounce a lot more than just the spin lock and that is going to
be expensive.

Of course getting stuck in that loop forever when there is other
work to be done (such as running the watchdog :) is bad. But
hopefully that's what this patch will fix.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: Matheos Worku <Matheos.Worku@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 9:38 am

On Fri, Mar 28, 2008 at 06:38:09PM +0800, Herbert Xu wrote:

Considering this, I wonder why using this __LINK_STATE_QDISC_RUNNING
flag to control enqueuing as well would be a wrong idea? Wouldn't this
enforce pseudo affinity?

Regards,
Jarek P.
--

To: Jarek Poplawski <jarkao2@...>
Cc: Matheos Worku <Matheos.Worku@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 9:53 am

Could you elaborate on how you intend to use it to control the
act of enqueueing?

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: Matheos Worku <Matheos.Worku@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 10:39 am

Probably some backlog would be necessary, similarly to RX. The main
question is how much cache problem is such enqueuing from other CPUs.

Jarek P.
--

To: Jarek Poplawski <jarkao2@...>
Cc: Matheos Worku <Matheos.Worku@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 10:56 am

Sorry but I still don't understand how this would differ from
what we do now. As it stands the packet would be queued for

It's pretty bad actually. In fact multiple TX queues isn't
going to be useful until we address this bottle-neck.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: Matheos Worku <Matheos.Worku@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 11:29 am

But during this, now limited, time of qdisc_run() there is a contention
for queue_lock and probably some additional cache updating because of
this other enqueuing, which could be delayed especially if queue length
is above some level.

Cheers,
Jarek P.
--

To: Jarek Poplawski <jarkao2@...>
Cc: Matheos Worku <Matheos.Worku@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 9:06 pm

You mean delaying into a per-cpu queue? That sounds interesting.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: Matheos Worku <Matheos.Worku@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <netdev@...>, <hadi@...>
Date: Saturday, March 29, 2008 - 5:11 am

I mean any delaying could be necessary here. After rethinking it seems
to me this solution with the flag could be wrong even after current fix.

The owner of the flag has to give up queue_lock for some time, and
because of this its chances for regaining the lock are worse: other CPUs
could take it in a loop, winning the cache, and adding packets, which
are imediately dumped (or requeued).

So, it would make a kind of reverse lockup situation. Then, even normal
contention for both locks seems safer against such races: throughput
could be worse, but probably no such "(soft)lockup" risk.

Regards,
Jarek P.
--

To: Herbert Xu <herbert@...>
Cc: Matheos Worku <Matheos.Worku@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 11:47 am

On Fri, Mar 28, 2008 at 04:29:53PM +0100, Jarek Poplawski wrote:

...Hm... this would need some special treatment for RT traffic yet...

Jarek P.
--

To: Matheos Worku <Matheos.Worku@...>
Cc: Herbert Xu <herbert@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>
Date: Friday, March 28, 2008 - 6:33 am

Empirical evidence so far says at some point the bottleneck is going to
be the wire i.e modern CPUs are "fast enough" that sooner than later
they will fill up the DMA ring of transmitting driver and go back to
doing other things.
It is hard to create the condition you seem to have come across. I had
access to a dual core opteron but found it very hard with parallel UDP
sessions to keep the TX CPU locked in that region (while the other 3
were busy pumping packets). My folly could have been that i had a Gige
wire and maybe a 10G would have recreated the condition.
If you can reproduce this at will, can you try to reduce the number of
sending TX u/iperfs and see when it begins to happen?
Are all the iperfs destined out of the same netdevice?

[Typically the TX path on the driver side is inefficient either because
of coding (ex: unnecessary locks) or expensive IO. But this has not
mattered much thus far (given fast enough CPUs).
It all could be improved by reducing the per packet operations the
driver incurs - as an example, the CPU (to the driver) could batch a
set of packet to the device then kick the device DMA once for the batch
etc.]

cheers,
jamal

--

To: <hadi@...>
Cc: Herbert Xu <herbert@...>, David Miller <davem@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>
Date: Friday, March 28, 2008 - 1:00 pm

I am using 10G nic at this time. With the same driver, I haven't come
across the lockup on 1G nic though I haven't really tried to reproduce
it. Regarding the number of connection it takes to create the
situation, I have noticed the lockup at 3 or more udp connections.
That could be true though oprofile is not providing obvious clues,
Regards

--

To: <herbert@...>
Cc: <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Thursday, March 27, 2008 - 9:38 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

This runs from softirqs, the local thread's scheduling
state is updated from timers which also run from softirqs,
so this need_resched() test won't work.

Probably you'll need something similar to the ->poll()
NAPI breakout logic, which uses 2 jiffies as the breakout
point. (it uses 2, because we could be very close to
transitioning to the next jiffie and thus don't want
to breakout prematurely in that case)
--

To: David Miller <davem@...>
Cc: <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>, Ingo Molnar <mingo@...>
Date: Friday, March 28, 2008 - 6:29 am

I had a trawl through the scheduler/timer code and it appears
that even with softirqs disabled we should able to set the flag
through this call chain (on x86-32):

timer_interrupt => do_timer_interrupt_hook => tick_handle_periodic =>
tick_periodic => update_process_times => scheduler_tick

Ingo, could you confirm that the scheduler is capable of setting
need_resched even with BH disabled?

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: David Miller <davem@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 6:56 am

hm, what's the context of this discussion? The call chain looks ok,
that's how we preempt tasks from the timer tick. But other code besides
the scheduler shouldnt do this.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: David Miller <davem@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 7:06 am

The code under discussion is __qdisc_run from net/sched/sch_generic.c.

It runs with BH off from either process context or softirq context.
As it is it can keep running forever. We were discussing adding
a need_resched check in there. So the question is would need_resched
ever get updated while BH is disabled?

Anyway, I've just realised that even if it does get updated, we still
need a better bound to avoid starving other local softirq events so
this is probably moot.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: David Miller <davem@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 10:09 am

yes, certainly - as long as the timer irq is allowed. (i.e. hardirqs are
not disabled) The scheduler tick runs from hardirq context, not from
softirq context. (that's how we can preempt softirq threads on the -rt
kernel for example)

Ingo
--

To: David S. Miller <davem@...>
Cc: Ingo Molnar <mingo@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 7:29 am

OK, since we don't really have any good ways of balancing softirq
events with each other, I've taken Dave's suggestion of checking
the jiffies as is done with NAPI. I've kept the need_resched to
minimise the scheduling latency.

[NET]: Add preemption point in qdisc_run

The qdisc_run loop is currently unbounded and runs entirely
in a softirq. This is bad as it may create an unbounded softirq
run.

This patch fixes this by calling need_resched and breaking out
if necessary.

It also adds a break out if the jiffies value changes since that
would indicate we've been transmitting for too long which starves
other softirqs.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 10b5c08..b741618 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -184,10 +184,22 @@ static inline int qdisc_restart(struct net_device *dev)

void __qdisc_run(struct net_device *dev)
{
- do {
- if (!qdisc_restart(dev))
+ unsigned long start_time = jiffies;
+
+ while (qdisc_restart(dev)) {
+ if (netif_queue_stopped(dev))
+ break;
+
+ /*
+ * Postpone processing if
+ * 1. another process needs the CPU;
+ * 2. we've been doing it for too long.
+ */
+ if (need_resched() || jiffies != start_time) {
+ netif_schedule(dev);
break;
- } while (!netif_queue_stopped(dev));
+ }
+ }

clear_bit(__LINK_STATE_QDISC_RUNNING, &dev->state);
}
--

To: <herbert@...>
Cc: <mingo@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>, <hadi@...>
Date: Friday, March 28, 2008 - 7:25 pm

From: Herbert Xu <herbert@gondor.apana.org.au>

Applied, and I'll queue this up for -stable, thanks!
--

To: Herbert Xu <herbert@...>
Cc: David S. Miller <davem@...>, Ingo Molnar <mingo@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>
Date: Friday, March 28, 2008 - 8:19 am

i think the need_resched would be effective.

**** heres something that has not yet resolved in my mind (even for the
NAPI case which was inherited from code that has been there forever).
Theres probably a very simple answer ;->
say we have two machines:
one faster and we decide HZ = 1000 meaning 1ms for a jiffy; the other
slower and we make HZ 100 i.e 10ms jiffy.
Logically, shouldnt the slower machine allocate less time running the
loop since it has less resources? in the above case it would spend 10
times more.

cheers,
jamal

--

To: jamal <hadi@...>
Cc: David S. Miller <davem@...>, Ingo Molnar <mingo@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>
Date: Friday, March 28, 2008 - 9:26 am

I agree that using jiffies is a pretty coarse approximation of
proper scheduling. However, in the absence of a better solution
we have to live with it.

Perhaps running these out of process context is the correct
approach.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

To: Herbert Xu <herbert@...>
Cc: jamal <hadi@...>, David S. Miller <davem@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>
Date: Friday, March 28, 2008 - 10:12 am

yes. Such anonymous work loops inside softirq context are a disaster to
TCP determinism and a disaster to scheduling in general (the wrong guy
gets credited with the overhead). Softirqs were a neat hack 10 years
ago, now if we know the target task for some workload we should execute
as much of the workload in that task's context as possible. (and even
for stuff where we dont have a 'target task' - routing, filtering, etc.
- it might be better to use kernel threads.)

Ingo
--

To: Herbert Xu <herbert@...>
Cc: David S. Miller <davem@...>, Ingo Molnar <mingo@...>, <Matheos.Worku@...>, <jesse.brandeburg@...>, <jarkao2@...>, <netdev@...>
Date: Friday, March 28, 2008 - 10:07 am

I know it works well enough and has for years; sure you could do better
but i wasnt questioning the solution - more than anything on my part

I saw something from Max Krasnyansky along those lines but havent quiet
followed it up. Should be interesting to see the effect over a large
types of workloads.

cheers,
jamal

--

To: Matheos Worku <Matheos.Worku@...>
Cc: <netdev@...>
Date: Wednesday, March 26, 2008 - 5:46 pm

On Wed, Mar 26, 2008 at 01:26:00PM -0700, Matheos Worku wrote:

Hmm... Is this before disabling or I miss something?

Of course, my hint could be wrong here, but there are some issues
around napi vs. irq balancing, so it would be interesting to check
this with all network cards irqs affinity set (and respected).

This could be interesting:
http://marc.info/?l=linux-netdev&m=120349568512423&w=2
(but I suspect there are possible other reasons yet)

Jarek P.
--

To: Matheos Worku <Matheos.Worku@...>
Cc: <netdev@...>
Date: Wednesday, March 26, 2008 - 5:53 pm

OOPS! I see - only not cleared. So it's OK!

Sorry,
Jarek P.
--

To: Matheos Worku <Matheos.Worku@...>
Cc: <netdev@...>
Date: Wednesday, March 26, 2008 - 1:31 pm

Is oprofile running on an AMD processor able to profile where/when
interrupts are disabled? If not there could be a fair bit missing from
the profile.

rick jones
--

Previous thread: [PATCH net-2.6.26][IPV6]: Fix potential net leak and oops in ipv6 routing code. by Pavel Emelyanov on Wednesday, March 26, 2008 - 11:55 am. (3 messages)

Next thread: [PATCH] e1000e: test MSI interrupts by Auke Kok on Wednesday, March 26, 2008 - 2:36 pm. (16 messages)