Re: ksoftirqd high cpu load on kernels 2.6.24 to 2.6.27-rc1-mm1

Previous thread: pull request: wireless-2.6 2008-08-04 by John W. Linville on Monday, August 4, 2008 - 3:37 pm. (16 messages)

Next thread: Re: [PATCH] sky2: fix PM related regressions by Rafael J. Wysocki on Monday, August 4, 2008 - 5:12 pm. (1 message)
To: Robert Hancock <hancockr@...>
Cc: LKML <linux-kernel@...>, <romieu@...>, <netdev@...>
Date: Monday, August 4, 2008 - 4:37 pm

Hello

I googled a little and found out that oprofile is a little above my head.
So as I thought that some driver or HW might be responsible for this I
tried to disable various onboard HW and found out that if I disable
onboard ethernet problem dissapears, so now I've added netdev and
maintainer of R8169 driver to cc.

Thanks
Dushan
--

To: Dushan Tcholich <dusanc@...>
Cc: Robert Hancock <hancockr@...>, <netdev@...>
Date: Thursday, August 7, 2008 - 2:58 pm

Dushan Tcholich <dusanc@gmail.com> :

Can you try 2.6.27-rc2 and send the content of /proc/interrupts, dmesg,
ifconfig as well as a capture of the strange output from top ?

It seems rather benign though.

--
Ueimor
--

To: Francois Romieu <romieu@...>
Cc: Robert Hancock <hancockr@...>, <netdev@...>
Date: Sunday, August 10, 2008 - 3:00 pm

Hi
Sorry for answering this late, but I was short on time and couldn't
get reiser4 to work with 2.6.27-rc2

I've copied my root to ext3 partition and with vanilla 2.6.27-rc2 I got:
-With my .config problem is still here
Well I wouldn't agree from power managment standpoint :). This nic is
in a lot of laptops.
8% of 2.13GHz Core2Duo CPU is a lot :)

Btw. should LKML be removed from cc?
If you need any more help please ask.
Have a nice day
Dushan

Requested data:

top output:

top - 20:50:27 up 8 min, 3 users, load average: 0.74, 0.45, 0.20
Tasks: 114 total, 1 running, 113 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.8%us, 3.0%sy, 0.0%ni, 93.7%id, 0.0%wa, 0.2%hi, 0.3%si, 0.0%st
Mem: 1033720k total, 287288k used, 746432k free, 9860k buffers
Swap: 610460k total, 0k used, 610460k free, 129228k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7 root 15 -5 0 0 0 S 8 0.0 0:37.75 ksoftirqd/1

Note the 8min uptime and ksoftirqd is already on 37min

cat /proc/interrupts
CPU0 CPU1
0: 126 0 IO-APIC-edge timer
1: 1042 0 IO-APIC-edge i8042
4: 1676 0 IO-APIC-edge serial
9: 0 0 IO-APIC-fasteoi acpi
16: 46224 0 IO-APIC-fasteoi uhci_hcd:usb3,
radeon@pci:0000:01:00.0
18: 21 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb7
19: 15440 0 IO-APIC-fasteoi ata_piix, ata_piix, uhci_hcd:usb6
21: 1269 0 IO-APIC-fasteoi uhci_hcd:usb4, eth0
22: 202 0 IO-APIC-fasteoi HDA Intel
23: 61 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb5
NMI: 0 0 Non-maskable interrupts
LOC: 177598 165568 Local timer interrupts
RES: 1224 1729 Rescheduling interrupts
CAL: 1247 1548 function call interrupts
TLB: 1577 1848 TLB shootdowns
T...

To: Francois Romieu <romieu@...>
Cc: Robert Hancock <hancockr@...>, <netdev@...>
Date: Monday, August 11, 2008 - 3:53 am

Hi

I tried some more kernels and I had the same problem with 2.6.23.17
and 2.6.27-rc1-mm1, but I couldn't reproduce it with kernel from
sysresccd 1.0.1 http://www.sysresccd.org/ which is a patched version
of 2.6.24 i think .7 when I booted it to change fs.
...
--

To: Francois Romieu <romieu@...>
Cc: Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>
Date: Friday, August 29, 2008 - 9:48 pm

Hello again

I dug a little more and found out some new info.
Unsolved bugreport with same symptoms:
http://marc.info/?l=linux-kernel&m=119613299024398&w=2
Problems appear if I start br0 interface, as context switch rate
increases 200 times. If I start eth0 instead everything looks ok.
The bugreport above had bridging enabled too.

When using eth0 I get:
vmstat -n 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 17720 162288 25024 576852 0 0 21 22 6 9 10 4 85 0
0 0 17720 162196 25024 576880 0 0 0 0 13 473 0 0 99 0
0 0 17720 162196 25028 576880 0 0 0 4 36 1122 1 0 99 0
0 0 17720 162196 25028 576880 0 0 0 0 83 844 0 0 99 0
0 0 17720 162196 25028 576880 0 0 0 0 55 691 1 0 99 0
0 0 17720 162556 25028 576880 0 0 0 0 13 490 1 0 100 0
0 0 17720 162100 25028 576880 0 0 0 0 39 561 6 0 94 0
1 0 17720 162028 25028 576880 0 0 0 0 16 1030 4 0 96 0
0 0 17720 162028 25028 576880 0 0 0 0 40 597 1 0 99 0
0 0 17720 162028 25028 576880 0 0 0 0 12 512 2 0 97 0

top

top - 03:30:07 up 6 days, 5:40, 4 users, load average: 0.02, 0.11, 0.28
Tasks: 149 total, 2 running, 147 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.1%us, 3.3%sy, 0.0%ni, 84.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1033388k total, 871336k used, 162052k free, 25016k buffers
Swap: 610460k total, 17720k used, 592740k free, 576852k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4 root 15 -5 0 0 0 S 0 0.0 554:23.44 ksoftirqd/0

If I use br0 I get:
vmstat -n 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi ...

To: Francois Romieu <romieu@...>
Cc: Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>
Date: Sunday, August 31, 2008 - 4:51 am

Hello
I found the culprit.

When using powertop I get:
Top causes for wakeups:
35,2% (251,0) ip : br_stp_enable_bridge (br_hello_timer_expired

So I tried to turn them off with:
brctl sethello br0 0
but the problem persisted.
If I do
brctl sethello br0 5
context switch rate drops 200 times and problem is gone.
I think that the command brctl sethello br0 0 doesn't turn off hello
messages, but sends them 250 times per second.

Thanks for your time
Dushan
--

To: Dushan Tcholich <dusanc@...>
Cc: Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>
Date: Sunday, August 31, 2008 - 1:05 pm

On Sun, 31 Aug 2008 10:51:46 +0200

You can't turn off the hello timer, it is needed for Spanning Tree to
work. The kernel should reject requests to set hello timer < 1sec.
Most routers allow 1 - 10sec.

I am going to do a new patch to add tighter range checking for STP timer
settings and another to default fowarding delay of zero if STP is disabled.
--

To: Stephen Hemminger <shemminger@...>
Cc: Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>
Date: Sunday, August 31, 2008 - 3:14 pm

Hi

On Sun, Aug 31, 2008 at 7:05 PM, Stephen Hemminger
Well I try to turn stp off but it doesn't want to :)
This is in my /etc/conf.d/net
brctl_br0=( "setfd 0" "sethello 10" "stp off" )
I had problems with sethello 0 so now I'm using 10.
Tried with
brctl stp br0 off
but still had same troubles.

Have a nice day
Dushan
--

To: David Miller <davem@...>, Dushan Tcholich <dusanc@...>
Cc: Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>, <bridge@...>
Date: Sunday, August 31, 2008 - 1:43 pm

The Spanning Tree Protocol timers need to be set within certain boundaries
to keep the internal protocol engine working, and to be interoperable.
This patch restricts changes to those timers to the values defined in IEEE 802.1D
specification.

The only exception to the standards are:
* if STP is disabled allow forwarding delay to be turned off
* allow wider range of ageing timer since this isn't directly part of
STP, and setting it to zero allows for non-remembering bridge.

Warning: this may cause user backlash since apparently working but standards
conforming configurations will get configuration errors that they didn't
see before.

--- a/net/bridge/br_ioctl.c 2008-08-31 10:00:44.000000000 -0700
+++ b/net/bridge/br_ioctl.c 2008-08-31 10:34:00.000000000 -0700
@@ -177,38 +177,63 @@ static int old_dev_ioctl(struct net_devi
}

case BRCTL_SET_BRIDGE_FORWARD_DELAY:
+ {
+ unsigned long t = clock_t_to_jiffies(args[1]);
if (!capable(CAP_NET_ADMIN))
return -EPERM;

+ /* enforce range checking per IEEE 802.1D 17.14 */
+ if (br->stp_enabled != BR_NO_STP &&
+ (t < 4*HZ || t > 30 * HZ))
+ return -EINVAL;
+
spin_lock_bh(&br->lock);
- br->bridge_forward_delay = clock_t_to_jiffies(args[1]);
+ br->bridge_forward_delay = t;
if (br_is_root_bridge(br))
br->forward_delay = br->bridge_forward_delay;
spin_unlock_bh(&br->lock);
return 0;
-
+ }
case BRCTL_SET_BRIDGE_HELLO_TIME:
+ {
+ unsigned long t = clock_t_to_jiffies(args[1]);
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;

+ if (t < HZ || t > 15 * HZ)
+ return -EINVAL;
+
spin_lock_bh(&br->lock);
- br->bridge_hello_time = clock_t_to_jiffies(args[1]);
+ br->bridge_hello_time = t;
if (br_is_root_bridge(br))
br->hello_time = br->bridge_hello_time;
spin_unlock_bh(&br->lock);
return 0;
-
+ }
case BRCTL_SET_BRIDGE_MAX_AGE:
+ {
+ unsigned long t = clock_t_to_jiffies(args[1...

To: David Miller <davem@...>
Cc: Dushan Tcholich <dusanc@...>, Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>, <bridge@...>
Date: Thursday, September 4, 2008 - 6:47 pm

The bridge hello time can't be safely set to values less than 1 second,
otherwise it is possible to end up with a runaway timer.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/bridge/br_ioctl.c 2008-09-04 15:25:41.000000000 -0700
+++ b/net/bridge/br_ioctl.c 2008-09-04 15:44:33.000000000 -0700
@@ -188,15 +188,21 @@ static int old_dev_ioctl(struct net_devi
return 0;

case BRCTL_SET_BRIDGE_HELLO_TIME:
+ {
+ unsigned long t = clock_t_to_jiffies(args[1]);
if (!capable(CAP_NET_ADMIN))
return -EPERM;

+ if (t < HZ)
+ return -EINVAL;
+
spin_lock_bh(&br->lock);
- br->bridge_hello_time = clock_t_to_jiffies(args[1]);
+ br->bridge_hello_time = t;
if (br_is_root_bridge(br))
br->hello_time = br->bridge_hello_time;
spin_unlock_bh(&br->lock);
return 0;
+ }

case BRCTL_SET_BRIDGE_MAX_AGE:
if (!capable(CAP_NET_ADMIN))
--- a/net/bridge/br_sysfs_br.c 2008-09-04 15:27:20.000000000 -0700
+++ b/net/bridge/br_sysfs_br.c 2008-09-04 15:33:31.000000000 -0700
@@ -29,11 +29,12 @@
*/
static ssize_t store_bridge_parm(struct device *d,
const char *buf, size_t len,
- void (*set)(struct net_bridge *, unsigned long))
+ int (*set)(struct net_bridge *, unsigned long))
{
struct net_bridge *br = to_bridge(d);
char *endp;
unsigned long val;
+ int err;

if (!capable(CAP_NET_ADMIN))
return -EPERM;
@@ -43,9 +44,9 @@ static ssize_t store_bridge_parm(struct
return -EINVAL;

spin_lock_bh(&br->lock);
- (*set)(br, val);
+ err = (*set)(br, val);
spin_unlock_bh(&br->lock);
- return len;
+ return err ? err : len;
}

@@ -56,12 +57,13 @@ static ssize_t show_forward_delay(struct
return sprintf(buf, "%lu\n", jiffies_to_clock_t(br->forward_delay));
}

-static void set_forward_delay(struct net_bridge *br, unsigned long val)
+static int set_forward_delay(struct net_bridge *br, unsigned long val)
{
unsigned long delay = clock_t_to_jiffies(val);...

To: <shemminger@...>
Cc: <dusanc@...>, <romieu@...>, <hancockr@...>, <netdev@...>, <linux-kernel@...>, <bridge@...>
Date: Monday, September 8, 2008 - 4:46 pm

From: Stephen Hemminger <shemminger@vyatta.com>

Applied, thanks Stephen.

I added more information to the commit message so that Dushan's
incredibly contribution to this bug getting fixed are mentioned.
I don't see how we would have figured out Bridging as even the
cause without his detective work. So it's definitely wrong not
to give him at least some mention in the commit message :-/

bridge: don't allow setting hello time to zero

Dushan Tcholich reports that on his system ksoftirqd can consume
between %6 to %10 of cpu time, and cause ~200 context switches per
second.

He then correlated this with a report by bdupree@techfinesse.com:

http://marc.info/?l=linux-kernel&m=119613299024398&w=2

and the culprit cause seems to be starting the bridge interface.
In particular, when starting the bridge interface, his scripts
are specifying a hello timer interval of "0".

The bridge hello time can't be safely set to values less than 1
second, otherwise it is possible to end up with a runaway timer.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/bridge/br_ioctl.c | 8 +++++++-
net/bridge/br_sysfs_br.c | 26 ++++++++++++++++++--------
2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index eeee218..5bbf073 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -188,15 +188,21 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
return 0;

case BRCTL_SET_BRIDGE_HELLO_TIME:
+ {
+ unsigned long t = clock_t_to_jiffies(args[1]);
if (!capable(CAP_NET_ADMIN))
return -EPERM;

+ if (t < HZ)
+ return -EINVAL;
+
spin_lock_bh(&br->lock);
- br->bridge_hello_time = clock_t_to_jiffies(args[1]);
+ br->bridge_hello_time = t;
if (br_is_root_bridge(br))
br->hello_time = br->bridge_hello_time;
spin_unlock_bh(&br->lock);
...

To: David Miller <davem@...>
Cc: <shemminger@...>, <romieu@...>, <hancockr@...>, <netdev@...>, <linux-kernel@...>, <bridge@...>
Date: Monday, September 8, 2008 - 5:35 pm

I don't know what to say :)

A little nitpick: 200 times greater context switch rate :), like

Btw. is there a way to make the command to turn STP off work too?
brctl stp br0 off
Because AFAIK if I shut down STP the hello timer should shut down too,
but it still continues to work.

Thank you for your time and effort

--

To: Dushan Tcholich <dusanc@...>, David Miller <davem@...>
Cc: <romieu@...>, <hancockr@...>, <netdev@...>, <linux-kernel@...>, <bridge@...>
Date: Monday, September 8, 2008 - 6:33 pm

On Mon, 8 Sep 2008 23:35:19 +0200

The basics:
* Hello timer is always enabled
* STP defaults to off unless you turn it on
* Turn STP on/off with brctl.

In the existing design, the hello timer always runs, even when STP
is not turned on. If STP is not enabled, the packet is just never
created. Fixing it would not be hard (or gain much), but would have
to deal with complex lock ordering and timer problems, so it isn't
worth fixing for current releases.

--

To: <shemminger@...>
Cc: <dusanc@...>, <romieu@...>, <hancockr@...>, <netdev@...>, <linux-kernel@...>, <bridge@...>
Date: Tuesday, September 2, 2008 - 8:28 pm

From: Stephen Hemminger <shemminger@vyatta.com>

I don't think we can really add these kinds of restrictions wholesale
like this.

And the user is reporting that using brctl to turn off STP doesn't
appear to actually turn off STP and thus fix all of the crazy
ksoftirqd high cpu load problems.

So what we need to do is resolve the user configuration issue that is
causing this problem to begin with.
--

To: Stephen Hemminger <shemminger@...>
Cc: David Miller <davem@...>, Dushan Tcholich <dusanc@...>, Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>, <bridge@...>
Date: Sunday, August 31, 2008 - 10:25 pm

Did you mean "apparently working but *non*-standards conforming"?

Other than that, seems to be a sane application of "Be conservative in what you
send". Our network is some 30K cat-5 ports, 1100 switches, 1300 wireless
access points, and we appreciate it every time somebody makes things more
bulletproof. And yes, we prefer things to out-and-out *fail* rather than
run in a wonky configuration - hard failures usually get fixed in a few
minutes, wonkiness can drag on for months of mystifying symptoms...

To: Stephen Hemminger <shemminger@...>
Cc: David Miller <davem@...>, Dushan Tcholich <dusanc@...>, Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>, <bridge@...>
Date: Sunday, August 31, 2008 - 6:02 pm

On Sun, 31 Aug 2008 10:43:09 -0700

Why do we care ? You have to be the network administrator to set values,
there are cases you may want to be out of the spec and you are
privileged. The kernel does need to stop things being done which are
fatal but running around restricting privileged administrators who have
the ability to bring the network down anyway isn't its job.

Seems bogus extra code to me - stops things working that should be
allowed too.
--

To: Alan Cox <alan@...>
Cc: Stephen Hemminger <shemminger@...>, David Miller <davem@...>, Dushan Tcholich <dusanc@...>, Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>, <bridge@...>
Date: Sunday, August 31, 2008 - 7:29 pm

The timer configuration is propagated in network protocol, so
misconfigured Linux box
could survive but effect other devices on the network that are less
robust. Maybe the
small values would cause some other bridge to crash, go infinite loop, ...
More likely robust devices might ignore our packets (because values out
of range), leading to
routing loops and other disasters.

The kernel does need to stop administrative settings from taking out a
network. If someone
has a custom device or other non-standard usage, they can always rebuild
the kernel and
remove the range check.

--

To: Stephen Hemminger <stephen.hemminger@...>
Cc: Stephen Hemminger <shemminger@...>, David Miller <davem@...>, Dushan Tcholich <dusanc@...>, Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>, <bridge@...>
Date: Monday, September 1, 2008 - 4:38 am

That would be irrelevant. CAP_NET_ADMIN lets you make that size mess

If you have CAP_NET_ADMIN you can trivially take out the network unless
it is properly switched.

Now you might want your pretty little GUI and/or config tools to warn
people that their configuration is outside 802 specs but that is a
different matter altogether

Alan
--

To: Alan Cox <alan@...>
Cc: Stephen Hemminger <stephen.hemminger@...>, Stephen Hemminger <shemminger@...>, David Miller <davem@...>, Dushan Tcholich <dusanc@...>, Francois Romieu <romieu@...>, Robert Hancock <hancockr@...>, <netdev@...>, LKML <linux-kernel@...>, <bridge@...>
Date: Tuesday, September 2, 2008 - 12:40 pm

Can one change the TCP maximum RTO to be smaller than specified in the
specs?

rick jones
--

To: <rick.jones2@...>
Cc: <alan@...>, <stephen.hemminger@...>, <shemminger@...>, <dusanc@...>, <romieu@...>, <hancockr@...>, <netdev@...>, <linux-kernel@...>, <bridge@...>
Date: Tuesday, September 2, 2008 - 7:41 pm

From: Rick Jones <rick.jones2@hp.com>

We always min-clamp the RTO at RTO calculation time in order to be
compatible with BSD's coarse grained times.
--

To: David Miller <davem@...>
Cc: <alan@...>, <stephen.hemminger@...>, <shemminger@...>, <dusanc@...>, <romieu@...>, <hancockr@...>, <netdev@...>, <linux-kernel@...>, <bridge@...>
Date: Tuesday, September 2, 2008 - 8:00 pm

But tuning TCP_RTO_MAX isn't permitted right? I'm drawing (perhaps
flawed) parallels/distinctions between what is/isn't permitted to tweak
for timers for one protocol versus another and wondering which may be a
case of sauce for the goose/gander.

rick jones
--

Previous thread: pull request: wireless-2.6 2008-08-04 by John W. Linville on Monday, August 4, 2008 - 3:37 pm. (16 messages)

Next thread: Re: [PATCH] sky2: fix PM related regressions by Rafael J. Wysocki on Monday, August 4, 2008 - 5:12 pm. (1 message)