Re: [PATCH -v2] x86: enable preemption in delay

Previous thread: [PLEASE REVERT] "Add a WARN_ON_ONCE" at net/sched/sch_generic.c: by Komuro on Saturday, May 24, 2008 - 10:23 pm. (13 messages)

Next thread: kobject_cleanup should set kobject.k_name to NULL? by sasin on Saturday, May 24, 2008 - 10:28 pm. (1 message)
To: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Saturday, May 24, 2008 - 11:11 pm

The RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!

Using ftrace I found the cause of the latency.

pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)

Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!

At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.

pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)

Looking at the x86 delay, I found my problem.

In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).

Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
arch/x86/lib/delay_32.c | 23 ++++++++++++++++++++++-
arch/x86/lib/delay_64.c | 26 +++++++++++++++++++++++---
2 files changed, 45 insertions(+), 4 deletions(-)

Index: linux-compile.git/arch/x86/lib/delay_32.c
===================================================================
--- linux-compile.git.orig/arch/x86/lib/delay_32.c 2008-05-24 22:30:49.000000000 -0400
+...

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 2:53 pm

On Sat, 24 May 2008 23:11:20 -0400 (EDT)

we used to have a WARN_ON if mdelay was called while preemptable..
maybe we should put that back in?
--

To: Arjan van de Ven <arjan@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 3:32 pm

That would be a good idea, but that wouldn't solve this issue. The issue
here is that the mdelay itself disables preemption. The caller had
preemption enabled.

The problem was caused by a quick fix by Andrew to handle out of sync
TSC's for a delay that used TSC's as a counter.

-- Steve

--

To: Arjan van de Ven <arjan@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 3:01 pm

We'd better have one which warns, when mdelay is called with
preemption disabled.

Thanks,
tglx

--

To: Thomas Gleixner <tglx@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 3:28 pm

On Sun, 25 May 2008 21:01:32 +0200 (CEST)

argument for my variant was "should have used msleep instead"...

I don't disagree that mdelay() is harmful in general; at some point I
stuck a WARN_ON in mdelay when called from irq context...
--

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 8:44 am

Eeek, once you migrated you do this all the time. you need to update

Also this is really dangerous with unsynchronized TSCs. You might get
migrated and return immediately because the TSC on the other CPU is
far ahead.

What you really want is something like the patch below, but we should
reuse the sched_clock_cpu() thingy to make that simpler. Looking into
that right now.

Thanks,
tglx

diff --git a/arch/x86/lib/delay_32.c b/arch/x86/lib/delay_32.c
index 4535e6d..66a3c32 100644
--- a/arch/x86/lib/delay_32.c
+++ b/arch/x86/lib/delay_32.c
@@ -40,17 +40,51 @@ static void delay_loop(unsigned long loops)
:"0" (loops));
}

+/*
+ * 5 usec on a 1GHZ machine. Not necessarily correct, but not too long
+ * either.
+ */
+#define TSC_MIGRATE_COUNT 5000
+
/* TSC based delay: */
static void delay_tsc(unsigned long loops)
{
unsigned long bclock, now;
+ int cpu;

- preempt_disable(); /* TSC's are per-cpu */
+ preempt_disable();
+ cpu = smp_processor_id();
rdtscl(bclock);
do {
rep_nop();
- rdtscl(now);
- } while ((now-bclock) < loops);
+
+ /* Allow RT tasks to run */
+ preempt_enable();
+ preempt_disable();
+
+ /*
+ * It is possible that we moved to another CPU, and
+ * since TSC's are per-cpu we need to calculate
+ * that. The delay must guarantee that we wait "at
+ * least" the amount of time. Being moved to another
+ * CPU could make the wait longer but we just need to
+ * make sure we waited long enough. Rebalance the
+ * counter for this CPU.
+ */
+ if (unlikely(cpu != smp_processor_id())) {
+ if (loops <= TSC_MIGRATE_COUNT)
+ break;
+ cpu = smp_processor_id();
+ rdtscl(bclock);
+ loops -= TSC_MIGRATE_COUNT;
+ } else {
+ rdtscl(now);
+ if ((now - bclock) >= loops)
+ break;
+ loops -= (now - bclock);
+ bclock = now;
+ }
+ } while (loops > 0);
preempt_enable();
}
--

To: Thomas Gleixner <tglx@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 2:01 pm

What happens with different cpus running on different frequencies...?
Cpufreq?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Pavel Machek <pavel@...>
Cc: Thomas Gleixner <tglx@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Wednesday, May 28, 2008 - 9:01 am

It's not even protected with the old code.

inline void __const_udelay(unsigned long xloops)
{
__delay(((xloops * HZ *
cpu_data(raw_smp_processor_id()).loops_per_jiffy) >> 32) + 1);
}

Here it calculates the loops_per_jiffy for the CPU and calls into __delay.
But it can easily be preempted here and the delay could run on another
CPU.

-- Steve

--

To: Steven Rostedt <rostedt@...>
Cc: Thomas Gleixner <tglx@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Saturday, May 31, 2008 - 3:18 am

Maybe, but it is simple to fix as long as preemption is disabled. When
you enable it, it becomes much harder.

Lets get that fixed.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Thomas Gleixner <tglx@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi-suse@...>
Date: Sunday, May 25, 2008 - 9:35 am

From: Steven Rostedt <srostedt@redhat.com>
Subject: x86: enable preemption in delay

The RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!

Using ftrace I found the cause of the latency.

pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)

Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!

At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.

pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)

Looking at the x86 delay, I found my problem.

In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).

Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.

[
Thanks to Thomas Gleixner for spotting that cpu wasn't updated
and his idea of counting down loop instead.
]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
arch/x86/lib/delay_32.c | 31 ++++++++++++++++++++++++++++---
arch/x86/lib/delay_64.c | 31 ++++++++++++++++++++++++++++---
2 files changed, 56 insertions(+)...

To: Thomas Gleixner <tglx@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi-suse@...>
Date: Sunday, May 25, 2008 - 9:44 am

BAH! expect version 3. :-p

This was compiled tested.

--

To: Thomas Gleixner <tglx@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi-suse@...>
Date: Sunday, May 25, 2008 - 9:51 am

The RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!

Using ftrace I found the cause of the latency.

pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)

Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!

At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.

pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)

Looking at the x86 delay, I found my problem.

In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).

Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.

[
Thanks to Thomas Gleixner for spotting that cpu wasn't updated,
suggesting the loop count down, and to place the rep_nop between
preempt_enabled/disable.
]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
arch/x86/lib/delay_32.c | 32 ++++++++++++++++++++++++++++----
arch/x86/lib/delay_64.c | 31 +++++++++++++++++++++++++++----
2 files changed, 55 insertions(+), 8 deletions(-)

Index: linux-tip....

To: Thomas Gleixner <tglx@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi-suse@...>
Date: Sunday, May 25, 2008 - 11:13 am

The RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!

Using ftrace I found the cause of the latency.

pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)

Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!

At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.

pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)

Looking at the x86 delay, I found my problem.

In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).

Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.

[
Thanks to Thomas Gleixner for spotting that cpu wasn't updated,
and to place the rep_nop between preempt_enabled/disable.
]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
arch/x86/lib/delay_32.c | 31 +++++++++++++++++++++++++++----
arch/x86/lib/delay_64.c | 30 ++++++++++++++++++++++++++----
2 files changed, 53 insertions(+), 8 deletions(-)

Index: linux-tip.git/arch/x86/lib/delay_32.c
============...

To: Thomas Gleixner <tglx@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi-suse@...>
Date: Sunday, May 25, 2008 - 11:04 am

Bah, this is mathematically incorrect. Going for -v4!

A simple patch like this shouldn't take so much. I must need more sleep.

--

To: Thomas Gleixner <tglx@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 9:17 am

And what happens when we have 10GHz boxes that can do migration in 1us,
and the delay that is asked for is 2us. We can return early. I don't like
to place assumptions of this kind that can hurt with future hardware
enhancements.

--

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 4:21 pm

Do the math. It's 500ns on a 10GHz machine then. So nothing to worry
about and future proof.

Also when those 10GHZ machines arrive the synchronized TSC should have
finally become reality.

Thanks,

tglx
--

To: Steven Rostedt <rostedt@...>
Cc: Thomas Gleixner <tglx@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 9:19 am

Then in the hypothetical future you fix it, and for now its clearly
documented. Th preempt botch in the current tree breaks all sorts of
existing real world setups so needs to go. (and we have people who do
stuff like mdelay(150);

Alan
--

To: Alan Cox <alan@...>
Cc: Thomas Gleixner <tglx@...>, LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 9:40 am

No, I'm saying that we don't need to account for the migrate. It will
only make the delay longer, and if the code was preemptible, that delay

My updated patch takes Thomas's patch into consideration. There's no need
to put on limited timeouts due to migration.

-- Steve

--

To: Thomas Gleixner <tglx@...>
Cc: LKML <linux-kernel@...>, linux-rt-users <linux-rt-users@...>, <akpm@...>, Ingo Molnar <mingo@...>, Clark Williams <clark.williams@...>, Peter Zijlstra <peterz@...>, Luis Claudio R. Goncalves <lclaudio@...>, Gregory Haskins <ghaskins@...>, Andi Kleen <ak@...>, Linus Torvalds <torvalds@...>
Date: Sunday, May 25, 2008 - 9:07 am

No it isn't ;-)

The now and bclock are both from before the migration. The cpus were the
same becaues we were under preempt disbled at the time. I recalculate
after the change has been noticed.

Sure, but this should be simple enough.

-- Steve

--

Previous thread: [PLEASE REVERT] "Add a WARN_ON_ONCE" at net/sched/sch_generic.c: by Komuro on Saturday, May 24, 2008 - 10:23 pm. (13 messages)

Next thread: kobject_cleanup should set kobject.k_name to NULL? by sasin on Saturday, May 24, 2008 - 10:28 pm. (1 message)