The RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!Using ftrace I found the cause of the latency.
pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)Looking at the x86 delay, I found my problem.
In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
arch/x86/lib/delay_32.c | 23 ++++++++++++++++++++++-
arch/x86/lib/delay_64.c | 26 +++++++++++++++++++++++---
2 files changed, 45 insertions(+), 4 deletions(-)Index: linux-compile.git/arch/x86/lib/delay_32.c
===================================================================
--- linux-compile.git.orig/arch/x86/lib/delay_32.c 2008-05-24 22:30:49.000000000 -0400
+...
On Sat, 24 May 2008 23:11:20 -0400 (EDT)
we used to have a WARN_ON if mdelay was called while preemptable..
maybe we should put that back in?
--
That would be a good idea, but that wouldn't solve this issue. The issue
here is that the mdelay itself disables preemption. The caller had
preemption enabled.The problem was caused by a quick fix by Andrew to handle out of sync
TSC's for a delay that used TSC's as a counter.-- Steve
--
We'd better have one which warns, when mdelay is called with
preemption disabled.Thanks,
tglx--
On Sun, 25 May 2008 21:01:32 +0200 (CEST)
argument for my variant was "should have used msleep instead"...
I don't disagree that mdelay() is harmful in general; at some point I
stuck a WARN_ON in mdelay when called from irq context...
--
Eeek, once you migrated you do this all the time. you need to update
Also this is really dangerous with unsynchronized TSCs. You might get
migrated and return immediately because the TSC on the other CPU is
far ahead.What you really want is something like the patch below, but we should
reuse the sched_clock_cpu() thingy to make that simpler. Looking into
that right now.Thanks,
tglxdiff --git a/arch/x86/lib/delay_32.c b/arch/x86/lib/delay_32.c
index 4535e6d..66a3c32 100644
--- a/arch/x86/lib/delay_32.c
+++ b/arch/x86/lib/delay_32.c
@@ -40,17 +40,51 @@ static void delay_loop(unsigned long loops)
:"0" (loops));
}+/*
+ * 5 usec on a 1GHZ machine. Not necessarily correct, but not too long
+ * either.
+ */
+#define TSC_MIGRATE_COUNT 5000
+
/* TSC based delay: */
static void delay_tsc(unsigned long loops)
{
unsigned long bclock, now;
+ int cpu;- preempt_disable(); /* TSC's are per-cpu */
+ preempt_disable();
+ cpu = smp_processor_id();
rdtscl(bclock);
do {
rep_nop();
- rdtscl(now);
- } while ((now-bclock) < loops);
+
+ /* Allow RT tasks to run */
+ preempt_enable();
+ preempt_disable();
+
+ /*
+ * It is possible that we moved to another CPU, and
+ * since TSC's are per-cpu we need to calculate
+ * that. The delay must guarantee that we wait "at
+ * least" the amount of time. Being moved to another
+ * CPU could make the wait longer but we just need to
+ * make sure we waited long enough. Rebalance the
+ * counter for this CPU.
+ */
+ if (unlikely(cpu != smp_processor_id())) {
+ if (loops <= TSC_MIGRATE_COUNT)
+ break;
+ cpu = smp_processor_id();
+ rdtscl(bclock);
+ loops -= TSC_MIGRATE_COUNT;
+ } else {
+ rdtscl(now);
+ if ((now - bclock) >= loops)
+ break;
+ loops -= (now - bclock);
+ bclock = now;
+ }
+ } while (loops > 0);
preempt_enable();
}
--
What happens with different cpus running on different frequencies...?
Cpufreq?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
It's not even protected with the old code.
inline void __const_udelay(unsigned long xloops)
{
__delay(((xloops * HZ *
cpu_data(raw_smp_processor_id()).loops_per_jiffy) >> 32) + 1);
}Here it calculates the loops_per_jiffy for the CPU and calls into __delay.
But it can easily be preempted here and the delay could run on another
CPU.-- Steve
--
Maybe, but it is simple to fix as long as preemption is disabled. When
you enable it, it becomes much harder.Lets get that fixed.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
From: Steven Rostedt <srostedt@redhat.com>
Subject: x86: enable preemption in delayThe RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!Using ftrace I found the cause of the latency.
pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)Looking at the x86 delay, I found my problem.
In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.[
Thanks to Thomas Gleixner for spotting that cpu wasn't updated
and his idea of counting down loop instead.
]Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
arch/x86/lib/delay_32.c | 31 ++++++++++++++++++++++++++++---
arch/x86/lib/delay_64.c | 31 ++++++++++++++++++++++++++++---
2 files changed, 56 insertions(+)...
BAH! expect version 3. :-p
This was compiled tested.
--
The RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!Using ftrace I found the cause of the latency.
pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)Looking at the x86 delay, I found my problem.
In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.[
Thanks to Thomas Gleixner for spotting that cpu wasn't updated,
suggesting the loop count down, and to place the rep_nop between
preempt_enabled/disable.
]Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
arch/x86/lib/delay_32.c | 32 ++++++++++++++++++++++++++++----
arch/x86/lib/delay_64.c | 31 +++++++++++++++++++++++++++----
2 files changed, 55 insertions(+), 8 deletions(-)Index: linux-tip....
The RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!Using ftrace I found the cause of the latency.
pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)Looking at the x86 delay, I found my problem.
In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.[
Thanks to Thomas Gleixner for spotting that cpu wasn't updated,
and to place the rep_nop between preempt_enabled/disable.
]Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
arch/x86/lib/delay_32.c | 31 +++++++++++++++++++++++++++----
arch/x86/lib/delay_64.c | 30 ++++++++++++++++++++++++++----
2 files changed, 53 insertions(+), 8 deletions(-)Index: linux-tip.git/arch/x86/lib/delay_32.c
============...
Bah, this is mathematically incorrect. Going for -v4!
A simple patch like this shouldn't take so much. I must need more sleep.
--
And what happens when we have 10GHz boxes that can do migration in 1us,
and the delay that is asked for is 2us. We can return early. I don't like
to place assumptions of this kind that can hurt with future hardware
enhancements.--
Do the math. It's 500ns on a 10GHz machine then. So nothing to worry
about and future proof.Also when those 10GHZ machines arrive the synchronized TSC should have
finally become reality.Thanks,
tglx
--
Then in the hypothetical future you fix it, and for now its clearly
documented. Th preempt botch in the current tree breaks all sorts of
existing real world setups so needs to go. (and we have people who do
stuff like mdelay(150);Alan
--
No, I'm saying that we don't need to account for the migrate. It will
only make the delay longer, and if the code was preemptible, that delayMy updated patch takes Thomas's patch into consideration. There's no need
to put on limited timeouts due to migration.-- Steve
--
No it isn't ;-)
The now and bclock are both from before the migration. The cpus were the
same becaues we were under preempt disbled at the time. I recalculate
after the change has been noticed.Sure, but this should be simple enough.
-- Steve
--
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Amit K. Arora | [RFC] Heads up on sys_fallocate() |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
git: | |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
