Re: 'global' rq->clock

Previous thread: [PATCH] use unsigned long for zoran_driver by Steven Rostedt on Thursday, May 1, 2008 - 8:04 pm. (1 message)

Next thread: Re: huge gcc 4.1.{0,1} __weak problem by Chris Knadle on Thursday, May 1, 2008 - 7:55 pm. (10 messages)
To: Ingo Molnar <mingo@...>, <a.p.zijlstra@...>
Cc: LKML <linux-kernel@...>
Date: Thursday, May 1, 2008 - 8:14 pm

Hi Ingo, Peter

Under kernel compilation load (make -j8) audio is very much unusable
on current linux-2.6.git - The stutter is so bad I can hardly tell
what song I am listening to :)
Even with make -j4 it still skips but not that badly.

I am running Ubuntu Hardy (Pulseaudio) on a quad core machine and
switching back to Ubuntu shipped kernel (2.6.24-16) resolves the issue
totally.

.config below. Revert of e22ecef1d2658ba54ed7d3fdb5d60829fb434c23
fixed similar issue.

Parag

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25
# Thu May 1 19:47:33 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_DEFCONFIG_LIST="arch/x86/configs/x86_64_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE...

To: Parag Warudkar <parag.warudkar@...>
Cc: Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Frans Pop <elendil@...>
Date: Friday, May 2, 2008 - 4:34 am

So you're saying reverting e22ecef1 fixes the issue? - as have others.

The initial rationale for the change was: vruntime and walltime are
related by a factor of rq->weight, hence compensate for that.

However that in itself has a flaw, we're comparing vruntime to process
runtime time, _NOT_ walltime. And vruntime and process runtime do not
have that factor so the compensation is unneeded.

However that still leaves me puzzled on why it would cause such bad
regression. The current code reads like:

/*
* delta *= w / rw
*/
static inline unsigned long
calc_delta_weight(unsigned long delta, struct sched_entity *se);

/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)) {
if (sched_feat(NORMALIZED_SLEEPER))
vruntime -= calc_delta_weight(sysctl_sched_latency, se);
else
vruntime -= sysctl_sched_latency;
}

So what we do is _shrink_ the bonus the busier we get (the more tasks we
get, the higher the total runqueue weight (rw) hence delta will get
smaller)

The thing is, I once asked Frans to test !NEW_FAIR_SLEEPERS and he
reported that that also solves the problem. So no bonus is good, and a
fixed bonus is good, but a variable bonus that is between these two
values is not.

Parag, would you also test with !NEW_FAIR_SLEEPERS to see if that solves
your problem?

The easiest way to disable it is (assumes you have debugfs mounted
at /debug):

# echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features

You can also disable NORMALIZED_SLEEPER that way (of course you would
first have to enable NEW_FAIR_SLEEPERS again):

# echo NO_NORMALIZED_SLEEPER > /debug/sched_features

You can get a current status of all the flags using:

# cat /debug/sched_features

So by default we have both enabled; could you report if either

NO_NEW_FAIR_SLEEPERS

NEW_FAIR_SLEEPERS + NO_NORMALIZED_SLEEPERS

works for you?

Also, could you apply this patch, and report the bonus_max value for
your music...

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Mike Galbraith <efault@...>
Date: Wednesday, May 7, 2008 - 4:26 am

What's the current status here?
With mainline v2.6.26-rc1-110-ga153063 (which has latest sched patches), I
still get very bad audio skips with GROUP_SCHED enabled.

Cheers,
FJP
--

To: Frans Pop <elendil@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Parag Warudkar <parag.warudkar@...>, LKML <linux-kernel@...>, Mike Galbraith <efault@...>
Date: Wednesday, May 7, 2008 - 4:32 am

there's no solution for now, the latency targets are still not working
under the group scheduler. Upstream we've made it non-default and marked
it EXPERIMENTAL.

Peter has a new updated group scheduler queue that i'm testing right
now, perhaps it can be pushed out to sched-devel/latest later today. But
that is v2.6.27 material.

Ingo
--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Frans Pop <elendil@...>, Mike Galbraith <efault@...>
Date: Friday, May 2, 2008 - 7:10 am

I would say no - the audio still is unusable - slightly better with
NO_NEW_FAIR_SLEEPERS, no difference with the other.

But per Mike's suggestion if I disable CONFIG_GROUP_SCHED ,
CONFIG_FAIR_GROUP_SCHED and CONFIG_USER_SCHED, even with
NEW_FAIR_SLEEPERS audio is good again, which does not change with
NO_NEW_FAIR_SLEEPERS. (No skips for long time under -j8 - still skips
a _very_ tiny bit - noticeable if I listen too carefully - but we can
ignore that for the time being.)

Mike - the GROUP_SCHED stuff is set to y in the Ubuntu kernel and I
have no audio skips - may be a regression introduced after 2.6.24?

I see Frans already reported that and there was some conclusion - let
me know if more data will help.

Thanks!

Parag
--

To: Parag Warudkar <parag.warudkar@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Frans Pop <elendil@...>
Date: Friday, May 2, 2008 - 8:09 am

Hm. I've stumbled across a regression that may be related. Can you
please try the attached with group scheduling enabled in current git?

-Mike

To: Mike Galbraith <efault@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Frans Pop <elendil@...>
Date: Friday, May 2, 2008 - 8:21 am

In file included from kernel/sched.c:1902:
kernel/sched_fair.c: In function 'print_cfs_rq_tasks':
kernel/sched_fair.c:1602: error: implicit declaration of function
'calc_delta_weight'
kernel/sched_fair.c:1603: warning: format '%lu' expects type 'long
unsigned int', but argument 5 has type 'int'
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2

'calc_delta_weight' or a replacement not found easily to fix it up!

Thanks

Parag
--

To: Parag Warudkar <parag.warudkar@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Frans Pop <elendil@...>
Date: Friday, May 2, 2008 - 8:37 am

Oops, I have an unrelated patch in my tree which you also need.

-Mike

To: Mike Galbraith <efault@...>
Cc: Parag Warudkar <parag.warudkar@...>, Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Saturday, May 3, 2008 - 3:13 am

Based on Mike's mails I have quickly tried reverting the following two
commits to see if that would fix the GROUP_SCHED related regression, but I
still get a merge error with the second one in sched_fair.c:
$ git revert 7ba2e74ab5
$ git revert 8f1bc385cf

I don't feel comfortable solving the conflict myself as I'd have absolutely
no idea what I'd be doing and it all seems to require a lot of precision.

As other suggested patches don't seem ready for testing yet, I'll just wait
for further instructions :-)

Cheers,
FJP
--

To: Frans Pop <elendil@...>
Cc: Parag Warudkar <parag.warudkar@...>, Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Saturday, May 3, 2008 - 3:39 am

You probably shouldn't bother. GROUP_SCHED needs Peter's EEVDF help
regardless of whether 8f1bc385cf is playing a significant role or not.

-Mike

--

To: Parag Warudkar <parag.warudkar@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Frans Pop <elendil@...>
Date: Friday, May 2, 2008 - 11:02 am

I see that Peter has a fix, so don't bother.

-Mike

--

To: Mike Galbraith <efault@...>
Cc: Parag Warudkar <parag.warudkar@...>, Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 11:49 am

Maybe not. It looks like Parag may have been seeing a combination of two
different issues: the one that Peter fixed and one related to group
scheduling.

As I had group scheduling disabled, I only saw the first one. But my audio
skips were a lot less severe than Parag described. And disabling group
scheduling helped solve the problem for him.

So it may well still be worth following up on this.

Cheers,
FJP
--

To: Frans Pop <elendil@...>
Cc: Parag Warudkar <parag.warudkar@...>, Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 3:38 pm

There are definitely issues with that commit, but I'm not at all sure
they manifest themselves with this workload in any heavy-weight way. My
thought was only that it's negative effects could be compounded by group
scheduling.

-Mike

--

To: Mike Galbraith <efault@...>
Cc: Parag Warudkar <parag.warudkar@...>, Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 2:53 pm

I have just tested a kernel with GROUP_SCHED enabled (with Peter's patch
still included) and can confirm that enabling group scheduling causes
serious latencies.
Skips are much more frequent than with the issue that was just solved.

latencytop shows (different samples)
Scheduler: waiting for cpu 537.7 msec
Scheduler: waiting for cpu 172.0 msec
Scheduler: waiting for cpu 414.6 msec
Scheduler: waiting for cpu 455.7 msec
Scheduler: waiting for cpu 446.5 msec

I think I also see why....

# grep bonus_max `grep -l amarokapp /proc/*/task/*/sched`
/proc/4725/task/20645/sched:se.bonus_max : 122640.203776
/proc/4725/task/4725/sched:se.bonus_max : 122640.203776
/proc/4725/task/4793/sched:se.bonus_max : 41902109.884416
/proc/4725/task/4797/sched:se.bonus_max : 41902109.884416
/proc/4725/task/4798/sched:se.bonus_max : 41902109.884416
/proc/4725/task/4799/sched:se.bonus_max : 40880.046080
/proc/4725/task/4800/sched:se.bonus_max : 19.970683
--

To: Frans Pop <elendil@...>
Cc: Parag Warudkar <parag.warudkar@...>, Peter Zijlstra <a.p.zijlstra@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 3:27 pm

Hm. I enabled group scheduling, with Peter's patch + revert of the
commit that is giving me grief + the below, and it seems to work ok
here. Definitely no sound skips even under quite hefty load.

The below _seems_ to make a difference stand-alone, but subjective
things like 'lurches' require _very_ much testing... you tend to see
what you want to see with such things. As you can see, I thought about
submitting it, but I have way too much subjective poison in my system
from testing this and that while reading source.

Fix undesirable rq.clock update noops.

Signed-off-by: Mike Galbraith <efault@gmx.de>

Index: linux-2.6.26.git/kernel/sched.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched.c
+++ linux-2.6.26.git/kernel/sched.c
@@ -668,9 +668,6 @@ static void __update_rq_clock(struct rq
s64 delta = now - prev_raw;
u64 clock = rq->clock;

-#ifdef CONFIG_SCHED_DEBUG
- WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
-#endif
/*
* Protect against sched_clock() occasionally going backwards:
*/
@@ -3009,8 +3006,8 @@ static void double_rq_lock(struct rq *rq
spin_lock(&rq1->lock);
}
}
- update_rq_clock(rq1);
- update_rq_clock(rq2);
+ __update_rq_clock(rq1);
+ __update_rq_clock(rq2);
}

/*
@@ -3860,7 +3857,7 @@ redo:
/* Attempt to move tasks */
double_lock_balance(this_rq, busiest);
/* this_rq->clock is already updated */
- update_rq_clock(busiest);
+ __update_rq_clock(busiest);
ld_moved = move_tasks(this_rq, this_cpu, busiest,
imbalance, sd, CPU_NEWLY_IDLE,
&all_pinned);
@@ -3959,8 +3956,8 @@ static void active_load_balance(struct r

/* move a task from busiest_rq to target_rq */
double_lock_balance(busiest_rq, target_rq);
- update_rq_clock(busiest_rq);
- update_rq_clock(target_rq);
+ __update_rq_clock(busiest_rq);
+ __update_rq_clock(target_rq);

/* Search for an sd spanning us and the target CPU. */
for_each_domain(ta...

To: Mike Galbraith <efault@...>
Cc: Frans Pop <elendil@...>, Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Guillaume Chazarain <guichaz@...>, Andi Kleen <andi@...>
Date: Friday, May 2, 2008 - 3:56 pm

On Fri, 2008-05-02 at 21:27 +0200, Mike Galbraith wrote:

That's a quite horrible patch for some of us, but I think I know why you
did that, and I think your fancy Q6600 has a stable enough TSC to pull
it off.

Ok, the the below would need something that relates tick_timestamp's to
one another.. probably sucks without it..

OTOH, Andi said he was working on a fastish global sched_clock() thingy,
Andi got a link to that code?

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -560,6 +560,7 @@ struct rq {
unsigned long next_balance;
struct mm_struct *prev_mm;

+ spinlock_t clock_lock;
u64 clock, prev_clock_raw;
s64 clock_max_delta;

@@ -657,20 +658,25 @@ static inline void update_last_tick_seen
}
#endif

+#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
+#define this_rq() (&__get_cpu_var(runqueues))
+#define task_rq(p) cpu_rq(task_cpu(p))
+#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
+
/*
* Update the per-runqueue clock, as finegrained as the platform can give
* us, but without assuming monotonicity, etc.:
*/
-static void __update_rq_clock(struct rq *rq)
+static void ___update_rq_clock(struct rq *rq, u64 now)
{
u64 prev_raw = rq->prev_clock_raw;
- u64 now = sched_clock();
s64 delta = now - prev_raw;
u64 clock = rq->clock;

#ifdef CONFIG_SCHED_DEBUG
WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
#endif
+ spin_lock(&rq->clock_lock);
/*
* Protect against sched_clock() occasionally going backwards:
*/
@@ -699,12 +705,31 @@ static void __update_rq_clock(struct rq

rq->prev_clock_raw = now;
rq->clock = clock;
+ spin_unlock(&rq->clock_lock);
+}
+
+static void __update_rq_clock(struct rq *rq)
+{
+ ___update_rq_clock(rq, sched_clock());
+}
+
+static void __update_remote_rq_clock(struct rq *rq)
+{
+ u64 now = sched_clock();
+ struct rq *this_rq = this_rq...

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Frans Pop <elendil@...>, Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Guillaume Chazarain <guichaz@...>, Andi Kleen <andi@...>
Date: Saturday, May 3, 2008 - 2:20 am

Oh, yeah, that's right. Bah, let them eat cake ;-)

-Mike

--

To: <a.p.zijlstra@...>
Cc: <efault@...>, <elendil@...>, <parag.warudkar@...>, <mingo@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Friday, May 2, 2008 - 5:48 pm

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

While I'm fine with this kind of stuff being added to constantly cope
with x86's joke of a TSC register implementation, it's starting to
become an enormous burdon for platforms where the TICK source actually
works properly.

Heck, on my Niagara2 chips, all 64 cpus use the same physical register
for the TICK source, it physically can't get desynchronized :-)

So, a way to turn all of this muck off would be much appreciated.
I'm happy to test anything on sparc64, and I'm sure the powerpc
folks are as well.

And I also heard a rumor that Peter has access to a machine with a
stable tick source for testing :-)
--

To: David Miller <davem@...>
Cc: <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <mingo@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Friday, May 2, 2008 - 6:09 am

On Fri, 02 May 2008 14:48:27 -0700 (PDT)

it's a sad affair indeed. On some systems it counts cycles, on other
systems it counts time. On most systems it stops while idle, on others
it keeps running. On most systems its not very synchronized between
packages, and on some systems it's not even synchronized between cores.

I'm not convinced TSC is the right thing for the scheduler in the first
place; on current x86 systems TSC counts "time" not "work done". Now of
course "time" is an approximation for "work done", but not a very good
one given the presence of what effectively is variable cpu speeds
(software CPU frequency control is only part of that; there's also the
finegrained hardware level frequency control as done by what marketing
people call "Intel Dynamic Acceleration technology"). [*]

I and others have talked to Peter about this already, and I'm sure we'll
talk more about it in the future as well.. at some point this part of
CFS needs to fundamentally be cleaned up. Since this gets into a debate
about what fairness means ;(

[*] The converse is also true; cycles aren't a good representation of
time either; this makes cycle based profilers a bit iffy if you're
interested in where the system spends time rather than where it spends
cycles.
--

To: Arjan van de Ven <arjan@...>
Cc: David Miller <davem@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <mingo@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Sunday, May 4, 2008 - 8:12 am

My current view on this is that per cpu scheduling should use a time
based clock, whereas smp load balancing can take the cycle (ie work)
counter to balance different work/time loads and estimate core power.

As for using the TSC, I'm afraid we just don't have any choice. Although
I guess we could dynamically detect some counter on new cpus and use
that when available. But the TSC is the only thing available on a lot of
existing machines.

--

To: David Miller <davem@...>
Cc: <efault@...>, <elendil@...>, <parag.warudkar@...>, <mingo@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Friday, May 2, 2008 - 6:07 pm

Yeah, I was thinking of a way for architectures to signify how much help
they need building the various clocks. Sparc, power and s390 would not
need any help.

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: David Miller <davem@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 4:28 am

there's already such a mechanism in sched-devel.git (and has been there
for a week or two): an architecture can set time_sync_thresh to -1LL
during early bootup and essentially disable all the synchronization
logic.

Ingo
--

To: <mingo@...>
Cc: <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 5:05 am

From: Ingo Molnar <mingo@elte.hu>

Does it remove all of the code too? :-)

Please give us a config boolean. The only platform for which
a run-time knob is even necessary is x86.

Thanks.
--

To: David Miller <davem@...>
Cc: <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 6:10 am

yeah, will try something like that too.

the thing is, core kernel folks have to resist such pressures
_somewhat_.

Architecture maintainers will put pressure on us from two directions: if
a particular generic feature only helps _another_ architecture, they
find it a nuisance and want it to be gone as much as possible.

If a particular feature helps them, they want it supported and
default-enabled as much as possible. In fact, complaints come if a
generic-looking feature shows up in one architecture only. (recent
example: inlining optimizations ;-)

And you are totally right about sched_clock() being dead on accurate an
globally synchronous on sparc64 - and you are right to find _any_ issue
about it a nuisance. I totally envy you that sparc64's sched_clock() is
so simple - it should have been like that on x86 years ago, if hw
designers werent so impotent about it.

( although please note that the growing generalization that goes on
_did_ find a subtle nohz problem on sparc64 early in the merge window,
so it's not like these changes are totally useless to you. )

but it all goes in the other direction as well: many folks find
endianness problems a nuisance on x86, many folks find TLB and explicit
cache coherence complications a nuisance on x86 as well. The bus-to-phys
complication which is an identity on x86. Etc. etc.

But the core kernel is a conscious and intelligent union of all
architecture's needs, and often we maintain complications even if they
make no sense on the most popular platform. I think it makes strategic
sense because it keep the kernel truly generic and truly clean. It also
keeps architectures honest: even today the x86 architecture is still not
as clean as sparc64 for example.

so please be patient, we are working on it :)

Ingo
--

To: <mingo@...>
Cc: <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 3:28 pm

From: Ingo Molnar <mingo@elte.hu>

And you guys have IOMMUs now, aren't you glad all of this
nice DMA API infrastructure exists already and has been tested
for years? :-))))
--

To: David Miller <davem@...>
Cc: <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 3:39 pm

yes :) But i hope we'll never have to flush caches by hand or change
endianness ;-)

Ingo
--

To: <mingo@...>
Cc: <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 3:27 pm

From: Ingo Molnar <mingo@elte.hu>

Note that this bug theoretically exists on every platform,
even x86, and only sparc64 is fixed at the moment by adding
the irq_{entry,exit}() guards around all of it's IPI handlers.

Just a reminder.
--

To: David Miller <davem@...>
Cc: <mingo@...>, <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 6:30 pm

What is the bug ?

IPIs on power are normal interrupts, so they do happen in irq_entry/exit
blocks but I'm curious to know what the root bug is :-)

Ben.

--

To: Benjamin Herrenschmidt <benh@...>
Cc: David Miller <davem@...>, <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 6:38 pm

on nohz we still keep jiffies uptodate - despite there not being an
explicit 'keep jiffies uptodate' tick interrupt anymore. So on every
irq_enter() we roll jiffies forward - if needed - and thus emulate
jiffies behavior to drivers and core kernel code, etc.

if an IPI on Power does not do an irq_enter() then you might miss out on
updated jiffies. That might not matter for most jiffies, but you might
also miss out on the 'touch the softlockup watchdog because we just woke
from idle' action. This is what triggered the false positive warnings on
Sparc64.

the same bug existed on x86 too: that too does a few IPIs without
irq_enter/irq_exit. We now removed the softlockup dependency so it
should not be required to do an irq_enter()/exit anymore - unless the
code that the IPI uses accesses jiffies. (but that would be unusual)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: David Miller <davem@...>, <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 10:22 pm

Allright, so it looks like we don't have a problem today, though I'll be
careful if some of our CPUs come up with more fancy ways to do IPIs.
Currently they are just normal interrupts, so we do irq_enter/exit and
we process softirqs on the way out.

Ben.

--

To: <mingo@...>
Cc: <benh@...>, <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 7:04 pm

From: Ingo Molnar <mingo@elte.hu>

That move doesn't solve the problem, I'm strongly certain
arch's still need to add the irq_{enter,exit}().

I'll revert the irq_{enter,exit}() changes on one of my
sparc64 boxes to validate this later today.
--

To: <mingo@...>
Cc: <benh@...>, <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 7:36 pm

From: David Miller <davem@davemloft.net>

False alarm, which is good news, thanks Ingo. I can revert
the following it seems:

commit 2664ef44cf5053d2b7dff01cecac70bc601a5f68
Author: David S. Miller <davem@davemloft.net>
Date: Fri Apr 25 03:11:37 2008 -0700

[SPARC64]: Wrap SMP IPIs with irq_enter()/irq_exit().

Otherwise all sorts of bad things can happen, including
spurious softlockup reports.

Other platforms have this same bug, in one form or
another, just don't see the issue because they
don't sleep as long as sparc64 can in NOHZ.

Thanks to some brilliant debugging by Peter Zijlstra.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
index 524b889..409dd71 100644
--- a/arch/sparc64/kernel/smp.c
+++ b/arch/sparc64/kernel/smp.c
@@ -866,14 +866,21 @@ void smp_call_function_client(int irq, struct pt_regs *regs)
void *info = call_data->info;

clear_softint(1 << irq);
+
+ irq_enter();
+
+ if (!call_data->wait) {
+ /* let initiator proceed after getting data */
+ atomic_inc(&call_data->finished);
+ }
+
+ func(info);
+
+ irq_exit();
+
if (call_data->wait) {
/* let initiator proceed only after completion */
- func(info);
atomic_inc(&call_data->finished);
- } else {
- /* let initiator proceed after getting data */
- atomic_inc(&call_data->finished);
- func(info);
}
}

@@ -1032,7 +1039,9 @@ void smp_receive_signal(int cpu)

void smp_receive_signal_client(int irq, struct pt_regs *regs)
{
+ irq_enter();
clear_softint(1 << irq);
+ irq_exit();
}

void smp_new_mmu_context_version_client(int irq, struct pt_regs *regs)
@@ -1040,6 +1049,8 @@ void smp_new_mmu_context_version_client(int irq, struct pt_regs *regs)
struct mm_struct *mm;
unsigned long flags;

+ irq_enter();
+
clear_softint(1 << irq);

/* See if we need to allocate a new ...

To: David Miller <davem@...>
Cc: <benh@...>, <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 7:38 pm

ah, good! It really looks like unnecessary overhead for the "simple
IPIs". But ... i'm wondering ... what about softirq processing? Do these
IPIs process softirqs on the way out? In that case the non-processed
jiffies might be a problem.

it's all a bit messy. I wish we could start turning jiffies into a
function (which would just read GTOD and estimate jiffies from there),
but i fear we are not there yet ...

Ingo
--

To: <mingo@...>
Cc: <benh@...>, <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 7:40 pm

From: Ingo Molnar <mingo@elte.hu>

I think Peter Z. made a good point that we probably need to
keep it there for smp_call_function() receivers, so that
the called functions can test the context they are in accurately.
--

To: David Miller <davem@...>
Cc: <benh@...>, <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 7:47 pm

hm, good point, i missed that aspect.

Ingo
--

To: David Miller <davem@...>
Cc: <a.p.zijlstra@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <linux-kernel@...>, <guichaz@...>, <andi@...>
Date: Saturday, May 3, 2008 - 3:37 pm

yeah, indeed - it's masked on x86 by the suckiness of most clockevents
hw timers. We dont get unlimited sleep time like sparc64 does :-/

Ingo
--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Mike Galbraith <efault@...>, Frans Pop <elendil@...>, Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Andi Kleen <andi@...>
Date: Friday, May 2, 2008 - 4:38 pm

Does this obsolete Ingo's "sched: make cpu_clock() globally synchronous"?

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...

My laptop is UP, so I cannot test this :-(

--
Guillaume
--

To: Guillaume Chazarain <guichaz@...>
Cc: Mike Galbraith <efault@...>, Frans Pop <elendil@...>, Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Andi Kleen <andi@...>
Date: Friday, May 2, 2008 - 4:46 pm

It doesn't, I'm just throwing around code/ideas.

We should probably come up with some clock lib that builds the various
clocks: rq->clock and cpu_clock() from sched_clock() depending on some
architecture supplied hints.

We can toss most of this code on archs like power64, sparc64 and s390x
which have stable clocks. It just seems wrong to do all this on these
archs, but don't have much choice on x86.

--

To: <a.p.zijlstra@...>
Cc: <guichaz@...>, <efault@...>, <elendil@...>, <parag.warudkar@...>, <mingo@...>, <linux-kernel@...>, <andi@...>
Date: Friday, May 2, 2008 - 6:00 pm

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

All of this scheduler runqueue clock business is probably solving
the x86 TSC issue on the wrong level.

I suggest having a char with Andi Kleen about these issues, he's
had some good ideas in the past which were ignored for whatever
reasons...
--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 6:32 am

I've restarted amarok and the glibc build each time after change to
sched_features.

# cat sched_features
NEW_FAIR_SLEEPERS WAKEUP_PREEMPT START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY
SYNC_WAKEUPS HRTICK NO_DOUBLE_TICK NORMALIZED_SLEEPER DEADLINE

=> skips in music (amarok)

/proc/20734/task/20734/sched:se.bonus_max : 40960.000000
/proc/20734/task/20775/sched:se.bonus_max : 40960.000000
/proc/20734/task/20779/sched:se.bonus_max : 40960.000000
/proc/20734/task/20780/sched:se.bonus_max : 40960.000000
/proc/20734/task/20781/sched:se.bonus_max : 40960.000000
/proc/20734/task/20782/sched:se.bonus_max : 39.960966
/proc/20734/task/20799/sched:se.bonus_max : 40960.000000

# echo NO_NEW_FAIR_SLEEPERS >sched_features

=> no skips in music

/proc/26357/task/26357/sched:se.bonus_max : 0.000000
/proc/26357/task/26362/sched:se.bonus_max : 0.000000
/proc/26357/task/26366/sched:se.bonus_max : 0.000000
/proc/26357/task/26367/sched:se.bonus_max : 0.000000
/proc/26357/task/26368/sched:se.bonus_max : 0.000000
/proc/26357/task/26369/sched:se.bonus_max : 0.000000
/proc/26357/task/9705/sched:se.bonus_max : 0.000000

# echo NEW_FAIR_SLEEPERS >sched_features
# echo NO_NORMALIZED_SLEEPER >sched_features

=> no skips in music

/proc/13352/task/13352/sched:se.bonus_max : 0.000000
/proc/13352/task/13357/sched:se.bonus_max : 0.000000
/proc/13352/task/13361/sched:se.bonus_max : 0.000000
/proc/13352/task/13362/sched:se.bonus_max : 0.000000
/proc/13352/task/13363/sched:se.bonus_max : 0.000000
/proc/13352/task/13364/sched:se.bonus_max : 0.000000
/proc/13352/task/27595/sched:se.bonus_max : 0.000000

Cheers,
FJP
--

To: Frans Pop <elendil@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 6:35 am

Ok, that's _WAY_ too big, something went seriously wrong. Thanks!

--

To: Frans Pop <elendil@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 7:08 am

In fact, that is exactly 1024 time too large.

How does this work for you:

Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -685,6 +685,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
+ account_entity_enqueue(cfs_rq, se);

if (wakeup) {
place_entity(cfs_rq, se, 0);
@@ -695,7 +696,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
check_spread(cfs_rq, se);
if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se);
- account_entity_enqueue(cfs_rq, se);
}

static void update_avg(u64 *avg, u64 sample)

--

To: Peter Zijlstra <peterz@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 7:37 am

Changes nothing. Values of bonus_max are unchanged and still skips.
--

To: Frans Pop <elendil@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 7:39 am

Humm, I saw some bonus_max spikes on my laptop which were fixed by that
patchlet, ah well, the hunt goes on.

Thanks!

--

To: Peter Zijlstra <peterz@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 8:06 am

Looks like you've got it after all! With the patch in the correct place
there are no more skips and I now get:

/proc/5182/task/5182/sched:se.bonus_max : 39.960966
/proc/5182/task/5263/sched:se.bonus_max : 39.960966
/proc/5182/task/5267/sched:se.bonus_max : 39.960966
/proc/5182/task/5268/sched:se.bonus_max : 39.960966
/proc/5182/task/5269/sched:se.bonus_max : 39.960966
/proc/5182/task/5270/sched:se.bonus_max : 19.990234
/proc/5182/task/5661/sched:se.bonus_max : 39.960966

Note (mostly for Parag) that I had GROUP_SCHED disabled for these tests, but
for me that did make not any difference with my 2.6.25-rc tests for this
issue, nor with my tests using sched-devel earlier in the .26 merge window.

Feel free to add my
Tested-by: Frans Pop <elendil@planet.nl>

Cheers,
FJP
--

To: Frans Pop <elendil@...>
Cc: Peter Zijlstra <peterz@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 8:22 am

Strange - without the _enqueue patch from Peter, stutter is reduced to
an almost acceptable level for me if I just disable GROUP_SCHED.
With Peter's patch - I have tested multiple times and the stutter is
completely eliminated. So +1 for Peter's patch.

Mike recalls a regression that may be related - so it is possible you
were running w/o the affecting change.

Anyway - all is well in sched-land for now. Rejoice!

Parag
--

To: Parag Warudkar <parag.warudkar@...>
Cc: Frans Pop <elendil@...>, Peter Zijlstra <peterz@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 9:21 am

Not really, we can do better. I still see stutters and noticable delays
in keypresses making it to the console even with Peter's patches.

/me tries to fix this one :-/

Peter, any more patches to try out?

thanks,
--
regards,
Dhaval
--

To: Peter Zijlstra <peterz@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 7:45 am

Oh! Hang on. I just checked and saw that I made the change in
dequeue_entity() instead of enqueue_entity(). Let me try again.

For code symmetry I guess both should probably be changed?

Sorry for the mistake.
--

To: Frans Pop <elendil@...>
Cc: Parag Warudkar <parag.warudkar@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>
Date: Friday, May 2, 2008 - 7:51 am

Symetry would actually make it something like:

enqueue()
A
B
C

dequeu()
C
B
A

But I just quickly checked the code and I can't see any ill effect of
mis-placing that code in dequeue() too.

--

Previous thread: [PATCH] use unsigned long for zoran_driver by Steven Rostedt on Thursday, May 1, 2008 - 8:04 pm. (1 message)

Next thread: Re: huge gcc 4.1.{0,1} __weak problem by Chris Knadle on Thursday, May 1, 2008 - 7:55 pm. (10 messages)