[RFC][PATCH] sched_clock_cpu()

Previous thread: congratulations winner, by James Johnson on Saturday, May 3, 2008 - 9:27 am. (1 message)

Next thread: [patch] video: build fix for drivers/media/video/mt9v022.c by Ingo Molnar on Saturday, May 3, 2008 - 9:30 am. (6 messages)
From: Peter Zijlstra
Date: Saturday, May 3, 2008 - 9:29 am

Hi,

This is my current proposal to replace the rq->clock stuff (and possibly
cpu_clock()).

it _DOESN'T_ boot ;-/ and I seem to have caught a flu that makes my
whole body hurt like hell, so I'm not getting anything done.

Brain dump before I crash:

 - architectures that have a 'perfect' hardware clock can set
     CONFIG_HAVE_STABLE_CLOCK

 - the 'jiffie' window might be superfulous when we update tick_gtod
   before the __update_sched_clock() call in sched_clock_tick()

 - cpu_clock() might be implemented as:

     sched_clock_cpu(smp_processor_id())

   if the accuracy proves good enough - how far can TSC drift in a
   single jiffie when considering the filtering and idle hooks?

 - what other architectures besides x86 would need this?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |   29 ++++++
 init/main.c           |    1 
 kernel/Makefile       |    2 
 kernel/sched.c        |  163 ++----------------------------------
 kernel/sched_clock.c  |  221 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_debug.c  |    7 -
 kernel/sched_fair.c   |    2 
 7 files changed, 266 insertions(+), 159 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -75,16 +75,6 @@
 #include <asm/irq_regs.h>
 
 /*
- * Scheduler clock - returns current time in nanosec units.
- * This is default implementation.
- * Architectures and sub-architectures can override this.
- */
-unsigned long long __attribute__((weak)) sched_clock(void)
-{
-	return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
-}
-
-/*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
  * and back.
@@ -560,13 +550,7 @@ struct rq {
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
 
-	u64 clock, prev_clock_raw;
-	s64 clock_max_delta;
-
-	unsigned int clock_warps, ...
From: Ingo Molnar
Date: Saturday, May 3, 2008 - 9:37 am

it 'can' be very bad - so we have to assume it's random and fall back to 
jiffies quality in that case. In practice on most x86 CPUs it wont drift 
that far - even drifting TSCs drift minimally (there are AMD 
Athlon64/Opteron CPUs where a CPU in HLT will cause the clock and the 
TSC to drift a bit) - and stopping TSCs will just stop.

But in terms of BIOSes trying to fix things up and in terms of cpufreq 
artifacts (the CPU's clock itself going through transients) anything can 

we rarely get any interactivity reports from anything non-x86 so i doubt 
it truly matters on anything but x86. If it _breaks_ in terms of 
crashing or locking we do hear from other architectures ;-)

	Ingo
--

From: Ingo Molnar
Date: Saturday, May 3, 2008 - 9:55 am

FYI, the merged version against sched-devel.git does boot fine here and 
the clock does seem to advance as expected.

	Ingo
--

From: Dhaval Giani
Date: Saturday, May 3, 2008 - 10:30 am

-- 
regards,
Dhaval
--

From: David Miller
Date: Saturday, May 3, 2008 - 9:16 pm

From: Dhaval Giani <dhaval@linux.vnet.ibm.com>

If you tested on powerpc (just guessing), did you set
HAVE_STABLE_CLOCK in arch/powerpc/Kconfig?  That's
what you'll need to do to take advantage of the
new code.
--

From: Dhaval Giani
Date: Saturday, May 3, 2008 - 9:30 pm

I was testing on x86. Let me boot it on a powerpc box, and give feedback
(I was more interested in getting it boot the first time and it was
already late :) )

-- 
regards,
Dhaval
--

From: Dhaval Giani
Date: Saturday, May 3, 2008 - 9:38 pm

Unrelated, it hits this as well,


   21.964266] ------------[ cut here ]------------
[   21.978124] WARNING: at include/linux/blkdev.h:431 blk_queue_init_tags+0x114/0x158()
[   22.013319] Modules linked in:
[   22.022518] Pid: 1, comm: swapper Not tainted 2.6.25 #1
[   22.042741]  [<c012601f>] warn_on_slowpath+0x41/0x5f
[   22.057667]  [<c012e2dc>] ? run_timer_softirq+0x158/0x160
[   22.073948]  [<c017ade7>] ? cache_alloc_refill+0x17f/0x1de
[   22.090485]  [<c014119a>] ? __lock_release+0x1e/0x51
[   22.105463]  [<c0179c47>] ? check_poison_obj+0x2a/0x17b
[   22.121219]  [<c017ade7>] ? cache_alloc_refill+0x17f/0x1de
[   22.137760]  [<c0179aa4>] ? poison_obj+0x1e/0x3b
[   22.151698]  [<c0179650>] ? dbg_redzone1+0x15/0x1c
[   22.166157]  [<c017af76>] ? cache_alloc_debugcheck_after+0x130/0x150
[   22.187398]  [<c017b3c5>] ? __kmalloc+0x100/0x122
[   22.201595]  [<c023b103>] ? init_tag_map+0x64/0x8b
[   22.216055]  [<c023b103>] ? init_tag_map+0x64/0x8b
[   22.230510]  [<c023b154>] ? __blk_queue_init_tags+0x2a/0x4e
[   22.247311]  [<c023b29a>] blk_queue_init_tags+0x114/0x158
[   22.263537]  [<c032d6eb>] ahc_platform_set_tags+0x130/0x17e
[   22.280284]  [<c032d82d>] ahc_linux_device_queue_depth+0x7f/0xdf
[   22.300205]  [<c0315d41>] ? scsi_add_lun+0x1f8/0x326
[   22.316370]  [<c032cb2b>] ahc_linux_slave_configure+0x36/0x53
[   22.335450]  [<c0315e24>] scsi_add_lun+0x2db/0x326
[   22.349856]  [<c0315fe1>] scsi_probe_and_add_lun+0x172/0x204
[   22.367268]  [<c03166e0>] __scsi_scan_target+0x86/0xcc
[   22.382713]  [<c03167f9>] scsi_scan_channel+0x3f/0x5e
[   22.397898]  [<c0316890>] scsi_scan_host_selected+0x78/0xa9
[   22.414645]  [<c0316b5c>] do_scsi_scan_host+0x5a/0x63
[   22.429830]  [<c0316bb2>] scsi_scan_host+0x33/0x75
[   22.444235]  [<c032d298>] ahc_linux_register_host+0x190/0x19a
[   22.461503]  [<c0253982>] ? pci_get_slot+0x61/0x68
[   22.475961]  [<c0245a56>] ? kobject_put+0x3c/0x41
[   22.490158]  [<c02c388f>] ? put_device+0x11/0x13
[   22.504095]  [<c025377d>] ? ...
From: Ingo Molnar
Date: Saturday, May 3, 2008 - 10:01 am

that wont work very well when sched_clock() is called from within 
CONFIG_LOCK_STAT instrumentation. Does the patch below solve the boot 
problems for you?

	Ingo

-------------------->
Subject: sched: sched_clock() fix
From: Ingo Molnar <mingo@elte.hu>
Date: Sat May 03 18:41:11 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c       |    2 --
 kernel/sched_clock.c |   42 ++++++++++++++++++++++++------------------
 2 files changed, 24 insertions(+), 20 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1074,8 +1074,6 @@ static struct rq *this_rq_lock(void)
 	return rq;
 }
 
-	WARN_ON(!irqs_disabled());
-	WARN_ON(!irqs_disabled());
 static void __resched_task(struct task_struct *p, int tif_bit);
 
 static inline void resched_task(struct task_struct *p)
Index: linux/kernel/sched_clock.c
===================================================================
--- linux.orig/kernel/sched_clock.c
+++ linux/kernel/sched_clock.c
@@ -33,12 +33,18 @@
 #ifndef CONFIG_HAVE_STABLE_CLOCK
 
 struct sched_clock_data {
-	spinlock_t lock;
-	unsigned long prev_jiffies;
-	u64 prev_raw;
-	u64 tick_raw;
-	u64 tick_gtod;
-	u64 clock;
+	/*
+	 * Raw spinlock - this is a special case: this might be called
+	 * from within instrumentation code so we dont want to do any
+	 * instrumentation ourselves.
+	 */
+	raw_spinlock_t		lock;
+
+	unsigned long		prev_jiffies;
+	u64			prev_raw;
+	u64			tick_raw;
+	u64			tick_gtod;
+	u64			clock;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_clock_data, sched_clock_data);
@@ -62,7 +68,7 @@ void sched_clock_init(void)
 	for_each_possible_cpu(cpu) {
 		struct sched_clock_data *scd = cpu_sdc(cpu);
 
-		spin_lock_init(&scd->lock);
+		scd->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
 		scd->prev_jiffies = jiffies;
 		scd->prev_raw = now;
 		scd->tick_raw = now;
@@ -116,11 +122,11 @@ ...
From: David Miller
Date: Saturday, May 3, 2008 - 9:26 pm

From: Ingo Molnar <mingo@elte.hu>

Also, no platform can set HAVE_STABLE_CLOCK until we instantiate it in
a Kconfig somewhere.  I've choosen to do it in kernel/Kconfig.hz and
here are the sparc/sparc64 bits as well, I've booted this up with
Peter's patch on my 64-cpu niagara2 box and done some basic testing.

It would be nice if a powerpc person could test the trivial
powerpc Kconfig patch.

Possibly this should be HAVE_UNSTABLE_CLOCK, then only one platform
needs to set it :-)

sparc: Instantiate and set HAVE_STABLE_CLOCK

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index d211fdb..c60f5d4 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -69,6 +69,7 @@ config SPARC
 	select HAVE_IDE
 	select HAVE_OPROFILE
 	select HAVE_ARCH_KGDB if !SMP
+	select HAVE_STABLE_CLOCK
 
 # Identify this as a Sparc32 build
 config SPARC32
diff --git a/arch/sparc64/Kconfig b/arch/sparc64/Kconfig
index eb36f3b..711d4b1 100644
--- a/arch/sparc64/Kconfig
+++ b/arch/sparc64/Kconfig
@@ -14,6 +14,7 @@ config SPARC64
 	select HAVE_IDE
 	select HAVE_LMB
 	select HAVE_ARCH_KGDB
+	select HAVE_STABLE_CLOCK
 
 config GENERIC_TIME
 	bool
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 526128a..b88c82a 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -56,3 +56,6 @@ config HZ
 
 config SCHED_HRTICK
 	def_bool HIGH_RES_TIMERS && X86
+
+config HAVE_STABLE_CLOCK
+	boolean
--

From: Dhaval Giani
Date: Saturday, May 3, 2008 - 10:09 pm

On it. Let me get hold of a powerpc box here, and try it out.
-- 
regards,
Dhaval
--

From: Ingo Molnar
Date: Sunday, May 4, 2008 - 2:00 am

applied, thanks David.

right now this topic looks good in review and in testing but it is 
stalled on a bug: in overnight testing it triggered an ftrace self-test 
hang that i bisected down to that patch. While that doesnt affect 
mainline it's something that shows that the new sched_clock() code is 
not as widely usable as the old code - have to investigate that some 

heh, indeed :)

Initially i thought that it's better to first be safe, but this really 
will only affect x86 in practice, so ... i think we'll switch around the 
flag, turning this into a no-effort thing for everything but x86.

	Ingo
--

From: Ingo Molnar
Date: Monday, May 5, 2008 - 4:32 am

FYI, figured that out today and fixed it, so the topic has green lights 
again.

	Ingo
--

From: Tony Breeds
Date: Monday, May 5, 2008 - 2:47 pm

FWIW I tested the following trivial patch for powerpc.  While testing
this I discovered a i(related) ppc32 regressesion that I'm trying to
bisect/debug.


Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
---
Obviously on top of the merged patch + Dave's sparc patch

 arch/powerpc/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3934e26..0642dff 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -110,6 +110,7 @@ config PPC
 	select HAVE_KPROBES
 	select HAVE_KRETPROBES
 	select HAVE_LMB
+	select HAVE_STABLE_CLOCK
 
 config EARLY_PRINTK
 	bool

Yours Tony

  linux.conf.au    http://www.marchsouth.org/
  Jan 19 - 24 2009 The Australian Linux Technical Conference!

--

From: Andi Kleen
Date: Monday, May 5, 2008 - 5:06 am

Sorry Peter I think the whole concept is wrong. You're still
at the wrong level.

If you don't fix the underlying sched_clock the results will be always
GIGO (Garbage-In-Garbage-Out). And the best "filtering" will not help
with that if the incoming time is just wrong. And if you fix the
underlying sched_clocks you don't need all that filtering.

That is also why I don't like David's proposal of a Kconfig for this.
It's the wrong way too. Or rather that Kconfig should be always set.

Also I believe x86 sched_clock can be fixed that it will work
properly with some constraints: 
- monotonous per CPU only
If you think about it the concept of "jiffies only error" doesn't make sense because it means you can as well use jiffies if you really need inter CPU
stamps.

- not over deep sleep states on some CPUs. 
While modern x86 designs fixed that older ones still have that problem 
and it will be always there with us in the embedded world.

I believe both constraints are ok for scheduler purposes. You don't
need to measure idle tasks (and those are the only ones who
go into sleep states) and you can always do all calculations on
the run queue of a specific CPU only.

This means it is not suitable as a printk clock for once, but for
that jiffies is fine anyways.

The rewritten sched_clock I posted the URL to a few times fixes
some of the problems, not all (there was on non monotonity case
left with non p state invariant TSC, i believe it was fixable
though). I unfortunately cannot work on this, but maybe someone is 
interested in taking that code over.

[for old kernels]
ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt-current/patches/sched-clock
ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt-current/patches/sched-clock64

-Andi
--

Previous thread: congratulations winner, by James Johnson on Saturday, May 3, 2008 - 9:27 am. (1 message)

Next thread: [patch] video: build fix for drivers/media/video/mt9v022.c by Ingo Molnar on Saturday, May 3, 2008 - 9:30 am. (6 messages)