Hi,
This is my current proposal to replace the rq->clock stuff (and possibly
cpu_clock()).
it _DOESN'T_ boot ;-/ and I seem to have caught a flu that makes my
whole body hurt like hell, so I'm not getting anything done.
Brain dump before I crash:
- architectures that have a 'perfect' hardware clock can set
CONFIG_HAVE_STABLE_CLOCK
- the 'jiffie' window might be superfulous when we update tick_gtod
before the __update_sched_clock() call in sched_clock_tick()
- cpu_clock() might be implemented as:
sched_clock_cpu(smp_processor_id())
if the accuracy proves good enough - how far can TSC drift in a
single jiffie when considering the filtering and idle hooks?
- what other architectures besides x86 would need this?
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 29 ++++++
init/main.c | 1
kernel/Makefile | 2
kernel/sched.c | 163 ++----------------------------------
kernel/sched_clock.c | 221 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_debug.c | 7 -
kernel/sched_fair.c | 2
7 files changed, 266 insertions(+), 159 deletions(-)
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -75,16 +75,6 @@
#include <asm/irq_regs.h>
/*
- * Scheduler clock - returns current time in nanosec units.
- * This is default implementation.
- * Architectures and sub-architectures can override this.
- */
-unsigned long long __attribute__((weak)) sched_clock(void)
-{
- return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
-}
-
-/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
* and back.
@@ -560,13 +550,7 @@ struct rq {
unsigned long next_balance;
struct mm_struct *prev_mm;
- u64 clock, prev_clock_raw;
- s64 clock_max_delta;
-
- unsigned int clock_warps, ...it 'can' be very bad - so we have to assume it's random and fall back to jiffies quality in that case. In practice on most x86 CPUs it wont drift that far - even drifting TSCs drift minimally (there are AMD Athlon64/Opteron CPUs where a CPU in HLT will cause the clock and the TSC to drift a bit) - and stopping TSCs will just stop. But in terms of BIOSes trying to fix things up and in terms of cpufreq artifacts (the CPU's clock itself going through transients) anything can we rarely get any interactivity reports from anything non-x86 so i doubt it truly matters on anything but x86. If it _breaks_ in terms of crashing or locking we do hear from other architectures ;-) Ingo --
FYI, the merged version against sched-devel.git does boot fine here and the clock does seem to advance as expected. Ingo --
-- regards, Dhaval --
From: Dhaval Giani <dhaval@linux.vnet.ibm.com> If you tested on powerpc (just guessing), did you set HAVE_STABLE_CLOCK in arch/powerpc/Kconfig? That's what you'll need to do to take advantage of the new code. --
I was testing on x86. Let me boot it on a powerpc box, and give feedback (I was more interested in getting it boot the first time and it was already late :) ) -- regards, Dhaval --
Unrelated, it hits this as well, 21.964266] ------------[ cut here ]------------ [ 21.978124] WARNING: at include/linux/blkdev.h:431 blk_queue_init_tags+0x114/0x158() [ 22.013319] Modules linked in: [ 22.022518] Pid: 1, comm: swapper Not tainted 2.6.25 #1 [ 22.042741] [<c012601f>] warn_on_slowpath+0x41/0x5f [ 22.057667] [<c012e2dc>] ? run_timer_softirq+0x158/0x160 [ 22.073948] [<c017ade7>] ? cache_alloc_refill+0x17f/0x1de [ 22.090485] [<c014119a>] ? __lock_release+0x1e/0x51 [ 22.105463] [<c0179c47>] ? check_poison_obj+0x2a/0x17b [ 22.121219] [<c017ade7>] ? cache_alloc_refill+0x17f/0x1de [ 22.137760] [<c0179aa4>] ? poison_obj+0x1e/0x3b [ 22.151698] [<c0179650>] ? dbg_redzone1+0x15/0x1c [ 22.166157] [<c017af76>] ? cache_alloc_debugcheck_after+0x130/0x150 [ 22.187398] [<c017b3c5>] ? __kmalloc+0x100/0x122 [ 22.201595] [<c023b103>] ? init_tag_map+0x64/0x8b [ 22.216055] [<c023b103>] ? init_tag_map+0x64/0x8b [ 22.230510] [<c023b154>] ? __blk_queue_init_tags+0x2a/0x4e [ 22.247311] [<c023b29a>] blk_queue_init_tags+0x114/0x158 [ 22.263537] [<c032d6eb>] ahc_platform_set_tags+0x130/0x17e [ 22.280284] [<c032d82d>] ahc_linux_device_queue_depth+0x7f/0xdf [ 22.300205] [<c0315d41>] ? scsi_add_lun+0x1f8/0x326 [ 22.316370] [<c032cb2b>] ahc_linux_slave_configure+0x36/0x53 [ 22.335450] [<c0315e24>] scsi_add_lun+0x2db/0x326 [ 22.349856] [<c0315fe1>] scsi_probe_and_add_lun+0x172/0x204 [ 22.367268] [<c03166e0>] __scsi_scan_target+0x86/0xcc [ 22.382713] [<c03167f9>] scsi_scan_channel+0x3f/0x5e [ 22.397898] [<c0316890>] scsi_scan_host_selected+0x78/0xa9 [ 22.414645] [<c0316b5c>] do_scsi_scan_host+0x5a/0x63 [ 22.429830] [<c0316bb2>] scsi_scan_host+0x33/0x75 [ 22.444235] [<c032d298>] ahc_linux_register_host+0x190/0x19a [ 22.461503] [<c0253982>] ? pci_get_slot+0x61/0x68 [ 22.475961] [<c0245a56>] ? kobject_put+0x3c/0x41 [ 22.490158] [<c02c388f>] ? put_device+0x11/0x13 [ 22.504095] [<c025377d>] ? ...
that wont work very well when sched_clock() is called from within
CONFIG_LOCK_STAT instrumentation. Does the patch below solve the boot
problems for you?
Ingo
-------------------->
Subject: sched: sched_clock() fix
From: Ingo Molnar <mingo@elte.hu>
Date: Sat May 03 18:41:11 CEST 2008
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 2 --
kernel/sched_clock.c | 42 ++++++++++++++++++++++++------------------
2 files changed, 24 insertions(+), 20 deletions(-)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1074,8 +1074,6 @@ static struct rq *this_rq_lock(void)
return rq;
}
- WARN_ON(!irqs_disabled());
- WARN_ON(!irqs_disabled());
static void __resched_task(struct task_struct *p, int tif_bit);
static inline void resched_task(struct task_struct *p)
Index: linux/kernel/sched_clock.c
===================================================================
--- linux.orig/kernel/sched_clock.c
+++ linux/kernel/sched_clock.c
@@ -33,12 +33,18 @@
#ifndef CONFIG_HAVE_STABLE_CLOCK
struct sched_clock_data {
- spinlock_t lock;
- unsigned long prev_jiffies;
- u64 prev_raw;
- u64 tick_raw;
- u64 tick_gtod;
- u64 clock;
+ /*
+ * Raw spinlock - this is a special case: this might be called
+ * from within instrumentation code so we dont want to do any
+ * instrumentation ourselves.
+ */
+ raw_spinlock_t lock;
+
+ unsigned long prev_jiffies;
+ u64 prev_raw;
+ u64 tick_raw;
+ u64 tick_gtod;
+ u64 clock;
};
static DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_clock_data, sched_clock_data);
@@ -62,7 +68,7 @@ void sched_clock_init(void)
for_each_possible_cpu(cpu) {
struct sched_clock_data *scd = cpu_sdc(cpu);
- spin_lock_init(&scd->lock);
+ scd->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
scd->prev_jiffies = jiffies;
scd->prev_raw = now;
scd->tick_raw = now;
@@ -116,11 +122,11 @@ ...From: Ingo Molnar <mingo@elte.hu> Also, no platform can set HAVE_STABLE_CLOCK until we instantiate it in a Kconfig somewhere. I've choosen to do it in kernel/Kconfig.hz and here are the sparc/sparc64 bits as well, I've booted this up with Peter's patch on my 64-cpu niagara2 box and done some basic testing. It would be nice if a powerpc person could test the trivial powerpc Kconfig patch. Possibly this should be HAVE_UNSTABLE_CLOCK, then only one platform needs to set it :-) sparc: Instantiate and set HAVE_STABLE_CLOCK Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index d211fdb..c60f5d4 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -69,6 +69,7 @@ config SPARC select HAVE_IDE select HAVE_OPROFILE select HAVE_ARCH_KGDB if !SMP + select HAVE_STABLE_CLOCK # Identify this as a Sparc32 build config SPARC32 diff --git a/arch/sparc64/Kconfig b/arch/sparc64/Kconfig index eb36f3b..711d4b1 100644 --- a/arch/sparc64/Kconfig +++ b/arch/sparc64/Kconfig @@ -14,6 +14,7 @@ config SPARC64 select HAVE_IDE select HAVE_LMB select HAVE_ARCH_KGDB + select HAVE_STABLE_CLOCK config GENERIC_TIME bool diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz index 526128a..b88c82a 100644 --- a/kernel/Kconfig.hz +++ b/kernel/Kconfig.hz @@ -56,3 +56,6 @@ config HZ config SCHED_HRTICK def_bool HIGH_RES_TIMERS && X86 + +config HAVE_STABLE_CLOCK + boolean --
On it. Let me get hold of a powerpc box here, and try it out. -- regards, Dhaval --
applied, thanks David. right now this topic looks good in review and in testing but it is stalled on a bug: in overnight testing it triggered an ftrace self-test hang that i bisected down to that patch. While that doesnt affect mainline it's something that shows that the new sched_clock() code is not as widely usable as the old code - have to investigate that some heh, indeed :) Initially i thought that it's better to first be safe, but this really will only affect x86 in practice, so ... i think we'll switch around the flag, turning this into a no-effort thing for everything but x86. Ingo --
FWIW I tested the following trivial patch for powerpc. While testing this I discovered a i(related) ppc32 regressesion that I'm trying to bisect/debug. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> --- Obviously on top of the merged patch + Dave's sparc patch arch/powerpc/Kconfig | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3934e26..0642dff 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -110,6 +110,7 @@ config PPC select HAVE_KPROBES select HAVE_KRETPROBES select HAVE_LMB + select HAVE_STABLE_CLOCK config EARLY_PRINTK bool Yours Tony linux.conf.au http://www.marchsouth.org/ Jan 19 - 24 2009 The Australian Linux Technical Conference! --
Sorry Peter I think the whole concept is wrong. You're still at the wrong level. If you don't fix the underlying sched_clock the results will be always GIGO (Garbage-In-Garbage-Out). And the best "filtering" will not help with that if the incoming time is just wrong. And if you fix the underlying sched_clocks you don't need all that filtering. That is also why I don't like David's proposal of a Kconfig for this. It's the wrong way too. Or rather that Kconfig should be always set. Also I believe x86 sched_clock can be fixed that it will work properly with some constraints: - monotonous per CPU only If you think about it the concept of "jiffies only error" doesn't make sense because it means you can as well use jiffies if you really need inter CPU stamps. - not over deep sleep states on some CPUs. While modern x86 designs fixed that older ones still have that problem and it will be always there with us in the embedded world. I believe both constraints are ok for scheduler purposes. You don't need to measure idle tasks (and those are the only ones who go into sleep states) and you can always do all calculations on the run queue of a specific CPU only. This means it is not suitable as a printk clock for once, but for that jiffies is fine anyways. The rewritten sched_clock I posted the URL to a few times fixes some of the problems, not all (there was on non monotonity case left with non p state invariant TSC, i believe it was fixable though). I unfortunately cannot work on this, but maybe someone is interested in taking that code over. [for old kernels] ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt-current/patches/sched-clock ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt-current/patches/sched-clock64 -Andi --
