Re: [PATCH] virtual sched_clock() for s390

Previous thread: kmalloc zero size changes break i386 by Andi Kleen on Thursday, July 19, 2007 - 3:01 am. (12 messages)

Next thread: [PATCH] NCP: Delete test of long-deceased CONFIG_NCPFS_DEBUGDENTRY. by Robert P. J. Day on Thursday, July 19, 2007 - 4:09 am. (2 messages)
From: Jan Glauber
Date: Thursday, July 19, 2007 - 3:57 am

This patch introduces a cpu time clock for s390 (only ticking
if the virtual cpu is running) and bases the s390 implementation
of sched_clock() on it.

The times lice length on a virtual cpu can be anything
between the calculated time slice and zero. In reality
this doesn't seem to be problem, since the scheduler is fair
enough to not let a single process starve but the current
implementation can lead to inefficient short time slices.

By providing a 'virtual' sched_clock() we guarantee that a
process can get its time slice regardless of scheduling
decisions from the hypervisor.

Patch applies to 2.6.22 git and works fine with CFS.

Jan

--
 arch/s390/kernel/time.c  |   18 ++++++++++++------
 arch/s390/kernel/vtime.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 include/asm-s390/timer.h |    2 ++
 3 files changed, 59 insertions(+), 6 deletions(-)

--- ./include/asm-s390/timer.h.cpu_clock	2007-07-18 13:43:53.000000000 +0200
+++ ./include/asm-s390/timer.h	2007-07-18 20:41:13.000000000 +0200
@@ -48,6 +48,8 @@
 extern void init_cpu_vtimer(void);
 extern void vtime_init(void);
 
+extern unsigned long long cpu_clock(void);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_S390_TIMER_H */
--- ./arch/s390/kernel/time.c.cpu_clock	2007-07-18 13:43:35.000000000 +0200
+++ ./arch/s390/kernel/time.c	2007-07-18 21:01:07.000000000 +0200
@@ -62,21 +62,27 @@
 static u64 xtime_cc;
 
 /*
- * Scheduler clock - returns current time in nanosec units.
+ * Monotonic_clock - returns # of nanoseconds passed since time_init()
  */
-unsigned long long sched_clock(void)
+unsigned long long monotonic_clock(void)
 {
 	return ((get_clock() - jiffies_timer_cc) * 125) >> 9;
 }
+EXPORT_SYMBOL(monotonic_clock);
 
 /*
- * Monotonic_clock - returns # of nanoseconds passed since time_init()
+ * Scheduler clock - returns current time in nanosec units.
+ * Now based on virtual cpu time to only account time the guest
+ * was actually running.
  */
-unsigned long long ...
From: Jeremy Fitzhardinge
Date: Thursday, July 19, 2007 - 8:29 am

The Xen sched_clock implementation is very similar, and it seems to work

Runn*ing*?  Does it include time the VCPU spends idle/blocked?  If not,
then the scheduler won't be able to tell how long a process has been
asleep.  Maybe this doesn't matter (I had this problem in a version of
Xen's sched_clock, and I can't say I saw an ill effects from it).

    J
-

From: Srivatsa Vaddagiri
Date: Thursday, July 19, 2007 - 8:48 am

Good point ..

I think we need a measure of both virtual and real time here - 
virtual for accounting task-execution time and real for

I guess it will show up as some corner case behaviour, which people are
yet to discover on virtual env.

-- 
Regards,
vatsa
-

From: Ingo Molnar
Date: Thursday, July 19, 2007 - 9:00 am

CFS does measure time elapsed across task-sleep periods (and does 
something similar to what the old scheduler's 'sleep average' 
interactivity mechanism did), but that mechanism measures "time spent 
running during sleep", not "time spent idling".

still, CFS needs time measurement across idle periods as well, for 
another purpose: to be able to do precise task statistics for /proc. 
(for top, ps, etc.) So it's still true that sched_clock() should include 
idle periods too.

	Ingo
-

From: Jan Glauber
Date: Thursday, July 19, 2007 - 12:20 pm

No, it does not include idle time, if we're going idle the cpu timer

I'm not sure, s390 already has an implemetation for precise accounting
in the architecture code, does CFS also improve accounting data? 

Jan

-

From: Ingo Molnar
Date: Thursday, July 19, 2007 - 12:38 pm

what kind of precise accounting does s390 have in the architecture code? 
CFS changes task (and load) accounting to be sched_clock() driven in 
essence.

	Ingo
-

From: Jan Glauber
Date: Thursday, July 19, 2007 - 2:07 pm

s390 has per-process accounting that is aware of virtual cpu time, implemented in
arch/s390/kernel/time.c: account_ticks() and arch/s390/kernel/vtime.c:
account_*_vtime(). Timestamps are taken in entry.S for system calls, interrupts
and other system entries and are accounted later, we don't call update_process_times().


-

From: Paul Mackerras
Date: Thursday, July 19, 2007 - 6:01 pm

PowerPC's sched_clock() currently measures real time.  On POWER5 and
POWER6 machines we could change it to use a register called the "PURR"
(for Processor Utilization of Resources Register), which only measures
time spent while the partition is running.  But the PURR has another
function as well: it measures the distribution of dispatch cycles
between the two hardware threads on each core when running in SMT
mode.  That is, the cpu dispatches instructions from one thread or
the other (not both) on each CPU cycle, and each thread's PURR only
gets incremented on cycles where the cpu dispatches instructions for
that thread.  So the sum of the two threads' PURRs adds up to real
time.

Do you think this makes the PURR more useful for CFS, or less?  To me
it looks like this would mean that CFS can make a more equitable
distribution of CPU time if, for example, you had 3 runnable tasks on
a 2-core x dual-threaded machine (4 virtual CPUs).

BTW, what does "time spent running during sleep" mean?  Does it mean

As with s390, 64-bit PowerPC also uses CONFIG_VIRT_CPU_ACCOUNTING.
That affects how tsk->utime and tsk->stime are accumulated (we call
account_user_time and account_system_time directly rather than calling
update_process_times) as well as the system hardirq/softirq time, idle
time, and stolen time.

When you say "precise task statistics for /proc", where are they
accumulated?  I don't see any changes to the way that tsk->utime and
ctime are computed.

Paul.
-

From: Jeremy Fitzhardinge
Date: Thursday, July 19, 2007 - 11:03 pm

Sounds reasonable to me.  I've proposed in the past that sched_clock
should be scaled by the cpufreq frequency to achieve the same effect
(ie, measure the actual number of cpu cycles that are really available
to tasks).

But more specifically, what you've described is exactly analogous to

That's how I interpreted it.  You're only credited for sleeping if
someone else wanted the CPU in the meantime.

    J
-

From: Ingo Molnar
Date: Friday, July 20, 2007 - 12:22 am

there's one complication: sched_clock() still needs to increase while 
the CPU (or thread) is idle, so that we can have a correct measurement 
of the CPU's utilization, for SMP load-balancing. CFS constructs another 
clock from sched_clock() [the rq->fair_clock] which does stop while the 
CPU is idle.

So perhaps a combination of the PURR and real-time might work as 
sched_clock(): when a hardware thread is in cpu_idle(), it should 
advance its sched clock with _half_ the rate of real-time [so that the 
sum of advance of all threads if they are all idle is equal to real 
time], and use the PURR if they are not idle. This would still correctly 
keep a meaningful load-average if the physical CPU is under-utilized.

If you do such a change you'll immediately see whether the approach is 
right: monitor the cpu_load[] values in /proc/sched_debug, they should 
match the intuitive 'load average' of that CPU (if divided by 1024), and 

yeah. It's "the amount of fair runtime i missed out on while others were 

tsk->utime and tsk->stime is only used for a single purpose: to 
determine the 'split' factor of how to split up the precise total time 

we now use p->se.sum_exec_runtime that measures (in nanoseconds) the 
precise amount of time spent executing (sum of system and user time) - 
and ->stime and ->utime is used to determine the 'split'. [this allows 
us to gather ->stime and ->utime via low-resolution sampling, while 
keeping the 'total' precise. Accounting at every system entry point 
would be quite expensive on most platforms.]

	Ingo
-

From: Jan Glauber
Date: Monday, July 23, 2007 - 2:15 am

Using se.sum_exec_runtime to generate ->utime and ->stime breaks
the process accounting we have on s390 (and probably on PowerPC too).
With CONFIG_VIRT_CPU_ACCOUNTING we already have precise values in
->utime and ->stime. Can we make the calculation of the CFS-based time
values conditional by CONFIG_VIRT_CPU_ACCOUNTING?


-

From: Martin Schwidefsky
Date: Monday, July 23, 2007 - 6:24 am

At least for s390 and powerpc the utime and stime already contain a very
precise value how much time was spent in the user and system context.
For s390 the granularity is a microsecond. The other values nice, idle,

With the exact accounting of utime and stime that would mean that
p->se.sum_exec_runtime is utime + stime, no?
Precise Accounting at every cpu context switch has some cost, but for
s390 it is not as bad as it sounds. We do 2 store-cpu-timer (STPT)
instructions, 2 64 bit adds and 2 64 bit subtracts. In terms of cycles

Imho, we just have to update utime and stime when the process accounting
values are requested and set se.sum_exec_runtime to the sum of utime and
stime for CONFIG_VIRT_CPU_ACCOUNTING=y.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



-

Previous thread: kmalloc zero size changes break i386 by Andi Kleen on Thursday, July 19, 2007 - 3:01 am. (12 messages)

Next thread: [PATCH] NCP: Delete test of long-deceased CONFIG_NCPFS_DEBUGDENTRY. by Robert P. J. Day on Thursday, July 19, 2007 - 4:09 am. (2 messages)