Re: [PATCH] sched: high-res preemption tick

Previous thread: [PATCH RFC 2/2] paravirt: clean up lazy mode handling by Jeremy Fitzhardinge on Friday, October 12, 2007 - 4:40 pm. (3 messages)

Next thread: Build error for 2.6.23-git1 by Larry Finger on Friday, October 12, 2007 - 5:22 pm. (1 message)
To: linux-kernel <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>
Date: Friday, October 12, 2007 - 4:51 pm

Subject: sched: high-res preemption tick

Use HR-timers (when available) to deliver an accurate preemption tick.

The regular scheduler tick that runs at 1/HZ can be too coarse when nice
level are used. The fairness system will still keep the cpu utilisation 'fair'
by then delaying the task that got an excessive amount of CPU time but try to
minimize this by delivering preemption points spot-on.

The average frequency of this extra interrupt is sched_latency / nr_latency.
Which need not be higher than 1/HZ, its just that the distribution within the
sched_latency period is important.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

For now HRTICK and PREEMPT_RESTRICT are not compatible.

arch/x86/kernel/entry_64.S | 6 -
arch/x86/kernel/signal_32.c | 3
arch/x86/kernel/signal_64.c | 3
include/asm-x86/thread_info_32.h | 2
include/asm-x86/thread_info_64.h | 5
include/linux/hrtimer.h | 9 +
include/linux/sched.h | 3
kernel/Kconfig.hz | 4
kernel/sched.c | 205 ++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 67 +++++++++++-
kernel/sched_idletask.c | 2
kernel/sched_rt.c | 2
12 files changed, 290 insertions(+), 21 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -62,6 +62,7 @@
#include <linux/reciprocal_div.h>
#include <linux/unistd.h>
#include <linux/pagemap.h>
+#include <linux/hrtimer.h>

#include <asm/tlb.h>

@@ -323,6 +324,12 @@ struct rq {
struct list_head migration_queue;
#endif

+#ifdef CONFIG_SCHED_HRTICK
+ unsigned long hrtick_flags;
+ ktime_t hrtick_expire;
+ struct hrtimer hrtick_timer;
+#endif
+
#ifdef CONFIG_SCHEDSTATS
/* latency stats */
struct sched_info rq_sched_info;
@@ -446,6 +453,8 @@ enum...

To: linux-kernel <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Lennart Poettering <mztabzr@...>
Date: Saturday, October 13, 2007 - 6:51 pm

The below patch is an idea proposed by tglx and depends on sched-devel +
the hrtick patch previously posted.

The current watchdog action is to demote the task to SCHED_NORMAL,
however it might be wanted to deliver a signal instead (or have more per
task configuration state). Which is why I added Lennart to the CC list
as I gathered he would like something like this for PulseAudio.

---
Subject: sched: SCHED_FIFO watchdog timer

Set a per task (rlimit based) limit on SCHED_FIFO runtime. When the
limit is exceeded the task is demoted back to SCHED_NORMAL.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/asm-generic/resource.h | 5 +++--
kernel/sched.c | 5 +++++
kernel/sched_rt.c | 36 ++++++++++++++++++++++++++++++++++++
3 files changed, 44 insertions(+), 2 deletions(-)

Index: linux-2.6/include/asm-generic/resource.h
===================================================================
--- linux-2.6.orig/include/asm-generic/resource.h
+++ linux-2.6/include/asm-generic/resource.h
@@ -44,8 +44,8 @@
#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
-
-#define RLIM_NLIMITS 15
+#define RLIMIT_FIFOTIME 15 /* timeout for fifo slices in us */
+#define RLIM_NLIMITS 16

/*
* SuS says limits have to be unsigned.
@@ -86,6 +86,7 @@
[RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
[RLIMIT_NICE] = { 0, 0 }, \
[RLIMIT_RTPRIO] = { 0, 0 }, \
+ [RLIMIT_FIFOTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
}

#endif /* __KERNEL__ */
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -89,6 +89,14 @@ static struct task_struct *pick_next_tas

next->se.exec_start = rq->clock;

+ if (next->policy == SCHED_FIFO) {
+ unsigned long fifotime;
+
+ fifo...

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Lennart Poettering <mztabzr@...>
Date: Monday, October 15, 2007 - 5:32 pm

Great, this looks very promising. Thanks for doing this.

Kay
-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>
Date: Monday, October 15, 2007 - 10:25 am

Indeed! Having this in the kernel would allow us to enable RT
scheduling for PulseAudio by default without bad effects. I was thinking about
adding some kind of babysitting process to userspace -- but doing this as
an RLIMIT in the kernel strikes me a much better idea!

I think it would make a lot of sense to make the API very similar to
RLIMIT_CPU, i.e. also send out SIGXCPU and SIGKILL, with the single
difference that RLIMIT_CPU sends out a signal depending on the total
CPU time used for the process and the new RLIMIT based on the time the
process spent without sleeping. That would be a very reasonable
extension to the current RLIMIT_CPU model.

Thank you very much for doing this patch!

Lennart

--
Lennart Poettering Red Hat, Inc.
lennart [at] poettering [dot] net ICQ# 11060553
http://0pointer.net/lennart/ GnuPG 0x1A015CC4
-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Lennart Poettering <mztabzr@...>
Date: Monday, October 15, 2007 - 9:26 am

Why only SHCED_FIFO and not SCHED_RR?

I guess, put_prev_task() / set_curr_task() should be called (for the
case of task_running(p)) to make it group-scheduler-friendly (as it's
done e.g. in sched_setscheduler()).

(normilize_task() should probably do the same)

--
Best regards,
Dmitry Adamushko
-

To: Dmitry Adamushko <dmitry.adamushko@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Lennart Poettering <mztabzr@...>
Date: Monday, October 15, 2007 - 9:57 am

Because SCHED_FIFO is easier, _RR is for later. It was mostly an RFC to

Right, that is where I copied from, I'll pull the functionality into a
single function and make this and the sysrq stuff use it.

Thanks!

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>
Date: Saturday, October 13, 2007 - 3:18 am

This patch further reduced iperf context switching, and boosted
throughput.

iperf -c localhost -P 10 -t 300

Previously reported numbers

2.6.23-smp
[SUM] 0.0-300.0 sec 153 GBytes 4.39 Gbits/sec
[SUM] 0.0-300.1 sec 148 GBytes 4.23 Gbits/sec
[SUM] 0.0-300.0 sec 152 GBytes 4.36 Gbits/sec

2.6.23-smp-d (sched-devel)
[SUM] 0.0-300.0 sec 173 GBytes 4.96 Gbits/sec
[SUM] 0.0-300.1 sec 173 GBytes 4.96 Gbits/sec
[SUM] 0.0-300.0 sec 172 GBytes 4.93 Gbits/sec

Numbers from fresh pull today

2.6.23-smp-d-hrt
(re-enable PREEMPT_RESTRICT)
[SUM] 0.0-300.1 sec 181 GBytes 5.19 Gbits/sec
[SUM] 0.0-300.0 sec 182 GBytes 5.22 Gbits/sec
[SUM] 0.0-300.1 sec 182 GBytes 5.22 Gbits/sec

2.6.23-smp-d
[SUM] 0.0-300.1 sec 174 GBytes 4.97 Gbits/sec
[SUM] 0.0-300.1 sec 173 GBytes 4.95 Gbits/sec
[SUM] 0.0-300.1 sec 173 GBytes 4.96 Gbits/sec

-

To: Mike Galbraith <efault@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>
Date: Saturday, October 13, 2007 - 4:55 am

Ah, but HRTICK is not compatible with PREEMPT_RESTRICT, it will be

To: Mike Galbraith <efault@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>
Date: Saturday, October 13, 2007 - 5:17 am

(I do plan to fix that eventually, just need to do it)

-

To: Mike Galbraith <efault@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>
Date: Saturday, October 13, 2007 - 7:13 pm

I guess something like this ought to do, but its a tad late so I'm quite
sure :-)

---
kernel/sched_fair.c | 44 +++++++++++++++++++++++++++++++++++++-------
1 file changed, 37 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -737,6 +737,24 @@ static inline struct sched_entity *paren

#endif /* CONFIG_FAIR_GROUP_SCHED */

+/*
+ * does pse (newly woken task) preempt se (current task)
+ */
+static int wakeup_preempt(struct sched_entity *se, struct sched_entity *pse)
+{
+ s64 delta, gran;
+
+ delta = se->vruntime - pse->vruntime;
+ gran = sysctl_sched_wakeup_granularity;
+ if (unlikely(se->load.weight != NICE_0_LOAD))
+ gran = calc_delta_fair(gran, &se->load);
+
+ if (delta > gran)
+ return 1;
+
+ return 0;
+}
+
#ifdef CONFIG_SCHED_HRTICK
static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
@@ -764,6 +782,24 @@ static void hrtick_start_fair(struct rq
if (!requeue)
delta = max(10000LL, delta);

+ /*
+ * if we delayed wakeup preemption, shorten the slice to at most
+ * 1 jiffy (does this call for yet another sysctl_sched_?)
+ */
+ if (sched_feat(PREEMPT_RESTRICT) && first_fair(cfs_rq)) {
+ struct sched_entity *next = __pick_next_entity(cfs_rq);
+
+ if (wakeup_preempt(se, next)) {
+ u64 wakeup = NSEC_PER_SEC / HZ;
+ s64 delta2 = wakeup - ran;
+
+ if (delta2 < 0)
+ resched_task(rq->curr);
+ else
+ delta = min(delta, delta2);
+ }
+ }
+
hrtick_start(rq, delta, requeue);
}
}
@@ -866,7 +902,6 @@ static void check_preempt_wakeup(struct
struct task_struct *curr = rq->curr;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
struct sched_entity *se = &curr->se, *pse = &p->se;
- s64 delta, gran;

if (unlikely(rt_prio(p->prio))) {
update_rq_clock(rq);
@@ -887,12 +9...

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>
Date: Sunday, October 14, 2007 - 2:34 am

2.6.23-smp-d-hrt + restrict fix patch
[SUM] 0.0-300.1 sec 176 GBytes 5.03 Gbits/sec
[SUM] 0.0-300.1 sec 175 GBytes 5.02 Gbits/sec
[SUM] 0.0-300.1 sec 176 GBytes 5.05 Gbits/sec

Context switches are further reduced (across the board) over
PREEMPT_RESTRICT, dropping from ~7-8k to ~2.5k with this test, vs ~950
for SCHED_BATCH and ~50k with this tree and no restriction. Throughput
is ~96% of SCHED_BATCH, vs ~55% with no restriction. I see no
interactivity regressions.

-Mike

-

To: Mike Galbraith <efault@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>
Date: Saturday, October 13, 2007 - 7:16 pm

I guess that proves it.... that should have read:
... so I'm _not_ quite sure.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: linux-kernel <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>
Date: Saturday, October 13, 2007 - 6:11 am

Yes. Throughput falls as preemption climbs.

-Mike

-

Previous thread: [PATCH RFC 2/2] paravirt: clean up lazy mode handling by Jeremy Fitzhardinge on Friday, October 12, 2007 - 4:40 pm. (3 messages)

Next thread: Build error for 2.6.23-git1 by Larry Finger on Friday, October 12, 2007 - 5:22 pm. (1 message)