Subject: sched: high-res preemption tick
Use HR-timers (when available) to deliver an accurate preemption tick.
The regular scheduler tick that runs at 1/HZ can be too coarse when nice
level are used. The fairness system will still keep the cpu utilisation 'fair'
by then delaying the task that got an excessive amount of CPU time but try to
minimize this by delivering preemption points spot-on.The average frequency of this extra interrupt is sched_latency / nr_latency.
Which need not be higher than 1/HZ, its just that the distribution within the
sched_latency period is important.Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---For now HRTICK and PREEMPT_RESTRICT are not compatible.
arch/x86/kernel/entry_64.S | 6 -
arch/x86/kernel/signal_32.c | 3
arch/x86/kernel/signal_64.c | 3
include/asm-x86/thread_info_32.h | 2
include/asm-x86/thread_info_64.h | 5
include/linux/hrtimer.h | 9 +
include/linux/sched.h | 3
kernel/Kconfig.hz | 4
kernel/sched.c | 205 ++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 67 +++++++++++-
kernel/sched_idletask.c | 2
kernel/sched_rt.c | 2
12 files changed, 290 insertions(+), 21 deletions(-)Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -62,6 +62,7 @@
#include <linux/reciprocal_div.h>
#include <linux/unistd.h>
#include <linux/pagemap.h>
+#include <linux/hrtimer.h>#include <asm/tlb.h>
@@ -323,6 +324,12 @@ struct rq {
struct list_head migration_queue;
#endif+#ifdef CONFIG_SCHED_HRTICK
+ unsigned long hrtick_flags;
+ ktime_t hrtick_expire;
+ struct hrtimer hrtick_timer;
+#endif
+
#ifdef CONFIG_SCHEDSTATS
/* latency stats */
struct sched_info rq_sched_info;
@@ -446,6 +453,8 @@ enum...
The below patch is an idea proposed by tglx and depends on sched-devel +
the hrtick patch previously posted.The current watchdog action is to demote the task to SCHED_NORMAL,
however it might be wanted to deliver a signal instead (or have more per
task configuration state). Which is why I added Lennart to the CC list
as I gathered he would like something like this for PulseAudio.---
Subject: sched: SCHED_FIFO watchdog timerSet a per task (rlimit based) limit on SCHED_FIFO runtime. When the
limit is exceeded the task is demoted back to SCHED_NORMAL.Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/asm-generic/resource.h | 5 +++--
kernel/sched.c | 5 +++++
kernel/sched_rt.c | 36 ++++++++++++++++++++++++++++++++++++
3 files changed, 44 insertions(+), 2 deletions(-)Index: linux-2.6/include/asm-generic/resource.h
===================================================================
--- linux-2.6.orig/include/asm-generic/resource.h
+++ linux-2.6/include/asm-generic/resource.h
@@ -44,8 +44,8 @@
#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
-
-#define RLIM_NLIMITS 15
+#define RLIMIT_FIFOTIME 15 /* timeout for fifo slices in us */
+#define RLIM_NLIMITS 16/*
* SuS says limits have to be unsigned.
@@ -86,6 +86,7 @@
[RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
[RLIMIT_NICE] = { 0, 0 }, \
[RLIMIT_RTPRIO] = { 0, 0 }, \
+ [RLIMIT_FIFOTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
}#endif /* __KERNEL__ */
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -89,6 +89,14 @@ static struct task_struct *pick_next_tasnext->se.exec_start = rq->clock;
+ if (next->policy == SCHED_FIFO) {
+ unsigned long fifotime;
+
+ fifo...
Great, this looks very promising. Thanks for doing this.
Kay
-
Indeed! Having this in the kernel would allow us to enable RT
scheduling for PulseAudio by default without bad effects. I was thinking about
adding some kind of babysitting process to userspace -- but doing this as
an RLIMIT in the kernel strikes me a much better idea!I think it would make a lot of sense to make the API very similar to
RLIMIT_CPU, i.e. also send out SIGXCPU and SIGKILL, with the single
difference that RLIMIT_CPU sends out a signal depending on the total
CPU time used for the process and the new RLIMIT based on the time the
process spent without sleeping. That would be a very reasonable
extension to the current RLIMIT_CPU model.Thank you very much for doing this patch!
Lennart
--
Lennart Poettering Red Hat, Inc.
lennart [at] poettering [dot] net ICQ# 11060553
http://0pointer.net/lennart/ GnuPG 0x1A015CC4
-
Why only SHCED_FIFO and not SCHED_RR?
I guess, put_prev_task() / set_curr_task() should be called (for the
case of task_running(p)) to make it group-scheduler-friendly (as it's
done e.g. in sched_setscheduler()).(normilize_task() should probably do the same)
--
Best regards,
Dmitry Adamushko
-
Because SCHED_FIFO is easier, _RR is for later. It was mostly an RFC to
Right, that is where I copied from, I'll pull the functionality into a
single function and make this and the sysrq stuff use it.Thanks!
This patch further reduced iperf context switching, and boosted
throughput.iperf -c localhost -P 10 -t 300
Previously reported numbers
2.6.23-smp
[SUM] 0.0-300.0 sec 153 GBytes 4.39 Gbits/sec
[SUM] 0.0-300.1 sec 148 GBytes 4.23 Gbits/sec
[SUM] 0.0-300.0 sec 152 GBytes 4.36 Gbits/sec2.6.23-smp-d (sched-devel)
[SUM] 0.0-300.0 sec 173 GBytes 4.96 Gbits/sec
[SUM] 0.0-300.1 sec 173 GBytes 4.96 Gbits/sec
[SUM] 0.0-300.0 sec 172 GBytes 4.93 Gbits/secNumbers from fresh pull today
2.6.23-smp-d-hrt
(re-enable PREEMPT_RESTRICT)
[SUM] 0.0-300.1 sec 181 GBytes 5.19 Gbits/sec
[SUM] 0.0-300.0 sec 182 GBytes 5.22 Gbits/sec
[SUM] 0.0-300.1 sec 182 GBytes 5.22 Gbits/sec2.6.23-smp-d
[SUM] 0.0-300.1 sec 174 GBytes 4.97 Gbits/sec
[SUM] 0.0-300.1 sec 173 GBytes 4.95 Gbits/sec
[SUM] 0.0-300.1 sec 173 GBytes 4.96 Gbits/sec-
Ah, but HRTICK is not compatible with PREEMPT_RESTRICT, it will be
(I do plan to fix that eventually, just need to do it)
-
I guess something like this ought to do, but its a tad late so I'm quite
sure :-)---
kernel/sched_fair.c | 44 +++++++++++++++++++++++++++++++++++++-------
1 file changed, 37 insertions(+), 7 deletions(-)Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -737,6 +737,24 @@ static inline struct sched_entity *paren#endif /* CONFIG_FAIR_GROUP_SCHED */
+/*
+ * does pse (newly woken task) preempt se (current task)
+ */
+static int wakeup_preempt(struct sched_entity *se, struct sched_entity *pse)
+{
+ s64 delta, gran;
+
+ delta = se->vruntime - pse->vruntime;
+ gran = sysctl_sched_wakeup_granularity;
+ if (unlikely(se->load.weight != NICE_0_LOAD))
+ gran = calc_delta_fair(gran, &se->load);
+
+ if (delta > gran)
+ return 1;
+
+ return 0;
+}
+
#ifdef CONFIG_SCHED_HRTICK
static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
@@ -764,6 +782,24 @@ static void hrtick_start_fair(struct rq
if (!requeue)
delta = max(10000LL, delta);+ /*
+ * if we delayed wakeup preemption, shorten the slice to at most
+ * 1 jiffy (does this call for yet another sysctl_sched_?)
+ */
+ if (sched_feat(PREEMPT_RESTRICT) && first_fair(cfs_rq)) {
+ struct sched_entity *next = __pick_next_entity(cfs_rq);
+
+ if (wakeup_preempt(se, next)) {
+ u64 wakeup = NSEC_PER_SEC / HZ;
+ s64 delta2 = wakeup - ran;
+
+ if (delta2 < 0)
+ resched_task(rq->curr);
+ else
+ delta = min(delta, delta2);
+ }
+ }
+
hrtick_start(rq, delta, requeue);
}
}
@@ -866,7 +902,6 @@ static void check_preempt_wakeup(struct
struct task_struct *curr = rq->curr;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
struct sched_entity *se = &curr->se, *pse = &p->se;
- s64 delta, gran;if (unlikely(rt_prio(p->prio))) {
update_rq_clock(rq);
@@ -887,12 +9...
2.6.23-smp-d-hrt + restrict fix patch
[SUM] 0.0-300.1 sec 176 GBytes 5.03 Gbits/sec
[SUM] 0.0-300.1 sec 175 GBytes 5.02 Gbits/sec
[SUM] 0.0-300.1 sec 176 GBytes 5.05 Gbits/secContext switches are further reduced (across the board) over
PREEMPT_RESTRICT, dropping from ~7-8k to ~2.5k with this test, vs ~950
for SCHED_BATCH and ~50k with this tree and no restriction. Throughput
is ~96% of SCHED_BATCH, vs ~55% with no restriction. I see no
interactivity regressions.-Mike
-
I guess that proves it.... that should have read:
... so I'm _not_ quite sure.-
Yes. Throughput falls as preemption climbs.
-Mike
-
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Artem Bityutskiy | [PATCH 18/44 take 2] [UBI] build unit implementation |
| James Morris | Re: LSM conversion to static interface |
git: | |
| Paul Jackson | [PATCH] cpuset sched_load_balance kmalloc fix |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Linus Torvalds | Re: [GIT]: Networking |
