Today in Linux, select() and poll() operate in jiffies resolution (granularity), meaning an effective resolution of 1 millisecond (HZ=1000) to 10 milliseconds (HZ=100). Delays shorter than this are not possible, and all delays are in multiples of this granularity. The effect is that applications that want (on average) to specify more accurate delays (for example multimedia or other interactive apps) just cannot do that; this creates more than needed jitter. With this patch series, the internals of select() and poll() interfaces are changed such that they work on the nanosecond level (using hrtimers). The userspace interface for select() is in microseconds, for pselect() and ppoll() this is in nanoseconds. [actual behavior obviously depends on what resolution the hardware timers work, on modern PCs this is pretty good though] To show this effect I made a test application to measure the error made in the select() timing. For example, a userspace application asking for a 1200 microsecond delay, on a HZ=1000 kernel, will in practice get a 1997 microsecond delay, a delta of almost 800 microseconds (which is of course a high percentage of 1200). The extreme case is asking for 1 microsecond, and getting 998 microseconds delay... with the patch we get a 250 times improvement in behavior (!). A graph of various inputs with the jitter can be seen at http://www.tglx.de/~arjan/select_benefits.png One thing to note is that on my machine, the current select() implementation will return after 1997 microseconds when asked for 1999 microseconds; this can be seen in a zoom in of the graph above: http://www.tglx.de/~arjan/zoom.png E.g. select() is returning too early in current Linux kernels; and this is also fixed (by nature) by this patch series. In the graph there's a 4 microsecond delta for most data points, this is basically the measurement overhead (C-state exit, a few system calls, a loop and some math). About the patches: Patch 1: Introduces infrastructure where ...
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 28 Aug 2008 13:40:29 -0700
Subject: [PATCH] select: return accurate remainer in select() and ppoll()
On Linux, select() and ppoll() write back the time remaining into the timeout
value. Now that we have a nanosecond level end time, we can also give a very
accurate remainder back; this patch adds the logic to do this.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
fs/compat.c | 39 ++++++++++++++++++++++++++++++---------
fs/select.c | 47 ++++++++++++++++++++++++++++++++++++-----------
2 files changed, 66 insertions(+), 20 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 7c9b2a1..b7c68f1 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1615,11 +1615,22 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
if (tvp) {
struct compat_timeval rtv;
+ struct timespec now;
if (current->personality & STICKY_TIMEOUTS)
goto sticky;
rtv.tv_usec = jiffies_to_usecs(do_div((*(u64*)&timeout), HZ));
rtv.tv_sec = timeout;
+
+ ktime_get_ts(&now);
+ now = timespec_sub(end_time, now);
+ if (now.tv_sec < 0 || now.tv_nsec < 0) {
+ memset(&rtv, 0, sizeof(rtv));
+ } else {
+ rtv.tv_sec = now.tv_sec;
+ rtv.tv_usec = now.tv_nsec / NSEC_PER_USEC;
+ }
+
if (compat_timeval_compare(&rtv, &tv) >= 0)
rtv = tv;
if (copy_to_user(tvp, &rtv, sizeof(rtv))) {
@@ -1697,16 +1708,20 @@ asmlinkage long compat_sys_pselect7(int n, compat_ulong_t __user *inp,
if (tsp) {
struct compat_timespec rts;
+ struct timespec now;
if (current->personality & STICKY_TIMEOUTS)
goto sticky;
- rts.tv_sec = timeout / HZ;
- rts.tv_nsec = (timeout % HZ) * (NSEC_PER_SEC/HZ);
- if (rts.tv_nsec >= NSEC_PER_SEC) {
- rts.tv_sec++;
- rts.tv_nsec -= NSEC_PER_SEC;
+ ktime_get_ts(&now);
+ now = timespec_sub(end_time, now);
+ if (now.tv_sec < 0 || now.tv_nsec < 0) {
+ memset(&rts, 0, sizeof(rts));
+ } else {
+ rts.tv_nsec = ...From: Arjan van de Ven <arjan@linux.intel.com> Date: Thu, 28 Aug 2008 13:44:03 -0700 Subject: [PATCH] select: introduce a schedule_hrtimeout() function This patch adds a schedule_hrtimeout() function, to be used by select() and poll() in a later patch. This function works similar to schedule_timeout() in most ways, but takes a timespec rather than jiffies. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> --- include/linux/hrtimer.h | 3 ++ kernel/hrtimer.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 60 insertions(+), 0 deletions(-) diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 6d93dce..526b62f 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -346,6 +346,9 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block); extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, struct task_struct *tsk); +extern signed long schedule_hrtimeout(struct timespec *time, int mode); + + /* Soft interrupt function to run the hrtimer queues: */ extern void hrtimer_run_queues(void); extern void hrtimer_run_pending(void); diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index b8e4dce..0ab7b04 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1678,3 +1678,60 @@ void __init hrtimers_init(void) #endif } + +/** + * schedule_hrtimeout - sleep until timeout + * @timeout: timeout value (timespec) + * @mode: timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL + * + * Make the current task sleep until @timeout time has + * elapsed. The routine will return immediately unless + * the current task state has been set (see set_current_state()). + * + * You can set the task state as follows - + * + * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to + * pass before the routine returns. The routine will return 0 + * + * %TASK_INTERRUPTIBLE - the routine may return early if a signal is + * delivered to the current task. In this case a positivee ...
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 28 Aug 2008 13:45:38 -0700
Subject: [PATCH] select: make select() use schedule_hrtimeout()
now that we have schedule_hrtimeout(), make select() use it.
But only for short delays; really long delays are assumed to not
need the highres level of accuracy but rather want the regular
timer behavior for now.
This is only done for delays shorter than 1 second; currently
the assumption is that longer delays have no desire to get the
higher accuracy.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
fs/select.c | 41 +++++++++++++++++++++++++++++------------
1 files changed, 29 insertions(+), 12 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 3ce5943..62b9767 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -24,6 +24,7 @@
#include <linux/fdtable.h>
#include <linux/fs.h>
#include <linux/rcupdate.h>
+#include <linux/hrtimer.h>
#include <asm/uaccess.h>
@@ -187,6 +188,7 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
struct poll_wqueues table;
poll_table *wait;
int retval, i;
+ int use_hr = 0;
rcu_read_lock();
retval = max_select_fd(n, fds);
@@ -196,6 +198,10 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
return retval;
n = retval;
+ /* Use hrtimers only for short timeouts for now */
+ if (*timeout < HZ && end_time->tv_sec != 0)
+ use_hr = 1;
+
poll_initwait(&table);
wait = &table.pt;
if (!*timeout)
@@ -266,20 +272,31 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
break;
}
- if (*timeout < 0) {
- /* Wait indefinitely */
- __timeout = MAX_SCHEDULE_TIMEOUT;
- } else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT - 1)) {
- /* Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in a loop */
- __timeout = MAX_SCHEDULE_TIMEOUT - 1;
- *timeout -= __timeout;
+ if (use_hr) {
+ struct timespec now;
+ ktime_t ...What is the tradeoff here? Where is the advantage in using the traditional timer interface for longer timeouts? Arnd <>< --
This is _really_ ugly. Can't you just do this relaxation in "schedule_hrtimeout()" instead, and just document the fact that the "high resolution" of hrtimeout is relative to the length of the timeout. It's not that select() doesn't care, it's that *NOBODY* cares. If somebody asks for a timeout of one second and 2 microseconds, the two microseconds simply don't matter. Ever. But if somebody asks for a timeout of 12 microseconds, individual microseconds probably _do_ matter. So if you want high-resolution select/poll, then get rid of the "use_hr" logic entirely, and just do it unconditionally. Then, relax the scheduler timeouts in the scheduler. (But, that's probably _generally_ true. Even now, when people do "schedule_timeout()", there's a big difference between asking for two ticks and asking for two seconds. The latter should probably try to round to a nice timer tick basis for power reasons). Linus --
I disagree - that is fixing the problem in the wrong place. The timer structure needs an accuracy field of some form that the existing timer functions initialise to 0. On a heavily loaded system with things like network events the behaviour of the overall system is too complex to do the job well except at the timer level which seems all. At the virtualisation level the hypervisor needs to be doing the work to merge timer events between guests for power management. Once the timers have an accuracy representation there doesn't need to be any real difference anywhere in the stack. Alan --
I do agree that we could do that too, but you miss one big issue: even if
we were to add an accuracy field inside the kernel, there is no such field
in the user interfaces.
We just pass timevals (and sometimes timespecs) around, and no, they don't
have any way to specify accuracy.
Yeah, we could use the high bits in the usec/nsec words, but then older
kernels would basically do random things, so that would be a horrible
interface.
The other thing to do would be to just add totally new system calls with
totally new interfaces, but (a) nobody would use them anyway and (b) it's
simply not worth it.
So given that reality, and _if_ we want to support nice high-resolution
sleeping by select/poll, the only reasonable thing to do is to estimate
some kind of expected accuracy from the existing timeval/timespec.
And the only reasonable way to do that is to just look at the range. You
can probably do something fairly trivial with
/* Estimate expected accuracy in ns from a timeval */
unsigned long estimate_accuracy(struct timeval *tv)
{
/*
* Tens of ms if we're looking at seconds, even
* more for 10s+ sleeping
*/
if (tv->tv_sec) {
/* Tenths of seconds for long sleeps */
if (tv->tv_sec > 10)
return 100000000;
/*
* Tens of ms for second-granularity sleeps. This,
* btw, is the historical Linux 100Hz timer range.
*/
return 10000000;
}
/* Single msecs if we're looking at milliseconds */
if (tv->tv_usec > 1000)
return 1000000;
/* Aim for tenths of msecs otherwise */
return 100000;
}
and yes, it's just a heuristic, but it's probably not a horribly stupid
one or a very unreasonable one.
Linus
--
On Fri, 29 Aug 2008 10:26:02 -0700 (PDT) that works one of the things we were thinking about was to also have a field in the task struct (so inherited over fork/exec) that is the default "slack", which can be set via a prctl, so that an admin can do a "nice" like thing and run certain known silly apps at different granularity (and a bunch of smart tricks in some key apps that we'll then do) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Most of the timers we want to aggregate with are kernel originated and those also need accuracy fields (tcp retransmit is fairly precision at high speed while tcp keepalive is 'yeah whatever' sort of urgency. At the point we are merging timers we want to consider user kernel and in Agreed except that I'd assume real time tasks implied high accuracy, otherwise one assumes they'd have not have been real time - and probably putting those constants into sysfs somewhere. Alan --
Actually, I disagree. They are _both_ of the exact same type: the precision depends on how long the timeout is. TCP retransmit timers are a perfect example. If the timeout is long (which is quite common if you end up having multiple retransmits), you _really_ don't care about how precise it is. And keepalives are 'yeah whatever' exactly because they are so long, not because they are inherently uninteresting per se. I suspect you could find some kernel-generated timer that doesn't fit that And yes, I do agree that the heuristic could well involve other characteristics of the process in question. And probably characteristics of the machine itself (ie some general kind of "power mode" where timers are simply not considered critical if you want to be in low-power mode). Linus --
HAL and dbus already deal with this sort of stuff, including now having a notion of 'on battery' and the like so if its in sysfs it ceases to be the kernels problem ;) --
On Fri, 29 Aug 2008 19:33:09 +0100 (side note: "on battery" is the worst possible measure for doing 'do I need to be power sensitive', but you knew that already) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
On Fri, 29 Aug 2008 09:20:18 -0700 (PDT) yes we can ok I'll do it this way; I wanted to avoid changing behavior for long delays (some people think there's too much overhead in hrtimers or something) but i'm totally fine with not doing this bifurbicated yes that's a separate story; this is what "range timers" are about and David Woodhouse and I are working on that as well (but this select stuff is basically a prerequisite to use that) --
Do you have any basis to make this "short delays" assumption? I would think if regular timers can be used for longer than a second, people will naturally assume anything else that's high res will also .. The futex_wait() has code which looks a lot like your schedule_hrtimeout() , (you might want to replace your code there instead) , and it doesn't seem to have the short delays restriction .. Daniel --
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 28 Aug 2008 13:46:47 -0700
Subject: [PATCH] select: make poll() use schedule_hrtimeout() as well
similar to the select() patch, use schedule_hrtimeout() in the
poll() codepath.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
fs/select.c | 46 ++++++++++++++++++++++++++++++----------------
1 files changed, 30 insertions(+), 16 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 62b9767..e881b97 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -631,6 +631,10 @@ static int do_poll(unsigned int nfds, struct poll_list *list,
{
int count = 0;
poll_table* pt = &wait->pt;
+ int use_hr = 0;
+
+ if (*timeout < HZ && *timeout > 0)
+ use_hr = 1;
/* Optimise the no-wait case */
if (!(*timeout))
@@ -673,24 +677,34 @@ static int do_poll(unsigned int nfds, struct poll_list *list,
if (count || !*timeout)
break;
- if (*timeout < 0) {
- /* Wait indefinitely */
- __timeout = MAX_SCHEDULE_TIMEOUT;
- } else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT-1)) {
- /*
- * Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in
- * a loop
- */
- __timeout = MAX_SCHEDULE_TIMEOUT - 1;
- *timeout -= __timeout;
+ if (use_hr) {
+ struct timespec now;
+ ktime_t expire;
+ schedule_hrtimeout(end_time, HRTIMER_MODE_ABS);
+ getnstimeofday(&now);
+ expire = ktime_sub(timespec_to_ktime(*end_time), timespec_to_ktime(now));
+ if (ktime_to_ns(expire) <= 0)
+ break;
} else {
- __timeout = *timeout;
- *timeout = 0;
- }
+ if (*timeout < 0) {
+ /* Wait indefinitely */
+ __timeout = MAX_SCHEDULE_TIMEOUT;
+ } else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT-1)) {
+ /*
+ * Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in
+ * a loop
+ */
+ __timeout = MAX_SCHEDULE_TIMEOUT - 1;
+ *timeout -= __timeout;
+ } else {
+ __timeout = *timeout;
+ *timeout = 0;
+ }
- __timeout = ...From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 27 Aug 2008 11:24:52 -0700
Subject: [PATCH] select: add a timespec version of the timeout to select/poll
This patch adds the end_time parameter to the various pieces in the select()
and poll() paths; but does not use it yet (that is for later patches in
the series). This end_time is the timespec accuracy of the end time (eg
nanoseconds) rather than the current jiffies-juggling that happens all
over the poll/select code. It's also an absolute time, so we can always
see how long there is to go just by looking at the delta between the
end time and the current time.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
fs/compat.c | 28 ++++++++++++++++++++----
fs/select.c | 56 ++++++++++++++++++++++++++++++++++++++++---------
include/linux/poll.h | 6 ++--
3 files changed, 71 insertions(+), 19 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index c9d1472..7c9b2a1 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1513,7 +1513,8 @@ int compat_set_fd_set(unsigned long nr, compat_ulong_t __user *ufdset,
((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1)
int compat_core_sys_select(int n, compat_ulong_t __user *inp,
- compat_ulong_t __user *outp, compat_ulong_t __user *exp, s64 *timeout)
+ compat_ulong_t __user *outp, compat_ulong_t __user *exp, s64 *timeout,
+ struct timespec *end_time)
{
fd_set_bits fds;
void *bits;
@@ -1560,7 +1561,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
zero_fd_set(n, fds.res_out);
zero_fd_set(n, fds.res_ex);
- ret = do_select(n, &fds, timeout);
+ ret = do_select(n, &fds, timeout, end_time);
if (ret < 0)
goto out;
@@ -1589,6 +1590,7 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
s64 timeout = -1;
struct compat_timeval tv;
int ret;
+ struct timespec end_time;
if (tvp) {
if (copy_from_user(&tv, tvp, sizeof(tv)))
@@ -1604,9 +1606,12 @@ asmlinkage long compat_sys_select(int ...It's a bit dorkward using memset to clear out a structure which zillions of code sites already know contains only two members. It could be that two plain old writes is more efficient. (and it could be that this is how the compiler implements the memset anyway?) But I'd suggest that adding a new inlined timespec_zero(timespec*) or timespec_set(timespec*,int,int) would be nicer. --
gcc indeed knows about memset(), and will generate just two stores. No worries, Linus --
Nice work, seems a lot better thought through than my previous attempt [1], You mean entirely entirely getting rid of jiffies in the kernel, or just with user facing interfaces like epoll_wait, rt_sigtimedwait etc? I think if we want to replace more code with hr timers, we will need something like dwmw2's range timers based on that in order to keep the wakeup rate low. These are still jiffies based, which doesn't help your case, but they should be easy to change to ktime or timespec. Arnd <>< [1] http://www.ussg.iu.edu/hypermail/linux/kernel/0703.0/1319.html [2] http://lwn.net/Articles/291159/ --
On Fri, 29 Aug 2008 17:54:27 +0200 I meant actually just in the select() and poll() code; right now it is sort of dual-brained, I'd like to eventually be able to get rid of that and just use hrtimers for select() and poll() always -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Would a similar change also be useful for epoll_wait()? It's unfortunate that there's no way to use epoll with more than millisecond granularity, but it seems like the same arguments could be made as with poll(). Brian --
