Re: [PATCH 4/5] select: make select() use schedule_hrtimeout()

Previous thread: endianness and sparse warnings by Geert Uytterhoeven on Friday, August 29, 2008 - 7:54 am. (3 messages)

Next thread: Re: 2.6.27 mtrr fixes do not work by Alexander Huemer on Friday, August 29, 2008 - 8:09 am. (5 messages)
From: Arjan van de Ven
Date: Friday, August 29, 2008 - 8:05 am

Today in Linux, select() and poll() operate in jiffies resolution
(granularity), meaning an effective resolution of 1 millisecond (HZ=1000) to
10 milliseconds (HZ=100). Delays shorter than this are not possible, and all
delays are in multiples of this granularity.

The effect is that applications that want (on average) to specify more
accurate delays (for example multimedia or other interactive apps) just
cannot do that; this creates more than needed jitter.

With this patch series, the internals of select() and poll() interfaces are
changed such that they work on the nanosecond level (using hrtimers). The
userspace interface for select() is in microseconds, for pselect() and
ppoll() this is in nanoseconds.

[actual behavior obviously depends on what resolution the hardware timers work, on
modern PCs this is pretty good though]

To show this effect I made a test application to measure the error made
in the select() timing.

For example, a userspace application asking for a 1200 microsecond delay, on
a HZ=1000 kernel, will in practice get a 1997 microsecond delay, a delta of
almost 800 microseconds (which is of course a high percentage of 1200). The
extreme case is asking for 1 microsecond, and getting 998 microseconds
delay... with the patch we get a 250 times improvement in behavior (!).

A graph of various inputs with the jitter can be seen at
http://www.tglx.de/~arjan/select_benefits.png

One thing to note is that on my machine, the current select() implementation
will return after 1997 microseconds when asked for 1999 microseconds; this
can be seen in a zoom in of the graph above:
http://www.tglx.de/~arjan/zoom.png
E.g. select() is returning too early in current Linux kernels; and this is
also fixed (by nature) by this patch series.
In the graph there's a 4 microsecond delta for most data points, this is
basically the measurement overhead (C-state exit, a few system calls, a 
loop and some math).


About the patches:

Patch 1: Introduces infrastructure where ...
From: Arjan van de Ven
Date: Friday, August 29, 2008 - 8:07 am

From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 28 Aug 2008 13:40:29 -0700
Subject: [PATCH] select: return accurate remainer in select() and ppoll()

On Linux, select() and ppoll() write back the time remaining into the timeout
value. Now that we have a nanosecond level end time, we can also give a very
accurate remainder back; this patch adds the logic to do this.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/compat.c |   39 ++++++++++++++++++++++++++++++---------
 fs/select.c |   47 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 7c9b2a1..b7c68f1 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1615,11 +1615,22 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
 
 	if (tvp) {
 		struct compat_timeval rtv;
+		struct timespec now;
 
 		if (current->personality & STICKY_TIMEOUTS)
 			goto sticky;
 		rtv.tv_usec = jiffies_to_usecs(do_div((*(u64*)&timeout), HZ));
 		rtv.tv_sec = timeout;
+
+		ktime_get_ts(&now);
+		now = timespec_sub(end_time, now);
+		if (now.tv_sec < 0 || now.tv_nsec < 0) {
+			memset(&rtv, 0, sizeof(rtv));
+		} else {
+			rtv.tv_sec = now.tv_sec;
+			rtv.tv_usec = now.tv_nsec / NSEC_PER_USEC;
+		}
+
 		if (compat_timeval_compare(&rtv, &tv) >= 0)
 			rtv = tv;
 		if (copy_to_user(tvp, &rtv, sizeof(rtv))) {
@@ -1697,16 +1708,20 @@ asmlinkage long compat_sys_pselect7(int n, compat_ulong_t __user *inp,
 
 	if (tsp) {
 		struct compat_timespec rts;
+		struct timespec now;
 
 		if (current->personality & STICKY_TIMEOUTS)
 			goto sticky;
 
-		rts.tv_sec = timeout / HZ;
-		rts.tv_nsec = (timeout % HZ) * (NSEC_PER_SEC/HZ);
-		if (rts.tv_nsec >= NSEC_PER_SEC) {
-			rts.tv_sec++;
-			rts.tv_nsec -= NSEC_PER_SEC;
+		ktime_get_ts(&now);
+		now = timespec_sub(end_time, now);
+		if (now.tv_sec < 0 || now.tv_nsec < 0) {
+			memset(&rts, 0, sizeof(rts));
+		} else {
+			rts.tv_nsec = ...
From: Arjan van de Ven
Date: Friday, August 29, 2008 - 8:07 am

From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 28 Aug 2008 13:44:03 -0700
Subject: [PATCH] select: introduce a schedule_hrtimeout() function

This patch adds a schedule_hrtimeout() function, to be used by select() and poll()
in a later patch. This function works similar to schedule_timeout() in most ways,
but takes a timespec rather than jiffies.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/hrtimer.h |    3 ++
 kernel/hrtimer.c        |   57 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 6d93dce..526b62f 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -346,6 +346,9 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
 extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
 				 struct task_struct *tsk);
 
+extern signed long schedule_hrtimeout(struct timespec *time, int mode);
+
+
 /* Soft interrupt function to run the hrtimer queues: */
 extern void hrtimer_run_queues(void);
 extern void hrtimer_run_pending(void);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index b8e4dce..0ab7b04 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1678,3 +1678,60 @@ void __init hrtimers_init(void)
 #endif
 }
 
+
+/**
+ * schedule_hrtimeout - sleep until timeout
+ * @timeout: timeout value (timespec)
+ * @mode: timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL
+ *
+ * Make the current task sleep until @timeout time has
+ * elapsed. The routine will return immediately unless
+ * the current task state has been set (see set_current_state()).
+ *
+ * You can set the task state as follows -
+ *
+ * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
+ * pass before the routine returns. The routine will return 0
+ *
+ * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
+ * delivered to the current task. In this case a positivee ...
From: Arjan van de Ven
Date: Friday, August 29, 2008 - 8:08 am

From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 28 Aug 2008 13:45:38 -0700
Subject: [PATCH] select: make select() use schedule_hrtimeout()

now that we have schedule_hrtimeout(), make select() use it.
But only for short delays; really long delays are assumed to not
need the highres level of accuracy but rather want the regular
timer behavior for now.

This is only done for delays shorter than 1 second; currently
the assumption is that longer delays have no desire to get the
higher accuracy.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/select.c |   41 +++++++++++++++++++++++++++++------------
 1 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 3ce5943..62b9767 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -24,6 +24,7 @@
 #include <linux/fdtable.h>
 #include <linux/fs.h>
 #include <linux/rcupdate.h>
+#include <linux/hrtimer.h>
 
 #include <asm/uaccess.h>
 
@@ -187,6 +188,7 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
 	struct poll_wqueues table;
 	poll_table *wait;
 	int retval, i;
+	int use_hr = 0;
 
 	rcu_read_lock();
 	retval = max_select_fd(n, fds);
@@ -196,6 +198,10 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
 		return retval;
 	n = retval;
 
+	/* Use hrtimers only for short timeouts for now */
+	if (*timeout < HZ && end_time->tv_sec != 0)
+		use_hr = 1;
+
 	poll_initwait(&table);
 	wait = &table.pt;
 	if (!*timeout)
@@ -266,20 +272,31 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
 			break;
 		}
 
-		if (*timeout < 0) {
-			/* Wait indefinitely */
-			__timeout = MAX_SCHEDULE_TIMEOUT;
-		} else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT - 1)) {
-			/* Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in a loop */
-			__timeout = MAX_SCHEDULE_TIMEOUT - 1;
-			*timeout -= __timeout;
+		if (use_hr) {
+			struct timespec now;
+			ktime_t ...
From: Arnd Bergmann
Date: Friday, August 29, 2008 - 8:36 am

What is the tradeoff here? Where is the advantage in using the
traditional timer interface for longer timeouts?

	Arnd <><
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 9:20 am

This is _really_ ugly.

Can't you just do this relaxation in "schedule_hrtimeout()" instead, and 
just document the fact that the "high resolution" of hrtimeout is relative 
to the length of the timeout.

It's not that select() doesn't care, it's that *NOBODY* cares. If somebody 
asks for a timeout of one second and 2 microseconds, the two microseconds 
simply don't matter. Ever. But if somebody asks for a timeout of 12 
microseconds, individual microseconds probably _do_ matter.

So if you want high-resolution select/poll, then get rid of the "use_hr" 
logic entirely, and just do it unconditionally. Then, relax the scheduler 
timeouts in the scheduler.

(But, that's probably _generally_ true. Even now, when people do 
"schedule_timeout()", there's a big difference between asking for two 
ticks and asking for two seconds. The latter should probably try to round 
to a nice timer tick basis for power reasons).

		Linus
--

From: Alan Cox
Date: Friday, August 29, 2008 - 9:11 am

I disagree - that is fixing the problem in the wrong place. The timer
structure needs an accuracy field of some form that the existing timer
functions initialise to 0.

On a heavily loaded system with things like network events the behaviour
of the overall system is too complex to do the job well except at the
timer level which seems all. At the virtualisation level the hypervisor
needs to be doing the work to merge timer events between guests for power
management.

Once the timers have an accuracy representation there doesn't need to be
any real difference anywhere in the stack.

Alan
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 10:26 am

I do agree that we could do that too, but you miss one big issue: even if 
we were to add an accuracy field inside the kernel, there is no such field 
in the user interfaces.

We just pass timevals (and sometimes timespecs) around, and no, they don't 
have any way to specify accuracy.

Yeah, we could use the high bits in the usec/nsec words, but then older 
kernels would basically do random things, so that would be a horrible 
interface.

The other thing to do would be to just add totally new system calls with 
totally new interfaces, but (a) nobody would use them anyway and (b) it's 
simply not worth it.

So given that reality, and _if_ we want to support nice high-resolution 
sleeping by select/poll, the only reasonable thing to do is to estimate 
some kind of expected accuracy from the existing timeval/timespec.

And the only reasonable way to do that is to just look at the range. You 
can probably do something fairly trivial with

	/* Estimate expected accuracy in ns from a timeval */
	unsigned long estimate_accuracy(struct timeval *tv)
	{
		/*
		 * Tens of ms if we're looking at seconds, even
		 * more for 10s+ sleeping
		 */
		if (tv->tv_sec) {
			/* Tenths of seconds for long sleeps */
			if (tv->tv_sec > 10)
				return 100000000;
			/*
			 * Tens of ms for second-granularity sleeps. This,
			 * btw, is the historical Linux 100Hz timer range.
			 */
			return 10000000;
		}

		/* Single msecs if we're looking at milliseconds */
		if (tv->tv_usec > 1000)
			return 1000000;

		/* Aim for tenths of msecs otherwise */
		return 100000;
	}

and yes, it's just a heuristic, but it's probably not a horribly stupid 
one or a very unreasonable one. 

			Linus
--

From: Arjan van de Ven
Date: Friday, August 29, 2008 - 10:42 am

On Fri, 29 Aug 2008 10:26:02 -0700 (PDT)


that works

one of the things we were thinking about was to also have a field in
the task struct (so inherited over fork/exec) that is the default
"slack", which can be set via a prctl, so that an admin can do a "nice"
like thing and run certain known silly apps at different granularity
(and a bunch of smart tricks in some key apps that we'll then do)



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Alan Cox
Date: Friday, August 29, 2008 - 11:18 am

Most of the timers we want to aggregate with are kernel originated and
those also need accuracy fields (tcp retransmit is fairly precision at
high speed while tcp keepalive is 'yeah whatever' sort of urgency.

At the point we are merging timers we want to consider user kernel and in



Agreed except that I'd assume real time tasks implied high accuracy,
otherwise one assumes they'd have not have been real time - and probably
putting those constants into sysfs somewhere.


Alan
--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 11:46 am

Actually, I disagree.

They are _both_ of the exact same type: the precision depends on how long 
the timeout is.

TCP retransmit timers are a perfect example. If the timeout is long (which 
is quite common if you end up having multiple retransmits), you _really_ 
don't care about how precise it is.

And keepalives are 'yeah whatever' exactly because they are so long, not 
because they are inherently uninteresting per se.

I suspect you could find some kernel-generated timer that doesn't fit that 

And yes, I do agree that the heuristic could well involve other 
characteristics of the process in question. And probably characteristics 
of the machine itself (ie some general kind of "power mode" where timers 
are simply not considered critical if you want to be in low-power mode).

		Linus
--

From: Alan Cox
Date: Friday, August 29, 2008 - 11:33 am

HAL and dbus already deal with this sort of stuff, including now having a
notion of 'on battery' and the like so if its in sysfs it ceases to be
the kernels problem ;)
--

From: Arjan van de Ven
Date: Saturday, August 30, 2008 - 8:25 am

On Fri, 29 Aug 2008 19:33:09 +0100

(side note: "on battery" is the worst possible measure for doing 'do I
need to be power sensitive', but you knew that already)


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Arjan van de Ven
Date: Friday, August 29, 2008 - 9:30 am

On Fri, 29 Aug 2008 09:20:18 -0700 (PDT)

yes we can

ok I'll do it this way; I wanted to avoid changing behavior for long
delays (some people think there's too much overhead in hrtimers or
something) but i'm totally fine with not doing this bifurbicated


yes that's a separate story; this is what "range timers" are about and
David Woodhouse and I are working on that as well
(but this select stuff is basically a prerequisite to use that)

--

From: Daniel Walker
Date: Friday, August 29, 2008 - 8:59 am

Do you have any basis to make this "short delays" assumption? I would
think if regular timers can be used for longer than a second, people
will naturally assume anything else that's high res will also ..

The futex_wait() has code which looks a lot like your
schedule_hrtimeout() , (you might want to replace your code there
instead) , and it doesn't seem to have the short delays restriction ..

Daniel

--

From: Arjan van de Ven
Date: Friday, August 29, 2008 - 8:08 am

From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 28 Aug 2008 13:46:47 -0700
Subject: [PATCH] select: make poll() use schedule_hrtimeout() as well

similar to the select() patch, use schedule_hrtimeout() in the
poll() codepath.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/select.c |   46 ++++++++++++++++++++++++++++++----------------
 1 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 62b9767..e881b97 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -631,6 +631,10 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 {
 	int count = 0;
 	poll_table* pt = &wait->pt;
+	int use_hr = 0;
+
+	if (*timeout < HZ && *timeout > 0)
+		use_hr = 1;
 
 	/* Optimise the no-wait case */
 	if (!(*timeout))
@@ -673,24 +677,34 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 		if (count || !*timeout)
 			break;
 
-		if (*timeout < 0) {
-			/* Wait indefinitely */
-			__timeout = MAX_SCHEDULE_TIMEOUT;
-		} else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT-1)) {
-			/*
-			 * Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in
-			 * a loop
-			 */
-			__timeout = MAX_SCHEDULE_TIMEOUT - 1;
-			*timeout -= __timeout;
+		if (use_hr) {
+			struct timespec now;
+			ktime_t expire;
+			schedule_hrtimeout(end_time, HRTIMER_MODE_ABS);
+			getnstimeofday(&now);
+			expire = ktime_sub(timespec_to_ktime(*end_time), timespec_to_ktime(now));
+			if (ktime_to_ns(expire) <= 0)
+				break;
 		} else {
-			__timeout = *timeout;
-			*timeout = 0;
-		}
+			if (*timeout < 0) {
+				/* Wait indefinitely */
+				__timeout = MAX_SCHEDULE_TIMEOUT;
+			} else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT-1)) {
+				/*
+				 * Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in
+				 * a loop
+				 */
+				__timeout = MAX_SCHEDULE_TIMEOUT - 1;
+				*timeout -= __timeout;
+			} else {
+				__timeout = *timeout;
+				*timeout = 0;
+			}
 
-		__timeout = ...
From: Arjan van de Ven
Date: Friday, August 29, 2008 - 8:06 am

From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 27 Aug 2008 11:24:52 -0700
Subject: [PATCH] select: add a timespec version of the timeout to select/poll

This patch adds the end_time parameter to the various pieces in the select()
and poll() paths; but does not use it yet (that is for later patches in
the series). This end_time is the timespec accuracy of the end time (eg
nanoseconds) rather than the current jiffies-juggling that happens all
over the poll/select code. It's also an absolute time, so we can always
see how long there is to go just by looking at the delta between the
end time and the current time.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/compat.c          |   28 ++++++++++++++++++++----
 fs/select.c          |   56 ++++++++++++++++++++++++++++++++++++++++---------
 include/linux/poll.h |    6 ++--
 3 files changed, 71 insertions(+), 19 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index c9d1472..7c9b2a1 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1513,7 +1513,8 @@ int compat_set_fd_set(unsigned long nr, compat_ulong_t __user *ufdset,
 	((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1)
 
 int compat_core_sys_select(int n, compat_ulong_t __user *inp,
-	compat_ulong_t __user *outp, compat_ulong_t __user *exp, s64 *timeout)
+	compat_ulong_t __user *outp, compat_ulong_t __user *exp, s64 *timeout,
+	struct timespec *end_time)
 {
 	fd_set_bits fds;
 	void *bits;
@@ -1560,7 +1561,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
 	zero_fd_set(n, fds.res_out);
 	zero_fd_set(n, fds.res_ex);
 
-	ret = do_select(n, &fds, timeout);
+	ret = do_select(n, &fds, timeout, end_time);
 
 	if (ret < 0)
 		goto out;
@@ -1589,6 +1590,7 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
 	s64 timeout = -1;
 	struct compat_timeval tv;
 	int ret;
+	struct timespec end_time;
 
 	if (tvp) {
 		if (copy_from_user(&tv, tvp, sizeof(tv)))
@@ -1604,9 +1606,12 @@ asmlinkage long compat_sys_select(int ...
From: Andrew Morton
Date: Friday, August 29, 2008 - 7:10 pm

It's a bit dorkward using memset to clear out a structure which
zillions of code sites already know contains only two members.

It could be that two plain old writes is more efficient.  (and it could
be that this is how the compiler implements the memset anyway?)

But I'd suggest that adding a new inlined timespec_zero(timespec*) or
timespec_set(timespec*,int,int) would be nicer.

--

From: Linus Torvalds
Date: Friday, August 29, 2008 - 7:54 pm

gcc indeed knows about memset(), and will generate just two stores. No 
worries,

		Linus
--

From: Arnd Bergmann
Date: Friday, August 29, 2008 - 8:54 am

Nice work, seems a lot better thought through than my previous attempt [1],


You mean entirely entirely getting rid of jiffies in the kernel, or just
with user facing interfaces like epoll_wait, rt_sigtimedwait etc?

I think if we want to replace more code with hr timers, we will need
something like dwmw2's range timers based on that in order to keep the
wakeup rate low. These are still jiffies based, which doesn't help
your case, but they should be easy to change to ktime or timespec.

	Arnd <><

[1] http://www.ussg.iu.edu/hypermail/linux/kernel/0703.0/1319.html
[2] http://lwn.net/Articles/291159/
--

From: Arjan van de Ven
Date: Friday, August 29, 2008 - 9:12 am

On Fri, 29 Aug 2008 17:54:27 +0200

I meant actually just in the select() and poll() code; right now it is
sort of dual-brained, I'd like to eventually be able to get rid of that
and just use hrtimers for select() and poll() always

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Brian Wellington
Date: Friday, August 29, 2008 - 3:15 pm

Would a similar change also be useful for epoll_wait()?  It's unfortunate
that there's no way to use epoll with more than millisecond granularity,
but it seems like the same arguments could be made as with poll().

Brian
--

Previous thread: endianness and sparse warnings by Geert Uytterhoeven on Friday, August 29, 2008 - 7:54 am. (3 messages)

Next thread: Re: 2.6.27 mtrr fixes do not work by Alexander Huemer on Friday, August 29, 2008 - 8:09 am. (5 messages)