New in this version: * With lots of help from Thomas Gleixner, select() and poll() now exclusively use hrtimers * Several key cleanups from Thomas actually simplify and clean the code up so that it's an overall improvement in code quality * Various interesting bugs were encountered during the switchover on this level. The Fedora/Red Hat "nash" program deserves a special mention for both asking for a 1 nanosecond ppoll() timeout AND depending on the implementation to set this to 0 nanoseconds in userspace memory at the end of the first iteration. fs/compat.c | 187 +++++++++-------------- fs/select.c | 346 +++++++++++++++++++++----------------------- include/linux/hrtimer.h | 2 include/linux/poll.h | 8 - include/linux/thread_info.h | 8 + include/linux/time.h | 4 kernel/hrtimer.c | 65 ++++++++ kernel/time.c | 18 ++ 8 files changed, 344 insertions(+), 294 deletions(-) (the bulk of actual linecount growth is just newly added comments) ---- Today in Linux, select() and poll() operate in jiffies resolution (granularity), meaning an effective resolution of 1 millisecond (HZ=1000) to 10 milliseconds (HZ=100). Delays shorter than this are not possible, and all delays are in multiples of this granularity. The effect is that applications that want (on average) to specify more accurate delays (for example multimedia or other interactive apps) just cannot do that; this creates more than needed jitter. With this patch series, the internals of select() and poll() interfaces are changed such that they work on the nanosecond level (using hrtimers). The userspace interface for select() is in microseconds, for pselect() and ppoll() this is in nanoseconds. [actual behavior obviously on what resolution the hardware timers work, on modern PCs this is pretty good though] To show this effect I made a test application to measure the error made in the select() ...
From: Arjan van de Ven <arjan@linux.intel.com> Subject: [PATCH] select: Introduce a hrtimeout function This patch adds a schedule_hrtimeout() function, to be used by select() and poll() in a later patch. This function works similar to schedule_timeout() in most ways, but takes a timespec rather than jiffies. With a lot of contributions/fixes from Thomas Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/linux/hrtimer.h | 2 + kernel/hrtimer.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 67 insertions(+), 0 deletions(-) diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 6d93dce..becd17d 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -346,6 +346,8 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block); extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, struct task_struct *tsk); +extern int schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode); + /* Soft interrupt function to run the hrtimer queues: */ extern void hrtimer_run_queues(void); extern void hrtimer_run_pending(void); diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index b8e4dce..782137d 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1678,3 +1678,68 @@ void __init hrtimers_init(void) #endif } +/** + * schedule_hrtimeout - sleep until timeout + * @expires: timeout value (ktime_t) + * @mode: timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL + * + * Make the current task sleep until the given expiry time has + * elapsed. The routine will return immediately unless + * the current task state has been set (see set_current_state()). + * + * You can set the task state as follows - + * + * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to + * pass before the routine returns. + * + * %TASK_INTERRUPTIBLE - the routine may return early if a signal is + * delivered to the ...
From: Thomas Gleixner <tglx@linutronix.de>
Subject: [PATCH] select: add a timespec_add_safe() function
For the select() rework, it's important to be able to add timespec
structures in an overflow-safe manner.
This patch adds a timespec_add_safe() function for this which is similar in
operation to ktime_add_safe(), but works on a struct timespec.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
include/linux/time.h | 4 ++++
kernel/time.c | 18 ++++++++++++++++++
2 files changed, 22 insertions(+), 0 deletions(-)
diff --git a/include/linux/time.h b/include/linux/time.h
index e15206a..7269764 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -38,6 +38,8 @@ struct timezone {
#define NSEC_PER_SEC 1000000000L
#define FSEC_PER_SEC 1000000000000000L
+#define TIME_T_MAX (time_t)((1UL << ((sizeof(time_t) << 3) - 1)) - 1)
+
static inline int timespec_equal(const struct timespec *a,
const struct timespec *b)
{
@@ -72,6 +74,8 @@ extern unsigned long mktime(const unsigned int year, const unsigned int mon,
const unsigned int min, const unsigned int sec);
extern void set_normalized_timespec(struct timespec *ts, time_t sec, long nsec);
+extern struct timespec timespec_add_safe(const struct timespec lhs,
+ const struct timespec rhs);
/*
* sub = lhs - rhs, in normalized form
diff --git a/kernel/time.c b/kernel/time.c
index 6a08660..d63a433 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -669,3 +669,21 @@ EXPORT_SYMBOL(get_jiffies_64);
#endif
EXPORT_SYMBOL(jiffies);
+
+/*
+ * Add two timespec values and do a safety check for overflow.
+ * It's assumed that both values are valid (>= 0)
+ */
+struct timespec timespec_add_safe(const struct timespec lhs,
+ const struct timespec rhs)
+{
+ struct timespec res;
+
+ set_normalized_timespec(&res, lhs.tv_sec + rhs.tv_sec,
+ lhs.tv_nsec + rhs.tv_nsec);
+
+ if ...From: Thomas Gleixner <tglx@linutronix.de>
Date: Sun, 31 Aug 2008 08:16:57 -0700
Subject: [PATCH] select: add poll_select_set_timeout() and
poll_select_copy_remaining() helpers
This patch adds 2 helpers that will be used for the hrtimer based
select/poll:
poll_select_set_timeout() is a helper that takes a timeout (as a
second, nanosecond pair) and turns that into a "struct timespec" that
represents the absolute end time. This is a common operation in the
many select() and poll() variants and needs various, common, sanity
checks.
poll_select_copy_remaining() is a helper that takes care of copying the
remaining time to userspace, as select(), pselect() and ppoll() do.
This function comes in both a natural and a compat implementation (due
to datastructure differences).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
fs/compat.c | 51 ++++++++++++++++++++++++++++++++++
fs/select.c | 75
++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/poll.h
| 2 + 3 files changed, 128 insertions(+), 0 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 075d050..424767c 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1436,6 +1436,57 @@ out_ret:
#define __COMPAT_NFDBITS (8 * sizeof(compat_ulong_t))
+static int poll_select_copy_remaining(struct timespec *end_time, void
__user *p,
+ int timeval, int ret)
+{
+ struct timespec ts;
+
+ if (!p)
+ return ret;
+
+ if (current->personality & STICKY_TIMEOUTS)
+ goto sticky;
+
+ /* No update for zero timeout */
+ if (!end_time->tv_sec && !end_time->tv_nsec)
+ return ret;
+
+ ktime_get_ts(&ts);
+ ts = timespec_sub(*end_time, ts);
+ if (ts.tv_sec < 0)
+ ts.tv_sec = ts.tv_nsec = 0;
+
+ if (timeval) {
+ struct compat_timeval rtv;
+
+ rtv.tv_sec = ts.tv_sec;
+ rtv.tv_usec = ts.tv_nsec / NSEC_PER_USEC;
+
+ if (!copy_to_user(p, &rtv, sizeof(rtv)))
+ return ret;
+ } else {
+ struct ...From: Thomas Gleixner <tglx@linutronix.de>
Date: Sun, 31 Aug 2008 08:19:15 -0700
Subject: [PATCH] select: add a poll specific struct to the restart_block union
with hrtimer poll/select, the signal restart data no longer is a single
long representing a jiffies count, but it becomes a second/nanosecond pair
that also needs to encode if there was a timeout at all or not.
This patch adds a struct to the restart_block union for this purpose
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
include/linux/thread_info.h | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 38a5647..e6b820f 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -38,6 +38,14 @@ struct restart_block {
#endif
u64 expires;
} nanosleep;
+ /* For poll */
+ struct {
+ struct pollfd __user *ufds;
+ int nfds;
+ int has_timeout;
+ unsigned long tv_sec;
+ unsigned long tv_nsec;
+ } poll;
};
};
--
1.5.5.1
--
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Sun, 31 Aug 2008 08:26:40 -0700
Subject: [PATCH] select: switch select() and poll() over to hrtimers
With lots of help, input and cleanups from Thomas Gleixner
This patch switches select() and poll() over to hrtimers.
The core of the patch is replacing the "s64 timeout" with a
"struct timespec end_time" in all the plumbing.
But most of the diffstat comes from using the just introduced helpers:
poll_select_set_timeout
poll_select_copy_remaining
timespec_add_safe
which make manipulating the timespec easier and less error-prone.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
fs/compat.c | 136 ++++----------------------
fs/select.c | 263 +++++++++++++++++---------------------------------
include/linux/poll.h | 6 +-
3 files changed, 111 insertions(+), 294 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 424767c..133ed7f 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1568,7 +1568,8 @@ int compat_set_fd_set(unsigned long nr, compat_ulong_t __user *ufdset,
((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1)
int compat_core_sys_select(int n, compat_ulong_t __user *inp,
- compat_ulong_t __user *outp, compat_ulong_t __user *exp, s64 *timeout)
+ compat_ulong_t __user *outp, compat_ulong_t __user *exp,
+ struct timespec *end_time)
{
fd_set_bits fds;
void *bits;
@@ -1615,7 +1616,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
zero_fd_set(n, fds.res_out);
zero_fd_set(n, fds.res_ex);
- ret = do_select(n, &fds, timeout);
+ ret = do_select(n, &fds, end_time);
if (ret < 0)
goto out;
@@ -1641,7 +1642,7 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
compat_ulong_t __user *outp, compat_ulong_t __user *exp,
struct compat_timeval __user *tvp)
{
- s64 timeout = -1;
+ struct timespec end_time, *to = NULL;
struct compat_timeval tv;
int ret;
@@ ...Hi Arjan, This is a good news. I've been bothered by this problem for a long time and got used to add 99999 us to any delay fed to select() (that started when HZ=100). I remember having noticed performance improvements in my applications using this trick even on other systems, so I don't think that Linux is the only one with this anomaly. I think it's simply caused by the rounding methods which make it hard to compute an accurate deadline when a value is converted before being added to jiffies. I do not remember noticing this on epoll though. Maybe it's just because I did not look close enough :-/ Cheers, Willy --
Another way to get down in the timeout resolution for newer kernels, and still supporting the syscalls with millisec timeouts, is to use timerfds. - Davide --
On Sun, 31 Aug 2008 14:13:45 -0700 (PDT) perhaps, but once you have to change the app you can also make it use ppoll() rather than poll().. that's a VERY minute change. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
