Re: Problems with timerfd()

Previous thread: keyboard not found in 2.6.22.1 by Alan Curry on Sunday, July 22, 2007 - 11:16 pm. (22 messages)

Next thread: [RFC 0/8]KVM: swap out guest pages by Shaohua Li on Sunday, July 22, 2007 - 11:51 pm. (20 messages)
From: Michael Kerrisk
Date: Sunday, July 22, 2007 - 11:32 pm

Andrew,

The timerfd() syscall went into 2.6.22.  While writing the man page for
this syscall I've found some notable limitations of the interface, and I am
wondering whether you and Linus would consider having this interface fixed
for 2.6.23.

On the one hand, these fixes would be an ABI change, which is of course
bad.  (However, as noted below, you have already accepted one of the ABI
changes that I suggested into -mm, after Davide submitted a patch.)

On the other hand, the interface has not yet made its way into a glibc
release, and the change will not break applications.  (The 2.6.22 version
of the interface would just be "broken".)

Details of my suggested changes are below.  A complication in all of this
is that on Friday, while I was part way though discussing this with Davide,
he went on vacation for a month and is likely to have only limited email
access during that time.  (See my further thoughts about what to do while
Davide is away at the end of this mail message.)  Our last communication,
after Davide had expressed reluctance about making some of the interface
changes, was a more extensive note from me describing the problems of the
interface.

The problems of the 2.6.22 timerfd() interface are as follows:

Problem 1
---------

The value returned by read(2)ing from a timerfd file descriptor is the
number of timer overruns.  In 2.6.22, this value is 4 bytes, limiting the
overrun count  to 2^32.  Consider an application where the timer frequency
was 100 kHz (feasible in the not-too-distant future, I would guess), then
the overrun counter would cycle after ~40000 seconds (~11 hours).
Furthermore returning 4 bytes from the read() is inconsistent with eventfd
file descriptors, which return 8 byte integers from a read().

Davide has already submitted a patch to you to make read() from a timerfd
file descriptor return an 8 byte integer, and I understand it to have been
accepted into -mm.

Problem 2
---------
Existing timer APIs (Unix interval timers -- ...
From: Andrew Morton
Date: Sunday, July 22, 2007 - 11:38 pm

I think if the need is sufficient we can do this: fix it in 2.6.23 and in
2.6.22.x.  That means that there will be a few broken-on-new-glibc kernels

argh.  Nobody told me it was an ABI change!  We'll need to consider merging
make-timerfd-return-a-u64-and-fix-the-__put_user.patch into 2.6.22.x as
well.

-

From: Andrew Morton
Date: Sunday, July 22, 2007 - 11:42 pm

So I'm trying to write a halfway respectable description of that patch and
I'm stuck when it comes to describing what will happen if someone tries
to run a future timerfd-enabled glibc on 2.2.22 base.   In what manner
will it misbehave?  What are the consequences of this decision?
-

From: Michael Kerrisk
Date: Monday, July 23, 2007 - 1:02 am

The difference would be that read() will only return 4 bytes, while the
application will expect 8.  If the application is checking the size of
returned value, as it should, then it will be able to detect the problem
(it could even be sophisticated enough to know that if this is a 4-byte
return, then it is running on an old 2.6.22 kernel).  If the application is
not checking the return from read(), then its 8-byte buffer will not be
filled -- the contents of the last 4 bytes will be undefined, so the u64
value as a whole will be junk.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-

From: Michael Kerrisk
Date: Wednesday, July 25, 2007 - 11:18 am

Andrew,


So I'm still not quite clear.  Can I take it from your statement above that
the proposed ABI changes would be admissible, as long as Davide is okay
with them?

Cheers,

Michael


-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-

From: Andrew Morton
Date: Wednesday, July 25, 2007 - 3:12 pm

On Wed, 25 Jul 2007 20:18:51 +0200

yup, I'll send that diff into Linus and -stable and see what happens.

-

From: Michael Kerrisk
Date: Monday, August 6, 2007 - 11:55 pm

Andrew,

I'm working on the changes to timerfd(), but must admit I am struggling to
understand some of the kernel code for working with userspace timers (e.g.,
in kernel/posix-timers.c).  Can you suggest anyone who could provide
assistance?

Cheers,

Michael


-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-

From: Andrew Morton
Date: Tuesday, August 7, 2007 - 12:36 am

The code was originally contributed by George Anzinger.  He has moved on
and Thomas Gleixner now does most work in there.  I'd expect that Thomas is
your guy, but he is nearing the end of a two-week vacation and will
presumably return to an akpm-sized inbox, so I'd give him a little time ;)

-

From: Michael Kerrisk
Date: Tuesday, August 7, 2007 - 2:14 am

Yes, I pinged George a week or so back, but got no response.

Thanks -- Thomas CCed on this reply.

Unfortunately I myself go on holiday on Saturday for two weeks.
All of these holidays (I guess that Davide is only back in another
1.5 weeks or so) start to make the chances of getting change
in during the .23-rc window look tight...

Cheers,

Michael
-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages , 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

-

From: Michael Kerrisk
Date: Thursday, August 9, 2007 - 2:11 pm

Andrew,

Here's my first shot at changing the timerfd() interface; patch
is against 2.6.23-rc1-mm2.

I think I got to the bottom of understanding the timer code in
the end, but I may still have missed some things...

This is my first kernel patch of any substance.  Perhaps Randy
or Christoph can give my toes a light toasting for the various
boobs I've likely made.

Thomas, would you please look over this patch to verify my
timer code?

Davide: would you please bless this interface change ;-).
It makes it possible to:

1) Fetch the previous timer settings (i.e., interval, and time
   remaining until next expiration) while installing new settings

2) Retrieve the settings of an existing timer without changing
   the timer.

Both of these features are supported by the older Unix timer APIs
(setitimer/getitimer and timer_create/timer_settime/timer_gettime).

The changes are as follows:

a) A further argument, outmr, can be used to retrieve the interval
   and time remaining before the expiration of a previously created
   timer.  Thus the call signature is now:

   timerfd(int utmr, int clockid, int flags,
           struct itimerspec *utmr,
           struct itimerspec *outmr)

   The outmr argument:

   1) must be NULL for a new timer (ufd == -1).
   2) can be NULL if we are updating an existing
      timer (i.e., utmr != NULL), but we are not
      interested in the previous timer value.
   3) can be used to retrieve the time until next timer
      expiration, without changing the timer.
      (In this case, utmr == NULL and ufd >= 0.)

b) Make the 'clockid' immutable: it can only be set
   if 'ufd' is -1 -- that is, on the timerfd() call that
   first creates a timer.  (This is effectively
   the limitation that is imposed by POSIX timers: the
   clockid is specified when the timer is created with
   timer_create(), and can't be changed.)

   [In the 2.6.22 interface, the clockid of an existing
   timer can be changed with a further call to timerfd()
   ...
From: Randy Dunlap
Date: Monday, August 13, 2007 - 4:34 pm

Hi,

OK, I'll make a few comments.

1.  Provide a full changelog, including reasons why the patch is
needed.  Don't assume that we have read the previous email threads.




		    put_compat_itimerspec(old_utmr, &t))

IOW, I think that the parameters are reversed (at least this way



---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

From: Jonathan Corbet
Date: Wednesday, August 15, 2007 - 7:40 am

Sorry for the late commentary...as I looked this over, one thing popped

timerfd() is looking increasingly like a multiplexor system call in
disguise.  There is no form of invocation which uses all of the
arguments.  Are we sure we wouldn't rather have something like:

	timerfd_open(clock);
	timerfd_adjust(fd, new_time, old_time);

?

jon
-

From: Ray Lee
Date: Monday, July 23, 2007 - 9:55 am

Hey there Michael, all,


I'm feeling slow, and think I'm missing something. Why is this an
issue? Wouldn't userspace just keep track of the last overrun count,
and subtract the two to get the overruns-since-last-read? That makes
it oblivious to rollovers, unless it can't manage to do a read once
every 11 hours.

(This is the same sort of thing we already have to deal with in
certain situations, such as network stat counters on 32 bit
platforms.)

Ray
-

From: Michael Kerrisk
Date: Tuesday, July 24, 2007 - 12:40 am

The value returned by read() is the number of overruns since


But userspace can't deal with the condition accurately, so why
require userspace to worry about this when we could just use
a 64-bit value instead?

Cheers,

Michael
-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages , 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

-

From: Ray Lee
Date: Tuesday, July 24, 2007 - 8:22 am

Once every 11 hours isn't a terrible price to pay. Once every 11
seconds would be another matter. It doesn't seem unreasonable to
expect an application that is already actively dealing with timers to
go and take a look at things every hour to see if there have been more

Okay, perhaps this is where I'm missing something? If userspace wakes
up once every hour, checks the overrun counter against the previous

<shrug> I don't have strong feelings either way. It just seemed like
something that could already be taken care of with today's interface.
Given that the discussion was about an API change between 2.6.22 and
2.6.23, I was looking for options to avoid that.
-

From: Michael Kerrisk
Date: Tuesday, July 24, 2007 - 8:56 am

In fact I don't have strong feelings on this part of the interface
design either.  I pointed out the limitation to Davide, and 
pointed out that the related eventfd interface read()s an 8-byte 
integer, and Davide then just fired off a patch to Andrew which 
went into --mm.

It's the other problems with the interface that bother me more
(inability to retrieve previous setting when changing
the timer; inability to retrieve time until next expiration
without changing the current setting).

Cheers,

Michael
-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  
Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages , 
read the HOWTOHELP file and grep the source 
files for 'FIXME'.

-

Previous thread: keyboard not found in 2.6.22.1 by Alan Curry on Sunday, July 22, 2007 - 11:16 pm. (22 messages)

Next thread: [RFC 0/8]KVM: swap out guest pages by Shaohua Li on Sunday, July 22, 2007 - 11:51 pm. (20 messages)