Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

Previous thread: [PATCH] input: joystick: Adding Austria Microsystem AS5011 joystick driver (without wrapped lines) by Fabien Marteau on Tuesday, November 23, 2010 - 10:25 am. (1 message)

Next thread: [PATCH] perf: add support for per-event sampling period or frequency in perf record (v2) by Stephane Eranian on Tuesday, November 23, 2010 - 10:06 am. (1 message)
From: Alexander Shishkin
Date: Tuesday, November 23, 2010 - 10:22 am

Certain userspace applications (like "clock" desktop applets or cron or
systemd) might want to be notified when some other application changes
the system time. There are several known to me reasons for this:
 - avoiding periodic wakeups to poll time changes;
 - rearming CLOCK_REALTIME timers when said changes happen;
 - changing system timekeeping policy for system-wide time management
   programs;
 - keeping guest applications/operating systems running in emulators
   up to date.

This is another attempt to approach notifying userspace about system
clock changes. The other one is using an eventfd and a syscall [1]. In
the course of discussing the necessity of a syscall for this kind of
notifications, it was suggested that this functionality can be achieved
via timers [2] (and timerfd in particular [3]). This idea got quite
some support [4], [5], [6] and some vague criticism [7], so I decided
to try and go a bit further with it.

[1] http://marc.info/?l=linux-kernel&m=128950389423614&w=2
[2] http://marc.info/?l=linux-kernel&m=128951020831573&w=2
[3] http://marc.info/?l=linux-kernel&m=128951588006157&w=2
[4] http://marc.info/?l=linux-kernel&m=128951503205111&w=2
[5] http://marc.info/?l=linux-kernel&m=128955890118477&w=2
[6] http://marc.info/?l=linux-kernel&m=129002967031104&w=2
[7] http://marc.info/?l=linux-kernel&m=129002672227263&w=2

Signed-off-by: Alexander Shishkin <virtuoso@slind.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Greg Kroah-Hartman <gregkh@suse.de>
CC: Feng Tang <feng.tang@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Michael Tokarev <mjt@tls.msk.ru>
CC: Marcelo Tosatti <mtosatti@redhat.com>
CC: John Stultz <johnstul@us.ibm.com>
CC: Chris Friesen <chris.friesen@genband.com>
CC: Kay Sievers <kay.sievers@vrfy.org>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: Artem Bityutskiy <dedekind1@gmail.com>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: linux-fsdevel@vger.kernel.org
CC: ...
From: Lennart Poettering
Date: Tuesday, November 23, 2010 - 3:43 pm

I agree with Kay, this is pretty much exactly what we want for
systemd. (Assuming that the time jump due to system suspend is
propagated to userspace like any other time jump with this path).

So yeah, I'd be very happy if this could be merged.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--

From: Artem Bityutskiy
Date: Wednesday, November 24, 2010 - 1:05 am

A "Tested-by: Lennart Poettering <mzxreary@0pointer.de>" would be

Hmm, and question about why exactly the timerfd interface is a bad way
to go was ignored.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

--

From: Alexander Shishkin
Date: Tuesday, November 30, 2010 - 5:33 pm

Good to hear.

Regards,
--
Alex
--

From: Jamie Lokier
Date: Wednesday, December 1, 2010 - 3:43 am

I hope the time jump due to suspend is *not* propagated in the same
way to userspace :-)

What I'd like to see:

 1. Time jump due to the system clock being stepped: Notification.

    This is *not* a change in real time.  It means the clock was
    corrected/changed.  No physical time passed.

 2. Time jump due to suspend/resume: Different notification.

    This *is* a change in real time.  Physical time passed.

 3. Time drift corrections: As now, no notification, it's just
    the clock being regulated.

To signal the difference between 1 and 2, there ought to be some way
for userspace to determine how much of the clock delta corresponds
with physical time, by reading some sort of "monotonic" clock :-)

CLOCK_MONOTONIC is unsuitable because it stops at suspend.  Maybe it
should stay that way.  But maybe not - programs using CLOCK_MONOTONIC
usually want to trigger timeouts etc. based on real elapsed time, and
after suspend/resume, it's quite reasonable to want to trigger all of
a program's short timeouts immediately.  Indeed some network protocol
userspace may currently behave *incorrectly* over suspend/resume,
especially those using clock times to validate their caches,
*because* CLOCK_MONOTONIC doesn't count it.

So maybe CLOCK_MONOTONIC should be changed to include elapsed time
during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
for programs that want that?

That, plus this proposed patch, would signal the difference between 1
and 2 above nicely.

-- Jamie
--

From: Valdis.Kletnieks
Date: Wednesday, December 1, 2010 - 3:46 pm

Wouldn't that be an API break for programs that are expecting the current
behavior of CLOCK_MONOTONIC?  Yes, there should be a way to request either of
them - but if there's only one way now, it should continue to act the current
way, and the added way is the second option.

From: john stultz
Date: Wednesday, December 1, 2010 - 4:46 pm

We do keep track of the amount of time in suspend (total_sleep_time), so
creating a new clockid to provide CLOCK_MONOTONIC + total_sleep_time
wouldn't be hard. We just haven't had a clear articulation of why it
would be useful to expose to userland (nor a clear name to describe
exactly what it represents).

thanks
-john


--

From: Jamie Lokier
Date: Wednesday, December 1, 2010 - 6:18 pm

I don't know.  Can you think of any program which would break if
suspend/resume's clocks behaved like ordinary task scheduling - when a
task doesn't run for a long time because of scheduling decisions?
Hmm, I guess some realtime apps might like to know.

Currently CLOCK_MONOTONIC jumps forwards by 4 seconds on
suspend/resume anyway (as seen by userspace), on my x86 laptop running
2.6.37-rc3.  So it does already jump a bit...

But see my other reply; maybe there's no need to change it.  A
reliable, immediate notification that CLOCK_MONOTONIC's relationship
to real time has been disrupted by an unknown amount would be
sufficient for the problems I have in mind.

-- Jamie

--

From: john stultz
Date: Wednesday, December 1, 2010 - 6:55 pm

Like I mentioned earlier, CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC are
tightly tied, so anything using CLOCK_MONOTONIC_RAW would break.

It might be possible to change both, but I still think such a change

So just to clarify here, by this do you mean that there's ~4 seconds
delay between the resume event and when userland apps start to run (or
possibly some of that accumulating between the app freeze and the
timekeeping suspend) ?

Or are you seeing CLOCK_MONOTONIC jump 4 seconds out of sync with
CLOCK_REALTIME? 

It should be the delta between CLOCK_MONOTONIC and CLOCK_REALTIME prior
to suspend should be that same delta + suspend time after resume. If
that's not the case, something may be broken.

thanks
-john

--

From: john stultz
Date: Friday, December 3, 2010 - 5:57 pm

So actually, as I think more about this, I'm starting to come around to
the side that maybe CLOCK_MONOTONIC should be changed to increment
during suspend (CLOCK_MONOTONIC_RAW could also be moved forward by the
same amount, which isn't really ideal, but maybe not problematic).

There are still quite a number of problems that might be caused by such
a change. So it may still be impractical to actually do, but more and
more it does seem like it might be the better approach.

I keep thinking about it.

thanks
-john


--

From: john stultz
Date: Wednesday, December 1, 2010 - 5:10 pm

Sadly this behavior depends on architecture and rtc configuration.

For x86 and a number of other architectures, read_persisitent_clock()
functions and we inject the time in suspend into CLOCK_REALTIME on
resume. No notification would be seen.

For architectures where read_persistent_clock does not function (usually
due to RTC not being accessible with irqs are off), we rely on the RTC
code to set the time when it resumes and irqs are enabled. This happens
via do_settimeofday, so a notification would be seen.

A hook could be added so the non-read_persistent_clock supporting arches
can inject time into CLOCK_REALTIME without going through settimeofday()
and triggering the notification. But there may still be odd races around
other stuff running and getting the wrong time before the suspend time
is injected.

This ignores any userland resume scripts that may do something like call


This is the case for read_persistent_clock() supported architectures.




Could you further expand on the needs for distinguishing between the


No. Lets not change it. CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW's
relationship is tightly coupled, and applications that are tracking the
amount of clock adjustment being done to the system require they keep
their semantics.

As I said earlier, adding a new clockid to represent the MONOTONIC
+SUSPEND time wouldn't be difficult, we just need to be clear about why
it should be exposed, and have it also be easy to describe to developers
which clockid would suit their needs best.

thanks
-john


--

From: Jamie Lokier
Date: Wednesday, December 1, 2010 - 6:12 pm

Yes, it's a correctness issue in network protocols using
lease/oplock/MESI-style cache coherency.  (E.g. NFSv4, CIFS, whatever
you like in userspace.)

By this, I mean anything with this sort of pattern:

   1. Receive message "you may cache thing X for up to 20 seconds *without
      checking if it changed* during that time; afterwards, check".

      (If the other end need to change X within the 20 second
      interval, the other end will send a request to break the lease;
      if the other end doesn't get a response, then it waits until the
      20 second expires, and then it's safe to assume the lease expired.)

   2. Local request for value of X.

      => If less than 20 seconds has passed, the local cache responds
         with X *without any network confirmation*.  I.e. it's instant.
      => If more than 20 seconds has passed, it has to talk to the
         other end.  I.e. a network round trip.

The algorithm is coherent even if the network is unreliable and goes
down sometimes.  When that happens, local requests are stalled, rather
than returning values incoherent with other machines.

This algorithm breaks if the local application depends on
CLOCK_MONOTONIC to confirm that less than 20 seconds has passed
and CLOCK_MONOTONIC is lying.

CLOCK_MONOTONIC lies when you've done suspend+resume while this
program was running, so it's 20 seconds test gives the wrong result.

You can imagine there are quite a few applications that use this
technique because it's quite fundamental to efficient coherency
protocols.  (Although I'm unable to name any off the top of my head!).

There are generalisations for more interesting distributed systems.
The thing they all have in common is the ability to locally
query "has time T elapsed, in terms that would be recognised as T by
the remote machines".  In reality clocks have tolerances etc. so you
fudge by some percentage, and you are more careful about the order of
events than I have shown (it's more like "you may assume Y ...
From: john stultz
Date: Wednesday, December 1, 2010 - 8:07 pm

Ok. Just curious, as similar cases I was thinking about (like AFS)
require clients to have a reasonably synced CLOCK_REALTIME to the server

Yea, the case seems reasonable. I guess I'm just surprised they use

I'm not as familiar with the pm code, but if you just need
suspend/resume event notification, we should already have that via the
userland suspend/resume hooks.

It just seems to me that the notification you suggest is sufficient, but
is only minimally useful. So, an application gets a notification that we
suspended, and so CLOCK_MONOTONIC based timers may have been delayed,
but without knowing how much, its unclear what to do. For the cache
cases, sure, you can just drop everything, but I'm sure for other cases
we'd be pushing the userland app to keep its own sense of the
CLOCK_MONOTONIC/REALTIME delta and try to track those changes.

So providing a new CLOCK_BOOTTIME or something would seem pretty
reasonable to me, allowing things like timers to be set that would
expire immediately after a resume if they were to expire while the

Well, unless there is no persistent/RTC device to figure out the suspend
time from, I think we could do a decent job. There are limitations (ie:
RTC hardware only providing second resolution time), but the bar for

Maybe I'm missing something, but that seems like such a notification is
going to be difficult to provide with the current interfaces. And I'm
not sure it resolves any races you'd have with the suspend hitting you
right after the time read but before an action is taken.

For such strict semantics, it almost seems like some way to inhibit
suspend would be needed around the time checks and actions.

thanks
-john






--

Previous thread: [PATCH] input: joystick: Adding Austria Microsystem AS5011 joystick driver (without wrapped lines) by Fabien Marteau on Tuesday, November 23, 2010 - 10:25 am. (1 message)

Next thread: [PATCH] perf: add support for per-event sampling period or frequency in perf record (v2) by Stephane Eranian on Tuesday, November 23, 2010 - 10:06 am. (1 message)