Certain userspace applications (like "clock" desktop applets or cron or systemd) might want to be notified when some other application changes the system time. There are several known to me reasons for this: - avoiding periodic wakeups to poll time changes; - rearming CLOCK_REALTIME timers when said changes happen; - changing system timekeeping policy for system-wide time management programs; - keeping guest applications/operating systems running in emulators up to date. This is another attempt to approach notifying userspace about system clock changes. The other one is using an eventfd and a syscall [1]. In the course of discussing the necessity of a syscall for this kind of notifications, it was suggested that this functionality can be achieved via timers [2] (and timerfd in particular [3]). This idea got quite some support [4], [5], [6] and some vague criticism [7], so I decided to try and go a bit further with it. [1] http://marc.info/?l=linux-kernel&m=128950389423614&w=2 [2] http://marc.info/?l=linux-kernel&m=128951020831573&w=2 [3] http://marc.info/?l=linux-kernel&m=128951588006157&w=2 [4] http://marc.info/?l=linux-kernel&m=128951503205111&w=2 [5] http://marc.info/?l=linux-kernel&m=128955890118477&w=2 [6] http://marc.info/?l=linux-kernel&m=129002967031104&w=2 [7] http://marc.info/?l=linux-kernel&m=129002672227263&w=2 Signed-off-by: Alexander Shishkin <virtuoso@slind.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Greg Kroah-Hartman <gregkh@suse.de> CC: Feng Tang <feng.tang@intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Michael Tokarev <mjt@tls.msk.ru> CC: Marcelo Tosatti <mtosatti@redhat.com> CC: John Stultz <johnstul@us.ibm.com> CC: Chris Friesen <chris.friesen@genband.com> CC: Kay Sievers <kay.sievers@vrfy.org> CC: Kirill A. Shutemov <kirill@shutemov.name> CC: Artem Bityutskiy <dedekind1@gmail.com> CC: Davide Libenzi <davidel@xmailserver.org> CC: linux-fsdevel@vger.kernel.org CC: ...
I agree with Kay, this is pretty much exactly what we want for systemd. (Assuming that the time jump due to system suspend is propagated to userspace like any other time jump with this path). So yeah, I'd be very happy if this could be merged. Lennart -- Lennart Poettering - Red Hat, Inc. --
A "Tested-by: Lennart Poettering <mzxreary@0pointer.de>" would be Hmm, and question about why exactly the timerfd interface is a bad way to go was ignored. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) --
Good to hear. Regards, -- Alex --
I hope the time jump due to suspend is *not* propagated in the same
way to userspace :-)
What I'd like to see:
1. Time jump due to the system clock being stepped: Notification.
This is *not* a change in real time. It means the clock was
corrected/changed. No physical time passed.
2. Time jump due to suspend/resume: Different notification.
This *is* a change in real time. Physical time passed.
3. Time drift corrections: As now, no notification, it's just
the clock being regulated.
To signal the difference between 1 and 2, there ought to be some way
for userspace to determine how much of the clock delta corresponds
with physical time, by reading some sort of "monotonic" clock :-)
CLOCK_MONOTONIC is unsuitable because it stops at suspend. Maybe it
should stay that way. But maybe not - programs using CLOCK_MONOTONIC
usually want to trigger timeouts etc. based on real elapsed time, and
after suspend/resume, it's quite reasonable to want to trigger all of
a program's short timeouts immediately. Indeed some network protocol
userspace may currently behave *incorrectly* over suspend/resume,
especially those using clock times to validate their caches,
*because* CLOCK_MONOTONIC doesn't count it.
So maybe CLOCK_MONOTONIC should be changed to include elapsed time
during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
for programs that want that?
That, plus this proposed patch, would signal the difference between 1
and 2 above nicely.
-- Jamie
--
Wouldn't that be an API break for programs that are expecting the current behavior of CLOCK_MONOTONIC? Yes, there should be a way to request either of them - but if there's only one way now, it should continue to act the current way, and the added way is the second option.
We do keep track of the amount of time in suspend (total_sleep_time), so creating a new clockid to provide CLOCK_MONOTONIC + total_sleep_time wouldn't be hard. We just haven't had a clear articulation of why it would be useful to expose to userland (nor a clear name to describe exactly what it represents). thanks -john --
I don't know. Can you think of any program which would break if suspend/resume's clocks behaved like ordinary task scheduling - when a task doesn't run for a long time because of scheduling decisions? Hmm, I guess some realtime apps might like to know. Currently CLOCK_MONOTONIC jumps forwards by 4 seconds on suspend/resume anyway (as seen by userspace), on my x86 laptop running 2.6.37-rc3. So it does already jump a bit... But see my other reply; maybe there's no need to change it. A reliable, immediate notification that CLOCK_MONOTONIC's relationship to real time has been disrupted by an unknown amount would be sufficient for the problems I have in mind. -- Jamie --
Like I mentioned earlier, CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC are tightly tied, so anything using CLOCK_MONOTONIC_RAW would break. It might be possible to change both, but I still think such a change So just to clarify here, by this do you mean that there's ~4 seconds delay between the resume event and when userland apps start to run (or possibly some of that accumulating between the app freeze and the timekeeping suspend) ? Or are you seeing CLOCK_MONOTONIC jump 4 seconds out of sync with CLOCK_REALTIME? It should be the delta between CLOCK_MONOTONIC and CLOCK_REALTIME prior to suspend should be that same delta + suspend time after resume. If that's not the case, something may be broken. thanks -john --
So actually, as I think more about this, I'm starting to come around to the side that maybe CLOCK_MONOTONIC should be changed to increment during suspend (CLOCK_MONOTONIC_RAW could also be moved forward by the same amount, which isn't really ideal, but maybe not problematic). There are still quite a number of problems that might be caused by such a change. So it may still be impractical to actually do, but more and more it does seem like it might be the better approach. I keep thinking about it. thanks -john --
Sadly this behavior depends on architecture and rtc configuration. For x86 and a number of other architectures, read_persisitent_clock() functions and we inject the time in suspend into CLOCK_REALTIME on resume. No notification would be seen. For architectures where read_persistent_clock does not function (usually due to RTC not being accessible with irqs are off), we rely on the RTC code to set the time when it resumes and irqs are enabled. This happens via do_settimeofday, so a notification would be seen. A hook could be added so the non-read_persistent_clock supporting arches can inject time into CLOCK_REALTIME without going through settimeofday() and triggering the notification. But there may still be odd races around other stuff running and getting the wrong time before the suspend time is injected. This ignores any userland resume scripts that may do something like call This is the case for read_persistent_clock() supported architectures. Could you further expand on the needs for distinguishing between the No. Lets not change it. CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW's relationship is tightly coupled, and applications that are tracking the amount of clock adjustment being done to the system require they keep their semantics. As I said earlier, adding a new clockid to represent the MONOTONIC +SUSPEND time wouldn't be difficult, we just need to be clear about why it should be exposed, and have it also be easy to describe to developers which clockid would suit their needs best. thanks -john --
Yes, it's a correctness issue in network protocols using
lease/oplock/MESI-style cache coherency. (E.g. NFSv4, CIFS, whatever
you like in userspace.)
By this, I mean anything with this sort of pattern:
1. Receive message "you may cache thing X for up to 20 seconds *without
checking if it changed* during that time; afterwards, check".
(If the other end need to change X within the 20 second
interval, the other end will send a request to break the lease;
if the other end doesn't get a response, then it waits until the
20 second expires, and then it's safe to assume the lease expired.)
2. Local request for value of X.
=> If less than 20 seconds has passed, the local cache responds
with X *without any network confirmation*. I.e. it's instant.
=> If more than 20 seconds has passed, it has to talk to the
other end. I.e. a network round trip.
The algorithm is coherent even if the network is unreliable and goes
down sometimes. When that happens, local requests are stalled, rather
than returning values incoherent with other machines.
This algorithm breaks if the local application depends on
CLOCK_MONOTONIC to confirm that less than 20 seconds has passed
and CLOCK_MONOTONIC is lying.
CLOCK_MONOTONIC lies when you've done suspend+resume while this
program was running, so it's 20 seconds test gives the wrong result.
You can imagine there are quite a few applications that use this
technique because it's quite fundamental to efficient coherency
protocols. (Although I'm unable to name any off the top of my head!).
There are generalisations for more interesting distributed systems.
The thing they all have in common is the ability to locally
query "has time T elapsed, in terms that would be recognised as T by
the remote machines". In reality clocks have tolerances etc. so you
fudge by some percentage, and you are more careful about the order of
events than I have shown (it's more like "you may assume Y ...Ok. Just curious, as similar cases I was thinking about (like AFS) require clients to have a reasonably synced CLOCK_REALTIME to the server Yea, the case seems reasonable. I guess I'm just surprised they use I'm not as familiar with the pm code, but if you just need suspend/resume event notification, we should already have that via the userland suspend/resume hooks. It just seems to me that the notification you suggest is sufficient, but is only minimally useful. So, an application gets a notification that we suspended, and so CLOCK_MONOTONIC based timers may have been delayed, but without knowing how much, its unclear what to do. For the cache cases, sure, you can just drop everything, but I'm sure for other cases we'd be pushing the userland app to keep its own sense of the CLOCK_MONOTONIC/REALTIME delta and try to track those changes. So providing a new CLOCK_BOOTTIME or something would seem pretty reasonable to me, allowing things like timers to be set that would expire immediately after a resume if they were to expire while the Well, unless there is no persistent/RTC device to figure out the suspend time from, I think we could do a decent job. There are limitations (ie: RTC hardware only providing second resolution time), but the bar for Maybe I'm missing something, but that seems like such a notification is going to be difficult to provide with the current interfaces. And I'm not sure it resolves any races you'd have with the suspend hitting you right after the time read but before an action is taken. For such strict semantics, it almost seems like some way to inhibit suspend would be needed around the time checks and actions. thanks -john --
