Another possibility: Optionally trust the stamped rate for the part? I understand that on Nehalem this value is available in MSR_PLATFORM_INFO[15:8] (google for MSR_PLATFORM_INFO 15 8), but I don't know if this MSR is available on older (or AMD) processors. Just wondering: If one were to put an ultra-precise scope on a processor, how far off would the calibrated value be? I'd imagine the process of calibrating one unknown crystal against a second crystal which has a known-but-not-highly-precise frequency, though good enough for most purposes, is not particularly accurate. In other words, maybe the stamped rate is more accurate than the calibrated rate anyway? --
No. Not even close. A spread-spectrum clock is inaccurate by entire percentage points. A non-spread clock is typically ±50 ppm with typical consumer PC oscillators, ±1 ppm with non-crappy but still cheap oscillators (e.g. used in cell phones.) -hpa --
Hmmm. That could be an option for newer cpus that I wouldn't oppose. While Peter is correct that the stamped value is probably not very accurate, atleast it would be constant from boot to boot, and NTP's calculated drift value would be correct. We'd need a check to make sure its not way off, since NTP will give up if its outside 500ppm. So as long as its close to the calibrated value, we probably could use it. thanks -john --
Is that still the case? I thought newer versions of NTP could deal with large values. Inaccuracies of way more than 500 ppm are everyday. -hpa --
That's scary. Yea, in the kernel the ntp freq correction tops out at 500ppm. Almost all the systems I see tend to fall in the +/- 200ppm range (if there's not something terribly wrong with the hardware). So maybe things aren't so bad out there? Or is that wishful thinking? thanks -john --
In the kernel, yes; I thought the ntp daemon itself now handled the exceptions (basically it detects if the PLL consistently veers off the rails and adjusts the timing constants.) However, you're comparing apples to oranges: you're talking about current kernels, which means a calibrated TSC, which means you're comparing to the non-spread 14.31818 MHz clock (which feeds into the HPET, PMTMR and 8254 on a standard PC platform.) In most PCs this is a separate oscillator from the bus clock which is spread spectrum. As a result, it should be in the ±50 ppm range in theory; in practice as you observe the range is wider. -hpa --
Since Brian's concern is at boot-time at which point there is no network or ntp, and assuming that it would be unwise to vary tsc_khz dynamically on a clocksource==tsc machine (is it?), would optionally lengthening the TSC<->PIT calibration beyond 25ms result in a more consistent tsc_khz between boots? Or is the relative instability an unavoidable result of skew between the PIT and the fixed constant PIT_TICK_RATE combined with algorithmic/arithmetic error? Or is the jitter of the (spread-spectrum) TSC too extreme? Or ??? If better more consistent calibration is possible, offering that as an optional kernel parameter seems better than specifying a fixed tsc_khz (stamped or user-specified) which may or may not be ignored due to "too different from measured tsc_khz". Even an (*optional*) extra second or two of boot time might be perfectly OK if it resulted in an additional five or six bits of tsc_khz precision. Thoughts, Brian? --
Making the calibration time longer should give a more precise result, but of course at the expense of longer boot time. A longer sample would make sense if the goal is to freeze it into a kernel command line variable, but the real question is how many people would actually do that (and how many people would then suffer problems because they upgraded their CPU/mobo and got massive failures on post-boot.) -hpa --
I'll admit its a feature for a minority of users. Probably why its not included. And the upgraded system issue was something I tried to address by using the calibrated value if it was off by some unreasonable amount, however folks protested that, figuring since if its explicitly stated kernel should not override it (ie: for the use case of where the calibration is broken and folks want to force the value). Also, you don't really need extra accuracy, you just need it to be the same from boot to boot. NTP keeps the correction factor persistent from boot to boot via the drift file. The boot argument is just trying to save the time (possibly hours depending on ntp config) after a reboot for NTP to correct for the new error introduced by calibration. thanks -john --
Sounds good to me. If it's non-obvious what value to choose for the new calibration, maybe specifying it in MS (per MAX_QUICK_PIT_MS On a quick sample of two machines looking at the TSC calibration done by Xen (which exposes the equivalent of tsc_khz), it appears that the stamped value is different from the calibration by about 1000ppm. YMMV. --
So pretty seriously different. Less than I would have expected, but not out of the ballpark. -hpa --
I still sorta think the clearest thing is to keep the kernel's quick, dynamic calibration at boot, and expose the tsc_khz it decides on. People who don't care about clocksync get their quick boot, NTP knows enough to correct its drift estimate when it starts, and users don't need to learn all these details to get stable clocksync (*). If we're being strict, something like NTP needs to know exactly what's driving gettimeofday(). If the clocksource changes, it needs to know that so it could correct or trash its drift estimate. If there's one-time calibration, it needs to know what the result was. If there's continuous calibration, it either needs to be notified, or have the ability to disable it. Right? So I think exporting tsc_khz in some form is a step in the right direction. So what's wrong with just adding a /sys/devices/system/clocksource/clocksource0/tsc_khz? Maybe Thomas Gleixner's suggestion of a vget_tsc_raw() would also suffice, I'm not sure I understand the details enough. Any of the other fixes people have discussed (tsc_khz= bootopt, tsc_long_calibration=1) would be enough to make me happy though :) (*) Though they still need to learn enough to coax the kernel into giving them a fast gettimeofday(). That's a price you gotta pay if you care enough :) --
As an RFC:
Add clocksource.sys_register & sys_unregister so the
current clocksource can add supplemental information to
/sys/devices/system/clocksource/clocksource0/
Export tsc_khz when current_clocksource==tsc so that
daemons like NTP can account for the variability of
calibration results.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9faf91a..9c99965 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -76,6 +76,11 @@ unsigned long long
sched_clock(void) __attribute__((alias("native_sched_clock")));
#endif
+int sysfs_tsc_register(struct sys_device *clocksource_dev,
+ struct clocksource *cs);
+void sysfs_tsc_unregister(struct sys_device *clocksource_dev,
+ struct clocksource *cs);
+
int check_tsc_unstable(void)
{
return tsc_unstable;
@@ -757,6 +762,8 @@ static struct clocksource clocksource_tsc = {
#ifdef CONFIG_X86_64
.vread = vread_tsc,
#endif
+ .sys_register = sysfs_tsc_register,
+ .sys_unregister = sysfs_tsc_unregister,
};
void mark_tsc_unstable(char *reason)
@@ -967,3 +974,22 @@ void __init tsc_init(void)
init_tsc_clocksource();
}
+static ssize_t show_tsc_khz(
+ struct sys_device *dev, struct sysdev_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", tsc_khz);
+}
+
+static SYSDEV_ATTR(tsc_khz, 0444, show_tsc_khz, NULL);
+
+int sysfs_tsc_register(struct sys_device *clocksource_dev,
+ struct clocksource *cs)
+{
+ return sysdev_create_file(clocksource_dev, &attr_tsc_khz);
+}
+
+void sysfs_tsc_unregister(struct sys_device *clocksource_dev,
+ struct clocksource *cs)
+{
+ sysdev_remove_file(clocksource_dev, &attr_tsc_khz);
+}
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 5ea3c60..d9f6f13 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -15,6 +15,7 @@
#include <linux/cache.h>
#include <linux/timer.h>
#include ...I think this is a bad idea, as it creates an ABI that is arch AND machine specific, which will cause portability problems in applications that expect the interface to be there. thanks -john --
It's an arch-independent ABI that returns ENOENT on unsupported platforms ;) Could you please explain what you envision as an arch-independent solution to this problem? I guess the tsc_long_calibration=1 alternative is one. --
Arch independent solution is to provide information about the current clock source in general. This is _NOT_ a TSC specific problem, you have the same trouble with any other clocksource which gets calibrated and does not take it's frequency as a constant value from boot loader, configuration or some CPU/chipset register. The only missing piece is a frequency member in struct clocksource which needs to be filled in by the arch/machine specific code. Thanks, tglx --
Right but having applications add "Linux on x86 where the TSC is being used" logic is pretty poor solution. Its an issue that should be addressed from the kernel side. And really, if apps really wanted this info, they can fish it out ...and the tsc_khz= patch I posted earlier. thanks -john --
Really? I was under the impression that tsc_khz can differ from cpu_mhz (invariant tsc?), and cpu_mhz can differ from what shows up in /proc/cpuinfo cpuMHz due to cpufreq scaling. I was also under the impression that knowing or controlling tsc_khz is what NTP needs to ensure stability (assuming the TSC is otherwise stable, i.e. no halts-in-idle, NMI etc etc weirdness). Another possibility: $ cd /sys/devices/system/clocksource/clocksource0/ $ ls -lR available_clocksource current_clocksource current_clocksource_ln -> tsc tsc/ tsc/calibration tsc/calibrated_master -> ../hpet tsc/khz hpet/ hpet/calibration hpet/khz $ cat tsc/calibration slave # there has been a one-time calibration against a reference at boot time, # the source clock is in calibrated_master and and the khz is calculated # from that $ cat hpet/calibration constant # takes its value from constant value from boot loader, configuration # or some CPU/chipset register Would this be workable? I need to look deeper at how the other clocksources work, for example the virtualized ones. I'm also wondering if NICs with their own clocks & IEEE-1588 support are going to become part of the clocksource infrastructure (see e.g. http://patchwork.ozlabs.org/patch/52626/) Thanks everyone for the guidance. --
Bah. You're right. I shouldn't be emailing this early :) Even so, I'm still not a fan of the "expose raw details so userland apps While I'm not a huge fan of it, Thomas' way would be a bit more palatable. NTP can check the initial freq the clocksource was registered and if its different from the last boot decide if it can recalculate that into a new correction factor, or just throw out the drift file value. Brian: is this something the NTPd folks actually want? Has anyone checked with them before we hand down the solution from high upon on lkml mountain? Personally I think NTPd should be a little more savvy about how far it trusts the drift file when it starts up. Since I believe its fast-startup mode can quickly estimate the drift well within 100ppm, which is about the maximum variance I've seen from the calibration code. thanks -john --
Engaging with them is probably a good idea. In the past, the NTP core folks have been extremely anti-Linux and pro-BSD and therefore unwilling to talk, but that has at least in part been due to what they perceive as unilateral actions on our part. -hpa --
I haven't checked, it's been a while since I dealt with this problem. The NTP maintainers definitely complain about the quick TSC calibration code like it's a bug: (e.g. http://www.mail-archive.com/questions@lists.ntp.org/msg02079.html). Anyway I'll reach out before I spend any time investing in The workaround we went with was to remove the drift file on every reboot. But in our experience, even with iburst, converging takes a long time. I don't have hard numbers since it's been a long time since I investigated the problem, but we defined failure as >1ms offset syncing to a server in our LAN, and a cold NTP boot takes 10-20 hours to get there. I was hoping that being able to reuse the drift information across boots would shorten convergence time. I think that in principle it's a nice thing to be able to do. Though as far as I'm aware, neither chrony nor PTPd (IEEE 1588) attempts to do this. --
Yes, Prof. Mills in particular (for those who don't know, he's "Mr. NTP") really gets upset about the way Linux does timekeeping. Unfortunately it's not clear to me that he's willing to work with us as opposed to wanting things to work exactly like BSD, and swive the non-NTP users. -hpa --
Although I suspect his dislike for Linux is historical, as Roman reworked the ntp code to follow the NTPv4 reference implementation back in the 2.6.19 timeframe. However I'd be more then happy to try to address any specific deficiencies with Linux's NTP implementation if someone can better express Prof Mills' critiques are. thanks -john --
That's the $10M question... I think it starts with asking the NTP community for advice. -hpa --
Ok. If its been awhile, you may find recent kernels (2.6.31+) are much faster to converge due to adjustments made to the SHIFT_PLL constant. This was done explicitly to address issues similar to what you describe above. thanks -john --
My tests were pre-2.6.31, this is really good to know. I'll take another look on recent kernels. --
Actually there is already a frequency in struct clocksource except it's represented by the two components: mult and shift. Maybe it would be best to expose these instead of khz (for all clocksources) so as to limit abuse by naive users. So, Thomas and John, if Brian's patch is modified to provide: /sys/devices/system/clocksource/clocksource0/current_mult /sys/devices/system/clocksource/clocksource0/current_shift and/or /sys/devices/system/clocksource/clocksource0/current_khz is that an acceptable arch-independent patch? (And which do you prefer?) --
I'd rather prefer the frequency interface for a simple reason. It allows to add a commandline option which provides the NTP folks with a sensible solution to their calibration problem as I don't see that a longer calibration time will reliably fix it. So we'd get a "clocksource_freq=XXX" option which would be applied to the clocksource which is selected on the command line with "clocksource=NNN". John ? Thanks, tglx --
I'd rather see a generic solution which provides the information of the current (and possibly those of the available) clock source(s). This x86 centric TSC world view is horrible. Thanks, tglx --
I was assuming that extra accuracy would decrease the ntp convergence time by about the same factor (5-6 bits of extra accuracy would decrease ntp convergence time by 32-64x). Not sure why upgraded mobo's would fail due to a longer sample? As more and more systems become dependent on clocksource==tsc and more and more people assume nanosecond-class measurements are relatively accurate, I'd expect the accuracy of tsc_khz to become more important. While desktop users might bristle at an extra second of boot delay, I'll bet many server farm administrators would gladly pay that upfront cost if they know an option exists. --
Not really. The delta measurements aren't the issue here, but rather walltime convergence. -hpa --
Sorry, this is sort of mixing points. I was saying you don't need more accuracy (as opposed to what H. Peter mentioned below) when setting the tsc_khz= option I proposed. Since it will be constant from boot to boot, and thus will reduce the ntp convergence time. However, without such a boot option, more accuracy from an increased calibration time would help. However, the tradeoff of a longer boot time Again, this is mixing the discussion. The concern was users of a tsc_khz= boot option might have problems when they upgrade, as the Maybe something like a tsc_long_calibration=1 option would allow for this? However, I really do like the idea of pulling the stamped value from the MSR and if its close to what we quickly calibrated, use that. thanks -john --
On a system with synchronized TSC and multiple cores you could also simply do a longer calibration on another core in the background after a quick "fast calibration" -Andi --
