Re: [PATCH] x86: Export tsc related information in sysfs

Previous thread: [RFC] User Guide for Sysfs and libudev by Alan Ott on Monday, May 24, 2010 - 11:08 am. (4 messages)

Next thread: [PATCH 01/11] Add support for hvm_op by Stefano Stabellini on Monday, May 24, 2010 - 11:27 am. (1 message)
From: Dan Magenheimer
Date: Monday, May 24, 2010 - 11:13 am

Another possibility: Optionally trust the stamped rate for the part?

I understand that on Nehalem this value is available in
MSR_PLATFORM_INFO[15:8] (google for MSR_PLATFORM_INFO 15 8),
but I don't know if this MSR is available on older (or AMD)
processors.

Just wondering:  If one were to put an ultra-precise scope on
a processor, how far off would the calibrated value be?  I'd
imagine the process of calibrating one unknown crystal against
a second crystal which has a known-but-not-highly-precise
frequency, though good enough for most purposes, is not particularly
accurate. In other words, maybe the stamped rate is more accurate
than the calibrated rate anyway?
--

From: H. Peter Anvin
Date: Monday, May 24, 2010 - 11:19 am

No.  Not even close.

A spread-spectrum clock is inaccurate by entire percentage points.
A non-spread clock is typically ±50 ppm with typical consumer PC
oscillators, ±1 ppm with non-crappy but still cheap oscillators (e.g.
used in cell phones.)

	-hpa
--

From: john stultz
Date: Monday, May 24, 2010 - 11:51 am

Hmmm. That could be an option for newer cpus that I wouldn't oppose.

While Peter is correct that the stamped value is probably not very
accurate, atleast it would be constant from boot to boot, and NTP's
calculated drift value would be correct. 

We'd need a check to make sure its not way off, since NTP will give up
if its outside 500ppm.  So as long as its close to the calibrated value,
we probably could use it.

thanks
-john


--

From: H. Peter Anvin
Date: Monday, May 24, 2010 - 1:20 pm

Is that still the case?  I thought newer versions of NTP could deal with
large values.  Inaccuracies of way more than 500 ppm are everyday.

	-hpa
--

From: john stultz
Date: Monday, May 24, 2010 - 1:39 pm

That's scary. 

Yea, in the kernel the ntp freq correction tops out at 500ppm. Almost
all the systems I see tend to fall in the +/- 200ppm range (if there's
not something terribly wrong with the hardware).

So maybe things aren't so bad out there? Or is that wishful thinking?

thanks
-john


--

From: H. Peter Anvin
Date: Monday, May 24, 2010 - 2:26 pm

In the kernel, yes; I thought the ntp daemon itself now handled the
exceptions (basically it detects if the PLL consistently veers off the
rails and adjusts the timing constants.)

However, you're comparing apples to oranges: you're talking about
current kernels, which means a calibrated TSC, which means you're
comparing to the non-spread 14.31818 MHz clock (which feeds into the
HPET, PMTMR and 8254 on a standard PC platform.)  In most PCs this is a
separate oscillator from the bus clock which is spread spectrum.  As a
result, it should be in the ±50 ppm range in theory; in practice as you
observe the range is wider.

	-hpa
--

From: Dan Magenheimer
Date: Monday, May 24, 2010 - 3:04 pm

Since Brian's concern is at boot-time at which point there is no
network or ntp, and assuming that it would be unwise to vary tsc_khz
dynamically on a clocksource==tsc machine (is it?), would optionally
lengthening the TSC<->PIT calibration beyond 25ms result in a more
consistent tsc_khz between boots?  Or is the relative instability
an unavoidable result of skew between the PIT and the fixed constant
PIT_TICK_RATE combined with algorithmic/arithmetic error?  Or is
the jitter of the (spread-spectrum) TSC too extreme?  Or ???

If better more consistent calibration is possible, offering
that as an optional kernel parameter seems better than specifying
a fixed tsc_khz (stamped or user-specified) which may or may
not be ignored due to "too different from measured tsc_khz".
Even an (*optional*) extra second or two of boot time might
be perfectly OK if it resulted in an additional five or six
bits of tsc_khz precision.

Thoughts, Brian?
--

From: H. Peter Anvin
Date: Monday, May 24, 2010 - 3:30 pm

Making the calibration time longer should give a more precise result,
but of course at the expense of longer boot time.

A longer sample would make sense if the goal is to freeze it into a
kernel command line variable, but the real question is how many people
would actually do that (and how many people would then suffer problems
because they upgraded their CPU/mobo and got massive failures on post-boot.)

	-hpa
--

From: john stultz
Date: Monday, May 24, 2010 - 3:49 pm

I'll admit its a feature for a minority of users. Probably why its not
included. 

And the upgraded system issue was something I tried to address by using
the calibrated value if it was off by some unreasonable amount, however
folks protested that, figuring since if its explicitly stated kernel
should not override it (ie: for the use case of where the calibration is
broken and folks want to force the value).

Also, you don't really need extra accuracy, you just need it to be the
same from boot to boot. NTP keeps the correction factor persistent from
boot to boot via the drift file. The boot argument is just trying to
save the time (possibly hours depending on ntp config) after a reboot
for NTP to correct for the new error introduced by calibration.

thanks
-john




--

From: Dan Magenheimer
Date: Monday, May 24, 2010 - 5:01 pm

Sounds good to me.  If it's non-obvious what value to choose for
the new calibration, maybe specifying it in MS (per MAX_QUICK_PIT_MS

On a quick sample of two machines looking at the TSC calibration
done by Xen (which exposes the equivalent of tsc_khz), it appears
that the stamped value is different from the calibration by about
1000ppm.  YMMV.
--

From: H. Peter Anvin
Date: Monday, May 24, 2010 - 5:07 pm

So pretty seriously different.  Less than I would have expected, but not
out of the ballpark.

	-hpa
--

From: Brian Bloniarz
Date: Monday, May 24, 2010 - 6:33 pm

I still sorta think the clearest thing is to keep the kernel's
quick, dynamic calibration at boot, and expose the tsc_khz it
decides on. People who don't care about clocksync get their
quick boot, NTP knows enough to correct its drift estimate
when it starts, and users don't need to learn all these
details to get stable clocksync (*).

If we're being strict, something like NTP needs to know exactly
what's driving gettimeofday(). If the clocksource changes, it
needs to know that so it could correct or trash its drift
estimate. If there's one-time calibration, it needs to know
what the result was. If there's continuous calibration, it
either needs to be notified, or have the ability to disable
it. Right? So I think exporting tsc_khz in some form is a
step in the right direction.

So what's wrong with just adding a
/sys/devices/system/clocksource/clocksource0/tsc_khz?
Maybe Thomas Gleixner's suggestion of a vget_tsc_raw()
would also suffice, I'm not sure I understand the details
enough.

Any of the other fixes people have discussed (tsc_khz=
bootopt, tsc_long_calibration=1) would be enough to make
me happy though :)

(*) Though they still need to learn enough to coax the
kernel into giving them a fast gettimeofday(). That's a
price you gotta pay if you care enough :)

--

From: Brian Bloniarz
Date: Tuesday, May 25, 2010 - 5:16 pm

As an RFC:

Add clocksource.sys_register & sys_unregister so the
current clocksource can add supplemental information to
/sys/devices/system/clocksource/clocksource0/

Export tsc_khz when current_clocksource==tsc so that
daemons like NTP can account for the variability of
calibration results.

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9faf91a..9c99965 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -76,6 +76,11 @@ unsigned long long
 sched_clock(void) __attribute__((alias("native_sched_clock")));
 #endif
 
+int sysfs_tsc_register(struct sys_device *clocksource_dev,
+		       struct clocksource *cs);
+void sysfs_tsc_unregister(struct sys_device *clocksource_dev,
+			  struct clocksource *cs);
+
 int check_tsc_unstable(void)
 {
 	return tsc_unstable;
@@ -757,6 +762,8 @@ static struct clocksource clocksource_tsc = {
 #ifdef CONFIG_X86_64
 	.vread                  = vread_tsc,
 #endif
+	.sys_register           = sysfs_tsc_register,
+	.sys_unregister         = sysfs_tsc_unregister,
 };
 
 void mark_tsc_unstable(char *reason)
@@ -967,3 +974,22 @@ void __init tsc_init(void)
 	init_tsc_clocksource();
 }
 
+static ssize_t show_tsc_khz(
+	struct sys_device *dev, struct sysdev_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", tsc_khz);
+}
+
+static SYSDEV_ATTR(tsc_khz, 0444, show_tsc_khz, NULL);
+
+int sysfs_tsc_register(struct sys_device *clocksource_dev,
+                         struct clocksource *cs)
+{
+    return sysdev_create_file(clocksource_dev, &attr_tsc_khz);
+}
+
+void sysfs_tsc_unregister(struct sys_device *clocksource_dev,
+                          struct clocksource *cs)
+{
+    sysdev_remove_file(clocksource_dev, &attr_tsc_khz);
+}
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 5ea3c60..d9f6f13 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -15,6 +15,7 @@
 #include <linux/cache.h>
 #include <linux/timer.h>
 #include ...
From: john stultz
Date: Tuesday, May 25, 2010 - 5:48 pm

I think this is a bad idea, as it creates an ABI that is arch AND
machine specific, which will cause portability problems in applications
that expect the interface to be there.

thanks
-john


--

From: Brian Bloniarz
Date: Tuesday, May 25, 2010 - 7:50 pm

It's an arch-independent ABI that returns ENOENT on
unsupported platforms ;)

Could you please explain what you envision as an
arch-independent solution to this problem?
I guess the tsc_long_calibration=1 alternative is
one.
--

From: Thomas Gleixner
Date: Wednesday, May 26, 2010 - 5:35 am

Arch independent solution is to provide information about the current
clock source in general. This is _NOT_ a TSC specific problem, you
have the same trouble with any other clocksource which gets calibrated
and does not take it's frequency as a constant value from boot loader,
configuration or some CPU/chipset register. The only missing piece is
a frequency member in struct clocksource which needs to be filled in
by the arch/machine specific code.

Thanks,

	tglx
--

From: john stultz
Date: Wednesday, May 26, 2010 - 8:04 am

Right but having applications add "Linux on x86 where the TSC is being
used" logic is pretty poor solution. Its an issue that should be
addressed from the kernel side.

And really, if apps really wanted this info, they can fish it out

...and the tsc_khz= patch I posted earlier.


thanks
-john


--

From: Brian Bloniarz
Date: Wednesday, May 26, 2010 - 9:02 am

Really? I was under the impression that tsc_khz can differ
from cpu_mhz (invariant tsc?), and cpu_mhz can differ from what
shows up in /proc/cpuinfo cpuMHz due to cpufreq scaling. I was
also under the impression that knowing or controlling tsc_khz
is what NTP needs to ensure stability (assuming the TSC is
otherwise stable, i.e. no halts-in-idle, NMI etc etc weirdness).



Another possibility:

$ cd /sys/devices/system/clocksource/clocksource0/

$ ls -lR
available_clocksource
current_clocksource
current_clocksource_ln -> tsc
tsc/
tsc/calibration
tsc/calibrated_master -> ../hpet
tsc/khz
hpet/
hpet/calibration
hpet/khz

$ cat tsc/calibration
slave
# there has been a one-time calibration against a reference at boot time,
# the source clock is in calibrated_master and and the khz is calculated
# from that

$ cat hpet/calibration
constant
# takes its value from constant value from boot loader, configuration
# or some CPU/chipset register

Would this be workable? I need to look deeper at how the other clocksources
work, for example the virtualized ones. I'm also wondering if NICs with their
own clocks & IEEE-1588 support are going to become part of the clocksource
infrastructure (see e.g. http://patchwork.ozlabs.org/patch/52626/)

Thanks everyone for the guidance.
--

From: john stultz
Date: Wednesday, May 26, 2010 - 9:25 am

Bah. You're right. I shouldn't be emailing this early :)

Even so, I'm still not a fan of the "expose raw details so userland apps

While I'm not a huge fan of it, Thomas' way would be a bit more
palatable. 

NTP can check the initial freq the clocksource was registered and if its
different from the last boot decide if it can recalculate that into a
new correction factor, or just throw out the drift file value.

Brian: is this something the NTPd folks actually want? Has anyone
checked with them before we hand down the solution from high upon on
lkml mountain?

Personally I think NTPd should be a little more savvy about how far it
trusts the drift file when it starts up. Since I believe its
fast-startup mode can quickly estimate the drift well within 100ppm,
which is about the maximum variance I've seen from the calibration code.

thanks
-john

--

From: H. Peter Anvin
Date: Wednesday, May 26, 2010 - 11:24 am

Engaging with them is probably a good idea.  In the past, the NTP core
folks have been extremely anti-Linux and pro-BSD and therefore unwilling
to talk, but that has at least in part been due to what they perceive as
unilateral actions on our part.

	-hpa

--

From: Brian Bloniarz
Date: Wednesday, May 26, 2010 - 11:44 am

I haven't checked, it's been a while since I dealt with
this problem. The NTP maintainers definitely complain about the
quick TSC calibration code like it's a bug:
(e.g. http://www.mail-archive.com/questions@lists.ntp.org/msg02079.html).
Anyway I'll reach out before I spend any time investing in

The workaround we went with was to remove the drift file on 
every reboot. But in our experience, even with iburst, converging takes
a long time. I don't have hard numbers since it's been a long time since
I investigated the problem, but we defined failure as >1ms offset syncing
to a server in our LAN, and a cold NTP boot takes 10-20 hours to get
there.

I was hoping that being able to reuse the drift information
across boots would shorten convergence time. I think that in principle
it's a nice thing to be able to do. Though as far as I'm aware, neither
chrony nor PTPd (IEEE 1588) attempts to do this.
--

From: H. Peter Anvin
Date: Wednesday, May 26, 2010 - 11:51 am

Yes, Prof. Mills in particular (for those who don't know, he's "Mr.
NTP") really gets upset about the way Linux does timekeeping.
Unfortunately it's not clear to me that he's willing to work with us as
opposed to wanting things to work exactly like BSD, and swive the
non-NTP users.

	-hpa
--

From: john stultz
Date: Wednesday, May 26, 2010 - 1:19 pm

Although I suspect his dislike for Linux is historical, as Roman
reworked the ntp code to follow the NTPv4 reference implementation back
in the 2.6.19 timeframe.

However I'd be more then happy to try to address any specific
deficiencies with Linux's NTP implementation if someone can better
express Prof Mills' critiques are.

thanks
-john




--

From: H. Peter Anvin
Date: Wednesday, May 26, 2010 - 2:06 pm

That's the $10M question... I think it starts with asking the NTP
community for advice.

	-hpa
--

From: john stultz
Date: Wednesday, May 26, 2010 - 12:49 pm

Ok. If its been awhile, you may find recent kernels (2.6.31+) are much
faster to converge due to adjustments made to the SHIFT_PLL constant.
This was done explicitly to address issues similar to what you describe
above.

thanks
-john

--

From: Brian Bloniarz
Date: Wednesday, May 26, 2010 - 1:22 pm

My tests were pre-2.6.31, this is really good to know. I'll take
another look on recent kernels.
--

From: Dan Magenheimer
Date: Wednesday, May 26, 2010 - 7:26 am

Actually there is already a frequency in struct clocksource except
it's represented by the two components: mult and shift.  Maybe
it would be best to expose these instead of khz (for all clocksources)
so as to limit abuse by naive users.

So, Thomas and John, if Brian's patch is modified to provide:

/sys/devices/system/clocksource/clocksource0/current_mult
/sys/devices/system/clocksource/clocksource0/current_shift
and/or
/sys/devices/system/clocksource/clocksource0/current_khz

is that an acceptable arch-independent patch? (And which do
you prefer?)

--

From: Thomas Gleixner
Date: Wednesday, May 26, 2010 - 7:41 am

I'd rather prefer the frequency interface for a simple reason. It
allows to add a commandline option which provides the NTP folks with a
sensible solution to their calibration problem as I don't see that a
longer calibration time will reliably fix it.

So we'd get a "clocksource_freq=XXX" option which would be applied to
the clocksource which is selected on the command line with
"clocksource=NNN".

John ?

Thanks,

	tglx
--

From: Thomas Gleixner
Date: Wednesday, May 26, 2010 - 5:30 am

I'd rather see a generic solution which provides the information of
the current (and possibly those of the available) clock source(s). 

This x86 centric TSC world view is horrible.

Thanks,

	tglx
--

From: Dan Magenheimer
Date: Monday, May 24, 2010 - 4:16 pm

I was assuming that extra accuracy would decrease the ntp
convergence time by about the same factor (5-6 bits of extra
accuracy would decrease ntp convergence time by 32-64x).

Not sure why upgraded mobo's would fail due to a longer sample?

As more and more systems become dependent on clocksource==tsc
and more and more people assume nanosecond-class measurements
are relatively accurate, I'd expect the accuracy of tsc_khz
to become more important.  While desktop users might bristle
at an extra second of boot delay, I'll bet many server
farm administrators would gladly pay that upfront cost
if they know an option exists.
--

From: H. Peter Anvin
Date: Monday, May 24, 2010 - 4:19 pm

Not really.  The delta measurements aren't the issue here, but rather
walltime convergence.

	-hpa
--

From: john stultz
Date: Monday, May 24, 2010 - 4:30 pm

Sorry, this is sort of mixing points. I was saying you don't need more
accuracy (as opposed to what H. Peter mentioned below) when setting the
tsc_khz= option I proposed. Since it will be constant from boot to boot,
and thus will reduce the ntp convergence time.

However, without such a boot option, more accuracy from an increased
calibration time would help. However, the tradeoff of a longer boot time

Again, this is mixing the discussion. The concern was users of a
tsc_khz= boot option might have problems when they upgrade, as the

Maybe something like a tsc_long_calibration=1 option would allow for
this?  

However, I really do like the idea of pulling the stamped value from the
MSR and if its close to what we quickly calibrated, use that.

thanks
-john

--

From: Andi Kleen
Date: Monday, May 24, 2010 - 4:42 pm

On a system with synchronized TSC and multiple cores you could also simply do 
a longer calibration on another core in the background after a quick "fast
calibration"

-Andi
--

Previous thread: [RFC] User Guide for Sysfs and libudev by Alan Ott on Monday, May 24, 2010 - 11:08 am. (4 messages)

Next thread: [PATCH 01/11] Add support for hvm_op by Stefano Stabellini on Monday, May 24, 2010 - 11:27 am. (1 message)