An ancient laptop of mine started throwing errors from b43legacy when
I started using 2.6.27 on it. This has been bisected to commit bfc0f59
"x86: merge tsc calibration". Incidentally, I was appalled at the
number of build errors that were found while bisecting in this region.
Every commit from 2.6.26-rc9-00715 to at least -00719 had include
errors that had to be fixed before it would compile. Those fixes are
the only reason that the builds below are "dirty".
The CPU in this computer is an AMD-K6 at stepping 0c and running at
450 MHz.
The critical differences in the dmesg output between the "good" and
"bad" results indicate a factor of 2 difference in the clock speed,
and are shown below:
============================================================
finger@larrylap:/ide1/mtech/wireless-testing> diff -u /ide1/dmesg.good
/ide1/dmesg.bad
--- /ide1/dmesg.good 2008-08-31 16:23:32.000000000 -0500
+++ /ide1/dmesg.bad 2008-08-31 16:26:25.000000000 -0500
@@ -1,4 +1,4 @@
-Linux version 2.6.26-rc9-test-00715-g0ef9553-dirty (finger@larrylap)
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision
135036] (SUSE Linux) ) #29 Sun Aug 31 15:37:05 ACT 2008
+Linux version 2.6.26-rc9-test-00716-gbfc0f59-dirty (finger@larrylap)
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision
135036] (SUSE Linux) ) #31 Sun Aug 31 16:52:23 ACT 2008
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
@@ -54,7 +54,8 @@
Kernel command line: root=/dev/sda3 vga=0x314 resume=/dev/sda2
splash=silent
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 4096 bytes)
-Detected 428.846 MHz processor.
+TSC calibrated against PM_TIMER
+Detected 214.401 MHz processor.
Console: colour dummy device 80x25
console [tty0] enabled
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
@@ -69,7 +70,7 @@
.text : 0xc0100000 - 0xc02638ac ...In both cases the TSC is ahead of the pm_timer, which looks like the pm_timer is behaving strange. Can you please disable the pm_timer (in the kernel config, unfortunately there is no command line option for that) for a test and provide the relevant output of demsg ? Thanks, tglx --
I manually deleted CONFIG_X86_PM_TIMER from .config, but it returns silently. The only way I see to turn off the pm_timer is to disable power management. Is that what I need to do? Larry --
The switch depends on CONFIG_EMBEDDED. Just turn that on in "General Setup"
[*] Configure standard kernel features (for small systems) --->
Then you get an extra switch in:
"Powermanagement Options"
ACPI (Advanced Configuration and Power Interface) Support --->
[*] Power Management Timer Support
Turn that one off.
Thanks,
tglx
--
It took a while to figure out how to kill the pm_timer. I finally did
it by changing the default to no rather than yes. I also reset the
bisection and compiled a full -rc4 kernel.
What I hope is the relevant output of dmesg is below. The clock rate
is correctly determined, and the b43legacy errors are gone.
Thanks - Larry
Linux version 2.6.27-rc4-wl-15747-g2e3bbe3-dirty (finger@larrylap)
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision
135036] (SUSE Linux) ) #36 Mon Sep 1 12:19:13 ACT 2008
--snip--
Kernel command line: root=/dev/sda3 vga=0x314 resume=/dev/sda2
splash=silent
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 4096 bytes)
TSC calibrated against PIT
Detected 428.823 MHz processor.
Console: colour dummy device 80x25
console [tty0] enabled
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Memory: 252692k/262080k available (1442k kernel code, 8836k reserved,
613k data, 188k init, 0k highmem)
virtual kernel memory layout:
fixmap : 0xffffa000 - 0xfffff000 ( 20 kB)
vmalloc : 0xd0800000 - 0xffff8000 ( 759 MB)
lowmem : 0xc0000000 - 0xcfff0000 ( 255 MB)
.init : 0xc0306000 - 0xc0335000 ( 188 kB)
.data : 0xc0268b54 - 0xc0302208 ( 613 kB)
.text : 0xc0100000 - 0xc0268b54 (1442 kB)
Checking if this processor honours the WP bit even in supervisor
mode...Ok.
CPA: page pool initialized 1 of 1 pages preallocated
Calibrating delay loop (skipped), value calculated using timer
frequency.. 857.64 BogoMIPS (lpj=1715292)
--
Hmm. Haven't seen that before, but if confirms what I guessed from your previous dmesg information. I wonder why you did not observe strange behaviour with older kernel versions. I don't mean the b43legacy errors, which might be caused by the wrong calibrated TSC, but those even should show strange behaviour vs. time. Can you please provide the output of an older "working" kernel version from: # cat /sys/devices/system/clocksource/clocksource0/current_clocksource after the TSC was set to unstable. It should say acpi_pm. If that's the case please run # time sleep 60 on a shell and provide the output and verify it against a knwn to be halfways correct stopwatch. Then do the same on the current mainline with pm_timer disabled. current clocksource should be either jiffies or tsc. Thanks, tglx --
x86-32 never used the PM_TIMER for frequency estimation, it only ever used the PIT. See the old "native_calculate_cpu_khz()" in tsc_32.c that you deleted in favor of the (imho inferior) x86-64 version. How about: - taking the old 32-bit code, and using it to initially _just_ estimate the TSC speed. That code was stable and pretty much guaranteed to work reasonably well on all machines. It retries the timings three times, and picks the best one. - Then, _after_ you already have a pretty good estimation for TSC, you can use _that_ to then get the HPET and/or PM_TIMER version (and not use the PIT at all for those calibrations) - and if the PM_TIMER one is too far off, just throw it away. We know the PIT is a lot more trustworthy than the PM_TIMER. Hmm? Linus --
Far off in which direction ?
If the PIT interrupts are delayed by SMM code, then I see That's on a
max. three years old 32bit Core Duo things like:
[ 0.000000] Detected 8340.258 MHz processor.
[ 13.782091] APIC calibration not consistent with PM Timer: 228ms instead of 100ms
This one is way off, while the next one is in a reasonable range
[ 0.000000] Detected 3240.001 MHz processor.
[ 13.792122] APIC calibration not consistent with PM Timer: 178ms instead of 100ms
while in reality the machine is @2GHZ and current mainline says:
[ 0.000000] Detected 2000.065 MHz processor.
The CPU calibration of < 2.6.27 is against PIT and does _NOT_ give me a
pretty good estimation for TSC.
I was pretty happy when Alok beat me to unify the TSC calibration code
as it solved one of my long standing todo items, which also filled my
buglist on a regular base.
I did debugged this thorougly using the tracer from preempt-rt to
check, what the box does during that time, and it definitely vanishes
for >100ms in a row in the black hole of the stupid BIOS.
So either way. Relying on PIT on newer machines is _BAD_, relying on
PM_TIMER on older machine is _BAD_ as well.
There is no given good estimate, when the TSC/PIT calibration is off
by factor 1.5 to 4. The consequence would be that I throw away a
perfect fine pmtimer and run a machine which advertises itself as the
fastest box on the planet. With your method I would disable nohz and I
would be back to 50% battery time.
I'm happy to discard the PIT on the 32bit machines again and then file
a bugreport for a regression between 2.6.27-rc1 and tomorrows git :)
This one is the first complaints, I've seen vs. a non working pmtimer
since quite a time. That's why I obviously forgot about the rate check
issue.
I just looked at drivers/clocksource/acpi_pm.c history and saw, that
John explicitely mentions AMD K6 in commit
562f9c574e0707f9159a729ea41faf53b221cd30
This patch re-adds the verify_pmtmr_rate ...ooops ---------------------^^^ wanted to say PMTIMER of course :)
tglx
--
Umm. Which one. The 32-bit or the 64-bit? I already pointed out that the 64-bit PIT calibration is pure and utter The problem is that he unified it around the _worse_ code. Linus --
So then the PIT isn't going to work on that machine. Do you have HPET? Because if so, the most obvious choice would be to say: - on old machines, the PIT is likely more reliable than PM_TIMER - on new machines, the HPET is likely more reliable than PIT and there's actually a fairly obvious place to distinguish between old and new: if ti has a HPET, consider it a new one. But that just means that there's never any reason to use PM_TIMER anyway, and the proper patch would probably be to just ignore it. No? Linus --
There is a HPET timer on that machine, but it is only available by enforcing the address via black magic because the BIOS does advertise it on the wrong location and the newest BIOS version does not advertise it at all. I have not seen a single bug report against PM timer aside of those K6 area systems, but we have - HPETs which are not advertised at all - HPETs which are advertised at the wrong address - HPETs which do not count at all - HPETs which tell us bogus operating frequencies - HPETs which have bogus interrupt routings So why should we rely on the HPET ? The whole ACPI code relies pretty much on PM_TIMER and as I said before there are no PM-TIMER related bug reports aside of those documented K6 ones, which Alok did probably not know about and I forgot them as well. Yeah, that's my fault. Thanks, tglx --
The Linux ACPI code does not rely on PM_TIMER in any special ways AFAIK (except that it occasionally uses standard linux timer functions) It merely discovers it for use by the other pmtimer users. -Andi -- ak@linux.intel.com --
# grep -rc acpi_gbl_FADT.xpm_timer_block.address drivers/acpi/ | grep -v ":0" drivers/acpi/processor_idle.c:12 I need to correct myself. It's only the processor_idle code :) Thanks, tglx --
Btw, this sentence of yours just doesn't seem to make much sense.
The thing is, the calibration code doesn't even use interrupts. It just
reads the PIT timer value.
And the thing is, the 64-bit code really does things that the 32-bit code
does _not_ do.
Just as an example, the old 32-bit code (the thing that was deleted) tried
to actually round things correctly, while the 64-bit code does not. Look
at what the 64-bit code does:
outb((inb(0x61) & ~0x02) | 0x01, 0x61);
outb(0xb0, 0x43);
outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
tr1 = get_cycles();
while ((inb(0x61) & 0x20) == 0);
tr2 = get_cycles();
and notice three things:
- no comments (can you actually see what it does?)
- no rounding of the inevitable errors of trying to wait 1/50th of a
second
- not a single try to even account for the fact that there might be
things going on that perturb the value.
Now, look at what the 32-bit code _used_ to do. The good code. The code
that was _deleted_.
Really. I don't think you really even looked. It did:
/* run 3 times to ensure the cache is warm and to get an accurate reading */
for (i = 0; i < 3; i++) {
mach_prepare_counter();
rdtscll(start);
mach_countup(&count);
rdtscll(end);
.. ignore bad values ..
/*
* We want the minimum time of all runs in case one of them
* is inaccurate due to SMI or other delay
*/
delta64 = min(delta64, (end - start));
}
and if you actually look at those counter things, you'll see:
#define CALIBRATE_TIME_MSEC 30 /* 30 msecs */
#define CALIBRATE_LATCH \
((CLOCK_TICK_RATE * CALIBRATE_TIME_MSEC + 1000/2)/1000)
static inline void mach_prepare_counter(void)
{
/* Set the Gate high, disable speaker ...The _good_ code which results in a 8GhZ TSC calibration on that very
_32_ bit box I have here. The CPU is 32bit only, so it never even
None.
start_pit_documented_magic()
read_tsc()
wait_until_pit_has_wrapped_documented_magic()
read_tsc()
is error prone versus SMI/SMM code simply due to the fact, that at any
given point between those functions the SMM/SMI can happen. Doing it
three times in a row and select the lowest one does not change much. I
tried it 10 times in a row with varying bogus results.
So at every boot I get significant different calibration values. See
The following is from a 32bit boot on that very 32bit Intel Core Duo
Laptop running 2.6.26:
[ 0.000000] Detected 8340.258 MHz processor.
next boot
[ 0.000000] Detected 3240.001 MHz processor.
next boot
[ 0.000000] Detected 2211.134 MHz processor.
I can print you the value for 100 loops if you want, but I bet that
the correctness rate will be pretty small.
Current mainline calibrated against pmtimer gives me:
[ 0.000000] Detected 2000.065 MHz processor.
next boot
[ 0.000000] Detected 2000.129 MHz processor.
next boot
[ 0.000000] Detected 1999.988 MHz processor.
which is about accurate:
[ 13.408342] CPU0: Intel Genuine Intel(R) CPU T2500 @ 2.00GHz stepping 08
We had the same problem versus the local APIC timer calibration, which
had basically the same algorithm as the TSC one and we changed it to
look at the PMTimer as well in the days where we debugged the initial
wreckage caused by the nohz/highres changes. I can dig up the archives
of LAPIC timers with 200Mhz clock frequency, which results in a 10GHz
bus frequency, if you want.
How do you prevent the SMM brain damage, when it hits 3 times in a row ?
You can not prevent it for a very simple reason: The PIT is not
necessary a PIT. It can be a fake SMM code replacement. We actually
have no idea anymore what's hardware and what's just emulated crapola
under ...Hmm. So then how would you discover when it's reliable and when it's not? Just hardcode it for certain machines? One alternative might be to do the same "detect if it's SMM code by seeing how long the read takes" for the PIT reads themselves. Right now the code does it for the HPET timer read and for the PM_TIMER reads, but _not_ for Well, the biggest problem is actually _detection_. We have three different timers, and they all have their own problems. How do you reliably detect which one to use? The PM_TIMER clearly is _not_ always the answer here, but the code just assumes it is! Linus --
On the machine you have trouble with the PIT on, does this thing trigger? If it does, that could be a simple way to say whether you prefer PM_TIMER over PIT. For me, even on a modern machine, I get a pit_count of 46321, which matches the "about one microsecond for an ISA/LPC read" timing pretty well. What do you get? Linus --- arch/x86/kernel/tsc.c | 17 ++++++++++++++++- 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 8e786b0..7cf7543 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -131,6 +131,7 @@ unsigned long native_calibrate_tsc(void) u64 tsc1, tsc2, tr1, tr2, delta, pm1, pm2, hpet1, hpet2; int hpet = is_hpet_enabled(); unsigned int tsc_khz_val = 0; + unsigned int pit_count; local_irq_save(flags); @@ -142,7 +143,9 @@ unsigned long native_calibrate_tsc(void) outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42); outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42); tr1 = get_cycles(); - while ((inb(0x61) & 0x20) == 0); + pit_count = 0; + while ((inb(0x61) & 0x20) == 0) + pit_count++; tr2 = get_cycles(); tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL); @@ -150,6 +153,18 @@ unsigned long native_calibrate_tsc(void) local_irq_restore(flags); /* + * We should have waited for about 50ms, and a real hardware + * PIT access _should_ take about a microsecond even if it's + * over an ISA bus (or something that tries to emulate timings). + * + * So pit_count should be 50,000+. If it's a lot lower, that + * implies it might be some emulated device rather than real + * hardware.. + */ + if (pit_count < 25000) + printk("Slow PIT device (%u) - emulated using SMI?\n", pit_count); + + /* * Preset the result with the raw and inaccurate PIT * calibration value */ --
On the problem AMD K6, the count is 29411. On my almost new AMD Turion 64 X2 that runs at 2.0 GHz, the value was 31401. A cutoff value of 25000 may be a bit close. That machine says that it calibrates against the PM_TIMER, but cat /sys/devices/...../current_clocksource returns hpet. Larry --
You'll get funny answers on the Geode doing that. It will also randomly mistrigger on some laptops - they don't use SMI for PIC but their power handling SMIs will trigger that randomly. PMTimer may also be an SMI too btw. Alan --
About anything between 0 and useful, but still I had cases where the
pit_count was way above the 25000 but the TSC was off by factor 2.
Went down the road and instrumented the code:
unsinged long tsc_deltas[50000];
start_pit();
tsc1 = tsc = read_tsc_start();
while (!pit_ready()) {
tsc2 = read_tsc();
tsc_deltas[pit_count++] = tsc2 - tsc;
tsc = tsc2;
}
Analysing the tsc_deltas gave interesting insight. On the affected
laptop I had several entries where the delta between two reads was
from 1msec up to 120msec maximum.
As the code does nothing else and has interrupts disabled there is
only one explanation: the SMM/SMI black hole.
If this high delta hits after the pit_count reached 25000 we still
believe that our calibration against pit is fine :(
So what I'm working on is an algorithm, which is similar to the checks
in the tsc_read_refs() function. That should allow us to detect
whether one of the reads is way off by doing a min/max detection. In
such a case we can either repeat the calibration or try to figure out
whether the pmtimer / hpet can provide us with some useful reference.
Thanks,
tglx
--
Ok, that sounds like a good approach to find if it's done by some kind of emulation or not. Of course, any machine with SMM (even if it doesn't emulate the PIT per se - maybe it just gets some event related to overheating or other 'maintenance' stuff) can have occasional hickups, but the '120msec' thing is, I think, the real clincher. Why? Because we only try to wait for 50ms in the first place! Even if emulation is 100% exact (or even none at all, and the PIT accesses are in hardware), if we have a 120ms hickup while waiting for 50ms, then the end result will obviously be total crap, and yes, that sure explains how you I think the most trivial approach would be to - just keep track of the max TSC difference for each loop iteration. - if the max TSC is bigger than 1% of the total TSC, then something is already seriously wrong (either we had very few loops indeed, or some of them were very expensive) - perhaps loop over the calibration, and make the TSC calibration loop increase the delay. Because even if there is a 120ms hickup, if we had used a longer calibration delay, we'd probably not have noticed (well, ok, 120ms is pretty damning and is probably just unfixable, but smaller hickups are probably harmless) Additionally doing a min/max comparison to see that the loop is very _stable_ is of course also a way to validate things, but expecting _too_ much stability may be wrong too. As mentioned, SMM events can happen for other reasons than emulation. Linus --
I went for summing up the deltas and build an average at the end. That's from a loop of 10 consecutive runs: [ 0.000000] TSC min 2160 max 3732 avg 3266 pitcnt 30614 [ 0.000000] TSC min 2160 max 1036164 avg 3299 pitcnt 30310 [ 0.000000] TSC min 2160 max 1032360 avg 3303 pitcnt 30277 [ 0.000000] TSC min 2160 max 210453018 avg 69509 pitcnt 30260 Hit very late in the loop, as pitcnt is close to the others [ 0.000000] TSC min 2160 max 3708 avg 3265 pitcnt 30624 [ 0.000000] TSC min 2160 max 3720 avg 3265 pitcnt 30622 [ 0.000000] TSC min 2160 max 1062252 avg 3301 pitcnt 30287 [ 0.000000] TSC min 2160 max 3756 avg 3267 pitcnt 30605 [ 0.000000] TSC min 2160 max 3732 avg 3267 pitcnt 30605 [ 0.000000] TSC min 2136 max 989292 avg 3297 pitcnt 30324 [ 0.000000] TSC min 2136 max 3744 avg 3266 pitcnt 30612 [ 0.000000] TSC min 2160 max 78042006 avg 78045 pitcnt 1001 This one hit early in the loop as pitcnt is pretty low. The min value is pretty constant. The max value for sane loops is in the range of 3708 - 3756, the average is between 3266 and 3267. For those which have a ~500us maximum the average is still in a sane range. That seems to be a single glitch, which pushs the maximum, but does not really influence the average result. The outstanding one is the 100ms (210 453 018 ticks), where the average is also off by factor 20. I think that information is enough to give us a pretty precice idea when to discard the result. I'm currently looking at the hpet/pmtimer values for comparison and I should have a patch for testing ready Increasing the delay is probably not a good idea as we just make the Yeah, I know. One of the oddballs is the USB->PS2 keyboard emulator which is active during early boot. We do the USB handoff definitely after the TSC calibration. Found a box with similar (not that bad) hickups which go away when I disable that in the BIOS settings. Can't ...
Sorry for joining the party this late...am still going through all my mails. Ok, so from what I understand until now, we will calibrate TSC against PIT as was done in 32bit code and use that as default. If that fails to give any sane results we will fall back to calibrating against PM_timer or HPET ? Thomas has already explained the problem with 32bit calibration ( i.e. just against PIT and no checks for SMI's and all) but would like to point that this problem is lot more worse in virtualized environment, because we may fail to get sane values even from multiple loops of calibrating against PIT. If we have a fall back mechanism to detect this SMI event, and then try calibrating against PM timer or HPET we should be good. Anyways I will wait to see the patch. Thanks, Alok --
I still keep the fallback against pmtimer/hpet alive. Thanks, tglx --
Dang. We actually _used_ to do this fairly early for UHCI, since it was simple and very useful to do, and not doing it caused serious problems with the USB stack. But then it got moved to the USB stack itself, and for some reason the DECLARE_PCI_FIXUP_HEADER got turned into a DECLARE_PCI_FIXUP_FINAL instead (probably because the USB layer expected things to be fairly set up), which means that it now happens much much later. But I guess even DECLARE_PCI_FIXUP_HEADER was is too late for TSC, so it doesn't much matter. Even with the old setup, it would be too late. You're definitely right that this could easily be the _real_ problem. Especially as your TSC min value of 2160 is (a) pretty close to the expected time of a microsecond and (b) so stable that I actually do not believe that the PIT itself is at all emulated or the problem. Btw - as to caring about the average value: that's pointless. If you only look at the average time the PIT read takes place, then it is going to approximate that "pit_count" thing in the end that I already did. Why? Because the average value should essentially end up being "(end_tsc - start_tsc) / pit_count". And if you just compare that to "min_tsc", then that should always be about a microsecond (on normal machines where the PIT is essentially on the old emulated internal "ISA" bus on the southbridge). So you end up with what I already posted, and you already dismissed. So average TSC is not any more interesting than "pit_count". Linus --
On that box, the PIT is probably real hardware or a damned good emulation. When you look at the 10 loop values you see that it does 50% perfectly fine calibration loops. The others are just SMI Yeah, you're right. Math is hard :) Thanks, tglx --
Ok, so I actually think I know how to resolve the problem once and for all. The solution is actually fairly simple: we use the HPET algorithm. The reason the HPET algorithm is so robust is that - we can actually read the frequency from the HPET itself - we also simply just read the counter values from the HPET, and so it doesn't really matter how much time has passed between the two reads, it only matters that _some_ time has passed, and that we pick _one_ stable read that we can associate with a TSC value. But the thing is, the exact same thing is actually true of the old PIT timer too - except we simply don't take advantage of it. The PIT timer has a very well known frequency value (PIT_TICK_RATE: 1193180 Hz), and we can trivially read the counter value too. But the thing is, that for some forgotten reason, that's not actually what we do. Instead of reading the counter value, we wait until it counts down to zero, and read the output value instead. So instead of having a nice and dependable counter that ticks down (16 bits of precision), we actually use a _single_ bit of result, and depend on reading the TSC at the same time. That's kind of sad. I'll try to whip up a test-patch to do this in a smarter way. Linus --
Except for the couple of exceptions, where the readout of the old PIT timer is broken. See arch/x86/kernel/i8253.c:pit_read() Thought about that already and discarded it as it is basically the same problem as we have versus _ONE_ AMD K6 family pmtimer I'm fine with either solution as long it works on _ALL_ kind of broken hardware. Believe me or not, but since I work on the whole timer issue, I have not seen anything really reliable in the x86 world. That's the really sad part, that the hardware dudes did not learn anything about the importance of timing and timekeeping within 20 years. I'm tearing my hair on a regular base when I try to get down to the root cause of timer related wreckage in several (including todays) generations of CPU technology in x86 land. Other architectures have their odds and ends as well, but the vast majority of implementations is pretty straight forward and usable without restrictions. x86 is definitely the ultimate winner of the all time "timer ignorance award". Thanks, tglx --
Well, the VIA problem, for example, is not an issue if you just make it do the maximum timeout (ie count down from zero), which is what you want _anyway_ in order to get the maximum range. So that one is simply avoided by just programming the timer to count the maximum possible range. However, the counter latching itself does seem to stop the counting until it is read, which is rather irritating. It means that you can't just latch and read in a tight loop. An while I actually have a patch that works very well for me, it depends on not latching, which is not strictly correct either. So yeah, I suspect using the current code and trying to just see when it isn't reliable (and the "max time" seems good for that) is probably the best option. Linus --
Larry Finger reported at http://lkml.org/lkml/2008/9/1/90: An ancient laptop of mine started throwing errors from b43legacy when I started using 2.6.27 on it. This has been bisected to commit bfc0f59 "x86: merge tsc calibration". The unification of the TSC code adopted mostly the 64bit code, which prefers PMTIMER/HPET over the PIT calibration. Larrys system has an AMD K6 CPU. Such systems are known to have PMTIMER incarnations which run at double speed. This results in a miscalibration of the TSC by factor 0.5. So the resulting calibrated CPU/TSC speed is half of the real CPU speed, which means that the TSC based delay loop will run half the time it should run. That might explain why the b43legacy driver went berserk. On the other hand we know about systems, where the PIT based calibration results in random crap due to heavy SMI/SMM disturbance. On those systems the PMTIMER/HPET based calibration logic with SMI detection shows better results. According to Alok also virtualized systems suffer from the PIT calibration method. The solution is to use a more wreckage aware aproach than the current either/or decision. 1) reimplement the retry loop which was dropped from the 32bit code during the merge. It repeats the calibration and selects the lowest frequency value as this is probably the closest estimate to the real frequency 2) Monitor the delta of the TSC values in the delay loop which waits for the PIT counter to reach zero. If the maximum value is significantly different from the minimum, then we have a pretty safe indicator that the loop was disturbed by an SMI. 3) keep the pmtimer/hpet reference as a backup solution for systems where the SMI disturbance is a permanent point of failure for PIT based calibration 4) do the loop iteration for both methods, record the lowest value and decide after all iterations finished. 5) Set a clear preference to PIT based calibration when the result makes sense. The implementation does the reference calibration based ...
This is "wrongish".
You really should do the
tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
...
tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
around the whole loop, because they get more exact with more time inside,
and they don't improve from looping around.
Also, that code is already _too_ unreadable. How about starting by just
moving the PIT calibration into a function of its own, like the appended
patch. And then as a separate patch, improving the heuristics for just the
PIT calibration.
The others are independent of this issue..
Linus
---
arch/x86/kernel/tsc.c | 37 ++++++++++++++++++++-----------------
1 files changed, 20 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8e786b0..bf7ff5a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -122,19 +122,9 @@ static u64 tsc_read_refs(u64 *pm, u64 *hpet)
return ULLONG_MAX;
}
-/**
- * native_calibrate_tsc - calibrate the tsc on boot
- */
-unsigned long native_calibrate_tsc(void)
+static unsigned long PIT_calibrate_tsc(void)
{
- unsigned long flags;
- u64 tsc1, tsc2, tr1, tr2, delta, pm1, pm2, hpet1, hpet2;
- int hpet = is_hpet_enabled();
- unsigned int tsc_khz_val = 0;
-
- local_irq_save(flags);
-
- tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
+ u64 tr1, tr2, delta;
outb((inb(0x61) & ~0x02) | 0x01, 0x61);
@@ -145,17 +135,30 @@ unsigned long native_calibrate_tsc(void)
while ((inb(0x61) & 0x20) == 0);
tr2 = get_cycles();
- tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
-
- local_irq_restore(flags);
-
/*
* Preset the result with the raw and inaccurate PIT
* calibration value
*/
delta = (tr2 - tr1);
do_div(delta, 50);
- tsc_khz_val = delta;
+ return delta;
+}
+
+/**
+ * native_calibrate_tsc - calibrate the tsc on boot
+ */
+unsigned long native_calibrate_tsc(void)
+{
+ unsigned long flags;
+ u64 tsc1, tsc2, pm1, pm2, hpet1, hpet2;
+ int hpet = ...Yeah, should have done that. But I was too tired to touch anything in the code more complex than adding a few comments. Thanks, tglx --
Hi Thomas,
I agree with Linus that we should move the tsc_read_refs call outside of
the loop. I did those changes and ran some boot-halt tests at my end.
The frequency calibration against the pmtimer/hpet was surely more fine
tuned with this change, the variance that i see now in repeated reboots
is very minimal.
Please have a look at the patch below.
--
x86: Fine tune TSC calibration.
From: Alok N Kataria <akataria@vmware.com>
As Linus suggested, we should be moving the tsc_read_refs outside of the
loop, this gives us more accurate TSC calibration when calibrating against
hpet/pmtimer, since we are now calibrating over a period of 250ms.
With SMI_THRESHOLD equals to 50000, in the worst case, tsc values read by
tsc_read_refs, could be off by 50000 ticks. On a 2Ghz processor this would
mean an error of 25us. The tsc frequency is calibrated over a period of 250ms
with this patch, hence the worst case error would be around 100ppm down from
500ppm previously without the patch.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
---
arch/x86/kernel/tsc.c | 54 ++++++++++++++++++++++++++-----------------------
1 files changed, 29 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 346cae5..bc2c1a2 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -188,7 +188,7 @@ static unsigned long pit_calibrate_tsc(void)
unsigned long native_calibrate_tsc(void)
{
u64 tsc1, tsc2, delta, pm1, pm2, hpet1, hpet2;
- unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
+ unsigned long tsc_pit_min = ULONG_MAX, tsc_ref = ULONG_MAX;
unsigned long flags;
int hpet = is_hpet_enabled(), i;
@@ -216,31 +216,35 @@ unsigned long native_calibrate_tsc(void)
* calibration delay loop as we have to wait for a certain
* amount of time anyway.
*/
+
+ local_irq_save(flags);
+
+ /*
+ * Read the start value and the reference count of
+ * hpet/pmtimer when available. Then do the PIT
+ * ...Side note: I'd like to change that. 250ms is a _loong_ time. It's a really really long time. It's human-noticeable. A quarter of a second at boot is just too long. We have a couple of options: - make the PIT calibration shorter I suspect we could easily make the PIT calibration be just 5ms rather than 50ms. Yes, yes, it will be less precise, but the PIT counts at 1.1MHz or something, so we're still talking about 5000+ ticks of the clock. So we'll get within a fraction of a percent. We could then decide that _if_ the HPET/PMTIMER matches the PIT, we'll decide to use that as the reference time, because it would possibly have higher accuracy. - do other things while calibrating. Especially for HPET/PM_TIMER, we could actually do some other stuff while it is calibrating, since those calibrations only depend on the end/beginning being "close enough". This seems like it would be pretty hard to actually do sanely, though (there's still the issue of the HPET/PM_TIMER overflowing, although with 24 bits that should be on the order of several seconds, I dunno) Anyway, 250ms really is too long. Linus --
On Wed, 3 Sep 2008 19:56:10 -0700 (PDT) yeah it's insane ;-( (since we boot a kernel in 1 second and a full OS including the kernel one option for that is the fastboot patchkit in -next ;-) but at some point, even doing things in parallel/asynchronous isn't helping, "parallel shit is still shit" :) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Well, the thing is, you can't call ti "shit" when the fact is that we don't have any other options than to wait. The only frequency we can trust on 99% of all machines is the PIT, and it's a very uncomfortable programming model due to all the history (it is one of the few truly 8-bit things left in a modern PC). The other options are just not reliably there, or are known to not have a stable frequency. So how would you suggest we do it? Lowering the wait to 5ms (times 5, so it's really 25ms, although we can probably stop early if the first iterations are very consistent) will work, but it _will_ reduce precision. And it's still real time. But we simply don't have alternatives. That 'shit' is originally from the company you work for, btw, and while it was good for its time, the replacement (HPET) was horribly misdesigned by the same company, and is deficient in many ways (not the least of which is the idiotic enumeration: another ACPI braindamage), and it often isn't even exposed. As a result, the PIT remains to this day the most reliable source of a reference timer. That includes even on really modern machines (ie the one I have from Intel that contains hardware not even released yet!). Linus --
On Wed, 3 Sep 2008 20:59:05 -0700 (PDT) pmtimer is also quite ok, with the exception of some K6 based boxes. (I'm surprised the K6 boxes even have enough modern stuff to have one of the options we have is to start with an initial rough-but-conservative estimate, and refine it over time as the system is running.... sort of like ntp but for the calibration. another option for calibrating the tsc rate is to read it from the msr's/cpuid/aperf of what the hardware says it should be, and then all we need is to verify it is that; that we could do over timer or quickly. (of course that only works for systems with constant tsc) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Quite frankly, pmtimer isn't all that much better than PIT. It has a slightly bigger range, but it has a much more limited format, and it doesn't have a reliable frequency. It was designed for something else. At least HPET is clearly better than PIT as a _timer_. All the big HPET problems are with its idiotic interface. Of course, in any _sane_ situation, the timer really would have been in the local APIC instead, with a fixed and architected frequency, and it should run in all power states. But noo, that obviously won't ever work, I do agree that we could aim for something like that. But even to get the I don't think it's reliable even for systems with a constant TSC. Because the msr/cpuid thing isn't going to actualyl give the right frequency. It might be the frequency the thing is _rated_ at, but it will be off when people over- or under-clock the front-side bus etc. This is why it's so important that the clock input be a _known_ frequency. The thing that makes the PIT still so useful is not that it's a good timer, but that we *know* the frequency it runs at. Linus --
On Wed, 3 Sep 2008 21:20:23 -0700 (PDT) for 99%+ of the systems it'll be extremely close. We do need to validate it to detect the overclocking etc (and if the calibration says "it smells funny" we can do the slow 250msec thing) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
15 years ago when I only knew DOS, I used the PIT a lot for precise
delay calculations. I can attest that it can be a very precise timer
for delays when you run busy loops. You even need very few ticks because
you detect the falling edge with a high accuracy. Basically, I would
do this :
pit1 = readpit();
while (readpit() == pit1);
t1 = rdtsc(); // precise beginning of tick 0
while (readpit() != pit1 - 5000);
t2 = rdtsc(); // precise beginning of tick 5000
(t2 - t1) will be exactly 5000 PIT ticks long, or 4.1904767 ms.
Additional sanity checks are needed of course, such as rollover
detection, and a max loop counter in case we boot on a machine
with a broken PIT.
If someone wants to test this, I'd be interested in the number of
ticks required to get a good accuracy, I bet that even with a few
hundred ones it's already precise by a few ppm (about the precision
of the input clock in fact).
Willy
--
There's a few caveats here:
- the "readpit()" has to read without actually latching the value
latching the PIT value will stop counting.
- and all the docs say that you have to be careful about reading the PIT
without latching it because the two 8-bit accesses aren't atomic.
so the above will work in practice, but there are dangers.
The best way to fix most of the dangers is probably to only care about the
*high* byte, so that it doesn't matter if the low byte doesn't match the
high byte.
So you could probably change your version to wait for 4096 cycles (a
change of 16 in the high byte):
static unsigned char read_pit_msb(void)
{
/* Read but throw away the LSB */
inb(0x42);
return inb(0x42);
}
..
/* PIT ch2: square wave, full 16-bit count */
outb(0xb6, 0x43);
outb(0, 0x42);
outb(0, 0x42);
..
unsigned char pit = read_pit_msb();
/* Wait until the MSB changes */
while (read_pit_msb() == pit1);
t1 = rdtsc();
while ((unsigned char) (pit - read_pit_msb()) < 9);
t2 = rdtsc();
and it might work out ok without explicit latching, and without having to
I actually tested a patch with a counter value of just 1024, and I got the
right answer.
But if the busy loops aren't busy (due to MSI or virtualization), then all
those things fly out the window.
Linus
--
Ah yes you're right, I remember having been doing crappy stuff like re-reading 100% agreed, though the problem is already the same with any calibration code, with more or less sensitivity. Willy --
x86: Change warning message in TSC calibration.
From: Alok N Kataria <akataria@vmware.com>
When calibration against PIT fails, the warning that we print is misleading.
In a virtualized environment the VM may get descheduled while calibration
or, the check in PIT calibration may fail due to other virtualization
overheads.
The warning message explicitly assumes that calibration failed due to SMI's
which may not be the case. Change that to something proper.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
---
arch/x86/kernel/tsc.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index d74b3ef..9d4aeec 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -271,8 +271,7 @@ unsigned long native_calibrate_tsc(void)
*/
if (tsc_pit_min == ULONG_MAX) {
/* PIT gave no useful value */
- printk(KERN_WARNING "TSC: PIT calibration failed due to "
- "SMI disturbance.\n");
+ printk(KERN_WARNING "TSC: Unable to calibrate against PIT\n");
/* We don't have an alternative source, disable TSC */
if (!hpet && !pm1 && !pm2) {
--
I know that Linus has some problems with this patch, but FWIW it worked on my K6. The dmesg output is TSC: PIT calibration deviates from PMTIMER: 428809 214401. TSC: Using PIT calibration value Detected 428.809 MHz processor. Larry --
I don't have a problem with what the patch _does_ (I think all the added
sanity checking is good).
I just don't like how it makes the already rather complex and confusing
function about ten times _more_ complex and confusing. It needs splitting
up.
But we can split it up after-the-fact as well as before-the-fact.
Does this cleanup-patch (on _top_ of Thomas' patch) also work for you? It
should do exactly the same thing, except it splits up the TSC PIT
calibration into a function of its own.
But I may obviously have introduced some silly bug when cleaning it up,
so testing necessary..
Linus
---
arch/x86/kernel/tsc.c | 132 ++++++++++++++++++++++++++----------------------
1 files changed, 71 insertions(+), 61 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ac79bd1..346cae5 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -122,15 +122,75 @@ static u64 tsc_read_refs(u64 *pm, u64 *hpet)
return ULLONG_MAX;
}
+/*
+ * Try to calibrate the TSC against the Programmable
+ * Interrupt Timer and return the frequency of the TSC
+ * in kHz.
+ *
+ * Return ULONG_MAX on failure to calibrate.
+ */
+static unsigned long pit_calibrate_tsc(void)
+{
+ u64 tsc, t1, t2, delta;
+ unsigned long tscmin, tscmax;
+ int pitcnt;
+
+ /* Set the Gate high, disable speaker */
+ outb((inb(0x61) & ~0x02) | 0x01, 0x61);
+
+ /*
+ * Setup CTC channel 2* for mode 0, (interrupt on terminal
+ * count mode), binary count. Set the latch register to 50ms
+ * (LSB then MSB) to begin countdown.
+ */
+ outb(0xb0, 0x43);
+ outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
+ outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
+
+ tsc = t1 = t2 = get_cycles();
+
+ pitcnt = 0;
+ tscmax = 0;
+ tscmin = ULONG_MAX;
+ while ((inb(0x61) & 0x20) == 0) {
+ t2 = get_cycles();
+ delta = t2 - tsc;
+ tsc = t2;
+ if ((unsigned long) delta < tscmin)
+ tscmin = (unsigned int) delta;
+ if ((unsigned long) delta > ...Same result as Thomas' version. Larry --
.. Not to mention that we still have the unsolved regression from when HPET support first went in (2.6.2[01] timeframe), where machines randomly lock up at "Switching to HPET" time during boot. Depending upon kernel .config, phase of the moon, and apparently SMM events. Hopefully this will finally get fixed with this work? Cheers --
Looking at values for old K6 machines, I would suspect that doing the test three times and checking the deviation would be enough. If the timer is emulated the value will jump around and if it is stable it could be used. Considering that this is one use code you could increase the number of trials to five or so, keeping the high and low. If -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
Both the openSUSE 2.6.22 kernel and the one with the pm_timer disabled return "pit". I don't think pm_timer had ever been used until the commit in question. The timed sleep is as accurate as I can measure. I put in some test prints. The value of pm2 is zero when the else branch of the "if (hpet)" is entered; however, pm1 is 15768471. When we reach the do_div(tsc2, tsc1) statement, tsc2 is zero, which I think means that the two calls to tsc_read_refs() are returning the same junk value. Larry --
Ok, so the pmtimer is probably detected later as unusable and disabled. Please check your logs for: "PM-Timer had inconsistent results:" Thanks, tglx --
Booting 2.6.26, the dmesg output has a line that says: PM-Timer running at invalid rate: 200% of normal - aborting. Amazing that it should be exactly 200%. Why is the CPU running at half speed when the PM-Timer rate is measured? Larry --
The kernel assumes that the PM timer frequency is normal, so it does:
read pm-timer start value, read TSC start value
wait for a some time
read pm-timer end value, read TSC end value
And the TSC frequency is calculated via:
TSC-End - TSC-Start
TSC-Frequency = -------------------- * PM-Frequency
PM-End - PM-Start
So if your PM-Timer runs at the double frequency for reasons only
known to the Chip Manufacturer the kernel miscalculates the TSC
frequency by factor 0.5.
Simple rule of three.
Thanks,
tglx
--
Ok, Thomas, that means that the PIT is reliable (not surprising), and the PM_TIMER isn't (again, I'm not horribly surprised). And HPET isn't available, of course. The old x86-32 code never even bothered with the PM_TIMER for calibration. I don't understand why the x86-64 code bothers with it either. Why not just drop that whole broken thing, and just depend on the PIT if there is no HPET? I would also like to point out that the 32-bit code actually had a much nicer PIT setup, using the much better documented mach_prepare_counter() and mach_countup() helper functions. I'm unhappy to note that the new "common" code uses what appears to be the inferior code. Also, note that this is _not_ a new issue. See "verify_pmtmr_rate()" in drivers/clocksource/acpi_pm.c, along with all the code to check that the reads are stable in "init_acpi_pm_clocksource()". IOW, the PM_TIMER has been found to be broken before. Depending on it for calibration is broken. Linus --
We had horrible results on some machines with the PIT especially in the local APIC timer calibration code. One of my own laptops behaved stupid until I verified the APIC timer versus the PM timer. Until recently it also had occasional stupid TSC calibration values. This was reported by others as well and is definitely caused by SMM While on older machines, which might have pmtimer wreckage (I just googled that some AMD K6 are known to have such issues), the SMM/SMI wreckage is likely to be a non issue, while on modern systems the SMM/SMI wreckage will bite us. So we have the choice between evil and evil. One way out would be to check on which CPU type we run and ignore the pmtimer for older systems. Thanks, tglx --
