Re: [PATCH] Fix TSC calibration issues

Previous thread: [ANNOUNCE] Git User's Survey 2008 by Jakub Narebski on Sunday, August 31, 2008 - 2:29 pm. (5 messages)

Next thread: [git pull] drm debug message downgrade. by Dave Airlie on Sunday, August 31, 2008 - 4:00 pm. (1 message)
From: Larry Finger
Date: Sunday, August 31, 2008 - 3:54 pm

An ancient laptop of mine started throwing errors from b43legacy when 
I started using 2.6.27 on it. This has been bisected to commit bfc0f59 
"x86: merge tsc calibration". Incidentally, I was appalled at the 
number of build errors that were found while bisecting in this region. 
Every commit from 2.6.26-rc9-00715 to at least -00719 had include 
errors that had to be fixed before it would compile. Those fixes are 
the only reason that the builds below are "dirty".

The CPU in this computer is an AMD-K6 at stepping 0c and running at 
450 MHz.

The critical differences in the dmesg output between the "good" and 
"bad" results indicate a factor of 2 difference in the clock speed, 
and are shown below:

============================================================
finger@larrylap:/ide1/mtech/wireless-testing> diff -u /ide1/dmesg.good 
/ide1/dmesg.bad
--- /ide1/dmesg.good    2008-08-31 16:23:32.000000000 -0500
+++ /ide1/dmesg.bad     2008-08-31 16:26:25.000000000 -0500
@@ -1,4 +1,4 @@
-Linux version 2.6.26-rc9-test-00715-g0ef9553-dirty (finger@larrylap) 
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 
135036] (SUSE Linux) ) #29 Sun Aug 31 15:37:05 ACT 2008
+Linux version 2.6.26-rc9-test-00716-gbfc0f59-dirty (finger@larrylap) 
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 
135036] (SUSE Linux) ) #31 Sun Aug 31 16:52:23 ACT 2008
  BIOS-provided physical RAM map:
   BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
   BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
@@ -54,7 +54,8 @@
  Kernel command line: root=/dev/sda3 vga=0x314 resume=/dev/sda2 
splash=silent
  Initializing CPU#0
  PID hash table entries: 1024 (order: 10, 4096 bytes)
-Detected 428.846 MHz processor.
+TSC calibrated against PM_TIMER
+Detected 214.401 MHz processor.
  Console: colour dummy device 80x25
  console [tty0] enabled
  Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
@@ -69,7 +70,7 @@
        .text : 0xc0100000 - 0xc02638ac   ...
From: Thomas Gleixner
Date: Monday, September 1, 2008 - 4:14 am

In both cases the TSC is ahead of the pm_timer, which looks like the
pm_timer is behaving strange.

Can you please disable the pm_timer (in the kernel config,
unfortunately there is no command line option for that) for a test and
provide the relevant output of demsg ?

Thanks,

	tglx

--

From: Larry Finger
Date: Monday, September 1, 2008 - 8:37 am

I manually deleted CONFIG_X86_PM_TIMER from .config, but it returns 
silently. The only way I see to turn off the pm_timer is to disable 
power management. Is that what I need to do?

Larry
--

From: Thomas Gleixner
Date: Monday, September 1, 2008 - 10:49 am

The switch depends on CONFIG_EMBEDDED. Just turn that on in "General Setup"

       [*] Configure standard kernel features (for small systems)  --->

Then you get an extra switch in:

 "Powermanagement Options"
 	ACPI (Advanced Configuration and Power Interface) Support  ---> 
		[*]   Power Management Timer Support 

Turn that one off.

Thanks,

	tglx
--

From: Larry Finger
Date: Monday, September 1, 2008 - 10:44 am

It took a while to figure out how to kill the pm_timer. I finally did 
it by changing the default to no rather than yes. I also reset the 
bisection and compiled a full -rc4 kernel.

What I hope is the relevant output of dmesg is below. The clock rate 
is correctly determined, and the b43legacy errors are gone.

Thanks - Larry

Linux version 2.6.27-rc4-wl-15747-g2e3bbe3-dirty (finger@larrylap) 
(gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 
135036] (SUSE Linux) ) #36 Mon Sep 1 12:19:13 ACT 2008

--snip--

Kernel command line: root=/dev/sda3 vga=0x314 resume=/dev/sda2 
splash=silent
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 4096 bytes)
TSC calibrated against PIT
Detected 428.823 MHz processor.
Console: colour dummy device 80x25
console [tty0] enabled
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Memory: 252692k/262080k available (1442k kernel code, 8836k reserved, 
613k data, 188k init, 0k highmem)
virtual kernel memory layout:
     fixmap  : 0xffffa000 - 0xfffff000   (  20 kB)
     vmalloc : 0xd0800000 - 0xffff8000   ( 759 MB)
     lowmem  : 0xc0000000 - 0xcfff0000   ( 255 MB)
       .init : 0xc0306000 - 0xc0335000   ( 188 kB)
       .data : 0xc0268b54 - 0xc0302208   ( 613 kB)
       .text : 0xc0100000 - 0xc0268b54   (1442 kB)
Checking if this processor honours the WP bit even in supervisor 
mode...Ok.
CPA: page pool initialized 1 of 1 pages preallocated
Calibrating delay loop (skipped), value calculated using timer 
frequency.. 857.64 BogoMIPS (lpj=1715292)
--

From: Thomas Gleixner
Date: Monday, September 1, 2008 - 11:31 am

Hmm. Haven't seen that before, but if confirms what I guessed from
your previous dmesg information. I wonder why you did not observe
strange behaviour with older kernel versions. I don't mean the
b43legacy errors, which might be caused by the wrong calibrated TSC,
but those even should show strange behaviour vs. time.

Can you please provide the output of an older "working" kernel version
from:

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource

after the TSC was set to unstable. It should say acpi_pm.

If that's the case please run
 
# time sleep 60

on a shell and provide the output and verify it against a knwn to be
halfways correct stopwatch.

Then do the same on the current mainline with pm_timer
disabled. current clocksource should be either jiffies or tsc.

Thanks,

	tglx
--

From: Linus Torvalds
Date: Monday, September 1, 2008 - 12:10 pm

x86-32 never used the PM_TIMER for frequency estimation, it only ever used 
the PIT. See the old "native_calculate_cpu_khz()" in tsc_32.c that you 
deleted in favor of the (imho inferior) x86-64 version.

How about:

 - taking the old 32-bit code, and using it to initially _just_ estimate 
   the TSC speed. That code was stable and pretty much guaranteed to work 
   reasonably well on all machines. It retries the timings three times, 
   and picks the best one.

 - Then, _after_ you already have a pretty good estimation for TSC, you 
   can use _that_ to then get the HPET and/or PM_TIMER version (and not 
   use the PIT at all for those calibrations)

 - and if the PM_TIMER one is too far off, just throw it away. We know the 
   PIT is a lot more trustworthy than the PM_TIMER.

Hmm?

			Linus
--

From: Thomas Gleixner
Date: Monday, September 1, 2008 - 1:07 pm

Far off in which direction ?

If the PIT interrupts are delayed by SMM code, then I see That's on a
max. three years old 32bit Core Duo things like:

[    0.000000] Detected 8340.258 MHz processor.
[   13.782091] APIC calibration not consistent with PM Timer: 228ms instead of 100ms

This one is way off, while the next one is in a reasonable range

[    0.000000] Detected 3240.001 MHz processor.
[   13.792122] APIC calibration not consistent with PM Timer: 178ms instead of 100ms

while in reality the machine is @2GHZ and current mainline says:

[    0.000000] Detected 2000.065 MHz processor.

The CPU calibration of < 2.6.27 is against PIT and does _NOT_ give me a
pretty good estimation for TSC.

I was pretty happy when Alok beat me to unify the TSC calibration code
as it solved one of my long standing todo items, which also filled my
buglist on a regular base.

I did debugged this thorougly using the tracer from preempt-rt to
check, what the box does during that time, and it definitely vanishes
for >100ms in a row in the black hole of the stupid BIOS.


So either way. Relying on PIT on newer machines is _BAD_, relying on
PM_TIMER on older machine is _BAD_ as well.

There is no given good estimate, when the TSC/PIT calibration is off
by factor 1.5 to 4. The consequence would be that I throw away a
perfect fine pmtimer and run a machine which advertises itself as the
fastest box on the planet. With your method I would disable nohz and I
would be back to 50% battery time.

I'm happy to discard the PIT on the 32bit machines again and then file
a bugreport for a regression between 2.6.27-rc1 and tomorrows git :)

This one is the first complaints, I've seen vs. a non working pmtimer
since quite a time. That's why I obviously forgot about the rate check
issue.

I just looked at drivers/clocksource/acpi_pm.c history and saw, that
John explicitely mentions AMD K6 in commit
562f9c574e0707f9159a729ea41faf53b221cd30

    This patch re-adds the verify_pmtmr_rate ...
From: Thomas Gleixner
Date: Monday, September 1, 2008 - 2:30 pm

ooops ---------------------^^^ wanted to say PMTIMER of course :)

      tglx
--

From: Linus Torvalds
Date: Monday, September 1, 2008 - 3:02 pm

Umm. Which one. The 32-bit or the 64-bit?

I already pointed out that the 64-bit PIT calibration is pure and utter 

The problem is that he unified it around the _worse_ code.

			Linus
--

From: Thomas Gleixner
Date: Monday, September 1, 2008 - 3:33 pm

The 32 bit version.
 
	tglx
--

From: Linus Torvalds
Date: Monday, September 1, 2008 - 3:56 pm

So then the PIT isn't going to work on that machine. Do you have HPET?

Because if so, the most obvious choice would be to say:

 - on old machines, the PIT is likely more reliable than PM_TIMER

 - on new machines, the HPET is likely more reliable than PIT

and there's actually a fairly obvious place to distinguish between old and 
new: if ti has a HPET, consider it a new one.

But that just means that there's never any reason to use PM_TIMER anyway, 
and the proper patch would probably be to just ignore it. No?

		Linus
--

From: Thomas Gleixner
Date: Monday, September 1, 2008 - 4:24 pm

There is a HPET timer on that machine, but it is only available by
enforcing the address via black magic because the BIOS does advertise
it on the wrong location and the newest BIOS version does not
advertise it at all.

I have not seen a single bug report against PM timer aside of those K6
area systems, but we have

- HPETs which are not advertised at all
- HPETs which are advertised at the wrong address
- HPETs which do not count at all
- HPETs which tell us bogus operating frequencies
- HPETs which have bogus interrupt routings

So why should we rely on the HPET ?

The whole ACPI code relies pretty much on PM_TIMER and as I said
before there are no PM-TIMER related bug reports aside of those
documented K6 ones, which Alok did probably not know about and I
forgot them as well. Yeah, that's my fault.

Thanks,

	tglx
--

From: Andi Kleen
Date: Monday, September 1, 2008 - 11:37 pm

The Linux ACPI code does not rely on PM_TIMER in any special ways AFAIK 
(except that it occasionally uses standard linux timer functions)
It merely discovers it for use by the other pmtimer users.

-Andi
-- 
ak@linux.intel.com
--

From: Thomas Gleixner
Date: Tuesday, September 2, 2008 - 5:21 am

# grep -rc acpi_gbl_FADT.xpm_timer_block.address drivers/acpi/ | grep -v ":0"
drivers/acpi/processor_idle.c:12

I need to correct myself. It's only the processor_idle code :)

Thanks,

	tglx
--

From: Linus Torvalds
Date: Monday, September 1, 2008 - 3:16 pm

Btw, this sentence of yours just doesn't seem to make much sense.

The thing is, the calibration code doesn't even use interrupts. It just 
reads the PIT timer value. 

And the thing is, the 64-bit code really does things that the 32-bit code 
does _not_ do. 

Just as an example, the old 32-bit code (the thing that was deleted) tried 
to actually round things correctly, while the 64-bit code does not. Look 
at what the 64-bit code does:

        outb((inb(0x61) & ~0x02) | 0x01, 0x61);

        outb(0xb0, 0x43);
        outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
        outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
        tr1 = get_cycles();  
        while ((inb(0x61) & 0x20) == 0);
        tr2 = get_cycles();

and notice three things:

 - no comments (can you actually see what it does?)

 - no rounding of the inevitable errors of trying to wait 1/50th of a 
   second

 - not a single try to even account for the fact that there might be 
   things going on that perturb the value.

Now, look at what the 32-bit code _used_ to do. The good code. The code 
that was _deleted_.

Really. I don't think you really even looked. It did:

        /* run 3 times to ensure the cache is warm and to get an accurate reading */
        for (i = 0; i < 3; i++) {
                mach_prepare_counter();
                rdtscll(start);
                mach_countup(&count);
                rdtscll(end);

		.. ignore bad values ..

                /*
                 * We want the minimum time of all runs in case one of them
                 * is inaccurate due to SMI or other delay
                 */
                delta64 = min(delta64, (end - start));
	}

and if you actually look at those counter things, you'll see:

	#define CALIBRATE_TIME_MSEC 30 /* 30 msecs */
	#define CALIBRATE_LATCH \
	        ((CLOCK_TICK_RATE * CALIBRATE_TIME_MSEC + 1000/2)/1000)
	
	static inline void mach_prepare_counter(void)
	{
	       /* Set the Gate high, disable speaker ...
From: Thomas Gleixner
Date: Monday, September 1, 2008 - 4:16 pm

The _good_ code which results in a 8GhZ TSC calibration on that very
_32_ bit box I have here. The CPU is 32bit only, so it never even


None. 

      start_pit_documented_magic()
      read_tsc()
      wait_until_pit_has_wrapped_documented_magic()
      read_tsc()

is error prone versus SMI/SMM code simply due to the fact, that at any
given point between those functions the SMM/SMI can happen. Doing it
three times in a row and select the lowest one does not change much. I
tried it 10 times in a row with varying bogus results.

So at every boot I get significant different calibration values. See

The following is from a 32bit boot on that very 32bit Intel Core Duo
Laptop running 2.6.26:

[    0.000000] Detected 8340.258 MHz processor.

next boot

[    0.000000] Detected 3240.001 MHz processor.

next boot

[    0.000000] Detected 2211.134 MHz processor.

I can print you the value for 100 loops if you want, but I bet that
the correctness rate will be pretty small.

Current mainline calibrated against pmtimer gives me:

[    0.000000] Detected 2000.065 MHz processor.

next boot

[    0.000000] Detected 2000.129 MHz processor.

next boot

[    0.000000] Detected 1999.988 MHz processor.

which is about accurate:

[   13.408342] CPU0: Intel Genuine Intel(R) CPU           T2500  @ 2.00GHz stepping 08

We had the same problem versus the local APIC timer calibration, which
had basically the same algorithm as the TSC one and we changed it to
look at the PMTimer as well in the days where we debugged the initial
wreckage caused by the nohz/highres changes. I can dig up the archives
of LAPIC timers with 200Mhz clock frequency, which results in a 10GHz
bus frequency, if you want.

How do you prevent the SMM brain damage, when it hits 3 times in a row ? 

You can not prevent it for a very simple reason: The PIT is not
necessary a PIT. It can be a fake SMM code replacement. We actually
have no idea anymore what's hardware and what's just emulated crapola
under ...
From: Linus Torvalds
Date: Monday, September 1, 2008 - 8:18 pm

Hmm.

So then how would you discover when it's reliable and when it's not? Just 
hardcode it for certain machines?

One alternative might be to do the same "detect if it's SMM code by seeing 
how long the read takes" for the PIT reads themselves. Right now the code 
does it for the HPET timer read and for the PM_TIMER reads, but _not_ for 

Well, the biggest problem is actually _detection_.

We have three different timers, and they all have their own problems. How 
do you reliably detect which one to use? The PM_TIMER clearly is _not_ 
always the answer here, but the code just assumes it is!

			Linus
--

From: Linus Torvalds
Date: Monday, September 1, 2008 - 8:35 pm

On the machine you have trouble with the PIT on, does this thing trigger?

If it does, that could be a simple way to say whether you prefer PM_TIMER 
over PIT.

For me, even on a modern machine, I get a pit_count of 46321, which 
matches the "about one microsecond for an ISA/LPC read" timing pretty 
well. What do you get? 

		Linus

---
 arch/x86/kernel/tsc.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8e786b0..7cf7543 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -131,6 +131,7 @@ unsigned long native_calibrate_tsc(void)
 	u64 tsc1, tsc2, tr1, tr2, delta, pm1, pm2, hpet1, hpet2;
 	int hpet = is_hpet_enabled();
 	unsigned int tsc_khz_val = 0;
+	unsigned int pit_count;
 
 	local_irq_save(flags);
 
@@ -142,7 +143,9 @@ unsigned long native_calibrate_tsc(void)
 	outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
 	outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
 	tr1 = get_cycles();
-	while ((inb(0x61) & 0x20) == 0);
+	pit_count = 0;
+	while ((inb(0x61) & 0x20) == 0)
+		pit_count++;
 	tr2 = get_cycles();
 
 	tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
@@ -150,6 +153,18 @@ unsigned long native_calibrate_tsc(void)
 	local_irq_restore(flags);
 
 	/*
+	 * We should have waited for about 50ms, and a real hardware
+	 * PIT access _should_ take about a microsecond even if it's
+	 * over an ISA bus (or something that tries to emulate timings).
+	 *
+	 * So pit_count should be 50,000+. If it's a lot lower, that
+	 * implies it might be some emulated device rather than real
+	 * hardware..
+	 */
+	if (pit_count < 25000)
+		printk("Slow PIT device (%u) - emulated using SMI?\n", pit_count);
+
+	/*
 	 * Preset the result with the raw and inaccurate PIT
 	 * calibration value
 	 */
--

From: Larry Finger
Date: Monday, September 1, 2008 - 9:54 pm

On the problem AMD K6, the count is 29411. On my almost new AMD Turion 
64 X2 that runs at 2.0 GHz, the value was 31401. A cutoff value of 
25000 may be a bit close. That machine says that it calibrates against 
the PM_TIMER, but cat /sys/devices/...../current_clocksource returns 
hpet.

Larry
--

From: Alan Cox
Date: Tuesday, September 2, 2008 - 2:17 am

You'll get funny answers on the Geode doing that. It will also randomly
mistrigger on some laptops - they don't use SMI for PIC but their power
handling SMIs will trigger that randomly.

PMTimer may also be an SMI too btw.

Alan
--

From: Thomas Gleixner
Date: Tuesday, September 2, 2008 - 5:15 am

About anything between 0 and useful, but still I had cases where the
pit_count was way above the 25000 but the TSC was off by factor 2.

Went down the road and instrumented the code:

   unsinged long tsc_deltas[50000];

   start_pit();
   tsc1 = tsc = read_tsc_start();
   while (!pit_ready()) {
   	 tsc2 = read_tsc();
 	 tsc_deltas[pit_count++] = tsc2 - tsc;
	 tsc = tsc2;
   }

Analysing the tsc_deltas gave interesting insight. On the affected
laptop I had several entries where the delta between two reads was
from 1msec up to 120msec maximum. 

As the code does nothing else and has interrupts disabled there is
only one explanation: the SMM/SMI black hole.

If this high delta hits after the pit_count reached 25000 we still
believe that our calibration against pit is fine :(

So what I'm working on is an algorithm, which is similar to the checks
in the tsc_read_refs() function. That should allow us to detect
whether one of the reads is way off by doing a min/max detection. In
such a case we can either repeat the calibration or try to figure out
whether the pmtimer / hpet can provide us with some useful reference.

Thanks,

	tglx






--

From: Linus Torvalds
Date: Tuesday, September 2, 2008 - 8:09 am

Ok, that sounds like a good approach to find if it's done by some 
kind of emulation or not. Of course, any machine with SMM (even if it 
doesn't emulate the PIT per se - maybe it just gets some event related to 
overheating or other 'maintenance' stuff) can have occasional hickups, but 
the '120msec' thing is, I think, the real clincher. 

Why? Because we only try to wait for 50ms in the first place! Even if 
emulation is 100% exact (or even none at all, and the PIT accesses are in 
hardware), if we have a 120ms hickup while waiting for 50ms, then the end 
result will obviously be total crap, and yes, that sure explains how you 

I think the most trivial approach would be to

 - just keep track of the max TSC difference for each loop iteration.

 - if the max TSC is bigger than 1% of the total TSC, then something is 
   already seriously wrong (either we had very few loops indeed, or some 
   of them were very expensive)

 - perhaps loop over the calibration, and make the TSC calibration loop 
   increase the delay. Because even if there is a 120ms hickup, if we had 
   used a longer calibration delay, we'd probably not have noticed (well, 
   ok, 120ms is pretty damning and is probably just unfixable, but smaller 
   hickups are probably harmless)

Additionally doing a min/max comparison to see that the loop is very 
_stable_ is of course also a way to validate things, but expecting _too_ 
much stability may be wrong too. As mentioned, SMM events can happen for 
other reasons than emulation.

		Linus
--

From: Thomas Gleixner
Date: Tuesday, September 2, 2008 - 11:14 am

I went for summing up the deltas and build an average at the
end. That's from a loop of 10 consecutive runs:

[    0.000000] TSC min 2160 max       3732 avg  3266 pitcnt 30614
[    0.000000] TSC min 2160 max    1036164 avg  3299 pitcnt 30310
[    0.000000] TSC min 2160 max    1032360 avg  3303 pitcnt 30277

[    0.000000] TSC min 2160 max  210453018 avg 69509 pitcnt 30260

Hit very late in the loop, as pitcnt is close to the others

[    0.000000] TSC min 2160 max       3708 avg  3265 pitcnt 30624
[    0.000000] TSC min 2160 max       3720 avg  3265 pitcnt 30622
[    0.000000] TSC min 2160 max    1062252 avg  3301 pitcnt 30287
[    0.000000] TSC min 2160 max       3756 avg  3267 pitcnt 30605
[    0.000000] TSC min 2160 max       3732 avg  3267 pitcnt 30605
[    0.000000] TSC min 2136 max     989292 avg  3297 pitcnt 30324
[    0.000000] TSC min 2136 max       3744 avg  3266 pitcnt 30612

[    0.000000] TSC min 2160 max   78042006 avg 78045 pitcnt  1001

This one hit early in the loop as pitcnt is pretty low.

The min value is pretty constant.

The max value for sane loops is in the range of 3708 - 3756, the
average is between 3266 and 3267.

For those which have a ~500us maximum the average is still in a sane
range. That seems to be a single glitch, which pushs the maximum, but
does not really influence the average result.

The outstanding one is the 100ms (210 453 018 ticks), where the average
is also off by factor 20. 

I think that information is enough to give us a pretty precice idea
when to discard the result. I'm currently looking at the hpet/pmtimer
values for comparison and I should have a patch for testing ready

Increasing the delay is probably not a good idea as we just make the

Yeah, I know. One of the oddballs is the USB->PS2 keyboard emulator
which is active during early boot. We do the USB handoff definitely
after the TSC calibration. Found a box with similar (not that bad)
hickups which go away when I disable that in the BIOS settings. 

Can't ...
From: Alok Kataria
Date: Tuesday, September 2, 2008 - 11:41 am

Sorry for joining the party this late...am still going through all my
mails.

Ok, so from what I understand until now, we will calibrate TSC against
PIT as was done in 32bit code and use that as default. If that fails to
give any sane results we will fall back to calibrating against PM_timer
or HPET ?
Thomas has already explained the problem with 32bit calibration ( i.e.
just against PIT and no checks for SMI's and all) but would like to
point that this problem is lot more worse in virtualized environment,
because we may fail to get sane values even from multiple loops of
calibrating against PIT. 
If we have a fall back mechanism to detect this SMI event, and then try
calibrating against PM timer or HPET we should be good. 

Anyways I will wait to see the patch.

Thanks,
Alok



--

From: Thomas Gleixner
Date: Tuesday, September 2, 2008 - 2:16 pm

I still keep the fallback against pmtimer/hpet alive.

Thanks,

	tglx
--

From: Linus Torvalds
Date: Tuesday, September 2, 2008 - 11:42 am

Dang. 

We actually _used_ to do this fairly early for UHCI, since it was simple 
and very useful to do, and not doing it caused serious problems with the 
USB stack.

But then it got moved to the USB stack itself, and for some reason the 
DECLARE_PCI_FIXUP_HEADER got turned into a DECLARE_PCI_FIXUP_FINAL instead 
(probably because the USB layer expected things to be fairly set up), 
which means that it now happens much much later.

But I guess even DECLARE_PCI_FIXUP_HEADER was is too late for TSC, so it 
doesn't much matter. Even with the old setup, it would be too late.

You're definitely right that this could easily be the _real_ problem. 
Especially as your TSC min value of 2160 is (a) pretty close to the 
expected time of a microsecond and (b) so stable that I actually do not 
believe that the PIT itself is at all emulated or the problem.

Btw - as to caring about the average value: that's pointless. If you only 
look at the average time the PIT read takes place, then it is going to 
approximate that "pit_count" thing in the end that I already did. 

Why? Because the average value should essentially end up being "(end_tsc - 
start_tsc) / pit_count". And if you just compare that to "min_tsc", then 
that should always be about a microsecond (on normal machines where the 
PIT is essentially on the old emulated internal "ISA" bus on the 
southbridge). So you end up with what I already posted, and you already 
dismissed.

So average TSC is not any more interesting than "pit_count". 

			Linus
--

From: Thomas Gleixner
Date: Tuesday, September 2, 2008 - 2:13 pm

On that box, the PIT is probably real hardware or a damned good
emulation. When you look at the 10 loop values you see that it does
50% perfectly fine calibration loops. The others are just SMI

Yeah, you're right. Math is hard :)

Thanks,

	tglx
--

From: Linus Torvalds
Date: Tuesday, September 2, 2008 - 3:21 pm

Ok, so I actually think I know how to resolve the problem once and for 
all.

The solution is actually fairly simple: we use the HPET algorithm. The 
reason the HPET algorithm is so robust is that

 - we can actually read the frequency from the HPET itself

 - we also simply just read the counter values from the HPET, and so it 
   doesn't really matter how much time has passed between the two reads, 
   it only matters that _some_ time has passed, and that we pick _one_ 
   stable read that we can associate with a TSC value.

But the thing is, the exact same thing is actually true of the old PIT 
timer too - except we simply don't take advantage of it. The PIT timer has 
a very well known frequency value (PIT_TICK_RATE: 1193180 Hz), and we can 
trivially read the counter value too.

But the thing is, that for some forgotten reason, that's not actually what 
we do. Instead of reading the counter value, we wait until it counts down 
to zero, and read the output value instead. So instead of having a nice 
and dependable counter that ticks down (16 bits of precision), we actually 
use a _single_ bit of result, and depend on reading the TSC at the same 
time.

That's kind of sad. 

I'll try to whip up a test-patch to do this in a smarter way.

		Linus
--

From: Thomas Gleixner
Date: Tuesday, September 2, 2008 - 4:10 pm

Except for the couple of exceptions, where the readout of the old PIT
timer is broken. See arch/x86/kernel/i8253.c:pit_read()

Thought about that already and discarded it as it is basically the
same problem as we have versus _ONE_ AMD K6 family pmtimer


I'm fine with either solution as long it works on _ALL_ kind of broken
hardware. Believe me or not, but since I work on the whole timer
issue, I have not seen anything really reliable in the x86 world.

That's the really sad part, that the hardware dudes did not learn
anything about the importance of timing and timekeeping within 20
years.

I'm tearing my hair on a regular base when I try to get down to the
root cause of timer related wreckage in several (including todays)
generations of CPU technology in x86 land. Other architectures have
their odds and ends as well, but the vast majority of implementations
is pretty straight forward and usable without restrictions.

x86 is definitely the ultimate winner of the all time "timer ignorance
award".

Thanks,

	tglx
--

From: Linus Torvalds
Date: Tuesday, September 2, 2008 - 6:49 pm

Well, the VIA problem, for example, is not an issue if you just make it do 
the maximum timeout (ie count down from zero), which is what you want 
_anyway_ in order to get the maximum range.

So that one is simply avoided by just programming the timer to count the 
maximum possible range.

However, the counter latching itself does seem to stop the counting until 
it is read, which is rather irritating. It means that you can't just latch 
and read in a tight loop. An while I actually have a patch that works very 
well for me, it depends on not latching, which is not strictly correct 
either.

So yeah, I suspect using the current code and trying to just see when it 
isn't reliable (and the "max time" seems good for that) is probably the 
best option.

		Linus
--

From: Thomas Gleixner
Date: Tuesday, September 2, 2008 - 3:54 pm

Larry Finger reported at http://lkml.org/lkml/2008/9/1/90:
An ancient laptop of mine started throwing errors from b43legacy when
I started using 2.6.27 on it. This has been bisected to commit bfc0f59
"x86: merge tsc calibration".

The unification of the TSC code adopted mostly the 64bit code, which
prefers PMTIMER/HPET over the PIT calibration.

Larrys system has an AMD K6 CPU. Such systems are known to have
PMTIMER incarnations which run at double speed. This results in a
miscalibration of the TSC by factor 0.5. So the resulting calibrated
CPU/TSC speed is half of the real CPU speed, which means that the TSC
based delay loop will run half the time it should run. That might
explain why the b43legacy driver went berserk.

On the other hand we know about systems, where the PIT based
calibration results in random crap due to heavy SMI/SMM
disturbance. On those systems the PMTIMER/HPET based calibration logic
with SMI detection shows better results.

According to Alok also virtualized systems suffer from the PIT
calibration method.

The solution is to use a more wreckage aware aproach than the current
either/or decision.

1) reimplement the retry loop which was dropped from the 32bit code
during the merge. It repeats the calibration and selects the lowest
frequency value as this is probably the closest estimate to the real
frequency

2) Monitor the delta of the TSC values in the delay loop which waits
for the PIT counter to reach zero. If the maximum value is
significantly different from the minimum, then we have a pretty safe
indicator that the loop was disturbed by an SMI.

3) keep the pmtimer/hpet reference as a backup solution for systems
where the SMI disturbance is a permanent point of failure for PIT
based calibration

4) do the loop iteration for both methods, record the lowest value and
decide after all iterations finished.

5) Set a clear preference to PIT based calibration when the result
makes sense.

The implementation does the reference calibration based ...
From: Linus Torvalds
Date: Tuesday, September 2, 2008 - 7:14 pm

This is "wrongish".

You really should do the

	tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);   
	...
	tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);

around the whole loop, because they get more exact with more time inside, 
and they don't improve from looping around.

Also, that code is already _too_ unreadable. How about starting by just 
moving the PIT calibration into a function of its own, like the appended 
patch. And then as a separate patch, improving the heuristics for just the 
PIT calibration.

The others are independent of this issue..

		Linus

---
 arch/x86/kernel/tsc.c |   37 ++++++++++++++++++++-----------------
 1 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8e786b0..bf7ff5a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -122,19 +122,9 @@ static u64 tsc_read_refs(u64 *pm, u64 *hpet)
 	return ULLONG_MAX;
 }
 
-/**
- * native_calibrate_tsc - calibrate the tsc on boot
- */
-unsigned long native_calibrate_tsc(void)
+static unsigned long PIT_calibrate_tsc(void)
 {
-	unsigned long flags;
-	u64 tsc1, tsc2, tr1, tr2, delta, pm1, pm2, hpet1, hpet2;
-	int hpet = is_hpet_enabled();
-	unsigned int tsc_khz_val = 0;
-
-	local_irq_save(flags);
-
-	tsc1 = tsc_read_refs(&pm1, hpet ? &hpet1 : NULL);
+	u64 tr1, tr2, delta;
 
 	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
 
@@ -145,17 +135,30 @@ unsigned long native_calibrate_tsc(void)
 	while ((inb(0x61) & 0x20) == 0);
 	tr2 = get_cycles();
 
-	tsc2 = tsc_read_refs(&pm2, hpet ? &hpet2 : NULL);
-
-	local_irq_restore(flags);
-
 	/*
 	 * Preset the result with the raw and inaccurate PIT
 	 * calibration value
 	 */
 	delta = (tr2 - tr1);
 	do_div(delta, 50);
-	tsc_khz_val = delta;
+	return delta;
+}
+
+/**
+ * native_calibrate_tsc - calibrate the tsc on boot
+ */
+unsigned long native_calibrate_tsc(void)
+{
+	unsigned long flags;
+	u64 tsc1, tsc2, pm1, pm2, hpet1, hpet2;
+	int hpet = ...
From: Thomas Gleixner
Date: Wednesday, September 3, 2008 - 2:11 am

Yeah, should have done that. But I was too tired to touch anything in
the code more complex than adding a few comments.
 
Thanks,

	tglx
--

From: Alok Kataria
Date: Wednesday, September 3, 2008 - 6:14 pm

Hi Thomas, 

I agree with Linus that we should move the tsc_read_refs call outside of
the loop. I did those changes and ran some boot-halt tests at my end.
The frequency calibration against the pmtimer/hpet was surely more fine
tuned with this change, the variance that i see now in repeated reboots
is very minimal.

Please have a look at the patch below. 

--
x86: Fine tune TSC calibration.

From: Alok N Kataria <akataria@vmware.com>

As Linus suggested, we should be moving the tsc_read_refs outside of the
loop, this gives us more accurate TSC calibration when calibrating against
hpet/pmtimer, since we are now calibrating over a period of 250ms.


With SMI_THRESHOLD equals to 50000, in the worst case, tsc values read by
tsc_read_refs, could be off by 50000 ticks. On a 2Ghz processor this would
mean an error of 25us. The tsc frequency is calibrated over a period of 250ms
with this patch, hence the worst case error would be around 100ppm down from
500ppm previously without the patch.

Signed-off-by: Alok N Kataria <akataria@vmware.com>
---

 arch/x86/kernel/tsc.c |   54 ++++++++++++++++++++++++++-----------------------
 1 files changed, 29 insertions(+), 25 deletions(-)


diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 346cae5..bc2c1a2 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -188,7 +188,7 @@ static unsigned long pit_calibrate_tsc(void)
 unsigned long native_calibrate_tsc(void)
 {
 	u64 tsc1, tsc2, delta, pm1, pm2, hpet1, hpet2;
-	unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
+	unsigned long tsc_pit_min = ULONG_MAX, tsc_ref = ULONG_MAX;
 	unsigned long flags;
 	int hpet = is_hpet_enabled(), i;
 
@@ -216,31 +216,35 @@ unsigned long native_calibrate_tsc(void)
 	 * calibration delay loop as we have to wait for a certain
 	 * amount of time anyway.
 	 */
+
+	local_irq_save(flags);
+
+	/*
+	 * Read the start value and the reference count of
+	 * hpet/pmtimer when available. Then do the PIT
+	 * ...
From: Linus Torvalds
Date: Wednesday, September 3, 2008 - 7:56 pm

Side note: I'd like to change that.

250ms is a _loong_ time. It's a really really long time. It's 
human-noticeable. A quarter of a second at boot is just too long.

We have a couple of options:

 - make the PIT calibration shorter

   I suspect we could easily make the PIT calibration be just 5ms rather
   than 50ms. Yes, yes, it will be less precise, but the PIT counts at
   1.1MHz or something, so we're still talking about 5000+ ticks of the 
   clock.

   So we'll get within a fraction of a percent. We could then decide that 
   _if_ the HPET/PMTIMER matches the PIT, we'll decide to use that as the 
   reference time, because it would possibly have higher accuracy.

 - do other things while calibrating.

   Especially for HPET/PM_TIMER, we could actually do some other stuff 
   while it is calibrating, since those calibrations only depend on the 
   end/beginning being "close enough". This seems like it would be pretty 
   hard to actually do sanely, though (there's still the issue of the 
   HPET/PM_TIMER overflowing, although with 24 bits that should be on the 
   order of several seconds, I dunno)

Anyway, 250ms really is too long.

		Linus
--

From: Arjan van de Ven
Date: Wednesday, September 3, 2008 - 8:16 pm

On Wed, 3 Sep 2008 19:56:10 -0700 (PDT)

yeah it's insane ;-(
(since we boot a kernel in 1 second and a full OS including the kernel

one option for that is the fastboot patchkit in -next ;-)

but at some point, even doing things in parallel/asynchronous isn't
helping, "parallel shit is still shit" :)


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Linus Torvalds
Date: Wednesday, September 3, 2008 - 8:59 pm

Well, the thing is, you can't call ti "shit" when the fact is that we 
don't have any other options than to wait.

The only frequency we can trust on 99% of all machines is the PIT, and 
it's a very uncomfortable programming model due to all the history (it is 
one of the few truly 8-bit things left in a modern PC). The other options 
are just not reliably there, or are known to not have a stable frequency.

So how would you suggest we do it? Lowering the wait to 5ms (times 5, so 
it's really 25ms, although we can probably stop early if the first 
iterations are very consistent) will work, but it _will_ reduce precision. 
And it's still real time.

But we simply don't have alternatives. That 'shit' is originally from the 
company you work for, btw, and while it was good for its time, the 
replacement (HPET) was horribly misdesigned by the same company, and is 
deficient in many ways (not the least of which is the idiotic enumeration: 
another ACPI braindamage), and it often isn't even exposed.

As a result, the PIT remains to this day the most reliable source of a 
reference timer. That includes even on really modern machines (ie the one 
I have from Intel that contains hardware not even released yet!).

			Linus
--

From: Arjan van de Ven
Date: Wednesday, September 3, 2008 - 9:10 pm

On Wed, 3 Sep 2008 20:59:05 -0700 (PDT)

pmtimer is also quite ok, with the exception of some K6 based boxes.
(I'm surprised the K6 boxes even have enough modern stuff to have

one of the options we have is to start with an initial
rough-but-conservative estimate, and refine it over time as the system
is running.... sort of like ntp but for the calibration.

another option for calibrating the tsc rate is to read it from the
msr's/cpuid/aperf of what the hardware says it should be, and then all
we need is to verify it is that; that we could do over timer or quickly.
(of course that only works for systems with constant tsc)


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Linus Torvalds
Date: Wednesday, September 3, 2008 - 9:20 pm

Quite frankly, pmtimer isn't all that much better than PIT. It has a 
slightly bigger range, but it has a much more limited format, and it 
doesn't have a reliable frequency. It was designed for something else.

At least HPET is clearly better than PIT as a _timer_.

All the big HPET problems are with its idiotic interface. 

Of course, in any _sane_ situation, the timer really would have been in 
the local APIC instead, with a fixed and architected frequency, and it 
should run in all power states. But noo, that obviously won't ever work, 

I do agree that we could aim for something like that. But even to get the 

I don't think it's reliable even for systems with a constant TSC. Because 
the msr/cpuid thing isn't going to actualyl give the right frequency. It 
might be the frequency the thing is _rated_ at, but it will be off when 
people over- or under-clock the front-side bus etc.

This is why it's so important that the clock input be a _known_ frequency. 
The thing that makes the PIT still so useful is not that it's a good 
timer, but that we *know* the frequency it runs at.

			Linus
--

From: Arjan van de Ven
Date: Wednesday, September 3, 2008 - 9:27 pm

On Wed, 3 Sep 2008 21:20:23 -0700 (PDT)


for 99%+ of the systems it'll be extremely close. We do need to
validate it to detect the overclocking etc (and if the calibration says
"it smells funny" we can do the slow 250msec thing)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Willy Tarreau
Date: Wednesday, September 3, 2008 - 9:25 pm

15 years ago when I only knew DOS, I used the PIT a lot for precise
delay calculations. I can attest that it can be a very precise timer
for delays when you run busy loops. You even need very few ticks because
you detect the falling edge with a high accuracy. Basically, I would
do this :

    pit1 = readpit();
    while (readpit() == pit1);
    t1 = rdtsc(); // precise beginning of tick 0
    while (readpit() != pit1 - 5000);
    t2 = rdtsc(); // precise beginning of tick 5000

(t2 - t1) will be exactly 5000 PIT ticks long, or 4.1904767 ms.

Additional sanity checks are needed of course, such as rollover
detection, and a max loop counter in case we boot on a machine
with a broken PIT.

If someone wants to test this, I'd be interested in the number of
ticks required to get a good accuracy, I bet that even with a few
hundred ones it's already precise by a few ppm (about the precision
of the input clock in fact).

Willy

--

From: Linus Torvalds
Date: Wednesday, September 3, 2008 - 9:53 pm

There's a few caveats here:

 - the "readpit()" has to read without actually latching the value

   latching the PIT value will stop counting.

 - and all the docs say that you have to be careful about reading the PIT 
   without latching it because the two 8-bit accesses aren't atomic.

so the above will work in practice, but there are dangers.

The best way to fix most of the dangers is probably to only care about the 
*high* byte, so that it doesn't matter if the low byte doesn't match the 
high byte.

So you could probably change your version to wait for 4096 cycles (a 
change of 16 in the high byte):

	static unsigned char read_pit_msb(void)
	{
		/* Read but throw away the LSB */
		inb(0x42);
		return inb(0x42);
	}

	..
	/* PIT ch2: square wave, full 16-bit count */
	outb(0xb6, 0x43);
	outb(0, 0x42);
	outb(0, 0x42);
	..

	unsigned char pit = read_pit_msb();
	/* Wait until the MSB changes */
	while (read_pit_msb() == pit1);
	t1 = rdtsc();
	while ((unsigned char) (pit - read_pit_msb()) < 9);
	t2 = rdtsc();

and it might work out ok without explicit latching, and without having to 

I actually tested a patch with a counter value of just 1024, and I got the 
right answer. 

But if the busy loops aren't busy (due to MSI or virtualization), then all 
those things fly out the window.

			Linus
--

From: Willy Tarreau
Date: Wednesday, September 3, 2008 - 10:09 pm

Ah yes you're right, I remember having been doing crappy stuff like re-reading


100% agreed, though the problem is already the same with any calibration code,
with more or less sensitivity.

Willy

--

From: Alok Kataria
Date: Wednesday, September 3, 2008 - 6:18 pm

x86: Change warning message in TSC calibration.

From: Alok N Kataria <akataria@vmware.com>

When calibration against PIT fails, the warning that we print is misleading.
In a virtualized environment the VM may get descheduled while calibration
or, the check in PIT calibration may fail due to other virtualization
overheads.

The warning message explicitly assumes that calibration failed due to SMI's
which may not be the case. Change that to something proper.

Signed-off-by: Alok N Kataria <akataria@vmware.com>
---

 arch/x86/kernel/tsc.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)


diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index d74b3ef..9d4aeec 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -271,8 +271,7 @@ unsigned long native_calibrate_tsc(void)
 	 */
 	if (tsc_pit_min == ULONG_MAX) {
 		/* PIT gave no useful value */
-		printk(KERN_WARNING "TSC: PIT calibration failed due to "
-		       "SMI disturbance.\n");
+		printk(KERN_WARNING "TSC: Unable to calibrate against PIT\n");
 
 		/* We don't have an alternative source, disable TSC */
 		if (!hpet && !pm1 && !pm2) {


--

From: Larry Finger
Date: Tuesday, September 2, 2008 - 7:51 pm

I know that Linus has some problems with this patch, but FWIW it 
worked on my K6. The dmesg output is

TSC: PIT calibration deviates from PMTIMER: 428809 214401.
TSC: Using PIT calibration value
Detected 428.809 MHz processor.

Larry
--

From: Linus Torvalds
Date: Tuesday, September 2, 2008 - 9:00 pm

I don't have a problem with what the patch _does_ (I think all the added 
sanity checking is good).

I just don't like how it makes the already rather complex and confusing 
function about ten times _more_ complex and confusing. It needs splitting 
up.

But we can split it up after-the-fact as well as before-the-fact.

Does this cleanup-patch (on _top_ of Thomas' patch) also work for you? It 
should do exactly the same thing, except it splits up the TSC PIT 
calibration into a function of its own.

But I may obviously have introduced some silly bug when cleaning it up, 
so testing necessary..

		Linus

---
 arch/x86/kernel/tsc.c |  132 ++++++++++++++++++++++++++----------------------
 1 files changed, 71 insertions(+), 61 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ac79bd1..346cae5 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -122,15 +122,75 @@ static u64 tsc_read_refs(u64 *pm, u64 *hpet)
 	return ULLONG_MAX;
 }
 
+/*
+ * Try to calibrate the TSC against the Programmable
+ * Interrupt Timer and return the frequency of the TSC
+ * in kHz.
+ *
+ * Return ULONG_MAX on failure to calibrate.
+ */
+static unsigned long pit_calibrate_tsc(void)
+{
+	u64 tsc, t1, t2, delta;
+	unsigned long tscmin, tscmax;
+	int pitcnt;
+
+	/* Set the Gate high, disable speaker */
+	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
+
+	/*
+	 * Setup CTC channel 2* for mode 0, (interrupt on terminal
+	 * count mode), binary count. Set the latch register to 50ms
+	 * (LSB then MSB) to begin countdown.
+	 */
+	outb(0xb0, 0x43);
+	outb((CLOCK_TICK_RATE / (1000 / 50)) & 0xff, 0x42);
+	outb((CLOCK_TICK_RATE / (1000 / 50)) >> 8, 0x42);
+
+	tsc = t1 = t2 = get_cycles();
+
+	pitcnt = 0;
+	tscmax = 0;
+	tscmin = ULONG_MAX;
+	while ((inb(0x61) & 0x20) == 0) {
+		t2 = get_cycles();
+		delta = t2 - tsc;
+		tsc = t2;
+		if ((unsigned long) delta < tscmin)
+			tscmin = (unsigned int) delta;
+		if ((unsigned long) delta > ...
From: Larry Finger
Date: Tuesday, September 2, 2008 - 9:34 pm

Same result as Thomas' version.

Larry

--

From: Mark Lord
Date: Friday, September 5, 2008 - 6:45 am

..

Not to mention that we still have the unsolved regression from when HPET
support first went in (2.6.2[01] timeframe), where machines randomly lock up
at "Switching to HPET" time during boot.  Depending upon kernel .config,
phase of the moon, and apparently SMM events.

Hopefully this will finally get fixed with this work?

Cheers

--

From: Bill Davidsen
Date: Tuesday, September 2, 2008 - 10:17 am

Looking at values for old K6 machines, I would suspect that doing the 
test three times and checking the deviation would be enough. If the 
timer is emulated the value will jump around and if it is stable it 
could be used. Considering that this is one use code you could increase 
the number of trials to five or so, keeping the high and low. If 


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

--

From: Larry Finger
Date: Monday, September 1, 2008 - 12:36 pm

Both the openSUSE 2.6.22 kernel and the one with the pm_timer disabled 
return "pit". I don't think pm_timer had ever been used until the 
commit in question.

The timed sleep is as accurate as I can measure.

I put in some test prints. The value of pm2 is zero when the else 
branch of the "if (hpet)" is entered; however, pm1 is 15768471. When 
we reach the do_div(tsc2, tsc1) statement, tsc2 is zero, which I think 
means that the two calls to tsc_read_refs() are returning the same 
junk value.

Larry
--

From: Thomas Gleixner
Date: Monday, September 1, 2008 - 1:09 pm

Ok, so the pmtimer is probably detected later as unusable and disabled.
Please check your logs for: 
"PM-Timer had inconsistent results:"

Thanks,

	tglx
--

From: Larry Finger
Date: Monday, September 1, 2008 - 1:23 pm

Booting 2.6.26, the dmesg output has a line that says:

PM-Timer running at invalid rate: 200% of normal - aborting.

Amazing that it should be exactly 200%. Why is the CPU running at half 
speed when the PM-Timer rate is measured?

Larry
--

From: Thomas Gleixner
Date: Monday, September 1, 2008 - 1:45 pm

The kernel assumes that the PM timer frequency is normal, so it does:

    read pm-timer start value, read TSC start value
    wait for a some time
    read pm-timer end value, read TSC end value

And the TSC frequency is calculated via:

                  TSC-End - TSC-Start
TSC-Frequency =  -------------------- * PM-Frequency
                   PM-End - PM-Start

So if your PM-Timer runs at the double frequency for reasons only
known to the Chip Manufacturer the kernel miscalculates the TSC
frequency by factor 0.5.

Simple rule of three.

Thanks,

	tglx
--

From: Linus Torvalds
Date: Monday, September 1, 2008 - 11:42 am

Ok, Thomas, that means that the PIT is reliable (not surprising), and the 
PM_TIMER isn't (again, I'm not horribly surprised). And HPET isn't 
available, of course.

The old x86-32 code never even bothered with the PM_TIMER for calibration. 
I don't understand why the x86-64 code bothers with it either. Why not 
just drop that whole broken thing, and just depend on the PIT if there is 
no HPET?

I would also like to point out that the 32-bit code actually had a much 
nicer PIT setup, using the much better documented mach_prepare_counter() 
and mach_countup() helper functions. I'm unhappy to note that the new 
"common" code uses what appears to be the inferior code.

Also, note that this is _not_ a new issue. See "verify_pmtmr_rate()" in 
drivers/clocksource/acpi_pm.c, along with all the code to check that the 
reads are stable in "init_acpi_pm_clocksource()".

IOW, the PM_TIMER has been found to be broken before. Depending on it for 
calibration is broken.

			Linus
--

From: Thomas Gleixner
Date: Monday, September 1, 2008 - 12:08 pm

We had horrible results on some machines with the PIT especially in
the local APIC timer calibration code. One of my own laptops behaved
stupid until I verified the APIC timer versus the PM timer. Until
recently it also had occasional stupid TSC calibration values.

This was reported by others as well and is definitely caused by SMM


While on older machines, which might have pmtimer wreckage (I just
googled that some AMD K6 are known to have such issues), the SMM/SMI
wreckage is likely to be a non issue, while on modern systems the
SMM/SMI wreckage will bite us. So we have the choice between evil and
evil.

One way out would be to check on which CPU type we run and ignore the
pmtimer for older systems.

Thanks,

	tglx
--

Previous thread: [ANNOUNCE] Git User's Survey 2008 by Jakub Narebski on Sunday, August 31, 2008 - 2:29 pm. (5 messages)

Next thread: [git pull] drm debug message downgrade. by Dave Airlie on Sunday, August 31, 2008 - 4:00 pm. (1 message)