This patch series addresses following:
- reduce the calibration time to a useful value
- make decision smarter, when a reference (HPET/PMTIMER) is around
The first patches are cleanups to prepare for the smarter loop
handling.
The main change is to reduce the PIT delay value to 10ms, which gives
reasonable results on very slow machines as well. To avoid looping
several times when the machine has a working reference counter
(HPET/pmtimer), we compare the results of the PIT and the reference and
break out of the loop when both match inside of a 10% window.
For virtualized environments the PIT calibration fails often and the
reference calibration is not reproducible with 10ms. To address this
we check whether the PIT failed two times in a row and make the PIT
loop longer (50ms) for the last try to get a better result for the
reference.
Most of the machines I tested break out of the loop after the first
try with a stable reproducible result.
Thanks,
tglx
--
--
i've added them to tip/x86/tsc and merged it into tip/master - if there's test success we can merge it into x86/urgent as well and push it into v2.6.27. Any objections to that merge route? Ingo --
I don't think it's quite that urgent, and wonder what the downside is of just changing the timeout to 10ms. On 32-bit x86, it was 30ms (I think) before the merge, so it sounds like 50ms was a bit excessive even before the whole "loop five times"/ So _short_ term, I'd really prefer (a) looping just three times and (b) looping with a smaller timeout. Long-term, I actually think even 10ms is actually a total waste. I'll post my trial "quick calibration" code that is more likely to fail under virtualization or SMM (or, indeed, perhaps even on things like TMTA CPU's that can have longer latencies due to translation), but that is really fast and knows very intimately when it succeeds. I just need to do slightly more testing. Linus --
ok - sounds very good to me! It's quite late in the cycle so the more ah - perhaps a dynamic statistical approach with an estimation of worst-case calibration error (~= standard deviation) and a quality threshold to reach? That could dramatically increase the number of samples while also making it much faster in practice. Nifty! Ingo --
Oh, no, I'm _much_ more nifty than that! Instead of being very clever, I have a very _stupid_ algorithm, one that has very hardcoded expectations of exactly what it will see. And if it doesn't see exactly that, it just fails early. I'd post the patch, but I really need to actually _test_ it first, and I haven't rebooted yet. Linus --
ok, will wait with patience :) i've been using adaptive calibration with great success in user-space, to run benchmarks on multiple boxes with a variable number of lmbench iterations. Once the observable statistical properties of the series of measurements looks valid it stops the test iterations and emits a result. This both makes things faster (more predictable hw/kernel executes certain lmbench tests much faster) and it makes the results more reliable (less noise, better cross-hardware, cross-kernel and cross-test comparisons). I've got a quality threshold (and a max iterations threshold) hardcoded as well. Ingo --
Just as well. There were various stupid small details, like the fact that i8253 timer mode 2 (square wave) decrements by two, which confused me for a while until I realized it. Anyway, here's a suggested diff. The comments are quite extensive, and should explain it all. The code should be _very_ robust, in that if anything doesn't match expectations, it will fail and fall back on the old code. But it should also be very fast, and quite precise. It only uses 2048 PIT timer ticks to calibrate the TSC, plus 256 ticks on each side to make sure the TSC values were very close to the tick, so the whole calibration takes less than 2.5ms. Yet, despite only takign 2.5ms, we can actually give pretty stringent guarantees of accuracy: - the code requires that we hit each 256-counter block at least 35 times, so the TSC error is basically at *MOST* just a few PIT cycles off in any direction. In practice, it's going to be about three microseconds off (which is how long it takes to read the counter) - so over 2048 PIT cycles, we can pretty much guarantee that the calibration error is less than one half of a percent. My testing bears this out: on my machine, the quick-calibration reports 2934.085kHz, while the slow one reports 2933.415. Yes, the slower calibration is still more precise. For me, the slow calibration is stable to within about one hundreth of a percent, so it's (at a guess) roughly an order-and-a-half of magnitude more precise. The longer you wait, the more precise you can be. However, the nice thing about the fast TSC PIT synchronization is that it's pretty much _guaranteed_ to give that 0.5% precision, and fail gracefully (and very quickly) if it doesn't get it. And it really is fairly simple (even if there's a lot of _details_ there, and I didn't get all of those right ont he first try or even the second ;) The patch says "110 insertions", but 63 of those new lines are actually comments. (And yes, I do the latching - it's ...
Good job you don't. Various Cyrix/Geode chipsets have as errata #2 "Counter latch command is non-operational in the 8254 timer" Alan --
Yeah, I had some memory of latch issues. I wrote the thing originally without the latching, which is why the whole thing is designed to igore the low cycle count. I just decided that doing the latching shouldn't hurt that much, even if it ends up being just a 1us no-op. It does mean that on any normal hardware, the expected error is roughly 3us over 2048 PIT ticks, which if I do the math right (nominal PIT frequency: 1193182 Hz) is just under 0.2%. Or put another way, ~1750 ppm. Not doing the latching should make the expected error go down to 2us. Of course, the 2048 PIT ticks is just a random choice. It could be any multiple of 256 ticks, so that error can be made smaller. Maybe it's worth spending 10ms on this, and get it down by a factor of five (at which point the error on the PIT frequency is probably in the same order of magnitude). Linus --
FWIW, typical error on the 14.31818 MHz clock (used as the PIT, PMTMR and HPET timebase in most systems) is usually ±50 ppm. High-quality motherboards which use a TCXO for the 14.31818 MHz clock would have around ±1 ppm. -hpa --
Thinking more about it, that was the wrong decision. After all, the whole point of the "quick calibration" is to take care of the good case of reasonable hardware - and fall back on a much slower version if there is something odd going on. So latching things is the wrong thing to do, since it just slows things down. No real hardware should need it, and if some odd hardware does exist, all the other sanity checks will catch it anyway. And yeah, I shouldn't have tried to go from 250ms down to 2.5ms. Aim for something like a 15ms calibration instead, which should give plenty of precision, while still being much faster than we used to be. So how about this? (Stage #2 would then be to simplify the main calibration loop now that we know that it's there really as a fall-back when the PIT isn't stable for some reason, but that's a separate issue). Thomas, I assume that this one catches your SMI-laptop and falls back to the slow case, the same way Alok already said it catches his VM setup? Linus --- arch/x86/kernel/tsc.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 118 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 8f98e9d..4589ae4 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -181,6 +181,117 @@ static unsigned long pit_calibrate_tsc(void) return delta; } +/* + * This reads the current MSB of the PIT counter, and + * checks if we are running on sufficiently fast and + * non-virtualized hardware. + * + * Our expectations are: + * + * - the PIT is running at roughly 1.19MHz + * + * - each IO is going to take about 1us on real hardware, + * but we allow it to be much faster (by a factor of 10) or + * _slightly_ slower (ie we allow up to a 2us read+counter + * update - anything else implies a unacceptably slow CPU + * or PIT for the fast calibration to work. + * + * - with 256 PIT ticks to read the value, we have 214us to + * ...
i suspect you didnt want the second initialization of 'expect' to 0xfe? I fixed that up by removing the second one. ( although i think it would all be even clearer if we initialized 'expect' to 0xff before the first call of pit_expect_msb() higher up, and then doing an expect-- before doing this loop - but that is really these bits collided with Thomas's textually but not conceptually: Thomas improved the slowpath calibration while you introduced a new fastpath. I kept both and applied your patch to tip/x86/tsc. if you are happy with the commit below, could you please give your Signed-off-by for it? Ingo --------------> From b6cb513801373e10af6f2df287a8089e3c340e83 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, 4 Sep 2008 10:41:22 -0700 Subject: [PATCH] x86: quick TSC calibration Introduce a fast TSC-calibration method on sane hardware. It only uses 17920 PIT timer ticks to calibrate the TSC, plus 256 ticks on each side to make sure the TSC values were very close to the tick, so the whole calibration takes 15ms. Yet, despite only takign 15ms, we can actually give pretty stringent guarantees of accuracy: - the code requires that we hit each 256-counter block at least 50 times, so the TSC error is basically at *MOST* just a few PIT cycles off in any direction. In practice, it's going to be about one microseconds off (which is how long it takes to read the counter) - so over 17920 PIT cycles, we can pretty much guarantee that the calibration error is less than one half of a percent. My testing bears this out: on my machine, the quick-calibration reports 2934.085kHz, while the slow one reports 2933.415. Yes, the slower calibration is still more precise. For me, the slow calibration is stable to within about one hundreth of a percent, so it's (at a guess) roughly an order-and-a-half of magnitude more precise. The longer you wait, the more precise you can be. However, the nice thing about ...
hm, unless i'm missing something i think here we still have a small
window for an SMI or some virtualization delay to slip in and cause
massive inaccuracy: if the delay happens _after_ the last
pit_expect_msb() and _before_ the external get_cycles() call. Right?
i fixed that by adding one more pit_expect_msb() call.
plus i think QUICK_PIT_ITERATIONS is quite close to overflowing 255
which is built into the u32 'expect' variable (the MSB will only
overflow to 10 bits or so) - so i've added a BUILD_BUG_ON() to make sure
anyone tuning QUICK_PIT_MS above 60msec or so would get a build error
instead of some hard(er) to track down calibration error.
but it's getting late here so please double-check me ... The commit is
below.
Ingo
------------>
From 40d2650256289d3ba59c4fd146b86b972db6ec40 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 4 Sep 2008 22:47:47 +0200
Subject: [PATCH] x86: quick TSC calibration, improve
- make sure the final TSC timestamp is reliable too
- make sure nobody increases QUICK_PIT_MS so that
QUICK_PIT_ITERATIONS can get larger than 0xff, breaking the iteration.
(It would take about 60 msecs to reach that limit.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/tsc.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 839070b..4832a40 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -304,6 +304,11 @@ static unsigned long quick_pit_calibrate(void)
outb(0xff, 0x42);
outb(0xff, 0x42);
+ /*
+ * The iteration assumes that expect never goes below zero:
+ */
+ BUILD_BUG_ON(QUICK_PIT_ITERATIONS >= 0xff);
+
if (pit_expect_msb(0xff)) {
int i;
u64 t1, t2, delta;
@@ -317,6 +322,12 @@ static unsigned long quick_pit_calibrate(void)
t2 = get_cycles();
/*
+ * Make sure we can rely on the second TSC timestamp:
+ */
+ if (!pit_expect_msb(--expect))
+ goto ...Yes. I had the extra pit_expect_msb() originally, but decided that basically a single-instruction race for somethign that ran without any MSI for 15ms was a bit pointless. But adding another pit_expect_msb() is certainly not wrong. No it doesn't. "expect" is unsigned char and will happily wrap, as will the PIT timer. The fact that it is in "single shot" mode doesn't actually mean that the timer stops, it just affects what happens when it goes down to zero. So that BUILD_BUG_ON() is misleading and incorrect. Of course, it is true that in _practice_ you would never actually want to delay that long, but the code as written should be perfectly happy to iterate arbitrarily many times. Of course, the actual final TSC multiply/divide calculations would overflow at some point, but that's much further down the line. Linus --
ah, indeed, that bit of mine is wrong - and the period is programmed to 0xffff so it should all work out just fine. You code in a way too tricky manner ;) I zapped that portion. In fact ... shouldnt we intentionally include a 'wraparound' event in the test? Some of the erratums/instabilities regarding PITs happened around wraparounds [of the lsb] - maybe the wraparound of the MSB matters too. (Maybe some boards freeze the counter readout until the host OS ACKs the PIT irq or something - which we dont do in this calibration run so if there's some weirdness there we'd detect it.) So maybe we should start with an expect value of QUICK_PIT_ITERATIONS/2, with a wraparound right in the middle of the measurement? Ingo --
No. Face it, if somebody tries to make QUICK_PIT_MS be so large as to that be an issue, then the whole point of the function goes away. Linus --
Btw, the same is true of adding any random "sanity checking". The point of that thing was to simply only work when the PIT works as advertized, and fail immediately if it doesn't. Even *if* you were to pick a big calibration delay *and* if you happened to have a PIT that is broken and doesn't wrap correctly, the design of the thing would mean that it would then fail the calibration already. Exactly because it would _see_ that it's not wrapping. So then it returns zero, and the slow and complicated case can run. Take a look at the generated assembly language. I literally wrote it so that you could imagine that it's an old-time asm hacker that wrote the asm. Don't screw it up. Linus --
yeah - pit_expect_msb() would return after 50,000 iterations with a failure. So i was wondering whether for such PITs we _want_ the slow and complicated case to run. Probably not, because the PIT could quite likely still be counting very precisely as long as no wraparound was involved. Ingo --
the race is wider than that i think: all it takes an SMI at the last PIO access, so the window should be 1 usec, against a 15000 usecs period. That's 1 out of 15,000 boxes coming up with totally incorrect calibration. we also might have a very theoretical race of an SMI taking exactly 65 msecs so that the whole PIT wraps around and fools the fastpath - the chance for that would be around 1:300 - assuming we only have to hit the right MSB with a ~200 usecs precision). That assumes equal distribution of SMI costs which they certainly dont have - most of them are much less than 60 msecs. So i dont think it's an issue in practice - on real hw. But it's still a possibility unless i'm missing something. We could protect against that case by reading the IRQ0-pending bit and making ok, i kept that bit. Ingo --
Hi, I ran the current tree with these patches on my VM setup for both 32 & 64bit around 200 reboots each. The system entered the FAST calibration mode more often this time, around 25% of time. And i had an interesting case where in the frequency that was calibrated was 1875Mhz compared to actual ~1866Mhz, leaving an error of 0.5%. Now, looking at the code. Even with this last pit_expect_msb check, i think there can be a case when a error spanning 114usec can slip in the TSC calculation. This can happen if, in the pit_expect_msb (the one just before the second read_tsc), we hit an SMI/virtualization event *after* doing the 50 iterations of PIT read loop, this allows the pit_expect_msb to succeed when the SMI returns. If this SMI/Virtualization event spans across the next PIT MSB increment interval leaving sufficient time (100us) for the last pit_expect_msb to succeed. We can have a error of 1MSB tick increment - time taken for the last pit_expect_msb to succeed, in the read TSC value. i.e. a error of (214us - 100us) in the 15msec period, i.e. error of 7600PPM ?? And, in order for the TSC clocksource to keep correct time (on systems where the TSC clocksource is usable), the TSC frequency estimate must be within 500 ppm of its true frequency, otherwise NTP will not be able to correct it. So, IMHO we should not use this algorithm. I don't know if increasing the count threshold will help too, since that threshold value may fail for some system which perform better than our assumption of "we take 2us to do the 2 PIT reads". Atleast in virtualized environment I can make no such guarantees. Thanks, --
So theoretically, on real hardware, the minimum of 50 reads will take 100us. The 256 PTI cycles will take 214us, so in the absolute worst case, you can have two consecutive successful cycles despite having a 228us SMI (or other event) if it happens just in the middle. Of course, then the actual _error_ on the TSC read will be just half that, but since there are two TSC reads - one at the beginning and one at the end - and if the errors of the two reads go in opposite directions, they can add up to 228us. So I agree - in theory you can have a fairly big error if you hit everything just right. In practice, of course, even that *maximal* error is actually perfectly fine for TSC calibration. So I just don't care one whit. The fact is, fast bootup is more important than some crazy and totally unrealistic VM situation. The 50ms thing was already too long, the 250ms one is unbearable. The thing is, you _can_ calibrate the thing more carefully _later_. Use a tiemr to do two events one second apart (without slowing down the boot) if you want to get a really good value, along with the HPET/PMTIMER fine-tuning. That way you should actually be able to get a _really_ precise thing, because you do need a long time to get precision. But that long time should not be in a critical path on the bootup. Linus --
I just ran your patch in a loop on the SMI mephitic laptop with random
delays between the loops.
20 out of 10000 results are between 2200 and 8400Mhz while the CPU
still runs @2GHz.
And it also happened in an automated boot/reboot cycle (no extra loop)
once out of 50.
So the SMI hit exactly:
+ t1 = get_cycles();
+ for (expect = 0xfe, i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
+ if (!pit_expect_msb(expect))
+ goto failed;
+ }
-----> HERE
+ t2 = get_cycles();
The SMIs on that machine take between 500us and 100+ms according to my
instrumentation experiments: http://lkml.org/lkml/2008/9/2/202
Adding another check after the second get_cycles() makes it reliable:
+ t1 = get_cycles();
+ for (expect = 0xfe, i = 0; i < QUICK_PIT_ITERATIONS; i++) {
+ if (!pit_expect_msb(expect--))
+ goto failed;
+ }
+ t2 = get_cycles();
+
+ if (!pit_expect_msb(expect))
+ goto failed;
That solves the problem on that box and will detect any virtualization
delay as well. I just had to move out the "expect--" from the for() as
True. You applied a first draft of my patch right away from mail. It
was the accumulated findings of my detective work on various wreckaged
hardware.
The follow up patches I sent (http://lkml.org/lkml/2008/9/4/254),
bring it down to 10ms in the good and 30ms in the worst case with the
option for 50ms in the last round to make the virtualized/emulated
stuff happy. With that I have not seen any wrong result on any of my
jinxed boxen so far and it works very reliable on virtualized
environments as well.
Combining them with your fast calibration should be a solid and
reasonable fast solution in all corner cases.
Ingo put them into the -tip tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git x86/tsc
Thanks,
tglx
--
My original patch actually had that, and Ingo added it to the -tip tree too (I think), so that should be the one we're discussing anyway. I just didn't think it would ever trigger in practice, which is why I removed it. Mea culpa. Linus --
My bad. Did not test that against -tip, just applied it from mail :) Just checked. The -tip version still has the expect-- in the for() which might lead to stupid results depending on the gcc madness level. Thanks, tglx --
Umm. What? You're on some odd drugs. Linus --
Oh, you mean te "--expect" in the last pit_expect_msb(). Yeah, that one looks bogus, but I don't understand what it has to do with gcc at all. "expect" is an unsigned char. There are absolutely _zero_ issues with overflow, underflow, random phases of the moon, madness levels or anything else. But yes, it does look like Ingo screwed up when adding that final check, since expect was already decremented at the end of the loop. Ingo? Did you actually test it? Linus --
One gcc does:
i++;
if (i >= QUICK_PIT_ITERATIONS)
goto out;
expect--;
The other one does:
i++;
expect--;
if (i >= QUICK_PIT_ITERATIONS)
goto out;
Don't ask me which one is correct. It's just reality :(
/me goes back to consume legal german drugs
tglx
--
Show me. Because I simply don't believe you. The first code is simply _wrong_ - except if "expect" isn't even _used_ afterwards (in which case gcc can optimize away the last unused write). And I strongly suspect that that is what you've seen. Because quite frankly, if what you describe is real, then your gcc is incredibly buggy. So buggy that it sounds unlikely to be able to compile the kernel in many other places. This is very simple and very fundamental C, not something subtle or even half-way undefined. Linus --
Just checked, yes you are right. I messed that up in my tiredness to see
the obvious bug in Ingo's fixup patch.
/me resorts to bed
tglx
--
hm, yes, that's my brown paper bag fault, sorry. I did that addition in tip/x86/tsc and posted it to you and i did test it immediately - and i noticed that i never saw the fast-calibration message i expected to see. I even pasted the boot log over irc yesterday and i still have it: *> [ 0.000] TSC: PIT calibration deviates from PMTIMER: 738839 846296. *> [ 0.000] TSC: Using PIT calibration value *> [ 0.000] Detected 738.839 MHz processor. *> does not seem to trigger anywhere i wanted to debug that problem straight after i worked down my 800+ mails post-vacation mbox :-/ Which state i reached about 2 hours ago so i'm now free - the fix is below. i _think_ that the quality of calibration should now be pretty OK with latest -git. The clever fast calibration stuff could be .28 material. And/or we could change the 5x 50msec calibration to 3x 30msec right now, the precision is still plenty and the 90 msec is then replaced with your fast-calibrate method anyway on proper boxes. Hm? Ingo -------------------> From 5df45515512436a808d3476a90e83f2efb022422 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Sat, 6 Sep 2008 23:55:40 +0200 Subject: [PATCH] x86, tsc calibration: fix my brown paperbag day ... Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/tsc.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 6dab90f..4847a92 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -319,7 +319,7 @@ static unsigned long quick_pit_calibrate(void) /* * Make sure we can rely on the second TSC timestamp: */ - if (!pit_expect_msb(--expect)) + if (!pit_expect_msb(expect)) goto failed; /* --
Just straight forward german beer :)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6dab90f..3bfe083 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -310,8 +310,8 @@ static unsigned long quick_pit_calibrate(void)
unsigned char expect = 0xfe;
t1 = get_cycles();
for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
if (!pit_expect_msb(expect))
goto failed;
}
t2 = get_cycles();
/*
* Make sure we can rely on the second TSC timestamp:
*/
if (!pit_expect_msb(--expect))
goto failed;
Where is a guarantee, that excpect is not decremented before we break
out of the loop ?
the "expect--" can be done _BEFORE_ the i < QUICK_PIT_ITERATIONS
evaluation. Not likely, but ...
This version works always
t1 = get_cycles();
for (i = 0; i < QUICK_PIT_ITERATIONS; i++) {
if (!pit_expect_msb(expect--))
goto failed;
}
t2 = get_cycles();
/*
* Make sure we can rely on the second TSC timestamp:
*/
if (!pit_expect_msb(expect))
goto failed;
Thanks,
tglx
--
Quite the reverse. We have a guarantee that it _is_ decremented.
Adn that guarantee is very much about the C language.
for (a ; b ; c) {
..
}
translates as
a;
while (b) {
..
continue:
c;
}
And this has absolutely _nothing_ to do with any gcc oddity or anything
else.
The fact is, the code that Ingo added was totally bogus. The real bug was
that he did a totally bogus "--expect" in the argument to that last call.
Because 'c' *will* have been done after the last iteration of the loop.
Linus
--
BTW, I hate to see state-changing instructions inside an if condition.
I've been bitten several times while debugging. You try to temporarily
comment out the if statement for a test and you end up with different
code. Same for printf. Examples of dangerous usages :
i = 0;
for (x = 0; x < 100; x++) {
update_var(&i);
if (debug && i--)
printf("Hey I'm here\n");
}
return i;
You can bet that the if will go away before production. Variant with
similar effects :
i = 0;
for (x = 0; x < 100; x++) {
update_var(&i);
printf("Hey I'm here : %d\n", --i);
}
return i;
Since it costs nothing (except one tab and one LF) to put the instruction
out of the condition, I prefer to see them extracted :
i = 0;
for (x = 0; x < 100; x++) {
update_var(&i);
i--;
if (debug)
printf("Hey I'm here\n");
}
Willy
--
If Alok has the second check in place and is actually worried about
that 288us impact, then we can add the following (untested), which
does not impact the speed of the check.
Against -tip x86/tsc
Thanks,
tglx
---
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6dab90f..3bfe083 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -310,8 +310,8 @@ static unsigned long quick_pit_calibrate(void)
unsigned char expect = 0xfe;
t1 = get_cycles();
- for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
- if (!pit_expect_msb(expect))
+ for (i = 0; i < QUICK_PIT_ITERATIONS; i++) {
+ if (!pit_expect_msb(expect--))
goto failed;
}
t2 = get_cycles();
@@ -319,7 +319,7 @@ static unsigned long quick_pit_calibrate(void)
/*
* Make sure we can rely on the second TSC timestamp:
*/
- if (!pit_expect_msb(--expect))
+ if (!pit_expect_msb(expect))
goto failed;
/*
@@ -338,7 +338,6 @@ static unsigned long quick_pit_calibrate(void)
*/
delta = (t2 - t1)*PIT_TICK_RATE;
do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
- printk("Fast TSC calibration using PIT\n");
return delta;
}
failed:
@@ -356,10 +355,42 @@ unsigned long native_calibrate_tsc(void)
int hpet = is_hpet_enabled(), i, loopmin;
local_irq_save(flags);
+ tsc1 = tsc_read_refs(&ref1, hpet);
fast_calibrate = quick_pit_calibrate();
+ tsc2 = tsc_read_refs(&ref1, hpet);
local_irq_restore(flags);
- if (fast_calibrate)
- return fast_calibrate;
+
+ if (fast_calibrate) {
+ /*
+ * Return the fast_calibrate value when neither hpet
+ * nor pmtimer are available.
+ */
+ if (!hpet && !ref1 && !ref2) {
+ printk("Fast TSC calibration using PIT\n");
+ return fast_calibrate;
+ }
+
+ /* Check, whether the sampling was disturbed by an SMI */
+ if (tsc1 == ULLONG_MAX || tsc2 == ULLONG_MAX)
+ goto slowpath;
+
+ tsc2 = (tsc2 - tsc1) * 1000000LL;
+ if (hpet)
+ tsc2 = calc_hpet_ref(tsc2, ref1, ...Guys, please. Show some _taste_. Dammit, stop adding random crap to "native_calibrate_tsc()" and make it look like total and utter SHIT. If you want to do that tsc1 = tsc_read_refs(&ref1, hpet); .. tsc2 = tsc_read_refs(&ref1, hpet); around calibration and comparing it, then do it *once*. Do it over the whole thing. Do it in a function of its own, instead of making this horrible and unreadable mess. This patch may be fine as a "let's check if it works" thing, but please don't send out total SH*T to public lists. Some _tasted_ in programming, please! Linus --
Over which _whole_ thing ? You want to have the very very fast thing, which is not reliable under all circumstances as Alok pointed out and Why not ? We want to figure out if it solves the problem and sending What we apply finally is a totally different thing. Thanks, tglx --
You can move that thing _out_ into a function of its own.
Look at this piece fo CRAP, and tell me, HOW MANY TIMES do you want to
repeat it?
+ /*
+ * Return the fast_calibrate value when neither hpet
+ * nor pmtimer are available.
+ */
+ if (!hpet && !ref1 && !ref2) {
+ printk("Fast TSC calibration using PIT\n");
+ return fast_calibrate;
+ }
+
+ /* Check, whether the sampling was disturbed by an SMI */
+ if (tsc1 == ULLONG_MAX || tsc2 == ULLONG_MAX)
+ goto slowpath;
+
+ tsc2 = (tsc2 - tsc1) * 1000000LL;
+ if (hpet)
+ tsc2 = calc_hpet_ref(tsc2, ref1, ref2);
+ else
+ tsc2 = calc_pmtimer_ref(tsc2, ref1, ref2);
+
+ /* Check the reference deviation */
+ delta = ((u64) fast_calibrate) * 100;
+ do_div(delta, tsc2);
+
+ if (delta >= 90 && delta <= 110) {
+ printk("Fast TSC calibration using PIT\n");
+ return fast_calibrate;
+ }
+ }
Here's a hint: we don't do cut-and-paste programming. And we don't get
extra points for bloating a single function with the same unreadable code
over and over and over again.
How many copies do you want? And here's a hint: the answer is _one_. If
you get any other answer, your patch is SHIT.
Linus
--
I didn't know that sending a test patch which is admittetly not pretty
is a capital crime nowadays.
In future I'll restrict myself to look at such stuff only on Monday to
Friday between 9AM and 5PM and send test/RFC patches only when they
got approved by the nonshitapproval committee, which holds a meeting
once a month.
Yours grumpy,
tglx
--
The thing is, there's a pattern to this. And it has nothing to do with "test patch". See commit fbb16e243887332dd5754e48ffe5b963378f3cd2, and then see what I had to do in ec0c15afb41fd9ad45b53468b60db50170e22346. That wasn't a test patch, was it? I don't want to continually see these patches that are simply adding more and more crap. I want to not have to clean up the end result afterwards. Linus --
Sure. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Linus --
Just for the record, I ran this reboot loop for 50 times, and in 1 of the run the kernel did use fast pit calibration method. These runs were on vanilla 2.6.27-rc5 with this patch (take 2 from Linus) and Ingo's fix. The frequency that was calibrated by this run was 1866.291 Mhz, usually it's about 1866.692 Mhz. Slow method calibrates in the same range. In all the other 49 odd runs it did calibrate using the slow method. I printed the count values to see how much are we missing the fast method by. The maximum count value that I see is 84. In one single reboot run, on an average in about 70 iterations the val returned from pit_expect_msb is > 50, and eventually we hit a condition where the value is < 50 and we bail out of the fast method. So just to be on safer side can we be a little less generous and increase the threshold to somewhere around 75 from 50 ? Or is there a good reason not to ? Thanks, --
Why would you? The reason the single run completed successfully was apparently that no actual virtualization event triggered, so it actually accessed the hardware successfully and without any real slowdown. As shown also by the fact that the actual frequency was correct at the end. The ones that failed presumably all had interrupts that happened in the VM, which then immediately triggered the "uhhuh, there was a bump" thing. IOW, the code worked correctly as designed. It's not a "anti-virtualization" feature per se, it's a "detect when virtualization screws up timing". When virtualization (or SMI etc) does _not_ screw up timing, it all works fine. Linus --
Maybe i was just being too paranoid, the slow calibration method has been known to work correctly even under extreme stress conditions for us and wanted to make sure that we always take that path. Anyways I understand your point so lets stick with this value, if there are any false positives that I see with this i will get back. Thanks, --
Looping for a smaller timeout is really going to strain things for Virtualization. Even on native hardware if you reduce the timeout to less than 10ms it may result in errors in the range of 2500ppm on a 2GHz systems when calibrating against pmtimer/hpet, this is far worse than what NTP can correct, afaik NTP can handle errors only upto 500ppm. And IMHO that is the reason why we had a timeout of 50ms before (since it limits the maximum theoretical error to 500ppm) In addition to that, the pmtimer hardware itself has some drift, typically in the range of 20 to 100ppm. If this drift happens to be in the same direction as the error in measuring the tsc against the pmtimer, we could have a total error of more than 500ppm in the tsc frequency calibration code with the 50ms timeout. So anything less than 50msec of timeout is risky for virtualized environment. If people think that new hardware is good enough to handle these errors with a smaller timeout value thats okay, but for virtualized environment we need to keep these timeouts as they are or increase them. If still there is a pressing need to reduce the timeout, then alternatively, as Arjan suggested, we can ask the hardware (hypervisor) for tsc frequency, and do this only if we are running under a hypervisor ( can't do this for h/w with constant TSC too since its not reliable as mentioned by Linus). This tsc freq from hypervisor, can be implemented using cpuid's if the backend hypervisor supports that (VMware does). Let me know what people think about this and we can work towards a standard interface. Thanks, --
Can you check the patch I just sent out? It loops for a _very_ short timeout, but on the other hand it should also absolutely immediately notice that it's getting the wrong expected values under virtualization, and the fast case will then fail early. It then falls back on the slow case, but I don't think you can avoid that I would not mind at all having the more precise thing happen _later_, especially if we can do it incrementally. One of the problems with the TSC calibration is that we need it fairly early (for things like usleep()), and it needs to be in the right ballpark. It definitely does not need to be in the parts-per-million range, it needs to be in the "within a few percent" range. (To make matters worse, the TSC isn't then even used in practice for real-time clocks, because of variable frequency and/or halting in idle states. So the actual real-time clock will actually be based on HPET or PM_TIMER anyway most of the time). Linus --
Yep tried that, the fast calibration failed in 4-5 reboots that i did That's fine i think the slow calibration works well for us and we can live with the added 200msec delay while booting. I am assuming the patch you just sent was for 2.6.27. And that we won't be changing the timeout's after this patch. So, can we also apply the 2 patches that I sent yesterday on top of this one ? the one that calibrated against pmtimer/hpet over the full loop and the one that fixed the warning message. Both of these will affect the slower case, which is now not the common case when running on native hardware. In the long run though, I think to get rid of all this complexity, we can think of getting the frequency from the hypervisor. Lets see, i will try making a patch maybe early next week and send it for the x86 tree. Thanks, --
That should go along with a paravirtualized (virtio) timesource. -hpa --
Timekeeping has always been a snarly issue fraught with peril when testing on new hardware. Are we confident enough that this will DTRT on almost every platform, including all the busticated ones? Seems dangerous to be merging it for .27 when we're already past -rc5. But if Ingo and Thomas want to merge it, it's OK with me as long as they take the blame for it. :)
