On X86 platform we can use the value of cpu_khz computed during tsc calibration
to calculate the loops_per_jiffy value. Its very important to keep the error in
lpj values to minimum as any error in that may result in kernel panic in
check_timer.
In virtualization environment on a highly overloaded host, the guest delay
calibration may sometimes result in errors beyond the ~50% that timer_irq_works
can handle, resulting in the guest panicking.This change could also help large MP systems in reducing their booting time.
Patch also does some formating changes to lpj_setup code to now have a single
printk to print the calculated bogomips value.On top of current git.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Index: linux-2.6/arch/x86/kernel/time_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/time_64.c 2008-05-23 16:56:24.000000000 -0700
+++ linux-2.6/arch/x86/kernel/time_64.c 2008-06-03 17:28:46.000000000 -0700
@@ -123,6 +123,8 @@
(boot_cpu_data.x86_vendor == X86_VENDOR_AMD))
cpu_khz = calculate_cpu_khz();+ lpj_tsc = ((unsigned long)cpu_khz * 1000)/HZ;
+
if (unsynchronized_tsc())
mark_tsc_unstable("TSCs unsynchronized");Index: linux-2.6/include/linux/delay.h
===================================================================
--- linux-2.6.orig/include/linux/delay.h 2008-05-23 16:56:24.000000000 -0700
+++ linux-2.6/include/linux/delay.h 2008-06-03 17:28:46.000000000 -0700
@@ -41,6 +41,7 @@
#define ndelay(x) ndelay(x)
#endif+extern unsigned long lpj_tsc;
void calibrate_delay(void);
void msleep(unsigned int msecs);
unsigned long msleep_interruptible(unsigned int msecs);
Index: linux-2.6/init/calibrate.c
===================================================================
--- linux-2.6.orig/init/calibrate.c 2008-05-23 16:56:24.000000000 -0700
+++ linux-2.6/init/calibrate.c 2008-06-03 17:28:46.000000000 -0700
@@ -9,6 +9,7 @@
#include <linux/init.h>
#include <...
that wont work very well as lpj_tsc is not declared.
Ingo
--
lpj_tsc is declared in include/linux/delay.h, and that declaration is
available in tsc_32.cI compile tested against current git too and seems fine to me.
Also this patch is not to be applied to any tree yet, i will make a
different patch which calculates lpj_tsc from tsc_khz instead and use it
only for the boor processor.
There were some concerns from Arjan and Pavel regarding cpu frequency
being different on different cpus. I was waiting to hear back from them
before sending a new patch.Let me know your comments on my questions/suggestions from the earlier
post
http://marc.info/?l=linux-kernel&m=121269125317375&w=2Thanks,
--
No. Some cpus do one loop per tick, some do two loops per tick, and
there are probably weird cpus, too.Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
On Tue, 03 Jun 2008 17:41:09 -0700
\can you guarantee that the rate tsc ticks at is the same as the current
CPU frequency? Answer: You can't....sadly we do need to calibrate this...
In addition, clearly you can have different cpus in a system run at a
different rate (both in terms of cpu_khz and, independently, in terms
of tsc rate)--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
I think at the boot time atleast we can assume that, no ?
If you are referring to the cpu frequency changing at run time (aka
dynamic frequency scaling),
in that case the time_cpufreq_notifier should take care of updating
the loops_per_jiffyAgain yes at run time frequency's may change but they shouldn't at boottime.
AFAIK, i don't think there are X86 MP systems with asymmetric cpus
i.e. systems with different
base frequencies. If thats not true then there sure is a problem.Thanks,
--
On Tue, 3 Jun 2008 21:01:54 -0700
Nope, absolutely not.
1) The rate TSC ticks may or may not be the maximum frequency (usually
is, but no guarantee)
2) The BIOS might not boot your system at the maximum frequency (thinkthere's nothing that guarantees this. (Well maybe Dell's website does
because they want to sell you 2 expensive cpus rather than 1 cheap 1--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
Hi Arjan,
I have modified the patch to use the pre-calculated value only for the
boot processor.Please have a look.
Patch is on top of today's git.--
On X86 platform we can use the value of tsc_khz computed during tsc calibration
to calculate the loops_per_jiffy value. Its very important to keep the error in
lpj values to minimum as any error in that may result in kernel panic in
check_timer.
In virtualization environment, On a highly overloaded host the guest delay
calibration may sometimes result in errors beyond the ~50% that timer_irq_works
can handle, resulting in the guest panicking.Does some formating changes to lpj_setup code to now have a single printk to
print the bogomips value.We do this only for the boot processor because the AP's can have different
base frequencies or the BIOS might boot a AP at a different frequency.Signed-off-by: Alok N Kataria <akataria@vmware.com>
Index: linux-2.6/arch/x86/kernel/time_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/time_64.c 2008-06-09 10:19:20.000000000 -0700
+++ linux-2.6/arch/x86/kernel/time_64.c 2008-06-19 17:22:53.000000000 -0700
@@ -123,6 +123,8 @@
(boot_cpu_data.x86_vendor == X86_VENDOR_AMD))
cpu_khz = calculate_cpu_khz();+ lpj_tsc = ((unsigned long)tsc_khz * 1000)/HZ;
+
if (unsynchronized_tsc())
mark_tsc_unstable("TSCs unsynchronized");Index: linux-2.6/include/linux/delay.h
===================================================================
--- linux-2.6.orig/include/linux/delay.h 2008-06-09 10:19:20.000000000 -0700
+++ linux-2.6/include/linux/delay.h 2008-06-19 17:22:28.000000000 -0700
@@ -41,6 +41,7 @@
#define ndelay(x) ndelay(x)
#endif+extern unsigned long lpj_tsc;
void calibrate_delay(void);
void msleep(unsigned int msecs);
unsigned long msleep_interruptible(unsigned int msecs);
Index: linux-2.6/init/calibrate.c
===================================================================
--- linux-2.6.ori...
i dont think the message should be eliminated from default logs via
making it KERN_DEBUG - this message is a common pattern people watch out
for and it does give an indication about various sorts of
timing/performance trouble.so that the string remains on a single line.
Ingo
--
Ok, I have changed the printks to KERN_INFO.
--
On X86 platform we can use the value of tsc_khz computed during tsc calibration
to calculate the loops_per_jiffy value. Its very important to keep the error in
lpj values to minimum as any error in that may result in kernel panic in
check_timer.
In virtualization environment, On a highly overloaded host the guest delay
calibration may sometimes result in errors beyond the ~50% that timer_irq_works
can handle, resulting in the guest panicking.Does some formating changes to lpj_setup code to now have a single printk to
print the bogomips value.We do this only for the boot processor because the AP's can have different
base frequencies or the BIOS might boot a AP at a different frequency.Signed-off-by: Alok N Kataria <akataria@vmware.com>
Index: linux-2.6/arch/x86/kernel/time_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/time_64.c 2008-06-09 10:19:20.000000000 -0700
+++ linux-2.6/arch/x86/kernel/time_64.c 2008-06-19 17:22:53.000000000 -0700
@@ -123,6 +123,8 @@
(boot_cpu_data.x86_vendor == X86_VENDOR_AMD))
cpu_khz = calculate_cpu_khz();+ lpj_tsc = ((unsigned long)tsc_khz * 1000)/HZ;
+
if (unsynchronized_tsc())
mark_tsc_unstable("TSCs unsynchronized");Index: linux-2.6/include/linux/delay.h
===================================================================
--- linux-2.6.orig/include/linux/delay.h 2008-06-09 10:19:20.000000000 -0700
+++ linux-2.6/include/linux/delay.h 2008-06-19 17:22:28.000000000 -0700
@@ -41,6 +41,7 @@
#define ndelay(x) ndelay(x)
#endif+extern unsigned long lpj_tsc;
void calibrate_delay(void);
void msleep(unsigned int msecs);
unsigned long msleep_interruptible(unsigned int msecs);
Index: linux-2.6/init/calibrate.c
===================================================================
--- linux-2.6.orig/init/calibrate.c 2008-06-09 10:19:20.000000000 -0700
+++ linux-2.6/init/calibrate.c 2008-06-20 14:16:27.00000...
How did you adress 'khz has nothing to do with loops per jiffie'
comment?Some cpus can do loop in cycle , some need two cycles, some need half.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Hi Pavel,
When you say loops per jiffies has nothing to do with khz, by khz you
mean the cpu frequency, right ?AFAIU in calibrate_delay_direct too we measure the amount by which timer
has ticked until DELAY_CALIBRATION_TICKS amount of jiffies has passed.
So IMO the code there too assumes that there is one loop per timer
cycle ?If that's not the case then i fail to understand how does the current
code figures out how many loops occur in a cycle ?Thanks,
--
On my machine, it reports:
delay using timer specific routine.. 3661.98 BogoMIPS (lpj=7323971)
...
Detected 1828.828 MHz processor.(/proc/cpuinfo)
model name : Genuine Intel(R) CPU T2400 @ 1.83GHz
...
cpu MHz : 1000.000
...
bogomips : 3657.54So you'd break it by setting lpj (aka bogomips) to cpu_khz, right?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
We are not setting it to cpu_khz but to tsc_khz, i am assuming that in
this case tsc_khz will be different than cpu_khz. Can you please mail me
the full dmesg log.Thanks,
--
Yes, but neither cpu_khz nor tsc_khz will be 3657 bogoMips, right?
Anyway, here are my boot messages.
PavelLinux version 2.6.26-rc8 (pavel@amd) (gcc version 4.1.3 20071209 (prerelease) (Debian 4.1.2-18)) #310 SMP Thu Jun 26 00:09:08 CEST 2008
PAT disabled. Not yet verified on this CPU type.
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007f6d0000 (usable)
BIOS-e820: 000000007f6d0000 - 000000007f6df000 (ACPI data)
BIOS-e820: 000000007f6df000 - 000000007f700000 (ACPI NVS)
BIOS-e820: 000000007f700000 - 0000000080000000 (reserved)
BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
1142MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at [c00f67f0] 000f67f0
Entering add_active_range(0, 0, 521936) 0 entries of 256 used
Zone PFN ranges:
DMA 0 -> 4096
Normal 4096 -> 229376
HighMem 229376 -> 521936
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
0: 0 -> 521936
On node 0 totalpages: 521936
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 4064 pages, LIFO batch:0
Normal zone: 1760 pages used for memmap
Normal zone: 223520 pages, LIFO batch:31
HighMem zone: 2286 pages used for memmap
HighMem zone: 290274 pages, LIFO batch:31
Movable zone: 0 pages used for memmap
DMI present.
ACPI: RSDP 000F67C0, 0024 (r2 LENOVO)
ACPI: X...
Hi Pavel,
Thanks for the dmesg. The HZ value that you are using on this system is
250, right ?We are dividing by HZ over here. So you are right in saying that tsc_khz
wont be equal to bogoMips but lpj_fine will surely be computed correct
since we do consider the HZ value.Please let me know if you still have any doubts.
Or can i safely assume that you will ACK the patch ;-)Thanks,
--
Well, I'm not expert-enough in this subsystem (nor comfortable enough
with your code) to ACK it, sorry.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
this needed the fix below.
but there's another problem as well: why are generic files
(init/calibrate.c and include/linux/delay.h) using something that is
named in an x86-specific way - lpj_tsc ? (TSC is an x86 concept)Ingo
------------>
commit 5cd5a41ea6f4363b03156e2208dd0d2266f0d67d
Author: Ingo Molnar <mingo@elte.hu>
Date: Tue Jun 24 01:19:49 2008 +0200x86: fix "x86: use cpu_khz for loops_per_jiffy calculation"
fix:
arch/x86/kernel/tsc_32.c: In function ‘tsc_init':
arch/x86/kernel/tsc_32.c:421: error: ‘lpj_tsc' undeclared (first use in this function)
arch/x86/kernel/tsc_32.c:421: error: (Each undeclared identifier is reported only once
arch/x86/kernel/tsc_32.c:421: error: for each function it appears in.)Signed-off-by: Ingo Molnar <mingo@elte.hu>
diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 4adac0d..bfb9193 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -1,6 +1,7 @@
#include <linux/sched.h>
#include <linux/clocksource.h>
#include <linux/workqueue.h>
+#include <linux/delay.h>
#include <linux/cpufreq.h>
#include <linux/jiffies.h>
#include <linux/init.h>
--
calibrate_delay_direct was using some variables with "tsc" as the prefix
(tsc_rate_min/max) ...so i thought of using lpj_tsc. And also IMO,
lpj_tsc explains how is this variable initialized. But thinking about
it, maybe we should rename it to "lpj_timer" ?Also, i still haven't got to testing today's tip tree in my environment,
will let you know as soon as i have done it.Thanks,
--
ok. But instead of 'lpj_timer' i'd suggest to use something like
'lpj_fine' - as this really is about finegrained measurements.I'd suggest a delta patch against tip/master that renames all those
tsc_* variables to fine_*. So tsc_rate_min would become fine_rate_min,
etc.Ingo
--
Ingo, tsc_rate_min etc are related to the timer rate so calling it
fine_rate_min would be confusing IMO. Instead i call it as
timer_rate_min.
Below is the patch on tip/master.
Tested on both 32 and 64 bit environment, tree works fine for me.--
As suggested by Ingo, remove all references to tsc from init/calibrate.cTSC is x86 specific, and using tsc in variable names in a generic file should
be avoided. lpj_tsc is now called lpj_fine, since it is related to fine tuning
of lpj value. Also tsc_rate_* is called timer_rate_*Signed-off-by: Alok N Kataria <akataria@vmware.com>
Index: linux-x86-tree.git/arch/x86/kernel/time_64.c
===================================================================
--- linux-x86-tree.git.orig/arch/x86/kernel/time_64.c 2008-06-23 17:24:07.000000000 -0700
+++ linux-x86-tree.git/arch/x86/kernel/time_64.c 2008-06-23 17:50:51.000000000 -0700
@@ -123,7 +123,7 @@
(boot_cpu_data.x86_vendor == X86_VENDOR_AMD))
cpu_khz = calculate_cpu_khz();- lpj_tsc = ((unsigned long)tsc_khz * 1000)/HZ;
+ lpj_fine = ((unsigned long)tsc_khz * 1000)/HZ;if (unsynchronized_tsc())
mark_tsc_unstable("TSCs unsynchronized");
Index: linux-x86-tree.git/arch/x86/kernel/tsc_32.c
===================================================================
--- linux-x86-tree.git.orig/arch/x86/kernel/tsc_32.c 2008-06-23 17:24:07.000000000 -0700
+++ linux-x86-tree.git/arch/x86/kernel/tsc_32.c 2008-06-23 17:50:58.000000000 -0700
@@ -419,7 +419,7 @@lpj = ((u64)tsc_khz * 1000);
do_div(lpj, HZ);
- lpj_tsc = lpj;
+ lpj_fine = lpj;/* now allow native_sched_clock() to use rdtsc */
tsc_disabled = 0;
Index: linux-x86-tree.git/include/linux/delay.h
===================================================================
--- linux-x86-tree.git.orig/include/linux/delay.h 2008-06-23 17:24:26.000000000 -0700
+++ linux-x86-tree.git/include/linux/delay.h 2008-06-23 17:51:08.000000000 -0700
@@ -41,7 +41,7 @@
#define ndelay(x) ndelay(x)
#endif-extern unsigned ...
applied to tip/x86/delay - thanks Alok.
Ingo
--
applied to tip/x86/delay - thanks Alok.
could you check whether tip/master (which now includes your changes as
well) works as expected in your test environment? I had to do a conflict
resolution in tsc_32.c, i hope i got it right. You can pick it up via:http://people.redhat.com/mingo/tip.git/README
Ingo
--
So then, how about using the tsc_khz (tsc frequency) for this
calculation. Atleast for the boot processor the lpj value can be derived
from the tsc frequency.
So insted of using lpj_tsc for all cpu's we can use it just for the bootAgreed so we might not be able to use it for other cpus's.
Is there a way to get the cpu frequency of the processor that we are
bringing up, i see that there is cpufreq_quick_get but this would be
initialized very late in the boot process.If there is a way we can check if the cpu being brought up is same as
the last one and then we can skip the delay calibration, something like
what ia64 does.Thanks,
Alok--
| Linus Torvalds | Linux 2.6.21-rc4 |
| Greg Kroah-Hartman | [PATCH 008/196] Chinese: add translation of volatile-considered-harmful.txt |
| Andrew Morton | -mm merge plans for 2.6.23 |
| Stephane Eranian | Re: [PATCH] fix up perfmon to build on -mm |
git: | |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Linus Torvalds | Re: silent semantic changes with reiser4 |
| jim owens | Re: ext4 - getting at birth time (file create time) and getting/setting nanosecond... |
| Alan Cox | Re: impact of 4k sector size on the IO & FS stack |
| Peter Zijlstra | Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-re... |
