getcpu() returns a caller's current core number. On 2.6.26 running on
x86_64, there are two VDSO implementations: store it in TSCP's AUX
register; or, if the processor does not support TSCP, store it in
GDT's limit. Dean Gaudet and Nathan Laredo also suggest using IDT's
limit. Call these GDT, TSCP, and SIDT.
The cost of reading the CPU number can be reduced significantly across
a variety of platforms. Suggestions: eliminate per-call architecture
check, use SIDT to hold the CPU and node number, cache the result,
split the VDSO in to red-zone and no-red-zone areas, streamline cache
checks in getcpu() code; provide a specialized sched_getcpu().
Result: on various x86_64 platforms, reading the CPU number drops from
about 30-100 cycles to 4-21 cycles.
I do not yet have a patch. I would like folks to (a) comment; and (b)
try the attached microbenchmark on various machines to see if there
are any machines where something is faster than SIDT.
TESTS AND DATA
I ran timing tests that "fake" the user-space instruction sequence for
various VDSO-based getcpu() and sched_getcpu() implementations. I ran
the tests on seven kinds of Intel and AMD platforms. Each sequence
was measured individually (rather than averaging N runs). Best and
median costs of 1000 runs were recorded. An empty sequence was also
measured and that cost subtracted from each of the other runs, so a
reported "20 cycles" is "20 cycles more than the empty sequence."
A first test is the "raw" cost of just the machine instructions to
read the special register. SIDT holds the value offset by 0x1000 and
the machine instruction saves it to memory. The SIDT cost reported
here is conservative in that it includes a load and subtract which are
sometimes eliminated in getcpu()/sched_getcpu(). Note machine E is
based on a P4-microarchitecture processor, which is typically hard to
measure accurately, hence some reported costs for E are as low as 0
cycles.
--- BEST --- -- MEDIAN --
GDT TSCP SIDT GDT ...[Re-post as the test program attached before was an old one with a bad field size.] getcpu() returns a caller's current core number. On 2.6.26 running on x86_64, there are two VDSO implementations: store it in TSCP's AUX register; or, if the processor does not support TSCP, store it in GDT's limit. Dean Gaudet and Nathan Laredo also suggest using IDT's limit. Call these GDT, TSCP, and SIDT. The cost of reading the CPU number can be reduced significantly across a variety of platforms. Suggestions: eliminate per-call architecture check, use SIDT to hold the CPU and node number, cache the result, split the VDSO in to red-zone and no-red-zone areas, streamline cache checks in getcpu() code; provide a specialized sched_getcpu(). Result: on various x86_64 platforms, reading the CPU number drops from about 30-100 cycles to 4-21 cycles. I do not yet have a patch. I would like folks to (a) comment; and (b) try the attached microbenchmark on various machines to see if there are any machines where something is faster than SIDT. TESTS AND DATA I ran timing tests that "fake" the user-space instruction sequence for various VDSO-based getcpu() and sched_getcpu() implementations. I ran the tests on seven kinds of Intel and AMD platforms. Each sequence was measured individually (rather than averaging N runs). Best and median costs of 1000 runs were recorded. An empty sequence was also measured and that cost subtracted from each of the other runs, so a reported "20 cycles" is "20 cycles more than the empty sequence." A first test is the "raw" cost of just the machine instructions to read the special register. SIDT holds the value offset by 0x1000 and the machine instruction saves it to memory. The SIDT cost reported here is conservative in that it includes a load and subtract which are sometimes eliminated in getcpu()/sched_getcpu(). Note machine E is based on a P4-microarchitecture processor, which is typically hard to measure accurately, hence some reported costs for E ...
Without a vsyscall the cache probably doesn't make too much sense because once you're in the kernel reading the real CPU number is really cheap. I agree with you that the cache should be enabled on all vDSO implementations (that is what my original code did) Also the TSCP version could probably go. I'm still not sure why you say no redzone is that expensive? Do you have numbers? I know it's a few instructions, but it shouldn't Yes, unfortunately glibc didn't chose the same interface as the kernel for this. I still don't know why. But now since we're in this mess specializing for the glibc implementation is probably a good idea. Or just add getcpu() to glibc :) -Andi -- ak@linux.intel.com --
it depends on the processor involved and the kernel config options -- i.e. if frame pointers are enabled then the stack frame guarantees a store operation (push rbp) and on processors which do memops in-order this delays the other memops in the vsyscall (i.e. testing the cache or executing SIDT). it was 2 or 3 cycles difference in most cases iirc. -dean --
Ok frame pointers are always a performance disasters on some CPUs. Perhaps they should be just unconditionally disabled for vsyscall.c and the vdso -Andi -- ak@linux.intel.com --
>[Maybe disable frame pointers for vsyscall.c and the vdso?] IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch split vsyscall.c, creating a vsyscall_user.c for code which can run without them. Seem reasonable? --
