Re: Faster getcpu() and sched_getcpu()

Previous thread: Interrupt handler latency and Interrupt handling issues by Singaravelan Nallasellan on Tuesday, September 23, 2008 - 11:31 am. (1 message)

Next thread: [git pull] PCI fixes by Jesse Barnes on Tuesday, September 23, 2008 - 12:14 pm. (1 message)
From: Pardo
Date: Tuesday, September 23, 2008 - 12:09 pm

getcpu() returns a caller's current core number.  On 2.6.26 running on
x86_64, there are two VDSO implementations: store it in TSCP's AUX
register; or, if the processor does not support TSCP, store it in
GDT's limit.  Dean Gaudet and Nathan Laredo also suggest using IDT's
limit.  Call these GDT, TSCP, and SIDT.

The cost of reading the CPU number can be reduced significantly across
a variety of platforms.  Suggestions: eliminate per-call architecture
check, use SIDT to hold the CPU and node number, cache the result,
split the VDSO in to red-zone and no-red-zone areas, streamline cache
checks in getcpu() code; provide a specialized sched_getcpu().
Result: on various x86_64 platforms, reading the CPU number drops from
about 30-100 cycles to 4-21 cycles.

I do not yet have a patch.  I would like folks to (a) comment; and (b)
try the attached microbenchmark on various machines to see if there
are any machines where something is faster than SIDT.

TESTS AND DATA

I ran timing tests that "fake" the user-space instruction sequence for
various VDSO-based getcpu() and sched_getcpu() implementations.  I ran
the tests on seven kinds of Intel and AMD platforms.  Each sequence
was measured individually (rather than averaging N runs).  Best and
median costs of 1000 runs were recorded.  An empty sequence was also
measured and that cost subtracted from each of the other runs, so a
reported "20 cycles" is "20 cycles more than the empty sequence."

A first test is the "raw" cost of just the machine instructions to
read the special register.  SIDT holds the value offset by 0x1000 and
the machine instruction saves it to memory.  The SIDT cost reported
here is conservative in that it includes a load and subtract which are
sometimes eliminated in getcpu()/sched_getcpu().  Note machine E is
based on a P4-microarchitecture processor, which is typically hard to
measure accurately, hence some reported costs for E are as low as 0
cycles.

    --- BEST ---    -- MEDIAN --
    GDT TSCP SIDT   GDT ...
From: Pardo
Date: Tuesday, September 23, 2008 - 12:48 pm

[Re-post as the test program attached before was an old one with a bad
field size.]

getcpu() returns a caller's current core number.  On 2.6.26 running on
x86_64, there are two VDSO implementations: store it in TSCP's AUX
register; or, if the processor does not support TSCP, store it in
GDT's limit.  Dean Gaudet and Nathan Laredo also suggest using IDT's
limit.  Call these GDT, TSCP, and SIDT.

The cost of reading the CPU number can be reduced significantly across
a variety of platforms.  Suggestions: eliminate per-call architecture
check, use SIDT to hold the CPU and node number, cache the result,
split the VDSO in to red-zone and no-red-zone areas, streamline cache
checks in getcpu() code; provide a specialized sched_getcpu().
Result: on various x86_64 platforms, reading the CPU number drops from
about 30-100 cycles to 4-21 cycles.

I do not yet have a patch.  I would like folks to (a) comment; and (b)
try the attached microbenchmark on various machines to see if there
are any machines where something is faster than SIDT.

TESTS AND DATA

I ran timing tests that "fake" the user-space instruction sequence for
various VDSO-based getcpu() and sched_getcpu() implementations.  I ran
the tests on seven kinds of Intel and AMD platforms.  Each sequence
was measured individually (rather than averaging N runs).  Best and
median costs of 1000 runs were recorded.  An empty sequence was also
measured and that cost subtracted from each of the other runs, so a
reported "20 cycles" is "20 cycles more than the empty sequence."

A first test is the "raw" cost of just the machine instructions to
read the special register.  SIDT holds the value offset by 0x1000 and
the machine instruction saves it to memory.  The SIDT cost reported
here is conservative in that it includes a load and subtract which are
sometimes eliminated in getcpu()/sched_getcpu().  Note machine E is
based on a P4-microarchitecture processor, which is typically hard to
measure accurately, hence some reported costs for E ...
From: Andi Kleen
Date: Sunday, September 28, 2008 - 9:42 am

Without a vsyscall the cache probably doesn't make too much sense
because once you're in the kernel reading the real CPU number is really
cheap.

I agree with you that the cache should be enabled on all vDSO implementations
(that is what my original code did)

Also the TSCP version could probably go.

I'm still not sure why you say no redzone is that expensive? Do you
have numbers?  I know it's a few instructions, but it shouldn't 

Yes, unfortunately glibc didn't chose the same interface as the kernel
for this. I still don't know why. But now since we're in this mess
specializing for the glibc implementation is probably a good idea.
Or just add getcpu() to glibc :)

-Andi

-- 
ak@linux.intel.com
--

From: dean gaudet
Date: Monday, September 29, 2008 - 12:27 am

it depends on the processor involved and the kernel config options --
i.e. if frame pointers are enabled then the stack frame guarantees a
store operation (push rbp) and on processors which do memops in-order
this delays the other memops in the vsyscall (i.e. testing the cache or
executing SIDT).  it was 2 or 3 cycles difference in most cases iirc.

-dean

--

From: Andi Kleen
Date: Monday, September 29, 2008 - 7:54 am

Ok frame pointers are always a performance disasters on some CPUs.
Perhaps they should be just unconditionally disabled for vsyscall.c
and the vdso

-Andi
-- 
ak@linux.intel.com
--

From: Pardo
Date: Monday, September 29, 2008 - 11:02 am

>[Maybe disable frame pointers for vsyscall.c and the vdso?]

IIRC, some vsyscall.c code needs them enabled, so Dean's earlier patch
split vsyscall.c, creating a vsyscall_user.c for code which can run
without them.  Seem reasonable?
--

Previous thread: Interrupt handler latency and Interrupt handling issues by Singaravelan Nallasellan on Tuesday, September 23, 2008 - 11:31 am. (1 message)

Next thread: [git pull] PCI fixes by Jesse Barnes on Tuesday, September 23, 2008 - 12:14 pm. (1 message)