Re: [PATCH] percpu: add optimized generic percpu accessors

Previous thread: [PATCH 03/21] RDS: Congestion-handling code by Andy Grover on Monday, January 26, 2009 - 7:17 pm. (64 messages)

Next thread: [PATCH] iproute2: 0 out memory for tc_skbedit struct prior to using it. by Jeff Kirsher on Monday, January 26, 2009 - 8:20 pm. (3 messages)
From: Tejun Heo
Date: Monday, January 26, 2009 - 7:24 pm

Hello, Rusty.


There actually were quite some places where atomic add ops would be
useful, especially the places where statistics are collected.  For

If atomic rw/add/sub are implementible on most archs (and judging from
local_t, I suppose it is), I think it should.  So that it can replace

Isn't something like the following possible?

#define pcpu_read(ptr)						\
({								\
	if (__builtin_constant_p(ptr) &&			\
	    ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END)	\
		do 64k TLB trick for static pcpu;		\
	else							\
		do generic stuff;				\

If Ingo is okay with it, I'm fine with it too.  Unless Ingo objects,
I'll do it tomorrow-ish (still big holiday here).

Thanks.

-- 
tejun
--

From: Ingo Molnar
Date: Tuesday, January 27, 2009 - 6:13 am

Sure - but please do it as a clean rebase ontop of latest, not as a 
conflict-merge, as i'd expect there to be quite many clashes.

	Ingo
--

From: Tejun Heo
Date: Tuesday, January 27, 2009 - 4:07 pm

Alrighty, re-basing.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Tuesday, January 27, 2009 - 8:36 pm

Okay, rebased on top of the current master[1].

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu

The rebased commit ID is e7f538f3d2a6b3a0a632190a7d736730ee10909c.

There were several changes which didn't fit the rebased tree as they
really belong to other branches in the x86 tree.  I'll re-submit them
as patches against the appropriate branch.

Thanks.

-- 
tejun

[1] 5ee810072175042775e39bdd3eaaa68884c27805
--

From: Tejun Heo
Date: Wednesday, January 28, 2009 - 1:12 am

Two patches against #stackprotector and one against #cpus4096 sent
out.  With these three patches, most of stuff has been preserved but
there are still few missing pieces which only make sense after trees
are merged.  Will be happy to drive it if the current branch looks
fine.

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Tuesday, January 27, 2009 - 1:08 pm

The TLB trick is just to access the percpu data at a fixed base. I.e.
	value = SHIFT_PERCPU_PTR(percpu_var, FIXED_ADDRESS);

--

From: David Miller
Date: Tuesday, January 27, 2009 - 2:47 pm

From: Christoph Lameter <cl@linux-foundation.org>

The IA64 per-cpu TLB entry only covers 64k which makes use of it for
dynamic per-cpu stuff out of the question.  That's why it "doesn't
scale"
--

From: Rick Jones
Date: Tuesday, January 27, 2009 - 3:47 pm

I was asking around, and was told that on IA64 *harware* at least, in addition to 
supporting multiple page sizes (up to a GB IIRC), one can pin up to 8 or perhaps 
1/2 the TLB entries.

So, in theory if one were so inclined the special pinned per-CPU entry could 
either be more than one 64K entry, or a single, rather larger entry.

As a sanity check, I've cc'd linux-ia64.

rick jones
--

From: Luck, Tony
Date: Tuesday, January 27, 2009 - 5:17 pm

The number of TLB entries that can be pinned is model specific (but the
minimum allowed is 8 each for code & data).  Page sizes supported also
vary by model, recent models go up to 4GB.

BUT ... we stopped pinning this entry in the TLB when benchmarks showed
that it was better to just insert this as a regular TLB entry which might
can be dropped to map something else.  So now there is code in the Alt-DTLB

Managing a larger space could be done ... but at the expense of making
the Alt-DTLB miss handler do a memory lookup to find the physical address
of the per-cpu page needed (assuming that we allocate a bunch of random
physical pages for use as per-cpu space rather than a single contiguous
block of physical memory).

When do we know the total amount of per-cpu memory needed?
1) CONFIG time?
2) Boot time?
3) Arbitrary run time?

-Tony
--

From: Christoph Lameter
Date: Wednesday, January 28, 2009 - 9:48 am

We could reserve a larger virtual space and switch TLB entries as needed?
We would need do get larger order pages to do this.

--

From: Luck, Tony
Date: Wednesday, January 28, 2009 - 10:15 am

Yes we could ... the catch is that the supported TLB page sizes go
up by multiples of 4.  So if 64K is not enough the next stop is
at 256K, then 1M, then 4M, and so on.

That's why I asked when we know what total amount of per-cpu memory
is needed.  CONFIG time or boot time is good, because allocating higher
order pages is easy at boot time.  Arbitrary run time is bad because
we might have difficulty allocating a high order page for every cpu.

-Tony
--

From: Christoph Lameter
Date: Wednesday, January 28, 2009 - 9:45 am

IA64 supports varying page sizes. You can use a 128k TLB entry etc.

--

From: David Miller
Date: Wednesday, January 28, 2009 - 1:47 pm

From: Christoph Lameter <cl@linux-foundation.org>

Good luck moving to a larger size dynamically at run time.

It really isn't a tenable solution.
--

From: Rusty Russell
Date: Wednesday, January 28, 2009 - 3:38 am

This is more like Christoph's CPU_OPS: they were special operators on normal per-cpu vars/ptrs.  Generic version was irqsave+op+irqrestore.

I actually like this idea, but Mathieu insists that the ops be NMI-safe, for ftrace.  Hence local_t needing to be atomic_t for generic code.

AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use atomic_t
in ftrace (which isn't NMI safe on parisc or sparc/32 anyway, but I don't think we care).

Other than the shouting, I liked Christoph's system:
- CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore)
- _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu)
- __CPU_INC = not safe against anything (eg. per_cpu(i)++)



Ah, I did not realize that you celebrated Australia day :)

Cheers!
Rusty.
--

From: Tejun Heo
Date: Wednesday, January 28, 2009 - 3:56 am

Hello,


Yes, it is.  I was hoping it to be not more expensive on most archs.
It isn't on x86 at the very least but I don't know much about other

Requiring NMI-safeness is quite an exception, I suppose.  I don't
think we should design around it.  If it can be worked around one way


Yes, right.  Got confused there.  Hmmm... looks like what would work
there is "is it a lvalue?" test.  Well, anyways, if it isn't

Hey, didn't know Australia was founded on lunar New Year's day.
Nice. :-)

Thanks.

-- 
tejun
--

From: Rusty Russell
Date: Wednesday, January 28, 2009 - 7:06 pm

Hmm, you can garner this from the local_t stats which were flying around.
(see Re: local_add_return from me), or look in the preamble to
http://ozlabs.org/~rusty/kernel/rr-latest/misc:test-local_t.patch ).

Of course, if you want to be my hero, you could implement "soft" irq


That would have been cool, but no; first time in 76 years they matched tho.

Thanks,
Rusty.
--

From: Tejun Heo
Date: Friday, January 30, 2009 - 11:11 pm

Hello, Rusty.



I suppose you mean deferred execution of interrupt handlers for quick

It was a joke.  :-)

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Wednesday, January 28, 2009 - 9:50 am

The term cpu is meaning multiple things at this point. So yes it may be
better to go with glibc naming of thread local space.

--

From: Mathieu Desnoyers
Date: Wednesday, January 28, 2009 - 11:07 am

Ideally this should be done transparently so we don't have to #ifdef
code around HAVE_NMISAFE_CPUOPS everywhere in the tracer. We might
consider declaring an intermediate type with this kind of #ifdef in the

However using "local" for "per-cpu" could be confusing with the glibc
naming of thread local space, because "per-thread" and "per-cpu"
variables are different from a synchronization POV and we can end up
needing both (e.g. a thread local variable can never be accessed by
another thread, but a cpu local variable could be accessed by a
different CPU due to scheduling).

I'm currently thinking about implementing user-space per-cpu tracing buffers,
and the last thing I'd like is to have a "local" semantic clash between
the kernel and glibc. Ideally, we could have something like :

Atomic safe primitives (emulated with irq disable if the architecture
does not have atomic primitives) :
- atomic_thread_inc()
  * current mainline local_t local_inc().
- atomic_cpu_inc()
  * Your proposed CPU_INC.

Non-safe against interrupts, but safe against preemption :
- thread_inc()
  * no preempt_disable involved, because this deals with per-thread
    variables :
    * Simple var++
- cpu_inc()
  * disables preemption, per_cpu(i)++, enables preemption

Not safe against preemption nor interrupts :
- _thread_inc()
  * maps to thread_inc()
- _cpu_inc()
  * simple per_cpu(i)++

So maybe we don't really need thread_inc(), _thread_inc() and _cpu_inc,
because they map to standard C operations, but we may find that in some
architectures that the atomic_cpu_inc() is faster than the per_cpu(i)++,
and in those cases it would make sense to map e.g. _cpu_inc() to
atomic_cpu_inc().

Also note that don't think adding _ and __ prefixes to the operations
makes it clear for the programmer and reviewer what is safe against
what. OMHO, it will just make the code more obscure. One level of
underscore is about the limit I think people can "know" what this
"unsafe" version of the primitive ...
From: Christoph Lameter
Date: Thursday, January 29, 2009 - 11:33 am

gcc/glibc support a __thread attribute to variables. As far as I can tell
this automatically makes gcc perform the relocation to the current
context using a segment register.

But its a weird ABI http://people.redhat.com/drepper/tls.pdf. After
reading that I agree that we should stay with the cpu ops and forget about
the local and thread stuff in gcc/glibc.

--

From: H. Peter Anvin
Date: Thursday, January 29, 2009 - 11:48 am

We have discussed this a number of times.  There are several issues with
it; one being that per-cpu != per-thread (we can switch CPU whereever
we're preempted), and another that many versions of gcc uses hardcoded
registers, in particular %fs on 64 bits, but the kernel *has* to use %gs
since there is no swapfs instruction.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

Previous thread: [PATCH 03/21] RDS: Congestion-handling code by Andy Grover on Monday, January 26, 2009 - 7:17 pm. (64 messages)

Next thread: [PATCH] iproute2: 0 out memory for tc_skbedit struct prior to using it. by Jeff Kirsher on Monday, January 26, 2009 - 8:20 pm. (3 messages)