Hello, Rusty.
There actually were quite some places where atomic add ops would be
useful, especially the places where statistics are collected. For
If atomic rw/add/sub are implementible on most archs (and judging from
local_t, I suppose it is), I think it should. So that it can replace
Isn't something like the following possible?
#define pcpu_read(ptr) \
({ \
if (__builtin_constant_p(ptr) && \
ptr >= PCPU_STATIC_START && ptr < PCPU_STATIC_END) \
do 64k TLB trick for static pcpu; \
else \
do generic stuff; \
If Ingo is okay with it, I'm fine with it too. Unless Ingo objects,
I'll do it tomorrow-ish (still big holiday here).
Thanks.
--
tejun
--
Sure - but please do it as a clean rebase ontop of latest, not as a conflict-merge, as i'd expect there to be quite many clashes. Ingo --
Okay, rebased on top of the current master[1]. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu The rebased commit ID is e7f538f3d2a6b3a0a632190a7d736730ee10909c. There were several changes which didn't fit the rebased tree as they really belong to other branches in the x86 tree. I'll re-submit them as patches against the appropriate branch. Thanks. -- tejun [1] 5ee810072175042775e39bdd3eaaa68884c27805 --
Two patches against #stackprotector and one against #cpus4096 sent out. With these three patches, most of stuff has been preserved but there are still few missing pieces which only make sense after trees are merged. Will be happy to drive it if the current branch looks fine. Thanks. -- tejun --
The TLB trick is just to access the percpu data at a fixed base. I.e. value = SHIFT_PERCPU_PTR(percpu_var, FIXED_ADDRESS); --
From: Christoph Lameter <cl@linux-foundation.org> The IA64 per-cpu TLB entry only covers 64k which makes use of it for dynamic per-cpu stuff out of the question. That's why it "doesn't scale" --
I was asking around, and was told that on IA64 *harware* at least, in addition to supporting multiple page sizes (up to a GB IIRC), one can pin up to 8 or perhaps 1/2 the TLB entries. So, in theory if one were so inclined the special pinned per-CPU entry could either be more than one 64K entry, or a single, rather larger entry. As a sanity check, I've cc'd linux-ia64. rick jones --
The number of TLB entries that can be pinned is model specific (but the minimum allowed is 8 each for code & data). Page sizes supported also vary by model, recent models go up to 4GB. BUT ... we stopped pinning this entry in the TLB when benchmarks showed that it was better to just insert this as a regular TLB entry which might can be dropped to map something else. So now there is code in the Alt-DTLB Managing a larger space could be done ... but at the expense of making the Alt-DTLB miss handler do a memory lookup to find the physical address of the per-cpu page needed (assuming that we allocate a bunch of random physical pages for use as per-cpu space rather than a single contiguous block of physical memory). When do we know the total amount of per-cpu memory needed? 1) CONFIG time? 2) Boot time? 3) Arbitrary run time? -Tony --
We could reserve a larger virtual space and switch TLB entries as needed? We would need do get larger order pages to do this. --
Yes we could ... the catch is that the supported TLB page sizes go up by multiples of 4. So if 64K is not enough the next stop is at 256K, then 1M, then 4M, and so on. That's why I asked when we know what total amount of per-cpu memory is needed. CONFIG time or boot time is good, because allocating higher order pages is easy at boot time. Arbitrary run time is bad because we might have difficulty allocating a high order page for every cpu. -Tony --
IA64 supports varying page sizes. You can use a 128k TLB entry etc. --
From: Christoph Lameter <cl@linux-foundation.org> Good luck moving to a larger size dynamically at run time. It really isn't a tenable solution. --
This is more like Christoph's CPU_OPS: they were special operators on normal per-cpu vars/ptrs. Generic version was irqsave+op+irqrestore. I actually like this idea, but Mathieu insists that the ops be NMI-safe, for ftrace. Hence local_t needing to be atomic_t for generic code. AFAICT we'll need a hybrid: HAVE_NMISAFE_CPUOPS, and if not, use atomic_t in ftrace (which isn't NMI safe on parisc or sparc/32 anyway, but I don't think we care). Other than the shouting, I liked Christoph's system: - CPU_INC = always safe (eg. local_irq_save/per_cpu(i)++/local_irq_restore) - _CPU_INC = not safe against interrupts (eg. get_cpu/per_cpu(i)++/put_cpu) - __CPU_INC = not safe against anything (eg. per_cpu(i)++) Ah, I did not realize that you celebrated Australia day :) Cheers! Rusty. --
Hello, Yes, it is. I was hoping it to be not more expensive on most archs. It isn't on x86 at the very least but I don't know much about other Requiring NMI-safeness is quite an exception, I suppose. I don't think we should design around it. If it can be worked around one way Yes, right. Got confused there. Hmmm... looks like what would work there is "is it a lvalue?" test. Well, anyways, if it isn't Hey, didn't know Australia was founded on lunar New Year's day. Nice. :-) Thanks. -- tejun --
Hmm, you can garner this from the local_t stats which were flying around. (see Re: local_add_return from me), or look in the preamble to http://ozlabs.org/~rusty/kernel/rr-latest/misc:test-local_t.patch ). Of course, if you want to be my hero, you could implement "soft" irq That would have been cool, but no; first time in 76 years they matched tho. Thanks, Rusty. --
Hello, Rusty. I suppose you mean deferred execution of interrupt handlers for quick It was a joke. :-) Thanks. -- tejun --
The term cpu is meaning multiple things at this point. So yes it may be better to go with glibc naming of thread local space. --
Ideally this should be done transparently so we don't have to #ifdef
code around HAVE_NMISAFE_CPUOPS everywhere in the tracer. We might
consider declaring an intermediate type with this kind of #ifdef in the
However using "local" for "per-cpu" could be confusing with the glibc
naming of thread local space, because "per-thread" and "per-cpu"
variables are different from a synchronization POV and we can end up
needing both (e.g. a thread local variable can never be accessed by
another thread, but a cpu local variable could be accessed by a
different CPU due to scheduling).
I'm currently thinking about implementing user-space per-cpu tracing buffers,
and the last thing I'd like is to have a "local" semantic clash between
the kernel and glibc. Ideally, we could have something like :
Atomic safe primitives (emulated with irq disable if the architecture
does not have atomic primitives) :
- atomic_thread_inc()
* current mainline local_t local_inc().
- atomic_cpu_inc()
* Your proposed CPU_INC.
Non-safe against interrupts, but safe against preemption :
- thread_inc()
* no preempt_disable involved, because this deals with per-thread
variables :
* Simple var++
- cpu_inc()
* disables preemption, per_cpu(i)++, enables preemption
Not safe against preemption nor interrupts :
- _thread_inc()
* maps to thread_inc()
- _cpu_inc()
* simple per_cpu(i)++
So maybe we don't really need thread_inc(), _thread_inc() and _cpu_inc,
because they map to standard C operations, but we may find that in some
architectures that the atomic_cpu_inc() is faster than the per_cpu(i)++,
and in those cases it would make sense to map e.g. _cpu_inc() to
atomic_cpu_inc().
Also note that don't think adding _ and __ prefixes to the operations
makes it clear for the programmer and reviewer what is safe against
what. OMHO, it will just make the code more obscure. One level of
underscore is about the limit I think people can "know" what this
"unsafe" version of the primitive ...gcc/glibc support a __thread attribute to variables. As far as I can tell this automatically makes gcc perform the relocation to the current context using a segment register. But its a weird ABI http://people.redhat.com/drepper/tls.pdf. After reading that I agree that we should stay with the cpu ops and forget about the local and thread stuff in gcc/glibc. --
We have discussed this a number of times. There are several issues with it; one being that per-cpu != per-thread (we can switch CPU whereever we're preempted), and another that many versions of gcc uses hardcoded registers, in particular %fs on 64 bits, but the kernel *has* to use %gs since there is no swapfs instruction. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
