Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

Previous thread: [patch 06/41] cpu alloc: crash_notes conversion by Christoph Lameter on Thursday, May 29, 2008 - 8:56 pm. (1 message)

Next thread: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations by Christoph Lameter on Thursday, May 29, 2008 - 8:56 pm. (36 messages)
From: Christoph Lameter
Date: Thursday, May 29, 2008 - 8:56 pm

In various places the kernel maintains arrays of pointers indexed by
processor numbers. These are used to locate objects that need to be used
when executing on a specirfic processor. Both the slab allocator
and the page allocator use these arrays and there the arrays are used in
performance critical code. The allocpercpu functionality is a simple
allocator to provide these arrays. However, there are certain drawbacks
in using such arrays:

1. The arrays become huge for large systems and may be very sparsely
   populated (if they are dimensionied for NR_CPUS) on an architecture
   like IA64 that allows up to 4k cpus if a kernel is then booted on a
   machine that only supports 8 processors. We could nr_cpu_ids there
   but we would still have to allocate all possible processors up to
   the number of processor ids. cpu_alloc can deal with sparse cpu_maps.

2. The arrays cause surrounding variables to no longer fit into a single
   cacheline. The layout of core data structure is typically optimized so
   that variables frequently used together are placed in the same cacheline.
   Arrays of pointers move these variables far apart and destroy this effect.

3. A processor frequently follows only one pointer for its own use. Thus
   that cacheline with that pointer has to be kept in memory. The neighboring
   pointers are all to other processors that are rarely used. So a whole
   cacheline of 128 bytes may be consumed but only 8 bytes of information
   is constant use. It would be better to be able to place more information
   in this cacheline.

4. The lookup of the per cpu object is expensive and requires multiple
   memory accesses to:

   A) smp_processor_id()
   B) pointer to the base of the per cpu pointer array
   C) pointer to the per cpu object in the pointer array
   D) the per cpu object itself.

5. Each use of allocper requires its own per cpu array. On large
   system large arrays have to be allocated again and again.

6. Processor hotplug cannot effectively track ...
From: Andrew Morton
Date: Thursday, May 29, 2008 - 9:58 pm

All seems reasonable to me.  The obvious question is "how do we size
the arena".  We either waste memory or, much worse, run out.

And running out is a real possibility, I think.  Most people will only
mount a handful of XFS filesystems.  But some customer will come along
who wants to mount 5,000, and distributors will need to cater for that,
but how can they?

I wonder if we can arrange for the default to be overridden via a
kernel boot option?


Another obvious question is "how much of a problem will we have with
internal fragmentation"?  This might be a drop-dead showstopper.

--

From: Christoph Lameter
Date: Thursday, May 29, 2008 - 10:03 pm

The per cpu memory use by subsystems is typically quite small. We already 
have an 8k limitation for percpu space for modules. And that does not seem 



But then per cpu data is not frequently allocated and freed.

Going away from allocpercpu saves a lot of memory. We could make this 
128k or so to be safe?


--

From: Andrew Morton
Date: Thursday, May 29, 2008 - 10:21 pm

It was just an example.  There will be others.

	tcp_v4_md5_do_add
	->tcp_alloc_md5sig_pool
	  ->__tcp_alloc_md5sig_pool

does an alloc_percpu for each md5-capable TCP connection.  I think - it
doesn't matter really, because something _could_.  And if something


I think it is, in the TCP case.  And that's the only one I looked at.


("alloc_percpu" - please be careful about getting this stuff right)

I don't think there is presently any upper limit on alloc_percpu()?  It
uses kmalloc() and kmalloc_node()?

Even if there is some limit, is it an unfixable one?
--

From: Eric Dumazet
Date: Thursday, May 29, 2008 - 11:01 pm

Last time I took a look on this stuff, this was a percpu allocation for 
all connections, not for each TCP session.
(It should be static, instead of dynamic )

Really, percpu allocations are currently not frequent at all.

vmalloc()/vfreee() are way more frequent and still use a list.





--

From: Andrew Morton
Date: Thursday, May 29, 2008 - 11:16 pm

Sure it's hard to conceive how anyone could go and do a per-cpu
allocation on a fastpath.

But this has nothing to do with the frequency!  The problems surround
the _amount_ of allocated memory and the allocation/freeing patterns.

Here's another example.  And it's only an example!  Generalise!

ext3 maintains three percpu_counters per mount.  Each percpu_counter
does one percpu_alloc.  People can mount an arbitrary number of ext3
filesystems!


Another: there are two percpu_counters (and hence two percpu_alloc()s)
per backing_dev_info.  One backing_dev_info per disk and people have
been known to have thousands (iirc ~10,000) disks online.

And those examples were plucked only from today's kernel.  Who knows
what other problems will be in 2.6.45?
--

From: Christoph Lameter
Date: Thursday, May 29, 2008 - 11:22 pm

We can always increase the sizes.
 
--

From: Andrew Morton
Date: Thursday, May 29, 2008 - 11:37 pm

It could be 4000.  The present alloc_percpu() would support that.

And struct nfs_iostats is 264 bytes and nfs does an alloc_percpu() of
one of those per server and mounting thousands of servers per client
is, I believe, a real-world operation.

Plus for the entyenth time: saying that this code will probably work
acceptably for most people in 2.6.26 is not sufficient!
--

From: Matthew Wilcox
Date: Friday, May 30, 2008 - 4:32 am

Another example, not as extreme, there's an alloc_percpu(struct
disk_stats) [80 bytes on 64-bit machines] for every disk and every
partition in the machine.  The TPC system has 3000 disks, each with 14
partitions on it.  That's 15 * 80 * 3000 = 3,600,000 bytes.

Even if you're only putting a pointer to each allocation in the percpu
area, that's still 360,000 bytes, 12x as much as you think is sufficient
for the entire system.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Christoph Lameter
Date: Thursday, May 29, 2008 - 10:27 pm

No. The module subsystem has its own alloc_percpu subsystem that the 

But its going to be even more complicated and I have a hard time managing 



No there is no limit. It just wastes lots of space (pointer arrays, 
alignment etc) that we could use to configure sufficiently large per cpu 
areas.

--

From: Andrew Morton
Date: Thursday, May 29, 2008 - 10:49 pm

Christoph, please.  An allocator which is of fixed size and which is
vulnerable to internal fragmentation is a huge problem!  The kernel is
subject to wildly varying workloads both between different users and in
the hands of a single user.

If we were to merge all this code and then run into the problems which
I fear then we are tremendously screwed.  We must examine this
exhaustively, in the most paranoid fashion.

--

From: Christoph Lameter
Date: Thursday, May 29, 2008 - 11:16 pm

Right but it needs to have its own section of the percpu space from which 
it allocates the percpu segments for the modules. So it effectively 

Well V2 virtually mapped the cpu alloc area which allowed extending it 
arbitrarily. But that made things very complicated.

The number of per cpu resources needed is mostly fixed. The number of 
zones, nodes, slab caches, network interfaces etc etc does not change much 
during typical operations.
 
--

From: KAMEZAWA Hiroyuki
Date: Thursday, May 29, 2008 - 11:51 pm

On Thu, 29 May 2008 23:16:11 -0700 (PDT)

Could you add a text to explain "This interface is for wise use of
pre-allocated limited area (see Documentation/xxxx). please use this only
when you need very fast access to per-cpu object and you can estimate the amount
which you finally need. If unsure, please use generic allocator."

for the moment ?

At first look, I thought of using this in memory-resource-controller but it seems
I shouldn't do so because thousands of cgroup can be used in theory...

Thanks,
-Kame


--

From: Mike Travis
Date: Friday, May 30, 2008 - 7:38 am

Is there any reason why the per_cpu area couldn't be made extensible?  Maybe
a simple linked list of available areas?  (And use a config variable and/or
boot param for initial size and increment size?)  [Ignoring the problem of
reclaiming the space...]
--

From: Christoph Lameter
Date: Friday, May 30, 2008 - 10:50 am

cpu alloc v2 had an extendable per cpu space. You have the patches. We 
could put this on top of this patchset if necessary. But then it not so 
nice and simple anymore. Maybe we can rstrict the use of cpu alloc 
instead to users with objects < cache_line_size() or so?

--

From: Matthew Wilcox
Date: Friday, May 30, 2008 - 11:00 am

Restricting the use of cpu_alloc based on size of object is no good when
you're trying to allocate 45,000 objects.  Extending the per CPU space
is the only option.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Christoph Lameter
Date: Friday, May 30, 2008 - 11:12 am

Ok guess we need to bring the virtually mapped per cpu patches forward.

--

From: Mike Travis
Date: Wednesday, June 4, 2008 - 8:07 am

One problem with variable sized cpu_alloc area is this comment in bitmap.h:

 * Note that nbits should be always a compile time evaluable constant.
 * Otherwise many inlines will generate horrible code.

I'm guessing since this will be of low use and not performance critical,
then we can ignore the "horrible code"?  ;-)

Thanks,
Mike

--

From: Eric Dumazet
Date: Thursday, June 5, 2008 - 10:33 pm

Christoph & Mike,

Please forgive me if I beat a dead horse, but this percpu stuff
should find its way.

I wonder why you absolutely want to have only one chunk holding
all percpu variables, static(vmlinux) & static(modules)
& dynamically allocated.

Its *not* possible to put an arbitrary limit to this global zone.
You'll allways find somebody to break this limit. This is the point
we must solve, before coding anything.

Have you considered using a list of fixed size chunks, each chunk
having its own bitmap ?

We only want fix offsets between CPU locations. For a given variable,
we MUST find addresses for all CPUS looking at the same offset table.
(Then we can optimize things on x86, using %gs/%fs register, instead
of a table lookup)

We could chose chunk size at compile time, depending on various
parameters (32/64 bit arches, or hugepage sizes on NUMA),
and a minimum value (ABI guarantee)

On x86_64 && NUMA we could use 2 Mbytes chunks, while
on x86_32 or non NUMA we should probably use 64 Kbytes.

At boot time, we setup the first chunk (chunk 0) and copy 
.data.percpu on this chunk, for each possible cpu, and we
build the bitmap for future dynamic/module percpu allocations.
So we still have the restriction that sizeofsection(.data.percpu)
should fit in the chunk 0. Not a problem in practice.

Then if we need to expand percpu zone for heavy duty machine,
and chunk 0 is already filled, we can add as many 2 M/ 64K 
chunks we need.

This would limit the dynamic percpu allocation to 64 kbytes for
a given variable, so huge users should probably still use a
different allocator (like oprofile alloc_cpu_buffers() function)
But at least we dont anymore limit the total size of percpu area.

I understand you want to offset percpu data to 0, but for
static percpu data. (pda being included in, to share %gs)

For dynamically allocated percpu variables (including modules
".data.percpu"), nothing forces you to have low offsets,
relative to %gs/%fs register. Access to these ...
From: Mike Travis
Date: Friday, June 6, 2008 - 6:08 am

Wow!  Thanks for the detail!  It's extremely useful (to me at least)
to see it spelled out.

Since Christoph is still on vacation I'll try to summarize where we're
at at the moment.  (Besides being stuck on a boot up problem with the
%gs based percpu variables that is. ;-)

Yes, the problem is we need to use virtual addresses to expand the
percpu areas since each cpu needs the same fixed offset to the newly
allocated variables.  This was in the prior (v2) version of cpu_alloc
so I'm looking at pulling that forward.  And I also figured that the
size of the expansion allocations should be based on the system size
to minimize the effect on small systems (seems to be my life the
past 6 months... ;-)

I'm also looking at integrating more into the already present
infrastructure (thanks Rusty!) so there are less "diffs" (and less
new testing needed.)  And of course, there's the complexities
of submitting patches to many architectures simultaneously.

Hopefully, I'll have something for review soon.

Thanks again,
Mike

--

From: Rusty Russell
Date: Saturday, June 7, 2008 - 11:00 pm

If you're prepared to have mappings for chunk 0, you can simply make it 
virtually linear and creating a new chunk is simple.  If not, you need to 
reserve the virtual address space(s) for future mappings.  Otherwise you're 
unlikely to get the same layout for allocations.

This is not a show-stopper: we've lived with limited vmalloc room since 
forever.  It just has to be sufficient.

Otherwise, your analysis is correct, if a little verbose :)

Cheers,
Rusty.
--

From: Christoph Lameter
Date: Monday, June 9, 2008 - 11:44 am

The problem is that offsets relative to %gs or %fs are limited by the 
small memory model that is chosen. We cannot have an offset large than 
2GB. So we must have a linear address range and cannot use separate chunks 
of memory. If we do not use the segment register then we cannot do atomic 


Right that is what cpu_alloc v2 did. It created a virtual mapping and 

The relative to 0 stuff comes in at the x86_64 level because we want to 
unify pda and percpu accesses. pda access have been relative to 0 and in 
particular the stack canary in glibc directly accesses the pda at a 
certain offset. So we must be zero based in order to preserve 

Normal memory uses 2MB tlbs. There is no overhead therefore by mapping the 
percpu areas using 2MB tlbs. So we do not need to be that complicated.

What v2 did was allocate an area n * MAX_VIRT_PER_CPU_SIZE in vmalloc 
space and then it dynamically populated 2MB segments as needed. The MAX 
size was 128MB or so.

We could either do the same on i386 or use 4kb mappings (then we can 
directly use the vmalloc functionality). But then there would be 
additional TLB overhead.

We have similar 2MB virtual mapping tricks for the virtual memmap. 
Basically we can copy the functions and customize them for the virtual per 
cpu areas (Mike is hopefully listening and reading the V2 patch ....)

--

From: Andi Kleen
Date: Monday, June 9, 2008 - 12:11 pm

Actually they are not. If you really want you can do 
movabs $64bit,%reg ; op ...,%gs:(%reg) 
It's just not very efficient compared to small (or rather kernel) model
and also older binutils didn't support large model.

-Andi
--

From: Eric Dumazet
Date: Monday, June 9, 2008 - 1:15 pm

I am not sure Christoph was refering to actual instructions.

I was suggesting using for static percpu (vmlinux or modules) :

vmlinux : (offset31 computed by linker at vmlinux link edit time)
incl  %gs:offset31

modules : (offset31 computed at module load time by module loader)
incl %gs:offset31

(If we make sure all this stuff is allocated in first chunk)

And for dynamic percpu :

movq   field(%rdi),%rax
incl    %gs:(%rax)   /* full 64bits 'offsets' */

I understood (but might be wrong again) that %gs itself could not be used with an offset > 2GB, because
the way %gs segment is setup. So in the 'dynamic percpu' case, %rax should not exceed 2^31





--

Previous thread: [patch 06/41] cpu alloc: crash_notes conversion by Christoph Lameter on Thursday, May 29, 2008 - 8:56 pm. (1 message)

Next thread: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations by Christoph Lameter on Thursday, May 29, 2008 - 8:56 pm. (36 messages)