In various places the kernel maintains arrays of pointers indexed by processor numbers. These are used to locate objects that need to be used when executing on a specirfic processor. Both the slab allocator and the page allocator use these arrays and there the arrays are used in performance critical code. The allocpercpu functionality is a simple allocator to provide these arrays. However, there are certain drawbacks in using such arrays: 1. The arrays become huge for large systems and may be very sparsely populated (if they are dimensionied for NR_CPUS) on an architecture like IA64 that allows up to 4k cpus if a kernel is then booted on a machine that only supports 8 processors. We could nr_cpu_ids there but we would still have to allocate all possible processors up to the number of processor ids. cpu_alloc can deal with sparse cpu_maps. 2. The arrays cause surrounding variables to no longer fit into a single cacheline. The layout of core data structure is typically optimized so that variables frequently used together are placed in the same cacheline. Arrays of pointers move these variables far apart and destroy this effect. 3. A processor frequently follows only one pointer for its own use. Thus that cacheline with that pointer has to be kept in memory. The neighboring pointers are all to other processors that are rarely used. So a whole cacheline of 128 bytes may be consumed but only 8 bytes of information is constant use. It would be better to be able to place more information in this cacheline. 4. The lookup of the per cpu object is expensive and requires multiple memory accesses to: A) smp_processor_id() B) pointer to the base of the per cpu pointer array C) pointer to the per cpu object in the pointer array D) the per cpu object itself. 5. Each use of allocper requires its own per cpu array. On large system large arrays have to be allocated again and again. 6. Processor hotplug cannot effectively track ...
All seems reasonable to me. The obvious question is "how do we size the arena". We either waste memory or, much worse, run out. And running out is a real possibility, I think. Most people will only mount a handful of XFS filesystems. But some customer will come along who wants to mount 5,000, and distributors will need to cater for that, but how can they? I wonder if we can arrange for the default to be overridden via a kernel boot option? Another obvious question is "how much of a problem will we have with internal fragmentation"? This might be a drop-dead showstopper. --
The per cpu memory use by subsystems is typically quite small. We already have an 8k limitation for percpu space for modules. And that does not seem But then per cpu data is not frequently allocated and freed. Going away from allocpercpu saves a lot of memory. We could make this 128k or so to be safe? --
It was just an example. There will be others. tcp_v4_md5_do_add ->tcp_alloc_md5sig_pool ->__tcp_alloc_md5sig_pool does an alloc_percpu for each md5-capable TCP connection. I think - it doesn't matter really, because something _could_. And if something I think it is, in the TCP case. And that's the only one I looked at. ("alloc_percpu" - please be careful about getting this stuff right) I don't think there is presently any upper limit on alloc_percpu()? It uses kmalloc() and kmalloc_node()? Even if there is some limit, is it an unfixable one? --
Last time I took a look on this stuff, this was a percpu allocation for all connections, not for each TCP session. (It should be static, instead of dynamic ) Really, percpu allocations are currently not frequent at all. vmalloc()/vfreee() are way more frequent and still use a list. --
Sure it's hard to conceive how anyone could go and do a per-cpu allocation on a fastpath. But this has nothing to do with the frequency! The problems surround the _amount_ of allocated memory and the allocation/freeing patterns. Here's another example. And it's only an example! Generalise! ext3 maintains three percpu_counters per mount. Each percpu_counter does one percpu_alloc. People can mount an arbitrary number of ext3 filesystems! Another: there are two percpu_counters (and hence two percpu_alloc()s) per backing_dev_info. One backing_dev_info per disk and people have been known to have thousands (iirc ~10,000) disks online. And those examples were plucked only from today's kernel. Who knows what other problems will be in 2.6.45? --
We can always increase the sizes. --
It could be 4000. The present alloc_percpu() would support that. And struct nfs_iostats is 264 bytes and nfs does an alloc_percpu() of one of those per server and mounting thousands of servers per client is, I believe, a real-world operation. Plus for the entyenth time: saying that this code will probably work acceptably for most people in 2.6.26 is not sufficient! --
Another example, not as extreme, there's an alloc_percpu(struct disk_stats) [80 bytes on 64-bit machines] for every disk and every partition in the machine. The TPC system has 3000 disks, each with 14 partitions on it. That's 15 * 80 * 3000 = 3,600,000 bytes. Even if you're only putting a pointer to each allocation in the percpu area, that's still 360,000 bytes, 12x as much as you think is sufficient for the entire system. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
No. The module subsystem has its own alloc_percpu subsystem that the But its going to be even more complicated and I have a hard time managing No there is no limit. It just wastes lots of space (pointer arrays, alignment etc) that we could use to configure sufficiently large per cpu areas. --
Christoph, please. An allocator which is of fixed size and which is vulnerable to internal fragmentation is a huge problem! The kernel is subject to wildly varying workloads both between different users and in the hands of a single user. If we were to merge all this code and then run into the problems which I fear then we are tremendously screwed. We must examine this exhaustively, in the most paranoid fashion. --
Right but it needs to have its own section of the percpu space from which it allocates the percpu segments for the modules. So it effectively Well V2 virtually mapped the cpu alloc area which allowed extending it arbitrarily. But that made things very complicated. The number of per cpu resources needed is mostly fixed. The number of zones, nodes, slab caches, network interfaces etc etc does not change much during typical operations. --
On Thu, 29 May 2008 23:16:11 -0700 (PDT) Could you add a text to explain "This interface is for wise use of pre-allocated limited area (see Documentation/xxxx). please use this only when you need very fast access to per-cpu object and you can estimate the amount which you finally need. If unsure, please use generic allocator." for the moment ? At first look, I thought of using this in memory-resource-controller but it seems I shouldn't do so because thousands of cgroup can be used in theory... Thanks, -Kame --
Is there any reason why the per_cpu area couldn't be made extensible? Maybe a simple linked list of available areas? (And use a config variable and/or boot param for initial size and increment size?) [Ignoring the problem of reclaiming the space...] --
cpu alloc v2 had an extendable per cpu space. You have the patches. We could put this on top of this patchset if necessary. But then it not so nice and simple anymore. Maybe we can rstrict the use of cpu alloc instead to users with objects < cache_line_size() or so? --
Restricting the use of cpu_alloc based on size of object is no good when you're trying to allocate 45,000 objects. Extending the per CPU space is the only option. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Ok guess we need to bring the virtually mapped per cpu patches forward. --
One problem with variable sized cpu_alloc area is this comment in bitmap.h: * Note that nbits should be always a compile time evaluable constant. * Otherwise many inlines will generate horrible code. I'm guessing since this will be of low use and not performance critical, then we can ignore the "horrible code"? ;-) Thanks, Mike --
Christoph & Mike, Please forgive me if I beat a dead horse, but this percpu stuff should find its way. I wonder why you absolutely want to have only one chunk holding all percpu variables, static(vmlinux) & static(modules) & dynamically allocated. Its *not* possible to put an arbitrary limit to this global zone. You'll allways find somebody to break this limit. This is the point we must solve, before coding anything. Have you considered using a list of fixed size chunks, each chunk having its own bitmap ? We only want fix offsets between CPU locations. For a given variable, we MUST find addresses for all CPUS looking at the same offset table. (Then we can optimize things on x86, using %gs/%fs register, instead of a table lookup) We could chose chunk size at compile time, depending on various parameters (32/64 bit arches, or hugepage sizes on NUMA), and a minimum value (ABI guarantee) On x86_64 && NUMA we could use 2 Mbytes chunks, while on x86_32 or non NUMA we should probably use 64 Kbytes. At boot time, we setup the first chunk (chunk 0) and copy .data.percpu on this chunk, for each possible cpu, and we build the bitmap for future dynamic/module percpu allocations. So we still have the restriction that sizeofsection(.data.percpu) should fit in the chunk 0. Not a problem in practice. Then if we need to expand percpu zone for heavy duty machine, and chunk 0 is already filled, we can add as many 2 M/ 64K chunks we need. This would limit the dynamic percpu allocation to 64 kbytes for a given variable, so huge users should probably still use a different allocator (like oprofile alloc_cpu_buffers() function) But at least we dont anymore limit the total size of percpu area. I understand you want to offset percpu data to 0, but for static percpu data. (pda being included in, to share %gs) For dynamically allocated percpu variables (including modules ".data.percpu"), nothing forces you to have low offsets, relative to %gs/%fs register. Access to these ...
Wow! Thanks for the detail! It's extremely useful (to me at least) to see it spelled out. Since Christoph is still on vacation I'll try to summarize where we're at at the moment. (Besides being stuck on a boot up problem with the %gs based percpu variables that is. ;-) Yes, the problem is we need to use virtual addresses to expand the percpu areas since each cpu needs the same fixed offset to the newly allocated variables. This was in the prior (v2) version of cpu_alloc so I'm looking at pulling that forward. And I also figured that the size of the expansion allocations should be based on the system size to minimize the effect on small systems (seems to be my life the past 6 months... ;-) I'm also looking at integrating more into the already present infrastructure (thanks Rusty!) so there are less "diffs" (and less new testing needed.) And of course, there's the complexities of submitting patches to many architectures simultaneously. Hopefully, I'll have something for review soon. Thanks again, Mike --
If you're prepared to have mappings for chunk 0, you can simply make it virtually linear and creating a new chunk is simple. If not, you need to reserve the virtual address space(s) for future mappings. Otherwise you're unlikely to get the same layout for allocations. This is not a show-stopper: we've lived with limited vmalloc room since forever. It just has to be sufficient. Otherwise, your analysis is correct, if a little verbose :) Cheers, Rusty. --
The problem is that offsets relative to %gs or %fs are limited by the small memory model that is chosen. We cannot have an offset large than 2GB. So we must have a linear address range and cannot use separate chunks of memory. If we do not use the segment register then we cannot do atomic Right that is what cpu_alloc v2 did. It created a virtual mapping and The relative to 0 stuff comes in at the x86_64 level because we want to unify pda and percpu accesses. pda access have been relative to 0 and in particular the stack canary in glibc directly accesses the pda at a certain offset. So we must be zero based in order to preserve Normal memory uses 2MB tlbs. There is no overhead therefore by mapping the percpu areas using 2MB tlbs. So we do not need to be that complicated. What v2 did was allocate an area n * MAX_VIRT_PER_CPU_SIZE in vmalloc space and then it dynamically populated 2MB segments as needed. The MAX size was 128MB or so. We could either do the same on i386 or use 4kb mappings (then we can directly use the vmalloc functionality). But then there would be additional TLB overhead. We have similar 2MB virtual mapping tricks for the virtual memmap. Basically we can copy the functions and customize them for the virtual per cpu areas (Mike is hopefully listening and reading the V2 patch ....) --
Actually they are not. If you really want you can do movabs $64bit,%reg ; op ...,%gs:(%reg) It's just not very efficient compared to small (or rather kernel) model and also older binutils didn't support large model. -Andi --
I am not sure Christoph was refering to actual instructions. I was suggesting using for static percpu (vmlinux or modules) : vmlinux : (offset31 computed by linker at vmlinux link edit time) incl %gs:offset31 modules : (offset31 computed at module load time by module loader) incl %gs:offset31 (If we make sure all this stuff is allocated in first chunk) And for dynamic percpu : movq field(%rdi),%rax incl %gs:(%rax) /* full 64bits 'offsets' */ I understood (but might be wrong again) that %gs itself could not be used with an offset > 2GB, because the way %gs segment is setup. So in the 'dynamic percpu' case, %rax should not exceed 2^31 --
