CC linux-kernel and Tejun Heo How many possible cpus do you have ? head -1 /proc/interrupts and please post : cat /proc/vmallocinfo Thanks --
oot@lanforge-ubuntu:/home/lanforge# head -1 /proc/interrupts root@lanforge-ubuntu:/home/lanforge# cat /proc/vmallocinfo 0xf7ffe000-0xf8000000 8192 hpet_enable+0x2d/0x1b8 phys=fed00000 ioremap 0xf8002000-0xf8004000 8192 acpi_os_map_memory+0x16/0x1f phys=df79e000 ioremap 0xf8004000-0xf8007000 12288 acpi_os_map_memory+0x16/0x1f phys=df7a0000 ioremap 0xf8008000-0xf800a000 8192 acpi_os_map_memory+0x16/0x1f phys=df790000 ioremap 0xf800b000-0xf8010000 20480 module_alloc+0x72/0x80 pages=4 vmalloc 0xf8010000-0xf8019000 36864 acpi_os_map_memory+0x16/0x1f phys=df790000 ioremap 0xf801a000-0xf801c000 8192 module_alloc+0x72/0x80 pages=1 vmalloc 0xf801d000-0xf8020000 12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc 0xf8023000-0xf8026000 12288 module_alloc+0x72/0x80 pages=2 vmalloc 0xf8026000-0xf8028000 8192 msix_capability_init+0xae/0x2b0 phys=fa4fe000 ioremap 0xf8028000-0xf802a000 8192 module_alloc+0x72/0x80 pages=1 vmalloc 0xf802b000-0xf802e000 12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc 0xf802f000-0xf8032000 12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc 0xf8033000-0xf8036000 12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc 0xf8037000-0xf803a000 12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc 0xf803b000-0xf803e000 12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc 0xf803f000-0xf8048000 36864 module_alloc+0x72/0x80 pages=8 vmalloc 0xf804b000-0xf8055000 40960 module_alloc+0x72/0x80 pages=9 vmalloc 0xf8056000-0xf8059000 12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc 0xf805b000-0xf8063000 32768 module_alloc+0x72/0x80 pages=7 vmalloc 0xf8066000-0xf8068000 8192 msix_capability_init+0xae/0x2b0 phys=fa4fa000 ioremap 0xf8068000-0xf8070000 32768 module_alloc+0x72/0x80 pages=7 vmalloc 0xf8072000-0xf8075000 12288 module_alloc+0x72/0x80 pages=2 vmalloc 0xf8076000-0xf8083000 53248 module_alloc+0x72/0x80 pages=12 vmalloc 0xf8084000-0xf8087000 12288 ...
Thanks Your vmalloc space is very fragmented. pcpu_get_vm_areas() want hugepages (4MB on your machine, 2MB on mine because I have CONFIG_HIGHMEM64G=y) You could : 1) Use a 64 bit kernel ( :) ) or 2) boot parameter vmalloc=256M to get more room (default is 128 Mbytes) and eventually select a 2G/2G User/Kernel split to get more LOWMEM, because big vmalloc windows shrinks the LOWMEM zone. (CONFIG_VMSPLIT_2G=y) --
That sounds promising as well. I was also wondering if it would make sense to allow one to disable the snmp stats for ipv6? I don't think I have any use for those stats anyway.. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com --
I agree. IPV6 have per device SNMP fields, percpu... thats probably not needed. We have many SNMP fields that could avoid being percpu, even for ipv4. --
Well, this is wrong. We use normal (4KB) pages, unfortunately. I have a NUMA machine, with two nodes, so pcpu_get_vm_areas() allocates two zones, one for each node, with a 'known' offset between them. Then, 4KB pages are allocated to populate the zone when needed. # grep pcpu_get_vm_areas /proc/vmallocinfo 0xffffe8ffa0400000-0xffffe8ffa0600000 2097152 pcpu_get_vm_areas+0x0/0x740 vmalloc 0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x740 vmalloc BTW, we dont have the number of pages currently allocated in each 'vmalloc' zone, and/or node information. Tejun, do you have plans to use hugepages eventually ? (and fallback to 4KB pages, but most percpu data are allocated right after boot) Thanks --
We just tried creating 1000 macvlans with IPv6 addrs on a 64-bit machine with 12GB RAM. Only around 520 interfaces properly set their IPs, and again there are errors about of-of-memory from 'ip', but no obvious splats in dmesg. 'top' shows 10G or so free. It will take some time to figure out what exactly is returning the ENOMEM.... Thanks, -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com --
At least, nothing to do with percpu stuff ? On my 4GB machine, 16 'cpus' (but 32 possible cpus), I was able to allocate. 8192 percpu 8192 bytes structures (total : 32 * 8192 * 8192 = 2 Gbytes) setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:2 PERCPU: Embedded 26 pages/cpu @ffff88007fc00000 s76032 r8192 d22272 u131072 pcpu-alloc: s76032 r8192 d22272 u131072 alloc=1*2097152 pcpu-alloc: [0] 00 02 04 06 08 10 12 14 17 19 21 23 25 27 29 31 pcpu-alloc: [1] 01 03 05 07 09 11 13 15 16 18 20 22 24 26 28 30 grep Vmalloc /proc/meminfo VmallocTotal: 34359738367 kB VmallocUsed: 2202592 kB VmallocChunk: 34356996456 kB Make sure udev / hotplug is not the problem, if you create your devices very fast. (modprobe dummy numdummies=2000) can be very slow because of that. All tasks are fighting for RTNL or sysfs mutex. --
At least I don't see any percpu dumps in dmesg. I vaguely remember someone posting some ipv6 address scalability patches some time back. I think they had to hack on /proc fs as well. I'll see if I can We can create the macvlans w/out problem, though I'm sure that could be sped up. The problem is when we try to add IPv6 addresses to them. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com --
I see. Did you check /proc/sys/net/ipv6/ tunables ? For example, I bet you need to make route/max_size a bigger value than default (4096) Following is working for me echo 16384 >/proc/sys/net/ipv6/route/max_size modprobe dummy numdummies=2000 for a in `seq 1 1999` do ip -6 add add 4444::444:$a/24 dev dummy$a done ip -6 ro | wc -l 6008 --
That helps. I'm getting all of the IP addrs set now, but having trouble with some of the default gateways (I have one routing table per interface). ./local/sbin/ip -6 route replace default via 2002:9:8::1 dev eth7#458 table 726 RTNETLINK answers: No buffer space available dmesg is full of this: [247106.294743] ipv6: Neighbour table overflow. A quick look in /proc didn't show a tunable for this, but I'll go grub through the code. As for the route/max_size, it would be nice to see some useful kernel message in dmesg when this hit. Just telling the user '-ENOMEM' is not at all sufficient to help them figure out the problem. For that matter, why is there such a limit anyway? IPv4 doesn't appear to have any such limit? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com --
Sure, patches are welcomed. Apparently nobody yet used ipv6 with so many There are limits for ipv4, much bigger, you probably never noticed. /proc/sys/net/ipv4/route/gc_elasticity:8 /proc/sys/net/ipv4/route/gc_interval:60 /proc/sys/net/ipv4/route/gc_min_interval:0 /proc/sys/net/ipv4/route/gc_min_interval_ms:500 /proc/sys/net/ipv4/route/gc_thresh:131072 /proc/sys/net/ipv4/route/gc_timeout:300 /proc/sys/net/ipv4/route/max_size:2097152 <<< HERE /proc/sys/net/ipv4/route/min_adv_mss:256 /proc/sys/net/ipv4/route/min_pmtu:552 /proc/sys/net/ipv4/route/mtu_expires:600 /proc/sys/net/ipv4/route/redirect_load:2 /proc/sys/net/ipv4/route/redirect_number:9 /proc/sys/net/ipv4/route/redirect_silence:2048 I suggest followup discussion can got to netdev only, now per-cpu it not anymore the problem ? --
Hello, Well, it's rather complicated. Till now, the percpu usage hasn't justified allocating hugepages but it might someday, but more importantly the reason why those big chunks of address space are used is to keep the first chunk embedded in the regular linear kernel address space to avoid extra TLB pressure. On configurations where vmalloc area is a scarce resource, percpu_alloc=page can be specified to use page-mapped allocation. This will use much smaller chunks in vmalloc area at the cost of additional 4k page TLB pressure for percpu memory in the first chunk (all the static percpu variables and then some). pcpu_embed_first_chunk() contains heuristic which makes it yield to page allocator but the parameter is pretty generous (maximum distance between chunks > 75% of vmalloc area). It's there just to avoid completely crazy cases. Also, x86 setup_per_cpu_areas() chooses page allocator on 32bit NUMAs. This case didn't trigger either. We probably need to add another condition. That large machine on 32bit is bound to be flaky. To be short on virtual address space is a pretty silly and stupid thing. Anyways, any good idea on what criteria we could test? Thanks. -- tejun --
