Re: OOM when adding ipv6 route: How to make available more per-cpu memory?

Previous thread: [PATCH 1/1] crypto: api.c: doc on crypto_alg_lookup, crypto_larval_lookup, and crypto_alg_mod_lookup by Mark Allyn on Friday, November 5, 2010 - 11:05 am. (2 messages)

Next thread: [2.6.37-rc1] wait_for_sysfs prints errors. by Tetsuo Handa on Friday, November 5, 2010 - 11:07 am. (5 messages)
From: Eric Dumazet
Date: Friday, November 5, 2010 - 11:06 am

CC linux-kernel and Tejun Heo

How many possible cpus do you have ?
head -1 /proc/interrupts

and please post :
cat /proc/vmallocinfo

Thanks


--

From: Ben Greear
Date: Friday, November 5, 2010 - 11:15 am

oot@lanforge-ubuntu:/home/lanforge# head -1 /proc/interrupts

root@lanforge-ubuntu:/home/lanforge# cat /proc/vmallocinfo
0xf7ffe000-0xf8000000    8192 hpet_enable+0x2d/0x1b8 phys=fed00000 ioremap
0xf8002000-0xf8004000    8192 acpi_os_map_memory+0x16/0x1f phys=df79e000 ioremap
0xf8004000-0xf8007000   12288 acpi_os_map_memory+0x16/0x1f phys=df7a0000 ioremap
0xf8008000-0xf800a000    8192 acpi_os_map_memory+0x16/0x1f phys=df790000 ioremap
0xf800b000-0xf8010000   20480 module_alloc+0x72/0x80 pages=4 vmalloc
0xf8010000-0xf8019000   36864 acpi_os_map_memory+0x16/0x1f phys=df790000 ioremap
0xf801a000-0xf801c000    8192 module_alloc+0x72/0x80 pages=1 vmalloc
0xf801d000-0xf8020000   12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc
0xf8023000-0xf8026000   12288 module_alloc+0x72/0x80 pages=2 vmalloc
0xf8026000-0xf8028000    8192 msix_capability_init+0xae/0x2b0 phys=fa4fe000 ioremap
0xf8028000-0xf802a000    8192 module_alloc+0x72/0x80 pages=1 vmalloc
0xf802b000-0xf802e000   12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc
0xf802f000-0xf8032000   12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc
0xf8033000-0xf8036000   12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc
0xf8037000-0xf803a000   12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc
0xf803b000-0xf803e000   12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc
0xf803f000-0xf8048000   36864 module_alloc+0x72/0x80 pages=8 vmalloc
0xf804b000-0xf8055000   40960 module_alloc+0x72/0x80 pages=9 vmalloc
0xf8056000-0xf8059000   12288 igb_setup_tx_resources+0x27/0x140 [igb] pages=2 vmalloc
0xf805b000-0xf8063000   32768 module_alloc+0x72/0x80 pages=7 vmalloc
0xf8066000-0xf8068000    8192 msix_capability_init+0xae/0x2b0 phys=fa4fa000 ioremap
0xf8068000-0xf8070000   32768 module_alloc+0x72/0x80 pages=7 vmalloc
0xf8072000-0xf8075000   12288 module_alloc+0x72/0x80 pages=2 vmalloc
0xf8076000-0xf8083000   53248 module_alloc+0x72/0x80 pages=12 vmalloc
0xf8084000-0xf8087000   12288 ...
From: Eric Dumazet
Date: Friday, November 5, 2010 - 1:20 pm

Thanks

Your vmalloc space is very fragmented. pcpu_get_vm_areas() want
hugepages (4MB on your machine, 2MB on mine because I have
CONFIG_HIGHMEM64G=y)

You could :

1) Use a 64 bit kernel ( :) )

or

2) boot parameter vmalloc=256M   to get more room
   (default is 128 Mbytes)

and eventually

select a 2G/2G User/Kernel split to get more LOWMEM, because big vmalloc
windows shrinks the LOWMEM zone. (CONFIG_VMSPLIT_2G=y)



--

From: Ben Greear
Date: Friday, November 5, 2010 - 1:26 pm

That sounds promising as well.


I was also wondering if it would make sense to allow one to disable
the snmp stats for ipv6?  I don't think I have any use for those
stats anyway..

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

--

From: Eric Dumazet
Date: Friday, November 5, 2010 - 1:53 pm

I agree. IPV6 have per device SNMP fields, percpu... thats probably not
needed.

We have many SNMP fields that could avoid being percpu, even for ipv4.



--

From: Eric Dumazet
Date: Friday, November 5, 2010 - 3:11 pm

Well, this is wrong. We use normal (4KB) pages, unfortunately.

I have a NUMA machine, with two nodes, so pcpu_get_vm_areas() allocates
two zones, one for each node, with a 'known' offset between them.
Then, 4KB pages are allocated to populate the zone when needed.

# grep pcpu_get_vm_areas /proc/vmallocinfo 
0xffffe8ffa0400000-0xffffe8ffa0600000 2097152 pcpu_get_vm_areas+0x0/0x740 vmalloc
0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x740 vmalloc

BTW, we dont have the number of pages currently allocated in each
'vmalloc' zone, and/or node information.

Tejun, do you have plans to use hugepages eventually ?
(and fallback to 4KB pages, but most percpu data are allocated right
after boot)

Thanks


--

From: Ben Greear
Date: Friday, November 5, 2010 - 5:07 pm

We just tried creating 1000 macvlans with IPv6 addrs on a 64-bit machine
with 12GB RAM.  Only around 520 interfaces properly set their IPs, and
again there are errors about of-of-memory from 'ip', but no obvious
splats in dmesg.

'top' shows 10G or so free.

It will take some time to figure out what exactly is returning
the ENOMEM....

Thanks,


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

--

From: Eric Dumazet
Date: Saturday, November 6, 2010 - 12:26 am

At least, nothing to do with percpu stuff ?

On my 4GB machine, 16 'cpus' (but 32 possible cpus), I was able to
allocate.
8192 percpu 8192 bytes structures

(total : 32 * 8192 * 8192 = 2 Gbytes)

setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:2
PERCPU: Embedded 26 pages/cpu @ffff88007fc00000 s76032 r8192 d22272 u131072
pcpu-alloc: s76032 r8192 d22272 u131072 alloc=1*2097152
pcpu-alloc: [0] 00 02 04 06 08 10 12 14 17 19 21 23 25 27 29 31 
pcpu-alloc: [1] 01 03 05 07 09 11 13 15 16 18 20 22 24 26 28 30 

grep Vmalloc /proc/meminfo 
VmallocTotal:   34359738367 kB
VmallocUsed:     2202592 kB
VmallocChunk:   34356996456 kB

Make sure udev / hotplug is not the problem, if you create your devices
very fast.

(modprobe dummy numdummies=2000) can be very slow because of that.
All tasks are fighting for RTNL or sysfs mutex.



--

From: Ben Greear
Date: Saturday, November 6, 2010 - 10:08 am

At least I don't see any percpu dumps in dmesg.  I vaguely remember
someone posting some ipv6 address scalability patches some time back.
I think they had to hack on /proc fs as well.  I'll see if I can

We can create the macvlans w/out problem, though I'm sure that could
be sped up.  The problem is when we try to add IPv6 addresses to
them.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com
--

From: Eric Dumazet
Date: Monday, November 8, 2010 - 4:02 am

I see. Did you check /proc/sys/net/ipv6/ tunables ?

For example, I bet you need to make route/max_size a bigger value than
default (4096)

Following is working for me

echo 16384 >/proc/sys/net/ipv6/route/max_size
modprobe dummy numdummies=2000
for a in `seq 1 1999`
do
 ip -6 add add 4444::444:$a/24 dev dummy$a
done

ip -6 ro | wc -l
6008



--

From: Ben Greear
Date: Monday, November 8, 2010 - 10:45 am

That helps.  I'm getting all of the IP addrs set now, but
having trouble with some of the default gateways (I have one
routing table per interface).

./local/sbin/ip -6 route replace default via 2002:9:8::1 dev eth7#458 table 726
RTNETLINK answers: No buffer space available

dmesg is full of this:

[247106.294743] ipv6: Neighbour table overflow.


A quick look in /proc didn't show a tunable for this, but I'll
go grub through the code.

As for the route/max_size, it would be nice to see some useful kernel
message in dmesg when this hit.  Just telling the user '-ENOMEM'
is not at all sufficient to help them figure out the problem.

For that matter, why is there such a limit anyway?  IPv4 doesn't appear
to have any such limit?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

--

From: Eric Dumazet
Date: Monday, November 8, 2010 - 10:55 am

Sure, patches are welcomed. Apparently nobody yet used ipv6 with so many

There are limits for ipv4, much bigger, you probably never noticed.


/proc/sys/net/ipv4/route/gc_elasticity:8
/proc/sys/net/ipv4/route/gc_interval:60
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:131072
/proc/sys/net/ipv4/route/gc_timeout:300
/proc/sys/net/ipv4/route/max_size:2097152    <<< HERE
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:2
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:2048

I suggest followup discussion can got to netdev only, now per-cpu it not
anymore the problem ?




--

From: Tejun Heo
Date: Saturday, November 6, 2010 - 2:11 am

Hello,


Well, it's rather complicated.  Till now, the percpu usage hasn't
justified allocating hugepages but it might someday, but more
importantly the reason why those big chunks of address space are used
is to keep the first chunk embedded in the regular linear kernel
address space to avoid extra TLB pressure.

On configurations where vmalloc area is a scarce resource,
percpu_alloc=page can be specified to use page-mapped allocation.
This will use much smaller chunks in vmalloc area at the cost of
additional 4k page TLB pressure for percpu memory in the first chunk
(all the static percpu variables and then some).

pcpu_embed_first_chunk() contains heuristic which makes it yield to
page allocator but the parameter is pretty generous (maximum distance
between chunks > 75% of vmalloc area).  It's there just to avoid
completely crazy cases.  Also, x86 setup_per_cpu_areas() chooses page
allocator on 32bit NUMAs.  This case didn't trigger either.  We
probably need to add another condition.

That large machine on 32bit is bound to be flaky.  To be short on
virtual address space is a pretty silly and stupid thing.  Anyways,
any good idea on what criteria we could test?

Thanks.

-- 
tejun
--

Previous thread: [PATCH 1/1] crypto: api.c: doc on crypto_alg_lookup, crypto_larval_lookup, and crypto_alg_mod_lookup by Mark Allyn on Friday, November 5, 2010 - 11:05 am. (2 messages)

Next thread: [2.6.37-rc1] wait_for_sysfs prints errors. by Tetsuo Handa on Friday, November 5, 2010 - 11:07 am. (5 messages)