Hello ALL Some time ago i report this: http://bugzilla.kernel.org/show_bug.cgi?id=6648 and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back dmesg output: oprofile: using NMI interrupt. Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits cat /proc/net/fib_triestat Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes. Main: Aver depth: 2.28 Max depth: 6 Leaves: 276539 Prefixes: 289922 Internal nodes: 66762 1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5 9: 1 18: 1 Pointers: 691228 Null ptrs: 347928 Total size: 35709 kB Counters: --------- gets = 26276593 backtracks = 547306 semantic match passed = 26188746 semantic match miss = 1117 null node hit= 27285055 skipped node resize = 0 Local: Aver depth: 3.33 Max depth: 4 Leaves: 9 Prefixes: 10 Internal nodes: 8 1: 8 Pointers: 16 Null ptrs: 0 Total size: 2 kB Counters: --------- gets = 26642350 backtracks = 1282818 semantic match passed = 18166 semantic match miss = 0 null node hit= 0 skipped node resize = 0 This machine is running bgpd with two bgp peers / full route table cat /proc/meminfo MemTotal: 12279032 kB MemFree: ...
Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER allocation)
Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes (+ header -> 4194304 bytes)
But since following commit, we should use vmalloc() so this PAGE_SIZE<<10) limit
should not anymore be applied.
Could you do a "cat /proc/vmallocinfo" just to check your big tnodes are vmalloced() ?
commit 15be75cdb5db442d0e33d37b20832b88f3ccd383
Author: Stephen Hemminger <shemminger@vyatta.com>
Date: Thu Apr 10 02:56:38 2008 -0700
IPV4: fib_trie use vmalloc for large tnodes
Use vmalloc rather than alloc_pages to avoid wasting memory.
The problem is that tnode structure has a power of 2 sized array,
plus a header. So the current code wastes almost half the memory
allocated because it always needs the next bigger size to hold
that small header.
This is similar to an earlier patch by Eric, but instead of a list
and lock, I used a workqueue to handle the fact that vfree can't
be done in interrupt context.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--
cat /proc/vmallocinfo 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfe6a000 ioremap 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 phys=dfef5000 ioremap 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef2000 ioremap 0xf800c000-0xf800e000 8192 acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfefb000 ioremap 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef4000 ioremap 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef3000 ioremap 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef1000 ioremap 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef0000 ioremap 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeef000 ioremap 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeee000 ioremap 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeed000 ioremap 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeec000 ioremap 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a phys=fed1c000 ioremap 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 pages=1 vmalloc 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf806e000-0xf8070000 8192 ...
This is from a 32 bit kernel. This doesnt match your previous /proc/meminfo (from a 64bit kernel on a 12 GB machine) Of course, I would like /proc/vmallocinfo on your loaded router, not from --
Yes sorry for no info about it. I test the same kernel configurations on one 32bit machine and second 64bit here is meminfo from this 32bit machine working on kernel 2.6.30 cat /proc/meminfo MemTotal: 3625444 kB MemFree: 3043648 kB Buffers: 133968 kB Cached: 36316 kB SwapCached: 0 kB Active: 256868 kB Inactive: 76252 kB Active(anon): 163064 kB Inactive(anon): 0 kB Active(file): 93804 kB Inactive(file): 76252 kB Unevictable: 0 kB Mlocked: 0 kB HighTotal: 2758160 kB HighFree: 2556136 kB LowTotal: 867284 kB LowFree: 487512 kB SwapTotal: 995896 kB SwapFree: 995896 kB Dirty: 3624 kB Writeback: 0 kB AnonPages: 162912 kB Mapped: 3612 kB Slab: 235888 kB SReclaimable: 46408 kB SUnreclaim: 189480 kB PageTables: 384 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2808616 kB Committed_AS: 170648 kB VmallocTotal: 122880 kB VmallocUsed: 2876 kB VmallocChunk: 109824 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 4096 kB DirectMap4k: 8184 kB DirectMap4M: 901120 kB and vmallocinfo cat /proc/vmallocinfo 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfe6a000 ioremap 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 phys=dfef5000 ioremap 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef2000 ioremap 0xf800c000-0xf800e000 8192 acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfefb000 ioremap 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef4000 ioremap 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 ...
Yes, I was a fool to ask you to try 2.6.31-rc1, sorry. Even 2.6.30 is too young for a production machine. 2.6.29.5 contains the fixes, Pawel, did you tried this version ? --
No problem with this test i lost only one test failover and no traffic I alvays make like this - i have iBGP mesh with main access path of machines on stable 2.6.28.9 kernels and second failover path on machines that use newest kernel for testing in this case 2.6.29 but I will try 2.6.29.5 today Thanks Paweł Staszewski --
OK thanks Please report (while machine has enough load) output of rtstat -c20 -i1 (rtstat is a symbolic link to lnstat, if not provided by your distro) --
here you have
rtstat -i 1 -c 20
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti|
out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
| | tot| mc| ute| | an_dst|
an_src| | _tot| _mc| | ed| miss| verflow|
_search|t_search|
93362|22930850| 1671866| 0| 1369| 2| 0|
0| 53432| 1324| 0| 0| 0| 0| 0|
4783896| 11985|
92067| 101426| 5315| 0| 2| 0| 0|
0| 258| 2| 0| 0| 0| 0| 0|
21893| 6|
90561| 100094| 4666| 0| 6| 0| 0|
0| 267| 1| 0| 0| 0| 0| 0|
23433| 30|
90101| 98672| 5630| 0| 2| 0| 0|
0| 253| 0| 0| 0| 0| 0| 0|
24386| 34|
89994| 99962| 5654| 0| 6| 0| 0|
0| 266| 2| 0| 0| 0| 0| 0|
26251| 38|
95209| 91974| 14860| 0| 9| 0| 0|
0| 236| 31| 0| 0| 0| 0| 0|
14238| 35|
95323| 101714| 10126| 0| 14| 0| 0|
0| 255| 9| 0| 0| 0| 0| 0|
8532| 21|
94814| 99918| 8539| 0| 5| 0| 0|
0| 258| 4| 0| 0| 0| 0| 0|
11069| 24|
98510| 93929| 12672| 0| 13| 0| 0|
0| 238| 31| 0| 0| 0| 0| 0|
12704| 34|
98983| 96131| 11128| 0| 12| 0| 0|
0| 252| 10| 0| ...On the other hand, even if there is no problem with memory, it seems
because of hitting max_resize the threshold should be changed, e.g.
by reverting the patch below.
Jarek P.
commit 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5
Author: Robert Olsson <robert.olsson@its.uu.se>
Date: Mon Mar 19 16:29:58 2007 -0700
[IPV4]: fib_trie root node settings
The threshold for root node can be more aggressive set to get
better tree compression. The new setting mekes the root grow
from 16 to 19 bits and substansial improvemnt in Aver depth
this with the current table of 214393 prefixes
But really the dynamic resize should need more investigation
both in terms convergence and performance and maybe it should
be possible to change...
Maybe just for the brave to start with or we may have to back
this out.
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 5d2b43d..9be7da7 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -292,8 +292,8 @@ static inline void check_tnode(const struct tnode *tn)
static int halve_threshold = 25;
static int inflate_threshold = 50;
-static int halve_threshold_root = 15;
-static int inflate_threshold_root = 25;
+static int halve_threshold_root = 8;
+static int inflate_threshold_root = 15;
static void __alias_free_mem(struct rcu_head *head)
--
Jarek Poplawski writes: > >> oprofile: using NMI interrupt. > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > On the other hand, even if there is no problem with memory, it seems > because of hitting max_resize the threshold should be changed, e.g. > by reverting the patch below. You seem to have some temporary memory problem. So the printout might be a bit misleading in this case. We really like to keep the root node as big as we can to keep the tree as flat as possible for performance reasons. (We're even more motivated now when we can disable the route cache) So I'll guess the next insert/delete inflates the root node to be within the interval. So I'll assume this just a temporary failure? I would be nice to have *threshholds* settable by /proc or /sys. I would use this in the other direction to trade memory for even faster lookups. But maybe experts memory allocation has some good suggestions. Cheers. --ro --
Pawel has reported these problems for a long time: http://bugzilla.kernel.org/show_bug.cgi?id=6648 So, until it's fully investigated, it seems some 'fast' fix is needed here. Cheers, Jarek P. --
I have never reported these problems but I am definitely seeing the same message on kernel 2.6.29.5, usually, when one of my BGP peers goes down. So, just a "me too". Regards, Jorge ----------------- [ 1198.333854] Fix inflate_threshold_root. Now=15 size=11 bits [ 1198.437028] Fix inflate_threshold_root. Now=15 size=11 bits [ 1198.460848] Fix inflate_threshold_root. Now=15 size=11 bits [ 1199.240223] Fix inflate_threshold_root. Now=15 size=11 bits [ 1199.279723] Fix inflate_threshold_root. Now=15 size=11 bits [ 1199.383081] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.154893] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.191711] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.223242] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.270299] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.355795] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.239254] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.271995] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.349351] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.384676] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.428801] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.457315] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.485710] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.513691] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.039681] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.069224] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.108840] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.141450] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.172317] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.197824] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.224711] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.251566] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.289603] Fix inflate_threshold_root. Now=15 size=11 bits [ 1211.561178] Fix ...
Jarek Poplawski writes: > > But maybe memory allocation experts has some good suggestions. > > Pawel has reported these problems for a long time: > http://bugzilla.kernel.org/show_bug.cgi?id=6648 > > So, until it's fully investigated, it seems some 'fast' fix is needed > here. We talked about having a fixed pre-allocated root-node long ago but it's only optimisation for routers w. full BGP. Best if memory problems got solved. Cheers --ro --
I think the current process of rebalancing can allocate and hold
unnecessarily long a lot of 'temp' memory, so probably something
like the patch below could be useful. It should be applied to the
2.6.30 after two patches below (from 2.6.31-rc). (Alas I can't even
compile-test it now).
Cheers,
Jarek P.
--- (for testing)
net/ipv4/fib_trie.c | 24 ++++++++++++++++++------
1 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..c2fc862 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
vfree(tn);
}
+static void __tnode_free(struct tnode *tn)
+{
+ if (size <= PAGE_SIZE)
+ kfree(tn);
+ else
+ vfree(tn);
+}
+
static void __tnode_free_rcu(struct rcu_head *head)
{
struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +410,7 @@ static void tnode_free_flush(void)
while ((tn = tnode_free_head)) {
tnode_free_head = tn->tnode_free;
tn->tnode_free = NULL;
- tnode_free(tn);
+ __tnode_free(tn);
}
}
@@ -1020,19 +1028,23 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
tnode_put_child_reorg((struct tnode *)tp, cindex,
(struct node *)tn, wasfull);
- tp = node_parent((struct node *) tn);
+ synchronize_rcu();
tnode_free_flush();
+ tp = node_parent((struct node *) tn);
if (!tp)
break;
tn = tp;
}
/* Handle last (top) tnode */
- if (IS_TNODE(tn))
+ if (IS_TNODE(tn)) {
tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
- rcu_assign_pointer(t->trie, (struct node *)tn);
- tnode_free_flush();
+ rcu_assign_pointer(t->trie, (struct node *)tn);
+ synchronize_rcu();
+ tnode_free_flush();
+ } else {
+ rcu_assign_pointer(t->trie, (struct node *)tn);
+ }
return;
}
---
commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon Jun 15 02:31:29 2009 -0700
ipv4: Fix ...Alternatively here is a faster version with less synchronize_rcu().
Jarek P.
--- (take 2 - for testing)
net/ipv4/fib_trie.c | 27 +++++++++++++++++++++------
1 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..2936b2e 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
vfree(tn);
}
+static void __tnode_free(struct tnode *tn)
+{
+ if (size <= PAGE_SIZE)
+ kfree(tn);
+ else
+ vfree(tn);
+}
+
static void __tnode_free_rcu(struct rcu_head *head)
{
struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +410,7 @@ static void tnode_free_flush(void)
while ((tn = tnode_free_head)) {
tnode_free_head = tn->tnode_free;
tn->tnode_free = NULL;
- tnode_free(tn);
+ __tnode_free(tn);
}
}
@@ -1021,18 +1029,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
(struct node *)tn, wasfull);
tp = node_parent((struct node *) tn);
- tnode_free_flush();
if (!tp)
break;
tn = tp;
}
+ if (tnode_free_head) {
+ synchronize_rcu();
+ tnode_free_flush();
+ }
+
/* Handle last (top) tnode */
- if (IS_TNODE(tn))
+ if (IS_TNODE(tn)) {
tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
- rcu_assign_pointer(t->trie, (struct node *)tn);
- tnode_free_flush();
+ rcu_assign_pointer(t->trie, (struct node *)tn);
+ synchronize_rcu();
+ tnode_free_flush();
+ } else {
+ rcu_assign_pointer(t->trie, (struct node *)tn);
+ }
return;
}
---
commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon Jun 15 02:31:29 2009 -0700
ipv4: Fix fib_trie rebalancing
While doing trie_rebalance(): resize(), inflate(), halve() RCU free
tnodes before updating their parents. It depends on RCU delaying the
real destruction, but if RCU readers start after call_rcu() and before
...Jarek Poplawski writes:
Thanks,
Should be worth testing so we synchronize_rcu instead of doing call_rcu's
Cheers
--ro
> Alternatively here is a faster version with less synchronize_rcu().
>
> Jarek P.
>
> --- (take 2 - for testing)
>
> net/ipv4/fib_trie.c | 27 +++++++++++++++++++++------
> 1 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..2936b2e 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
> vfree(tn);
> }
>
> +static void __tnode_free(struct tnode *tn)
> +{
> + if (size <= PAGE_SIZE)
> + kfree(tn);
> + else
> + vfree(tn);
> +}
> +
> static void __tnode_free_rcu(struct rcu_head *head)
> {
> struct tnode *tn = container_of(head, struct tnode, rcu);
> @@ -402,7 +410,7 @@ static void tnode_free_flush(void)
> while ((tn = tnode_free_head)) {
> tnode_free_head = tn->tnode_free;
> tn->tnode_free = NULL;
> - tnode_free(tn);
> + __tnode_free(tn);
> }
> }
>
> @@ -1021,18 +1029,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> (struct node *)tn, wasfull);
>
> tp = node_parent((struct node *) tn);
> - tnode_free_flush();
> if (!tp)
> break;
> tn = tp;
> }
>
> + if (tnode_free_head) {
> + synchronize_rcu();
> + tnode_free_flush();
> + }
> +
> /* Handle last (top) tnode */
> - if (IS_TNODE(tn))
> + if (IS_TNODE(tn)) {
> tn = (struct tnode *)resize(t, (struct tnode *)tn);
> -
> - rcu_assign_pointer(t->trie, (struct node *)tn);
> - tnode_free_flush();
> + rcu_assign_pointer(t->trie, (struct node *)tn);
> + synchronize_rcu();
> + tnode_free_flush();
> + } else {
> + rcu_assign_pointer(t->trie, (struct node *)tn);
> + }
>
> return;
> }
>
>
>
> ---
> commit ...Alas take 2 (nor 1) doesn't compile, so here it is again.
Thanks,
Jarek P.
--- (take 3 - for testing)
net/ipv4/fib_trie.c | 30 ++++++++++++++++++++++++------
1 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1a4c4b7 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
vfree(tn);
}
+static void __tnode_free(struct tnode *tn)
+{
+ size_t size = sizeof(struct tnode) +
+ (sizeof(struct node *) << tn->bits);
+
+ if (size <= PAGE_SIZE)
+ kfree(tn);
+ else
+ vfree(tn);
+}
+
static void __tnode_free_rcu(struct rcu_head *head)
{
struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +413,7 @@ static void tnode_free_flush(void)
while ((tn = tnode_free_head)) {
tnode_free_head = tn->tnode_free;
tn->tnode_free = NULL;
- tnode_free(tn);
+ __tnode_free(tn);
}
}
@@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
(struct node *)tn, wasfull);
tp = node_parent((struct node *) tn);
- tnode_free_flush();
if (!tp)
break;
tn = tp;
}
+ if (tnode_free_head) {
+ synchronize_rcu();
+ tnode_free_flush();
+ }
+
/* Handle last (top) tnode */
- if (IS_TNODE(tn))
+ if (IS_TNODE(tn)) {
tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
- rcu_assign_pointer(t->trie, (struct node *)tn);
- tnode_free_flush();
+ rcu_assign_pointer(t->trie, (struct node *)tn);
+ synchronize_rcu();
+ tnode_free_flush();
+ } else {
+ rcu_assign_pointer(t->trie, (struct node *)tn);
+ }
return;
}
--
So the idea is to balance memory and latency, so that large changes (those affecting the root node) get at least one synchronize_rcu(), while smaller changes just use call_rcu(), correct? This means that the amount of memory awaiting an RCU grace period is limited, but the algorithm avoids per-node synchronize_rcu() overhead. If I understand the goal correctly, looks good! (Give or take my limited understanding of fib_trie and is usage, of course.) --
The goal is practically to replace all call_rcu() during trie_rebalance() with synchronize_rcu() (except some freeing after ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this memory after trie_rebalance() has finished, that's why there were problems with enabled preemption. So this patch tries to do/force this a bit earlier - at least before the top/largest node is rebalanced. Thanks, --
On the other hand, we could probably stay with call_rcu() plus only one synchronize_rcu() before the top node's resize() if you think it's enough here? Thanks, --
Well, my first task is to understand the problem/goal. ;-) My guess from what you said above is that use of call_rcu(), when combined with changes to the trie in rapid succession, is resulting in excessive memory awaiting a grace period. Is this the case, or am I confused? --
;-) In that case, simply invoking synchronize_rcu() every once and awhile should take care of things. This could be at the end of every large trie operation, or you could even count the call_rcu() invocations and do a synchronize_rcu() every 100th, 1,000th, or whatever, based on the amount of memory available. Thanx, Paul --
On Fri, Jun 26, 2009 at 10:05:38AM -0700, Paul E. McKenney wrote:
OK, for now the minimal change for testing (2.6.30 needs previously
mentioned two commits from 2.6.31-rc). (I guess I'll send it with a
changelog after net-next is opened.)
Thanks,
Jarek P.
--- (take 4 - for testing)
net/ipv4/fib_trie.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..98b31a1 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
{
int wasfull;
t_key cindex, key;
- struct tnode *tp;
+ struct tnode *tp, *oldtnode = tn;
key = tn->key;
@@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
}
/* Handle last (top) tnode */
- if (IS_TNODE(tn))
+ if (IS_TNODE(tn)) {
+ /* force memory freeing after last changes */
+ if (oldtnode != tn)
+ synchronize_rcu();
tn = (struct tnode *)resize(t, (struct tnode *)tn);
+ }
rcu_assign_pointer(t->trie, (struct node *)tn);
tnode_free_flush();
--
Alas, after rethinking, there is one detail which bothers me. Those largest allocs here are done with vmalloc and freed with RCU by schedule_work(). So, I wonder if just because of this, the previous version doing it directly isn't more reliable anyway. Of course, it's my bad I didn't point it while describing the problem earlier. (I knew I missed something...;-) Thanks, Jarek P. --
Yes looks like a good solution but maybe it safest to synchronize unconditionally?
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..9cb8623 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1028,8 +1028,11 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
}
/* Handle last (top) tnode */
- if (IS_TNODE(tn))
+ if (IS_TNODE(tn)) {
+ /* force memory freeing after last changes */
+ synchronize_rcu();
tn = (struct tnode *)resize(t, (struct tnode *)tn);
+ }
rcu_assign_pointer(t->trie, (struct node *)tn);
tnode_free_flush();
Cheers
--ro
Jarek Poplawski writes:
> net/ipv4/fib_trie.c | 8 ++++++--
> 1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..98b31a1 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> {
> int wasfull;
> t_key cindex, key;
> - struct tnode *tp;
> + struct tnode *tp, *oldtnode = tn;
>
> key = tn->key;
>
> @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> }
>
> /* Handle last (top) tnode */
> - if (IS_TNODE(tn))
> + if (IS_TNODE(tn)) {
> + /* force memory freeing after last changes */
> + if (oldtnode != tn)
> + synchronize_rcu();
> tn = (struct tnode *)resize(t, (struct tnode *)tn);
> + }
>
> rcu_assign_pointer(t->trie, (struct node *)tn);
> tnode_free_flush();
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Hmm... I lost around half an hour for this doubt... Sure! (Unless there are some strange cases which very often create and destroy very small tables?) Thanks, --
...or maybe even only updating such small tables very often? Btw., Robert, I wondered about some design details lately, especially about pointer to a parent. I didn't see it in the basic docs, so it seems it could be avoided. It seems to be a problem with RCU, unless I miss something: if there were no going back from children to parents it seems we could free those "temporary" (created by inflate() and halve() and destroyed before resize() has finished) earlier. Another problem with this, it seems, are possibly false lookups (if we go back to the new parent which doesn't have it's parent or other nodes updated). So, was there so much performance gain to introduce this? Thanks, Jarek P. --
Robert, you and Eric pointed at memory problems, so I thought I missed
something. But after the second look I see "skipped node resize" should
show this, but it's always zero on these reports. So, isn't it possible
the current inflate_threshold_root is simply unreachable with some
conditions, at least within 10 loops?
Then these settable thresholds might be more useful here than memory
fixes, but here is some idea to try handle this automatically within
some limits. The patch below increases inflate_threshold_root (only)
up to ~50% of its initial value if needed, and should be able to go
back sometimes.
Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
some offsets.)
Thanks,
Jarek P.
---
net/ipv4/fib_trie.c | 23 ++++++++++++++++-------
1 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1dc1bb4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -318,6 +318,7 @@ static const int halve_threshold = 25;
static const int inflate_threshold = 50;
static const int halve_threshold_root = 8;
static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
static void __alias_free_mem(struct rcu_head *head)
@@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
/* Keep root node larger */
if (!tn->parent)
- inflate_threshold_use = inflate_threshold_root;
+ inflate_threshold_use = inflate_threshold_root +
+ inflate_threshold_root_fix;
else
inflate_threshold_use = inflate_threshold;
@@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
}
if (max_resize < 0) {
- if (!tn->parent)
- pr_warning("Fix inflate_threshold_root."
- " Now=%d size=%d bits\n",
- inflate_threshold_root, tn->bits);
- else
+ if (!tn->parent) {
+ if (inflate_threshold_root_fix * 2 <
+ ...On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote:
A tiny adjustment in the last if...
Jarek P.
--- (take 2)
net/ipv4/fib_trie.c | 23 ++++++++++++++++-------
1 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1dc1bb4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -318,6 +318,7 @@ static const int halve_threshold = 25;
static const int inflate_threshold = 50;
static const int halve_threshold_root = 8;
static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
static void __alias_free_mem(struct rcu_head *head)
@@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
/* Keep root node larger */
if (!tn->parent)
- inflate_threshold_use = inflate_threshold_root;
+ inflate_threshold_use = inflate_threshold_root +
+ inflate_threshold_root_fix;
else
inflate_threshold_use = inflate_threshold;
@@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
}
if (max_resize < 0) {
- if (!tn->parent)
- pr_warning("Fix inflate_threshold_root."
- " Now=%d size=%d bits\n",
- inflate_threshold_root, tn->bits);
- else
+ if (!tn->parent) {
+ if (inflate_threshold_root_fix * 2 <
+ inflate_threshold_root)
+ inflate_threshold_root_fix++;
+ else
+ pr_warning("Fix inflate_threshold_root."
+ " Now=%d size=%d bits fix=%d\n",
+ inflate_threshold_root, tn->bits,
+ inflate_threshold_root_fix);
+ } else {
pr_warning("Fix inflate_threshold."
" Now=%d size=%d bits\n",
inflate_threshold, tn->bits);
- }
+ }
+ } else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix)
+ inflate_threshold_root_fix--;
check_tnode(tn);
--
Thanks Jarek I apply this patch to 2.6.29.5 For some results we must wait to "rush hours" when there will be more traffic / routes :) Regards Paweł Staszewski --
When testing please monitor size of root node and and aver depth
Cheers
--ro
Jarek Poplawski writes:
> On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote:
> ...
> > Then these settable thresholds might be more useful here than memory
> > fixes, but here is some idea to try handle this automatically within
> > some limits. The patch below increases inflate_threshold_root (only)
> > up to ~50% of its initial value if needed, and should be able to go
> > back sometimes.
> >
> > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
> > some offsets.)
>
> A tiny adjustment in the last if...
>
> Jarek P.
> --- (take 2)
>
> net/ipv4/fib_trie.c | 23 ++++++++++++++++-------
> 1 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..1dc1bb4 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -318,6 +318,7 @@ static const int halve_threshold = 25;
> static const int inflate_threshold = 50;
> static const int halve_threshold_root = 8;
> static const int inflate_threshold_root = 15;
> +static int inflate_threshold_root_fix;
>
>
> static void __alias_free_mem(struct rcu_head *head)
> @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
> /* Keep root node larger */
>
> if (!tn->parent)
> - inflate_threshold_use = inflate_threshold_root;
> + inflate_threshold_use = inflate_threshold_root +
> + inflate_threshold_root_fix;
> else
> inflate_threshold_use = inflate_threshold;
>
> @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
> }
>
> if (max_resize < 0) {
> - if (!tn->parent)
> - pr_warning("Fix inflate_threshold_root."
> - " Now=%d size=%d bits\n",
> - inflate_threshold_root, tn->bits);
> - else
> + if (!tn->parent) {
> + if ...Some fib_triestats - kernel.2.6.29.5 with first Jarek patch.
Dump every 10sec:
Mon Jun 29 11:54:31 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
Aver depth: 2.28
Max depth: 7
Leaves: 276978
Prefixes: 290448
Internal nodes: 66813
1: 34703 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5
9: 1 18: 1
Pointers: 691606
Null ptrs: 347816
Total size: 18403 kB
Counters:
---------
gets = 390981859
backtracks = 5332465
semantic match passed = 390452936
semantic match miss = 30198
null node hit= 375522207
skipped node resize = 0
Local:
Aver depth: 3.75
Max depth: 5
Leaves: 12
Prefixes: 13
Internal nodes: 10
1: 9 2: 1
Pointers: 22
Null ptrs: 1
Total size: 2 kB
Counters:
---------
gets = 391017445
backtracks = 121012874
semantic match passed = 37565
semantic match miss = 0
null node hit= 261583
skipped node resize = 0
Mon Jun 29 11:54:41 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
Aver depth: 2.28
Max depth: 7
Leaves: 276976
Prefixes: 290446
Internal nodes: 66813
1: 34703 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5
9: 1 18: 1
Pointers: 691606
Null ptrs: 347818
Total size: 18403 kB
Counters:
---------
gets = 391061852
backtracks = 5334173
semantic match passed = 390532664
semantic match miss = 30199
null node hit= 375595706
skipped node resize = 0
Local:
Aver depth: 3.75
Max depth: 5
Leaves: 12
Prefixes: 13
Internal nodes: 10
1: 9 2: 1
Pointers: 22
Null ptrs: 1
Total size: 2 kB
Counters:
---------
gets = 391097445
backtracks = 121039213
semantic match passed = 37570
semantic match miss = 0
null node hit= 261589
skipped node resize = ...Jarek Poplawski writes: > Robert, you and Eric pointed at memory problems, so I thought I missed > something. But after the second look I see "skipped node resize" should > show this, but it's always zero on these reports. So, isn't it possible > the current inflate_threshold_root is simply unreachable with some > conditions, at least within 10 loops? > > Then these settable thresholds might be more useful here than memory > fixes, but here is some idea to try handle this automatically within > some limits. The patch below increases inflate_threshold_root (only) > up to ~50% of its initial value if needed, and should be able to go > back sometimes. Yes we keep the old tnode size and the convergence interval was some of the concerns. That why this checks was added. Still we want to inflate the root node to a very max. So this approach with halving or doubling tnodes towards the root node was the suggest by Dyntree paper. I asked Stefan (one of the authors) if we could get safe and very offensive settings. But from what I understood there was no easy way to calculate this. So any bright ideas in this area are very welcome. But we should also monitor size of root and average tree depth so we don't take an to defensive approach just to solve this case. The memory patches and "manual RCU" are also interesting to address the case with PREEMTP's. Inserts and deletes are also very fast due to the flat tree so I think we can "slow down" this a bit if need to be safe with all PREEMPT's. Thanks for giving this area energy. Cheers --ro > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with > some offsets.) > > Thanks, > Jarek P. > --- > > net/ipv4/fib_trie.c | 23 ++++++++++++++++------- > 1 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..1dc1bb4 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ ...
On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote: Yes, but with this offensive approach it seems the current level of warnings could be too alarming. Btw., because of a design flaw in my current patch this _fix variable, which should be logically per trie/ table, can be reset by changes of other tables now, but I think it all could be fine tuned more in the future. Of course if there are people interested in testing/reporting this more. Thanks, Jarek P. --
On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote:
Since 2.6.29 looks like prefered here, and there were a lot of takes
in this thread, I attach below a combined all-in-one patch including:
- 2.6.29 -> 2.6.30 preemption disable patch
- 2 RCU vs. preemption fixes from 2.6.31-rc
- "manual RCU" patch to force vfree/kfree before root's resize (take 3)
- "automatic" inflate_threshold_root fix (take 2)
Thanks,
Jarek P.
--- (for 2.6.29.x or even .28 or .27; any testing appreciated)
diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200
+++ b/net/ipv4/fib_trie.c 2009-06-28 15:45:02.000000000 +0200
@@ -123,6 +123,7 @@ struct tnode {
union {
struct rcu_head rcu;
struct work_struct work;
+ struct tnode *tnode_free;
};
struct node *child[0];
};
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
static struct node *resize(struct trie *t, struct tnode *tn);
static struct tnode *inflate(struct trie *t, struct tnode *tn);
static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
static struct kmem_cache *fn_alias_kmem __read_mostly;
static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -315,6 +318,7 @@ static const int halve_threshold = 25;
static const int inflate_threshold = 50;
static const int halve_threshold_root = 8;
static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
static void __alias_free_mem(struct rcu_head *head)
@@ -363,6 +367,17 @@ static void __tnode_vfree(struct work_st
vfree(tn);
}
+static void __tnode_free(struct tnode *tn)
+{
+ size_t size = sizeof(struct tnode) +
+ (sizeof(struct node *) << tn->bits);
+
+ if (size <= PAGE_SIZE)
+ kfree(tn);
+ else
+ vfree(tn);
+}
+
static void __tnode_free_rcu(struct rcu_head *head)
{
struct tnode *tn = container_of(head, struct tnode, rcu);
@@ ...After 18 hours from apply first Jarek patch i have no info about Fix inflate_threshold_root even if i make: "clear ip bgp *" on router So i change Jarek patch from previous to this new one for test and we will see ... Regards Pawel Staszewski --
After apply this patch something is wrong Traffic is not forwarded no info in dmesg / no info from bgp and also i can't connect to bgpd process I revert kernel to past version with first Jarek patch --
Thank you very much, Pawel, for trying this. I'm starting to look for the reason. In the meantime try to get some fib_trie stats for Robert. ... --
To David Miller:
since among patches tested negatively by Pawel are current 2 fixes
from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise,
please withdraw them until they are tested alone. Thanks.
To Pawel:
Since checking this can take time I attach here a patch with only
changes which are currently in 2.6.31-rc. Of course, this part can be
broken as well, so it's up to you: if you could try it with caution
somewhere it would be very helpful; otherwise don't bother.
It could be applied to 2.6.29 with or without this currently working
patch.
Thanks,
Jarek P.
--- (for 2.6.29.x, .28 or .27)
diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200
+++ b/net/ipv4/fib_trie.c 2009-06-28 23:06:02.000000000 +0200
@@ -123,6 +123,7 @@ struct tnode {
union {
struct rcu_head rcu;
struct work_struct work;
+ struct tnode *tnode_free;
};
struct node *child[0];
};
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
static struct node *resize(struct trie *t, struct tnode *tn);
static struct tnode *inflate(struct trie *t, struct tnode *tn);
static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
static struct kmem_cache *fn_alias_kmem __read_mostly;
static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
call_rcu(&tn->rcu, __tnode_free_rcu);
}
+static void tnode_free_safe(struct tnode *tn)
+{
+ BUG_ON(IS_LEAF(tn));
+ tn->tnode_free = tnode_free_head;
+ tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+ struct tnode *tn;
+
+ while ((tn = tnode_free_head)) {
+ tnode_free_head = tn->tnode_free;
+ tn->tnode_free = NULL;
+ tnode_free(tn);
+ }
+}
+
static struct leaf *leaf_new(void)
{
struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +516,7 @@ static ...Ok.
I applied this patch 15mins ago to 2.6.29.5 and now it's working -
traffic is forwarded.
Some fib_triestats
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
Aver depth: 2.29
Max depth: 6
Leaves: 277015
Prefixes: 290493
Internal nodes: 67115
1: 35733 2: 13635 3: 9544 4: 4832 5: 2239 6: 1125 7: 5
9: 1 18: 1
Pointers: 686614
Null ptrs: 342485
Total size: 18396 kB
Counters:
---------
gets = 3956301
backtracks = 192497
semantic match passed = 3895955
semantic match miss = 133
null node hit= 4306948
skipped node resize = 0
Local:
Aver depth: 3.75
Max depth: 5
Leaves: 12
Prefixes: 13
Internal nodes: 10
1: 9 2: 1
Pointers: 22
Null ptrs: 1
Total size: 2 kB
Counters:
---------
gets = 3960981
backtracks = 2152441
semantic match passed = 4757
semantic match miss = 0
null node hit= 194997
--
But With all this patches i have the same problem with CPU load Every time when route cache entries are purged cpu load is increasing from 1% to 40 / 80% it depends I see that on 64bit machine when route cache entries are going down i have almost 80% load on each cpu where ethernet card is binded by smp_affinity But on 32bit machine cpu load reported by mpstat is half that on 64bit machine here is example from 32bit machine ( mpstat + rtstat -k entries ) Linux 2.6.29.5 (TM_02_C1) 06/29/09 _i686_ (2 CPU) 12:36:54 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle RT CACHE ENTRIES (from rtstat) 12:36:57 all 0.00 0.00 0.00 0.00 1.51 15.08 0.00 0.00 83.42 83346 12:36:58 all 0.00 0.00 0.00 0.00 1.01 7.58 0.00 0.00 91.41 85988 12:36:59 all 0.00 0.00 0.00 0.00 0.50 1.01 0.00 0.00 98.49 89979 12:37:00 all 0.00 0.00 0.50 0.00 0.00 1.51 0.00 0.00 97.99 93652 12:37:01 all 0.00 0.00 0.00 0.00 0.00 2.01 0.00 0.00 97.99 96533 12:37:02 all 0.00 0.00 0.00 0.00 0.51 1.01 0.00 0.00 98.48 99451 12:37:03 all 0.00 0.00 0.00 0.00 0.00 2.49 0.00 0.00 97.51 102018 12:37:04 all 0.00 0.00 0.00 0.00 0.00 1.52 0.00 0.00 98.48 104153 12:37:05 all 0.00 0.00 0.00 0.00 0.00 1.01 0.00 0.00 98.99 105979 12:37:06 all 0.00 0.00 0.00 0.00 0.00 1.01 0.00 0.00 98.99 107684 12:37:07 all 0.00 0.00 0.00 0.00 0.00 1.53 0.00 0.00 98.47 109070 12:37:08 all 0.00 0.00 0.00 0.00 0.00 1.51 0.00 0.00 98.49 110462 12:37:09 all 0.00 0.00 0.00 0.00 0.00 1.52 0.00 0.00 98.48 ...
I guess Eric is thinking about this. Btw., two little suggestions: it should be easier to track if these route cache reports stay in its starting thread ("weird problem"?), and if you could send these stats/logs as attachements or turn off line wrapping, please? ;-) Thanks, Jarek P. --
Sorry Jarek for combining problems :) And yes i will apply next stats in attachements :) --
David, IMHO this fix is needed in net-2.6 even if it doesn't fix the problem reported by Pawel (there could be still something more). Pawel, I see you decided to test my previous patch, but try to add this one on top. Thanks, Jarek P. -------------------> ipv4: Fix fib_trie rebalancing, part 3 Alas current delaying of freeing old tnodes by RCU in trie_rebalance is still not enough because we can free a top tnode before updating a t->trie pointer. Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- net/ipv4/fib_trie.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..00a54b2 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1021,6 +1021,9 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) (struct node *)tn, wasfull); tp = node_parent((struct node *) tn); + if (!tp) + rcu_assign_pointer(t->trie, (struct node *)tn); + tnode_free_flush(); if (!tp) break; --
I apply this patch fib_triestats in attached file :)
Great! But it would be nice to check if this (accidentally ;-) might
fix the previous problem, so I attach below the patch with "manual
RCU", which btw. (or even more important) should verify RCU use here.
It should be applied on top of this last "Fix..., part3". And
again: it's quite probable it can fail, so with caution, no hurry
(it can wait for quiet time)...
Many thanks,
Jarek P.
--------------------> (synchronize_rcu take 4)
diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c 2009-06-29 10:00:14.000000000 +0000
+++ b/net/ipv4/fib_trie.c 2009-06-29 10:04:22.000000000 +0000
@@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_st
vfree(tn);
}
+static void __tnode_free(struct tnode *tn)
+{
+ size_t size = sizeof(struct tnode) +
+ (sizeof(struct node *) << tn->bits);
+
+ if (size <= PAGE_SIZE)
+ kfree(tn);
+ else
+ vfree(tn);
+}
+
static void __tnode_free_rcu(struct rcu_head *head)
{
struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +413,7 @@ static void tnode_free_flush(void)
while ((tn = tnode_free_head)) {
tnode_free_head = tn->tnode_free;
tn->tnode_free = NULL;
- tnode_free(tn);
+ __tnode_free(tn);
}
}
@@ -1021,21 +1032,27 @@ static void trie_rebalance(struct trie *
(struct node *)tn, wasfull);
tp = node_parent((struct node *) tn);
- if (!tp)
+ if (!tp) {
rcu_assign_pointer(t->trie, (struct node *)tn);
-
- tnode_free_flush();
- if (!tp)
break;
+ }
tn = tp;
}
+ if (tnode_free_head) {
+ synchronize_rcu();
+ tnode_free_flush();
+ }
+
/* Handle last (top) tnode */
- if (IS_TNODE(tn))
+ if (IS_TNODE(tn)) {
tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
- rcu_assign_pointer(t->trie, (struct node *)tn);
- tnode_free_flush();
+ rcu_assign_pointer(t->trie, (struct node *)tn);
+ synchronize_rcu();
+ tnode_free_flush();
+ } else {
+ rcu_assign_pointer(t->trie, (struct node *)tn);
+ }
return;
...Jarek Poplawski pisze: > On Mon, Jun 29, 2009 at 11:51:52AM +0200, Pawe
OK, I'll look at it again. Thanks for testing! Jarek P. --
Pawel, here is another try to check what's going on here, so just like before, but this one on top of these 2 last working patches, plus quite time... (Stats aren't necessary; if these are some doubts let me know.) Thanks, Jarek P. --------------------> (synchronize_rcu take 5) diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-06-29 10:00:14.000000000 +0000 +++ b/net/ipv4/fib_trie.c 2009-06-30 06:50:35.000000000 +0000 @@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie * rcu_assign_pointer(t->trie, (struct node *)tn); tnode_free_flush(); + synchronize_rcu(); return; } --
A little comment: these last 2 patches weren't exactly to fix the problem you reported, which should be mostly fixed by the earlier patch. There is some other bug, which you omit with CONFIG_PREEMPT_NONE (but it's not for sure there is no by effects). So, I'd like to be sure you are willing and can (without too much risk) to do more such tests. Alas I've no way to generate similar conditions so it would simply have to wait for somebody else. Many thanks again, Jarek P. --
Yes i can make tests like this. My network is splited to test clients and other normal clients so it's really no problem to make testing. - if testing clients working then traffic from normal clients is also switched to this router (but if traffic is not forwarded "like in this case" for testing clients then failover switching them to working router ) and other point to make this tests - is that - it is good to have all in linux kernel networking working well :) Regards --
On Wed, Jul 01, 2009 at 01:31:09AM +0200, Paweł Staszewski wrote:
It's extremely nice of you! On the other hand, this type of change
was planned to the net-next to fix possible memory problems, which
might have happened to you as well. So you'd probably experience this
problem in the future (2.6.32) anyway.
So here is the first of 2 patches (the second in a separate message),
which should be tested separately, each one applied on top of the
2.6.29.x (vanilla - at least fib_trie.c), after reverting the previous
one. So, they are again all-in-one, to eclude any misunderstanding.
Btw., I assume there were no oopses, warnings or lockups after those
previous non-working patches - only no routing/forwarding.
Thanks,
Jarek P.
----------> (synchronize take 6 all-in-one for 2.6.29x, .28, or .27)
diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c 2009-06-29 05:30:50.000000000 +0000
+++ b/net/ipv4/fib_trie.c 2009-07-01 05:15:37.000000000 +0000
@@ -123,6 +123,7 @@ struct tnode {
union {
struct rcu_head rcu;
struct work_struct work;
+ struct tnode *tnode_free;
};
struct node *child[0];
};
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
static struct node *resize(struct trie *t, struct tnode *tn);
static struct tnode *inflate(struct trie *t, struct tnode *tn);
static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
static struct kmem_cache *fn_alias_kmem __read_mostly;
static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
call_rcu(&tn->rcu, __tnode_free_rcu);
}
+static void tnode_free_safe(struct tnode *tn)
+{
+ BUG_ON(IS_LEAF(tn));
+ tn->tnode_free = tnode_free_head;
+ tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+ struct tnode *tn;
+
+ while ((tn = tnode_free_head)) {
+ tnode_free_head = ...On Wed, Jul 01, 2009 at 06:36:51AM +0000, Jarek Poplawski wrote:
The second patch.
Jarek P.
----------> (synchronize take 7 all-in-one for 2.6.29x, .28, or .27)
diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c 2009-07-01 06:17:08.000000000 +0000
+++ b/net/ipv4/fib_trie.c 2009-07-01 05:27:02.000000000 +0000
@@ -123,6 +123,7 @@ struct tnode {
union {
struct rcu_head rcu;
struct work_struct work;
+ struct tnode *tnode_free;
};
struct node *child[0];
};
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
static struct node *resize(struct trie *t, struct tnode *tn);
static struct tnode *inflate(struct trie *t, struct tnode *tn);
static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
static struct kmem_cache *fn_alias_kmem __read_mostly;
static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
call_rcu(&tn->rcu, __tnode_free_rcu);
}
+static void tnode_free_safe(struct tnode *tn)
+{
+ BUG_ON(IS_LEAF(tn));
+ tn->tnode_free = tnode_free_head;
+ tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+ struct tnode *tn;
+
+ while ((tn = tnode_free_head)) {
+ tnode_free_head = tn->tnode_free;
+ tn->tnode_free = NULL;
+ tnode_free(tn);
+ }
+}
+
static struct leaf *leaf_new(void)
{
struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +516,7 @@ static struct node *resize(struct trie *
/* No children */
if (tn->empty_children == tnode_child_length(tn)) {
- tnode_free(tn);
+ tnode_free_safe(tn);
return NULL;
}
/* One child */
@@ -509,7 +530,7 @@ static struct node *resize(struct trie *
/* compress one level */
node_set_parent(n, NULL);
- tnode_free(tn);
+ tnode_free_safe(tn);
return n;
}
/*
@@ -670,7 +691,7 @@ static struct node *resize(struct trie *
...It looks like Cc was shortened BTW, but I guess at least Robert is interested in this testing, so I add him back. Cheers, --
Yes on on previous patches there was / no warnings / no oopses or lockups But now i apply this patch and i make more testing. First boot with start of bgpd and - traffic is not forwarded So i start to search and make only some routes (static without bgpd) thru this host And all is working for this host when i make all by static routes. So i change a little my bgp configuration and make default route to only one of my iBGP peers and start bgpd process All is working and what is weird is number of routes in kernel table. Kernel is learning routes from bgpd but very slowly - really very slowly. In attached file there are some fib_triestats after 5min of traffic. Without this patch (normally) total size: reported by fib_triestats in less that 1sec is: "Total size: 35769 kB" But with this patch Total size is growing up and in 5 min of traffic it grow to only: "Total size: 1005 kB" Regards --
On Wed, Jul 01, 2009 at 11:43:04AM +0200, Paweł Staszewski wrote: Pawel, this is really very helpful! So, this is (probably) only about timing, not wrong memory freeing. On the other hand this test was only for inserts. Btw., if you didn't start the second test, you can skip it. I have to rethink this. Many thanks, Jarek P. --
So, after your findings I'm about to recommend sending to -stable
3 patches from net-2.6, with additional lowering of threshold_root
settings, but it would be nice if you could give it a try with
CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
your other apps!) It is expected to work this time...;-) Maybe a
bit slower.
Thanks,
Jarek P.
--------> (all-in-one preempt fixes to apply with vanilla 2.6.29.x)
diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c 2009-07-01 06:17:08.000000000 +0000
+++ b/net/ipv4/fib_trie.c 2009-07-01 10:43:44.000000000 +0000
@@ -123,6 +123,7 @@ struct tnode {
union {
struct rcu_head rcu;
struct work_struct work;
+ struct tnode *tnode_free;
};
struct node *child[0];
};
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
static struct node *resize(struct trie *t, struct tnode *tn);
static struct tnode *inflate(struct trie *t, struct tnode *tn);
static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
static struct kmem_cache *fn_alias_kmem __read_mostly;
static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -313,8 +316,8 @@ static inline void check_tnode(const str
static const int halve_threshold = 25;
static const int inflate_threshold = 50;
-static const int halve_threshold_root = 8;
-static const int inflate_threshold_root = 15;
+static const int halve_threshold_root = 15;
+static const int inflate_threshold_root = 25;
static void __alias_free_mem(struct rcu_head *head)
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
call_rcu(&tn->rcu, __tnode_free_rcu);
}
+static void tnode_free_safe(struct tnode *tn)
+{
+ BUG_ON(IS_LEAF(tn));
+ tn->tnode_free = tnode_free_head;
+ tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+ struct tnode *tn;
+
+ while ((tn = tnode_free_head)) {
+ tnode_free_head = ...>> On Wed, Jul 01, 2009 at 11:43:04AM +0200, Pawe
Hmm... It should, because you tested very similar patch already;-) It could probably matter only if you're using some broken out-of-tree patches. Otherwise the kernel is expected to work OK. Btw., it would be also interesting to check if there is any difference wrt. these route cache problems while PREEMPT is enabled. Thanks, Jarek P. --
Yes i know there was almost identical one. Im a little confused about using of PREEMPT kernel because of past there was many oopses / lockups :) but yes that was a little long time ago. --
Yes, it looks like we can't free memory so simple because of such huge And you're very right! The place we're fixing is the best example. On the other hand, I hope there is not many such places yet. But if we test/fix it there will be one less... Jarek P. --
Jarek Poplawski writes: > Yes, it looks like we can't free memory so simple because of such huge > latencies. Controlling RCU seems crucial. Insertion of the full BGP table increased from 2 seconds to > 20 min with one synchronize_rcu patches. And fib_trie "worst case" wrt memory is the root node. So maybe we should monitor changes in root node and use this to control synchronize_rcu. Didn't Paul suggest something like this? And with don't find any decent solution we have to add an option for a fixed and pre-allocated root-nod typically for BGP-routers. Cheers --ro --
I wish I knew this a few days before. I could imagine a slow down, but it looked like it was stuck. Since these last changes weren't tested on SMP + PREEMPT I thought there is still something broken. (I was mainly interested in this synchronize_rcu at the moment as Sure, and it needs testing, but we should send some safe preemption Probably you're right; I'd prefer to see the test results showing a difference vs. simply less aggressive root thresholds. But of course, even if not convinced, I'll respect your choice as the author and maintainer, so feel free to NAK my proposals - I won't get it personally.;-) Cheers, Jarek P. --
Jarek Poplawski writes: > > Controlling RCU seems crucial. Insertion of the full BGP table increased > > from 2 seconds to > 20 min with one synchronize_rcu patches. > > I wish I knew this a few days before. I could imagine a slow down, > but it looked like it was stuck. Since these last changes weren't > tested on SMP + PREEMPT I thought there is still something broken. > (I was mainly interested in this synchronize_rcu at the moment as > a preemption test.) Honestly this huge slowdown was surprise for me too. I think I sent you a script so you could insert the full table yourself. > > And fib_trie "worst case" wrt memory is the root node. So maybe we should > > monitor changes in root node and use this to control synchronize_rcu. > > > > Didn't Paul suggest something like this? > > Sure, and it needs testing, but we should send some safe preemption > fix for -stable first, don't we? Yes my hope was that we could combine them... personally I'll need to understand who we can preeemted better in the different configs and most of that this can be handled by "standard" RCU. > > And with don't find any decent solution we have to add an option for > > a fixed and pre-allocated root-nod typically for BGP-routers. > > Probably you're right; I'd prefer to see the test results showing > a difference vs. simply less aggressive root thresholds. But of > course, even if not convinced, I'll respect your choice as the author > and maintainer, so feel free to NAK my proposals - I won't get it > personally.;-) Thresholds we can change no problem... but very soon I'll people will start routing without the route cache this at least in close to Internet core ,we will need all fib_look performance we can get. fib_trie was designed for classical RCU and no preempt you see the names i file... so this new and very challenging work to all of us. First week of vacation and have to fix the roof of the house... it's hot and ...
I can't remember this script, but I guess my hardware should be I mean changing thresholds as a temporary solution, until we can control memory freeing; and it seems to me, even excluding the root node, there could be a lot of temporary allocations during all those Have a nice time, Jarek P. --
Ok kernel configured with CONFIG_PREEMPT and all this day work without any problems (with Jarek last patch). So in attached file trere is fib_tirestats I dont see any big change of (cpu load or faster/slower routing/propagating routes from bgpd or something else) - in avg there is from 2% to 3% more of CPU load i dont know why but it is - i change from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL 1 30" always avg cpu load was from 2 to 3% more compared to "no preempt" Regards --
Oh I forgot - please Jarek give me patch with sync rcu and i will make test on preempt kernel Thanks Paweł Staszewski --
Probably non-preempt kernel might need something like this more, but
comparing is always interesting. This patch is based on Paul's
suggestion (I hope).
Thanks,
Jarek P.
---> (synchronize take 7; apply on top of the 2.6.29.x with the last
all-in-one patch, or net-2.6)
net/ipv4/fib_trie.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..fce8238 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -164,6 +164,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
static struct tnode *halve(struct trie *t, struct tnode *tn);
/* tnodes to free after resize(); protected by RTNL */
static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
static struct kmem_cache *fn_alias_kmem __read_mostly;
static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -393,6 +394,8 @@ static void tnode_free_safe(struct tnode *tn)
BUG_ON(IS_LEAF(tn));
tn->tnode_free = tnode_free_head;
tnode_free_head = tn;
+ tnode_free_size += sizeof(struct tnode) +
+ (sizeof(struct node *) << tn->bits);
}
static void tnode_free_flush(void)
@@ -404,6 +407,11 @@ static void tnode_free_flush(void)
tn->tnode_free = NULL;
tnode_free(tn);
}
+
+ if (tnode_free_size >= PAGE_SIZE * 128) {
+ tnode_free_size = 0;
+ synchronize_rcu();
+ }
}
static struct leaf *leaf_new(void)
--
Hold on ;-) Here is something even better... Syncing after 128 pages
might be still too slow, so here is a higher initial value, 1000, plus
you can change this while testing in:
/sys/module/fib_trie/parameters/sync_pages
It would be interesting to find the lowest acceptable value.
Jarek P.
---> (synchronize take 8; apply on top of the 2.6.29.x with the last
all-in-one patch, or net-2.6)
net/ipv4/fib_trie.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..decc8d0 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -71,6 +71,7 @@
#include <linux/netlink.h>
#include <linux/init.h>
#include <linux/list.h>
+#include <linux/moduleparam.h>
#include <net/net_namespace.h>
#include <net/ip.h>
#include <net/protocol.h>
@@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
static struct tnode *halve(struct trie *t, struct tnode *tn);
/* tnodes to free after resize(); protected by RTNL */
static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
+
+static int sync_pages __read_mostly = 1000;
+module_param(sync_pages, int, 0640);
static struct kmem_cache *fn_alias_kmem __read_mostly;
static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn)
BUG_ON(IS_LEAF(tn));
tn->tnode_free = tnode_free_head;
tnode_free_head = tn;
+ tnode_free_size += sizeof(struct tnode) +
+ (sizeof(struct node *) << tn->bits);
}
static void tnode_free_flush(void)
@@ -404,6 +411,11 @@ static void tnode_free_flush(void)
tn->tnode_free = NULL;
tnode_free(tn);
}
+
+ if (tnode_free_size >= PAGE_SIZE * sync_pages) {
+ tnode_free_size = 0;
+ synchronize_rcu();
+ }
}
static struct leaf *leaf_new(void)
--
Looks like a promising approach to me! --
Hmm... As a matter of fact, I'm a bit sceptical now: I'm worrying this synchronize_rcu done at the lowest acceptable rate could be actually mostly idle or on the contrary too late. Probably some more complex (per cpu?) accounting would be necessary to really matter here, but on the other hand these problems weren't reported often enough. Thanks, --
kernel 2.6.29.5 preempt
bgp starts normal and kernel know routes normaly like without patch
Here are some fib_triestats
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
Aver depth: 2.44
Max depth: 6
Leaves: 277888
Prefixes: 291399
Internal nodes: 66818
1: 33080 2: 14584 3: 10788 4: 4911 5: 2185 6: 900 7:
366 8: 3 17: 1
Pointers: 595584
Null ptrs: 250879
Total size: 18072 kB
Counters:
---------
gets = 1052940
backtracks = 55985
semantic match passed = 1034114
semantic match miss = 5
null node hit= 534415
skipped node resize = 0
Local:
Aver depth: 3.75
Max depth: 5
Leaves: 12
Prefixes: 13
Internal nodes: 10
1: 9 2: 1
Pointers: 22
Null ptrs: 1
Total size: 2 kB
Counters:
---------
gets = 1057636
backtracks = 1101307
semantic match passed = 4751
semantic match miss = 0
null node hit= 195605
skipped node resize = 0
kernel 2.6.29.5 no-preempt
All is ok like with preempt kernel (andl all working in normal time
"routes propagation")
cat /sys/module/fib_trie/parameters/sync_pages
1000
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
Aver depth: 2.45
Max depth: 6
Leaves: 277905
Prefixes: 291416
Internal nodes: 66863
1: 33119 2: 14594 3: 10782 4: 4911 5: 2187 6: 901 7:
365 8: 3 17: 1
Pointers: 595654
Null ptrs: 250887
Total size: 18074 kB
Counters:
---------
gets = 1060650
backtracks = 53161
semantic match passed = 1041008
semantic match miss = 12
null node hit= 504478
skipped node resize = 0
Local:
Aver depth: 3.75
Max depth: 5
Leaves: 12
Prefixes: 13
Internal nodes: 10
1: 9 2: 1
Pointers: ...On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote: 14 == 10!? ;-) Hmm... So, it's better than I expected; syncing after 128 or 256 pages could be quite reasonable. But then it would be interesting to find out if with such a safety we could go back to more aggressive values for possibly better performance. So here is 'the same' patch (so the previous, take 8, should be reverted), but with additional possibility to change: /sys/module/fib_trie/parameters/inflate_threshold_root I guess, you could try e.g. if: sync_pages 256, inflate_threshold_root 15 can give faster lookups (or lower cpu loads); with this these inflate warnings could be back btw.; or maybe you'll find something in between like inflate_threshold_root 20 is optimal for you. (I think it should be enough to try this only for PREEMPT_NONE unless you have spare time ;-) Thanks, Jarek P. ---> (synchronize take 9; apply on top of the 2.6.29.x with the last all-in-one patch, or net-2.6) net/ipv4/fib_trie.c | 18 ++++++++++++++++-- 1 files changed, 16 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 00a54b2..e8fca11 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -71,6 +71,7 @@ #include <linux/netlink.h> #include <linux/init.h> #include <linux/list.h> +#include <linux/moduleparam.h> #include <net/net_namespace.h> #include <net/ip.h> #include <net/protocol.h> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); /* tnodes to free after resize(); protected by RTNL */ static struct tnode *tnode_free_head; +static size_t tnode_free_size; + +static int sync_pages __read_mostly = 1000; +module_param(sync_pages, int, 0640); static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -316,9 +321,11 @@ static inline void check_tnode(const struct tnode *tn) static const int ...
Applied to 2.6.29.5 preempt/no-preempt and tested: - with preempt i make only one test with sync_pages = 256 to check that is working :) So here are some tests for different sync_pages size. echo 1 > /sys/module/fib_trie/parameters/sync_pages I stop count after 1minute - total size still rising :) echo 2 > /sys/module/fib_trie/parameters/sync_pages Total size in fib_triestats reach maximum in 33sec echo 3 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 31sec echo 4 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 23sec echo 8 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 17sec echo 16 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 14 sec echo 32 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 14 sec So i see in prev tests i make something wrong in time counting So i modify test script and make tests again: echo 64 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 13 sec echo 128 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 10 sec echo 256 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 10 sec And for sync_paqges >256 time for propagating routes is always 10sec. Also today i have many messages in dmesg like this: And after tune : /sys/module/fib_trie/parameters/inflate_threshold_root no more info :) Regards --
This is something new and a bit surprising to me: the same threshold in previous tests didn't generate this? Do you mean more than: With what value? Pawel, let's say that current defaults are: inflate_threshold_root 25 sync_pages 256 I'd like you to try to check if e.g.: inflate_threshold_root 15 sync_pages 256 can give you any visible or subjective difference worth tweaking it at all? (These stats from the next messages don't show this enough.) You don't need to hurry with this... Thanks, Jarek P. --
Yes. Sorry for that - this info was not all the day but only 5 minutes when i was making tests. When i set 35 as inflate_threshold_root there was no info even if all iBGP peers was down/up. But i start to search when i have info about "Fix inflate_threshold_root" And i see that the best is set this to 20 for me i have no info then in normal router operation / without down/up bgp peers many times in short I will try to make more accurate tests in weekend. Regards --
So it looks like the patch tested earlier could be still useful; after
changing the inflate_threshold_root it seems these warnings should be
very rare but there is no reason to alarm users with something they
can't fix optimally, anyway.
Thanks,
Jarek P.
--------------------->
ipv4: Fix inflate_threshold_root automatically
During large updates there could be triggered warnings like: "Fix
inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root
node isn't finished in 10 loops. It should be much rarer now, after
changing the threshold from 15 to 25, and a temporary problem, so
this patch tries to handle it automatically using a fix variable to
increase by one inflate threshold for next root resizes (up to the 35
limit, max fix = 10). The fix variable is decreased when root's
inflate() finishes below 7 loops (even if some other, smaller table/
trie is updated -- for simplicity the fix variable is global for now).
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---
diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c 2009-07-13 13:32:53.000000000 +0200
+++ b/net/ipv4/fib_trie.c 2009-07-13 15:16:18.000000000 +0200
@@ -327,6 +327,8 @@ static const int inflate_threshold = 50;
static const int halve_threshold_root = 15;
static const int inflate_threshold_root = 25;
+static int inflate_threshold_root_fix;
+#define INFLATE_FIX_MAX 10 /* a comment in resize() */
static void __alias_free_mem(struct rcu_head *head)
{
@@ -617,7 +619,8 @@ static struct node *resize(struct trie *
/* Keep root node larger */
if (!tn->parent)
- inflate_threshold_use = inflate_threshold_root;
+ inflate_threshold_use = inflate_threshold_root +
+ inflate_threshold_root_fix;
else
inflate_threshold_use = inflate_threshold;
@@ -641,15 +644,27 @@ static struct node ...Jarek Poplawski writes: Looks good. Maybe we're getting close to some generic solution to take a very optimistic approach wrt thresholds for root node and adjust to settings without the warning. Or maybe now even remove warning totally with stata counter? Can we even consider some other different strategy for bumping up the root node. We need all lookup performance we can get when we now try to route without the route cache. And we probably need to evaluate the cost for the multiple lookups again at least for LOCAL and MAIN when we talking routing well at least straight-forward simple routing. (Semantic change) I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns we see in/close to Internet core. This w/o route cache on our hi-end opterons with 8 CPU cores using niu and ixgbe. I'll test again and your patches when I'm back from vacation. Cheers --ro > So it looks like the patch tested earlier could be still useful; after > changing the inflate_threshold_root it seems these warnings should be > very rare but there is no reason to alarm users with something they > can't fix optimally, anyway. > > Thanks, > Jarek P. > ---------------------> > ipv4: Fix inflate_threshold_root automatically > > During large updates there could be triggered warnings like: "Fix > inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root > node isn't finished in 10 loops. It should be much rarer now, after > changing the threshold from 15 to 25, and a temporary problem, so > this patch tries to handle it automatically using a fix variable to > increase by one inflate threshold for next root resizes (up to the 35 > limit, max fix = 10). The fix variable is decreased when root's > inflate() finishes below 7 loops (even if some other, smaller table/ > trie is updated -- for simplicity the fix variable is global for now). > > Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> > Reported-by: Jorge Boncompte ...
I guess, we could, but maybe let's wait a bit to make sure there is Sure, I was mainly aiming at safe defaults (wrt. memory usage), but if tests show there is a better strategy we should go for it. Thanks, Jarek P. --
Jarek Poplawski writes: > On Wed, Jul 15, 2009 at 09:43:11AM +0200, Robert Olsson wrote: > > a very optimistic approach wrt thresholds for root node and adjust to > > settings without the warning. Or maybe now even remove warning totally > > with stata counter? > > I guess, we could, but maybe let's wait a bit to make sure there is > nothing surprising? Yes if Pawel is running it we we'll get reports. I've no chance to upgrade any of our routers now. I've seen this printout in one our routers but we don't do "clear ip bgp *" to often and besides we try to use soft re- configuration inbound. > > I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns > > we see in/close to Internet core. This w/o route cache on our hi-end opterons > > with 8 CPU cores using niu and ixgbe. I'll test again and your patches when > > I'm back from vacation. > > > Sure, I was mainly aiming at safe defaults (wrt. memory usage), but if > tests show there is a better strategy we should go for it. Routing without route cache is "new" area probably for minority of systems were caching is not possible. Read BGP routers in core. Yes we should have safe defults. Thanks for all your work. Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Cheers --ro --
From: Jarek Poplawski <jarkao2@gmail.com> Applied. --
Jarek Poplawski pisze: > On Mon, Jul 06, 2009 at 01:53:49AM +0200, Pawe
i forgot to add: Traffic when i make test was +/- 10Mbit/s in next tests: eth0: RX: 231.21 Mb/s TX: 287.40 Mb/s --
Below is a simpler version of this patch, without the sysfs parameter. ------------------------> ipv4: Use synchronize_rcu() during trie_rebalance() During trie_rebalance() we free memory after resizing with call_rcu(), but large updates, especially with PREEMPT_NONE configs, can cause memory stresses, so this patch calls synchronize_rcu() in tnode_free_flush() after each sync_pages to guarantee such freeing (especially before resizing the root node). The value of sync_pages = 128 is based on Pawel Staszewski's tests as the lowest which doesn't hinder updating times. (For testing purposes there was a sysfs module parameter to change it on demand, but it's removed until we're sure it could be really useful.) The patch is based on suggestions by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- net/ipv4/fib_trie.c | 15 +++++++++++++++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 63c2fa7..58ba9f4 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -164,6 +164,14 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); /* tnodes to free after resize(); protected by RTNL */ static struct tnode *tnode_free_head; +static size_t tnode_free_size; + +/* + * synchronize_rcu after call_rcu for that many pages; it should be especially + * useful before resizing the root node with PREEMPT_NONE configs; the value was + * obtained experimentally, aiming to avoid visible slowdown. + */ +static const int sync_pages = 128; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -393,6 +401,8 @@ static void tnode_free_safe(struct tnode *tn) BUG_ON(IS_LEAF(tn)); tn->tnode_free = tnode_free_head; ...
From: Jarek Poplawski <jarkao2@gmail.com> Applied. --
While looking for other fib_trie problems reported by Pawel Staszewski
I noticed there are a few uses of tnode_get_child() and node_parent()
in lookups instead of their rcu versions.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---
(this patch was prepared on top of my 2 today's fib_trie patches)
diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c 2009-07-14 20:40:39.000000000 +0200
+++ b/net/ipv4/fib_trie.c 2009-07-14 22:41:26.000000000 +0200
@@ -1465,7 +1465,7 @@ static int fn_trie_lookup(struct fib_tab
cindex = tkey_extract_bits(mask_pfx(key, current_prefix_length),
pos, bits);
- n = tnode_get_child(pn, cindex);
+ n = tnode_get_child_rcu(pn, cindex);
if (n == NULL) {
#ifdef CONFIG_IP_FIB_TRIE_STATS
@@ -1600,7 +1600,7 @@ backtrace:
if (chopped_off <= pn->bits) {
cindex &= ~(1 << (chopped_off-1));
} else {
- struct tnode *parent = node_parent((struct node *) pn);
+ struct tnode *parent = node_parent_rcu((struct node *) pn);
if (!parent)
goto failed;
@@ -1813,7 +1813,7 @@ static struct leaf *trie_firstleaf(struc
static struct leaf *trie_nextleaf(struct leaf *l)
{
struct node *c = (struct node *) l;
- struct tnode *p = node_parent(c);
+ struct tnode *p = node_parent_rcu(c);
if (!p)
return NULL; /* trie with just one leaf */
--
From: Jarek Poplawski <jarkao2@gmail.com> Applied. --
David & Robert,
below are my recommendations for -stable plus one more patch:
On Sun, Jul 05, 2009 at 02:26:54AM +0200, Paweł Staszewski wrote:
So after these patches from net-2.6 are tested both for PREEMPT and
PREEMPT_NONE I think they should go to -stable:
2.6.30 needs:
-------------
commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon Jun 15 02:31:29 2009 -0700
ipv4: Fix fib_trie rebalancing
commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu Jun 18 00:28:51 2009 -0700
ipv4: Fix fib_trie rebalancing, part 2
commit 008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08
Author: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue Jun 30 12:47:19 2009 -0700
ipv4: Fix fib_trie rebalancing, part 3
plus the new patch below
ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)
2.6.29 needs:
-------------
this patch from 2.6.30:
commit 3ed18d76d959e5cbfa5d70c8f7ba95476582a556
Author: Robert Olsson <robert.olsson@its.uu.se>
Date: Thu May 21 15:20:59 2009 -0700
ipv4: Fix oops with FIB_TRIE
plus above mentionned patches for 2.6.30 (part 1 - 4)
-----------------
David, if possible, please add to all these "Fix... part 1 - 4":
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
This new patch below is intended only for -stable (and later for
net-next), because it doesn't meet rules of the current -rc. Anyway,
it's not critical (but it actually fixes a regression from 2.6.22).
Thanks,
Jarek P.
---------------->
ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)
Pawel Staszewski wrote:
<blockquote>
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648
and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
...
Fix inflate_threshold_root. Now=15 size=11 bits
cat /proc/net/fib_triestat
Basic ...(Take 2: Changelog spelling fixes, sorry.)
David & Robert,
below are my recommendations for -stable plus one more patch:
On Sun, Jul 05, 2009 at 02:26:54AM +0200, Paweł Staszewski wrote:
So after these patches from net-2.6 are tested both for PREEMPT and
PREEMPT_NONE I think they should go to -stable:
2.6.30 needs:
-------------
commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon Jun 15 02:31:29 2009 -0700
ipv4: Fix fib_trie rebalancing
commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu Jun 18 00:28:51 2009 -0700
ipv4: Fix fib_trie rebalancing, part 2
commit 008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08
Author: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue Jun 30 12:47:19 2009 -0700
ipv4: Fix fib_trie rebalancing, part 3
plus the new patch below
ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)
2.6.29 needs:
-------------
this patch from 2.6.30:
commit 3ed18d76d959e5cbfa5d70c8f7ba95476582a556
Author: Robert Olsson <robert.olsson@its.uu.se>
Date: Thu May 21 15:20:59 2009 -0700
ipv4: Fix oops with FIB_TRIE
plus above mentionned patches for 2.6.30 (part 1 - 4)
-----------------
David, if possible, please add to all these "Fix... part 1 - 4":
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
This new patch below is intended only for -stable (and later for
net-next), because it doesn't meet rules of the current -rc. Anyway,
it's not critical (but it actually fixes a regression from 2.6.22).
Thanks,
Jarek P.
---------------->
ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)
Pawel Staszewski wrote:
<blockquote>
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648
and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
...
Fix inflate_threshold_root. Now=15 size=11 ...From: Jarek Poplawski <jarkao2@gmail.com> I think if we' re going to toss this into -stable, we should put it into net-2.6 too, and that's what I'm going to do. Once this makes it's way to Linus I'll work on the -stable submissions. And I'll make sure to add the tested-by tags, as you mentioned. Thanks! --
It's your decision: I don't think this patch is worth any arguing about (de)stabilizing. Btw., since -stable rules are less strict it seems natural such patches with bug fixes should rather go net-next -> -stable way, unless I miss something? Thanks, Jarek P. --
David, I guess you could add: Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> Thanks, Jarek P. --
From: Jarek Poplawski <jarkao2@gmail.com> Done, and applied, thanks Jarek. --
Btw., a little comment: there are still some issues while trying to reclaim memory after synchronize_rcu, which means the algorithm is buggy, or RCU use is still buggy, or maybe some timing because of synchronize_rcu. Anyway, fib_trie still seems to be safe only with CONFIG_PREEMPT_NONE, so I have no idea how this should be fixed in -stables (or why people don't report more this BUG in 2.6.30)... Thanks, Jarek P. --
On Tue, 30 Jun 2009 12:48:49 -0700 (PDT) This is probably in kernel bugzilla as well, so someone should update: http://bugzilla.kernel.org/show_bug.cgi?id=6648 -- --
