Hey all-
We currently have the ability to disable our route cache secret interval
rebuild timer (by setting it to zero), but if we do that its possible for an
attacker (if they guess our route cache hash secret, to fill our system with
routes that all hash to the same bucket, destroying our performance. This patch
provides a backstop for that issues. In the event that our rebuild interval is
disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
an emergency hash rebuild. During the hash rebuild we:
1) warn the user of the emergency
2) disable the rebuild timer
3) invalidate the route caches
4) re-enable the rebuild timer with its old value
Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
route.c | 36 +++++++++++++++++++++++++++++++++++-
1 file changed, 35 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..b95e02a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -145,7 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
static void ipv4_link_failure(struct sk_buff *skb);
static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
static int rt_garbage_collect(struct dst_ops *ops);
-
+static void rt_emergency_hash_rebuild(void);
static struct dst_ops ipv4_dst_ops = {
.family = AF_INET,
@@ -1056,6 +1056,15 @@ restart:
*candp = cand->u.dst.rt_next;
rt_free(cand);
}
+ } else if (chain_length > ip_rt_gc_elasticity) {
+ /*
+ * We didn't find a route entry we could safely free,
+ * yet our chain length is over our elasticity value.
+ * Someone may have guessed our hash secret and is artifically
+ * filling up our route cache. Lets do an emergency rebuild
+ * to be safe
+ */
+ rt_emergency_hash_rebuild();
}
/* Try to bind route to arp only if it is output
@@ -2946,6 +2955,31 @@ static void rt_secret_reschedule(int old)
rtnl_unlock();
}
+static void rt_secret_rebuild_oneshot(void) ...This sounds not good at all to me.
1) Dont set ip_rt_secret_interval to zero, this is plain silly, since
you give attackers infinite time to break your machine.
To quote Herbert (who allowed to set this interval to 0)
"Let me first state that disabling the route cache hash rebuild
should not be done without extensive analysis on the risk profile
and careful deliberation.
However, there are times when this can be done safely or for
testing. For example, when you have mechanisms for ensuring
that offending parties do not exist in your network."
2) Many machines have ip_rt_gc_elasticity set to 2,
--
Thats really rather the motivation behind this. The patch that Herbert submitted with that commit explicitly lets one disable their rebuild timer. I agree its stupid to do that, but we added code to allow it. This provides a patch to help people who are victimized because they've done exactly this Ok, that seem reasonable, and this isn't going to disallow that. By the same resoning, people who have huge hash tables, and low chain depths won't want their low chain length being violated, would they? This patch will warn them if their assumptions are being violated. Neil -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
Strange... many kernel parameters can be set to hazardous values that make machine unusable...
Warn only ? If I read your patch, you not only warn in this case.
(you invalidate cache for each struct net, potentially wraping rt_genid)
When you have 2^20 slots in route cache hash table, you dont care if few slots have 3 or 4 elements.
And chance is very high that more than one slot has 3 or even 4 elements, no need for an attacker.
Now if you change your code to something like
if (unlikely(chain_length > some_quite_big_number &&
ip_rt_secret_interval == 0)) {
do_something();
}
some_quite_big_number could be >= 30 or something...
then it might be OK (at least it wont break common setups)
--
If you want to talk philosophy, then I accept your premise: we can tune the kernel in ways that make it unusable. Does that mean we should avoid doing things to prevent that and maintain usibility. Ithink this patch accomplishes If you overflow your elasticity 2^16 times, yes (I think rt_genid is a 16 bit value, but I don't have the kernel in front of me). I would hope that would be Ok, then I would assert if your ok with a few chains getting to be X entries in I'd be ok with that, if some others chimed in with consensus on the matter. I felt that we had a value definining what chain length should be (ip_rt_gc_elasticity is already used in comparison on chain length in rt_intern_hash, so I was keeping with that). But if some others speak up and I don't think it will break many setups, the default value is 8 for this sysctl. If someone has set it lower, and sees a warning, they won't loose their system altogether, they'll just see a momentary reduction in throughput, and a warning to increase the value they have set. Its going to far to say anything will 'break'. Thanks & Regards -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
Nope. We set elasticity to 2 when we want to trigger code starting at line 1048
if (cand) {
if (chain_length > ip_rt_gc_elasticity) {
*candp = cand->u.dst.rt_next;
rt_free(cand);
}
}
That is, freeing one entry if chain length is > 2 AND one entry is freeable.
This means that *some* chains eventually can have live entries and length > 2
without any problem. This is not an emergency case. I have seen on real
machines some chains hitting length of 13 (elasticity=2), that was normal traffic,
and rt cache was on equilibrium.
Your patch adds to the "if (chain_length > elasticity)" case :
If no entry is freeable in this slot, invalidate *all* cache and put a warning.
Invalidating live entries is puting a high pressure on dst_garbage.list.
Having 1.000.000 entries in this list is not very cool, and should be done
only if really necessary.
When you know you have about 1.000.000 live entries in rt cache, you
can safely make your hash table sized to 2^20 slots, and elasticity set to 2
so that average chain length is <= 2, reducing cache misses at lookup time.
I said 30, but could have said 100. No need for a sysctl.
If only one chain is really that long (and attacked), all its entries are hot.
I spent many days so that route cache doesnt crash some big machines I have around,
I have feelings your patch will make them crawl again, unless sysadmin is smart
enough to change once again rt sysctls.
So my replies are not just random things from me.
When a machine is targeted by a DDOS attack, about all slots of the hash table
are fully loaded (ie chain length >= elasticity). We dont need to invalidate
the cache, but find an equilibrium, with small adjustements.
Thank you
--
Not quite, the above is in en else clause to if (cand), that is to say if there is no freeable entry, and our chain length is greater than our elasticity, which I agree, but this should not affect that ability (although I don't have the equipment to check such high throughput. If you do, I would appreciate the Again, if you have the ability to run such a high volume test and demonstrate that this will happen, I'll gladly recind the patch and go back to the drawing board, but I think you're wrong. Regards Neil -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
From: Eric Dumazet <dada1@cosmosbay.com> Sure, but it is possible to determine that some hash chains are unevenly growing out of control compared to others, and that is the algorithm that Neil is trying to discover. --
No problem, but my suggestion to use a separate threshold than elasticity
was apparently not taken into consideration.
I ran an experiment on a big stable machine with 2^19 rtcache slots,
scanning all chains and found *many* of them having length > elasticity,
maximum being 13. I am sure its allowed by statistics laws.
(average chain length : 3.55)
In order to avoid unecessary cache invalidation, we need some
calculation from a statistics expert.
Given rt_hash_size and elasticity (or rt_max_size), compute the "maximum reasonable"
chain length, ie some X number where probability(chain_length < X) > 0.9999
(CCed Evgeniy Polyakov :) )
MemTotal: 32963064 kB
8 CPUS
/proc/sys/net/ipv4/route/max_size:8388608 (default at boot time)
/proc/sys/net/ipv4/route/gc_thresh:2000000
/proc/sys/net/ipv4/route/gc_elasticity:4
/proc/sys/net/ipv4/route/gc_interval:1
Linux version 2.6.24.5
cat /proc/net/sockstat
sockets: used 957514
TCP: inuse 963998 orphan 7100 tw 10393 alloc 964646 mem 376389
rtstat -c5 -i5 (minus first sample)
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti| out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
| | tot| mc| ute| | an_dst| an_src| | _tot| _mc| | ed| miss| verflow| _search|t_search|
1862477| 23758| 4400| 0| 0| 0| 0| 0| 4142| 1134| 0| 0| 0| 0| 0| 45754| 11785|
1863310| 24794| 4504| 0| 0| 0| 0| 0| 4089| 1183| 0| 0| 0| 0| 0| 47558| 11640|
1863604| 24183| 4383| 0| 0| 0| 0| 0| 3910| 1061| 0| 0| 0| 0| 0| 46002| 10819|
1864473| 23899| 4510| 0| ...I think what you're looking for here is a simple standard deviation, isn't it? Compute the mean chain legnth, sum the squares of the deviations of each chain and take the square root. Any individual chain longer than the mean chain length + 1 standard deviation can be considered an 'outlier' and therefore trigger a rebuild of the table for that net namespace. I full well realize that thats easier said than done, but does that seem about right? If so, I can start working on trying to build something to accomplish that. Regards Neil -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
Hi. You are right, but in kernel it may not that simple to get a square root... We can probably play with logarithms though. -- Evgeniy Polyakov --
I don't in truth, need a quare root operation (at least I don't think I do). I'm abel to compute the variance rather easily, with a walk of the hash table, then I approximate the square root by use of a for loop in which I square the iterator and compare it to the variance. The iterator when squared that is closest to the variance is taken as my standard deviation. As mentioned previously 4*sd should cover almost all of my sample universe, identifying any remaining outliers as the trigger for a rebuild. Neil -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
Hey all- Since Eric mentioned the use of statistical analysis to determine if hash chains were growing improperly, I thought I would take a stab at such an approach. I'm no statistics expert, but it would seem to me that simply computing the standard deviation of all the non-zero chain lengths would give a good guide point to determine if we needed to invalidate our hash table. I've written the below implementation. I've not tested it (I'll be doing that here shortly for the next few days), but I wanted to post it to get feedback from you all on the subject. The highlights are: 1) We add a counter to rt_hash_bucket structure to track the length of each individual chain. I realize this is sub-optimal, as it adds potentially lots of memory to hash table as a whole (4 bytes * number of hash buckets). I'm afraid I've not come up with a better way to track that yet. We also track the total number of route entries and the number of non-zero length chains. Lastly we also maintain a global maximum chain length which defines the longest chain we will tolerate in the route table. This patch defines it as the mean chain length plus one standard deviation. 2) The above new counters are updated as we add/delete entries from the hash table 3) in rt_intern_hash if we don't find an entry to delete while walking a given chain that is over its elasticity, we check the chain length against the maximum length, which we defined in 1. If it exceeds the max length, we recompute the mean and standard deviation value, and check the length again (see rt_hash_sd_update). If the length is still longer than the new max length, we execute an emergency hash rebuild. Again, I've not started testing it, but if this works, we should be able to eliminate the need for periodic cache rebuilds entirely, replacing it with on demand detection. I could see a usefulness for add a control to tweak the multiple count of standard deviations allowed from the mean, for environments where route entries just ...
I believe the general rule of thumb for something like this is at least two standard deviations. For a normal distribution, one standard deviation covers about 68 % of the sample universe, while two standard deviations covers about 95 % (three standard deviations covers 99.73 %). See the Wikipedia entry: http://en.wikipedia.org/wiki/Standard_deviation -Bill --
Thanks Bill for the pointer, this is the trick. I believe we should target "4σ 99.993666% " case. But we dont need to really compute Standard deviation at runtime, only find an (upper) approximation of it. For elasticity=4 and 512*1024 samples (mean < 4), I guess 4σ can be approximated by 20 or something. Thank you --
Good estimation of Standard Deviation could be computed for free in rt_check_expire(). (each runs scans 20% of hash table with default tunables timeout & ip_rt_gc_interval) We could update 4σ estimation in this function, every minute (ip_rt_gc_interval) At softirq time we then can detect a particular hash chain is longer than 4σ estimation and trigger an appropriate action. This action is to : flush table, and while we do that, expand hash table if its current size is under ip_rt_max_size/elasticity... --
I ran again my litle dirty program that reads /proc/kcore to explore rt hash table on a busy server and found : hptr=0xffff81082ec00000 hsize=524288 total=2299242 dst entries 4306 chains of length 0 23807 chains of length 1 60119 chains of length 2 95112 chains of length 3 106364 chains of length 4 92307 chains of length 5 65743 chains of length 6 39710 chains of length 7 20997 chains of length 8 15775 chains of length 9 39 chains of length 10 7 chains of length 11 2 chains of length 12 avg = 4.38546 sd = 1.94614 X = avg + 4*sd = 12.17 I tried various elasticity settings and found litle variations of X This is because lot of entries are in use (refcnt>1) and can not be freed, regardless of elasticity. --
That is a good idea. I'm still testing the last patch I posted, and Its going pretty well (I did find a few bugs, so I'll be reposting when I'm done). Doing the update on demand when seem to be pretty quick, but I'll rewrite to gather stats on how long it take specifically, and then rewrite/compare to your suggested implementation above. It seems like the above would save space in the hash table, as I could eliminate the chain_length counter per hash bucket. On the down side though, I don't know if a 1 minute garbage collection interval would allow too much of a window to grow a chain in (i.e. we could check chain lengths against a too small standard deviation value, and trigger a cache rebuild unnecessecarily). Just thinking out loud, I've not got any evidence to support or contradict that Best --
Hey all-
So, I've been doing some testing here with this patch, and am
comfortable that the sd estimation is working reasonably well. For a hash table
with an average chain length of 1, it computes the stadard deviation to be 2,
which gives us a max chain length of 9 (4*sd + avg), and it manages to do that
in about 7 jiffies over about 524000 hash buckets. I'm reasonably pleased with
that speed I think, and after thinking about it, I like this implementation
somewhat better, as it doesn't create a window in which chains can be
artifically overrun (until the nect gc round) (although I'm happy to hear
arguments against my implementation). Anywho here it is, comments and thoughts
welcome!
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
route.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 118 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..4f8c5b5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -145,6 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
static void ipv4_link_failure(struct sk_buff *skb);
static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
static struct dst_ops ipv4_dst_ops = {
@@ -200,7 +201,14 @@ const __u8 ip_tos2prio[16] = {
struct rt_hash_bucket {
struct rtable *chain;
+ atomic_t chain_length;
};
+
+atomic_t rt_hash_total_count;
+atomic_t rt_hash_nz_count;
+
+static int rt_chain_length_max;
+
#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
defined(CONFIG_PROVE_LOCKING)
/*
@@ -601,6 +609,68 @@ static inline int ip_rt_proc_init(void)
}
#endif /* CONFIG_PROC_FS */
+static void rt_hash_sd_update(void)
+{
+ int temp, i;
+ unsigned long long sd;
+ int average = atomic_read(&rt_hash_total_count);
+ int nzcount = ...Empty chains should be accounted for, or average and standard Out of curiosity, what do you expect to do here ? (temp-average) XOR 2 or=20 (temp-average) * (temp-average)=20 Also, your computations use integer arithmetic. If avg =3D 2.5 and sd =3D 1.9, avg+4*sd you'll find 6 instead of 10=20 Anyway, we wont add so many atomic operations and double hash table size in order to be able to compute sd. If we really want to be smart, we can have a pretty good estimate of average and sd for free in rt_check_expire() Something like this untested patch. (We should make sure we dont overflow sum2 for example)
Fair enough, although for such a large hash table, I would assume that this would unfairly bias our average to zero, although as I think about that, our Doh! Show off how stupid I can be apparently. I was doing too much at once, Yes, but I think we're going to have to tolerate some error, since we're using That I have to agree with. The growth in size at the very least doesn't look Oh! Thats pretty nifty, I hadn't considered breaking the interger up like that to avoid round off error. Although Like computing sd during garbage collection, this method leaves a hole during which the addition of several entries to one chain may provide a false positive before the next run of rt_check_expire. Or is the likelyhood of that low enough that its not particularly relevant? Regards Neil -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
So, I've been playing with this patch, and I've not figured out eactly whats bothering me yet, since the math seems right, but something doesn't seem right about the outcome of this algorithm. I've tested with my local system, and all works well, because the route cache is well behaved, and the sd value always works out to be very small, so ip_rt_gc_elasticity is used. So I've been working through some scenarios by hand to see what this looks like using larger numbers. If i assume ip_rt_gc_interval is 60, and rt_hash_log is 17, my sample count here is 7864320 samples per run. If within that sample 393216 (about 4%) of the buckets have one entry on the chain, and all the rest are zeros, my hand calculations result in a standard deviation of approximately 140 and an average of .4. That imples that in that sample set any one chain could be almost 500 entires long before it triggered a cache rebuld. Does that seem reasonable? Best Neil -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
if rt_hash_log is 17, and interval is 60, then you should scan (60 << 17)/300 slots. That's 26214 slots. (ie 20% of the 2^17 slots) I have no idea how you can have sd = 140, even if scaled by (1 << 3) with slots being empty or with one entry only... If 4% of your slots have one element, then average length is 0.04 :) --
Yes, and the average worked out properly, which is why I was concerned. If you take an even simpler case, like you state above (I admit I miseed the /300 part of the sample, but no matter). samples = 26214 Assume each sample has a chain length of 1 sum = 26214 * (1 << 3) = 209712 sum2 = sum * sum = s09712 * 209712 = 43979122944 avg = sum / samples = 209712 / 26214 = 8 (correct) sd = sqrt(sum2 / samples - avg*avg) = sqrt(43979122944/26214 - 64) = 1295 sd >> 3 = 1295.23 >> 3 = 161 Clearly, given the assumption that every chain in the sample set has 1 entry, giving us an average of one, the traditional method of computing standard deviation should have yielded an sd of 0 exactly, since every sample was precisely the average. However, the math above gives us something significantly larger. I'm hoping I missed something, but I don't think I have. -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
Famous last sentence ;) You made some errors in your hand calculations. sum2 is the sum of squares. Its not sum * sum. If all slots have one entry, all "lengths" are (1<<3), and their 'square scaled by 6' is (1 << 6) . So sum2 = 26214 * (1 << 6) = 1677696 avg = sum / samples = 209712 / 26214 = (1 << 3) sd = sqrt(sum2 / samples - avg*avg) = sqrt(64 - 64) = 0 (this is what we expected) Now if you have 4 % of slots with one entry, and 96 % that are empty, you should have /* real maths */ avg = 0.04 sd = sqrt(0.04 - 0.04*0.04) = sqrt(0.0384) = 0.195959 avg + 4*sd = 0.82 /* fixed point math */ sum = 0.04 * 26214 * (1<<3) = 1048 * (1<<3) = 8384 sum2 = 1048 * (1 << 6) = 67072 avg << 3 = 8384/26214 = 0 (with 3 bits for fractional part, we do have rounding error) sd << 3 = sqrt(67072/26214 - 0) = 1 (avg + 4*sd) << 3 = 4 -> final result is 4>>3 = 0 (expected) Now if 50% of slots have one entry, we get : /* real maths */ avg = 0.5 sd = sqrt(0.5 - 0.5*0.5) = sqrt(0.25) = 0.5 avg + 4*sd = 2.5 /* fixed point math */ sum = 0.5 * 26214 * (1<<3) = 104856 sum2 = 13107 * (1<<6) = 838848 avg << 3 = 104856/26214 = 4 sd << 3 = sqrt(838848/26214 - 4*4) = sqrt(32 - 16) = 4 (avg + 4*sd) << 3 = 20 -> final result is 20>>3 = 2 (expected) Hope this helps --
Gaaa! Yes, embarrasing as it may be, that clarifies it for me. Sorry, I missed the the fact that sum2 is the sum of squares rather than the sqare of the sum. Stupid of me. Ok, so this actually works rather well then. I'll keep testing. Thanks! -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
On Tue, Oct 07, 2008 at 07:13:29AM +0200, Eric Dumazet wrote: <lost of prior commentary snipped> Ok, after my previous bone-headedness, I've ssen the light and concluded that the standard deviation computation that Eric devised works quite well. I've spent the past week adding to and enhancing his code to come up with the patch below. Its Erics at the core, to which I've added the following: 1) The actual test in rt_intern_hash to detect chains with outlier lengths, which will trgger an emergency cache rebuild, should an over-long chain be found 2) I've added some bits to organize chain in such a way that entries on the same chains which differ only by values that do not affect the hash computation are grouped together in contiguous segments. This lets us count hash input value transitions rather than simple lengths. This addresses Herberts concern that we might have a valid, non-malignant route cache with only one or two long chains (the case in which several QOS types were used on the same route would be one such example). So in this patch routes in the cache with the same hash inputs are counted only once 3) This patch keeps 2 extra values in the ipv4 net namespace structure. One is a sysctl that defines the maximum allowable number of route cache emergency rebuilds that we will allow, and the other is a counter of the number of times we have rebuilt under duress. If we cross that sysctl boundary for a given namespace, we discontinue using the route cache for that namespace entirely until such time as the sysadmin resets the count via a reboot or ups the maximum rebuild count via the sysctl. This addresses Herberts other concern that might find ourselves constantly rebuilding should we come under an especially sophisticated attack. I've done some rudimentary testing on it here, so more agressive testing, review comments, etc would be greatly appreciated. If this meets everyones approval I think we can follow up with a patch to remove the secret interval ...
From: Neil Horman <nhorman@tuxdriver.com> This patch looks pretty good to me. Likewise. --
While browsing Neil patch, I found one missing rcu_assign_pointer() in current code. I added a comment similar to other rcu_assign_pointer() in rt_intern_hash() function. [PATCH] ip: adds a missing rcu_assign_pointer() rt_intern_hash() is doing an update of a RCU guarded hash chain without using rcu_assign_pointer() or equivalent barrier. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
From: Eric Dumazet <dada1@cosmosbay.com> Applied, thanks Eric. --
Thanks Dave, new patch, with those nits fixed up. I also cleaned up a few
checkpatch errors (all trailing whitespace and 80 col errors)
Best
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
include/linux/sysctl.h | 1
include/net/netns/ipv4.h | 2
kernel/sysctl_check.c | 1
net/ipv4/route.c | 117 ++++++++++++++++++++++++++++++++++++++++++++-
net/ipv4/sysctl_net_ipv4.c | 12 ++++
5 files changed, 131 insertions(+), 2 deletions(-)
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index d0437f3..481aa44 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -435,6 +435,7 @@ enum
NET_TCP_ALLOWED_CONG_CONTROL=123,
NET_TCP_MAX_SSTHRESH=124,
NET_TCP_FRTO_RESPONSE=125,
+ NET_IPV4_RT_CACHE_REBUILD_COUNT=126,
};
enum {
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..4fef762 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -46,6 +46,8 @@ struct netns_ipv4 {
int sysctl_icmp_ratelimit;
int sysctl_icmp_ratemask;
int sysctl_icmp_errors_use_inbound_ifaddr;
+ int sysctl_rt_cache_rebuild_count;
+ int current_rt_cache_rebuild_count;
struct timer_list rt_secret_timer;
atomic_t rt_genid;
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index c35da23..eb9fb57 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -389,6 +389,7 @@ static const struct trans_ctl_table trans_net_ipv4_table[] = {
{ NET_TCP_ALLOWED_CONG_CONTROL, "tcp_allowed_congestion_control" },
{ NET_TCP_MAX_SSTHRESH, "tcp_max_ssthresh" },
{ NET_TCP_FRTO_RESPONSE, "tcp_frto_response" },
+ { NET_IPV4_RT_CACHE_REBUILD_COUNT, "rt_cache_rebuild_count" },
{ 2088 /* NET_IPQ_QMAX */, "ip_queue_maxlen" },
{}
};
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..623b633 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly = 10 * 60 * HZ;
static int ...Incomplete patch ? You added a 'length' variable, and update it but nowhere initialize and/or read it ? Some way to change rt_chain_length_max is needed, sysctl or dynamically... --
Yeah, that was quite stupid of me. I rescind this, and I'll post a patch with the I don't really think so, since thats computed every run through rt_check_expire anyway. Thanks! --
Ok, heres a new patch, same as before, but added in proper initalization for
stack variables, as well as the missing chunk that actually computed the
standard deviation and maximum chain length. Built/tested by me successfully.
Sorry for the prior noise. Please review/ack.
Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
include/linux/sysctl.h | 1
include/net/netns/ipv4.h | 2
kernel/sysctl_check.c | 1
net/ipv4/route.c | 130 ++++++++++++++++++++++++++++++++++++++++++++-
net/ipv4/sysctl_net_ipv4.c | 12 ++++
5 files changed, 144 insertions(+), 2 deletions(-)
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index d0437f3..481aa44 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -435,6 +435,7 @@ enum
NET_TCP_ALLOWED_CONG_CONTROL=123,
NET_TCP_MAX_SSTHRESH=124,
NET_TCP_FRTO_RESPONSE=125,
+ NET_IPV4_RT_CACHE_REBUILD_COUNT=126,
};
enum {
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..4fef762 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -46,6 +46,8 @@ struct netns_ipv4 {
int sysctl_icmp_ratelimit;
int sysctl_icmp_ratemask;
int sysctl_icmp_errors_use_inbound_ifaddr;
+ int sysctl_rt_cache_rebuild_count;
+ int current_rt_cache_rebuild_count;
struct timer_list rt_secret_timer;
atomic_t rt_genid;
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index c35da23..eb9fb57 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -389,6 +389,7 @@ static const struct trans_ctl_table trans_net_ipv4_table[] = {
{ NET_TCP_ALLOWED_CONG_CONTROL, "tcp_allowed_congestion_control" },
{ NET_TCP_MAX_SSTHRESH, "tcp_max_ssthresh" },
{ NET_TCP_FRTO_RESPONSE, "tcp_frto_response" },
+ { NET_IPV4_RT_CACHE_REBUILD_COUNT, "rt_cache_rebuild_count" },
{ 2088 /* NET_IPQ_QMAX */, "ip_queue_maxlen" },
{}
};
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..0ff28c4 ...OK First, please include the description, everytime you submit a patch. An extensive description is probably the most important part of a patch. Some machine come to life on hostile environments and receive lots of UDP messages from many hosts and may hit this limit before rt_check_expire() has started its first run. Please consider default route cache settings, that allow 16 elements per chain Here you ignore chains with one elements. This introduce a bias in the average and sd computation. A correct test would be : if ((*rthp == NULL) || !compare_hash_inputs(&(*rthp)->fl,&rth->fl)) Would be nice to add a new counter that one can check in /proc/net/stat/rt_cache Please respin your patch to last tree, this misses my last patch about memory barrier... Thank you --
From: Eric Dumazet <dada1@cosmosbay.com> It's there in net-2.6 --
On Thu, 16 Oct 2008 19:35:17 -0400 Don't add new sysctl numbers use CTL_UNNUMBERED --
Why not? static inline bool compare_hash_inputs(const struct flowi *fl1, struct flowi *fl2) ... --
I assume that you're asking why we use these values to determine if this namespace should use the route cache? The answer is this thread. Because ideally we should never rebuild the route cache unless something is giving the cache an uneven distribution. The consensus is that if we rebuild too many times, its an indicator that someone knows enough about our hashing algorithm to maliginantly force the creation of routes in the cache on the same chain regardless of how many times we rebuild it. In that case the best thing to do is skip the use of the cache entirely. Neil -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
Hi. If it is a hash table with fixed size, and hash is being produced like some function result modulo size of the table, then there will be always fluctuations with longer chain lenghts than 'usual', since, for example, hash function produces result in 2^32 field, which hash table size is only 2^20 or something smaller than maxiumum hash result. When I analyzed Jenkin's hash I mistakenly concluded that such pikes are result of the hash distribution itself and not final modulo operation. Plus, it is always a gaussian distribution, so there will be (although depending on parameters amount may be small enough) entries with longer and smaller than average chains. I tested list walking speed compared to rb-tree and trie on p4 machines without prefetch and lists with upto 10 element are processed faster than anything else, so if your average chain lengths are less than 10 elements, you are in a good condition. Also I have a tricky algorithm to automatically scale RCU protected hash table (with additional overhead at scale time though, but I believe it is not a problem), which just waits to be implemented in netchannels, so long chain length problem can be fixed, if algorithm works. Ok, I've finished my advertisement rants :) -- Evgeniy Polyakov --
From: Eric Dumazet <dada1@cosmosbay.com> It makes a ton of sense, even by default, if we set things up so that we turn it back on when necessary. And that's the final intended idea, to do exactly that. --
From: Neil Horman <nhorman@tuxdriver.com> I just want to clarify what my intentions were when I spoke with Neil about this stuff last week. The idea is that we can by default not rebuild the secret at all. And only when we notice that chains are growing larger than "(NUM_RTCACHE_ENTRIES / NUM_HASH_CHAINS) * N", only then do we do this secret rebuild and flush. Where N is some constant of configurable value, the GC elasticity is some example. Normally this whole hash secret business is totally unnecessary and there is zero reason to do it until we notice there is actually some kind of deep hash chain growth problem. It's expensive, we flush the whole routing cache, so doing it every so often by default makes no sense and it is causing performance problems for people. --
Intentions are very good, thanks for clarifying and letting us know. --
Actually Andrew Dickson <whydna@whydna.net> came up with this idea quite a while ago: Keep the rehash interval but do nothing until some chain hits a specified length. This is quite similar to what is being discussed here. Andrew, could you post the patch please? In addition to this, we should probably enforce that limit as well by simply not adding the newly created entry or deleting one forcibly. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Here's the patch that Herbert's referring to. The basic idea is that
we have a flag which indicates whether or not we need to invalidate
the route cache. If any chain exceeds gc_elasticity, we set the flag
and reschedule the timer. In the worst-case, we'll invalidate the
route cache once every secret_interval; in the best-case, we never
invalidate the cache.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..82baf68 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -48,6 +48,7 @@ struct netns_ipv4 {
int sysctl_icmp_errors_use_inbound_ifaddr;
struct timer_list rt_secret_timer;
+ int rt_secret_flag;
atomic_t rt_genid;
};
#endif
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e91bafe..83a1b43 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -837,13 +837,49 @@ void rt_cache_flush(struct net *net, int delay)
}
/*
- * We change rt_genid and let gc do the cleanup
+ * We set rt_secret_flag indicating that we can invalidate the cache if needed.
*/
static void rt_secret_rebuild(unsigned long __net)
{
struct net *net = (struct net *)__net;
- rt_cache_invalidate(net);
- mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
+ net->ipv4.rt_secret_flag = 1;
+}
+
+static void rt_secret_reschedule(int old)
+{
+ struct net *net;
+ int new = ip_rt_secret_interval;
+ int diff = new - old;
+
+ if (!diff)
+ return;
+
+ rtnl_lock();
+ for_each_net(net) {
+ int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
+
+ if (!new) {
+ net->ipv4.rt_secret_flag = 0;
+ continue;
+ }
+
+ if(net->ipv4.rt_secret_flag)
+ continue;
+
+ if (old && deleted) {
+ long time = net->ipv4.rt_secret_timer.expires - jiffies;
+
+ ...From: "Andrew Dickinson" <whydna@whydna.net> This is a very interesting patch and idea, but... Eric showed clearly that on a completely normal well loaded system, the chain lengths exceed the elasticity all the time and it's not like these are entries we can get rid of because their refcounts are all > 1 --
I've got another patch that takes a different approach... Instead of disabling the secret_interval timer or trying to heuristically guess when we're under attack, we continue to invalidate the cache; we just invalidate it with kid-gloves instead of a sledge hammer. Like we do today, we continue to update the genid every time the secret_interval timer expires. Instead of simply creating a new value (and thus invalidating the entire cache), we keep a short history of genid values (I'm thinking on the order of 2-4 previous values). In rt_intern_hash(), when we do the check to see if we already have an existing hash entry, we'll check each of the previous genid versions (hence the desire to keep the history short) before declaring it as not there. If we do find the entry in the hash with an older genid value, we'll re-bucket it into the correct location for the latest genid. Basically, we're allowing entries to continue to exist in the hash after the route cache has been invalidated (they can still be pruned by GC). Happy to send the patch along if you'd like, although I'm not as confident that this approach is really desirable. -A --
I think there are two orthogonal issues here. 1) The way we count the chain length is wrong. There are keys which do not form part of the hash computation. Entries that only differ by them will always end up in the same bucket. We should count all entries that only differ by those keys as a single entry for the purposes of detecting an attack. FWIW we could even reorganise the storage inside a bucket such that it is a 2-level list where the first level only contained entries that differ by saddr/daddr. 2) What do we do when we get a long chain just after a rehash. This is an indication that the attacker has more knowledge about us than we expected. Continuing to rehash is probably no going to help. We need to decide whether we care about this scenario. If yes, then we'll need to come up with a way to bypass the route cache, or at least act as if it was bypassed. If not, then we can just kill the rehash interval altogether. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
I'm not sure I follow what your saying here. I understand that some keys will wind up hashing to the same bucket, but from what I see a change to the saddr and daddr parameters to rt_hash, will change what bucket you hash too. What am Seems like it might be ambiguous to me. perhaps we just got a series of collisions in the firs few entries after a rebuild? I dont know, Im just Why don't we just add a count to the number of times we call rt_emergency_hash_rebuild? If we cross a threshold on that count (or perhaps a rate determined by jiffies since the last emergency rebuild), we can set a flag to not always return a failed lookup in the cache, so as to force routing into the slow path. Does that seem reasonable to you? Best Neil -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
We've got keys other than saddr/daddr, for example, all route entries that differ only by TOS hash to the same bucket. There are quite a few possible TOS values. Have you computed the probability of that? If the limit is sensible this should be extremely unlikely. Note that I'm also assuming that you've got 1) solved as otherwise all bets are off. Assuming that we are counting it correctly, and that the length limit is set to a sensible value, then the probability of the attacker reaching that limit soon after a rehash is very small. Therefore we can deduce that if it does happen then chances are that our RNG has been compromised and as such rehashing again Really, if we're hitting this when nobody is attacking us then something is screwed up, notably 1) as it stands. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Ok, that makes a bit more sense. Thank you. I agree that in that case we probably do need to bias our chain length accounting such that entries that differ only by TOS or other fields that do not affect the hash value only get Like I said, I'm just playing devils advocate on this. Clearly the hash function should be such that an early data set hashes to the same chain is very Thats rather the point though. If we're not being attacked, then we'll never hit this (ideally). I'll post a new patch when I figure out how best to handle entries which differ by a field irrelevant to the hash function. -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
The patch is wrong in respect to for_each_net usage. It should be used under rtnl. It is not held at the moment :( Regards, Den --
Thank you, I'll be sure to take that into consideration when we determine what the proper algorithm to implement here is. Although, I used the rt_secret_rebuild timer function as my guide here, and it doesn't hold the rtnl lock either, or is that because its running in a softirq context? -- /**************************************************** * Neil Horman <nhorman@tuxdriver.com> * Software Engineer, Red Hat ****************************************************/ --
no. There is no lock as there is no iteration over net namespace list. Each namespace has its own timer. The simplest approach I see right now is to re-iterate over current dst cache chain and mark only those namespaces. This is at least fair as only one namespace can be attacked (not entire node). All locking for this is in place. Regards, --
This is completely bogus. We never allow any chain to grow beyond the elasticity. So in the worst case we just bypass the route cache. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
OK we don't actually enforce that as it stands, but perhaps we should have a maximum chain length that is enforced. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Thank you :). I was just about to respond with a note asking you for a reference to your previous assertion. We definately don't enforce that now. I agree we should have a maximum chain legth that we enforce. Unfortunately according to eric, the gc_elasticity value can't be it, because we very routinely in nominal systems have a limited number of chains that violate the elasticity, and that should be expected. Thats why my latest patch tries to cover that by computing the standard deviation of the set of chains and allowing chains within avg+4*SD to exist. With that allowance, we should be able to remove the periodic rehash entirely. Best --
