Hi! I've been using HTB in a Linux bridge and recently I noticed that, for high speed, the configured rate/ceil is not respected as for lower speeds. I'm using a packet generator/analyser to inject over 950Mpbs, and see what returns back to it, in the other side of my bridge. Generated packets have 800bytes. I noticed that, for several tc HTB rate/ceil configurations the amount of traffic received by the analyser stays the same. See this values: HTB conf Analyser reception 476000Kbit 544.260.329 500000Kbit 545.880.017 510000Kbit 544.489.469 512000Kbit 546.890.972 ------------------------- 513000Kbit 596.061.383 520000Kbit 596.791.866 550000Kbit 596.543.271 554000Kbit 596.193.545 ------------------------- 555000Kbit 654.773.221 570000Kbit 654.996.381 590000Kbit 655.363.253 605000Kbit 654.112.017 ------------------------- 606000Kbit 728.262.237 665000Kbit 727.014.365 ------------------------- There are these steps and it looks like doesn't matter if I configure HTB to 555Mbit or to 605Mbit - the result is the same: 654Mbit. This is 18% more traffic than the configured value. I also realise that for smaller packets it gets worse, reaching 30% more traffic than what I configured. For packets of 1514bytes the accuracy is quiet good. I'm using kernel 2.6.25 My 'tc -s -d class ls dev eth1' output: class htb 1:10 parent 1:2 rate 1000Mbit ceil 1000Mbit burst 126375b/8 mpu 0b overhead 0b cburst 126375b/8 mpu 0b overhead 0b level 5 Sent 51888579644 bytes 62067679 pkt (dropped 0, overlimits 0 requeues 0) rate 653124Kbit 97656pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 113 ctokens: 113 class htb 1:1 root rate 1000Mbit ceil 1000Mbit burst 126375b/8 mpu 0b overhead 0b cburst 126375b/8 mpu 0b overhead 0b level 7 Sent 51888579644 bytes 62067679 pkt (dropped 0, overlimits 0 requeues 0) rate 653123Kbit 97656pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 113 ctokens: ...
On Fri, 15 May 2009 15:49:31 +0100
You are probably hitting the limit of the timer resolution. So it matters
what the clock source is.
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Also, is HFSC any better than HTB?
--
--
Hi! cat /sys/devices/system/clocksource/clocksource0/current_clocksource returns "jiffies" With HFSC the accuracy is good. Also with packets of 800 bytes I got these values: received configured error 904596519 900000000 0,51 804293658 800000000 0,54 703662853 700000000 0,52 603354059 600000000 0,56 502805411 500000000 0,56 402527055 400000000 0,63 301484904 300000000 0,49 201074301 200000000 0,54 100546656 100000000 0,55 Thanks Antonio Almeida On Fri, May 15, 2009 at 7:12 PM, Stephen Hemminger --
> Looks great! But, since HFSC uses rates directly (without rate tables) This matter about the use of rate tables is not very familiar to me. In fact I keep wondering a lot of things about what kernel does with packets. Is there any documentation explaining how queue disciplines work and how it interacts with netfilter and tc_core? What about packets dispatching? Thanks Antonio Almeida --
On Mon, 18 May 2009 11:01:21 +0100 That is the slowest of the choices. Better ones are hpet and tsc, but you hardware doesn't support them. You should compile your kernel with HZ=1000 and the resolution will be better (but with some loss of performance). --
I have my kernel's timer frequency set to 1000Hz since the beginning. I've got all these results with HZ_1000. (I'm working on clocksource) Thanks Antonio Almeida On Mon, May 18, 2009 at 5:13 PM, Stephen Hemminger --
On Mon, 18 May 2009 09:13:14 -0700 Are you using one of the AMD dual core machines? That processor has the bad design flaw that the TSC counter is not synced between core's so the kernel can't use it. You might even be better off running a non SMP kernel on that box. --
My machine has two dual cores AMD Opteron processor 280. Do I have that TSC problem. # dmesg | grep AMD OEM ID: AMD Product ID: HAMMER APIC at: 0xFEE00000 CPU0: AMD Dual Core AMD Opteron(tm) Processor 280 stepping 02 CPU1: AMD Dual Core AMD Opteron(tm) Processor 280 stepping 02 CPU2: AMD Dual Core AMD Opteron(tm) Processor 280 stepping 02 CPU3: AMD Dual Core AMD Opteron(tm) Processor 280 stepping 02 processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 280 stepping : 2 cpu MHz : 2394.039 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy ts fid vid ttp bogomips : 4790.36 clflush size : 64 Antonio Almeida --
Do I have that TSC problem? --
Is it for sure there is no gso/tso enabled on this dev (with up to date ethtool -k)? It would be nice to see also more details like .config, ifconfigs before and after the test, tc -s qdisc and bytes/ packet number seen by this analyser, plus maybe some proof you can obtain such flows with something simpler like tbf. Of course using the current kernel, even if no difference, would give us more valuable perspective. Thanks, Jarek P. --
Hi!
Here the information you asked:
# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
The bridge is between eth0 and eth1
---------------------------
Before traffic starts:
---------------------------
Analyser sent bytes: 0
Analyser sent packets: 0
Analyser received bytes: 0
Analyser received packets: 0
# tc -s -d class ls dev eth1
class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 990 ctokens: 990
class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 990 ctokens: 990
class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 990 ctokens: 990
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 999 ctokens: 999
# ifconfig
br0 Link encap:Ethernet HWaddr 00:E0:ED:10:7C:6C
UP ...Very nice, but there are some questions: - if this analyser uses tcp we definitely need tso off as well during these tests, - it would be nice to use two patches I've sent to exclude known (now) reasons. With the above I expect accuracy should be better, but definitely not like hfsc (plus no higher than 1000Mbit rate reported after stopping effect). Thanks, ... --
The analyser traffic is tcp. Setting tso off the accuracy stays the same # ethtool -K eth0 tso off # ethtool -K eth1 tso off # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off # ethtool -k eth1 Offload parameters for eth1: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off # tc -s -d class ls dev eth1 | head -24 class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 5 Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0) rate 652715Kbit 97655pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 402 ctokens: 402 class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7 Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0) rate 652715Kbit 97655pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 402 ctokens: 402 class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 6 Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0) rate 652715Kbit 97655pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 402 ctokens: 402 class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 164938040048 bytes 206824248 pkt (dropped 25827911, overlimits 0 requeues 0) rate 652715Kbit 97655pps backlog 0b 33p requeues 0 lended: 206824215 borrowed: 0 giants: 0 tokens: -6 ctokens: -6 I'm applying the patches now. I'll get back to you. Antonio ...
On Fri, May 15, 2009 at 03:49:31PM +0100, Antonio Almeida wrote: This looks like a regular bug. I guess it's an overflow in gen_estimator(), but I'm not sure there is nothing more. Could you try the patch below? (An offset warning when patching 2.6.25 is OK) Thanks, Jarek P. --- net/core/gen_estimator.c | 6 +++++- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c index 9cc9f95..87f0ced 100644 --- a/net/core/gen_estimator.c +++ b/net/core/gen_estimator.c @@ -127,7 +127,11 @@ static void est_timer(unsigned long arg) npackets = e->bstats->packets; rate = (nbytes - e->last_bytes)<<(7 - idx); e->last_bytes = nbytes; - e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log; + if (rate > e->avbps) + e->avbps += (rate - e->avbps) >> e->ewma_log; + else + e->avbps -= (e->avbps - rate) >> e->ewma_log; + e->rate_est->bps = (e->avbps+0xF)>>5; rate = (npackets - e->last_packets)<<(12 - idx); --
This patch works perfectly! rate (bits/s) is now decreasing along with pps when I stop the traffic (doesn't grow as it used to for rates over 500Mbtis/s). # tc -s -d class ls dev eth1 | head -21 | tail -1 rate 651960Kbit 97482pps backlog 0b 0p requeues 0 rate 541134Kbit 80911pps backlog 0b 0p requeues 0 rate 405850Kbit 60683pps backlog 0b 0p requeues 0 rate 304388Kbit 45512pps backlog 0b 0p requeues 0 rate 304388Kbit 45512pps backlog 0b 0p requeues 0 rate 228291Kbit 34134pps backlog 0b 0p requeues 0 rate 171218Kbit 25601pps backlog 0b 0p requeues 0 rate 171218Kbit 25601pps backlog 0b 0p requeues 0 rate 128414Kbit 19201pps backlog 0b 0p requeues 0 rate 96310Kbit 14400pps backlog 0b 0p requeues 0 rate 96310Kbit 14400pps backlog 0b 0p requeues 0 rate 72233Kbit 10800pps backlog 0b 0p requeues 0 rate 54174Kbit 8100pps backlog 0b 0p requeues 0 Thank's to you! Antonio Almeida --
I'm not able to reach full speed with bond + HTB + sfq on 2.6.29.1, both with and without these patches. I seem to get a lot of drops on sfq qdiscs, whatever quantum I set. Playing with IRQ affinity doesn't help. I didn't check without bond. With bond + HFSC + sfq, I'm able to reach the speed. It doesn't seem to overspill with 580 mbps load. Jarek, would your patches help with HSFC overspill ? I will check tomorrow under 750 mbps load. # ethtool -k eth0 Offload parameters for eth0: Cannot get device flags: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off large receive offload: off # cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc -- Best Regards, Vladimir Ivashchenko Chief Technology Officer PrimeTel PLC, Cyprus - www.prime-tel.com Tel: +357 25 100100 Fax: +357 2210 2211 --
Please disregard my comment about HFSC. It still overspills heavily. On a 400 mbps limit, I'm getting 520 mbps actual throughput. -- Best Regards, Vladimir Ivashchenko Chief Technology Officer PrimeTel PLC, Cyprus - www.prime-tel.com Tel: +357 25 100100 Fax: +357 2210 2211 --
The gen_estimator patch should fix only the effect of rising rate after flow stop, and maybe similar overflows while reporting rates around 1Gbit. It would show on tc stats of HFSC or HTB, but doesn't affect actual scheduling rates. The iproute2 tc_core patch can matter for HTB scheduling rates if there are a lot of small packets (e.g. 100 byte for rate 500Mbit) possibly mixed with bigger ones. It doesn't matter for HFSC or I guess you should send some logs. Your previous report seem to show the sum of sc rates of of children could be too high. You seem to expect the parent's sc and ul should limit this, but actually children rates decide and parent's rates are mainly for lending/borrowing (at least in HTB). So, it would be nice to try with one leaf class first, (similarly to Antonio) how high rates are respected. High drop should be OK if the flow is much faster than scheduling/ hardware send rate. It could be a bit higher than in older kernels because of limited requeuing, but this could be corrected with longer queue lenghts (sfq has a very short queue: max 127). Jarek P. --
Unfortunately its difficult for me to play with classes as its real traffic. I don't think its sfq, since I have the same sfq qdiscs with HSFC. Also I'm comparing this to my production HTB box has 2.6.21.5 with esfq and no bond (just eth), esfq also has 127p limit. I tried to get rid of bond on the outbound traffic, I balanced traffic via eth0 and eth2 manually by splitting routes going through them. I still had the same issue with HTB not reaching the full speed. I'm going to try testing exactly the same configuration on 2.6.29 as I have on 2.6.21.5 tonight. The only difference would be that I use sfq(dst) instead of esfq(dst) which is not available on 2.6.29. -- Best Regards Vladimir Ivashchenko Chief Technology Officer PrimeTel, Cyprus - www.prime-tel.com --
Similarly to Antonio's: ifconfigs and tc -s for qdiscs and classes at I'm a bit lost about your configs/results and not reaching vs. overspilled, so please send some new data to compare (gzipped?). Jarek P. --
Ok, it seems that I finally found what is causing my HTB on 2.6.29 not to reach full throughput: dst hashing on sfq with high divisor value. 2.6.21 esfq divisor 13 depth 4096 hash dst - 680 mbps 2.6.29 sfq WITHOUT "flow hash keys dst ... " (default sfq) - 680 mbps 2.6.29 sfq + "flow hash keys dst divisor 64" filter - 680 mbps 2.6.29 sfq + "flow hash keys dst divisor 256" filter - 660 mbps 2.6.29 sfq + "flow hash keys dst divisor 2048" filters - 460 mbps I'm using high sfq hash divisor in order to decrease the number of collisions, there are several thousands of hosts behind each of the classes. Any ideas why increasing the sfq divisor size results in drop of throughput ? Attached are diagnostics gathered in case of divisor 2048. -- Best Regards, Vladimir Ivashchenko Chief Technology Officer PrimeTel PLC, Cyprus - www.prime-tel.com Tel: +357 25 100100 Fax: +357 2210 2211
But... it appears sfq currently supports a fixed divisor of 1024 net/sched/sch_sfq.c IMPLEMENTATION: This implementation limits maximal queue length to 128; maximal mtu to 2^15-1; number of hash buckets to 1024. The only goal of this restrictions was that all data fit into one 4K page :-). Struct sfq_sched_data is organized in anti-cache manner: all the data for a bucket are scattered over different locations. This is not good, but it allowed me to put it into 4K. It is easy to increase these values, but not in flight. */ #define SFQ_DEPTH 128 #define SFQ_HASH_DIVISOR 1024 Apparently Corey Hickey 2007 work on SFQ was not merged. http://kerneltrap.org/mailarchive/linux-netdev/2007/9/28/325048 --
Yes, sfq has its design limits, and as a matter of fact, because of max length (127) it should be treated as a toy or "personal" qdisc. I don't know why more of esfq wasn't merged, anyway similar functionality could be achieved in current kernels with sch_drr + cls_flow, alas not enough documented. Here is some hint: http://markmail.org/message/h24627xkrxyqxn4k Jarek P. PS: I guess, you wasn't very consistent if your main problem was exceeding or not reaching htb rate, and there is quite a difference. ... ... --
Can I balance only by destination IP using this approach? Normal IP flow-based balancing is not good for me, I need Yes indeed :( I'm trying to migrate from 2.6.21 eth/htb/esfq to 2.6.29 bond/htb/sfq, and that introduces a lot of changes. Apparently during some point I changed sfq divisor from 1024 to 2048 and forgot about it. Now I realize that the problems I reported were as follows: 1) HTB exceeds target when I use HTB + sfq + divisor 1024 2) HFSC exceeds target when I use HFSC + sfq + divisor 1024 3) HTB does not reach target when I use HTB + sfq + divisor 2048 I will check again scenario 1) with the latest patches from -- Best Regards Vladimir Ivashchenko Chief Technology Officer PrimeTel, Cyprus - www.prime-tel.com --
Yes, you need to use flow "dst" key, I guess. (tc filter add flow help) Generally, the most common reasons are: - too short (or zero) tx queue length or/plus some disturbances in maintaining the flow - for not reaching the rate - gso/tso or other non standard packets sizes - for exceeding the rate. --
Could we detect this at runtime and emit a warning (once) ? Or should we assume guys using this stuff should be smart enough ? I confess I made this error once and this was not so easy to spot... --
On Thu, May 21, 2009 at 11:07:24AM +0200, Eric Dumazet wrote: I guess, it's a rhetorical question... Jarek P. --
What is the number of DRR classes I need to create, a separate class for each host? I have around 20000 hosts. I figured out that WRR does what I want and its documented, so I'm using a 2.6.27 kernel with WRR now. I was still hitting a wall with bonding. I played with a lot of combinations and could not find a way to make it scale to multiple cores. Cores which handle incoming traffic would get hit to 0-20% idle. So, I got rid of bonding completely and instead configured PBR on Cisco + Linux routing in such a way so that packet gets received and transmitted using NICs connected to the same pair of cores with common cache. 65-70% idle on all cores now, compared to 0-30% idle in worst Just FYI, kernel 2.6.29.1, sub-classes with sfq divisor 1024, tso & gso off, netdevice.h and tc_core.c patches applied: class htb 1:2 root rate 775000Kbit ceil 775000Kbit burst 98328b cburst 98328b Sent 64883444467 bytes 72261124 pkt (dropped 0, overlimits 0 requeues 0) rate 821332Kbit 112572pps backlog 0b 0p requeues 0 lended: 21736738 borrowed: 0 giants: 0 In any case, exceeding the rate is not big of a problem for me. Thanks a lot to everyone for their help. -- Best Regards, Vladimir Ivashchenko Chief Technology Officer PrimeTel PLC, Cyprus - www.prime-tel.com Tel: +357 25 100100 Fax: +357 2210 2211 --
As a matter of fact I don't understand this bonding idea vs. smp: I guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp expert but I think the most efficient use is with separate NICs per cpu (so with separate HTB qdiscs if possible), or multiqueue NICs - but they would currently need a common HTB etc., so again a common Anyway, I'd be interested with the full tc -s class & qdisc report. Thanks, Jarek P. --
I tried the following scenario: 2 NICs used for receive + another 2 NICs used for transmit having HTB. Each NIC on a separate core. No bonding, just manual load balancing using IP routing. The result was that RX cores would be 20% and 40% idle respectively, even though the amount of traffic they were receiving was roughly the same. The TX cores were idling at around 90%. I found this strange personally, but I'm completely ignorant in internals of kernel operation. -- Best Regards Vladimir Ivashchenko Chief Technology Officer PrimeTel, Cyprus - www.prime-tel.com --
There is not enough data to analyse this, but generally you should aim at maintaining one flow (RX + TX) on the same cpu cache. Jarek P. --
Yep, that's what I did in the end (as per the top paragraph). -- Best Regards Vladimir Ivashchenko Chief Technology Officer PrimeTel, Cyprus - www.prime-tel.com --
So, stop writing: "I'm completely ignorant in internals of kernel operation" because you're smp expert now! ;-) Jarek P. --
With a typical estimator "1sec 8sec", ewma_log value is 3
At gigabit speeds, we are very close to overflow yes, since
we only have 27 bits available, so 134217728 bytes per second
or 1073741824 bits per second.
So formula :
e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
is going to overflow.
One way to avoid the overflow would be to use a smaller estimator, like "500ms 4sec"
Or use a 64bits rate & avbps, this is needed fo 10Gb speeds I suppose...
diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..150e2f5 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -86,9 +86,9 @@ struct gen_estimator
spinlock_t *stats_lock;
int ewma_log;
u64 last_bytes;
+ u64 avbps;
u32 last_packets;
u32 avpps;
- u32 avbps;
struct rcu_head e_rcu;
struct rb_node node;
};
@@ -115,6 +115,7 @@ static void est_timer(unsigned long arg)
rcu_read_lock();
list_for_each_entry_rcu(e, &elist[idx].list, list) {
u64 nbytes;
+ u64 brate;
u32 npackets;
u32 rate;
@@ -125,9 +126,9 @@ static void est_timer(unsigned long arg)
nbytes = e->bstats->bytes;
npackets = e->bstats->packets;
- rate = (nbytes - e->last_bytes)<<(7 - idx);
+ brate = (nbytes - e->last_bytes)<<(7 - idx);
e->last_bytes = nbytes;
- e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
+ e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
e->rate_est->bps = (e->avbps+0xF)>>5;
rate = (npackets - e->last_packets)<<(12 - idx);
--
Yes, I considered this too, but because of an overhead I decided to fix as designed (according to the comment) for now. But probably you are right, and we should go further, so I'm OK with your patch. --
From: Jarek Poplawski <jarkao2@gmail.com> I like this patch too, Eric can you submit this formally with proper signoffs etc.? Thanks! --
Sure, here it is. We might need a similar patch to get a correct pps value
too, since we currently are limited to ~ 2^21 packets per second.
[PATCH] pkt_sched: gen_estimator: use 64 bit intermediate counters for bps
gen_estimator can overflow bps (bytes per second) with Gb links, while
it was designed with a u32 API, with a theorical limit of 34360Mbit (2^32 bytes)
Using 64 bit intermediate avbps/brate counters can allow us to reach this
theorical limit.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---
diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..ea28659 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -66,9 +66,9 @@
NOTES.
- * The stored value for avbps is scaled by 2^5, so that maximal
- rate is ~1Gbit, avpps is scaled by 2^10.
-
+ * avbps is scaled by 2^5, avpps is scaled by 2^10.
+ * both values are reported as 32 bit unsigned values. bps can
+ overflow for fast links : max speed being 34360Mbit/sec
* Minimal interval is HZ/4=250msec (it is the greatest common divisor
for HZ=100 and HZ=1024 8)), maximal interval
is (HZ*2^EST_MAX_INTERVAL)/4 = 8sec. Shorter intervals
@@ -86,9 +86,9 @@ struct gen_estimator
spinlock_t *stats_lock;
int ewma_log;
u64 last_bytes;
+ u64 avbps;
u32 last_packets;
u32 avpps;
- u32 avbps;
struct rcu_head e_rcu;
struct rb_node node;
};
@@ -115,6 +115,7 @@ static void est_timer(unsigned long arg)
rcu_read_lock();
list_for_each_entry_rcu(e, &elist[idx].list, list) {
u64 nbytes;
+ u64 brate;
u32 npackets;
u32 rate;
@@ -125,9 +126,9 @@ static void est_timer(unsigned long arg)
nbytes = e->bstats->bytes;
npackets = e->bstats->packets;
- rate = (nbytes - e->last_bytes)<<(7 - idx);
+ brate = (nbytes - e->last_bytes)<<(7 - idx);
e->last_bytes = nbytes;
- e->avbps += ((long)rate - (long)e->avbps) >> ...From: Eric Dumazet <dada1@cosmosbay.com> True, but it is a less urgent issue than bps overflow. --
On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote: Btw., I'm a bit concerned about the syntax here: isn't such shifting of signed ints implementation dependant? Jarek P. --
You are right Jarek, I very often forget to never ever use signed quantities at all ! (But also note original code has same undefined behavior) Quoting wikipedia : (http://en.wikipedia.org/wiki/Arithmetic_shift) The (1999) ISO standard for the, C programming language defines the C language's right shift operator in terms of divisions by powers of 2. Because of the aforementioned non-equivalence, the standard explicitly excludes from that definition the right shifts of signed numbers that have negative values. It doesn't specify the behaviour of the right shift operator in such circumstances, but instead requires each individual C compiler to specify the behaviour of shifting negative values right. Apparently gcc does the *right* thing on x86_32, but we probably want something stronger here. I could not find gcc documentation statement on right shifts of negative values. 436: 8b 4b 14 mov 0x14(%ebx),%ecx 439: 89 73 18 mov %esi,0x18(%ebx) 43c: 89 7b 1c mov %edi,0x1c(%ebx) 43f: 8b 73 20 mov 0x20(%ebx),%esi 442: 8b 7b 24 mov 0x24(%ebx),%edi 445: 29 f0 sub %esi,%eax 447: 19 fa sbb %edi,%edx 449: 0f ad d0 shrd %cl,%edx,%eax 44c: d3 fa sar %cl,%edx << good >> 44e: f6 c1 20 test $0x20,%cl 451: 74 05 je 458 <est_timer+0xb8> 453: 89 d0 mov %edx,%eax 455: c1 fa 1f sar $0x1f,%edx 458: 01 f0 add %esi,%eax 45a: 8b 4b 0c mov 0xc(%ebx),%ecx 45d: 89 43 20 mov %eax,0x20(%ebx) 460: 11 fa adc %edi,%edx 462: 83 c0 0f add $0xf,%eax 465: 89 53 24 mov %edx,0x24(%ebx) 468: 83 d2 00 adc $0x0,%edx 46b: 0f ac d0 05 shrd ...
I guess gcc and most of others do this "right"; but it looks "unkosher" anyway. Jarek P. --
I might have missed your point here, but would it be so costly to do these shifts separately here? Jarek P. --
You replied to yourself Jarek :)
As I said earlier, I found your concern right, so please submit a patch ?
I found many occurrences of a right shift on a signed int/long in kernel.
One example being :
arch/x86/mm/init_64.c
int kern_addr_valid(unsigned long addr)
{
unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;
and another rate estimator in drivers/atm/idt77252.c
static void
idt77252_est_timer(unsigned long data)
We could aso check net/netfilter/ipvs/ip_vs_est.c (estimation_timer())
--
On Tue, May 19, 2009 at 08:03:24PM +0200, Eric Dumazet wrote: OK, thanks, Jarek P. -----------------> pkt_sched: gen_estimator: Fix signed integers right-shifts. Right-shifts of signed integers are implementation-defined so unportable. With feedback from: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- diff -Nurp a/net/core/gen_estimator.c b/net/core/gen_estimator.c --- a/net/core/gen_estimator.c 2009-05-19 20:33:47.000000000 +0200 +++ b/net/core/gen_estimator.c 2009-05-19 20:40:58.000000000 +0200 @@ -128,12 +128,12 @@ static void est_timer(unsigned long arg) npackets = e->bstats->packets; brate = (nbytes - e->last_bytes)<<(7 - idx); e->last_bytes = nbytes; - e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log; + e->avbps += (brate >> e->ewma_log) - (e->avbps >> e->ewma_log); e->rate_est->bps = (e->avbps+0xF)>>5; rate = (npackets - e->last_packets)<<(12 - idx); e->last_packets = npackets; - e->avpps += ((long)rate - (long)e->avpps) >> e->ewma_log; + e->avpps += (rate >> e->ewma_log) - (e->avpps >> e->ewma_log); e->rate_est->pps = (e->avpps+0x1FF)>>10; skip: read_unlock(&est_lock); --
From: Jarek Poplawski <jarkao2@gmail.com> Applied to net-next-2.6, thanks! --
From: Eric Dumazet <dada1@cosmosbay.com> It emits an "arithmetic shift right" for every CPU I've ever checked. --
Here is some additional explanation. It looks like these rates above 500Mbit hit the design limits of packet scheduling. Currently used internal resolution PSCHED_TICKS_PER_SEC is 1,000,000. 550Mbit rate with 800byte packets means 550M/8/800 = 85938 packets/s, so on average 1000000/85938 = 11.6 ticks per packet. Accounting only 11 ticks means we leave 0.6*85938 = 51563 ticks per second, letting for additional sending of 51563/11 = 4687 packets/s or 4687*800*8 = 30Mbit. Of course it could be worse (0.9 tick/packet lost) depending on packet sizes vs. rates, and the effect rises for higher rates. Jarek P. --
Return non-zero tc_calc_xmittime() for rate tables
While looking at the problem of HTB accuracy for high speed (~500Mbit
rates) I've found that rate tables have cells filled with zeros for
the smallest sizes. It means such packets aren't accounted at all.
Apart from the correctness of such configs, let's make it safe with
rather overaccounting than living it unlimited.
Reported-by: Antonio Almeida <vexwek@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---
tc/tc_core.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..14f25bc 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -58,7 +58,9 @@ unsigned tc_core_ktime2time(unsigned ktime)
unsigned tc_calc_xmittime(unsigned rate, unsigned size)
{
- return tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+ unsigned t;
+ t = tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+ return t ? : 1;
}
unsigned tc_calc_xmitsize(unsigned rate, unsigned ticks)
--
I'm not sure if I'm able to test this patch. What do you mean with "smallest sizes"? Are you talking about packet's size? What kind of sizes? When I feed my bridge with 950Mbits/s of packets with 800 bytes that is close to 150.000pps and CPUs start to get busy. For packets 100 bytes long, 150.000pps would be close to 125Mbits/s and CPUs start to get busy already, so I'm not able to get close to 500Mbits/s. For rates near 125bits/s the bad accuracy is not so expressive. For packets of 100 bytes increasing analyser sent traffic, at some point is not HTB shaping but the CPU that can't process so many packets. I might misunderstood your point. I applied this tc_core.c patch and for packets of 800 bytes it had no effect in HTB accuracy with rates over 500Mbit. Anyway I also test it with packets of 100 bytes, generating 200Mbits, and the result is the same as without this patch: With the patch: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 100000Kbit ceil 100000Kbit burst 14087b/8 mpu 0b overhead 0b cburst 14087b/8 mpu 0b overhead 0b level 0 Sent 2187884640 bytes 22790465 pkt (dropped 8624566, overlimits 0 requeues 0) rate 124946Kbit 162691pps backlog 0b 0p requeues 0 lended: 22790465 borrowed: 0 giants: 0 tokens: 180 ctokens: 180 Without the patch: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 100000Kbit ceil 100000Kbit burst 14087b/8 mpu 0b overhead 0b cburst 14087b/8 mpu 0b overhead 0b level 0 Sent 1260235680 bytes 13127455 pkt (dropped 4531299, overlimits 0 requeues 0) rate 124575Kbit 162207pps backlog 0b 0p requeues 0 lended: 13127455 borrowed: 0 giants: 0 tokens: 123 ctokens: 123 Thanks Antonio Almeida --
I forgot to tell you that I used tc source code from iproute2-2.6.16. I couldn't use the newest version because I got errors when compiling. Antonio Almeida --
I still have no clue about the reason, but it would be really nice to do some short test with more current kernel (>= 2.6.27) and iproute2 (to exclude the possibility of some incomaptibility in configs e.g. rate tables passed to htb). Thanks, Jarek P. --
I installed kernel 2.6.29 (finaly! wasn't easy... I couldn't set memory split 2G/2G), but the results are the same. I've already applied gen_estimator.c patches (works fine). # tc -s -d class ls dev eth1 | head -24 class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7 Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0) rate 621844Kbit 97651pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 402 ctokens: 402 class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 5 Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0) rate 621844Kbit 97651pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 402 ctokens: 402 class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 119955366812 bytes 150697697 pkt (dropped 76696483, overlimits 0 requeues 0) rate 621847Kbit 97652pps backlog 0b 79p requeues 0 lended: 150697618 borrowed: 0 giants: 0 tokens: -5 ctokens: -5 class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 6 Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0) rate 621844Kbit 97651pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 402 ctokens: 402 # cat /sys/module/sch_htb/parameters/htb_hysteresis 0 # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off # ethtool -k eth1 Offload parameters for eth1: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic ...
Actually, from these two I was more interested in iproute2 more fitting the kernel version. :-((It should be enough to have at least tc compiled properly, I guess.) Btw.: if at any point you think this testing is too disturbing to you etc., feel free to stop this or delay in time as you like. Thanks, Jarek P. --
I installed iproute2-ss090115 with the new patch but the results are the same for my test scenery. HTB keeps sending 620Mbit/s when I I'm working on this, don't worry. Since I have a traffic generator/analyser, any modification you would make I can test it. You're free to ask. I've been looking inside htb source code. The granularity problem could be in the use qdisc_rate_table or near that. Antonio Almeida --
Yes, but according to my assessment there should be "only" 50Mbit difference for this rate/packet size. Anyway, could you try a testing patch below, which should add some granularity to this rate table? Thanks, Jarek P. --- include/net/pkt_sched.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index e37fe31..f0faf03 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -42,8 +42,8 @@ typedef u64 psched_time_t; typedef long psched_tdiff_t; /* Avoid doing 64 bit divide by 1000 */ -#define PSCHED_US2NS(x) ((s64)(x) << 10) -#define PSCHED_NS2US(x) ((x) >> 10) +#define PSCHED_US2NS(x) ((s64)(x) << 6) +#define PSCHED_NS2US(x) ((x) >> 6) #define PSCHED_TICKS_PER_SEC PSCHED_NS2US(NSEC_PER_SEC) #define PSCHED_PASTPERFECT 0 --
It's better! This patch gives more accuracy to HTB. Here some values: Note that these are boundary values, so, e.g., any HTB configuration between 377000Kbit and 400000Kbit would fall in the same step - close to 397977Kbit. This test was made over the same conditions: generating 950Mbit/s of unidirectional tcp traffic of 800 bytes packets long. leaf class ceil leaf class sent rate (tc -s values) 376000Kbit 375379Kbit -- 377000Kbit 397977Kbit 400000Kbit 397973Kbit -- 401000Kbit 425199Kbit 426000Kbit 425199Kbit -- 427000Kbit 456389Kbit 457000Kbit 456409Kbit -- 458000Kbit 490111Kbit 492000Kbit 490138Kbit -- 493000Kbit 531957Kbit 533000Kbit 532078Kbit -- 534000Kbit 581835Kbit 581000Kbit 581820Kbit -- 582000Kbit 637809Kbit 640000Kbit 637709Kbit -- 641000Kbit 710526Kbit 711000Kbit 710553Kbit -- 712000Kbit 795921Kbit 800000Kbit 795901Kbit -- 801000Kbit 912706Kbit 914000Kbit 912782Kbit -- 915000Kbit -- Here more values for a HTB ceil configuration of 555Mbit/s changing packet size: 800 bytes: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 18731000768 bytes 23531408 pkt (dropped 15715520, overlimits 0 requeues 0) rate 581832Kbit 91368pps backlog 0b 110p requeues 0 lended: 23531298 borrowed: 0 giants: 0 tokens: -16091 ctokens: -16091 850 bytes: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 30556163150 bytes 37645600 pkt (dropped 25746491, overlimits 0 requeues 0) rate 565509Kbit 83556pps backlog 0b 15p requeues 0 lended: 37645585 borrowed: 0 giants: 0 tokens: -16010 ctokens: -16010 950 bytes class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 51363059854 bytes ...
On Fri, 29 May 2009 18:02:39 +0100 You really need to get a better box than the dual core AMD. There is only millisecond (or worse with HZ=100) resolution possible because there is no working TSC on that hardware. -- --
I think this could cause problems with peak rates but IMHO there is no reason for htb to miss per second (4s) estimations against the same clock. Plus it mostly confirms theoretical limits of currently used rate tables vs. usecond time/ticket accounting. Jarek P. --
Good news! So it seems there are no other reasons of this inaccuracy than too coarse granularity, but I have to check this yet. Alas there is needed something more than this patch, because it probably breaks other things like hfsc. Thanks, --
On Fri, 29 May 2009 21:46:43 +0200 Why would it break hfsc, if it isn't already broken. --
I might be wrong but e.g. these usecs could be one reason:
/* convert d (us) into dx (psched us) */
static u64
d2dx(u32 d)
{
u64 dx;
dx = ((u64)d * PSCHED_TICKS_PER_SEC);
dx += USEC_PER_SEC - 1;
do_div(dx, USEC_PER_SEC);
return dx;
}
And maybe these shifts need some adjustment:
m = (sm * PSCHED_TICKS_PER_SEC) >> SM_SHIFT;
Jarek P.
--
Here is a tc patch, which should minimize these boundaries, so please,
repeat this test with previous patches/conditions plus this one.
Thanks,
Jarek P.
---
tc/tc_core.c | 10 +++++-----
tc/tc_core.h | 4 ++--
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..6d74287 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -27,18 +27,18 @@
static double tick_in_usec = 1;
static double clock_factor = 1;
-int tc_core_time2big(unsigned time)
+int tc_core_time2big(double time)
{
- __u64 t = time;
+ __u64 t;
- t *= tick_in_usec;
+ t = time * tick_in_usec + 0.5;
return (t >> 32) != 0;
}
-unsigned tc_core_time2tick(unsigned time)
+unsigned tc_core_time2tick(double time)
{
- return time*tick_in_usec;
+ return time * tick_in_usec + 0.5;
}
unsigned tc_core_tick2time(unsigned tick)
diff --git a/tc/tc_core.h b/tc/tc_core.h
index 5a693ba..0ac65aa 100644
--- a/tc/tc_core.h
+++ b/tc/tc_core.h
@@ -13,8 +13,8 @@ enum link_layer {
};
-int tc_core_time2big(unsigned time);
-unsigned tc_core_time2tick(unsigned time);
+int tc_core_time2big(double time);
+unsigned tc_core_time2tick(double time);
unsigned tc_core_tick2time(unsigned tick);
unsigned tc_core_time2ktime(unsigned time);
unsigned tc_core_ktime2time(unsigned ktime);
--
I'm getting great values with this patch! class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70970b/8 mpu 0b overhead 0b cburst 70970b/8 mpu 0b overhead 0b level 0 Sent 14270693572 bytes 17928007 pkt (dropped 12579262, overlimits 0 requeues 0) rate 552755Kbit 86802pps backlog 0b 127p requeues 0 lended: 17927880 borrowed: 0 giants: 0 tokens: -16095 ctokens: -16095 (for packets of 800 bytes) I'll get back to you with more values. Antonio Almeida --
The steps are much smaller and the error keeps lower than 1%. Injecting over 950Mpbs of tcp packets of 800bytes I get these values: Configuration Sent rate error (%) 498000Kbit 495023Kbit 0,60 499000Kbit 497456Kbit 0,31 500000Kbit 497498Kbit 0,50 501000Kbit 497496Kbit 0,70 502000Kbit 499986Kbit 0,40 503000Kbit 499978Kbit 0,60 504000Kbit 502520Kbit 0,29 696000Kbit 690964Kbit 0,72 697000Kbit 695782Kbit 0,17 698000Kbit 695783Kbit 0,32 699000Kbit 695783Kbit 0,46 700000Kbit 695795Kbit 0,60 701000Kbit 695786Kbit 0,74 702000Kbit 700703Kbit 0,18 896000Kbit 888383Kbit 0,85 897000Kbit 896289Kbit 0,08 904000Kbit 896389Kbit 0,84 905000Kbit 904542Kbit 0,05 Antonio Almeida --
Nice values - should be acceptable, I guess. Alas this is not all, and I'll ask you soon for re-testing HFSC (after another patch) or maybe even some simple CBQ setup ;-) Thank you very much for testing, --
I didn't follow the full discussion, so I'm not sure which kind of arithmetic error you're attempting to cure. For the HFSC scaling factors, please just keep in mind that its also supposed to be very accurate at low bandwidths. --
It's all here: http://permalink.gmane.org/gmane.linux.network/129301 Of course, I'd appreciate any suggestions. Thanks, Jarek P. --
I've read through the mails where you suggested to change the scaling factors. I wasn't able to find the reasoning (IOW: where does it The HFSC shifts would indeed need adjustments if the US<->NS conversion factor were to change. --
I described the reasoning here: http://permalink.gmane.org/gmane.linux.network/128189 Of course, we could try some other solution than changing the scaling. I considered a possibility to do it internally in htb, even with skipping rate tables, but the change of the scaling seems to be the most generic way (alas there are some odd compatibility issues in iproute/tc like TIME_UNITS_PER_SEC or "if (nom == 1000000)" to make it really consistent/readable). Jarek P. --
Jarek Poplawski wrote, On 06/02/2009 11:37 PM: The link is stuck now, so here is a quote: Jarek P. --
I see. Unfortunately changing the scaling factors is pushing the lower end towards overflowing. For example Denys Fedoryshchenko reported some breakage a few years ago when I changed the iproute-internal factors triggered by this command: .. tbf buffer 1024kb latency 500ms rate 128kbit peakrate 256kbit minburst 16384 The burst size calculated by TBF with the current parameters is 64000000. Increasing it by a factor of 16 as in your patch results in 1024000000. Which means we're getting dangerously close to overflowing, a buffer size increase or a rate decrease of slightly bigger than factor 4 will already overflow. Mid-term we really need to move to 64 bit values and ns resolution, otherwise this problem is just going to reappear as soon as someone tries 10gbit. Not sure what the best short term fix is, I feel a bit uneasy about changing the current factors given how close this brings us towards overflowing. --
I completely agree it's on the verge of overflow, and actually would overflow for some insanely low (for today's standards) rates. So I treat it's as a temporary solution, until people start asking about more than 1 or 2Gbit. And of course we will have to move to 64 bit anyway. Or we can do it now... Btw., I've some doubts about HFSC; it's really different than others wrt. rate tables/time accounting, and these PSCHED_TICKS look only like an unnecesary compatibility; it works OK with usecs and doesn't need this change now, unless I miss something. So maybe we would simply stop using common psched_get_time() for it, and only do a conversion for qdisc_watchdog_schedule() etc.? Thanks, Jarek P. --
That (now) would certainly be the best solution, but its a non-trivial Yes, it would work perfectly fine with usecs, which is actually (and unfortunately) the unit it uses in its ABI. But I think its better to convert the values once during initialization, instead of again and again when scheduling the watchdog. The necessary changes are really trivial, all you need to do when changing the scaling factors is to increase SM_MASK and decrease ISM_MASK accordingly. --
On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote: Right! (On the other hand we could consider a separate watchdog too...) Jarek P. --
We could :) But I don't see any benefit doing that, especially given that eventually everything should be using ns resolution anyways. --
The main benefit would be readability... I guess it's no problem for
you, but I'm currently trying to make sure things like this are/will
be OK :-)
dx = ((u64)d * PSCHED_TICKS_PER_SEC);
dx += USEC_PER_SEC - 1;
Jarek P.
--
On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote: OK, looks like it's really enough and I was confused with some rounding, thanks Patrick. Antonio, could you give this patch a try (with all the previous) and repeat those HFSC tests you did before (plus maybe a few tries with lower rates)? Thanks, Jarek P. --- net/sched/sch_hfsc.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 5022f9c..7c53a36 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -384,8 +384,9 @@ cftree_update(struct hfsc_class *cl) * * 1.024us/byte 78.125 7.8125 0.78125 0.078125 0.0078125 */ -#define SM_SHIFT 20 -#define ISM_SHIFT 18 +#define PSCHED_SHIFT 6 /* TODO: move to pkt_sched.h */ +#define SM_SHIFT (30 - PSCHED_SHIFT) +#define ISM_SHIFT (8 + PSCHED_SHIFT) #define SM_MASK ((1ULL << SM_SHIFT) - 1) #define ISM_MASK ((1ULL << ISM_SHIFT) - 1) --
Looks fine in principle, but considering your change to the generic --
Actually I'm confused, why the additional change of 10? --
If you wanted to console me after my hfsc confusions, you did it! Thanks again, Jarek P. --
For me, HTB values are just perfect! I would say that they're better than HFSC, since sent rate stays below the configured ceil (but that's for me) After applying the patch you sent (to sch_hfsc.c) I got these values for HFSC: configuration analyser RX error (%) 10000000 10062688 0,63 20000000 20096961 0,48 30000000 30135028 0,45 40000000 40186190 0,47 50000000 50294890 0,59 60000000 60294553 0,49 70000000 70284220 0,41 80000000 80414272 0,52 90000000 90354675 0,39 100000000 100453024 0,45 200000000 200962041 0,48 250000000 251467886 0,59 300000000 301422613 0,47 400000000 402123479 0,53 500000000 502356820 0,47 550000000 552988253 0,54 600000000 602956905 0,49 700000000 703405632 0,49 750000000 753949085 0,53 800000000 804315169 0,54 900000000 904584208 0,51 As usually, generating 970Mbit/s of tcp traffic of 800 bytes packets. Here's the setup picture: # tc -s -d class ls dev eth1 class hfsc 1: root Sent 253924 bytes 319 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 period 0 level 4 class hfsc 1:1 parent 1: sc m1 0bit d 0us m2 1000Mbit ul m1 0bit d 0us m2 1000Mbit Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 period 2 work 299437688 bytes level 3 class hfsc 1:10 parent 1:2 sc m1 0bit d 0us m2 1000Mbit ul m1 0bit d 0us m2 1000Mbit Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 period 2 work 299437688 bytes level 1 class hfsc 1:2 parent 1:1 sc m1 0bit d 0us m2 1000Mbit ul m1 0bit d 0us m2 1000Mbit Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 period 2 work 299437688 bytes level 2 class hfsc 1:108 parent 1:10 sc m1 0bit d 50.0ms m2 500000Kbit ul m1 0bit d 0us m2 500000Kbit Sent 300178764 bytes 377109 pkt (dropped 349464, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 931p requeues 0 period 2 work 299437688 ...
Very nice, it looks like HFSC precision isn't affected by these changes. OK, I'll browse other schedulers, and if there is nothing suspicious I'll submit these patches. Thank you very much for cooperation! Jarek P. --
Please give me a day to have another look at this, I didn't find any time today. In most areas the overflows are only occuring when crossing IMO unreasonable boundaries (but I've been wrong about that before), but tc_cbq_calc_maxidle() is still making me nervous. --
Sure, I planned similar time for browsing it yet, as well. Thanks, Jarek P. --
Hello! Do you have any progress to apply this patch set? I'm very interested to view that patches in mainline kernel tree. We would like to use HTB for speeds more than 1G (converts 10 servers x 1G to few with 10G intel multi queue network devices). Thanks for you doing! --
Hi, I'll try to send patches today, but they are expected to work with 1G or maybe a little more. I'm not sure higher rates make sense without tso/gso, which isn't properly handled by packet schedulers anyway, so more time/feedback/testing will be needed to go further. Regards, Jarek P. --
From: Patrick McHardy <kaber@trash.net> We could pass in a new attribute which provides the upper-32bits of the value. I'm not sure if that works in this case but it's an idea. --
I'm not sure it could be so simple: I guess Patrick is concerned with
a new tc talking to an old kernel (otherwise a kernel should recognize
an old format). Then it would need something reasonable in 32bits.
But, I'm not even sure we need 64bit rate tables. We could
alternatively use (after checking a kernel can handle this)
simply a log to shift these values in kernel to u64:
- static inline u32 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
+ static inline u64 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
{
...
- return rtab->data[slot];
+ return rtab->data[slot] << rtab->rate.rate_log;
}
Since these overflows are for low rates, this rounding of lower bits
shouldn't matter here. So, IMHO, it's more about adding this overhead
of u64 to the kernel now.
Jarek P.
--
You're right: if there were only 800 byte packets this patch shouldn't matter. It should matter e.g. if these 800 byte were mixed with 100 byte packets, rate 550Mbit, and HZ 1000. Btw. if could you send your .config (gzipped)? I guess, I've to look for some other reason yet. Thanks, --
Hmm... And if it's not a big problem I'd also ask you to try this test with 555000Kbit rate for 850 and 900 byte packets. (It can wait.) Thanks again, Jarek P. --
Precise measurements: 800 bytes: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 46793626324 bytes 57771194 pkt (dropped 29920019, overlimits 0 requeues 0) rate 621714Kbit 97631pps backlog 0b 126p requeues 0 lended: 57771068 borrowed: 0 giants: 0 tokens: -8 ctokens: -8 850 bytes: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 63422144616 bytes 77714246 pkt (dropped 41012275, overlimits 0 requeues 0) rate 600699Kbit 88756pps backlog 0b 127p requeues 0 lended: 77714119 borrowed: 0 giants: 0 tokens: -11 ctokens: -11 900 bytes: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 76868403562 bytes 92835297 pkt (dropped 48565133, overlimits 0 requeues 0) rate 636195Kbit 88755pps backlog 0b 126p requeues 0 lended: 92835171 borrowed: 0 giants: 0 tokens: -7 ctokens: -7 If you need more values you're free to ask. Antonio Almeida --
Since you're so kind... :-) There is a line in net/sched/sch_htb.c: #define HTB_HYSTERESIS 1 /* whether to use mode hysteresis for speedup */ Could you change 1 to 0, and repeat these tests above after recompiling? More thanks, Jarek P. --
Doesn't seem to make any diference seting HTB_HYSTERESIS to 0. Here're the values using #define HTB_HYSTERESIS 0 800 bytes: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 9773257752 bytes 12277962 pkt (dropped 6292541, overlimits 0 requeues 0) rate 621796Kbit 97644pps backlog 0b 127p requeues 0 lended: 12277835 borrowed: 0 giants: 0 tokens: -7 ctokens: -7 850 bytes: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 18225005732 bytes 22409017 pkt (dropped 11937269, overlimits 0 requeues 0) rate 600890Kbit 88796pps backlog 0b 43p requeues 0 lended: 22408974 borrowed: 0 giants: 0 tokens: -2 ctokens: -2 900 bytes: class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 29790867368 bytes 35400708 pkt (dropped 18399726, overlimits 0 requeues 0) rate 636361Kbit 88779pps backlog 0b 127p requeues 0 lended: 35400581 borrowed: 0 giants: 0 tokens: -2 ctokens: -2 Antonio Almeida --
6292541 dropped from 12277962 pkt, means 51% dropped. Maybe something fishy here? Can you try instead of SFQ - BFIFO? For 100ms buffer, 550Mbit/s it will be ~6875000 bytes bfifo. It is by the way too short, IMHO, for this bandwidth, 127 packets is not enough. 127 packets with 800 bytes can buffer 1 second for 812Kbit/s only, and for 550Mbit/s it will buffer data for ~2ms only. --
Sure, if the queue is too short we could have a problem with reaching the expected rate; but here it's all backwards - it could actually "help" with the stats. ;-) Jarek P. --
Well, i had real experience on HTB, when i set too short buffers on my QoS qdiscs, the incoming rate jumped too high than overall. When i set larger buffers (and by the way dropped sfq and use bfifo) - it is dropped. No idea why, bug or specific things in protocols congestion control. Maybe worth to try... --
Very strange. Anyway, "overlimits 0" suggests HTB always got packets when it needed... Jarek P. --
I tested it with BFIFO using limit 6875000. (Analyser keeps sending 950Mbits/s of 800 bytes tcp packets - lots of drops for sure) Backlog is now huge but the throughout stays much higher than the configured ceil. # tc -s -d class ls dev eth1 class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 5 Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0) rate 621765Kbit 97639pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: -186 ctokens: -186 class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7 Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0) rate 621765Kbit 97639pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: -186 ctokens: -186 class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 6 Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0) rate 621765Kbit 97639pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: -186 ctokens: -186 class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst 70901b/8 mpu 0b overhead 0b level 0 Sent 9549705928 bytes 11997118 pkt (dropped 6092846, overlimits 0 requeues 0) rate 621764Kbit 97639pps backlog 0b 8636p requeues 0 lended: 11988482 borrowed: 0 giants: 0 tokens: -1008 ctokens: -1008 # tc -s -d qdisc ls dev eth1 qdisc htb 1: root r2q 10 default 0 direct_packets_stat 11955 ver 3.17 Sent 9608660872 bytes 12071182 pkt (dropped 6124502, overlimits 18190041 requeues 0) rate 0bit 0pps backlog 0b 8636p requeues 0 qdisc bfifo 108: parent 1:108 limit 6875000b Sent 9599144692 bytes 12059227 pkt (dropped 6124502, overlimits 0 requeues 0) rate 0bit 0pps backlog 6874256b 8636p requeues 0 Antonio ...
OK, so it looks like some hidden bug yet. Many thanks for now, --
Notice its runtime adjustable via: /sys/module/sch_htb/parameters/htb_hysteresis Since kernel version v2.6.26. Cheers, Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- --
Yes, this should convince Antonio to try something newer. (Alas it didn't seem to make much difference to his case ;-) Cheers, Jarek P. --
-----------> (One misspelling fixed.)
Return non-zero tc_calc_xmittime() for rate tables
While looking at the problem of HTB accuracy for high speed (~500Mbit
rates) I've found that rate tables have cells filled with zeros for
the smallest sizes. It means such packets aren't accounted at all.
Apart from the correctness of such configs, let's make it safe with
rather overaccounting than leaving it unlimited.
Reported-by: Antonio Almeida <vexwek@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---
tc/tc_core.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..14f25bc 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -58,7 +58,9 @@ unsigned tc_core_ktime2time(unsigned ktime)
unsigned tc_calc_xmittime(unsigned rate, unsigned size)
{
- return tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+ unsigned t;
+ t = tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+ return t ? : 1;
}
unsigned tc_calc_xmitsize(unsigned rate, unsigned ticks)
--
Hi Antonio, FYI, these are exactly the same problems I get in real life. Check the later posts in "bond + tc regression" thread. -- Best Regards Vladimir Ivashchenko Chief Technology Officer PrimeTel, Cyprus - www.prime-tel.com --
