Re: HTB accuracy for high speed

Previous thread: (no subject) by il on Friday, May 15, 2009 - 7:29 am. (1 message)

Next thread: [PATCH 1/3] mdio: Add 10GBASE-T SNR register definition by Ben Hutchings on Friday, May 15, 2009 - 9:04 am. (6 messages)
From: Antonio Almeida
Date: Friday, May 15, 2009 - 7:49 am

Hi!
I've been using HTB in a Linux bridge and recently I noticed that, for
high speed, the configured rate/ceil is not respected as for lower
speeds.
I'm using a packet generator/analyser to inject over 950Mpbs, and see
what returns back to it, in the other side of my bridge. Generated
packets have 800bytes. I noticed that, for several tc HTB rate/ceil
configurations the amount of traffic received by the analyser stays
the same. See this values:

HTB conf      Analyser reception
476000Kbit    544.260.329
500000Kbit    545.880.017
510000Kbit    544.489.469
512000Kbit    546.890.972
-------------------------
513000Kbit    596.061.383
520000Kbit    596.791.866
550000Kbit    596.543.271
554000Kbit    596.193.545
-------------------------
555000Kbit    654.773.221
570000Kbit    654.996.381
590000Kbit    655.363.253
605000Kbit    654.112.017
-------------------------
606000Kbit    728.262.237
665000Kbit    727.014.365
-------------------------

There are these steps and it looks like doesn't matter if I configure
HTB to 555Mbit or to 605Mbit - the result is the same: 654Mbit. This
is 18% more traffic than the configured value. I also realise that for
smaller packets it gets worse, reaching 30% more traffic than what I
configured. For packets of 1514bytes the accuracy is quiet good.
I'm using kernel 2.6.25

My 'tc -s -d class ls dev eth1' output:

class htb 1:10 parent 1:2 rate 1000Mbit ceil 1000Mbit burst 126375b/8
mpu 0b overhead 0b cburst 126375b/8 mpu 0b overhead 0b level 5
 Sent 51888579644 bytes 62067679 pkt (dropped 0, overlimits 0 requeues 0)
 rate 653124Kbit 97656pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 113 ctokens: 113

class htb 1:1 root rate 1000Mbit ceil 1000Mbit burst 126375b/8 mpu 0b
overhead 0b cburst 126375b/8 mpu 0b overhead 0b level 7
 Sent 51888579644 bytes 62067679 pkt (dropped 0, overlimits 0 requeues 0)
 rate 653123Kbit 97656pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 113 ctokens: ...
From: Stephen Hemminger
Date: Friday, May 15, 2009 - 11:12 am

On Fri, 15 May 2009 15:49:31 +0100

You are probably hitting the limit of the timer resolution. So it matters
what the clock source is.  
    cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Also, is HFSC any better than HTB?

-- 
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 3:01 am

Hi!

cat /sys/devices/system/clocksource/clocksource0/current_clocksource
returns "jiffies"

With HFSC the accuracy is good. Also with packets of 800 bytes I got
these values:
received            configured         error
904596519	900000000		0,51
804293658	800000000		0,54
703662853	700000000		0,52
603354059	600000000		0,56
502805411	500000000		0,56
402527055	400000000		0,63
301484904	300000000		0,49
201074301	200000000		0,54
100546656	100000000		0,55


Thanks
  Antonio Almeida



On Fri, May 15, 2009 at 7:12 PM, Stephen Hemminger
--

From: Jarek Poplawski
Date: Monday, May 18, 2009 - 3:45 am

Looks great! But, since HFSC uses rates directly (without rate tables)

Thanks,
Jarek P.
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 5:27 am

> Looks great! But, since HFSC uses rates directly (without rate tables)

This matter about the use of rate tables is not very familiar to me.
In fact I keep wondering a lot of things about what kernel does with
packets. Is there any documentation explaining how queue disciplines
work and how it interacts with netfilter and tc_core? What about
packets dispatching?

Thanks
  Antonio Almeida
--

From: Jarek Poplawski
Date: Monday, May 18, 2009 - 5:32 am

Here are a few links:
http://yesican.chsoft.biz/lartc/index.html

Jarek P.
--

From: Stephen Hemminger
Date: Monday, May 18, 2009 - 9:13 am

On Mon, 18 May 2009 11:01:21 +0100

That is the slowest of the choices. Better ones are hpet and tsc, but you
hardware doesn't support them.

You should compile your kernel with HZ=1000 and the resolution will be better
(but with some loss of performance).
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 11:03 am

I have my kernel's timer frequency set to 1000Hz since the beginning.
I've got all these results with HZ_1000.

(I'm working on clocksource)

Thanks
  Antonio Almeida


On Mon, May 18, 2009 at 5:13 PM, Stephen Hemminger
--

From: Stephen Hemminger
Date: Monday, May 18, 2009 - 3:02 pm

On Mon, 18 May 2009 09:13:14 -0700

Are you using one of the AMD dual core machines?  That processor has the bad
design flaw that the TSC counter is not synced between core's so the kernel can't
use it. You might even be better off running a non SMP kernel on that box.
--

From: Antonio Almeida
Date: Tuesday, May 19, 2009 - 4:48 am

My machine has two dual cores AMD Opteron processor 280. Do I have
that TSC problem.


# dmesg | grep AMD
OEM ID: AMD      Product ID: HAMMER       APIC at: 0xFEE00000
CPU0: AMD Dual Core AMD Opteron(tm) Processor 280 stepping 02
CPU1: AMD Dual Core AMD Opteron(tm) Processor 280 stepping 02
CPU2: AMD Dual Core AMD Opteron(tm) Processor 280 stepping 02
CPU3: AMD Dual Core AMD Opteron(tm) Processor 280 stepping 02


processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 280
stepping        : 2
cpu MHz         : 2394.039
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy ts fid vid ttp
bogomips        : 4790.36
clflush size    : 64


  Antonio Almeida
--

From: Antonio Almeida
Date: Tuesday, May 19, 2009 - 6:08 am

Do I have that TSC problem?

--

From: Jarek Poplawski
Date: Saturday, May 16, 2009 - 1:31 am

Is it for sure there is no gso/tso enabled on this dev (with up to
date ethtool -k)? It would be nice to see also more details like
.config, ifconfigs before and after the test, tc -s qdisc and bytes/
packet number seen by this analyser, plus maybe some proof you can
obtain such flows with something simpler like tbf. Of course using
the current kernel, even if no difference, would give us more
valuable perspective.

Thanks,
Jarek P.
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 3:39 am

Hi!

Here the information you asked:

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

The bridge is between eth0 and eth1

---------------------------
Before traffic starts:
---------------------------
Analyser sent bytes: 0
Analyser sent packets: 0
Analyser received bytes: 0
Analyser received packets: 0


# tc -s -d class ls dev eth1
class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 990 ctokens: 990

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 990 ctokens: 990

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 990 ctokens: 990

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 999 ctokens: 999


# ifconfig
br0       Link encap:Ethernet  HWaddr 00:E0:ED:10:7C:6C
          UP ...
From: Jarek Poplawski
Date: Monday, May 18, 2009 - 4:14 am

Very nice, but there are some questions:
- if this analyser uses tcp we definitely need tso off as well during
  these tests,
- it would be nice to use two patches I've sent to exclude known (now)
  reasons.

With the above I expect accuracy should be better, but definitely not
like hfsc (plus no higher than 1000Mbit rate reported after stopping
effect).

Thanks,
...
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 5:05 am

The analyser traffic is tcp. Setting tso off the accuracy stays the same

# ethtool -K eth0 tso off
# ethtool -K eth1 tso off

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off


# tc -s -d class ls dev eth1 | head -24
class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
 Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0)
 rate 652715Kbit 97655pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 402 ctokens: 402

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
 Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0)
 rate 652715Kbit 97655pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 402 ctokens: 402

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
 Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0)
 rate 652715Kbit 97655pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 402 ctokens: 402

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 164938040048 bytes 206824248 pkt (dropped 25827911, overlimits 0
requeues 0)
 rate 652715Kbit 97655pps backlog 0b 33p requeues 0
 lended: 206824215 borrowed: 0 giants: 0
 tokens: -6 ctokens: -6


I'm applying the patches now. I'll get back to you.

  Antonio ...
From: Jarek Poplawski
Date: Saturday, May 16, 2009 - 7:14 am

On Fri, May 15, 2009 at 03:49:31PM +0100, Antonio Almeida wrote:

This looks like a regular bug. I guess it's an overflow in
gen_estimator(), but I'm not sure there is nothing more. Could you
try the patch below? (An offset warning when patching 2.6.25 is OK)

Thanks,
Jarek P.
---

 net/core/gen_estimator.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..87f0ced 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -127,7 +127,11 @@ static void est_timer(unsigned long arg)
 		npackets = e->bstats->packets;
 		rate = (nbytes - e->last_bytes)<<(7 - idx);
 		e->last_bytes = nbytes;
-		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
+		if (rate > e->avbps)
+			e->avbps += (rate - e->avbps) >> e->ewma_log;
+		else
+			e->avbps -= (e->avbps - rate) >> e->ewma_log;
+
 		e->rate_est->bps = (e->avbps+0xF)>>5;
 
 		rate = (npackets - e->last_packets)<<(12 - idx);
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 7:36 am

This patch works perfectly!
rate (bits/s) is now decreasing along with pps when I stop the traffic
(doesn't grow as it used to for rates over 500Mbtis/s).

# tc -s -d class ls dev eth1 | head -21 | tail -1
 rate 651960Kbit 97482pps backlog 0b 0p requeues 0
 rate 541134Kbit 80911pps backlog 0b 0p requeues 0
 rate 405850Kbit 60683pps backlog 0b 0p requeues 0
 rate 304388Kbit 45512pps backlog 0b 0p requeues 0
 rate 304388Kbit 45512pps backlog 0b 0p requeues 0
 rate 228291Kbit 34134pps backlog 0b 0p requeues 0
 rate 171218Kbit 25601pps backlog 0b 0p requeues 0
 rate 171218Kbit 25601pps backlog 0b 0p requeues 0
 rate 128414Kbit 19201pps backlog 0b 0p requeues 0
 rate 96310Kbit 14400pps backlog 0b 0p requeues 0
 rate 96310Kbit 14400pps backlog 0b 0p requeues 0
 rate 72233Kbit 10800pps backlog 0b 0p requeues 0
 rate 54174Kbit 8100pps backlog 0b 0p requeues 0


Thank's to you!
  Antonio Almeida




--

From: Vladimir Ivashchenko
Date: Monday, May 18, 2009 - 4:14 pm

I'm not able to reach full speed with bond + HTB + sfq on 2.6.29.1, both
with and without these patches. I seem to get a lot of drops on sfq
qdiscs, whatever quantum I set. Playing with IRQ affinity doesn't help.
I didn't check without bond.

With bond + HFSC + sfq, I'm able to reach the speed. It doesn't seem to
overspill with 580 mbps load. Jarek, would your patches help with HSFC
overspill ? I will check tomorrow under 750 mbps load. 

# ethtool -k eth0
Offload parameters for eth0:
Cannot get device flags: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
large receive offload: off

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211


--

From: Vladimir Ivashchenko
Date: Monday, May 18, 2009 - 4:27 pm

Please disregard my comment about HFSC. It still overspills heavily.

On a 400 mbps limit, I'm getting 520 mbps actual throughput.

-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211


--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 4:03 am

The gen_estimator patch should fix only the effect of rising rate
after flow stop, and maybe similar overflows while reporting rates
around 1Gbit. It would show on tc stats of HFSC or HTB, but doesn't
affect actual scheduling rates.

The iproute2 tc_core patch can matter for HTB scheduling rates if
there are a lot of small packets (e.g. 100 byte for rate 500Mbit)
possibly mixed with bigger ones. It doesn't matter for HFSC or

I guess you should send some logs. Your previous report seem to show
the sum of sc rates of of children could be too high. You seem to
expect the parent's sc and ul should limit this, but actually children
rates decide and parent's rates are mainly for lending/borrowing (at
least in HTB). So, it would be nice to try with one leaf class first,
(similarly to Antonio) how high rates are respected.

High drop should be OK if the flow is much faster than scheduling/
hardware send rate. It could be a bit higher than in older kernels
because of limited requeuing, but this could be corrected with
longer queue lenghts (sfq has a very short queue: max 127).

Jarek P.
--

From: Vladimir Ivashchenko
Date: Tuesday, May 19, 2009 - 7:04 am

Unfortunately its difficult for me to play with classes as its real traffic. 

I don't think its sfq, since I have the same sfq qdiscs with HSFC.

Also I'm comparing this to my production HTB box has 2.6.21.5 with esfq 
and no bond (just eth), esfq also has 127p limit.

I tried to get rid of bond on the outbound traffic, I balanced traffic
via eth0 and eth2 manually by splitting routes going through them.

I still had the same issue with HTB not reaching the full speed.

I'm going to try testing exactly the same configuration on 2.6.29 as I have
on 2.6.21.5 tonight. The only difference would be that I use sfq(dst) instead of
esfq(dst) which is not available on 2.6.29.

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 1:10 pm

Similarly to Antonio's: ifconfigs and tc -s for qdiscs and classes at



I'm a bit lost about your configs/results and not reaching vs.
overspilled, so please send some new data to compare (gzipped?).

Jarek P. 
--

From: Vladimir Ivashchenko
Date: Wednesday, May 20, 2009 - 3:07 pm

Ok, it seems that I finally found what is causing my HTB on 2.6.29 not
to reach full throughput: dst hashing on sfq with high divisor value.

2.6.21 esfq divisor 13 depth 4096 hash dst - 680 mbps
2.6.29 sfq WITHOUT "flow hash keys dst ... " (default sfq) - 680 mbps
2.6.29 sfq + "flow hash keys dst divisor 64" filter - 680 mbps
2.6.29 sfq + "flow hash keys dst divisor 256" filter - 660 mbps
2.6.29 sfq + "flow hash keys dst divisor 2048" filters - 460 mbps

I'm using high sfq hash divisor in order to decrease the number of
collisions, there are several thousands of hosts behind each of the
classes. 

Any ideas why increasing the sfq divisor size results in drop of
throughput ?

Attached are diagnostics gathered in case of divisor 2048.

-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211

From: Eric Dumazet
Date: Wednesday, May 20, 2009 - 3:46 pm

But... it appears sfq currently supports a fixed divisor of 1024

net/sched/sch_sfq.c

 IMPLEMENTATION:
 This implementation limits maximal queue length to 128;
 maximal mtu to 2^15-1; number of hash buckets to 1024.
 The only goal of this restrictions was that all data
 fit into one 4K page :-). Struct sfq_sched_data is
 organized in anti-cache manner: all the data for a bucket
 are scattered over different locations. This is not good,
 but it allowed me to put it into 4K.

 It is easy to increase these values, but not in flight.  */

#define SFQ_DEPTH   128
#define SFQ_HASH_DIVISOR    1024


Apparently Corey Hickey 2007 work on SFQ was not merged.

http://kerneltrap.org/mailarchive/linux-netdev/2007/9/28/325048


--

From: Jarek Poplawski
Date: Thursday, May 21, 2009 - 12:20 am

Yes, sfq has its design limits, and as a matter of fact, because of
max length (127) it should be treated as a toy or "personal" qdisc.

I don't know why more of esfq wasn't merged, anyway similar
functionality could be achieved in current kernels with sch_drr +
cls_flow, alas not enough documented. Here is some hint:
http://markmail.org/message/h24627xkrxyqxn4k

Jarek P.

PS: I guess, you wasn't very consistent if your main problem was
exceeding or not reaching htb rate, and there is quite a difference.

...

...
--

From: Vladimir Ivashchenko
Date: Thursday, May 21, 2009 - 12:44 am

Can I balance only by destination IP using this approach? 
Normal IP flow-based balancing is not good for me, I need 

Yes indeed :(

I'm trying to migrate from 2.6.21 eth/htb/esfq to 2.6.29 
bond/htb/sfq, and that introduces a lot of changes.

Apparently during some point I changed sfq divisor from 1024 
to 2048 and forgot about it.

Now I realize that the problems I reported were as follows:

1) HTB exceeds target when I use HTB + sfq + divisor 1024
2) HFSC exceeds target when I use HFSC + sfq + divisor 1024
3) HTB does not reach target when I use HTB + sfq + divisor 2048

I will check again scenario 1) with the latest patches from

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com
--

From: Jarek Poplawski
Date: Thursday, May 21, 2009 - 1:28 am

Yes, you need to use flow "dst" key, I guess. (tc filter add flow help)


Generally, the most common reasons are:
- too short (or zero) tx queue length or/plus some disturbances in
  maintaining the flow - for not reaching the rate
- gso/tso or other non standard packets sizes - for exceeding the
  rate.
--

From: Eric Dumazet
Date: Thursday, May 21, 2009 - 2:07 am

Could we detect this at runtime and emit a warning (once) ?

Or should we assume guys using this stuff should be smart enough ?
I confess I made this error once and this was not so easy to spot...
	

--

From: Jarek Poplawski
Date: Thursday, May 21, 2009 - 2:22 am

On Thu, May 21, 2009 at 11:07:24AM +0200, Eric Dumazet wrote:

I guess, it's a rhetorical question...

Jarek P.
--

From: Vladimir Ivashchenko
Date: Saturday, May 23, 2009 - 3:37 am

What is the number of DRR classes I need to create, a separate class for
each host? I have around 20000 hosts.

I figured out that WRR does what I want and its documented, so I'm using
a 2.6.27 kernel with WRR now.

I was still hitting a wall with bonding. I played with a lot of
combinations and could not find a way to make it scale to multiple
cores. Cores which handle incoming traffic would get hit to 0-20% idle.

So, I got rid of bonding completely and instead configured PBR on Cisco
+ Linux routing in such a way so that packet gets received and
transmitted using NICs connected to the same pair of cores with common
cache. 65-70% idle on all cores now, compared to 0-30% idle in worst

Just FYI, kernel 2.6.29.1, sub-classes with sfq divisor 1024, tso & gso
off, netdevice.h and tc_core.c patches applied:

class htb 1:2 root rate 775000Kbit ceil 775000Kbit burst 98328b cburst
98328b
Sent 64883444467 bytes 72261124 pkt (dropped 0, overlimits 0 requeues 0)
rate 821332Kbit 112572pps backlog 0b 0p requeues 0
lended: 21736738 borrowed: 0 giants: 0

In any case, exceeding the rate is not big of a problem for me.

Thanks a lot to everyone for their help.

-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211


--

From: Jarek Poplawski
Date: Saturday, May 23, 2009 - 7:34 am

As a matter of fact I don't understand this bonding idea vs. smp: I
guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
expert but I think the most efficient use is with separate NICs per
cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -
but they would currently need a common HTB etc., so again a common

Anyway, I'd be interested with the full tc -s class & qdisc report.

Thanks,
Jarek P.
--

From: Vladimir Ivashchenko
Date: Saturday, May 23, 2009 - 8:06 am

I tried the following scenario: 2 NICs used for receive + another 2 NICs 
used for transmit having HTB. Each NIC on a separate core. No bonding, 
just manual load balancing using IP routing.

The result was that RX cores would be 20% and 40% idle respectively, even 
though the amount of traffic they were receiving was roughly the same. 
The TX cores were idling at around 90%. 

I found this strange personally, but I'm completely ignorant in internals of
kernel operation.

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com
--

From: Jarek Poplawski
Date: Saturday, May 23, 2009 - 8:35 am

There is not enough data to analyse this, but generally you should aim
at maintaining one flow (RX + TX) on the same cpu cache.

Jarek P.
--

From: Vladimir Ivashchenko
Date: Saturday, May 23, 2009 - 8:53 am

Yep, that's what I did in the end (as per the top paragraph).

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com
--

From: Jarek Poplawski
Date: Saturday, May 23, 2009 - 9:02 am

So, stop writing: "I'm completely ignorant in internals of kernel
operation" because you're smp expert now! ;-)

Jarek P.
--

From: Eric Dumazet
Date: Monday, May 18, 2009 - 9:40 am

With a typical estimator "1sec 8sec", ewma_log value is 3

At gigabit speeds, we are very close to overflow yes, since
we only have 27 bits available, so 134217728 bytes per second
or 1073741824 bits per second.

So formula :
e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
is going to overflow.

One way to avoid the overflow would be to use a smaller estimator, like "500ms 4sec" 

Or use a 64bits rate & avbps, this is needed fo 10Gb speeds I suppose...

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..150e2f5 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -86,9 +86,9 @@ struct gen_estimator
 	spinlock_t		*stats_lock;
 	int			ewma_log;
 	u64			last_bytes;
+	u64			avbps;
 	u32			last_packets;
 	u32			avpps;
-	u32			avbps;
 	struct rcu_head		e_rcu;
 	struct rb_node		node;
 };
@@ -115,6 +115,7 @@ static void est_timer(unsigned long arg)
 	rcu_read_lock();
 	list_for_each_entry_rcu(e, &elist[idx].list, list) {
 		u64 nbytes;
+		u64 brate;
 		u32 npackets;
 		u32 rate;
 
@@ -125,9 +126,9 @@ static void est_timer(unsigned long arg)
 
 		nbytes = e->bstats->bytes;
 		npackets = e->bstats->packets;
-		rate = (nbytes - e->last_bytes)<<(7 - idx);
+		brate = (nbytes - e->last_bytes)<<(7 - idx);
 		e->last_bytes = nbytes;
-		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
+		e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
 		e->rate_est->bps = (e->avbps+0xF)>>5;
 
 		rate = (npackets - e->last_packets)<<(12 - idx);

--

From: Jarek Poplawski
Date: Monday, May 18, 2009 - 10:23 am

Yes, I considered this too, but because of an overhead I decided to
fix as designed (according to the comment) for now. But probably you
are right, and we should go further, so I'm OK with your patch.

--

From: David Miller
Date: Monday, May 18, 2009 - 2:52 pm

From: Jarek Poplawski <jarkao2@gmail.com>

I like this patch too, Eric can you submit this formally with
proper signoffs etc.?

Thanks!
--

From: Eric Dumazet
Date: Monday, May 18, 2009 - 4:59 pm

Sure, here it is. We might need a similar patch to get a correct pps value
too, since we currently are limited to ~ 2^21 packets per second.

[PATCH] pkt_sched: gen_estimator: use 64 bit intermediate counters for bps

gen_estimator can overflow bps (bytes per second) with Gb links, while
it was designed with a u32 API, with a theorical limit of 34360Mbit (2^32 bytes)

Using 64 bit intermediate avbps/brate counters can allow us to reach this
theorical limit.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..ea28659 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -66,9 +66,9 @@
 
    NOTES.
 
-   * The stored value for avbps is scaled by 2^5, so that maximal
-     rate is ~1Gbit, avpps is scaled by 2^10.
-
+   * avbps is scaled by 2^5, avpps is scaled by 2^10.
+   * both values are reported as 32 bit unsigned values. bps can
+     overflow for fast links : max speed being 34360Mbit/sec
    * Minimal interval is HZ/4=250msec (it is the greatest common divisor
      for HZ=100 and HZ=1024 8)), maximal interval
      is (HZ*2^EST_MAX_INTERVAL)/4 = 8sec. Shorter intervals
@@ -86,9 +86,9 @@ struct gen_estimator
 	spinlock_t		*stats_lock;
 	int			ewma_log;
 	u64			last_bytes;
+	u64			avbps;
 	u32			last_packets;
 	u32			avpps;
-	u32			avbps;
 	struct rcu_head		e_rcu;
 	struct rb_node		node;
 };
@@ -115,6 +115,7 @@ static void est_timer(unsigned long arg)
 	rcu_read_lock();
 	list_for_each_entry_rcu(e, &elist[idx].list, list) {
 		u64 nbytes;
+		u64 brate;
 		u32 npackets;
 		u32 rate;
 
@@ -125,9 +126,9 @@ static void est_timer(unsigned long arg)
 
 		nbytes = e->bstats->bytes;
 		npackets = e->bstats->packets;
-		rate = (nbytes - e->last_bytes)<<(7 - idx);
+		brate = (nbytes - e->last_bytes)<<(7 - idx);
 		e->last_bytes = nbytes;
-		e->avbps += ((long)rate - (long)e->avbps) >> ...
From: David Miller
Date: Monday, May 18, 2009 - 7:27 pm

From: Eric Dumazet <dada1@cosmosbay.com>


True, but it is a less urgent issue than bps overflow.
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 12:02 am

On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:

Btw., I'm a bit concerned about the syntax here: isn't such shifting
of signed ints implementation dependant?

Jarek P.
--

From: Eric Dumazet
Date: Tuesday, May 19, 2009 - 12:31 am

You are right Jarek, I very often forget to never ever use signed quantities
at all ! (But also note original code has same undefined behavior)


Quoting wikipedia : (http://en.wikipedia.org/wiki/Arithmetic_shift)

The (1999) ISO standard for the, C programming language defines the C language's 
right shift operator in terms of divisions by powers of 2. Because of the 
aforementioned non-equivalence, the standard explicitly excludes from that
 definition the right shifts of signed numbers that have negative values.
 It doesn't specify the behaviour of the right shift operator in such circumstances,
 but instead requires each individual C compiler to specify the behaviour of shifting 
negative values right.

Apparently gcc does the *right* thing on x86_32, but we probably want something
stronger here. I could not find gcc documentation statement on right shifts of 
negative values.


 436:   8b 4b 14                mov    0x14(%ebx),%ecx
 439:   89 73 18                mov    %esi,0x18(%ebx)
 43c:   89 7b 1c                mov    %edi,0x1c(%ebx)
 43f:   8b 73 20                mov    0x20(%ebx),%esi
 442:   8b 7b 24                mov    0x24(%ebx),%edi
 445:   29 f0                   sub    %esi,%eax
 447:   19 fa                   sbb    %edi,%edx
 449:   0f ad d0                shrd   %cl,%edx,%eax
 44c:   d3 fa                   sar    %cl,%edx         << good >>
 44e:   f6 c1 20                test   $0x20,%cl
 451:   74 05                   je     458 <est_timer+0xb8>
 453:   89 d0                   mov    %edx,%eax
 455:   c1 fa 1f                sar    $0x1f,%edx       
 458:   01 f0                   add    %esi,%eax
 45a:   8b 4b 0c                mov    0xc(%ebx),%ecx
 45d:   89 43 20                mov    %eax,0x20(%ebx)
 460:   11 fa                   adc    %edi,%edx
 462:   83 c0 0f                add    $0xf,%eax
 465:   89 53 24                mov    %edx,0x24(%ebx)
 468:   83 d2 00                adc    $0x0,%edx
 46b:   0f ac d0 05             shrd   ...
From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 12:42 am

I guess gcc and most of others do this "right"; but it looks
"unkosher" anyway.

Jarek P.
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 12:57 am

I might have missed your point here, but would it be so costly to do
these shifts separately here?

Jarek P.
--

From: Eric Dumazet
Date: Tuesday, May 19, 2009 - 11:03 am

You replied to yourself Jarek :)

As I said earlier, I found your concern right, so please submit a patch ?

I found many occurrences of a right shift on a signed int/long in kernel.
One example being :

arch/x86/mm/init_64.c

int kern_addr_valid(unsigned long addr)
{
	unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;


and another rate estimator in drivers/atm/idt77252.c

static void
idt77252_est_timer(unsigned long data)


We could aso check net/netfilter/ipvs/ip_vs_est.c (estimation_timer())

--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 12:09 pm

On Tue, May 19, 2009 at 08:03:24PM +0200, Eric Dumazet wrote:

OK, thanks,
Jarek P.
----------------->
pkt_sched: gen_estimator: Fix signed integers right-shifts.

Right-shifts of signed integers are implementation-defined so unportable.

With feedback from: Eric Dumazet <dada1@cosmosbay.com>

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

diff -Nurp a/net/core/gen_estimator.c b/net/core/gen_estimator.c
--- a/net/core/gen_estimator.c	2009-05-19 20:33:47.000000000 +0200
+++ b/net/core/gen_estimator.c	2009-05-19 20:40:58.000000000 +0200
@@ -128,12 +128,12 @@ static void est_timer(unsigned long arg)
 		npackets = e->bstats->packets;
 		brate = (nbytes - e->last_bytes)<<(7 - idx);
 		e->last_bytes = nbytes;
-		e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
+		e->avbps += (brate >> e->ewma_log) - (e->avbps >> e->ewma_log);
 		e->rate_est->bps = (e->avbps+0xF)>>5;
 
 		rate = (npackets - e->last_packets)<<(12 - idx);
 		e->last_packets = npackets;
-		e->avpps += ((long)rate - (long)e->avpps) >> e->ewma_log;
+		e->avpps += (rate >> e->ewma_log) - (e->avpps >> e->ewma_log);
 		e->rate_est->pps = (e->avpps+0x1FF)>>10;
 skip:
 		read_unlock(&est_lock);
--

From: David Miller
Date: Monday, May 25, 2009 - 10:47 pm

From: Jarek Poplawski <jarkao2@gmail.com>

Applied to net-next-2.6, thanks!
--

From: David Miller
Date: Tuesday, May 19, 2009 - 1:18 am

From: Eric Dumazet <dada1@cosmosbay.com>

It emits an "arithmetic shift right" for every CPU I've ever checked.
--

From: Jarek Poplawski
Date: Sunday, May 17, 2009 - 1:15 pm

Here is some additional explanation. It looks like these rates above
500Mbit hit the design limits of packet scheduling. Currently used
internal resolution PSCHED_TICKS_PER_SEC is 1,000,000. 550Mbit rate
with 800byte packets means 550M/8/800 = 85938 packets/s, so on average
1000000/85938 = 11.6 ticks per packet. Accounting only 11 ticks means
we leave 0.6*85938 = 51563 ticks per second, letting for additional
sending of 51563/11 = 4687 packets/s or 4687*800*8 = 30Mbit. Of course
it could be worse (0.9 tick/packet lost) depending on packet sizes vs.
rates, and the effect rises for higher rates.

Jarek P.
--

From: Jarek Poplawski
Date: Sunday, May 17, 2009 - 11:56 pm

Return non-zero tc_calc_xmittime() for rate tables

While looking at the problem of HTB accuracy for high speed (~500Mbit
rates) I've found that rate tables have cells filled with zeros for
the smallest sizes. It means such packets aren't accounted at all.
Apart from the correctness of such configs, let's make it safe with
rather overaccounting than living it unlimited.

Reported-by: Antonio Almeida <vexwek@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 tc/tc_core.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..14f25bc 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -58,7 +58,9 @@ unsigned tc_core_ktime2time(unsigned ktime)
 
 unsigned tc_calc_xmittime(unsigned rate, unsigned size)
 {
-	return tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+	unsigned t;
+	t = tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+	return t ? : 1;
 }
 
 unsigned tc_calc_xmitsize(unsigned rate, unsigned ticks)
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 9:54 am

I'm not sure if I'm able to test this patch. What do you mean with
"smallest sizes"? Are you talking about packet's size? What kind of
sizes?
When I feed my bridge with 950Mbits/s of packets with 800 bytes that
is close to 150.000pps and CPUs start to get busy. For packets 100
bytes long, 150.000pps would be close to 125Mbits/s and CPUs start to
get busy already, so I'm not able to get close to 500Mbits/s. For
rates near 125bits/s the bad accuracy is not so expressive. For
packets of 100 bytes increasing analyser sent traffic, at some point
is not HTB shaping but the CPU that can't process so many packets. I
might misunderstood your point.

I applied this tc_core.c patch and for packets of 800 bytes it had no
effect in HTB accuracy with rates over 500Mbit.
Anyway I also test it with packets of 100 bytes, generating 200Mbits,
and the result is the same as without this patch:

With the patch:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
100000Kbit ceil 100000Kbit burst 14087b/8 mpu 0b overhead 0b cburst
14087b/8 mpu 0b overhead 0b level 0
 Sent 2187884640 bytes 22790465 pkt (dropped 8624566, overlimits 0 requeues 0)
 rate 124946Kbit 162691pps backlog 0b 0p requeues 0
 lended: 22790465 borrowed: 0 giants: 0
 tokens: 180 ctokens: 180


Without the patch:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
100000Kbit ceil 100000Kbit burst 14087b/8 mpu 0b overhead 0b cburst
14087b/8 mpu 0b overhead 0b level 0
 Sent 1260235680 bytes 13127455 pkt (dropped 4531299, overlimits 0 requeues 0)
 rate 124575Kbit 162207pps backlog 0b 0p requeues 0
 lended: 13127455 borrowed: 0 giants: 0
 tokens: 123 ctokens: 123


Thanks
  Antonio Almeida


--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 10:16 am

I forgot to tell you that I used tc source code from iproute2-2.6.16.
I couldn't use the newest version because I got errors when compiling.

  Antonio Almeida


--

From: Jarek Poplawski
Date: Thursday, May 21, 2009 - 1:51 am

I still have no clue about the reason, but it would be really nice to
do some short test with more current kernel (>= 2.6.27) and iproute2
(to exclude the possibility of some incomaptibility in configs e.g.
rate tables passed to htb).

Thanks,
Jarek P.
--

From: Antonio Almeida
Date: Friday, May 22, 2009 - 10:42 am

I installed kernel 2.6.29 (finaly! wasn't easy... I couldn't set
memory split 2G/2G),
but the results are the same. I've already applied gen_estimator.c
patches (works fine).

# tc -s -d class ls dev eth1 | head -24
class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
 Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0)
 rate 621844Kbit 97651pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 402 ctokens: 402

class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
 Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0)
 rate 621844Kbit 97651pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 402 ctokens: 402

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 119955366812 bytes 150697697 pkt (dropped 76696483, overlimits 0
requeues 0)
 rate 621847Kbit 97652pps backlog 0b 79p requeues 0
 lended: 150697618 borrowed: 0 giants: 0
 tokens: -5 ctokens: -5

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
 Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0)
 rate 621844Kbit 97651pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 402 ctokens: 402


# cat /sys/module/sch_htb/parameters/htb_hysteresis
0

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic ...
From: Jarek Poplawski
Date: Saturday, May 23, 2009 - 12:32 am

Actually, from these two I was more interested in iproute2 more
fitting the kernel version. :-((It should be enough to have at least
tc compiled properly, I guess.)

Btw.: if at any point you think this testing is too disturbing to you
etc., feel free to stop this or delay in time as you like.

Thanks,
Jarek P.
--

From: Antonio Almeida
Date: Thursday, May 28, 2009 - 11:13 am

I installed iproute2-ss090115 with the new patch but the results are
the same for my test scenery. HTB keeps sending 620Mbit/s when I
I'm working on this, don't worry. Since I have a traffic
generator/analyser, any modification you would make I can test it.
You're free to ask.

I've been looking inside htb source code. The granularity problem
could be in the use qdisc_rate_table or near that.


  Antonio Almeida
--

From: Jarek Poplawski
Date: Thursday, May 28, 2009 - 2:12 pm

Yes, but according to my assessment there should be "only" 50Mbit
difference for this rate/packet size. Anyway, could you try a testing
patch below, which should add some granularity to this rate table?

Thanks,
Jarek P.
---

 include/net/pkt_sched.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index e37fe31..f0faf03 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -42,8 +42,8 @@ typedef u64	psched_time_t;
 typedef long	psched_tdiff_t;
 
 /* Avoid doing 64 bit divide by 1000 */
-#define PSCHED_US2NS(x)			((s64)(x) << 10)
-#define PSCHED_NS2US(x)			((x) >> 10)
+#define PSCHED_US2NS(x)			((s64)(x) << 6)
+#define PSCHED_NS2US(x)			((x) >> 6)
 
 #define PSCHED_TICKS_PER_SEC		PSCHED_NS2US(NSEC_PER_SEC)
 #define PSCHED_PASTPERFECT		0
--

From: Antonio Almeida
Date: Friday, May 29, 2009 - 10:02 am

It's better! This patch gives more accuracy to HTB. Here some values:
Note that these are boundary values, so, e.g., any HTB configuration
between 377000Kbit and 400000Kbit would fall in the same step - close
to 397977Kbit.
This test was made over the same conditions: generating 950Mbit/s of
unidirectional tcp traffic of 800 bytes packets long.

leaf class ceil	leaf class sent rate (tc -s values)
376000Kbit	375379Kbit
--
377000Kbit	397977Kbit
400000Kbit	397973Kbit
--
401000Kbit	425199Kbit
426000Kbit	425199Kbit
--
427000Kbit	456389Kbit
457000Kbit	456409Kbit
--
458000Kbit	490111Kbit
492000Kbit	490138Kbit
--
493000Kbit	531957Kbit
533000Kbit	532078Kbit
--
534000Kbit	581835Kbit
581000Kbit	581820Kbit
--
582000Kbit	637809Kbit
640000Kbit	637709Kbit
--
641000Kbit	710526Kbit
711000Kbit	710553Kbit
--
712000Kbit	795921Kbit
800000Kbit	795901Kbit
--
801000Kbit	912706Kbit
914000Kbit	912782Kbit
--
915000Kbit	--


Here more values for a HTB ceil configuration of 555Mbit/s changing packet size:

800 bytes:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 18731000768 bytes 23531408 pkt (dropped 15715520, overlimits 0 requeues 0)
 rate 581832Kbit 91368pps backlog 0b 110p requeues 0
 lended: 23531298 borrowed: 0 giants: 0
 tokens: -16091 ctokens: -16091


850 bytes:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 30556163150 bytes 37645600 pkt (dropped 25746491, overlimits 0 requeues 0)
 rate 565509Kbit 83556pps backlog 0b 15p requeues 0
 lended: 37645585 borrowed: 0 giants: 0
 tokens: -16010 ctokens: -16010


950 bytes	
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 51363059854 bytes ...
From: Stephen Hemminger
Date: Friday, May 29, 2009 - 10:28 am

On Fri, 29 May 2009 18:02:39 +0100

You really need to get a better box than the dual core AMD.
There is only millisecond (or worse with HZ=100) resolution possible because
there is no working TSC on that hardware.


-- 
--

From: Jarek Poplawski
Date: Friday, May 29, 2009 - 12:58 pm

I think this could cause problems with peak rates but IMHO there is
no reason for htb to miss per second (4s) estimations against the same
clock. Plus it mostly confirms theoretical limits of currently used
rate tables vs. usecond time/ticket accounting.

Jarek P.
--

From: Jarek Poplawski
Date: Friday, May 29, 2009 - 12:46 pm

Good news! So it seems there are no other reasons of this inaccuracy
than too coarse granularity, but I have to check this yet. Alas there
is needed something more than this patch, because it probably breaks
other things like hfsc.

Thanks,
--

From: Stephen Hemminger
Date: Friday, May 29, 2009 - 1:49 pm

On Fri, 29 May 2009 21:46:43 +0200

Why would it break hfsc, if it isn't already broken.
--

From: Jarek Poplawski
Date: Friday, May 29, 2009 - 1:59 pm

I might be wrong but e.g. these usecs could be one reason:

/* convert d (us) into dx (psched us) */
static u64
d2dx(u32 d)
{
        u64 dx;

        dx = ((u64)d * PSCHED_TICKS_PER_SEC);
        dx += USEC_PER_SEC - 1;
        do_div(dx, USEC_PER_SEC);
        return dx;
}

And maybe these shifts need some adjustment:
m = (sm * PSCHED_TICKS_PER_SEC) >> SM_SHIFT;

Jarek P.
--

From: Jarek Poplawski
Date: Saturday, May 30, 2009 - 1:07 pm

Here is a tc patch, which should minimize these boundaries, so please,
repeat this test with previous patches/conditions plus this one.

Thanks,
Jarek P.
---

 tc/tc_core.c |   10 +++++-----
 tc/tc_core.h |    4 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..6d74287 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -27,18 +27,18 @@
 static double tick_in_usec = 1;
 static double clock_factor = 1;
 
-int tc_core_time2big(unsigned time)
+int tc_core_time2big(double time)
 {
-	__u64 t = time;
+	__u64 t;
 
-	t *= tick_in_usec;
+	t = time * tick_in_usec + 0.5;
 	return (t >> 32) != 0;
 }
 
 
-unsigned tc_core_time2tick(unsigned time)
+unsigned tc_core_time2tick(double time)
 {
-	return time*tick_in_usec;
+	return time * tick_in_usec + 0.5;
 }
 
 unsigned tc_core_tick2time(unsigned tick)
diff --git a/tc/tc_core.h b/tc/tc_core.h
index 5a693ba..0ac65aa 100644
--- a/tc/tc_core.h
+++ b/tc/tc_core.h
@@ -13,8 +13,8 @@ enum link_layer {
 };
 
 
-int  tc_core_time2big(unsigned time);
-unsigned tc_core_time2tick(unsigned time);
+int  tc_core_time2big(double time);
+unsigned tc_core_time2tick(double time);
 unsigned tc_core_tick2time(unsigned tick);
 unsigned tc_core_time2ktime(unsigned time);
 unsigned tc_core_ktime2time(unsigned ktime);
--

From: Antonio Almeida
Date: Tuesday, June 2, 2009 - 3:12 am

I'm getting great values with this patch!

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70970b/8 mpu 0b overhead 0b cburst
70970b/8 mpu 0b overhead 0b level 0
 Sent 14270693572 bytes 17928007 pkt (dropped 12579262, overlimits 0 requeues 0)
 rate 552755Kbit 86802pps backlog 0b 127p requeues 0
 lended: 17927880 borrowed: 0 giants: 0
 tokens: -16095 ctokens: -16095

(for packets of 800 bytes)
I'll get back to you with more values.

  Antonio Almeida
--

From: Antonio Almeida
Date: Tuesday, June 2, 2009 - 4:45 am

The steps are much smaller and the error keeps lower than 1%.
Injecting over 950Mpbs of tcp packets of 800bytes I get these values:

Configuration	Sent rate		error (%)
498000Kbit	495023Kbit	0,60
499000Kbit	497456Kbit	0,31
500000Kbit	497498Kbit	0,50
501000Kbit	497496Kbit	0,70
502000Kbit	499986Kbit	0,40
503000Kbit	499978Kbit	0,60
504000Kbit	502520Kbit	0,29
		
696000Kbit	690964Kbit	0,72
697000Kbit	695782Kbit	0,17
698000Kbit	695783Kbit	0,32
699000Kbit	695783Kbit	0,46
700000Kbit	695795Kbit	0,60
701000Kbit	695786Kbit	0,74
702000Kbit	700703Kbit	0,18
		
896000Kbit	888383Kbit	0,85
897000Kbit	896289Kbit	0,08
904000Kbit	896389Kbit	0,84
905000Kbit	904542Kbit	0,05

  Antonio Almeida
--

From: Jarek Poplawski
Date: Tuesday, June 2, 2009 - 5:36 am

Nice values - should be acceptable, I guess. Alas this is not all, and
I'll ask you soon for re-testing HFSC (after another patch) or maybe
even some simple CBQ setup ;-)

Thank you very much for testing,
--

From: Patrick McHardy
Date: Tuesday, June 2, 2009 - 5:45 am

I didn't follow the full discussion, so I'm not sure which kind of
arithmetic error you're attempting to cure. For the HFSC scaling
factors, please just keep in mind that its also supposed to be
very accurate at low bandwidths.

--

From: Jarek Poplawski
Date: Tuesday, June 2, 2009 - 6:08 am

It's all here:

http://permalink.gmane.org/gmane.linux.network/129301

Of course, I'd appreciate any suggestions.

Thanks,
Jarek P.
--

From: Patrick McHardy
Date: Tuesday, June 2, 2009 - 6:20 am

I've read through the mails where you suggested to change the scaling
factors. I wasn't able to find the reasoning (IOW: where does it

The HFSC shifts would indeed need adjustments if the US<->NS conversion
factor were to change.

--

From: Jarek Poplawski
Date: Tuesday, June 2, 2009 - 2:37 pm

I described the reasoning here:
http://permalink.gmane.org/gmane.linux.network/128189

Of course, we could try some other solution than changing the scaling.
I considered a possibility to do it internally in htb, even with
skipping rate tables, but the change of the scaling seems to be the
most generic way (alas there are some odd compatibility issues in
iproute/tc like TIME_UNITS_PER_SEC or "if (nom == 1000000)" to make
it really consistent/readable).

Jarek P.
--

From: Jarek Poplawski
Date: Tuesday, June 2, 2009 - 2:50 pm

Jarek Poplawski wrote, On 06/02/2009 11:37 PM:

The link is stuck now, so here is a quote:



Jarek P.
--

From: Patrick McHardy
Date: Wednesday, June 3, 2009 - 12:06 am

I see. Unfortunately changing the scaling factors is pushing the lower
end towards overflowing. For example Denys Fedoryshchenko reported some
breakage a few years ago when I changed the iproute-internal factors
triggered by this command:

.. tbf buffer 1024kb latency 500ms rate 128kbit peakrate 256kbit 
minburst 16384

The burst size calculated by TBF with the current parameters is
64000000. Increasing it by a factor of 16 as in your patch results
in 1024000000. Which means we're getting dangerously close to
overflowing, a buffer size increase or a rate decrease of slightly
bigger than factor 4 will already overflow.

Mid-term we really need to move to 64 bit values and ns resolution,
otherwise this problem is just going to reappear as soon as someone
tries 10gbit. Not sure what the best short term fix is, I feel a bit
uneasy about changing the current factors given how close this brings
us towards overflowing.
--

From: Jarek Poplawski
Date: Wednesday, June 3, 2009 - 12:40 am

I completely agree it's on the verge of overflow, and actually would
overflow for some insanely low (for today's standards) rates. So I
treat it's as a temporary solution, until people start asking about
more than 1 or 2Gbit. And of course we will have to move to 64 bit
anyway. Or we can do it now...

Btw., I've some doubts about HFSC; it's really different than others
wrt. rate tables/time accounting, and these PSCHED_TICKS look only
like an unnecesary compatibility; it works OK with usecs and doesn't
need this change now, unless I miss something. So maybe we would
simply stop using common psched_get_time() for it, and only do a
conversion for qdisc_watchdog_schedule() etc.?

Thanks,
Jarek P.
--

From: Patrick McHardy
Date: Wednesday, June 3, 2009 - 12:53 am

That (now) would certainly be the best solution, but its a non-trivial

Yes, it would work perfectly fine with usecs, which is actually (and
unfortunately) the unit it uses in its ABI. But I think its better
to convert the values once during initialization, instead of again
and again when scheduling the watchdog. The necessary changes are
really trivial, all you need to do when changing the scaling factors
is to increase SM_MASK and decrease ISM_MASK accordingly.
--

From: Jarek Poplawski
Date: Wednesday, June 3, 2009 - 1:01 am

On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:

Right! (On the other hand we could consider a separate watchdog too...)

Jarek P.
--

From: Patrick McHardy
Date: Wednesday, June 3, 2009 - 1:29 am

We could :) But I don't see any benefit doing that, especially given
that eventually everything should be using ns resolution anyways.
--

From: Jarek Poplawski
Date: Wednesday, June 3, 2009 - 1:45 am

The main benefit would be readability... I guess it's no problem for
you, but I'm currently trying to make sure things like this are/will
be OK :-)

        dx = ((u64)d * PSCHED_TICKS_PER_SEC);
        dx += USEC_PER_SEC - 1;
 
Jarek P.
--

From: Jarek Poplawski
Date: Wednesday, June 3, 2009 - 2:54 am

On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:

OK, looks like it's really enough and I was confused with some
rounding, thanks Patrick.

Antonio, could you give this patch a try (with all the previous) and
repeat those HFSC tests you did before (plus maybe a few tries with
lower rates)?

Thanks,
Jarek P.
---

 net/sched/sch_hfsc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 5022f9c..7c53a36 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -384,8 +384,9 @@ cftree_update(struct hfsc_class *cl)
  *
  *  1.024us/byte  78.125     7.8125     0.78125    0.078125   0.0078125
  */
-#define	SM_SHIFT	20
-#define	ISM_SHIFT	18
+#define	PSCHED_SHIFT	6	/* TODO: move to pkt_sched.h */
+#define	SM_SHIFT	(30 - PSCHED_SHIFT)
+#define	ISM_SHIFT	(8 + PSCHED_SHIFT)
 
 #define	SM_MASK		((1ULL << SM_SHIFT) - 1)
 #define	ISM_MASK	((1ULL << ISM_SHIFT) - 1)
--

From: Patrick McHardy
Date: Wednesday, June 3, 2009 - 3:01 am

Looks fine in principle, but considering your change to the generic

--

From: Patrick McHardy
Date: Wednesday, June 3, 2009 - 3:05 am

Actually I'm confused, why the additional change of 10?
--

From: Patrick McHardy
Date: Wednesday, June 3, 2009 - 3:06 am

OK, 10 - 6 = 4, got it :)

--

From: Jarek Poplawski
Date: Wednesday, June 3, 2009 - 3:27 am

If you wanted to console me after my hfsc confusions, you did it!

Thanks again,
Jarek P.
--

From: Antonio Almeida
Date: Thursday, June 4, 2009 - 6:50 am

For me, HTB values are just perfect! I would say that they're better
than HFSC, since sent rate stays below the configured ceil (but that's
for me)
After applying the patch you sent (to sch_hfsc.c) I got these values for HFSC:

configuration	analyser RX	error (%)
  10000000	10062688		0,63
  20000000	20096961		0,48
  30000000	30135028		0,45
  40000000	40186190		0,47
  50000000	50294890		0,59
  60000000	60294553		0,49
  70000000	70284220		0,41
  80000000	80414272		0,52
  90000000	90354675		0,39
100000000	100453024		0,45
200000000	200962041		0,48
250000000	251467886		0,59
300000000	301422613		0,47
400000000	402123479		0,53
500000000	502356820		0,47
550000000	552988253		0,54
600000000	602956905		0,49
700000000	703405632		0,49
750000000	753949085		0,53
800000000	804315169		0,54
900000000	904584208		0,51

As usually, generating 970Mbit/s of tcp traffic of 800 bytes packets.

Here's the setup picture:
# tc -s -d class ls dev eth1
class hfsc 1: root
 Sent 253924 bytes 319 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 period 0 level 4

class hfsc 1:1 parent 1: sc m1 0bit d 0us m2 1000Mbit ul m1 0bit d 0us
m2 1000Mbit
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 period 2 work 299437688 bytes level 3

class hfsc 1:10 parent 1:2 sc m1 0bit d 0us m2 1000Mbit ul m1 0bit d
0us m2 1000Mbit
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 period 2 work 299437688 bytes level 1

class hfsc 1:2 parent 1:1 sc m1 0bit d 0us m2 1000Mbit ul m1 0bit d
0us m2 1000Mbit
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 period 2 work 299437688 bytes level 2

class hfsc 1:108 parent 1:10 sc m1 0bit d 50.0ms m2 500000Kbit ul m1
0bit d 0us m2 500000Kbit
 Sent 300178764 bytes 377109 pkt (dropped 349464, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 931p requeues 0
 period 2 work 299437688 ...
From: Jarek Poplawski
Date: Thursday, June 4, 2009 - 12:30 pm

Very nice, it looks like HFSC precision isn't affected by these changes.

OK, I'll browse other schedulers, and if there is nothing suspicious
I'll submit these patches.

Thank you very much for cooperation!
Jarek P.
--

From: Patrick McHardy
Date: Thursday, June 4, 2009 - 12:35 pm

Please give me a day to have another look at this, I didn't find
any time today.

In most areas the overflows are only occuring when crossing
IMO unreasonable boundaries (but I've been wrong about that
before), but tc_cbq_calc_maxidle() is still making me nervous.
--

From: Jarek Poplawski
Date: Thursday, June 4, 2009 - 12:42 pm

Sure, I planned similar time for browsing it yet, as well.

Thanks,
Jarek P.
--

From: Badalian Vyacheslav
Date: Monday, June 8, 2009 - 10:25 pm

Hello!
Do you have any progress to apply this patch set?
I'm very interested to view that patches in mainline kernel tree. We
would like to use HTB for speeds more than 1G (converts 10 servers x 1G
to few with 10G intel multi queue network devices).

Thanks for you doing!

--

From: Jarek Poplawski
Date: Monday, June 8, 2009 - 10:49 pm

Hi,

I'll try to send patches today, but they are expected to work with 1G
or maybe a little more. I'm not sure higher rates make sense without
tso/gso, which isn't properly handled by packet schedulers anyway, so
more time/feedback/testing will be needed to go further.

Regards,
Jarek P.
--

From: David Miller
Date: Wednesday, June 3, 2009 - 9:53 pm

From: Patrick McHardy <kaber@trash.net>

We could pass in a new attribute which provides the upper-32bits
of the value.  I'm not sure if that works in this case but it's
an idea.

--

From: Jarek Poplawski
Date: Thursday, June 4, 2009 - 12:50 am

I'm not sure it could be so simple: I guess Patrick is concerned with
a new tc talking to an old kernel (otherwise a kernel should recognize
an old format). Then it would need something reasonable in 32bits.

But, I'm not even sure we need 64bit rate tables. We could
alternatively use (after checking a kernel can handle this)
simply a log to shift these values in kernel to u64:

- static inline u32 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
+ static inline u64 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
  {
	...
-        return rtab->data[slot];
+        return rtab->data[slot] << rtab->rate.rate_log;
  }

Since these overflows are for low rates, this rounding of lower bits
shouldn't matter here. So, IMHO, it's more about adding this overhead
of u64 to the kernel now.

Jarek P.
--

From: Jarek Poplawski
Date: Monday, May 18, 2009 - 10:53 am

You're right: if there were only 800 byte packets this patch shouldn't
matter. It should matter e.g. if these 800 byte were mixed with 100
byte packets, rate 550Mbit, and HZ 1000. Btw. if could you send your
.config (gzipped)? I guess, I've to look for some other reason yet.

Thanks,
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 11:23 am

Here's my .config

  Antonio Almeida


From: Jarek Poplawski
Date: Monday, May 18, 2009 - 11:32 am

Hmm... And if it's not a big problem I'd also ask you to try this test
with 555000Kbit rate for 850 and 900 byte packets. (It can wait.)

Thanks again,
Jarek P.
--

From: Antonio Almeida
Date: Monday, May 18, 2009 - 11:56 am

Precise measurements:

800 bytes:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 46793626324 bytes 57771194 pkt (dropped 29920019, overlimits 0 requeues 0)
 rate 621714Kbit 97631pps backlog 0b 126p requeues 0
 lended: 57771068 borrowed: 0 giants: 0
 tokens: -8 ctokens: -8


850 bytes:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 63422144616 bytes 77714246 pkt (dropped 41012275, overlimits 0 requeues 0)
 rate 600699Kbit 88756pps backlog 0b 127p requeues 0
 lended: 77714119 borrowed: 0 giants: 0
 tokens: -11 ctokens: -11


900 bytes:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 76868403562 bytes 92835297 pkt (dropped 48565133, overlimits 0 requeues 0)
 rate 636195Kbit 88755pps backlog 0b 126p requeues 0
 lended: 92835171 borrowed: 0 giants: 0
 tokens: -7 ctokens: -7


If you need more values you're free to ask.

  Antonio Almeida


--

From: Jarek Poplawski
Date: Monday, May 18, 2009 - 12:05 pm

Since you're so kind... :-) There is a line in net/sched/sch_htb.c:

#define HTB_HYSTERESIS 1        /* whether to use mode hysteresis for speedup */

Could you change 1 to 0, and repeat these tests above after recompiling?

More thanks,
Jarek P.
--

From: Antonio Almeida
Date: Tuesday, May 19, 2009 - 3:55 am

Doesn't seem to make any diference seting HTB_HYSTERESIS to 0. Here're
the values using #define HTB_HYSTERESIS 0

800 bytes:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 9773257752 bytes 12277962 pkt (dropped 6292541, overlimits 0 requeues 0)
 rate 621796Kbit 97644pps backlog 0b 127p requeues 0
 lended: 12277835 borrowed: 0 giants: 0
 tokens: -7 ctokens: -7

850 bytes:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 18225005732 bytes 22409017 pkt (dropped 11937269, overlimits 0 requeues 0)
 rate 600890Kbit 88796pps backlog 0b 43p requeues 0
 lended: 22408974 borrowed: 0 giants: 0
 tokens: -2 ctokens: -2

900 bytes:
class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 29790867368 bytes 35400708 pkt (dropped 18399726, overlimits 0 requeues 0)
 rate 636361Kbit 88779pps backlog 0b 127p requeues 0
 lended: 35400581 borrowed: 0 giants: 0
 tokens: -2 ctokens: -2


  Antonio Almeida



--

From: Denys Fedoryschenko
Date: Tuesday, May 19, 2009 - 4:04 am

6292541 dropped from 12277962 pkt, means 51% dropped. Maybe something fishy 
here?

Can you try instead of SFQ - BFIFO? For 100ms buffer, 550Mbit/s it will be 
~6875000 bytes bfifo.

It is by the way too short, IMHO, for this bandwidth, 127 packets is not 
enough. 127 packets with 800 bytes can buffer 1 second for 812Kbit/s only, 
and for 550Mbit/s it will buffer data for ~2ms only.


--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 4:18 am

Sure, if the queue is too short we could have a problem with reaching
the expected rate; but here it's all backwards - it could actually
"help" with the stats. ;-)

Jarek P.
--

From: Denys Fedoryschenko
Date: Tuesday, May 19, 2009 - 4:21 am

Well, i had real experience on HTB, when i set too short buffers on  my QoS 
qdiscs, the incoming rate jumped too high than overall. When i set larger 
buffers (and by the way dropped sfq and use bfifo) - it is dropped.  No idea 
why, bug or specific things in  protocols congestion control. Maybe worth to 
try...

--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 4:28 am

Very strange. Anyway, "overlimits 0" suggests HTB always got packets
when it needed...

Jarek P.
--

From: Antonio Almeida
Date: Tuesday, May 19, 2009 - 7:31 am

I tested it with BFIFO using limit 6875000. (Analyser keeps sending
950Mbits/s of 800 bytes tcp packets - lots of drops for sure)
Backlog is now huge but the throughout stays much higher than the
configured ceil.

# tc -s -d class ls dev eth1
class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
 Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0)
 rate 621765Kbit 97639pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: -186 ctokens: -186

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
 Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0)
 rate 621765Kbit 97639pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: -186 ctokens: -186

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
 Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0)
 rate 621765Kbit 97639pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: -186 ctokens: -186

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
 Sent 9549705928 bytes 11997118 pkt (dropped 6092846, overlimits 0 requeues 0)
 rate 621764Kbit 97639pps backlog 0b 8636p requeues 0
 lended: 11988482 borrowed: 0 giants: 0
 tokens: -1008 ctokens: -1008



# tc -s -d qdisc ls dev eth1
qdisc htb 1: root r2q 10 default 0 direct_packets_stat 11955 ver 3.17
 Sent 9608660872 bytes 12071182 pkt (dropped 6124502, overlimits
18190041 requeues 0)
 rate 0bit 0pps backlog 0b 8636p requeues 0
qdisc bfifo 108: parent 1:108 limit 6875000b
 Sent 9599144692 bytes 12059227 pkt (dropped 6124502, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 6874256b 8636p requeues 0


  Antonio ...
From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 4:09 am

OK, so it looks like some hidden bug yet.

Many thanks for now,
--

From: Jesper Dangaard Brouer
Date: Tuesday, May 19, 2009 - 6:18 am

Notice its runtime adjustable via:
  /sys/module/sch_htb/parameters/htb_hysteresis

Since kernel version v2.6.26.


Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 12:35 pm

Yes, this should convince Antonio to try something newer.
(Alas it didn't seem to make much difference to his case ;-)

Cheers,
Jarek P.
--

From: Jarek Poplawski
Date: Monday, May 18, 2009 - 12:01 am

-----------> (One misspelling fixed.)
Return non-zero tc_calc_xmittime() for rate tables

While looking at the problem of HTB accuracy for high speed (~500Mbit
rates) I've found that rate tables have cells filled with zeros for
the smallest sizes. It means such packets aren't accounted at all.
Apart from the correctness of such configs, let's make it safe with
rather overaccounting than leaving it unlimited.

Reported-by: Antonio Almeida <vexwek@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 tc/tc_core.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..14f25bc 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -58,7 +58,9 @@ unsigned tc_core_ktime2time(unsigned ktime)
 
 unsigned tc_calc_xmittime(unsigned rate, unsigned size)
 {
-	return tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+	unsigned t;
+	t = tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+	return t ? : 1;
 }
 
 unsigned tc_calc_xmitsize(unsigned rate, unsigned ticks)
--

From: Vladimir Ivashchenko
Date: Sunday, May 17, 2009 - 1:29 pm

Hi Antonio,

FYI, these are exactly the same problems I get in real life.
Check the later posts in "bond + tc regression" thread.


-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com
--

Previous thread: (no subject) by il on Friday, May 15, 2009 - 7:29 am. (1 message)

Next thread: [PATCH 1/3] mdio: Add 10GBASE-T SNR register definition by Ben Hutchings on Friday, May 15, 2009 - 9:04 am. (6 messages)