Hi Stephen and all, I have observed a HTB accuracy problem on the Linux kernel 2.6.30 and the Myri-10G 10 GbE NIC. HTB can control the transmission rate at Gigabit speed, however it can not work well at 10 Gigabit speed. I asked Stephen this problem at Japan Linux Symposium. He mentioned a HTB bug related to the timer granularity. I want to know what is happen, and what should be do for fixing it. Any comments and suggestions will be welcome. For more detail, please see the following page: http://code.google.com/p/pspacer/wiki/HTBon10GbE Best regards, Ryousei --
Hello. Also we planed convert 5-10 servers witch 1gigabit connection to one BIG server witch 10g (traffic rate in peak about 6gigabit) network card (Intel multiqueue). Can test any patches for fix any problems in 10 gigabit connections! Thanks! --
This is not an easy problem to fix. Userspace, the kernel and the netlink API use 32 bit for timing related values, which is too small to use more than microsecond resolution. All of them need to be converted to use bigger types, additionally some kind of compatibility handling to deal with old iproute versions still using microsecond resolution is required. --
On Mon, 02 Nov 2009 16:43:42 +0100 The existing API is a legacy mish-mash. The field is limited to 32 bits, but it might be possible to use a finer scale. Maybe if kernel advertised finer resolution through /proc/net/psched then table could be finer grained. This would maintain compatibility between kernel and user space. You would need to have new kernel and new iproute to get nanosecond resolution but older combinations would still work. The downside is that by using nanosecond resolution the rates are upper bounded at 4.2seconds / packet. --
Hello dear netdev team! Linux all time go with the times :) Network in world go to use 10G technologies. I can test any stress patches in produce system for linux developers :) I discharge from out company in 1 December and have 1 month for all tests. I believe that linux net dev team do it easy. Need only begin :) Lets do it together :) Jarek, you many times help to us fix small problems in HTB, thanks for this! All work great! Now netdev team have "crazy" Eric that do great code and not afraid do big code changes. Maybe together you think about changes and create mega patch for my testing? :) I alltime read all changes in code from netdev mail list. Its my coffee time in morning :) Also interesting that say David. That linux networking planes to full support 10g technologies? We buy 4x4 Xeon Quad + IntelCX4 x 2 network cards for test support 10g shaper in our network :) Lets begin test! :) Best regals, Slavon. Tech Director Assistant. JSC BIG Telecom --
Hi Patrick and Stephen,
Thanks for your comments.
I retried on the newer kernel and iproute2, and added the experimental result
on my page. Please see 'Experimental result 2':
http://code.google.com/p/pspacer/wiki/HTBon10GbE
The accuracy improves compared with the previous experiment.
The difference reduces from +810 Mbps to +430 Mbps.
It is because the timer resolution improves from 1 usec to 1/64 usec.
But it is not perfect.
Best regards,
Ryousei Takano
--
Hmm, do you know part of the error comes from the user tool itself ? If you check iperf results at sender and receiver you'll see different values, sender lies a bit. Tried here on a Gbit link (I dont have 10Gbe yet) $ ./iperf.bench.sh .100 104 .200 206 .300 307 .400 413 .500 515 .600 610 .700 715 .800 822 .900 913 1.000 945 while on receiver : [ 4] 0.0- 5.3 sec 62.8 MBytes 100 Mbits/sec [ 5] 0.0- 5.1 sec 123 MBytes 202 Mbits/sec [ 4] 0.0- 5.1 sec 183 MBytes 303 Mbits/sec [ 5] 0.0- 5.1 sec 246 MBytes 409 Mbits/sec [ 4] 0.0- 5.0 sec 307 MBytes 511 Mbits/sec [ 5] 0.0- 5.0 sec 364 MBytes 607 Mbits/sec [ 4] 0.0- 5.0 sec 427 MBytes 711 Mbits/sec [ 5] 0.0- 5.0 sec 490 MBytes 818 Mbits/sec [ 4] 0.0- 5.0 sec 545 MBytes 909 Mbits/sec [ 5] 0.0- 5.0 sec 565 MBytes 941 Mbits/sec You might use longer intervals to reduce this error (10 secs instead of 5 secs) $./iperf.bench.sh .100 102 .200 204 .300 305 .400 410 .500 513 .600 608 .700 713 .800 820 .900 911 1.000 943 --
(that was with standard 1500 MTU) Now, with 9000 MTU and 50 seconds samples (instead of 5 s) I get : $ ./iperf.bench.sh .100 101 .200 200 .300 301 .400 401 .500 500 .600 601 .700 700 .800 803 .900 903 1.000 991 Not too bad :) --
Hi Eric,
I tried iperf with 60 seconds samples. I got the almost same result.
Here is the result:
sender receiver
1.000 1.00 1.00
2.000 2.01 2.01
3.000 3.03 3.02
4.000 4.07 4.07
5.000 5.05 5.05
6.000 6.16 6.16
7.000 7.22 7.22
8.000 8.15 8.15
9.000 9.23 9.23
9.900 9.69 9.69
Best regards,
Ryousei Takano
--
One thing to consider is the estimation error in qdisc_l2t(), rate table has only 256 slots
static inline u32 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
{
int slot = pktlen + rtab->rate.cell_align + rtab->rate.overhead;
if (slot < 0)
slot = 0;
slot >>= rtab->rate.cell_log;
if (slot > 255)
return (rtab->data[255]*(slot >> 8) + rtab->data[slot & 0xFF]);
return rtab->data[slot];
}
Maybe you can try changing class mtu to 40000 instead of 9000, and quantum to 60000 too
tc class add dev $DEV parent 1: classid 1:1 htb rate ${rate}mbit mtu 40000 quantum 60000
(because your tcp stack sends large buffers ( ~ 60000 bytes) as your NIC can offload tcp segmentation)
--
Hi Eric, Thanks for your suggestion. You are right! I am using TSO. The myri10ge driver is passing 64KB packets to the NIC. I changed the class mtu parameter to 64000 instead of 9000. Here is the result: 1.000 1.00 2.000 2.01 3.000 2.99 4.000 4.01 5.000 5.01 6.000 6.04 7.000 7.06 8.000 8.09 9.000 9.11 9.900 9.64 It's not so bad! For more information, I updated the results on my page. Best regards, Ryousei --
In fact, I gave you 40000 because rtab will contain 256 elements from 0 to 65280 If you use 64000, you lose some precision (for small packets for example) --
Hi Eric, I see. In my experiment, it is not very big problem. I do not send short packets. I got the almost same result in the both cases "mtu 64000" and "mtu 40000 quantum 60000". Anyway, setting larger mtu size than the physical mtu does not quiet make sense. Best regards, Ryousei --
tc class mtu is a hint given to stack, about average packet size, ie not related to physical MTU (because of TSO) You could use same mtu, but disable tso on device --
