Re: HTB accuracy on 10GbE

Previous thread: [PATCH net-next-2.6] pppoe: RCU locking in get_item_by_addr() by Eric Dumazet on Sunday, November 1, 2009 - 10:51 pm. (2 messages)

Next thread: Re: [net-next-2.6] net: Introduce dev_get_by_name_rcu() by David Miller on Monday, November 2, 2009 - 12:56 am. (1 message)
From: Ryousei Takano
Date: Monday, November 2, 2009 - 12:22 am

Hi Stephen and all,

I have observed a HTB accuracy problem on the Linux kernel 2.6.30 and
the Myri-10G 10 GbE NIC.
HTB can control the transmission rate at Gigabit speed, however it can
not work well at 10 Gigabit speed.

I asked Stephen this problem at Japan Linux Symposium.  He mentioned a
HTB bug related to the timer granularity.
I want to know what is happen, and what should be do for fixing it.

Any comments and suggestions will be welcome.

For more detail, please see the following page:
http://code.google.com/p/pspacer/wiki/HTBon10GbE

Best regards,
Ryousei
--

From: Badalian Vyacheslav
Date: Monday, November 2, 2009 - 1:17 am

Hello.

Also we planed convert 5-10 servers witch 1gigabit connection to one BIG server witch 10g (traffic rate in peak about 6gigabit) network card (Intel multiqueue).

Can test any patches for fix any problems in 10 gigabit connections!

Thanks!


--

From: Patrick McHardy
Date: Monday, November 2, 2009 - 8:43 am

This is not an easy problem to fix. Userspace, the kernel and the
netlink API use 32 bit for timing related values, which is too small
to use more than microsecond resolution. All of them need to be
converted to use bigger types, additionally some kind of compatibility
handling to deal with old iproute versions still using microsecond
resolution is required.
--

From: Stephen Hemminger
Date: Monday, November 2, 2009 - 1:53 pm

On Mon, 02 Nov 2009 16:43:42 +0100

The existing API is a legacy mish-mash. The field is limited to 32 bits,
but it might be possible to use a finer scale.

Maybe if kernel advertised finer resolution through /proc/net/psched
then table could be finer grained. This would maintain compatibility
between kernel and user space. You would need to have new kernel and
new iproute to get nanosecond resolution but older combinations would
still work.

The downside is that by using nanosecond resolution the rates are upper
bounded at 4.2seconds / packet.

--

From: Badalian Vyacheslav
Date: Tuesday, November 3, 2009 - 12:43 am

Hello dear netdev team!

Linux all time go with the times :)
Network in world go to use 10G technologies. I can test any stress patches in produce system for linux developers :)
I discharge from out company in 1 December and have 1 month for all tests.
I believe that linux net dev team do it easy. Need only begin :) Lets do it together :)

Jarek, you many times help to us fix small problems in HTB, thanks for this! All work great! Now netdev team have "crazy" Eric that do great code and not afraid do big code changes. Maybe together you think about changes and create mega patch for my testing? :)
I alltime read all changes in code from netdev mail list. Its my coffee time in morning :) 

Also interesting that say David. That linux networking planes to full support 10g technologies?

We buy 4x4 Xeon Quad + IntelCX4 x 2 network cards for test support 10g shaper in our network :) Lets begin test! :)

Best regals, Slavon.
Tech Director Assistant.
JSC BIG Telecom

--

From: Ryousei Takano
Date: Tuesday, November 3, 2009 - 8:13 pm

Hi Patrick and Stephen,

Thanks for your comments.

I retried on the newer kernel and iproute2, and added the experimental result
on my page.  Please see 'Experimental result 2':
    http://code.google.com/p/pspacer/wiki/HTBon10GbE

The accuracy improves compared with the previous experiment.
The difference reduces from +810 Mbps to +430 Mbps.
It is because the timer resolution improves from 1 usec to 1/64 usec.
But it is not perfect.

Best regards,
Ryousei Takano


--

From: Ryousei Takano
Date: Tuesday, November 3, 2009 - 8:45 pm

From: Eric Dumazet
Date: Tuesday, November 3, 2009 - 10:03 pm

Hmm, do you know part of the error comes from the user tool itself ?

If you check iperf results at sender and receiver you'll see different
values, sender lies a bit.

Tried here on a Gbit link (I dont have 10Gbe yet)

$ ./iperf.bench.sh
.100 104
.200 206
.300 307
.400 413
.500 515
.600 610
.700 715
.800 822
.900 913
1.000 945

while on receiver :
[  4]  0.0- 5.3 sec  62.8 MBytes    100 Mbits/sec
[  5]  0.0- 5.1 sec    123 MBytes    202 Mbits/sec
[  4]  0.0- 5.1 sec    183 MBytes    303 Mbits/sec
[  5]  0.0- 5.1 sec    246 MBytes    409 Mbits/sec
[  4]  0.0- 5.0 sec    307 MBytes    511 Mbits/sec
[  5]  0.0- 5.0 sec    364 MBytes    607 Mbits/sec
[  4]  0.0- 5.0 sec    427 MBytes    711 Mbits/sec
[  5]  0.0- 5.0 sec    490 MBytes    818 Mbits/sec
[  4]  0.0- 5.0 sec    545 MBytes    909 Mbits/sec
[  5]  0.0- 5.0 sec    565 MBytes    941 Mbits/sec


You might use longer intervals to reduce this error (10 secs instead of 5 secs)

$./iperf.bench.sh
.100 102
.200 204
.300 305
.400 410
.500 513
.600 608
.700 713
.800 820
.900 911
1.000 943
--

From: Eric Dumazet
Date: Tuesday, November 3, 2009 - 10:27 pm

(that was with standard 1500 MTU)

Now, with 9000 MTU and 50 seconds samples (instead of 5 s) I get :

$ ./iperf.bench.sh
.100 101
.200 200
.300 301
.400 401
.500 500
.600 601
.700 700
.800 803
.900 903
1.000 991

Not too bad :)
--

From: Ryousei Takano
Date: Wednesday, November 4, 2009 - 1:19 am

Hi Eric,


I tried iperf with 60 seconds samples. I got the almost same result.

Here is the result:
      sender	receiver
1.000 1.00	1.00
2.000 2.01	2.01
3.000 3.03	3.02
4.000 4.07	4.07
5.000 5.05	5.05
6.000 6.16	6.16
7.000 7.22	7.22
8.000 8.15	8.15
9.000 9.23	9.23
9.900 9.69	9.69

Best regards,
Ryousei Takano
--

From: Eric Dumazet
Date: Wednesday, November 4, 2009 - 4:31 am

One thing to consider is the estimation error in qdisc_l2t(), rate table has only 256 slots

static inline u32 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
{
	int slot = pktlen + rtab->rate.cell_align + rtab->rate.overhead;
	if (slot < 0)
		slot = 0;
	slot >>= rtab->rate.cell_log;
	if (slot > 255)
		return (rtab->data[255]*(slot >> 8) + rtab->data[slot & 0xFF]);
	return rtab->data[slot];
}


Maybe you can try changing class mtu to 40000 instead of 9000, and quantum to 60000 too

tc class add dev $DEV parent 1: classid 1:1 htb rate ${rate}mbit mtu 40000 quantum 60000

(because your tcp stack sends large buffers ( ~ 60000 bytes) as your NIC can offload tcp segmentation)

--

From: Ryousei Takano
Date: Wednesday, November 4, 2009 - 9:31 am

Hi Eric,

Thanks for your suggestion.

You are right!
I am using TSO. The myri10ge driver is passing 64KB packets to the NIC.
I changed the class mtu parameter to 64000 instead of 9000.

Here is the result:
1.000 1.00
2.000 2.01
3.000 2.99
4.000 4.01
5.000 5.01
6.000 6.04
7.000 7.06
8.000 8.09
9.000 9.11
9.900 9.64

It's not so bad!
For more information, I updated the results on my page.

Best regards,
Ryousei
--

From: Eric Dumazet
Date: Wednesday, November 4, 2009 - 10:03 am

In fact, I gave you 40000 because rtab will contain 256 elements from 0 to 65280

If you use 64000, you lose some precision (for small packets for example)


--

From: Ryousei Takano
Date: Thursday, November 5, 2009 - 12:08 am

Hi Eric,

I see.

In my experiment, it is not very big problem.  I do not send short packets.
I got the almost same result in the both cases "mtu 64000" and "mtu
40000 quantum 60000".

Anyway, setting larger mtu size than the physical mtu does not quiet make sense.

Best regards,
Ryousei
--

From: Eric Dumazet
Date: Thursday, November 5, 2009 - 12:10 am

tc class mtu is a hint given to stack, about average packet size, ie not
related to physical MTU (because of TSO)

You could use same mtu, but disable tso on device
--

From: Ryousei Takano
Date: Thursday, November 5, 2009 - 3:15 am

Hi Eric,

I got it.
Thanks for your explanation.

Best regards,
Ryousei
--

Previous thread: [PATCH net-next-2.6] pppoe: RCU locking in get_item_by_addr() by Eric Dumazet on Sunday, November 1, 2009 - 10:51 pm. (2 messages)

Next thread: Re: [net-next-2.6] net: Introduce dev_get_by_name_rcu() by David Miller on Monday, November 2, 2009 - 12:56 am. (1 message)