Re: 1gbit LAN/NIC performance, queue speed bug?

Previous thread: Vacaciones para fin de año 2011 by Travelfan on Tuesday, November 16, 2010 - 2:03 am. (1 message)

Next thread: BSD at FOSDEM 2011 - Call for speakers by Marius Nünnerich on Tuesday, November 16, 2010 - 5:17 am. (1 message)
From: Robert Lewandowski
Date: Tuesday, November 16, 2010 - 4:52 am

Hello,

PROBLEM: transfer speed is ONLY HALF if queue is defined in pf.conf 
although queue is 950Mbit (1000Mbit-5%)
pf disabled: 768 Mbits/sec
pf enabled, queue 950Mbit: 337 Mbits/sec

ANALYSIS:

- OpenBSD 4.8 default intallation.
- Test made between OpenBSD 4.8 and Debian Linux.
(between two Debian systems speed is more than 900Mbit/s)

*********************************************************
LAN interface: Intel PRO/1000 PT Desktop Adapter (PCIe, model: 
EXPI9300PTBLK)
DMESG: em0 at pci1 dev 0 function 0 "Intel PRO/1000 PT (82572EI)" rev 
0x06: apic 1 int 16 (irq 5), address 00:1b:21:05:1f:39
*********************************************************
Default settings of TCP window size:
net.inet.tcp.recvspace=16384
net.inet.tcp.sendspace=16384
*********************************************************

1a) pf disabled

root@router-test (/root)# iperf -i 1 -t 3 -c 10.0.0.6
------------------------------------------------------------
Client connecting to 10.0.0.6, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.0.8 port 27600 connected with 10.0.0.6 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  54.7 MBytes    459 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  1.0- 2.0 sec  54.7 MBytes    458 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  2.0- 3.0 sec  54.7 MBytes    459 Mbits/sec

1b) pf enabled, no queue

root@router-test (/root)# iperf -i 1 -t 3 -c 10.0.0.6
------------------------------------------------------------
Client connecting to 10.0.0.6, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.0.8 port 46912 connected with 10.0.0.6 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  53.9 MBytes    452 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  1.0- 2.0 sec  52.6 MBytes    441 Mbits/sec
[ ID] ...
From: Joel Sing
Date: Tuesday, November 16, 2010 - 8:14 am

The default length for a queue is 50 packets - this only allows you to queue 
around 75,000 bytes and the burstiness of TCP slow-start is likely to well 
exceed this in your configuration (due to the BDP). I'd suggest increasing 
the queue length - also run 'pfctl -vvs queue' or 'systat queue' and see 
what's happening with regards to packets drops.
-- 

   "Stop assuming that systems are secure unless demonstrated insecure;
    start assuming that systems are insecure unless designed securely."
          - Bruce Schneier

From: RLW
Date: Tuesday, November 16, 2010 - 9:04 am

root@router-test (/root)# systat queue

QUEUE                             BW SCH  PRIO     PKTS    BYTES 
DROP_P   DROP_B QLEN BORROW SUSPEN     P/S     B/S
root_em0                       1000M cbq     0  1947967 2879364K 
0        0    0      0      0   29412  44525K
  q_lan                          950M cbq        1947967 2879364K 
  0        0    0      0      0   29412  44525K



root@router-test (/root)# pfctl -vvs queue

queue root_em0 on em0 bandwidth 1Gb priority 0 cbq( wrr root ) {q_lan}
   [ pkts:    4793481  bytes: 7256036778  dropped pkts:      0 bytes: 
    0 ]
   [ qlength:   0/ 50  borrows:      0  suspends:      0 ]
   [ measured: 29385.4 packets/s, 355.86Mb/s ]
queue  q_lan on em0 bandwidth 950Mb cbq( default )
   [ pkts:    4793481  bytes: 7256036778  dropped pkts:      0 bytes: 
    0 ]
   [ qlength:   0/ 50  borrows:      0  suspends:      0 ]
   [ measured: 29385.4 packets/s, 355.86Mb/s ]


best regards,
Robert Lewandowski

From: RLW
Date: Wednesday, November 17, 2010 - 3:01 am

If I am reading it wright, no packets are droped.

Changing values like:
kern.somaxconn
net.inet.ip.maxqueue
net.bpf.bufsize
net.bpf.maxbufsize
net.inet.ipcomp.enable
net.inet.tcp.ackonpush
net.inet.tcp.ecn

does not help either. It only has some influence on network speed with 
PF disabled. With PF enabled speed is alwasy around 350mbit/s :((

So any new ideas about debuging the problem or possible solution?


best regards,
Robert Lewandowski

From: RLW
Date: Wednesday, November 17, 2010 - 4:21 am

ok, a set qlimit to 200 and then 500, no difference

queue root_em0 on em0 bandwidth 1Gb priority 0 qlimit 500 cbq( wrr root 
) {q_lan}
   [ pkts:     858820  bytes: 1300055838  dropped pkts:      0 bytes: 
    0 ]
   [ qlength:   0/500  borrows:      0  suspends:      0 ]
   [ measured: 29222.9 packets/s, 353.90Mb/s ]
queue  q_lan on em0 bandwidth 950Mb qlimit 500 cbq( borrow default )
   [ pkts:     858820  bytes: 1300055838  dropped pkts:      0 bytes: 
    0 ]
   [ qlength:   0/500  borrows:      0  suspends:      0 ]
   [ measured: 29222.9 packets/s, 353.90Mb/s ]


best regards,
RLW

From: James Records
Date: Wednesday, November 17, 2010 - 10:54 am

What does CPU usage look like when this is happening?  is there any other
resources that appear to be constrained?

J


From: RLW
Date: Wednesday, November 17, 2010 - 1:39 pm

Thanks for all the answers, but the problem still exists.

To sum up:

OpenBSD 4.8 default install

cpu0: Intel(R) Celeron(R) CPU 2.80GHz ("GenuineIntel" 686-class) 2.80 GHz
npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
em0 at pci1 dev 0 function 0 "Intel PRO/1000 PT (82572EI)" rev 0x06: 
apic 1 int 16 (irq 5), address XX:XX:XX:XX:XX:XX

1.
- tcp send and receive window @ default
- pf disabled
- transfer speed on em0 tested by iperf: 458 Mbits/sec
- top shows:

load averages:  0.90,  0.42,  0.18
25 processes:  1 running, 23 idle, 1 on processor
CPU states:  0.6% user,  0.0% nice, 51.5% system, 33.5% interrupt, 14.4% 
idle
Memory: Real: 9164K/43M act/tot  Free: 442M  Swap: 0K/759M used/tot


2.
- tcp send and receive window: 131072
- pf disabled
- transfer speed on em0 tested by iperf: 767 Mbits/sec
- top shows:

load averages:  0.85,  0.56,  0.29
25 processes:  1 running, 23 idle, 1 on processor
CPU states:  1.6% user,  0.0% nice, 71.5% system, 26.9% interrupt,  0.0% 
idle
Memory: Real: 9172K/43M act/tot  Free: 442M  Swap: 0K/759M used/tot


3.
- tcp send and receive window: 131072
- pf enabled, no queue
- transfer speed on em0 tested by iperf: 677 Mbits/sec
- top shows:

load averages:  0.84,  0.58,  0.38
25 processes:  1 running, 23 idle, 1 on processor
CPU states:  1.4% user,  0.0% nice, 70.1% system, 28.5% interrupt,  0.0% 
idle
Memory: Real: 9184K/44M act/tot  Free: 441M  Swap: 0K/759M used/tot


4.
- tcp send and receive window: 131072
- pf enabled
- to default pf.conf added (as Joel Sing suggested qlimit changed from 
50 to 500):
altq on em0 cbq bandwidth 1Gb qlimit 500 queue { q_lan }
queue q_lan bandwidth 950Mb qlimit 500 cbq (default)
- transfer speed on em0 tested by iperf: 337 Mbits/sec
- top shows:

load averages:  0.94,  0.68,  0.45
25 processes:  1 running, 23 idle, 1 on processor
CPU states:  0.6% user,  0.0% nice, 79.4% system, 20.0% interrupt,  0.0% 
idle
Memory: Real: 9184K/44M act/tot  Free: 441M  Swap: ...
From: RLW
Date: Wednesday, November 17, 2010 - 2:46 pm

Hello,

I see, that while I am testing network speed by iperf, 100% CPU is being 
used, but is that normal for default install of OpenBSD 4.8 with default 
pf.conf??

I have second computer exactly like that one (IBM ThinkCentre A51P), on 
which i am running this tests but with P4 3Ghz CPU 2mb cache (not 
celeron 2.8) and the same is happening (100% CPU).

LAN interface is Intel PRO/1000 PT Desktop Adapter (PCIe, model: 
EXPI9300PTBLK) and this is the only pcie adapter in computer (maybe 
broadcom integrated nic is also pcie but is not used)

So the conclusion might be:
- there is problem with my Intel NIC model/cheapset
- there is problem with em driver
- there is problem with my hardware (I need serwer motherboard with pcie 
and pci 64bit 66mhz)
- I need faster CPU than P4 3GHz ??


------------
best regards,
RLW

From: RLW
Date: Thursday, November 18, 2010 - 2:45 am

root@router-test (/root)# vmstat -i
interrupt                       total     rate
irq0/clock                    4836013      100
irq83/em0                     2771337       57
irq83/bge0                          1        0
irq81/pciide0                    4790        0
irq85/ichiic0                  172044        3
Total                         7784185      160


http://erydium.pl/upload/vmstat.gif

http://erydium.pl/upload/systat.gif


----
best regards,
RLW

From: Schöberle Dániel
Date: Thursday, November 18, 2010 - 5:42 am

or

- there is a problem with iperf  :)

How about measuring with something else? Did you try tcpbench? Or something
even simpler, like scp-ing from /dev/null to /dev/null? With pf and queues
enabled you can monitor the B/S rate.

From: RLW
Date: Thursday, November 18, 2010 - 8:41 am

there is no tcpbench in packages for 4.8 and for debian linux

1.
transferring file by scp from router-test to linux machine:
transfer speed: 16.1MB/s ~ 128.8 Mbits/s

root@router-test (/root)# top

load averages:  1.59,  1.05,  0.67
28 processes:  2 running, 25 idle, 1 on processor
CPU states: 33.3% user,  0.0% nice, 47.9% system, 14.2% interrupt,  4.6% 
idle
Memory: Real: 11M/80M act/tot  Free: 405M  Swap: 0K/759M used/tot


root@router-test (/root)# systat queue

2 users    Load 1.49 0.99 0.74                      Thu Nov 18 16:27:25 2010

QUEUE               BW SCH  PR  PKTS BYTES DROP_P DROP_B QLEN BORR SUSP 
P/S  B/S
root_em0         1000M cbq   0   10M   14G      0      0    0    0    0 
122  17M
  q_lan            950M cbq       10M   14G      0      0    0    0    0 
122  17M


2.
transferring file back from linux machine to router-test:
transfer speed:  19.9MB/s ~ 159.2 Mbits/s

root@router-test (/root)# top

load averages:  1.13,  0.95,  0.69
25 processes:  1 running, 23 idle, 1 on processor
CPU states: 40.1% user,  0.0% nice, 33.5% system, 26.3% interrupt,  0.0% 
idle
Memory: Real: 11M/80M act/tot  Free: 405M  Swap: 0K/759M used/tot


3.
as comparison transfer speed between two debian boxes:
- tested by iperf: 940 Mbits/sec
- transfering file by scp: 42.6MB/s ~ 340.8 Mbits/s

top:

Tasks:  81 total,   1 running,  80 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.8%us, 13.8%sy,  0.0%ni, 79.0%id,  0.0%wa,  1.0%hi,  4.3%si, 
0.0%st
Mem:   1028836k total,   924868k used,   103968k free,    23800k buffers
Swap:  1951856k total,    51524k used,  1900332k free,   617788k cached


----
best regards
RLW

From: David Coppa
Date: Thursday, November 18, 2010 - 8:52 am

Because it's in base: /usr/bin/tcpbench

ciao,
david

From: RLW
Date: Thursday, November 18, 2010 - 9:29 am

I removed Intel NIC and run test on broadcom integrated Gbit NIC to see 
if there is problem with em driver

bge0 at pci1 dev 11 function 0 "Broadcom BCM5705K" rev 0x03, BCM5705 A3 
(0x3003): apic 1 int 16 (irq 5), address XX:XX:XX:XX:XX:XX
brgphy0 at bge0 phy 1: BCM5705 10/100/1000baseT PHY, rev. 2


1.
pf enabled, queue 950mbit, qlimit 500
iperf test: 410 Mbits/sec

root@router-test (/root)# top

load averages:  0.95,  0.53,  0.26
23 processes:  1 running, 21 idle, 1 on processor
CPU states:  1.2% user,  0.0% nice, 84.4% system, 14.4% interrupt,  0.0% 
idle
Memory: Real: 8972K/42M act/tot  Free: 443M  Swap: 0K/759M used/tot


2. test made between two OpenBSD 4.8 boxes (there is no tcpbench for debian)


transfers by tcpbench:

Conn:   1 Mbps:      399.972 Peak Mbps:      406.093 Avg Mbps:      399.972
       133996       45932008      370.419  100.00%
Conn:   1 Mbps:      370.419 Peak Mbps:      406.093 Avg Mbps:      370.419
       134999       46833528      373.920  100.00%
Conn:   1 Mbps:      373.920 Peak Mbps:      406.093 Avg Mbps:      373.920
       136074       43531224      323.953  100.00%
Conn:   1 Mbps:      323.953 Peak Mbps:      406.093 Avg Mbps:      323.953
       137002       41013960      353.950  100.00%
Conn:   1 Mbps:      353.950 Peak Mbps:      406.093 Avg Mbps:      353.950
       137996       50500448      406.442  100.00%
Conn:   1 Mbps:      406.442 Peak Mbps:      406.442 Avg Mbps:      406.442


root@router-test (/root)# top (while running tcpbench)

load averages:  1.26,  0.80,  0.49
22 processes:  1 running, 20 idle, 1 on processor
CPU states:  0.0% user,  0.0% nice, 77.2% system, 15.6% interrupt,  7.2% 
idle
Memory: Real: 8752K/43M act/tot  Free: 442M  Swap: 0K/759M used/tot


root@router-test (/root)# systat queue (while running tcpbench)

2 users    Load 0.82 0.69 0.51                      Thu Nov 18 17:13:10 2010

QUEUE               BW SCH  PR  PKTS BYTES DROP_P DROP_B QLEN BORR SUSP 
P/S  B/S
root_bge0        ...
From: Claudio Jeker
Date: Thursday, November 18, 2010 - 9:41 am

No the problem is altq. Altq(4) was written when 100Mbps was common and
people shaped traffic in the low megabit range. It seems to hit a wall when
doing hundreds of megabits. Guess someone needs to run a profiling kernel
and see where all that time is spent and then optimize altq(4).


From: RLW
Date: Thursday, November 18, 2010 - 9:54 am

Its nice to hear from OpenBSD developer on this matter.

I am wondering who is gonna be that "someone"? ;) and when it could happen?

Claudio can you add this problem as a bug to fix maybe in next release?


----
best regards,
RLW

From: Stuart Henderson
Date: Thursday, November 18, 2010 - 4:24 pm

The "someone" running a profiling kernel to identify the hot spots could be you.

cd /sys/arch/<arch>/config
config -p <kernelname>
build a kernel from the ../compile/<kernelname>.PROF directory in the usual way
kgmon -b to start profiling
(generate some traffic)
kgmon -h to stop profiling
kgmon -p to dump stats
gprof /bsd gmon.out to read stats...

Assuming you're interested in routed traffic (rather than queuing traffic
generated on a box itself), make sure you run the traffic source and sink
on other machines routing through the altq box, don't source/sink traffic
on the altq box itself.

From: RLW
Date: Friday, November 19, 2010 - 2:26 am

Sorry to others developers for not recognizing them.

I am interested in making network traffic on gbit lan around 
800-900mbit/s not 350mbit/s with altq so its rather the option mentioned 
by you in brackets ;)

I am rather standard OpenBSD user and to be honest I think it would be 
faster and simpler if I just give some OpenBSD developer access to this box.

----
best regards,
RLW

From: Amit Kulkarni
Date: Sunday, November 21, 2010 - 9:34 pm

Hi,

Could somebody include this in the FAQ? I found Daniel Hartmeier
personal page which shows how to get stack trace and line numbers. I
know that the stacktrace info is included somewhere on openbsd.org.
But the way Daniel presented was much better. That particular
presentation or a link should be in the FAQ too. They would be really
useful.


From: Tomas Bodzar
Date: Sunday, November 21, 2010 - 10:04 pm

Like this one http://www.openbsd.org/faq/faq2.html#Bugs ? Example on the end

From: RLW
Date: Thursday, December 30, 2010 - 5:51 am

Hello again ;)

I finaly had time to do kernel profiling.

So we have:
- default OpenBSD 4.8 install
- em0 nic (at pci express slot)
- default sysctl
- definition of queue in pf.conf:
altq on em0 cbq bandwidth 1Gb queue { q_lan }
queue q_lan bandwidth 950Mb cbq (default)

- low speed between Linux Debian box (as iperf server) and OpenBSD box 
(as iperf client):

[ ID] Interval       Transfer     Bandwidth
[  3] 41.0-42.0 sec  17.1 MBytes    144 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 42.0-43.0 sec  17.2 MBytes    144 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 43.0-44.0 sec  17.1 MBytes    144 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 44.0-45.0 sec  17.2 MBytes    144 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 45.0-46.0 sec  17.2 MBytes    144 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 46.0-47.0 sec  17.1 MBytes    143 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 47.0-48.0 sec  17.2 MBytes    144 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 48.0-49.0 sec  17.1 MBytes    144 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 49.0-50.0 sec  17.1 MBytes    144 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-50.0 sec    858 MBytes    144 Mbits/sec

- stats from kernel profiling at:
http://erydium.pl/upload/20101230_profiling.txt


best regards,
RLW

From: Claudio Jeker
Date: Thursday, December 30, 2010 - 6:36 am

From the porfile output.

index  %time    self descendents  called+self    name    	index
[1]     83.9    0.00      359.49                 sched_idle [1]
You spent > 80% in idle. So while forwarding all that traffic the box was
mostly idle. 

Interesting are:
[6]      3.3    0.02       14.26 4028087         acpi_get_timecount [6]
[7]      3.2    0.14       13.66 3854116         binuptime [7]
I guess these are that high up in the profile because of altq.

These seem to be altq related:
[10]     2.9    0.05       12.22 3234667         cbq_pfattach [10]
[13]     2.2    0.03        9.18 2386574         tbr_dequeue [13]
[14]     2.1    0.38        8.71 2386574         rmc_dequeue_next [14]

Now this profiles shows one thing, the problem is not CPU bound but
actually it seems like the TBR (token bucket regulator) is the problem.
What seems to happen is that the TBR runs low and returns NULL and
requires a timeout to fire to move the packets.


From: RLW
Date: Thursday, December 30, 2010 - 6:47 am

although kernel profiling stats show that system spends 80% time in 
idle, top during the iperf test shows:

load averages:  0.40,  0.16,  0.11 15:43:02
27 processes:  1 running, 25 idle, 1 on processor
CPU states:  0.6% user,  0.0% nice, 77.6% system, 21.8% interrupt,  0.0% 
idle
Memory: Real: 10M/83M act/tot  Free: 402M  Swap: 0K/759M used/tot

   PID USERNAME PRI NICE  SIZE   RES STATE     WAIT      TIME    CPU COMMAND

i dont know that is TBR, but can I do something with it??


best regards,
RLW

From: Ted Unangst
Date: Thursday, December 30, 2010 - 10:47 am

I forget if the profile counts interrupts as coming from idle.  If
you're mostly forwarding traffic, the idle loop will be the top
function on the stack just about always.

From: RLW
Date: Tuesday, January 4, 2011 - 9:53 am

any new ideas about where the problem is and how to fix it?


best regards,
RLW

From: Daniel Gracia
Date: Tuesday, November 16, 2010 - 5:36 am

dmesg missing!

Your computer horsepower will definitely affect the maximum bandwith pf 
will be able to manage.


Previous thread: Vacaciones para fin de año 2011 by Travelfan on Tuesday, November 16, 2010 - 2:03 am. (1 message)

Next thread: BSD at FOSDEM 2011 - Call for speakers by Marius Nünnerich on Tuesday, November 16, 2010 - 5:17 am. (1 message)