Hi All,
I'm currently working on an embedded project (based on Linux kernel)
that needs a high throughput using gigabit Ethernet controller and
"small" cpu.
I've made lot of tests, playing with jumbo frames, raw sockets, ...
I've never exceeded ~25Mbytes/s. So I've decided to analyze deeply the
packet socket transmission process.
The main blocking point was the memcpy_fromiovec() function that is
located in the packet_sendmsg() of af_packet.c.
It was consuming all my CPU resources to copy data from user space to
socket buffer.
Then I've started to work on a hack that makes this transfer possible
without any memcpys.
Mainly, the hack is the implementation of two "features":
* Sending packet through a circular buffer between user and
kernel space that minimizes the number of system calls. (Feature
actually implemented for capture process, libpcap ..).
To sum up the user process :
- initialize a raw socket
- allocate N buffers into kernel space through a setsockopt() (TX ring),
- mmap() the allocated memory,
- fill M buffers with custom data, and update status of filled
buffers to ready (header of buffer: struct tpacket_hdr contains a
status field: TP_STATUS_KERNEL means free, TP_STATUS_USER means ready
to be sent, TP_STATUS_COPY means transmission ongoing)
- call send() procedure. The kernel will then send all buffers
set with TP_STATUS_USER. Status is set to TP_STATUS_COPY during
transfer and TP_STATUS_KERNEL when done.
* Zero copy mode. CONFIG_PACKET_MMAP_ZERO_COPY feature flag
skips CPU copy between the circular buffer and the socket buffer
allocated during send.
To send packet without zero copy, if my understanding is
correct, first we allocate a socket buffer with sock_alloc_send_skb(),
then we copy content of data into the socket buffer, finally we give
this sk_buff to the network card. With zero copy, the trick is to
bypass the data copy by substituting data pointers of allocated
sk_buff for ...Hi Johann. Did you try vmsplice and splice? It is the preferred way to do a zero-copy. -- Evgeniy Polyakov --
Not yet, I will perform some tests using splice and let you know performances. Many thanks, Johann -- Johann Baudy johaahn@gmail.com --
Hi Evgeniy,
I'm not able to exceed 15Mo/s even with vmsplice/splice duo.
Due to some issues:
- I didn't manage to adjust size of packets sent over the network (it
seems to be aligned with page). And maximum packet size seems to be
the page size (4096).
- I need approximately two system calls (vmsplice and splice) for
~4096*8 bytes maximum which is maybe a limit of pipe.
- I'm still going through packet_sendmsg() (packet socket) which
allocates a sk_buff and copies all data inside.
As reference, with my "patch": I need to send more than 32 packets of
7200 bytes (pc network card limit) in one system call (send()) and
without sk_buff data copy. (To reach 85 Mbytes/s)
Please find below my test program for vmsplice/splice:
Best regards,
Johann
#include <stdio.h>
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <sys/select.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <poll.h>
int main (void)
{
struct tpacket_req s_packet_req;
uint32_t size, opt_len;
int fd, i, ec, i_sz_packet = 7150;
struct pollfd s_pfd;
struct sockaddr_ll my_addr, peer_addr;
struct ifreq s_ifr; /* points to one interface returned from ioctl */
int len;
int fd_socket;
int i_nb_buffer = 64;
int i_buffer_size = 8192;
int i_index;
int i_updated_cnt;
int i_ifindex;
int i_header_size;
struct tpacket_hdr * ps_header_start;
struct tpacket_hdr * ps_header;
char buffer[8000];
/* reset indes */
i_index = 0;
fd_socket = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if(fd_socket == ...From: "Johann Baudy" <johaahn@gmail.com> I think you misunderstood what Evgeniy was asking of you. He was asking how fast you can transfer data over this interface using a normal TCP socket to a remove host, via sendfile() or splice(). --
Sorry for misunderstanding, TCP socket, transferring 20Mbytes file (located in initramfs) in loop with sendfile() : 5.7Mbytes/s Best regards, Johann -- Johann Baudy johaahn@gmail.com --
Hi Johann. And _THIS_ is a serious problem. Let's assume that sendfile is broken or driver/hardware does not support scatter/gather and checksumming (does it?). Can you saturate the link with pktgen (1) and usual tcp socket (2). Assuming second case will fail, does it also broken because of very small performance of the copy from the userspace? -- Evgeniy Polyakov --
Could we see the code that was used to get these numbers? The problem may just be in the way that the calls to sendfile() have been coded. The TX code looks intriguing. Seems that some vendors are tinkering with VNIC ideas in order to bypass context switches and data copies. Maybe this is a cheap way to attain the same goals? --
Hi Evgeniy, The driver and the hardware support DMA scater/gather and checksum offloading. with pktgen and this below config, i reached 85MBytes/s ~ link saturation (I've reached the same bitrate with raw socket + TX RING ZeroCopy patch): #!/bin/sh echo rem_device_all > /proc/net/pktgen/kpktgend_0 echo add_device eth0 > /proc/net/pktgen/kpktgend_0 echo max_before_softirq 10000 > /proc/net/pktgen/kpktgend_0 sleep 1 echo count 10000000 > /proc/net/pktgen/eth0 echo clone_skb 0 > /proc/net/pktgen/eth0 echo pkt_size 7200 > /proc/net/pktgen/eth0 echo delay 0 > /proc/net/pktgen/eth0 echo dst 192.168.0.1 > /proc/net/pktgen/eth0 echo dst_mac ff:ff:ff:ff:ff:ff > /proc/net/pktgen/eth0 echo start > /proc/net/pktgen/pgctrl I can't saturate the link from user space with either UDP, TCP or RAW socket due to copies and multiple system calls. If the system is just doing one copy of the packet, it falls under 25Mbytes/s. This a simple memory bus which is only running at 100Mhz for data and instruction. I think I've well understood why my bitrate is so bad from userspace using normal TCP,UDP or RAW socket. That's why I'm working on this zero copy solution (without copy between user and kernel or between kernel buffer and socket buffer; and with a minimum of system call). A kind of full zero-copy sending capability, HW accesses same buffers as the user. In fact, I'm just suggesting the symmetric of packet mmap IO used for capture process with zero copy capability and I need to know what do you think about it. Thanks in advance, Johann -- Johann Baudy johaahn@gmail.com --
Hi Johann. What is the bus width and is there burst mode support? Not to point to the error in the speed calculation, just out of curiosity :) But why sendfile/splice does not work the same? It is (supposed to be) a zero-copy sending interface, which should be even more optimal, than your ring buffer approach, since uses just single syscall and no initialization of the data (well, there is page population and so on, but if file is in the ramdisk, it is effectively zero overhead). Can you run oprofile during sendfile() data transfer or Well, I'm not against this patch, but you pointed to the bug (or wrong initialization in your code) of the sendfile, which has higher priority imho :) Actually if it is indeed a bug in splice code then (if fixed) it can allow to have simpler zero-copy sulution for your problem. -- Evgeniy Polyakov --
32 bits with burst support. This is a PPC 405 embedded into Xilinx V4 I've never used oprofile before. I will get more logs and let you know. Just a question: I don't want to use TCP for final application. Is it expected that the kernel execute packet_sendmsg() when using packet socket with splice()? (because this function is doing a memcpy from a buffer to a socket buffer). Or is there a dedicated path for splicing? or maybe only in TCP read (I can see that splice_read operator is redefined with tcp_splice_read())? And I've also faced some issues with the size of packet (it seems to be limited to page size). It is really important for me to send large packet. I've just decreased the packet size of pktgen script from 7200 to 4096 and the bitrate has fallen from 85Mbytes/s to 50Mbytes/s. I understand that this is not a problem with TCP when sending a file, we don't really care about accuracy of the packet size. Do you know if there is way to adjust the size ? And again, many thanks for your fast replies ;) Johann Baudy -- Johann Baudy johaahn@gmail.com --
Hi Johann. So small PLB? Not OPB? Weird hardware :) But nevertheless at most 400 MB/s with 100mhz, so looks like either there is no burst mode or weird NIC hardware (or something else :) I used to easily saturate 100mbit channel with 405gp(r) and emac driver, which are better numbers than what you have with gige and sockets... Actually even 405gp had much wider plb, so this could be an issue. Likley your project will just dma data from some sensor to the preallocated buffer, you will add headers and send the data, so very small memory bus speed will not allow to use sockets and thus TCP. Having splice-friendly setup is possible, but I think raw socket No, it will use sendpage() if hardware and driver support scatter/gather and checksumm ofloading. Since you say they do, then there should be no What do you mean by packet size? MTU/MSS? In pktgen it means size of the allocated skb, so it will be eventually split into smaller chunks and the bigger size you have, the less allocations will be performed. Actually the fact, that 7200 works at all, is a bit surprising: your small machine has lots of ram and is effectively unused during tests (i.e. no other allocations). Changing it do 4k should not decrease performance at all... Do you have jumbo frames enabled? -- Evgeniy Polyakov --
Yes, this is a custom hardware (FPGA :)). There is no combo IPLB / DPLB, Indeed, I've double checked, but pipe_to_sendpage() will end up with packet_sendmsg() .splice_write = generic_splice_sendpage, generic_splice_sendpage() splice_from_pipe(); pipe_to_sendpage() from err = actor(pipe, buf, sd); sock_sendpage() from ile->f_op->sendpage() sock_no_sendpage() from sock->ops->sendpage() kernel_sendmsg() sock_sendmsg(); packet_sendmsg() from sock->ops->sendmsg(); memcpy() :'( I think a non-generic splice_write function should do the job. I mean the transfer unit size (ethernet frame length) that must be <= MTU. Jumbo frames are enabled in the driver and mtu size is set to 7200. I'm currently using wireshark on a remote pc to check bitrate and format. I think performance can decrease because CPU will spend the same time to send 7200 or 4096 bytes but not the DMA.(~50µs for 7200, ~30µs for 4096) Thanks, Johann -- Johann Baudy johaahn@gmail.com --
Looks like you try to sendfile() over packet socket. Both tcp and udp sockets have sendpage method. Or your hardware or driver do not support needed fucntionality, so tcp_sendpage() falls back to sock_no_sendpage(). From your dump I think it is the first case above. Well, after I read it again, I found word packet_sendmsg(), which explains everything. Please use tcp or udp If you use jumbo frames, than yes, the bigger allocation unit is (assuming allocation succeeded), the bigger speed will be, so this result is expectable. -- Evgeniy Polyakov --
I'm finally able to run a full zero copy mechanism with UDP socket as you said.
Unfortunately, I need at least one vmsplice() system call per UDP
packet (vmsplice call()).
mere vmsplice(mem to pipe) cost much (80µs of CPU). And splice(pipe to
socket) call is worst...
80us is approximately the duration of 12Kbytes sent at 1Gbps. As I
need to send packet of 7200bytes (with no frag)...
I can't use this mechanism unfortunaltely. I've only reached 20Mbytes/s.
You can find below a FTRACE of vmsplice(), if you find something
abnormal ... :) :
(80µs result is an average of vmsplice() duration thanks to
gettimeofday(): WITHOUT FTRACE IN KERNEL CONFIG)
main-849 [00] .. 1 4154502892.139088: sys_gettimeofday
<-ret_from_syscall
main-849 [00] .. 1 4154502892.139090: do_gettimeofday
<-sys_gettimeofday
main-849 [00] .. 1 4154502892.139092: getnstimeofday
<-do_gettimeofday
main-849 [00] .. 1 4154502892.139100: sys_vmsplice
<-ret_from_syscall
main-849 [00] .. 1 4154502892.139107: fget_light <-sys_vmsplice
main-849 [00] .. 1 4154502892.139118: rt_down_read <-sys_vmsplice
main-849 [00] .. 1 4154502892.139120: __rt_down_read
<-rt_down_read
main-849 [00] .. 1 4154502892.139124:
rt_mutex_down_read <-__rt_down_read
main-849 [00] .. 1 4154502892.139132: pagefault_disable
<-sys_vmsplice
main-849 [00] .. 1 4154502892.139136: pagefault_enable
<-sys_vmsplice
main-849 [00] .. 1 4154502892.139141: get_user_pages
<-sys_vmsplice
main-849 [00] .. 1 4154502892.139147: find_extend_vma
<-get_user_pages
main-849 [00] .. 1 4154502892.139150: find_vma <-find_extend_vma
main-849 [00] .. 1 4154502892.139158: _cond_resched
<-get_user_pages
main-849 [00] .. 1 4154502892.139161: follow_page
<-get_user_pages
main-849 [00] .. 1 4154502892.139165: rt_spin_lock <-follow_page
...Hi Johann. vmsplice() can be slow, try to inject header via usual send() call, or Amount of gettimofday() and friends is excessive, but it can be a trace tool itself. kill_fasync() also took too much time (top CPU user is at bottom I suppose?), do you use SIGIO? Also vma traveling and page checking is not what will be done in network code and your project, so it also adds an overhead. Please try without vmsplice() at all, usual splice()/sendfile() _has_ to saturate the link, otherwise we have a Not to distract you from the project, but you still can do the same with existing methods and smaller amount of work. But I should be last saying that creating tricky hacks to implement the idea should be abandoned in favour of the standards (even slow) methods :) -- Evgeniy Polyakov --
Hi Johann, Something like this has been done in PF_RING socket, which is a part of ntop project infra. Take care. Truly, Robert Iakobashvili ...................................................................... www.ghotit.com Assistive technology that understands you ...................................................................... --
Thanks Robert, The architecture of PF_RING seems to be really similar to packet mmap IO to optimize capture process. Is it planned to replace it? I'll try it to get performance. Best regards, Johann On Fri, Sep 5, 2008 at 12:28 PM, Robert Iakobashvili -- Johann Baudy johaahn@gmail.com --
