Re: Fwd: Packet mmap: TX RING and zero copy

Previous thread: Channel bonding with e1000 by Carsten Aulbert on Friday, September 5, 2008 - 1:36 am. (6 messages)

Next thread: [PATCH] hso.c against 2.6.27-rc5 throttle/unthrottle to prevent loss of serial data by Denis Joseph Barrow on Friday, September 5, 2008 - 2:35 am. (5 messages)
From: Johann Baudy
Date: Friday, September 5, 2008 - 2:17 am

vmsplice() is short in comparison to splice()  ~ 200us !
This was just to show you that even this vmpslice duration of 80us
that is needed for each packet is too long to send only 1 packet.
I really need a mechanism that allow sending of ~ 40 packets of 7200K
in one system call to keep some cpu ressources to do other things.
I've only observed a small performance between with and without
gettimeofday().(< 1MB/s). I've used it to do a light FTRACE and to get

Between kill_fasync() sys_gettimeofday() , I thought that we returned
to user space.

I've already tried sendfile only with standard TCP/UDP socket. I've
not saturated the link.
I understand your point that common solution are always better than
multiple hacks.
But I think that I have the same motivation than packet mmap IO developers .
This feature was introduced to make the capture process of raw socket
efficient. I just want to reach the same goal for transmission using
same mechanism.
We use those features only if we need performance at the driver level.

Thanks,
Johann




--
Johann Baudy
johaahn@gmail.com



-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Friday, September 5, 2008 - 4:31 am

Hi Johann.


Hmmm... splice()/sendfile() shuold be able to send the whole file in


This worries me a lot: sendfile should be a single syscall which very
optimally creates network packets getting into account MTU and hardware
capabilities. I do belive it is a problem with userspace code.

-- 
	Evgeniy Polyakov
--

From: Johann Baudy
Date: Friday, September 5, 2008 - 5:44 am

I was talking about vmsplice()/splice() which seems to me the only way
I don't understand your point, how can I check?

Yes, and this is what it does (only one single syscall).
No printf, only one sendfile of 10MB file over TCP socket

To resume ongoing test status:
with vmsplice()/splice() I need to do multiple call of vmsplice() and
one call of splice() - ratio seems to be limited to the pipe capacity
(16 pages: 64K)
                                      - vmsplice call specify the size
of the udp packet which means 1 syscall per packet :(
                                      - In UDP, Bitrate is  < 20MB/s


with sendfile() only one system call of 10MB  in TCP (in UDP I have to
split in 61440 bytes).
                                      - In TCP bitrate is limited due to remote
                                      - In UDP,  this 61440 bytes
limit which  is really inferior to my 7200*40 packets that allows me
to saturate the link (during 2ms only...) with my circular buffer.
                                     - Bitrate is  < 20MB/s





Thanks,
Johann

-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Friday, September 5, 2008 - 6:16 am

[Empty message]
From: Johann Baudy
Date: Friday, September 5, 2008 - 6:29 am

So, it seems that there is something wrong in UDP here, because TCP
works properly with same code.
If size argument of sendfile() exceed 61440, sendfile() returns 61440
and send no data to the device ...
I will try to investigate it but for my app sendfile() is not
conceivable. Only vmsplice could, but too slow.

Thanks,
Johann

-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Friday, September 5, 2008 - 6:37 am

That's a bug. Likely no one sends file content via udp, so it was not

Well, you can mmap empty file into the RAM, lock it there, DMA data from
the sensor to the mmapped area, add headers (just in the mapped area)
and then sendfile that file.. Single syscall, zero-copy, standard
interfaces :)

-- 
	Evgeniy Polyakov
--

From: Johann Baudy
Date: Friday, September 5, 2008 - 6:55 am

OK but it seems that there is no way to control packet format.
(beginning of packet and size of packet).
Each packet must start with a specific header (on my app). This is a
kind of streaming.

Thanks,
Johann





-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Friday, September 5, 2008 - 7:19 am

Hi Johann.


No need to run FTRACE, code shuld be audited and probably some debug
prints added to determine, why sendfile() decides to exit early wiht
UDP. I will try to do it if time permits this weekend, although I'm

You can always provide a global offset where to put next packet.
You can ajust it to put header before each data frame, and then DMA
frame content according to that offset.

Transmitting packet socket is needed for those, who wants to implement
own low-level protocol unsupported by the kernel, so to transfer data
over UDP or TCP over IP with the highests speeds, one should use
existing methods. This does not of course mean, that anyone _has_ to do
it, it is always very fun to find new ways like your patch.

-- 
	Evgeniy Polyakov
--

From: Johann Baudy
Date: Friday, September 5, 2008 - 7:45 am

I've finally made the test:
Packet is not going through device due to this test:
	if (inet->cork.length + size > 0xFFFF - fragheaderlen) {
		ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
		return -EMSGSIZE;
	}
in ip_append_page()

inet->cork.length reach 61448 then this failure occurs
size = 4096

What do you mean with global offset ?

Thanks,
Johann
-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Friday, September 5, 2008 - 7:59 am

Hi.

 
Well, udp_sendpage() needs to be extended to only append page when there
is anough free space there, otherwise push given frame and create next

I meant you get a pointer by mapping some file in tmpfs (for example)
and then use some offset variable to store where you put your last data
(either packet header, or data itself), so that any subsequent write to
that area (either new packet header or dma data placement) would put
data just after the previous chunk. Thus after you have put number of
headers and appropriate data chunks, you could call sendfile() and reset
offset to the beginning of the mapped area.

-- 
	Evgeniy Polyakov
--

From: Johann Baudy
Date: Friday, September 5, 2008 - 8:30 am

If I understand well, there is no link between start of ethernet frame
and packet header ?
App protocol must support packet loss ^^

Thanks,
Johann

-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Friday, September 5, 2008 - 8:38 am

Hi.


Great, thank you. But it should take into account UDP nature: data is

Ethernet header is appended by the network core itself, likely core will
just allocate skb with small data area, put there an ethernet and udp/ip
headers and attach pages from the file. If hardware does not support

I think matter of packet loss relevance here is just the same like with
any other sending method.

-- 
	Evgeniy Polyakov
--

From: Johann Baudy
Date: Friday, September 5, 2008 - 9:01 am

I'm leaving office right now.

Ok, I see. I just need to check if UDP is fine for my application or
if i need to do my own L3/L4.

Many thanks again,
Have a nice weekend,

Johann

-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Friday, September 5, 2008 - 9:34 am

Hi Johann.



This may be a bad sign. Or extremely needed step like in satellite links
with its huge rtts. Likely in case of ethernet usage tcp/udp over IP is
the way to go.

-- 
	Evgeniy Polyakov
--

From: Johann Baudy
Date: Monday, September 8, 2008 - 3:21 am

Hi Evgeniy,

I've made a test with below patch (with and without UDP fragmentation):

without UDP fragmentation, packet size are almost always equal to
PAGE_SIZE due to my mtu limit (2*PACKET_SIZE > mtu).
with UDP fragmentation, kernel is sending multiple fragmented packets
of 61448Kbytes.

Unfortunately, in both case, bitrate is still 15-20 MB/s :(
According to wireshark, kernel sends 60KB over 9 packets, nothing
during ~5ms, 60KB and so on. strange ... kernel seems to spend its
time during push(). Is there a blocking call somewhere ?

Thanks in advance,
Johann

--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c

@@ -743,7 +743,28 @@ int udp_sendpage(struct sock *sk, struct page
*page, int offset,
                 size_t size, int flags)
 {
        struct udp_sock *up = udp_sk(sk);
+       struct inet_sock *inet = inet_sk(sk);
        int ret;
+       int mtu = inet->cork.fragsize;
+       int fragheaderlen;
+       struct ip_options *opt = NULL;
+
+       if (inet->cork.flags & IPCORK_OPT)
+               opt = inet->cork.opt;
+
+       fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
+
+       // With UDP fragmentation
+       if (inet->cork.length + size >= 0xFFFF - fragheaderlen) {
+       // Without UDP fragmentation
+       //  if( (inet->cork.length + size) > mtu) {
+               lock_sock(sk);
+               ret = udp_push_pending_frames(sk);
+               release_sock(sk);
+               if (ret) {
+                       return 0;
+               }
+       }





-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Monday, September 8, 2008 - 4:26 am

Hi Johann.


Are you sure that it is udp_push_pending_frames() and not some splice


This also should be protected. Two threads can simultaneously check

-- 
	Evgeniy Polyakov
--

From: Johann Baudy
Date: Monday, September 8, 2008 - 6:01 am

No, I'm not sure.
Are there any queue or allocator limits that can slow the bitrate
through this function?
I mean something that will need end of transfer to start a new one.

--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -743,7 +743,29 @@ int udp_sendpage(struct sock *sk, struct page
*page, int offset,
                 size_t size, int flags)
 {
        struct udp_sock *up = udp_sk(sk);
+       struct inet_sock *inet = inet_sk(sk);
        int ret;
+       int mtu, fragheaderlen;
+       struct ip_options *opt = NULL;
+
+       lock_sock(sk);
+       mtu = inet->cork.fragsize;
+
+       if (inet->cork.flags & IPCORK_OPT)
+               opt = inet->cork.opt;
+
+       fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
+
+       // With UDP fragmentation
+       if (inet->cork.length + size >= 0xFFFF - fragheaderlen) {
+       // Without UDP fragmentation
+       //  if( (inet->cork.length + size) > mtu) {
+               ret = udp_push_pending_frames(sk);
+               if (ret) {
+                       return 0;
+               }
+       }
+       release_sock(sk);

        if (!up->pending) {
                struct msghdr msg = {   .msg_flags = flags|MSG_MORE };

Please find above, patch with your rectifications.
Do we must use MTU limit instead offset limit? As you said that UDP
split over IP must be avoided.
If yes, can I force size value forwarded to ip_append_page() in order
to fill the whole packet? Or this will not be handled properly by all
callers?

Thanks in advance,
Johann


-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Monday, September 8, 2008 - 8:28 am

Hi Johann.



No, there shuld not be any such code path.

What is CPU usage on sender when it sends data via UDP sendfile()?

-- 
	Evgeniy Polyakov
--

From: Evgeniy Polyakov
Date: Monday, September 8, 2008 - 8:38 am

Actually we can determine the culprit via putting a loop into
udp_sendpage(), which will send the same data. if receiver will see the
same delayes, problem in the udp sending path, otherwise in splice code.

-- 
	Evgeniy Polyakov
--

From: Johann Baudy
Date: Tuesday, September 9, 2008 - 4:11 pm

Hi Evgeniy,

We've performed more tests with additional and mandatory processes of
our system. Bitrate and CPU performance were very very low. This
result leads us to brainstorm on a new system design that meets
specifications target. I'll continue these tests once new design
ready.

Concerning udp_sendpage() patch, if you agree, I will suggest this
patch to the community in a new thread.

Many thanks for your help,
Johann






-- 
Johann Baudy
johaahn@gmail.com
--

From: Evgeniy Polyakov
Date: Tuesday, September 9, 2008 - 11:09 pm

Hi Johann.


If CPU usage was small, something slept somewhere, but so far we do not
know what and where. It would be great to understand, why sendfile() is
so slow with UDP especially when mempry bandwidth is very small, and

Sure, but please remove commented code lines instead of commenting them.

-- 
	Evgeniy Polyakov
--

Previous thread: Channel bonding with e1000 by Carsten Aulbert on Friday, September 5, 2008 - 1:36 am. (6 messages)

Next thread: [PATCH] hso.c against 2.6.27-rc5 throttle/unthrottle to prevent loss of serial data by Denis Joseph Barrow on Friday, September 5, 2008 - 2:35 am. (5 messages)