vmsplice() is short in comparison to splice() ~ 200us ! This was just to show you that even this vmpslice duration of 80us that is needed for each packet is too long to send only 1 packet. I really need a mechanism that allow sending of ~ 40 packets of 7200K in one system call to keep some cpu ressources to do other things. I've only observed a small performance between with and without gettimeofday().(< 1MB/s). I've used it to do a light FTRACE and to get Between kill_fasync() sys_gettimeofday() , I thought that we returned to user space. I've already tried sendfile only with standard TCP/UDP socket. I've not saturated the link. I understand your point that common solution are always better than multiple hacks. But I think that I have the same motivation than packet mmap IO developers . This feature was introduced to make the capture process of raw socket efficient. I just want to reach the same goal for transmission using same mechanism. We use those features only if we need performance at the driver level. Thanks, Johann -- Johann Baudy johaahn@gmail.com -- Johann Baudy johaahn@gmail.com --
Hi Johann. Hmmm... splice()/sendfile() shuold be able to send the whole file in This worries me a lot: sendfile should be a single syscall which very optimally creates network packets getting into account MTU and hardware capabilities. I do belive it is a problem with userspace code. -- Evgeniy Polyakov --
I was talking about vmsplice()/splice() which seems to me the only way
I don't understand your point, how can I check?
Yes, and this is what it does (only one single syscall).
No printf, only one sendfile of 10MB file over TCP socket
To resume ongoing test status:
with vmsplice()/splice() I need to do multiple call of vmsplice() and
one call of splice() - ratio seems to be limited to the pipe capacity
(16 pages: 64K)
- vmsplice call specify the size
of the udp packet which means 1 syscall per packet :(
- In UDP, Bitrate is < 20MB/s
with sendfile() only one system call of 10MB in TCP (in UDP I have to
split in 61440 bytes).
- In TCP bitrate is limited due to remote
- In UDP, this 61440 bytes
limit which is really inferior to my 7200*40 packets that allows me
to saturate the link (during 2ms only...) with my circular buffer.
- Bitrate is < 20MB/s
Thanks,
Johann
--
Johann Baudy
johaahn@gmail.com
--
So, it seems that there is something wrong in UDP here, because TCP works properly with same code. If size argument of sendfile() exceed 61440, sendfile() returns 61440 and send no data to the device ... I will try to investigate it but for my app sendfile() is not conceivable. Only vmsplice could, but too slow. Thanks, Johann -- Johann Baudy johaahn@gmail.com --
That's a bug. Likely no one sends file content via udp, so it was not Well, you can mmap empty file into the RAM, lock it there, DMA data from the sensor to the mmapped area, add headers (just in the mapped area) and then sendfile that file.. Single syscall, zero-copy, standard interfaces :) -- Evgeniy Polyakov --
OK but it seems that there is no way to control packet format. (beginning of packet and size of packet). Each packet must start with a specific header (on my app). This is a kind of streaming. Thanks, Johann -- Johann Baudy johaahn@gmail.com --
Hi Johann. No need to run FTRACE, code shuld be audited and probably some debug prints added to determine, why sendfile() decides to exit early wiht UDP. I will try to do it if time permits this weekend, although I'm You can always provide a global offset where to put next packet. You can ajust it to put header before each data frame, and then DMA frame content according to that offset. Transmitting packet socket is needed for those, who wants to implement own low-level protocol unsupported by the kernel, so to transfer data over UDP or TCP over IP with the highests speeds, one should use existing methods. This does not of course mean, that anyone _has_ to do it, it is always very fun to find new ways like your patch. -- Evgeniy Polyakov --
I've finally made the test:
Packet is not going through device due to this test:
if (inet->cork.length + size > 0xFFFF - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
return -EMSGSIZE;
}
in ip_append_page()
inet->cork.length reach 61448 then this failure occurs
size = 4096
What do you mean with global offset ?
Thanks,
Johann
--
Johann Baudy
johaahn@gmail.com
--
Hi. Well, udp_sendpage() needs to be extended to only append page when there is anough free space there, otherwise push given frame and create next I meant you get a pointer by mapping some file in tmpfs (for example) and then use some offset variable to store where you put your last data (either packet header, or data itself), so that any subsequent write to that area (either new packet header or dma data placement) would put data just after the previous chunk. Thus after you have put number of headers and appropriate data chunks, you could call sendfile() and reset offset to the beginning of the mapped area. -- Evgeniy Polyakov --
If I understand well, there is no link between start of ethernet frame and packet header ? App protocol must support packet loss ^^ Thanks, Johann -- Johann Baudy johaahn@gmail.com --
Hi. Great, thank you. But it should take into account UDP nature: data is Ethernet header is appended by the network core itself, likely core will just allocate skb with small data area, put there an ethernet and udp/ip headers and attach pages from the file. If hardware does not support I think matter of packet loss relevance here is just the same like with any other sending method. -- Evgeniy Polyakov --
I'm leaving office right now. Ok, I see. I just need to check if UDP is fine for my application or if i need to do my own L3/L4. Many thanks again, Have a nice weekend, Johann -- Johann Baudy johaahn@gmail.com --
Hi Johann. This may be a bad sign. Or extremely needed step like in satellite links with its huge rtts. Likely in case of ethernet usage tcp/udp over IP is the way to go. -- Evgeniy Polyakov --
Hi Evgeniy,
I've made a test with below patch (with and without UDP fragmentation):
without UDP fragmentation, packet size are almost always equal to
PAGE_SIZE due to my mtu limit (2*PACKET_SIZE > mtu).
with UDP fragmentation, kernel is sending multiple fragmented packets
of 61448Kbytes.
Unfortunately, in both case, bitrate is still 15-20 MB/s :(
According to wireshark, kernel sends 60KB over 9 packets, nothing
during ~5ms, 60KB and so on. strange ... kernel seems to spend its
time during push(). Is there a blocking call somewhere ?
Thanks in advance,
Johann
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -743,7 +743,28 @@ int udp_sendpage(struct sock *sk, struct page
*page, int offset,
size_t size, int flags)
{
struct udp_sock *up = udp_sk(sk);
+ struct inet_sock *inet = inet_sk(sk);
int ret;
+ int mtu = inet->cork.fragsize;
+ int fragheaderlen;
+ struct ip_options *opt = NULL;
+
+ if (inet->cork.flags & IPCORK_OPT)
+ opt = inet->cork.opt;
+
+ fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
+
+ // With UDP fragmentation
+ if (inet->cork.length + size >= 0xFFFF - fragheaderlen) {
+ // Without UDP fragmentation
+ // if( (inet->cork.length + size) > mtu) {
+ lock_sock(sk);
+ ret = udp_push_pending_frames(sk);
+ release_sock(sk);
+ if (ret) {
+ return 0;
+ }
+ }
--
Johann Baudy
johaahn@gmail.com
--
Hi Johann. Are you sure that it is udp_push_pending_frames() and not some splice This also should be protected. Two threads can simultaneously check -- Evgeniy Polyakov --
No, I'm not sure.
Are there any queue or allocator limits that can slow the bitrate
through this function?
I mean something that will need end of transfer to start a new one.
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -743,7 +743,29 @@ int udp_sendpage(struct sock *sk, struct page
*page, int offset,
size_t size, int flags)
{
struct udp_sock *up = udp_sk(sk);
+ struct inet_sock *inet = inet_sk(sk);
int ret;
+ int mtu, fragheaderlen;
+ struct ip_options *opt = NULL;
+
+ lock_sock(sk);
+ mtu = inet->cork.fragsize;
+
+ if (inet->cork.flags & IPCORK_OPT)
+ opt = inet->cork.opt;
+
+ fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
+
+ // With UDP fragmentation
+ if (inet->cork.length + size >= 0xFFFF - fragheaderlen) {
+ // Without UDP fragmentation
+ // if( (inet->cork.length + size) > mtu) {
+ ret = udp_push_pending_frames(sk);
+ if (ret) {
+ return 0;
+ }
+ }
+ release_sock(sk);
if (!up->pending) {
struct msghdr msg = { .msg_flags = flags|MSG_MORE };
Please find above, patch with your rectifications.
Do we must use MTU limit instead offset limit? As you said that UDP
split over IP must be avoided.
If yes, can I force size value forwarded to ip_append_page() in order
to fill the whole packet? Or this will not be handled properly by all
callers?
Thanks in advance,
Johann
--
Johann Baudy
johaahn@gmail.com
--
Hi Johann. No, there shuld not be any such code path. What is CPU usage on sender when it sends data via UDP sendfile()? -- Evgeniy Polyakov --
Actually we can determine the culprit via putting a loop into udp_sendpage(), which will send the same data. if receiver will see the same delayes, problem in the udp sending path, otherwise in splice code. -- Evgeniy Polyakov --
Hi Evgeniy, We've performed more tests with additional and mandatory processes of our system. Bitrate and CPU performance were very very low. This result leads us to brainstorm on a new system design that meets specifications target. I'll continue these tests once new design ready. Concerning udp_sendpage() patch, if you agree, I will suggest this patch to the community in a new thread. Many thanks for your help, Johann -- Johann Baudy johaahn@gmail.com --
Hi Johann. If CPU usage was small, something slept somewhere, but so far we do not know what and where. It would be great to understand, why sendfile() is so slow with UDP especially when mempry bandwidth is very small, and Sure, but please remove commented code lines instead of commenting them. -- Evgeniy Polyakov --
