Hello, I am using RHEL5 and have 1 Gigabit NIC cards. When doing a loop sending 128 KB blocks of data using TCP. I am using system tap to debug the performance and finding that: 90% of the send calls take about 100 micro seconds and 10% of the send calls take about 10 miliseconds. The average send time is about 1 milisecond The 10% of the calls taking about 10 milliseconds seem to be correlated with "sk_stream_wait_memory" calls in the kernel. sk_stream_wait_memory seems to be called when the send buffer is full and the next send call does not complete until the send buffer utilization goes down from 4,194,304 bytes to 2,814,968 bytes. This implies that the send that blocks on a full send buffer will not complete until there is 1 meg of free space in the send buffer even though the send could be accepted into the OS with only 128KB of free space. Do you think I am misinterpreting this data or is there a way to even out the send calls so that they they are more even in duration: approx 1 milisecond per call. Is there a parameter to reduce how much space needs to be free in the send buffer before a blocking send call can complete from user space? Cheers, Ivan --
static void sock_def_write_space(struct sock *sk)
{
...
if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
...
Quick answer is : No, this is not tunable ( independantly than SNDBUF )
SO_SNDLOWAT is not implemented on linux, yet (its value is : 1).
Why would you want to wakeup your thread more than necessary ?
--
Cool. This helps me understand what is happening. My user thread wants to wake up as soon as the OS can accept my data so that it can continue doing work and interact with other components in the system. This is an application issue, i can work around it now that i have a better understanding of what the kernel is doing. Cheers, Ivan --
From my tests select will not return until the same threshold is met of free space: if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf I got that from systemtap output Cheers, Ivan --
Unless you think your application will run over 10G, or over a WAN, you
shouldn't need anywhere near the size of socket buffer you are getting via
autotuning to be able to achieve "link-rate" - link rate with a 1GbE LAN
connection can be achieved quite easily with a 256KB socket buffer.
The first test here is with autotuning going - disregard what netperf reports
for the socket buffer sizes here - it is calling getsockopt() before connect()
and before the end of the connection():
raj@spec-ptd2:~/netperf2_trunk$ src/netperf -H s9 -v 2 -l 30 -- -m 128K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to s9.cup.hp.com
(16.89.132.29) port 0 AF_INET : histogram
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 131072 30.01 911.50
Alignment Offset Bytes Bytes Sends Bytes Recvs
Local Remote Local Remote Xfered Per Per
Send Recv Send Recv Send (avg) Recv (avg)
8 8 0 0 3.42e+09 131074.49 26090 11624.79 294176
Maximum
Segment
Size (bytes)
1448
Histogram of time spent in send() call.
UNIT_USEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
TEN_USEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
HUNDRED_USEC : 0: 3: 21578: 378: 94: 20: 3: 2: 0: 4
UNIT_MSEC : 0: 4: 2: 0: 0: 780: 3215: 6: 0: 1
TEN_MSEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
HUNDRED_MSEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
UNIT_SEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
TEN_SEC : 0: 0: 0: 0: 0: 0: 0: 0: 0: 0
>100_SECS: 0
HIST_TOTAL: 26090
Next, we have netperf make an explicit setsockopt() call for 128KB socket
buffers, which will get us 256K. Notice that the ...I am not sure i understand your historgram output. But what i am getting from your message is that my buffer may be too big. If i reduce the buffer like you are saying down to 256K send buffer than the code that checks if select or send should block: if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf Would only block waiting for space of 128 KB free as compared to 1 Meg free in my example. Therefore reducing the max time for send calls (in theory). Is this what you are getting at? Cheers, Ivan --
For example, 21811 of the send() calls were 1 <= time < 2 milliseconds. 2672 of Yes. As for the select/poll stuff, if you have a thread that wants to get to something else, I would suggest marking the socket non-blocking, trying the send(), if it completes cool, if not, remember what didn't get sent, do the other thing(s) and come back. If you find you have time to sit and wait, go ahead and call select/poll/epoll/whatever. Or, if you want to make sure you wait in poll/select/whatnot no more than N units of time, and that length of time is within the abilities of the call, use the timeout parameter present in those. rick jones --
