Lachlan Andrew observed that my TCP-Illinois implementation uses the
beta value incorrectly:
The parameter beta in the paper specifies the amount to decrease
*by*: that is, on loss,
W <- W - beta*W
but in tcp_illinois_ssthresh() uses beta as the amount
to decrease *to*: W <- beta*W
This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network,
hurting performance. Note: since the base beta value is .5, it has no
impact on a congested network.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
--- a/net/ipv4/tcp_illinois.c 2007-08-18 07:50:15.000000000 -0700
+++ b/net/ipv4/tcp_illinois.c 2007-11-28 15:39:04.000000000 -0800
@@ -298,7 +298,7 @@ static u32 tcp_illinois_ssthresh(struct
struct illinois *ca = inet_csk_ca(sk);
/* Multiplicative decrease */
- return max((tp->snd_cwnd * ca->beta) >> BETA_SHIFT, 2U);
+ return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->beta) >> BETA_SHIFT), 2U);
}
-
Applied to net-2.6. Thanks Stephen! -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -
Thanks Stephen. A related problem (largely due to the published algorithm itself) is that Illinois is very aggressive when it over-estimates the maximum RTT. At high load (say 200Mbps and 200ms RTT), a backlog of packets builds up just after a loss, causing the RTT estimate to become large. This makes Illinois think that *all* losses are due to corruption not congestion, and so only back off by 1/8 instead of 1/2. I can't think how to fix this except by better RTT estimation, or changes to Illinois itself. Currently, I ignore RTT measurements when sacked_out != 0 and have a heuristic "RTT aging" mechanism, but that's pretty ugly. Cheers, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan -
Ageing the RTT estimates needs to be done anyway. Maybe something can be reused from H-TCP. The two are closely related. -
Hi Stephen and Lachlan, Thanks for pointing out and fixing this bug. For the max RTT problem, I have considered it also and I have some idea on improve it. I also have some other places to improve. I will summarize all my new ideas and send you an update. For me to change it, could you please give me a link to download to latest source codes for the whole congestion control module in Linux implementation, including the general module for all algorithms, and the implementation for specific algorithms like TCP-Illinois and H-TCP? Thanks for the help! -Shao -----Original Message----- From: Stephen Hemminger [mailto:shemminger@linux-foundation.org] Sent: Wednesday, November 28, 2007 4:44 PM To: Lachlan Andrew Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith; Robert Shorten; netdev@vger.kernel.org Subject: Re: [PATCH] tcp-illinois: incorrect beta usage Ageing the RTT estimates needs to be done anyway. Maybe something can be reused from H-TCP. The two are closely related. -
On Wed, 28 Nov 2007 21:26:12 -0800 The following adds gradual aging of max RTT. --- a/net/ipv4/tcp_illinois.c 2007-11-29 08:58:35.000000000 -0800 +++ b/net/ipv4/tcp_illinois.c 2007-11-29 09:37:33.000000000 -0800 @@ -63,7 +63,10 @@ static void rtt_reset(struct sock *sk) ca->cnt_rtt = 0; ca->sum_rtt = 0; - /* TODO: age max_rtt? */ + /* add slowly fading memory for maxRTT to accommodate routing changes */ + if (ca->max_rtt > ca->base_rtt) + ca->max_rtt = ca->base_rtt + + (((ca->max_rtt - ca->base_rtt) * 31) >> 5); } static void tcp_illinois_init(struct sock *sk) --
Greetings Stephen,
Thanks. We'll have to play with the rate of ageing. I used the slower ageing
if (ca->cnt_rtt > 3) {
u64 mean_rtt = ca->sum_rtt;
do_div (mean_rtt, ca->cnt_rtt);
if (ca->max_rtt > mean_rtt)
ca->max_rtt -= (ca->max_rtt - mean_rtt) >> 9;
}
and still found that the max_rtt drops considerably within a congestion epoch.
What would also really help would be getting rid of RTT outliers
somehow. I ignore RTT measurements when SACK is active:
if (ca->max_rtt < rtt) {
struct tcp_sock *tp = tcp_sk(sk);
if (! tp->sacked_out ) // SACKs cause hi-CPU/hi-RTT. ignore
ca->max_rtt = rtt;
}
which helps a lot, but still gets some outliers. Would it be possible
to time-stamp packets in the hardware interrupt handler, instead of
waiting for the post-processing stage?
Cheers,
Lachlan
--
Lachlan Andrew Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--
Hi Stephen and Lachlan, Thanks for the discussion. The gradual aging is surely an option. And another possibility is that, we compute the RTT just before congestion notification, which ideally, represent the full queueing delay + propagation delay. We can compute the average of the last M such values, and either use the average as maxRTT, or use it as a benchmark to judge whether a sample is outlier. How do you think of this idea? And also a question, why the samples when SACK is active are outliers? For the accuracy of time stamping, I am not very familiar with the implementation details. But I can think of two ways, 1) do time stamp in as low layer as possible; 2) use as high priority thread to do it as possible. For 2), we can use separate threads to do time stamp and to process packets. Thanks and I will let you know more of my thoughts after I go over the entire code space! -Shao -----Original Message----- From: Lachlan Andrew [mailto:lachlan.andrew@gmail.com] Sent: Monday, December 03, 2007 3:06 PM To: Stephen Hemminger Cc: shaoliu@Princeton.EDU; David S. Miller; Herbert Xu; Douglas Leith; Robert Shorten; netdev@vger.kernel.org Subject: Re: [RFC] TCP illinois max rtt aging Greetings Stephen, Thanks. We'll have to play with the rate of ageing. I used the slower ageing if (ca->cnt_rtt > 3) { u64 mean_rtt = ca->sum_rtt; do_div (mean_rtt, ca->cnt_rtt); if (ca->max_rtt > mean_rtt) ca->max_rtt -= (ca->max_rtt - mean_rtt) >> 9; } and still found that the max_rtt drops considerably within a congestion epoch. What would also really help would be getting rid of RTT outliers somehow. I ignore RTT measurements when SACK is active: if (ca->max_rtt < rtt) { struct tcp_sock *tp = tcp_sk(sk); if (! tp->sacked_out ) // SACKs cause hi-CPU/hi-RTT. ignore ca->max_rtt = rtt; } which helps a lot, but still gets some outliers. Would it be possible to time-stamp packets in the hardware interrupt handler, instead of waiting ...
On Mon, 3 Dec 2007 15:59:23 -0800 The problem with an average like that would be storing enough values to be useful and choosing how many to store. Perhaps some form of weighted sliding average which favors recent values heavily would work best. Remember that RTT's have a huge noise component and you are fighting against the long tail distribution trying to see the queue Any sample with SACK is going to mean a loss or reordering has occurred. So shouldn't the SACK values be useful, but RTT values from retransmits Right now the resolution is in microseconds using the hardware clock. The clock usage costs a little bit, but makes the math more accurate. It would be worth exploring sensitivity by taking out RTT_STAMP from the flags field and varying HZ. --
Greetings, When SACK is active, the per-packet processing becomes more involved, tracking the list of lost/SACKed packets. This causes a CPU spike just after a loss, which increases the RTTs, at least in my experience. This is a separate issue from the fact that it is hard to get RTT measurements from lost/retransmitted packets themselves. Cheers, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan --
I suspect that as long as old code was able to use hint, it wasn't doing that bad. But it was seriously lacking ability to take advantage of sack processing hint when e.g., a new hole appeared, or cumulative ACK arrived. ...Code available in net-2.6.25 might cure those. -- i. --
Greetings Ilpo, We had been using one of your earlier patches, and still had the problem. I think you've cured the problem with SACK itself, but there still seems to be something taking a lot of CPU while recovering from the loss. It is possible that it was to do with web100 which we have also been running, but I cut out most of the statistics from that and still had problems. Cheers, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan --
On Thu, 6 Dec 2007, Lachlan Andrew wrote: > On 04/12/2007, Ilpo J
From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> Yes, it's the classic problem. But it ought to be at least partially masked when TSO is in use, because we'll only process a handful of SKBs. The more effectively TSO batches, the less work clean_rtx_queue() will do. When not doing TSO the behavior is super-stupid, we bump reference counts on the same page multiple times while running over the SKBs since consequetive SKBs cover data in different spans of the same page. The core issue is that we have a poorly behaving data container, and therefore that's obviously what we need to change. Conceptually what we probably need to do is seperate the data maintainence from the SKB objects themselves. There is a blob that maintains the paged data state for everything in the retransmit queue. SKBs are built and get the page pointers but don't actually grab references to the pages, the blob does that and it keeps track of how many SKB references to each page there are, non-atomically. The hardest part is dealing with the page lifetime issues. Unfortunately, when we trim the rtx queue, references to the clones can still exist in the driver output path. It's a difficult problem to overcome in fact, so in the end my suggestion above might not Web100 just provides statistics and other kinds of connection data to userspace, all the actual algorithm etc. modifications have been merged upstream and yanked out of the web100 patch. I was looking at it the other night and it's frankly totally uninteresting these days :-) --
On Fri, 7 Dec 2007, David Miller wrote: > From: "Ilpo_J
From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> You're of course right, and it's ironic that I wrote the SACK splitting code so I should have known this :-) A possible approach just occurred to me wherein we maintain the SACK state external to the SKBs so that we don't need to mess with them at all. That would allow us to eliminate the TSO splitting but it would not remove the general problem of clean_rtx_queue()'s overhead. I'll try to give some thought to this over the weekend. --
On Fri, 7 Dec 2007, David Miller wrote: > From: "Ilpo_J
From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Interesting approach, but I think there is limited value to this
(arguably) complex form.
The core issue is that the data and the SACK state are maintained in
the same datastructure. The complexity in all the state management
and fixups in your patch is purely because of this.
If we maintain SACK scoreboard information seperately, outside of
the SKB, then there are only two changes to make:
1) Every access to TCP_SKB_CB() SACK scoreboard is adjusted to
new data structure.
2) Retransmit is adjusted so that it can retransmit an SKB
constructed as a portion of an existing SKB. Since TSO
implies SG, this can be handled with simple offset and
length arguments and suitable creation of a clone referencing
the pages in the SG vector that contain the desired data.
I would envision this SACK state thing to reference into the
retransmit queue SKB's somehow. Each SACK block could perhaps
look something like:
struct sack_ref {
struct sk_buff *start_skb;
struct sk_buff *end_skb;
unsigned int start_off;
unsigned int len;
};
Traditionally we've prioritized the design of the SKB and other
infrastructure to suit TCP optimally and I still think we should
operate that way.
Therefore, long term, it is time to make a formal data blob to assist
with all of the repetitive work we do in cases like this and horribly
inefficient places like clean_rtx_queue().
So I'm basically advocating a two-pronged approach to this, the
seperate SACK scoreboard datastructure and the data blob. I
think we can work on the former right now, and take our time with
the data blob because it requires lots of thinking and we should
get it right as it might have network driver interface implications.
--
Greetings all, If you're redoing the driver interface, could I put in a request for packet time-stamping at a lower level? This thread started because TCP processing interferes with RTT estimation. This problem would be eliminated if time-stamping were done as soon as the packet comes off the NIC. $0.02 Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan --
From: "Lachlan Andrew" <lachlan.andrew@gmail.com> We don't do that because such timestamping is too expensive. It used to be the case that we did this, but we stopped doing that a long time ago. On x86 for example, timestamping can involve touching a slow I/O device to read the timestamp. We do not want to do that for every packet. Also, we timestamp differently for TCP, the global high resolution timestamp is overkill for this purpose. Really, this is a silly idea and would only be a bandaid for the problem at hand, that TCP input processing is too expensive in certain circumstances. --
Greetings Dave, OK. Thanks for the background. I thought that a TSC read was fairly cheap. Any messing around to interpret it could be the responsibility of any task which actually needs a high-resolution timestamp, couldn't it? If TSC is disabled, Overkill for Reno and Cubic, but useful for Vegas, LP, veno, Illinois and YeAH which are all in the kernel. They currently use "high resolution" timestamps which are effectively quantized to the scheduler resolution because of the way timestamping is done -- That problem should certainly be fixed as well -- I wasn't suggesting this as an alternative. Will fixing it fix the problem of those TCP modules suffering from CPU load from other sources? (I'm Cc'ing this to Darryl Veitch who has often wanted driver-level time-stamping for achieving high-resolution synchronization between hosts.) Cheers, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan --
From: "Lachlan Andrew" <lachlan.andrew@gmail.com> The TSC is cheap, but on the vast majority of systems it isn't usable as a stable time source, and we have to read the ACPI timer instead. --
On Wed, 12 Dec 2007 15:35:49 -0800 In current kernel, the congestion control can choose the desired accuracy. The units are always the same (microseconds), but if the flag value "TCP_CONG_RTT_STAMP" is set, then TCP tries to the high resolution real time clock. This is relatively expensive so only vegas, veno, LP, and YeAH use it. It would be worth an experiment to see whether it is worth it. All the delay based congestion control algorithms are impractical in the real world, but everyone keeps trying :-) -- Stephen Hemminger <shemminger@linux-foundation.org> --
Yeah, had just too much time while waiting for person who never arrived... :-) It would have covered the typical case quite well Not sure if it is going to all that straightforward because you seem to ignore in this simple description all details of performing the linking between the structures, which becomes tricky when we add those ...I.e., how to count retrans_out origins in such case because the presented sack_ref knows nothing about them nor do we have anything but one bit in ->sacked per skb. We need the origins when retransmission get sacked/acked to reduce retrans_out correctly. To solve this, I think the sack_ref must have ->sacked as well and the structure should store the R-bits there as well, and may L-bits too though their defination is more obvious because it's mostly a complement of sacked (except for small part near fackets_out and timedout marking that probably makes bookkeeping of L still necessary). The pcount boundaries are no longer that well defined after we break skb boundaries during retransmitting, so doing this makes fackets_out calculation even more trickier to track, unless whole pcount is replaced by byte based fackets_out :-/, which is very trivial to determine from seqnos only (TCP_NODELAYers might get unhappy though if we do that in a straightforward way...). This would clearly be a good step from one perspective. Nowadays GSO'ed skbs may get fragmented when mss_now changes, for two or more consecutive non-SACKed skbs that means one or more extra (small) packets when ...I assume that these are fast searchable? And skbs as well (because the linking of start/end_skb at minimum is a costly operation otherwise)? -- i. --
Ilpo, just trying to keep an old conversation from dying off. Did you happen to read a recent blog posting of mine? http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead I've been thinking more and more and I think we might be able to get away with enforcing that SACKs are always increasing in coverage. I doubt there are any real systems out there that drop out of order packets that are properly formed and are in window, even though the SACK specification (foolishly, in my opinion) allows this. If we could free packets as SACK blocks cover them, all the problems go away. For one thing, this will allow the retransmit queue liberation during loss recovery to be spread out over the event, instead of batched up like crazy to the point where the cumulative ACK finally moves and releases an entire window's worth of data. Next, it would simplify all of this scanning code trying to figure out which holes to fill during recovery. And for SACK scoreboard marking, the RB trie would become very nearly unecessary as far as I can tell. I would not even entertain this kind of crazy idea unless I thought the fundamental complexity simplification payback was enormous. And in this case I think it is. What we could do is put some experimental hack in there for developers to start playing with, which would enforce that SACKs always increase in coverage. If violated the connection reset and a verbose log message is logged so we can analyze any cases that occur. Sounds crazy, but maybe has potential. What do you think? --
Luckily we can see that already from MIBs, so quering people who have large servers, which are continously "testing" the internet :-), under their supervision or can access, and asking if they see any might help. I checked my dept's interactive servers and all had zero renegings, but I don't think I have access to www server which would have much wider I thought it a bit yesterday after reading your blog and came to conclusion that they won't, we can still get those nasty ACKs regardless of received SACK info (in here, missing). Even in some valid cases which include ACK losses besides actual data loss, not that this is the most common case but just wanted to point out that cleanup work is at least partially independent of SACK problem. So not "all" problems would go Two key cases for real pattern are: 1. Losses once per n, where n is something small, like 2-20 or so, usually happens at slow start overshoot or when compething traffic slow starts. Cumulative ACKs will cover only small part of the window once rexmits make through, thus this is not a problem. 2. Single loss (or few at the beginning of the window), rest SACKed. Cumulative ACK will cover original window when the last necessary rexmit gets through. Case 1 becomes nasty ACKy only if rexmit is lost as well, but in that case the arriving SACK blocks make the rest of the window equal to 2 :-). So I'm now trying to solve just case 2. What if we could somehow "combine" adjacent skbs (or whatever they're called in that model) if SACK covers them both so that we still hold them but can drop them in a very efficient way. That would make the combining effort split per ACK. And if reneging would occur, we can think a way to put the necessary fuzz into a form which cannot hurt the rest of the system (relatively easy & fast if we add CA_Reneging and allow retransmitting a portion of an skb similar to what you suggested earlier). And it might even be possible then to offer admin a ...
From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> RFCs are great guides by which to implement things, but they have been often completely wrong or not practicle to follow strictly. The handling of out of order ACKs with timestamps is my favorite example. Nobody performs an RFC compliant timestamp check on ACK packets, or else their performance would go into the toilet during packet reordering. The URG bit setting is another one. Especially when, practically speaking, we can in fact make changes like I believe we can here I think we should. --
Linux has a code path where this can happen under memory over-commit, in tcp_prune_queue(). Also, I think one of the motivations for making SACK strictly advisory is there was some concern about buggy SACK implementations. Keeping data in your retransmit queue allows you to fall back to timeout and go-back-n if things completely fall apart. For better or worse, we have to deal with the spec the way it is. Even if you made this assumption of "hard" SACKs, you still have to worry about large ACKs if SACK is disabled, though I guess you could say people running with large windows without SACK deserve what they get. :) I haven't thought about this too hard, but can we approximate this by moving scaked data into a sacked queue, then if something bad happens merge this back into the retransmit queue? The code will have to deal with non-contiguous data in the retransmit queue; I'm not sure offhand if that violates any assumptions. You still have a single expensive ACK at the end of recovery, though I wonder how much this really hurts. If you want to ameliorate this, you could save this sacked queue to be batch processed later, in application context for instance. -John --
From: John Heffner <jheffner@psc.edu> That defeats the impetus for the change. We want to free up the data, say, 2 packets at a time as ACKs come in. The key goal is smooth liberation of retransmit queue packets over time. The big problem is that recovery from even a single packet loss in a window makes us run kfree_skb() for a all the packets in a full window's worth of data when recovery completes. If we just move such packets to a seperate list, we still have to iterate over all of them when the cumulative ACK arrives. This problem, that retransmit queue liberation is not smooth, is the biggest flaw in how SACK is specified. I mean, consider Ilpo's mentioned case of 500,000 packet windows. The issue cannot be ignored. SACK is clearly broken. You speak of a path in Linux where we can reneg on SACKs, but I doubt it really ever runs because of how aggressive the queue collapser is. Alexey even has a comment there: * This must not ever occur. */ To be honest this code sits here because it was written before the queue collapser was coded up. Really, find me a box where the LINUX_MIB_OFOPRUNED or LINUX_MIB_RECVPRUNED counters are anything other than zero. So this is a non-issue and I did consider it before proposing that we redefine SACK. --
Greetings David, John also suggested freeing the packets as a lower priority task, just doing it after they're acknowledged. When the ACK finally comes, you could do something like moving John's entire list of packets to a "to be freed" list, and free a few every time (say) another ACK comes in. $0.02, Lachlan -- Lachlan Andrew Dept of Computer Science, Caltech 1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA Ph: +1 (626) 395-8820 Fax: +1 (626) 568-3603 http://netlab.caltech.edu/~lachlan --
From: "Lachlan Andrew" <lachlan.andrew@gmail.com> You could, but this requires extra state and the operations to properly queue such items to the deferred task and wake it up. And none of this would be necessary if we could make SACK indications hard. --
Why exactly is it a problem to free them all at once? Are you worried about kernel preemption latencies? -Andi --
I also wonder how much of a problem this is (for now, with window sizes of order 10000 packets. My understanding is that the biggest problems arise from O(N^2) time for recovery because every ack was expensive. Have current tests shown the final ack to be a major source of problems? -John --
From: John Heffner <jheffner@psc.edu> Yes, several people have reported this. It's a real issue, and it's only going to get worse. --
I had some testing on this a month ago. A small set of recent results with linux 2.6.23.9 are at http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency One of serious cases with a large number of packet losses (initial loss is around 8000 packets) is at http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency/600--TCP-TCP-NONE--400-3-1.0--... Also, there is a comparison among three Linux kernels (2.6.13, 2.6.18-rc4, 2.6.20.3) at http://netsrv.csc.ncsu.edu/wiki/index.php/Efficiency_of_SACK_processing Sangtae --
If I'm reading this right, all these tests occur with large amounts of loss and tons of sack processing. What would be most pertinent to this discussion would be a test with a large window, with delayed ack and sack disabled, and a single loss repaired by fast retransmit. This would isolate the "single big ack" processing from other factors such as doubling the ack rate and sack processing. I could probably set up such a test, but I don't want to duplicate effort if someone else already has done something similar. Thanks, -John --
This thread got started because I tried to solve the other latencies but realized that it helps very little because this latency spike would have remained unsolved and it happens in one of the most common case. -- i. --
From: Andi Kleen <andi@firstfloor.org> If the cpu is there spinning freeing up 500,000 SKBs, it isn't processing RX packets. It adds severe spikes in CPU utilization that are even moderate line rates begins to affect RTTs. Or do you think it's OK to process 500,000 SKBs while locked in a software interrupt. Perhaps you have another broken awk script to prove this :-) --
You can always push it into a work queue. Even put it to other cores if you want. In fact this is already done partly for the ->completion_queue. Wouldn't be a big change to queue it another level down. Also even freeing a lot of objects doesn't have to be that expensive. I suspect the most cost is in taking the slab locks, but that could be batched. Without that the kmem_free fast path isn't particularly expensive, as long as the headers are still in cache. Long ago I had sk_buff optimized for the routing case so that freeing can be all done with a single cache line. That is long gone, but could be probably restored. But asking for protocol changes just to work around such a Your hand waved numbers on inline sizes there were definitely worse than mine. -Andi --
From: Andi Kleen <andi@firstfloor.org> We're touching SKB struct members, doing atomics on them, etc. for objects we haven't referenced for at least two RTTs so are guarenteed to be cache cold. Let's say best case we can get it down to 2 cache line misses, and as a very aggressive goal we can get the cost down to 100 cycles. For 500,000 packets this is 500 million cpu cycles to free them all up. That's 1/4 of a second even on a 2 GHZ cpu. And yes there are inherent costs in handling TCP windows that are 500,000 packets in size. But, that freeing cost should be spread Your primary objective seems to be "being right", and that's fine but realize that it makes discussing anything with you about as fun as picking one's toe nails with an ice axe. Eventually you will be ignored by most folks who get fed up by this style of argument. So, have fun being right rather than being pleasant to work with. --
Hi. Postponing freeing of the skb has major drawbacks. Some time ago I made a patch to postpone skb freeing behind rcu and got 2.5 times slower connection speed on some machines with decreased CPU usage though. So, queueing solution has to be proven with real data and although it looks good in one situation, it can be really bad in another. For interested reader: results of the RCUfication of the kfree_skbmem() http://tservice.net.ru/~s0mbre/blog/devel/networking/2006/12/05 -- Evgeniy Polyakov --
> Postponing freeing of the skb has major drawbacks. Some time ago I Yes, the trick would be to make sure that it also does not tie up too much memory. e.g. it would need some throttling at least. Also the fast path of kmem_cache_free() is actually not that much different from just putting something on a list so perhaps it would not make that much difference. -Andi --
On Fri, 7 Dec 2007, Ilpo J
