This allows less strict control of access to the qdisc attached to a
netdev_queue. It is even allowed to enqueue into a qdisc which is
in the process of being destroyed. The RCU handler will toss out
those packets.We will need this to handle sharing of a qdisc amongst multiple
TX queues. In such a setup the lock has to be shared, so will
be inside of the qdisc itself. At which point the netdev_queue
lock cannot be used to hard synchronize access to the ->qdisc
pointer.One operation we have to keep inside of qdisc_destroy() is the list
deletion. It is the only piece of state visible after the RCU quiesce
period, so we have to undo it early and under the appropriate locking.The operations in the RCU handler do not need any looking because the
qdisc tree is no longer visible to anything at that point.Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/sched/sch_generic.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7e078c5..082db8a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -545,6 +545,17 @@ EXPORT_SYMBOL(qdisc_reset);
static void __qdisc_destroy(struct rcu_head *head)
{
struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
+ const struct Qdisc_ops *ops = qdisc->ops;
+
+ gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
+ if (ops->reset)
+ ops->reset(qdisc);
+ if (ops->destroy)
+ ops->destroy(qdisc);
+
+ module_put(ops->owner);
+ dev_put(qdisc_dev(qdisc));
+
kfree((char *) qdisc - qdisc->padded);
}@@ -552,21 +563,12 @@ static void __qdisc_destroy(struct rcu_head *head)
void qdisc_destroy(struct Qdisc *qdisc)
{
- const struct Qdisc_ops *ops = qdisc->ops;
-
if (qdisc->flags & TCQ_F_BUILTIN ||
!atomic_dec_and_test(&qdisc->refcnt))
return;list_del(&qdisc->list);
- gen_kill_estimator(&qdisc->b...
Still working my way through the patches, but this one caught my
eye (we had this before and it caused quite a few problems).One of the problems is that only the uppermost qdisc is destroyed
immediately, child qdiscs are still visible on qdisc_list and are
removed without any locking from the RCU callback. There are also
visibility issues for classifiers and actions deeper down in the
hierarchy.The previous way to work around this was quite ugly. qdisc_destroy()
walked the entire hierarchy to unlink inner classes immediately
from the qdisc_list (commit 85670cc1f changed it to what we do now).
That fixed visibility issues for everything visible only through
qdiscs (child qdiscs and classifiers). Actions are also visible
globally, so this might still be a problem, not sure though since
they don't refer to their parent (haven't thought about it much yet).Another problem we had earlier with this was that qdiscs previously
assumed changes (destruction) would only happen in process context
and thus didn't disable BHs when taking a read_lock for walking the
hierarchy (deadlocking with write_lock in BH context). This seems to
be handled correctly in your tree by always disabling BHs.The remaining problem is data that was previously only used and
modified under the RTNL (u32_list is one example). Modifications
during destruction now need protection against concurrent use in
process context.I still need to get a better understanding of how things work now,
so I won't suggest a fix until then :)--
From: Patrick McHardy <kaber@trash.net>
I took a closer look at u32_list.
It's members seem to be keyed by qdisc :-)
All this thing is doing is making up for the lack of a classifier
private pointer in the Qdisc.It's tough because classifiers of different types can be queued
up to a qdisc, so it's not like we can add one traffic classifier
private pointer to struct Qdisc and be done with it.However, I could not find anything other than u32 that seems to need
some global state that works like this.So what I've done is checked in the following change for now to make
u32 delete operation safe entirely within a qdisc tree's namespace.pkt_sched: Get rid of u32_list.
The u32_list is just an indirect way of maintaining a reference
to a U32 node on a per-qdisc basis.Just add an explicit node pointer for u32 to struct Qdisc an do
away with this global list.Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 0a158ff..8a44386 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -56,6 +56,8 @@ struct Qdisc
int (*reshape_fail)(struct sk_buff *skb,
struct Qdisc *q);+ void *u32_node;
+
/* This field is deprecated, but it is still used by CBQ
* and it will live until better solution will be invented.
*/
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 4d75544..527db25 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -75,7 +75,6 @@ struct tc_u_hnodestruct tc_u_common
{
- struct tc_u_common *next;
struct tc_u_hnode *hlist;
struct Qdisc *q;
int refcnt;
@@ -87,8 +86,6 @@ static const struct tcf_ext_map u32_ext_map = {
.police = TCA_U32_POLICE
};-static struct tc_u_common *u32_list;
-
static __inline__ unsigned u32_hash_fold(__be32 key, struct tc_u32_sel *sel, u8 fshift)
{
unsigned h = ntohl(key & sel->hmask)>>fshift;
@@ -287,9 +284,7 @@ static int u32_init(struct tc...
Actions are fine because they are intended to be globaly shared.
[i.e A classifier on ethx with qdiscA:Y (in/egress) can share an action
with classifer on ethy with qdiscB:Z (eg/ingress)].
Like you i need to digest the patches to understand the impact on the
rest but one thing i did notice was the last patch (replacement of
pfifo_fast):
prioritization based on TOS/DSCP (setsockopt) would no longer work, some
user space code may suffer (routing daemons likely). One suggestion to
fix it is to load pfifo qdisc (which does what fifo_fast is attempting)
for drivers that are h/ware multiq capable.cheers,
jamal--
Not all actions :) That nat action for example wasn't intended to
be shared at all. In fact I still need to submit a patch to make
it skip the shared hash as otherwise it simply won't scale as the
number of nat actions increases (e.g., to 256K).Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
True, sharing in the case of nat will cause scaling challenges because
there is a per-action locking. So you dont want to share in that case.
Let me clarify the global "sharedness" of actions, because i dont think
there is an issue:All actions (on a per-type hash table basis) have an index.
You create filter rule X and specify action nat.
You may specify the index of the action when you create the filter X.
If you then create another filter rule Y, also using the same action
index, then that nat action is shared between rule X and rule Y[1].If you dont specify the index a new nat action is created.
So in essence, if you create 256K rules each with an action and
as long as you dont specify the action index, you should be fine
because none will be shared.
The only scaling thing i can think of is to try and make the nat action
hash table large to reduce the init lookup. Other than that, once the
action is bound to a filter lookup cost is zero.cheers,
jamal[1]This is useful for tow reasons:
a) memory saving purposes: If you dont care that much about performance
or on a uniprocessor machine, one action would just be sufficient for
many rules.
b) accounting purposes; as you know qdiscs/filters/actions are
per-device. Over the years, a need has arosen from some users to have a
"per system" accounting (refer to the IMQ/IFB approach). Eg, if i wanted
the policer action to account for ingress eth0 and egress eth1 for a
user, i couldnt do it without some acrobatics.--
This is exactly what I want to get rid of because otherwise even
if no index was specified we'll still do a hash insertion which
simply falls apart with a small hash table. Using a large hashWe could do a dynamic table but so far I'm not convinced that
it's worth anybody's effort to implement :)Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
True.
But note: this is only during rule creation - once you create the
rule (user space to kernel path), then no more hash table reference.
Fast path has already a filter with actions attached, and is a mereIf user<->kernel performance insertion/deletion is important, it is
worth it.
For example:
Dave implemented dynamic hash tables on xfrm (voip setup time with ipsec
is a metric used in the industry in that case) . The only operational
problem i had with xfrm was lack of an upper bound of how large a table
can grow; i would rather user space be told ENOMEM than continuing to
grow in some cases (I actually implemented a patch which put a stop
after a certain number of sad/spd - but i dont expect hugs if i was to
post it;->).cheers,
jamal--
Unfortunately the scenario that I wrote this for requires frequent
Only if you also want to share it :) In the end I patched it to
Of course if you're volunteering to write the dynamic hash table
for actions then I'd happily switch back to sharing :)Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
I am trying to visualize: if you dont share, you must have 256K copies
It is a unique need like you said earlier (and would require a
medium-size surgery).
How about this: if a second user shows up with such a need I could do
it.If you knew you had a 256K entry, then you could make
NAT_TAB_MASK to be (256K-1) and you are guaranteed to get O(1) lookup if
you dont specify indices.
I know youve patched it already -havent quiet understood how and your
current solution may be better- but one other way is to have a Kconfig
option which lets the user type the size of the nat hash table size at
kernel compile time. So then a change
of the sort:
#ifdef CONFIG_HASH_SIZE
#define NAT_TAB_MASK CONFIGURED_HASH_SIZE
#else
#define NAT_TAB_MASK 15
#endifWhat do you think?
cheers,
jamal--
They can't be shared anyway because each of those 256K rules NATs
Sorry, I think I'll have to poke my eyes out :)
But yeah if we ever get a generic dynamic hash table implmenetation
then I'd be happy for act_nat to use that.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Ok, i have added it to my todo.
cheers,
jamal--
Yes, in that case its not a problem. The case where it behaves
That would perform priorization within each qdisc, the individual
qdiscs would still be transmitted using seperate HW queues though.--
From: Patrick McHardy <kaber@trash.net>
I think from certain perspectives it frankly doesn't matter.
It's not like the skb->priority field lets the SKB bypass the packets
already in the TX ring of the chip with a lower priority.It is true that, once the TX ring is full, the skb->priority thus
begins to have an influence on which packets are moved from the
qdisc to the TX ring of the device.However, I wonder if we're so sure that we want to give normal users
that kind of powers. Let's say for example that you set the highest
priority possible in the TOS socket option, and you do this for a ton
of UDP sockets, and you just blast packets out as fast as possible.
This backlogs the device TX ring, and if done effectively enough could
keep other sockets blocked out of the device completely.Are we really really sure it's OK to let users do this? :)
To me, as a default, I think TOS and DSCP really means just on-wire
priority.If we absolutely want to, we can keep the old pfifo_fast around and use
it (shared on multiq) if a certain sysctl knob is set.
--
Well security questions aside, I have seen this used on routers
to ensure certain traffic (e.g., VOIP) get preferential treatment
when it comes to transmission.Of course some of these routers setup custom qdiscs so they wouldn't
be affected, but there may be ones out there relying on the default
pfifo.Now I understand that having a pfifo is an anathema to multi-queue TX
by definition. However, even if we couldn't preserve the full
semantics by default, if we could at least preserve the ordering
within each queue it would still be a plus.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au>
That's how I basically feel right now as well.
So we can revert that change that killed pfifo_fast
if we want.
--
No, I fully agree that this is too much detail :) Its highly
unlikely that this default behaviour is important on a per
packet level :) I just meant to point out that using a pfifo
is not going to be the same behaviour as previously.--
Indeed QoS is irrelevant unless there is congestion.
The question is whether the packets sitting on the fifo qdisc are being
sorted fairly when congestion kicks in. Remember there is still a single
wire still even on multiple rings;->
If Woz (really) showed up at 9am and the Broussards at 3 am[1] on that
single (congestion-buffering) FIFO waiting for the shop/wire to open up,
then Woz should jump the queue (if he deserves it) when shop opens at
10am.If queues are building up, then by definition you have congestion
somewehere - IOW some resource (wire bandwidth, code-efficiency/cpu,
bus, remote being slow etc) is not keeping up.I am sorry havent read the patches sufficiently to answer that question
but i suspect that stashing the packets into different hardware queues
already solves this since the hardware does whatever scheduling it needsWe do today - if it is a concern, one could make the setsock opts
Agreed - with the caveat above on congestion. i.e it is still a single
IMO, if non-multiq drivers continue to work as before with the prios,
then nice. multiq could be tuned over a period of time.cheers,
jamal--
> The question is whether the packets sitting on the fifo qdisc are being
> sorted fairly when congestion kicks in. Remember there is still a single
> wire still even on multiple rings;->This is definitely true, but it is good to keep in mind that in the near
future we will start to see things look a little like multiple "virtual
wires." This is because of new ethernet standards like per-priority
pause, which makes it possible that one hardware ring on a NIC can
transmit while another ring is paused (possibly because of congestion
far off in the network).- R.
--
Thats essentially what i am arguing for.
[I think some, not all, of the wireless qos schemes also have similar
scheduling].My understanding of these wired "datacentre/virtualization" schemes you
describe is they are strict prio based. When the low prio "virtual wire"
is contending for the "physical wire" with a higher prio "virtual wire",
the high prio always wins.
We just need to make sure this behavior is also maintained whatever
buffering scheme is used within or above the driver(qdisc level).cheers,
jamal--
Sorry, the url of Woz allegedly jumping the queue:
http://news.cnet.com/8301-1023_3-9989823-93.html?hhTest=1%C3%A2%C2%88%
C2%82=rss&subj=news&tag=2547-1_3-0-20cheers,
jamal--
From: jamal <hadi@cyberus.ca>
The fundamental issue is what we believe qdiscs schedule, do they
schedule a device, or do they schedule what their namesake implies,
"queues"?Logically once we have multiple queues, we schedule queues.
Therefore what probably makes sense is that for mostly stateless
priority queueing such as pfifo_fast, doing prioritization at the
queue level is OK.But where non-trivial classification occurs, we have to either:
1) Make the queue selection match what the classifiers would do
exactly.OR
2) Point all the queues at a single device global qdisc.
What we have now implements #2. Later on we can try to do something
as sophisticated as #1.--
In the simple case of a single hardware queue/ring, the mapping between
a hardware queue and "physical wire" is one-to-one.IMO, in the case of multiple hardware queues per physical wire,
and such a netdevice already has a built-in hardware scheduler (they all
seem to have this feature) then if we can feed the hardware queues
directly, theres no need for any intermediate buffer(s).
In such a case, to compare with qdisc arch, its like the root qdisc is
in hardware.The only need for intermediate s/ware queues is for cases of congestion.
If you could have a single software queue for each hardware queue, then
you still have the obligation of correctness to make sure higher prio
hardware rings get fed with packets first (depending on the hardwaressure; i think you could achieve the goals by using the single queue with
a software pfifo_fast which maps skb->prio to hardware queues. such a
pfifo_fast may even sit in the driver. This queue will always be empty
unless you have congestion. The other thing is to make sure there is an
upper bound to the size of this queue; otherwise a remote bug could
cause it to grow infinitely and consume all memory.cheers,
jamal--
From: jamal <hadi@cyberus.ca>
They tend to implement round-robin or some similar fairness algorithm
amongst the queues, with zero concern about packet priorities.It really is just like a bunch of queues to the phsyical layer,
fairly shared.These things are built for parallelization, not prioritization.
--
pfifo_fast would be a bad choice in that case, but even a pfifo cannot
guarantee proper RR because it would present packets in a FIFO order
(and example the first 10 could go to hardware queue1 and the next to
hardware queue2).My view: i think you need a software queue per hardware queue.
Maybe even these queues residing in the driver; that way you take care
of congestion and it doesnt matter if the hardware is RR or strict prio
(and you dont need the pfifo or pfifo_fast anymore).
The use case would be something along:
packets comes in, you classify find its for queue1, grab the
per-hardware-queue1 lock, find the hardware queue1 is overloaded and
stash it instead in s/ware queue1. If it wasnt congested, it would go on
hardware queue1.
When hardware queue1 becomes available and netif-woken, you pick first
from s/ware queue1 (and batching could apply cleanly here) and send themI am suprised prioritization is not an issue. [My understanding of the
intel/cisco datacentre cabal is they serve virtual machines using
virtual wires; i would think in such scenarios youd have some customersTotal parallelization happens in the ideal case. If X cpus classify
packets going to X different hardware queueus each CPU grabs only locks
for that hardware queue. In virtualization, where only one customer's
traffic is going to a specific hardware queue, things would work well.
Non-virtualization scenario may result in collision in which two or more
CPUs may contend for the same hardware queue (either transmitting or
netif-waking etc).cheers,
jamal--
From: jamal <hadi@cyberus.ca>
Jamal, I have to wonder why are you so hung up on matching our
software qdiscs with whatever fairness algorithm the hardware happens
to implement internally for the TX queues?Every time this topic comes up, you insist on them having to match.
And I have no idea why.It's largely irrelevant and the TX queue can be viewed purely as a
buffer between us and the device, nearly a black hole we stick packetsThe problem is that the bottleneck is the qdisc itself since all those
cpus synchonize on it's lock. We can't have a shared qdisc for the
device and get full parallelization.That's why we're having one qdisc per TX queue, so that they all don't
bunch up on the qdisc lock.Otherwise, there is zero point in all of these TX multiqueue features
in the hardware if we can't parallelize things fully.--
I dont insist on them matching, just on correctness. i.e if you say you
have RR, then the scheduling needs to meet those requirements not anThat last sentence i have no issues with - it is what i thought wasnt
happening;-> i misunderstood it to be a single fifo shared by all
hardware tx queues from the begining (otherwise i wouldnt be posting).
We are in sync i think, a single pfifo per TX queue is the way to go. I
was suggesting it goes in the driver, but this is cleaner: In the
future, one could could actually replace the pfifo with another qdisc
since the single virtual wire becomes equivalent to a single virtualparallelization is achieveable in the ideal case.
cheers,
jamal--
I think I get you now. You're suggesting that we essentially
do what Dave has right now in the non-contending case, i.e.,
bypassing the qdisc so we get fully parallel processing until
one of the hardware queues seizes up.At that point you'd stop all queues and make every packet go
through the software qdisc to ensure ordering. This continues
until all queues have vacancies again.If this is what you're suggesting, then I think that will offer
pretty much the same behaviour as what we've got, while still
offering at least some (perhaps even most, but that is debatable)
of the benefits of multi-queue.At this point I don't think this is something that we need right
now, but it would be good to make sure that the architecture
allows such a thing to be implemented in future.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au>
Doing something like the noqueue_qdisc (bypassing the qdisc entirely)
is very attractive because it eliminates the qdisc lock. We're only
left with the TX lock.This whole idea of doing an optimized send to the device when the
TX queue has space is indeed tempting.But we can't do it for qdiscs that measure rates, enforce limits, etc.
And if the device kicks back at us with an error, we have to
perform the normal ->enqueue() path.
--
yes. That way there is no need for an intermediate queueing. As it is
now, packets first get queued to qdisc then we dequeu and send to driver
even when the driver would be happy to take it. That approach is fine
if you want to support non-work conserving schedulers on single-hwqueueI always visualize these as a single netdevice per hardware tx queue.
If i understood correctly, ordering is taken care of already in the
current patches because the stateless filter selects a hardware-queue.Dave has those queues sitting as a qdisc level (as pfifo) - which seems
better in retrospect (than what i was thinking that they should sit in
the driver) because one could decide they want to shape packets in the
future on a per-virtual-customer-sharing-a-virtual-wire and attach an
HTB instead.The one thing i am unsure of still:
I think it would be cleaner to just stop a single queue (instead of all)
when one hardware queue fills up. i.e if there is no congestion on the
other hardware queues, packets should continue to be fed to theirI think it is a pretty good first start (I am a lot more optimistic to
be honest).
Parallelization would work if you can get X CPUs to send to X hardware
queues concurently. Feasible in static host setup like virtualization
environment where you can tie a vm to a cpu. Not very feasible in
routing where you are driven to a random hardware tx queue by arriving
packets.cheers,
jamal--
Right you don't necessarily have to stop all queues but you do
It should work just fine for routing assuming your card does
multi-queue RX.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Ok - You may be able to pull it if you have the exact same hashing on
hardware rx as it is on transmit stateless filtering.
I can see a traffic pattern that could be cooked to give some good
numbers in such a scenario;->cheers,
jamal--
Actually you've hit it on the head, as an alternative to TX hashing
on the packet content, we need to allow TX queue selection based on
the current CPU ID.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au>
This we should avoid, it would allow reordering within a flow.
--
Not if the RX hashing is flow-based...
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au>
Can you control the process scheduler and where it decides
to run the processing running sendmsg() too?That's the problem.
It is totally unwise to do CPU based TX hashing.
--
Right I'm not suggesting having this as a default. However,
if you have a finely tuned system (e.g., a router) where you've
pinned all you RX queues to specific CPUs and your local apps
as well then it would make sense to provide this as an alternative.If this alternative doesn't exist, then unless the RX hash happens
to match the TX hash, for routing at least the packets are going
to jump all over the place which isn't nice.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au>
Where are these places they are going to "jump all over"? :-)
You can view the RX and TX queues as roughly independent namespaces.
If the TX hash is good enough (current one certainly isn't and I will
work on fixing that), it is likely to spread the accesses enough that
there won't be many collisions to matter.If you really think about it, this is even pointless. Packets that
hit RX queue A will always hit TX queue B, and so on and so forth.
Therefore there will be no new level of separation afforded by using
some kind of strict RX to TX mapping.We could provide the option, but it is so dangerous and I also see no
real tangible benfit from it.
--
Well consider the case where you have 4 queues, but a large number
of flows per second (>= 1000). No matter how good your hash is,
there is just no way of squeezing 1000 flows into 4 queues without
getting loads of collisions :)So let's assume that these flows have been distributed uniformly
by both the RX hash and the TX hash such that each queue is handling
~250 flows. If the TX hash does not match the result produced by
the RX hash, you're going to get a hell lot of contention once you
get into the NIC driver on the TX side.This is because for NICs like the ones from Intel ones you have to
protect the TX queue accesses so that only one CPU touches a given
queue at any point in time.The end result is either the driver being bogged down by lock or
TX queue contention, or the mid-layer will have to redistribute
skb's to the right CPUs in which case the synchronisation cost isI agree that what you've got here makes total sense for a host.
The benefit as far as I can see is that this would allow a packet's
entire journey through Linux to stay on exactly one CPU. There will
be zero memory written by multiple CPUs as far as that packet is
concerned.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au>
How so? If the TX hash is well distributed, which it should be,
it is at least going to approximate the distribution provided bySo will the existing one, enough of the time for it to matter,
and yes even on a router or firewall.At least this is my belief :-)
--
This is a matter of probabilities :) In general, if the TX hash
and the RX hash are completely unrelated, then the end result of
the hash distribution should be independent of each other (independent
in the probablistic sense). That is, for the flows which have
been RX hashed into one queue, they should be hashed on average
across all queues by the TX hash. Conversely, those that have
been hashed into one TX queue would be distributed across all
RX queues.Now if you've bound each RX queue to a specific CPU, then this
means for a given TX queue, its packets are going to come from
all the CPUs (well all those involved in network reception anyway).As each TX queue is designed to be accessed by only one CPU at
a time, somebody somewhere has to pay for all this synchronisation :)Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au>
Theoretically perhaps you are right.
Can I at least get some commitment that someone will test
that this really is necessary before we add the CPU ID
hash option?
--
Sure, I'll be testing some related things on this front so I'll
try to produce some results that compare these two cases.For the purposes of demonstrating this I'll choose the worst-case
scenario, that is, I'll ensure that the TX hash result exactly
distributes each RX hash preimage across all possible TX hash
values.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
I haven't had a chance to do the test yet but I've just had an
idea of how we can get the best of both worlds.The problem with always directing traffic based on the CPU alone
is that processes move around and we don't want to introduce packet
reordering because to that.The problem with hashing based on packet headers alone is that
it doesn't take CPU affinity into account at all so we may end
up with a situation where one thread out of a thread pool (e.g.,
a web server) has n sockets which are hashed to n different
queues.So here's the idea, we determine the tx queue for a flow based
on the CPU on which we saw its first packet. Once we have decided
on a queue we store that in a dst object (see below). This
ensures that all subsequent packets of that flow ends up in
the same queue so there is no reordering. It also avoids the
problem where traffic genreated by one CPU gets scattered across
queues.Of course to make this work we need to restart the flow cache
project so that we have somewhere to store this txq assignment.The good thing is that a flow cache would be of benefit for IPsec
users too and I hear that there is some interest in doing that
in the immediate future. So perhaps we can combine efforts and
use it for txq assignment as well.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Assuming multi-rx queues with configurable MSI or otherwise to map
to a receive processor, then in the case of routing/bridging or
otherfavoriteformofforwarding:
If you tie static filters to a specific cpu that will always work.
So no reordering there.Indeed. In the forwarding case, the problem is not reordering rather
all flows will always end up in the same cpu. So if you may end up
just overloading one cpu while the other 1023 stayed idle.
My earlier statement was you could cook traffic scenarios where allWont work with static multi-rx nic; iirc, changing those filters is
_expensive_. so you want it to stay static.cheers,
jamal--
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 22 Aug 2008 16:56:55 +1000IPSEC already has a flow cache doesn't it? :)
I suppose you're talking about the grand unified flow cache
that never gets implemented right? I classify that in the
same category as net channels at the moment, theoretically
very interesting but no practical implementation in sight.
--
In any case, we could implement what I suggested even without
a flow cache, by simply storing the info in the dst. This means
that all flows on the same dst will end up in the same queue, which
is not as good as what a flow cache could give, but it's useful
enough to be an option.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Herbert Xu <herbert@gondor.apana.org.au>
I had another idea in the back of my head that I wanted to mention.
The SKB has a hash, and sources set the hash. It doesn't matter
where in the stack it comes from.At transmit time, the select queue logic in dev_queue_xmit() will use
the SKB hash if one has been set already. Otherwise it will do the
hashing it does currently.So in the simplest case for forwarding, the RX side puts the RSS
hash or whatever into this SKB hash location.Then, taking things to the next level, protocols set hashes for
locally created packets. And this leads to being able to delete
the simple_tx_hash() code entirely and also we won't have to add
support for every protocol on the planet to that function. :)TCP sockets maybe could even do something clever, like, using the
incoming SKB hash of the SYN and SYN+ACK packets for all subsequent
SKBs sent by that socket.Another nice side effect, we have the choice of allowing IPSEC
encapsulation of locally generated frames to not change the TX queue.
This would have to be controlled by a sysctl or similar because I
can see both ways being useful in different circumstances.You get the idea.
--
Makes sense.
In the case Herbert described, I am assuming that the RSS side could be
configured to select a processor statically (MSI or otherwise). And that
the hash value will be that of the processorid ego the tx queueWell, conntrack could be useful (as long as you dont make it mandatory)
since it already keeps track of flows.
When unresolved, one could add it to some default queue.cheers,
jamal--
And this takes care of the hosts.
Best of all it's per flow and we don't even need to add any new
infrastructure :)Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Actually, can i modify that thought and go back to my initial contention
now that things are making more sense?;->
A single s/ware queue per hardware transmit queue is good - but that
being pfifo_fast would be a lot better. It would, in the minimal, keep
things as they were for non-multiq and is a sane choice for any virtual
wire.cheers,
jamal--
From: jamal <hadi@cyberus.ca>
Ok, I'll revert the change that changed pfifo_fast into plain fifo_fast.
Thanks.
--
Except wireless where HW has prioritizing scheduler per physical non-wire.
Tomas
--
From: "Tomas Winkler" <tomasw@gmail.com>
I know that. We're talking about multiqueue ethernet NICs.
--
From: Patrick McHardy <kaber@trash.net>
Indeed, it's the most delicate change.
Thanks for the info about all the tricky bits in this area.
--
One thought that occured to me - we could avoid all the visiblity
issues wrt. dev->qdisc_list by simply getting rid of it :)If we move the qdisc list from the device to the root Qdisc itself,
it would become invisible automatically as soon as we assign a new
root qdisc to the netdev_queue. Iteration would become slightly
more complicated since we'd have to iterate over all netdev_queues,
but I think it should avoid most of the problems I mentioned
(besides the u32_list thing).--
From: Patrick McHardy <kaber@trash.net>
What might make sense is to have a special Qdisc_root structure which
is simply:struct Qdisc_root {
struct Qdisc qd;
struct list_head qdisc_list;
};Everything about tree level synchronization would be type explicit.
Yes, as you say, the qdisc iteration would get slightly ugly. But
that doesn't seem to be a huge deal.But it seems a clean solution to the child qdisc visibility problem.
About u32_list, that thing definitely needs some spinlock. The
consultation of that list, and refcount mods, only occur during config
operations. So it's not like we have to grab this lock in the data
paths.If we really want to sweep this problem under the rug, there is another
way. Have the qdisc_destroy() RCU handler kick off a workqueue, and
grab the RTNL semaphore there during the final destruction calls. :-)--
Device level grafting is also explicit, so that looks like
That would be the safe way. The RCU destruction used to cause us
bugs for at least two years, but I actually believe the Qdisc_root
thing will work :)
--
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Nigel Cunningham | Re: [PATCH] Remove process freezer from suspend to RAM pathway |
| Paul Mundt | Re: 2.6.22-rc4-mm2 |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
git: | |
| Arjan van de Ven | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
