Re: [net-next-2.6 PATCH v2 3/3] net_sched: implement a root container qdisc sch_mclass

Previous thread: Re: IPTV buffering by Jesper Dangaard Brouer on Tuesday, December 21, 2010 - 9:24 am. (2 messages)

Next thread: [PATCH net-next 0/3] dcbnl: Extending dcbnl to support HW based DCBX by Shmulik Ravid on Tuesday, December 21, 2010 - 12:32 pm. (1 message)
From: John Fastabend
Date: Tuesday, December 21, 2010 - 12:28 pm

This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
a single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q ...
From: John Fastabend
Date: Tuesday, December 21, 2010 - 12:29 pm

This patch modifies the mq qdisc to allow multiple mq qdiscs
to be used. Allowing TX queues to be grouped for management.

This allows a root container qdisc to create multiple traffic
classes and use the mq qdisc as a default queueing discipline. It
is expected other queueing disciplines can then be grafted to the
container as needed.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 net/sched/sch_mq.c |   52 +++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index ecc302f..86da74c 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -19,17 +19,32 @@
 
 struct mq_sched {
 	struct Qdisc		**qdiscs;
+	struct netdev_tc_txq	tc_txq;
+	u8 num_tc;
 };
 
+static void mq_queues(struct net_device *dev, struct Qdisc *sch)
+{
+	struct mq_sched *priv = qdisc_priv(sch);
+	if (priv->num_tc) {
+		int queue = TC_H_MIN(sch->parent) - 1;
+		priv->tc_txq.count = dev->tc_to_txq[queue].count;
+		priv->tc_txq.offset = dev->tc_to_txq[queue].offset;
+	} else {
+		priv->tc_txq.count = dev->num_tx_queues;
+		priv->tc_txq.offset = 0;
+	}
+}
+
 static void mq_destroy(struct Qdisc *sch)
 {
-	struct net_device *dev = qdisc_dev(sch);
 	struct mq_sched *priv = qdisc_priv(sch);
 	unsigned int ntx;
 
 	if (!priv->qdiscs)
 		return;
-	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+
+	for (ntx = 0; ntx < priv->tc_txq.count && priv->qdiscs[ntx]; ntx++)
 		qdisc_destroy(priv->qdiscs[ntx]);
 	kfree(priv->qdiscs);
 }
@@ -42,20 +57,24 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 	struct Qdisc *qdisc;
 	unsigned int ntx;
 
-	if (sch->parent != TC_H_ROOT)
+	if (sch->parent != TC_H_ROOT && !dev->num_tc)
 		return -EOPNOTSUPP;
 
 	if (!netif_is_multiqueue(dev))
 		return -EOPNOTSUPP;
 
+	/* Record num tc info in priv so we can tear down cleanly */
+	priv->num_tc = dev->num_tc;
+	mq_queues(dev, sch);
+
 	/* pre-allocate ...
From: John Fastabend
Date: Tuesday, December 21, 2010 - 12:29 pm

This implements a mclass 'multi-class' queueing discipline that by
default creates multiple mq qdisc's one for each traffic class. Each
mq qdisc then owns a range of queues per the netdev_tc_txq mappings.

Using the mclass qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mclass_qopt {
        __u8    num_tc;
        __u8    prio_tc_map[16];
        __u8    hw;
        __u16   count[16];
        __u16   offset[16];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc. The
hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup(). This
scheme can be expanded as needed with additional qdisc being graft'd
onto the root qdisc to provide per tc queuing disciplines. Allowing
Software and hardware queuing disciplines can be used together

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to no specifics about
traffic types.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |    3 
 ...
From: Jarek Poplawski
Date: Thursday, December 30, 2010 - 4:37 pm

Is it really necessary to add one more abstraction layer for this,
probably not most often used (or even asked by users), functionality?
Why mclass can't simply do these few things more instead of attaching
(and changing) mq?



Actually, where this num_tc is expected to be set? I can see it inside



Are these offsets etc. validated?

Jarek P.
--

From: Jarek Poplawski
Date: Thursday, December 30, 2010 - 4:56 pm

They are... Forget this last question.

Jarek P.
--

From: John Fastabend
Date: Sunday, January 2, 2011 - 10:43 pm

The statistics work nicely when the mq qdisc is used. 

qdisc mclass 8002: root  tc 4 map 0 1 2 3 0 1 2 3 1 1 1 1 1 1 1 1
             queues:(0:1) (2:3) (4:5) (6:15)
 Sent 140 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc mq 8003: parent 8002:1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc mq 8004: parent 8002:2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc mq 8005: parent 8002:3
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc mq 8006: parent 8002:4
 Sent 140 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc sfq 8007: parent 8005:1 limit 127p quantum 1514b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc sfq 8008: parent 8005:2 limit 127p quantum 1514b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

The mclass gives the statistics for the interface and then statistics on the mq qdisc gives statistics for each traffic class. Also, when using the 'mq qdisc' with this abstraction other qdisc can be grafted onto the queue. For example the sch_sfq is used in the above example.







Yes, as your next email noted.

Thanks,
John
--

From: Jarek Poplawski
Date: Monday, January 3, 2011 - 10:02 am

Well, I sometimes add leaf qdiscs only to get class stats with less

IMHO, these tc offsets and counts make simply two level hierarchy
(classes with leaf subclasses) similarly (or simpler) to other
classful qdisc which manage it all inside one module. Of course,
we could think of another way of code organization, but it should
be rather done at the beginning of schedulers design. The mq qdisc
broke the design a bit adding a fake root, but I doubt we should go
deeper unless it's necessary. Doing mclass (or something) as a more
complex alternative to mq should be enough. Why couldn't mclass graft

I am not too hung up on this either, especially if it's OK to others,

Maybe you're right. On the other hand, usually flags are added for
more general purpose and the optimal/wrong configs are the matter of

OK, I probably missed this second possibility in the last version.


OK, anyway, all these '16' should be 'upgraded'.
 
Thanks,
Jarek P.
--

From: John Fastabend
Date: Monday, January 3, 2011 - 1:37 pm

If you also want to graft a scheduler onto a traffic class now your stuck. For now this qdisc doesn't exist, but I would like to have a software implementation of the currently offloaded DCB ETS scheduler. The 802.1Qaz spec allows different scheduling algorithms to be used on each traffic class. In the current implementation mclass could graft these scheduling schemes onto each traffic class independently.

                              mclass
                                |
    -------------------------------------------------------
    |         |        |        |     |     |     |       |
   mq_tbf   mq_tbf   mq_ets   mq_ets  mq    mq   mq_wrr greedy
   |                            |
 ---------                  ---------
 |   |   |                  |   |   |
red red red                red red red



Yes. I'll do this in the next version.

Thanks,

--

From: Jarek Poplawski
Date: Monday, January 3, 2011 - 3:59 pm

Probably, despite this very nice figure and description, I still miss
something and can't see the problem. If you graft a qdisc/scheduler
to a traffic class you can change the way/range of grafting depending
on additional parameters or even by checking some properties of the
grafted qdisc. My main concern is adding complexity to the qdisc tree
structure (instead of hiding it at the class level) for something,
IMHO, hardly ever popular (like both mq and DCB).

Thanks,
Jarek P.
--

From: John Fastabend
Date: Monday, January 3, 2011 - 5:18 pm

OK I'm convinced I'll keep everything contained in mclass. Building this mechanism into the qdisc seems to be adding extra complexity that is most likely not needed as you noted.

Although I suspect the "additional parameter" would be something along the lines of a queue index and offset? right? Otherwise how would a mq like qdisc know which queues it owns.

Thanks,
John.

--

From: John Fastabend
Date: Monday, January 3, 2011 - 7:59 pm

Perhaps something with qdisc_class_ops select_queue() could be done to make it more flexible. When I get around to implementing these hypothetical qdiscs I will have to figure it out. For now though hypothetical qdiscs are not a very compelling use case.

Thanks,
John.
--

From: Johannes Berg
Date: Wednesday, December 22, 2010 - 2:12 am

Is there any chance this might be applicable to the 802.11 layer as
well? We will definitely still need an ndo_select_queue handler to reset
in the case where the peer doesn't support QoS, but it seems the part
that depends on the frame itself could be pushed out to the generic
framework instead of having net/wireless/util.c:cfg80211_classify8021d?

johannes

--

From: John Fastabend
Date: Wednesday, December 22, 2010 - 10:29 pm

Johannes,

I took a quick look at this and I believe it should be doable. It would be nice to completely remove the ndo_select_queue if possible though.

I probably won't have a chance to look any further into this for at least a week maybe two. So I'll think about it a bit more later unless someone else gets there first.

Thanks,

--

From: Stephen Hemminger
Date: Sunday, December 26, 2010 - 4:47 pm

On Wed, 22 Dec 2010 21:29:43 -0800

The Beceem Wimax driver has same kind of select queue.

-- 
--

Previous thread: Re: IPTV buffering by Jesper Dangaard Brouer on Tuesday, December 21, 2010 - 9:24 am. (2 messages)

Next thread: [PATCH net-next 0/3] dcbnl: Extending dcbnl to support HW based DCBX by Shmulik Ravid on Tuesday, December 21, 2010 - 12:32 pm. (1 message)