Re: [net-next-2.6 PATCH v4 1/2] net: implement mechanism for HW based QOS

Previous thread: [PATCH net-next 1/2] cnic: Do not allow iSCSI and FCoE on bnx2x multi-function mode by Michael Chan on Monday, January 3, 2011 - 6:21 pm. (4 messages)

Next thread: [PATCH] ipv4/route.c: respect prefsrc for local routes by Joel Sing on Monday, January 3, 2011 - 11:24 pm. (9 messages)
From: John Fastabend
Date: Monday, January 3, 2011 - 8:05 pm

This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
a single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q ...
From: John Fastabend
Date: Monday, January 3, 2011 - 8:05 pm

This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.

Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mclass_qopt {
        __u8    num_tc;
        __u8    prio_tc_map[16];
        __u8    hw;
        __u16   count[16];
        __u16   offset[16];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.

The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to no specifics about
traffic types.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |    3 
 include/linux/pkt_sched.h |    9 +
 net/sched/Kconfig         |   10 +
 net/sched/Makefile        |    1 
 net/sched/sch_generic.c   |    4 +
 net/sched/sch_mqprio.c    |  357 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 384 insertions(+), 0 ...
From: Eric Dumazet
Date: Monday, January 3, 2011 - 11:46 pm

I understand this code already exists in mq, I just want to note that
some qdiscs update their stats in their dump() subroutine, because their
enqueue()/dequeue() doesnt update all fields.

We might add a gather_stats() method, eventually, to get rid of all
oddities we currently have with 0 backlogs (or qlen) here and here ;)

For example, I am not even sure qdisc->qstats.qlen should not be
replaced by to qdisc->qstats.qlen in your loop, as done in
mqprio_dump_class_stats()

Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>


--

From: Jarek Poplawski
Date: Tuesday, January 4, 2011 - 4:18 am

I'm not sure why this unsetting is needed in case it was set by a




Did you give up those stats per tc class? You only show the leaf
classes here, but you could first loop per num_tc (as virtual
parent classes). So in dump_class_stats you should be able to
distinguish class 'level' by cl and do the second loop if necessary.
To show the class hierarchy you change tcm_parent in dump_class
for 'leaf' classes (like eg in sch_htb/htb_dump_class).

--

From: John Fastabend
Date: Tuesday, January 4, 2011 - 11:16 am

I did give up... but got it working with your hint in v5 thanks.

John.
--

From: jamal
Date: Tuesday, January 4, 2011 - 5:32 am

On Mon, 2011-01-03 at 19:05 -0800, John Fastabend wrote:

This seems very hardware specific.
i.e may be true for 8021q (or like 8021q) type hardware and therefore
your qdisc. 
There are people with hardware that has thousands of hardware
flow queues (or at least give impressions as such). Maybe a naming
convention prefixed with 8021Q would be nicer or moving it into
your qdisc infrastructure and export it so drivers needing it can.

cheers,
jamal

--

Previous thread: [PATCH net-next 1/2] cnic: Do not allow iSCSI and FCoE on bnx2x multi-function mode by Michael Chan on Monday, January 3, 2011 - 6:21 pm. (4 messages)

Next thread: [PATCH] ipv4/route.c: respect prefsrc for local routes by Joel Sing on Monday, January 3, 2011 - 11:24 pm. (9 messages)