[PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition

Previous thread: [PATCH] tg3: Add phylib dependency to kconfig by Matt Carlson on Tuesday, May 27, 2008 - 11:07 am. (2 messages)

Next thread: [PATCH] IPv6: Fix the data length of get destination options with short length by Yang Hongyang on Tuesday, May 27, 2008 - 11:49 pm. (7 messages)
From: PJ Waskiewicz
Date: Tuesday, May 27, 2008 - 7:13 am

This patchset adds the initial DCB generic netlink interface to the kernel.
It adds the layer as a generic interface for any DCB-capable device through
the netdevice.

This patchset also includes an implementation using this interface in the
ixgbe driver.  It adds the hardware-specific code to turn the interface on,
and includes the netlink callbacks in the driver to perform the requested
operations.

These patches are targeted at the net-next-2.6 tree, for 2.6.27.  The patch
series is as follows:

patch 1: DCB netlink interface in-kernel
patch 2: ixgbe DCB hardware-specific patches
patch 3: enable DCB in ixgbe

Thanks,
-- 
PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
--

From: PJ Waskiewicz
Date: Tuesday, May 27, 2008 - 7:13 am

This patch enables DCB support for 82598.  DCB is a technology using the
802.1Qaz and 802.1Qbb IEEE standards for priority grouping and priority
flow control.  This is useful when trying to allow different types of
traffic on separate flows to be paused and handled with different
priorities across the network without impacting other flows on the same
links.  A target traffic type for this technology is to provide flow
control for Fibre Channel over Ethernet, while not impacting other LAN
traffic flows on the link.

This is a respin of previous patches posted for this.  These new patches
now use the DCBNL netlink interface in the kernel to communicate with
userspace.  The userspace utilities to control this are in the process of
being posted to SourceForge, and should be available in the very near
future.

This is based from the net-next-2.6 tree.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/Makefile        |    2 
 drivers/net/ixgbe/ixgbe.h         |   22 +
 drivers/net/ixgbe/ixgbe_ethtool.c |   36 ++
 drivers/net/ixgbe/ixgbe_main.c    |  577 +++++++++++++++++++++++++++++++++----
 4 files changed, 576 insertions(+), 61 deletions(-)

diff --git a/drivers/net/ixgbe/Makefile b/drivers/net/ixgbe/Makefile
index ccd83d9..20b37cc 100644
--- a/drivers/net/ixgbe/Makefile
+++ b/drivers/net/ixgbe/Makefile
@@ -33,4 +33,4 @@
 obj-$(CONFIG_IXGBE) += ixgbe.o
 
 ixgbe-objs := ixgbe_main.o ixgbe_common.o ixgbe_ethtool.o \
-              ixgbe_82598.o ixgbe_phy.o
+              ixgbe_82598.o ixgbe_phy.o ixgbe_dcb.o ixgbe_dcb_82598.o
diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index d981134..145421f 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -35,6 +35,7 @@
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
+#include "ixgbe_dcb.h"
 
 #ifdef CONFIG_DCA
 #include <linux/dca.h>
@@ -98,6 +99,7 @@
 #define IXGBE_TX_FLAGS_TSO		(u32)(1 << 2)
 #define IXGBE_TX_FLAGS_IPV4		(u32)(1 ...
From: PJ Waskiewicz
Date: Tuesday, May 27, 2008 - 7:13 am

This patch adds the netlink interface definition for Data Center Bridging.
This technology uses 802.1Qaz and 801.1Qbb for extending ethernet to
converge different traffic types on a single link.  E.g. Fibre Channel
over Ethernet and regular LAN traffic.  The goal is to use priority flow
control to pause individual flows at the MAC/network level, without
impacting other network flows.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/dcbnl.h     |  241 +++++++++++++++
 include/linux/netdevice.h |    8 
 net/Kconfig               |    1 
 net/Makefile              |    3 
 net/dcb/Kconfig           |   12 +
 net/dcb/Makefile          |    1 
 net/dcb/dcbnl.c           |  722 +++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 988 insertions(+), 0 deletions(-)

diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
new file mode 100644
index 0000000..db50f6c
--- /dev/null
+++ b/include/linux/dcbnl.h
@@ -0,0 +1,241 @@
+#ifndef __LINUX_DCBNL_H__
+#define __LINUX_DCBNL_H__
+/*
+ * Data Center Bridging (DCB) netlink header
+ *
+ * Copyright 2008, Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
+ */
+
+#define DCB_PROTO_VERSION 1
+
+/**
+ * enum dcbnl_commands - supported DCB commands
+ *
+ * @DCB_CMD_UNDEFINED: unspecified command to catch errors
+ * @DCB_CMD_GSTATE: request the state of DCB in the device
+ * @DCB_CMD_SSTATE: set the state of DCB in the device
+ * @DCB_CMD_PGTX_GCFG: request the priority group configuration for Tx
+ * @DCB_CMD_PGTX_SCFG: set the priority group configuration for Tx
+ * @DCB_CMD_PGRX_GCFG: request the priority group configuration for Rx
+ * @DCB_CMD_PGRX_SCFG: set the priority group configuration for Rx
+ * @DCB_CMD_PFC_GCFG: request the priority flow control configuration
+ * @DCB_CMD_PFC_SCFG: set the priority flow control configuration
+ * @DCB_CMD_SET_ALL: apply all changes to the underlying device
+ * @DCB_CMD_GPERM_HWADDR: get the permanent MAC address of the ...
From: Thomas Graf
Date: Wednesday, May 28, 2008 - 2:41 am

Is there a specific reason why you used a separate generic netlink
interface instead of embedding this into the regular link message
via either IFLA_DCB or by using the info API?
--

From: Waskiewicz Jr, Peter P
Date: Wednesday, May 28, 2008 - 9:03 am

There are four reasons we decided to use a separate interface:

1. The netlink messages are generated via userspace when the connection
is setup, plus they're generated from LLDP frames coming in off the
wire.  Those LLDP frames implement the DCBX protocol (Data Center
Bridging Exchange), which is the negotiation protocol between a DCB
device and its link partner.  In most cases, it's a DCB-compliant
switch, like a Cisco Nexus 5000.  So the messages can come out of band
depending on how the network gets configured, and if any events occur
causing the bandwidth credits or priority mappings to change (think
automated backups at night, wanting more bandwidth than during the day).

2. The DCBX protocol is being extended to contain more information, and
second generation DCB devices have more configuration data for the
network.  So we wanted an interface that could be extended on its own to
support the new DCB protocols as they're ratified and implemented in new
equipment, without impacting existing infrastructure.

3. We wanted to use generic netlink, since that seems to be a more
preferred method of netlink communication vs. rtnetlink.  And I don't
know anything about the info API, so I can't comment on why we didn't
look at that for implementation.  Can you suggest something for me to
look at for the info API so I can see what that's all about?

4. We also developed the userspace utilities for the Linux OSVs, which
should be having a pre-release "release" on Sourceforge in the next week
or so, to support the DCBX protocol.  They're implemented using the
generic netlink interface, so obviously if we can keep it that way, it'd
be preferred.  :-)

Thanks Thomas.  Other than that, is there anything in the netlink
interface that you would suggest to change?

Cheers,
-PJ Waskiewicz
--

From: Thomas Graf
Date: Wednesday, May 28, 2008 - 3:37 pm

There isn't much difference really, instead of using the separate
interface you could simply add a new link attribute IFLA_DCB and issue
a RTM_SETLINK/RTM_GETLINK and send the same information in the same
format. However, I agree with you that a separate interface is better
in this case as dcb requests are not directly connected to other link


Looks good from here, I didn't read it all line by line though.
--

From: Waskiewicz Jr, Peter P
Date: Sunday, June 1, 2008 - 5:16 am

Thanks Thomas for the review and comments.

Dave and Jeff, have you two taken a peek at this by chance?

Thanks,
-PJ Waskiewicz
--

From: Patrick McHardy
Date: Thursday, June 5, 2008 - 6:17 am

For these and the other numbered attributes: is the maximum number
fixed and/or defined somewhere? If not, I'd suggest to use lists

And in this case lists of nested attributes consisting of

"getpermhwaddr" doesn't seem to belong in this interface but

^^^ kfree_skb



The fact that you do this in every handler makes me wonder whether
rtnetlink wouldn't be the better choice, if only because it uses
the rtnl_mutex and configuration changes are thus serialized with
other networking configuration changes.

For example I don't see anything preventing concurrent changes
to the DCB configuration while it is copied between the temporary
configuration and the real one. In one cases its done in a
path holding the rtnl_mutex, in another case its done with
holding the genl_mutex in a genetlink callback.

--

From: Waskiewicz Jr, Peter P
Date: Monday, June 9, 2008 - 3:11 pm

I'll see what I can come up with when I move this implementation to

This was a feature in ethtool, but it was removed at some point.  I can


When we wrote this, we didn't know enough about rtnetlink to know what
to choose.  The general trend seemed to be moving towards genetlink for
subsystem changes in the kernel, so we chose that.  But this
serialization issue is a nice catch, and I think we'll move this
implementation to use rtnetlink.

Cheers,
-PJ Waskiewicz
--

From: Patrick McHardy
Date: Tuesday, June 10, 2008 - 12:14 am

I think thats a better idea than to put it in a private
driver interface.


--

From: PJ Waskiewicz
Date: Tuesday, May 27, 2008 - 7:13 am

This patch adds the necessary hardware initialization code for 82598 to
support Data Center Bridging.  The code takes care of bandwidth credit
calculations for the hardware arbiters, priority grouping methods, and
all the hardware accesses to enable the features in 82598.

This is based from the net-next-2.6 tree.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/ixgbe_dcb.c       |  330 +++++++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_dcb.h       |  168 +++++++++++++++
 drivers/net/ixgbe/ixgbe_dcb_82598.c |  400 +++++++++++++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_dcb_82598.h |   98 +++++++++
 4 files changed, 996 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_dcb.c b/drivers/net/ixgbe/ixgbe_dcb.c
new file mode 100644
index 0000000..11be2b8
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_dcb.c
@@ -0,0 +1,330 @@
+/*******************************************************************************
+
+  Intel 10 Gigabit PCI Express Linux driver
+  Copyright(c) 1999 - 2008 Intel Corporation.
+
+  This program is free software; you can redistribute it and/or modify it
+  under the terms and conditions of the GNU General Public License,
+  version 2, as published by the Free Software Foundation.
+
+  This program is distributed in the hope it will be useful, but WITHOUT
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+  more details.
+
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc.,
+  51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "COPYING".
+
+  Contact Information:
+  Linux NICS <linux.nics@intel.com>
+  e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
+  Intel Corporation, 5200 ...
From: David Miller
Date: Wednesday, June 4, 2008 - 11:44 am

From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>

Overall the changes look OK.  In particular the netlink implementation
looks clean.

However we need to think about how this stuff overlaps with existing
'tc' facilities.  For example, what we really need to do here is
define this generic DCB interface such that it normally just sits on
top of a software scheduler layer implementation and therefore there
are always non-NULL DCB ops to invoke.

If there is a device that can implement this in hardware, that's
fine and we define some interface for invoking that.

Because of that, the netdevice is likely not the correct place for the
ops (the only actual ugly part of the patches in my opinion).

I'm still very active travelling which is why I haven't responded to
this earlier.  I ask that you express some understanding about this as
there is really nothing I can do to review these kinds of important
changes properly when I am changing 10 timezones every other day.

Besides we're still in bug fix phase, so nothing I say will get this
upstream into Linus's tree any faster, and we really need to get
something like this right because it will be hard to undo this
afterwards if we get it wrong.
--

From: Waskiewicz Jr, Peter P
Date: Wednesday, June 4, 2008 - 11:23 pm

I'm not sure I follow this.  DCB is a scheduling policy, but that
scheduling policy is in the hardware.  The configuration interface,
which is what this is, happens out of band of any scheduling policies in
the kernel.  It's very analogous to the wireless configuration layer for

I agree that having this in the netdevice isn't great.  But given the
nature of how the hardware can be configured (via a user on the host, or
via messages from a switch to the userspace utilities), I think it needs
to be in there.  It's also the only common place the userspace can
enumerate an ethernet device from userspace; tc is another way, but the
ethernet device is used, along with the qdisc, which is part of the
netdevice.  But I am certainly all ears for any suggestions to not have

I can certainly sympathize; I've been travelling in Israel this past
week, and will be getting back to Portland on Friday.  Having a 10-hour

Totally understood.  My goal though is to make sure any
feedback/suggestions can be digested prior to the merge window, since
the chances of getting any good review during the merge window is next
to impossible, given the amount of patches flying in that have already
been queued.  I just want to make sure I'm prepared for a good
submission by the time the merge window opens.

Thanks Dave.  Any guidance from you, or anyone else, as to how I can get
this into better shape for acceptance, I'm all ears.

Cheers,
-PJ Waskiewicz
--

From: David Miller
Date: Thursday, June 5, 2008 - 7:43 am

From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>

And I'm saying we should have a equivalent software scheduler in
the kernel that can implement this if the hardware offloaded version
isn't present.

It overlaps existing functionality to a certain extent, and there is
no real reason for that overlap to exist.  The question is which
(the existing facilities or the new one) subsumes which.
--

From: Thomas Graf
Date: Thursday, June 5, 2008 - 1:29 pm

[Empty message]
From: Waskiewicz Jr, Peter P
Date: Tuesday, June 10, 2008 - 12:55 pm

I really don't think this is something that would work in software.  I
agree that having a bandwidth grouping like 802.1Qaz would be somewhat
useful, but that's the only piece of DCB that would work in software.
And you can achieve the same behavior using sch_prio with cbq or htb on
the nodes, minus the full link aggregation.

The 802.1Qbb, per-priority pause (flow control), cannot work in a
software implementation.  This is a new flow control frame processed by
the MAC for each priority on the link.  Also, the Rx filtering can't be
emulated in software either.  The MAC filters on VLAN priority.  I know
that can be configured with vconfig and set_ingress_map, but the whole
point of the technology is to have the Rx processing done in the
hardware's packet buffers, much like RSS filtering.

This technology really is a hardware-based technology.  The piece that
we could effectively implement in software is the Tx priority grouping.
But that's a small piece of the whole technology.  All we're trying to
do is provide the method of configuring the hardware for this
technology, which is closely coupled with the FCoE work going on in the
SCSI world.  I don't think anyone would benefit trying to emulate it in
software, since everything we can implement in the software can already

The existing facilities for traffic shaping and bandwidth aggregation do
overlap if you loaded those qdiscs with a DCB device.  But I don't think
the existing qdiscs should be removed or modified; the two technologies
are too different, in my opinion, to be combined in software.

Thanks Dave,
-PJ Waskiewicz
--

From: David Miller
Date: Tuesday, June 10, 2008 - 1:07 pm

From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>


This is a scare crow, please don't use arguments like that.

Saying that it can't be done at all in software, but then saying
"well, it sort of can be done, but the point is to do it in hardware"
side-steps the very reason I want you to implement a software variant

This sounds like another way of saying "having a software
implementation of even some of this facility would compromise
the value of our hardware implementation."

That's not the kind of decision making process we use when
deciding how to implement things in the kernel.
--

From: Thomas Graf
Date: Wednesday, June 11, 2008 - 10:51 am

Everything is possible in software as long as the hardware doesn't hide
the congestion information. It would be very useful to pass congestion
information received by 802.1Qau frames to the kernel for use when
selecting the nexthop or for the routing daemon to make decisions on.
So far we could only react to link states, now we could actually react to
link congestion on the routing layer.

There is no doubt that doing the prioritization in hardware is much
preferred but we should try and integrate it with other tc techniques.
F.e. it would be great if we could control DCB via skb->tc_index if 
that is possible. It would allow to define DCB traffic classes with the
rich features of existing classifiers. I've seen there is a mapping
functionality although I haven't found any documentation on how to use
it exactly.

Another area of interest is sending congestion frames on our own. We
could finally implement real ingress software shaping and turn every
linux system into a DCB capable node.
--

From: Patrick McHardy
Date: Wednesday, June 11, 2008 - 10:50 am

There was a qdisc submission for a scheduler called "pace" (IIRC)
that did this. It needed some cleanups before merging, but nothing
grave.


--

From: Thomas Graf
Date: Wednesday, June 11, 2008 - 2:28 pm

Do you know if it used congestion notifcation on link level? I can't
seem to find the posting.
--

From: Patrick McHardy
Date: Thursday, June 12, 2008 - 3:17 am

They're using PAUSE frames. The latest submission I could find is:

http://marc.info/?t=119625135300006&r=1&w=2
--

From: Waskiewicz Jr, Peter P
Date: Wednesday, June 11, 2008 - 11:28 am

> Everything is possible in software as long as the hardware doesn't

Congestion notification in 802.1Qau is certainly something we need to
support somewhere in the stack.  I was actually talking with one of our
hardware architects while I was in Israel last week about that exact
gap, since the BCN/QCN rate limiting will eventually drop packets if we
don't have a way of telling the upper layers to "slow down."  The
notification mechanism is also needed for 802.1Qbb, since the whole
point of the priority flow control is to provide a no-drop mechanism for
things like FCoE.  But if the upper layers (e.g. FCoE stack) don't know
to pause when the network is too congested, frames will be dropped,
which is bad.

802.1Qau is still being defined in IEEE unfortunately, and we and others
have no hardware that supports it to test the congestion notification
tag processing.  But it is something on our radar that needs to be

The prioritization is only one piece.  The bandwidth aggregation,
different modes of defining group strict vs. link strict priorities
within a bandwidth group, etc., are all hardware modes.  These modes
need to be in sync with the link partner (switch, back to back NIC), and

Once 802.1Qau is defined, and IEEE decides to use BCN or QCN, I think
this is a great direction to go in.  Right now the congestion
notification stuff is too up in the air to latch onto unfortunately.

Thanks for the comments Thomas,

-PJ Waskiewicz
--

From: Thomas Graf
Date: Wednesday, June 11, 2008 - 2:26 pm

That's already possible by calling netif_stop_queue() respectively
netif_stop_subqueue(). Much more important for the upper layers to
know is when a congestion is about to happen so it can be avoided

Again, this piece of hardware basically implements a classful qdisc
like htb or cbq except that it's limited to a flat tree. Yet, the
capabilities of the hardware may not be sufficient, therefore it must
be possible to combine hardware shaping with software qdiscs. It is
therefore crucial to find common grounds to exchange traffic class
information. Is your piece of hardware strictly limited to map VLANs
to traffic classes or would it be possible to attach traffic class
information to the packet in some way? (skb->tc_index) Could you
elaborate on how classification works, especially the configurable
mapping?

The most difficult part for me, an probably others as well, is that there
are no public documents available yet which would describe the direction
of where this is going, how complete the current implementation actually
is. We're pretty much looking at a black box which doesn't work very well
with the existing architecture and requires a completely separate
configuration interface. If we merge a configuration interface for DCB
now it will be pretty much written in stone, yet we have no idea what
other vendors may need.
--

Previous thread: [PATCH] tg3: Add phylib dependency to kconfig by Matt Carlson on Tuesday, May 27, 2008 - 11:07 am. (2 messages)

Next thread: [PATCH] IPv6: Fix the data length of get destination options with short length by Yang Hongyang on Tuesday, May 27, 2008 - 11:49 pm. (7 messages)