This patchset adds the initial DCB generic netlink interface to the kernel. It adds the layer as a generic interface for any DCB-capable device through the netdevice. This patchset also includes an implementation using this interface in the ixgbe driver. It adds the hardware-specific code to turn the interface on, and includes the netlink callbacks in the driver to perform the requested operations. These patches are targeted at the net-next-2.6 tree, for 2.6.27. The patch series is as follows: patch 1: DCB netlink interface in-kernel patch 2: ixgbe DCB hardware-specific patches patch 3: enable DCB in ixgbe Thanks, -- PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com> --
This patch enables DCB support for 82598. DCB is a technology using the 802.1Qaz and 802.1Qbb IEEE standards for priority grouping and priority flow control. This is useful when trying to allow different types of traffic on separate flows to be paused and handled with different priorities across the network without impacting other flows on the same links. A target traffic type for this technology is to provide flow control for Fibre Channel over Ethernet, while not impacting other LAN traffic flows on the link. This is a respin of previous patches posted for this. These new patches now use the DCBNL netlink interface in the kernel to communicate with userspace. The userspace utilities to control this are in the process of being posted to SourceForge, and should be available in the very near future. This is based from the net-next-2.6 tree. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> --- drivers/net/ixgbe/Makefile | 2 drivers/net/ixgbe/ixgbe.h | 22 + drivers/net/ixgbe/ixgbe_ethtool.c | 36 ++ drivers/net/ixgbe/ixgbe_main.c | 577 +++++++++++++++++++++++++++++++++---- 4 files changed, 576 insertions(+), 61 deletions(-) diff --git a/drivers/net/ixgbe/Makefile b/drivers/net/ixgbe/Makefile index ccd83d9..20b37cc 100644 --- a/drivers/net/ixgbe/Makefile +++ b/drivers/net/ixgbe/Makefile @@ -33,4 +33,4 @@ obj-$(CONFIG_IXGBE) += ixgbe.o ixgbe-objs := ixgbe_main.o ixgbe_common.o ixgbe_ethtool.o \ - ixgbe_82598.o ixgbe_phy.o + ixgbe_82598.o ixgbe_phy.o ixgbe_dcb.o ixgbe_dcb_82598.o diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h index d981134..145421f 100644 --- a/drivers/net/ixgbe/ixgbe.h +++ b/drivers/net/ixgbe/ixgbe.h @@ -35,6 +35,7 @@ #include "ixgbe_type.h" #include "ixgbe_common.h" +#include "ixgbe_dcb.h" #ifdef CONFIG_DCA #include <linux/dca.h> @@ -98,6 +99,7 @@ #define IXGBE_TX_FLAGS_TSO (u32)(1 << 2) #define IXGBE_TX_FLAGS_IPV4 (u32)(1 ...
This patch adds the netlink interface definition for Data Center Bridging. This technology uses 802.1Qaz and 801.1Qbb for extending ethernet to converge different traffic types on a single link. E.g. Fibre Channel over Ethernet and regular LAN traffic. The goal is to use priority flow control to pause individual flows at the MAC/network level, without impacting other network flows. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> --- include/linux/dcbnl.h | 241 +++++++++++++++ include/linux/netdevice.h | 8 net/Kconfig | 1 net/Makefile | 3 net/dcb/Kconfig | 12 + net/dcb/Makefile | 1 net/dcb/dcbnl.c | 722 +++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 988 insertions(+), 0 deletions(-) diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h new file mode 100644 index 0000000..db50f6c --- /dev/null +++ b/include/linux/dcbnl.h @@ -0,0 +1,241 @@ +#ifndef __LINUX_DCBNL_H__ +#define __LINUX_DCBNL_H__ +/* + * Data Center Bridging (DCB) netlink header + * + * Copyright 2008, Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com> + */ + +#define DCB_PROTO_VERSION 1 + +/** + * enum dcbnl_commands - supported DCB commands + * + * @DCB_CMD_UNDEFINED: unspecified command to catch errors + * @DCB_CMD_GSTATE: request the state of DCB in the device + * @DCB_CMD_SSTATE: set the state of DCB in the device + * @DCB_CMD_PGTX_GCFG: request the priority group configuration for Tx + * @DCB_CMD_PGTX_SCFG: set the priority group configuration for Tx + * @DCB_CMD_PGRX_GCFG: request the priority group configuration for Rx + * @DCB_CMD_PGRX_SCFG: set the priority group configuration for Rx + * @DCB_CMD_PFC_GCFG: request the priority flow control configuration + * @DCB_CMD_PFC_SCFG: set the priority flow control configuration + * @DCB_CMD_SET_ALL: apply all changes to the underlying device + * @DCB_CMD_GPERM_HWADDR: get the permanent MAC address of the ...
Is there a specific reason why you used a separate generic netlink interface instead of embedding this into the regular link message via either IFLA_DCB or by using the info API? --
There are four reasons we decided to use a separate interface: 1. The netlink messages are generated via userspace when the connection is setup, plus they're generated from LLDP frames coming in off the wire. Those LLDP frames implement the DCBX protocol (Data Center Bridging Exchange), which is the negotiation protocol between a DCB device and its link partner. In most cases, it's a DCB-compliant switch, like a Cisco Nexus 5000. So the messages can come out of band depending on how the network gets configured, and if any events occur causing the bandwidth credits or priority mappings to change (think automated backups at night, wanting more bandwidth than during the day). 2. The DCBX protocol is being extended to contain more information, and second generation DCB devices have more configuration data for the network. So we wanted an interface that could be extended on its own to support the new DCB protocols as they're ratified and implemented in new equipment, without impacting existing infrastructure. 3. We wanted to use generic netlink, since that seems to be a more preferred method of netlink communication vs. rtnetlink. And I don't know anything about the info API, so I can't comment on why we didn't look at that for implementation. Can you suggest something for me to look at for the info API so I can see what that's all about? 4. We also developed the userspace utilities for the Linux OSVs, which should be having a pre-release "release" on Sourceforge in the next week or so, to support the DCBX protocol. They're implemented using the generic netlink interface, so obviously if we can keep it that way, it'd be preferred. :-) Thanks Thomas. Other than that, is there anything in the netlink interface that you would suggest to change? Cheers, -PJ Waskiewicz --
There isn't much difference really, instead of using the separate interface you could simply add a new link attribute IFLA_DCB and issue a RTM_SETLINK/RTM_GETLINK and send the same information in the same format. However, I agree with you that a separate interface is better in this case as dcb requests are not directly connected to other link Looks good from here, I didn't read it all line by line though. --
Thanks Thomas for the review and comments. Dave and Jeff, have you two taken a peek at this by chance? Thanks, -PJ Waskiewicz --
For these and the other numbered attributes: is the maximum number fixed and/or defined somewhere? If not, I'd suggest to use lists And in this case lists of nested attributes consisting of "getpermhwaddr" doesn't seem to belong in this interface but ^^^ kfree_skb The fact that you do this in every handler makes me wonder whether rtnetlink wouldn't be the better choice, if only because it uses the rtnl_mutex and configuration changes are thus serialized with other networking configuration changes. For example I don't see anything preventing concurrent changes to the DCB configuration while it is copied between the temporary configuration and the real one. In one cases its done in a path holding the rtnl_mutex, in another case its done with holding the genl_mutex in a genetlink callback. --
I'll see what I can come up with when I move this implementation to This was a feature in ethtool, but it was removed at some point. I can When we wrote this, we didn't know enough about rtnetlink to know what to choose. The general trend seemed to be moving towards genetlink for subsystem changes in the kernel, so we chose that. But this serialization issue is a nice catch, and I think we'll move this implementation to use rtnetlink. Cheers, -PJ Waskiewicz --
I think thats a better idea than to put it in a private driver interface. --
This patch adds the necessary hardware initialization code for 82598 to support Data Center Bridging. The code takes care of bandwidth credit calculations for the hardware arbiters, priority grouping methods, and all the hardware accesses to enable the features in 82598. This is based from the net-next-2.6 tree. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> --- drivers/net/ixgbe/ixgbe_dcb.c | 330 +++++++++++++++++++++++++++++ drivers/net/ixgbe/ixgbe_dcb.h | 168 +++++++++++++++ drivers/net/ixgbe/ixgbe_dcb_82598.c | 400 +++++++++++++++++++++++++++++++++++ drivers/net/ixgbe/ixgbe_dcb_82598.h | 98 +++++++++ 4 files changed, 996 insertions(+), 0 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_dcb.c b/drivers/net/ixgbe/ixgbe_dcb.c new file mode 100644 index 0000000..11be2b8 --- /dev/null +++ b/drivers/net/ixgbe/ixgbe_dcb.c @@ -0,0 +1,330 @@ +/******************************************************************************* + + Intel 10 Gigabit PCI Express Linux driver + Copyright(c) 1999 - 2008 Intel Corporation. + + This program is free software; you can redistribute it and/or modify it + under the terms and conditions of the GNU General Public License, + version 2, as published by the Free Software Foundation. + + This program is distributed in the hope it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. + + The full GNU General Public License is included in this distribution in + the file called "COPYING". + + Contact Information: + Linux NICS <linux.nics@intel.com> + e1000-devel Mailing List <e1000-devel@lists.sourceforge.net> + Intel Corporation, 5200 ...
From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com> Overall the changes look OK. In particular the netlink implementation looks clean. However we need to think about how this stuff overlaps with existing 'tc' facilities. For example, what we really need to do here is define this generic DCB interface such that it normally just sits on top of a software scheduler layer implementation and therefore there are always non-NULL DCB ops to invoke. If there is a device that can implement this in hardware, that's fine and we define some interface for invoking that. Because of that, the netdevice is likely not the correct place for the ops (the only actual ugly part of the patches in my opinion). I'm still very active travelling which is why I haven't responded to this earlier. I ask that you express some understanding about this as there is really nothing I can do to review these kinds of important changes properly when I am changing 10 timezones every other day. Besides we're still in bug fix phase, so nothing I say will get this upstream into Linus's tree any faster, and we really need to get something like this right because it will be hard to undo this afterwards if we get it wrong. --
I'm not sure I follow this. DCB is a scheduling policy, but that scheduling policy is in the hardware. The configuration interface, which is what this is, happens out of band of any scheduling policies in the kernel. It's very analogous to the wireless configuration layer for I agree that having this in the netdevice isn't great. But given the nature of how the hardware can be configured (via a user on the host, or via messages from a switch to the userspace utilities), I think it needs to be in there. It's also the only common place the userspace can enumerate an ethernet device from userspace; tc is another way, but the ethernet device is used, along with the qdisc, which is part of the netdevice. But I am certainly all ears for any suggestions to not have I can certainly sympathize; I've been travelling in Israel this past week, and will be getting back to Portland on Friday. Having a 10-hour Totally understood. My goal though is to make sure any feedback/suggestions can be digested prior to the merge window, since the chances of getting any good review during the merge window is next to impossible, given the amount of patches flying in that have already been queued. I just want to make sure I'm prepared for a good submission by the time the merge window opens. Thanks Dave. Any guidance from you, or anyone else, as to how I can get this into better shape for acceptance, I'm all ears. Cheers, -PJ Waskiewicz --
From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com> And I'm saying we should have a equivalent software scheduler in the kernel that can implement this if the hardware offloaded version isn't present. It overlaps existing functionality to a certain extent, and there is no real reason for that overlap to exist. The question is which (the existing facilities or the new one) subsumes which. --
I really don't think this is something that would work in software. I agree that having a bandwidth grouping like 802.1Qaz would be somewhat useful, but that's the only piece of DCB that would work in software. And you can achieve the same behavior using sch_prio with cbq or htb on the nodes, minus the full link aggregation. The 802.1Qbb, per-priority pause (flow control), cannot work in a software implementation. This is a new flow control frame processed by the MAC for each priority on the link. Also, the Rx filtering can't be emulated in software either. The MAC filters on VLAN priority. I know that can be configured with vconfig and set_ingress_map, but the whole point of the technology is to have the Rx processing done in the hardware's packet buffers, much like RSS filtering. This technology really is a hardware-based technology. The piece that we could effectively implement in software is the Tx priority grouping. But that's a small piece of the whole technology. All we're trying to do is provide the method of configuring the hardware for this technology, which is closely coupled with the FCoE work going on in the SCSI world. I don't think anyone would benefit trying to emulate it in software, since everything we can implement in the software can already The existing facilities for traffic shaping and bandwidth aggregation do overlap if you loaded those qdiscs with a DCB device. But I don't think the existing qdiscs should be removed or modified; the two technologies are too different, in my opinion, to be combined in software. Thanks Dave, -PJ Waskiewicz --
From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com> This is a scare crow, please don't use arguments like that. Saying that it can't be done at all in software, but then saying "well, it sort of can be done, but the point is to do it in hardware" side-steps the very reason I want you to implement a software variant This sounds like another way of saying "having a software implementation of even some of this facility would compromise the value of our hardware implementation." That's not the kind of decision making process we use when deciding how to implement things in the kernel. --
Everything is possible in software as long as the hardware doesn't hide the congestion information. It would be very useful to pass congestion information received by 802.1Qau frames to the kernel for use when selecting the nexthop or for the routing daemon to make decisions on. So far we could only react to link states, now we could actually react to link congestion on the routing layer. There is no doubt that doing the prioritization in hardware is much preferred but we should try and integrate it with other tc techniques. F.e. it would be great if we could control DCB via skb->tc_index if that is possible. It would allow to define DCB traffic classes with the rich features of existing classifiers. I've seen there is a mapping functionality although I haven't found any documentation on how to use it exactly. Another area of interest is sending congestion frames on our own. We could finally implement real ingress software shaping and turn every linux system into a DCB capable node. --
There was a qdisc submission for a scheduler called "pace" (IIRC) that did this. It needed some cleanups before merging, but nothing grave. --
Do you know if it used congestion notifcation on link level? I can't seem to find the posting. --
They're using PAUSE frames. The latest submission I could find is: http://marc.info/?t=119625135300006&r=1&w=2 --
> Everything is possible in software as long as the hardware doesn't Congestion notification in 802.1Qau is certainly something we need to support somewhere in the stack. I was actually talking with one of our hardware architects while I was in Israel last week about that exact gap, since the BCN/QCN rate limiting will eventually drop packets if we don't have a way of telling the upper layers to "slow down." The notification mechanism is also needed for 802.1Qbb, since the whole point of the priority flow control is to provide a no-drop mechanism for things like FCoE. But if the upper layers (e.g. FCoE stack) don't know to pause when the network is too congested, frames will be dropped, which is bad. 802.1Qau is still being defined in IEEE unfortunately, and we and others have no hardware that supports it to test the congestion notification tag processing. But it is something on our radar that needs to be The prioritization is only one piece. The bandwidth aggregation, different modes of defining group strict vs. link strict priorities within a bandwidth group, etc., are all hardware modes. These modes need to be in sync with the link partner (switch, back to back NIC), and Once 802.1Qau is defined, and IEEE decides to use BCN or QCN, I think this is a great direction to go in. Right now the congestion notification stuff is too up in the air to latch onto unfortunately. Thanks for the comments Thomas, -PJ Waskiewicz --
That's already possible by calling netif_stop_queue() respectively netif_stop_subqueue(). Much more important for the upper layers to know is when a congestion is about to happen so it can be avoided Again, this piece of hardware basically implements a classful qdisc like htb or cbq except that it's limited to a flat tree. Yet, the capabilities of the hardware may not be sufficient, therefore it must be possible to combine hardware shaping with software qdiscs. It is therefore crucial to find common grounds to exchange traffic class information. Is your piece of hardware strictly limited to map VLANs to traffic classes or would it be possible to attach traffic class information to the packet in some way? (skb->tc_index) Could you elaborate on how classification works, especially the configurable mapping? The most difficult part for me, an probably others as well, is that there are no public documents available yet which would describe the direction of where this is going, how complete the current implementation actually is. We're pretty much looking at a black box which doesn't work very well with the existing architecture and requires a completely separate configuration interface. If we merge a configuration interface for DCB now it will be pretty much written in stone, yet we have no idea what other vendors may need. --
