Depending on the network driver, I'm seeing different behavior if a .1q packet is received to an PF_PACKET, SOCK_RAW, ETH_P_ALL socket. On devices what do not use NETIF_F_HW_VLAN_RX, the packet socket gets the complete packet with vlan tag included as the driver simply calls netif_receive_skb() or equivilant. packet_rcv() then gets the whole thing vlan tag included and sends this through the socket. vlan_skb_recv() also gets these all and will drop them because there are no vlans configured. Example, e100 driver gives this to tcpdump: # ifconfig eth1 up # tcpdump -s 2000 -e -n -i eth1 tcpdump: WARNING: eth1: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 2000 bytes 14:11:03.707178 00:0b:82:05:22:0a > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 101, p 0, ethertype ARP, arp who-has 192.168.101.191 tell 192.168.101.131 14:11:04.215164 00:0b:82:05:22:05 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 101, p 0, ethertype ARP, arp who-has 192.168.101.191 tell 192.168.101.130 14:11:04.658940 00:0b:82:05:22:0c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 101, p 0, ethertype ARP, arp who-has 192.168.101.191 tell 192.168.101.135 14:11:05.706070 00:0b:82:05:22:0a > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 101, p 0, ethertype ARP, arp who-has 192.168.101.191 tell 192.168.101.131 14:11:05.939195 00:b0:c2:e8:d8:1c > 33:33:00:00:00:01, ethertype 802.1Q (0x8100), length 122: vlan 108, p 0, ethertype IPv6, fe80::2b0:c2ff:fee8:d81c > ff02::1: icmp6: router advertisement [class 0xe0] 14:11:07.222302 00:b0:c2:e8:d8:1c > 33:33:00:00:00:01, ethertype 802.1Q (0x8100), length 122: vlan 110, p 0, ethertype IPv6, fe80::2b0:c2ff:fee8:d81c > ff02::1: icmp6: router advertisement [class 0xe0] 14:11:08.486953 00:b0:c2:e8:d8:1c > 01:00:5e:00:00:05, ethertype 802.1Q (0x8100), length 134: vlan 110, p 0, ethertype IPv4, IP ...
On Wed, 31 Oct 2007 14:43:51 -0400 The VLAN acceleration grabs and hides the tag. It is a design flaw that should be fixed, feel free to post a patch. -- Stephen Hemminger <shemminger@linux-foundation.org> -
There may be several ways to 'fix' this. Perhaps it would be worth discussing what we want the end result to be at least? Should we always pass the vlan header up to raw sockets as part of the data payload? Or, maybe pass it in an auxiliary message such as how timestamps may be passed? The first option seems cleaner, but maybe there are performance problems with this approach? We should also define what a NIC should do with VLANs it doesn't explicitly know about. I think it should pass them up the stack with VLAN tag intact, but again, perhaps there are reasons not to do that? DaveM did the HW Accel for VLANs if I remember correctly...perhaps he has some input? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com -
From: Ben Greear <greearb@candelatech.com> Not really, I'm busy and also not motivated to work on this, someone else will need to. -
The code in AF_PACKET should fix the skb before passing to user space so that there is no difference between accel and non-accel hardware. Internal choices shouldn't leak to user space. Ditto, the receive checksum offload should be fixed up as well. -
Ok, I guess that will fix the sniffing issues and any user-space bridging type applications. Currently, VLAN devices offer the ability to 'reorder' the header and explicitly remove the VLAN header. I assume we keep this feature and have the AF_PACKET logic check the device flags to see if it should insert the VLAN header for hw-accel vlans? Either way, if we sniff the underlying device, we should always get the VLAN header. What about drivers and filtering VLANs? It seems there is still a difference between software vlans and hw-accel in this case. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com -
From: Stephen Hemminger <shemminger@linux-foundation.org> The hardware has stripped the VLAN header completely and has not provided it to us at all. In my opinion trying to cobble one up by hand using the known TAG is worse than not providing a VLAN header at all. -
Do the NICs not save the QoS bits in the VLAN header anywhere that we could use to reconstitute the header? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com -
From: Ben Greear <greearb@candelatech.com> You get the 16-bit VLAN tag. -
Unless the device also supports NETIF_F_HW_VLAN_FILTER, it has no idea
which vlans the kernel cares about, it's up to __vlan_hwaccel_rx().
yep. bad csum on tx packets as reported by tcpdump is also an issue.
Yes, but it's more than just a packet socket issue.
A quick look through the hwaccel capable drivers (in 2.6.23) and most
are doing something like:
if (foo->vlgrp && packet_is_tagged)
vlan_hwaccel_receive_skb(skb, foo->vlgrp, vlan_tag);
else
netif_receive_skb(skb);
The important thing here is if the vlan group is NULL, the MAC must
be configured to NOT strip the tag.
users of NETIF_F_HW_VLAN_RX:
---------------------------
./drivers/net/8139cp.c: looks ok
./drivers/net/acenic.c: *1
./drivers/net/amd8111e.c: unsure, probably *1
./drivers/net/atl1/atl1_main.c: looks ok
./drivers/net/bnx2.c: *2
./drivers/net/bonding/bond_main.c: unsure, probably ok
./drivers/net/chelsio/cxgb2.c: looks ok
./drivers/net/cxgb3/cxgb3_main.c: looks ok
./drivers/net/e1000/e1000_main.c: looks ok
./drivers/net/ehea/ehea_main.c: unsure, probably ok
./drivers/net/forcedeth.c: looks ok
./drivers/net/gianfar.c: looks ok
./drivers/net/ixgb/ixgb_main.c: looks ok
./drivers/net/ns83820.c: unsure, probably ok
./drivers/net/r8169.c: looks ok
./drivers/net/s2io.c: *1
./drivers/net/sky2.c: looks ok
./drivers/net/starfire.c: unsure, probably ok
./drivers/net/tg3.c: *2
./drivers/net/typhoon.c: unsure, probably ok
./drivers/s390/net/qeth_main.c: unsure, probably ok
*1: Driver configures the MAC to strip TAGs even if vlan group is
NULL. MAC strips the tag, but driver calls netif_rx() or
netif_receive_skb() with the packet as untagged. Kernel
processes tagged packet as if it was received untagged. Possible
security issue.
*2: If chip supports 'ASF', tag is always stripped (see ...With TX CKO enabled, there isn't any checksum to fixup when a tx packet is sniffed, so I'm not sure what can be done in the kernel apart from an unpalatable "disable CKO and all which depend upon it when entering promiscuous mode." Having the tap calculate a checksum would be equally bad for performance, and would frankly be incorrect anyway because it would give the user the false impression that was the checksum which went-out onto the wire. One could I suppose try to ammend the information passed to allow tcpdump to say "oh, this was a tx packet on the same machine on which I am tracing so don't worry about checksum mismatch" but I have to wonder if it is _really_ worth it. Already someone has to deal with seeing TCP segments >> the MSS thanks to TSO. (Actually tcpdump got rather confused about that too since the IP length of those was 0, but IIRC we got that patched to use the length of zero as a "ah, this was TSO so wing it" heuristic.) rick jones -
From: Rick Jones <rick.jones2@hp.com> We do this already! -
I'll try to go pester folks in tcpdump-workers then. rick -
From: Rick Jones <rick.jones2@hp.com> The thing to check is "TP_STATUS_CSUMNOTREADY". When using mmap(), it will be provided in the descriptor. When using recvmsg() it will be provided via a PACKET_AUXDATA control message when enabled via the PACKET_AUXDATA socket option. -
Figures... the "dailies" and "weeklies" for tar files of tcpdump and libpcap source are fubar... again. I've email in to tcpdump-workers on that one. If that isn't resolved quickly I'll learn how to access their CVS (pick an SCM, any SCM...) I did an apt-get of debian lenny's tcpdump and sources: hpcpc103:~# tcpdump -V tcpdump version 3.9.8 libpcap version 0.9.8 and that seems to show the false checksum failure and not use the TP_STATUS_CSUMNOTREADY - at least that didn't appear in a grepping of the sources. At first I thought it might be, but then I realized that my snaplen was too short to get the whole TSO'ed frame so tcpdump wasn't even trying to verify. After disabling TSO on the NIC, leaving CKO on, and making my snaplen > 1500 I could see it was doing undesirable stuff. I'll see what top of trunk has at some point and what the folks there think of adding-in a change. rick jones -
This patch changes the following drivers to use vlan_hwaccel_rx()
and/or vlan_hwaccel_receive_skb() with a NULL vlan group if needed as
they are not setup to dynamically enable/disable vlan removal in
their MAC based on the vlan group:
drivers/net/tg3.c Tested on BCM5704, looks good
drivers/net/bnx2.c Tested on BCM5708, looks good
drivers/net/acenic.c Not tested, I don't have one of these
drivers/net/s2io.c Not tested, I don't have one of these
In addition, these drivers might also need changes, but should
probably be done by the maintainer as it's not clear exactly what
change is needed:
drivers/net/amd8111e.c
drivers/net/cxgb3/*
Signed-off-by: Dave Johnson <djohnson@sw.starentnetworks.com>
===== drivers/net/acenic.c 1.77 vs edited =====
--- 1.77/drivers/net/acenic.c 2007-07-24 16:28:41 -04:00
+++ edited/drivers/net/acenic.c 2007-11-03 12:27:40 -04:00
@@ -2036,7 +2036,7 @@
/* send it up */
#if ACENIC_DO_VLAN
- if (ap->vlgrp && (bd_flags & BD_FLG_VLAN_TAG)) {
+ if (bd_flags & BD_FLG_VLAN_TAG) {
vlan_hwaccel_rx(skb, ap->vlgrp, retdesc->vlan);
} else
#endif
===== drivers/net/bnx2.c 1.135 vs edited =====
--- 1.135/drivers/net/bnx2.c 2007-09-20 15:14:21 -04:00
+++ edited/drivers/net/bnx2.c 2007-11-05 09:34:26 -05:00
@@ -2493,7 +2493,8 @@
}
#ifdef BCM_VLAN
- if ((status & L2_FHDR_STATUS_L2_VLAN_TAG) && (bp->vlgrp != 0)) {
+ if ((status & L2_FHDR_STATUS_L2_VLAN_TAG) &&
+ (bp->vlgrp || (bp->flags & ASF_ENABLE_FLAG))) {
vlan_hwaccel_receive_skb(skb, bp->vlgrp,
rx_hdr->l2_fhdr_vlan_tag);
}
===== drivers/net/s2io.c 1.124 vs edited =====
--- 1.124/drivers/net/s2io.c 2007-08-03 18:10:44 -04:00
+++ edited/drivers/net/s2io.c 2007-11-03 12:29:09 -04:00
@@ -6834,8 +6834,7 @@
sp->mac_control.stats_info->sw_stat.mem_freed += skb->truesize;
if (!sp->lro) {
skb->protocol = eth_type_trans(skb, dev);
- if ((sp->vlgrp && RXD_GET_VLAN_TAG(rxdp->Control_2) &&
- vlan_strip_flag)) {
+ if ...The Xframe (S2io) adapter can be programmed dynamically to either, always strip the vlan tag or not. In this case, if the vlan group is NULL, it can be programmed at run time to NOT strip the vlan tag. When a packet with a Vlan id is received that is not added to the group, should it be dropped or indicated up? Either can be handled by the hardware. The vlan tag in this particular case will be stripped as the vlan group is not NULL. -
Dave, If you can remove the patch for the s2io driver, we can submit a patch that can dynamically program Xframe to strip or not strip the vlan tag based on whether the vlan group is not NULL or NULL respectively. For this, we have to modify the initialization as well as the vlan registration. With regards to the question on the Vlan id received not added to the group, I think we should drop these packets. This change requires additional changes to support multiple receive rings. But will add it to the driver on our website, which supports multiple rx rings, tx fifos, multiqueue etc, and is a very small incremental change. Thanks, -
Sure, not having this hardware I didn't want to attempt a complicated change. I'll let you take care of this. -- Dave Johnson Starent Networks -
Some MACs are configured to remove VLAN tags on RX packets even if
the kernel isn't setup to use VLANs.
On these drivers it is bad to simply pass the packets with the VLAN
tag removed to the network stack. This gives the network stack the
impression these packets arrived from the network without a VLAN tag
even though they had one. This has 2 issues: 1, PF_PACKET sockets see
the packet without the VLAN tag when bound to the base device; and 2,
the L3 stacks will process these packets as if they had no tag to
begin with.
This patch changes vlan_hwaccel_rx() and vlan_hwaccel_receive_skb() to
accept a NULL vlan group. This simplifies the drivers as they can
call vlan_hwaccel_rx()/vlan_hwaccel_receive_skb() if the MAC
has removed the VLAN tag, or netif_rx()/netif_receive_skb() if the
MAC hasn't removed a VLAN tag (or the packet arrived untagged).
If the vlan group is NULL or no device for the received VID exists,
vlan_hwaccel_rx()/vlan_hwaccel_receive_skb() will now re-add a VLAN
tag and pass the packet to the base device with tag intact.
It is still preferable for the drivers to disable VLAN tag removal in
the hardware if no vlan group is configured. However for devices that
cannot do this the change will ensure tagged packets remain tagged in
the network stack.
Signed-off-by: Dave Johnson <djohnson@sw.starentnetworks.com>
---
Note, __vlan_hwaccel_rx() needed to move below __vlan_put_tag() so the
change really isn't as big as it may look below.
===== include/linux/if_vlan.h 1.25 vs edited =====
--- 1.25/include/linux/if_vlan.h 2007-07-14 21:53:28 -04:00
+++ edited/include/linux/if_vlan.h 2007-11-03 14:26:43 -04:00
@@ -164,71 +164,6 @@
(VLAN_TX_SKB_CB(__skb)->magic == VLAN_TX_COOKIE_MAGIC)
#define vlan_tx_tag_get(__skb) (VLAN_TX_SKB_CB(__skb)->vlan_tag)
-/* VLAN rx hw acceleration helper. This acts like netif_{rx,receive_skb}(). */
-static inline int __vlan_hwaccel_rx(struct sk_buff *skb,
- struct vlan_group *grp,
- unsigned short ...This looks like a rather expensive operation for the unlikely case that packets will be received by a packet socket. IMO it should only be reconstructed if actually needed, by af_packet itself. As we discussed some time back storing the VLAN tag in the CB on TX clashes with other users of the CB like qdiscs, so we need a new field in the skb for this anyway. -
From: Patrick McHardy <kaber@trash.net> Completely agreed. We should not do this by default when %99 Someone will have to find a way to remove some other fields in sk_buff before I'm going to allow more space to be eaten up by this completely fringe case feature. -
I think there is one more case that matters, which is briding from a device with VLAN stripping for a VLAN not configured locally. The tag will be stripped and will be lost for forwarded packets. But I'm not exactly sure this really can be configured We have a two byte hole after tc_verd where we could fit this in. But I'm pretty sure we also could reuse some other fields on input, like queue_mapping or maybe even destructor for unowned skbs. -
From: Patrick McHardy <kaber@trash.net> If so then when such rules are loaded, just like PF_PACKET, it can set Two bytes is enough, so if there is a hole we can use it. -
I think we should drop such packets on RX. Anyway we shouldn't forward them. -- Krzysztof Halasa -
Bridging eth0 to eth1 should not pay attention to VLAN tags at all (if the pkt comes in on VLAN 7, it should go out on VLAN 7), in my opinion. If the NIC is stripping the VLAN header, then this cannot work unless something re-builds the VLAN header. If the stripped VLAN header is placed into the skb, then any code that does need to rebuild it can do so. It may be less efficient, but users can just not use that NIC hardware for high-end solutions, and at any rate, less efficient is better than broken. Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com -
That is all true. The problem arises when you receive a tagged frame on eth0, the chip removes the tag, and then the bridge sends it out untagged on eth1. I think there are two valid models for VLAN + bridging: a) bridging works on "physical" interfaces, all tags are transmitted unchanged. b) every VLAN is a different logical interface, packets from unknown VLANs are dropped on RX (and thus don't show up anywhere, except counters), bridging uses logical interfaces. VLAN 100 on eth0 may become VLAN 200 on eth1 and may be untagged on eth2. "a" requires "soft" VLANs and/or adding the tags back (with accelerated VLANs). This is how unmanaged switches labeled "802.1Q - transparent" work. Not very flexible but usually good enough. "b" is how switches supporting VLANs (and 802.1Q) usually work. I know ability to see exactly all packets as they are received (including tags) is a really nice thing. But maybe we should change the model? Making the ethX only carry untagged frames (even without hw VLAN acceleration)? -- Krzysztof Halasa -
I believe you misunderstand the configuration I was thinking about:
eth0.1000 br0
| |
eth0 with VLAN stripping
Its slightly different that I thought though, __vlan_hwaccel_rx
drops packets for unknown vids, so as soon as you configure a VLAN
locally on the same device that is used for a bridge, no VLAN
packets will be forwarded at all.
Slightly inconsistent with how other layered devices work, but
can be avoided by adding the VLAN to br0.
-
From: Dave Johnson <djohnson+linux-kernel@sw.starentnetworks.com> We provide a tag to userspace that tcpdump should use to see that the HW is going to checksum the packet, and therefore it should elide trying to verify the checksums. It's not a kernel issue. -
Michael, Could you elaborate if this is really needed, if so is there some workaround that could be done instead? Simply removing the check seemed to work for me, but I'm unsure if this is actually a valid thing to do with these MACs. -- Dave Johnson Starent Networks -
This is needed for management firmware to work properly. Management firmware expects any VLAN tags to be stripped. Unfortunately, VLAN stripping cannot be done independently between the driver and the Most of these on-board devices are shipped with management firmware enabled. Removing the check will make the firmware not functional. We realize this VLAN limitation is causing problems to many users and we are looking for ways to address it. -
From: Dave Johnson <djohnson+linux-kernel@sw.starentnetworks.com> Unfortunately the ASF firmware is very picky. I think were are stuck with this behavior. -
