Re: [PATCH 1/2] NET: Re-add VLAN tag for devices incapable of keeping it

Previous thread: [GIT PULL] FireWire updates by Stefan Richter on Wednesday, October 31, 2007 - 11:24 am. (1 message)

Next thread: BUG? task->nsproxy == NULL after main calls pthread_exit by Tony Battersby on Wednesday, October 31, 2007 - 11:25 am. (1 message)
From: Dave Johnson
Date: Wednesday, October 31, 2007 - 11:43 am

Depending on the network driver, I'm seeing different behavior if
a .1q packet is received to an PF_PACKET, SOCK_RAW, ETH_P_ALL socket.


On devices what do not use NETIF_F_HW_VLAN_RX, the packet socket gets
the complete packet with vlan tag included as the driver simply calls
netif_receive_skb() or equivilant.  packet_rcv() then gets the whole
thing vlan tag included and sends this through the socket.

vlan_skb_recv() also gets these all and will drop them because there
are no vlans configured.

Example, e100 driver gives this to tcpdump:

# ifconfig eth1 up
# tcpdump -s 2000 -e -n -i eth1
tcpdump: WARNING: eth1: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 2000 bytes
14:11:03.707178 00:0b:82:05:22:0a > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 101, p 0, ethertype ARP, arp who-has 192.168.101.191 tell 192.168.101.131
14:11:04.215164 00:0b:82:05:22:05 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 101, p 0, ethertype ARP, arp who-has 192.168.101.191 tell 192.168.101.130
14:11:04.658940 00:0b:82:05:22:0c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 101, p 0, ethertype ARP, arp who-has 192.168.101.191 tell 192.168.101.135
14:11:05.706070 00:0b:82:05:22:0a > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 101, p 0, ethertype ARP, arp who-has 192.168.101.191 tell 192.168.101.131
14:11:05.939195 00:b0:c2:e8:d8:1c > 33:33:00:00:00:01, ethertype 802.1Q (0x8100), length 122: vlan 108, p 0, ethertype IPv6, fe80::2b0:c2ff:fee8:d81c > ff02::1: icmp6: router advertisement [class 0xe0]
14:11:07.222302 00:b0:c2:e8:d8:1c > 33:33:00:00:00:01, ethertype 802.1Q (0x8100), length 122: vlan 110, p 0, ethertype IPv6, fe80::2b0:c2ff:fee8:d81c > ff02::1: icmp6: router advertisement [class 0xe0]
14:11:08.486953 00:b0:c2:e8:d8:1c > 01:00:5e:00:00:05, ethertype 802.1Q (0x8100), length 134: vlan 110, p 0, ethertype IPv4, IP ...
From: Stephen Hemminger
Date: Wednesday, October 31, 2007 - 12:33 pm

On Wed, 31 Oct 2007 14:43:51 -0400

The VLAN acceleration grabs and hides the tag. It is a design flaw
that should be fixed, feel free to post a patch.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>
-

From: Ben Greear
Date: Wednesday, October 31, 2007 - 6:06 pm

There may be several ways to 'fix' this.  Perhaps it would be worth 
discussing what
we want the end result to be at least?

Should we always pass the vlan header up to raw sockets as part of the
data payload?

Or, maybe pass it in an auxiliary message such as how timestamps may be 
passed?

The first option seems cleaner, but maybe there are performance problems 
with this
approach?

We should also define what a NIC should do with VLANs it doesn't 
explicitly know
about.   I think it should pass them up the stack with VLAN tag intact, 
but again, perhaps
there are reasons not to do that?

DaveM did the HW Accel for VLANs if I remember correctly...perhaps he 
has some input?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com


-

From: David Miller
Date: Wednesday, October 31, 2007 - 6:10 pm

From: Ben Greear <greearb@candelatech.com>

Not really, I'm busy and also not motivated to work on this, someone
else will need to.
-

From: Stephen Hemminger
Date: Wednesday, October 31, 2007 - 6:23 pm

The code in AF_PACKET should fix the skb before passing to user space so 
that there is
no difference between accel and non-accel hardware.  Internal choices 
shouldn't
leak to user space.  Ditto, the receive checksum offload should be fixed 
up as well.

-

From: Ben Greear
Date: Wednesday, October 31, 2007 - 6:31 pm

Ok, I guess that will fix the sniffing issues and any user-space 
bridging type applications.

Currently, VLAN devices offer the ability to 'reorder' the header and 
explicitly remove the VLAN
header.  I assume we keep this feature and have the AF_PACKET logic 
check the device
flags to see if it should insert the VLAN header for hw-accel vlans?

Either way, if we sniff the underlying device, we should always get the 
VLAN header.

What about drivers and filtering VLANs?  It seems there is still a 
difference between software
vlans and hw-accel in this case.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com


-

From: David Miller
Date: Wednesday, October 31, 2007 - 9:50 pm

From: Stephen Hemminger <shemminger@linux-foundation.org>

The hardware has stripped the VLAN header completely and has not
provided it to us at all.

In my opinion trying to cobble one up by hand using the known TAG is
worse than not providing a VLAN header at all.
-

From: Ben Greear
Date: Thursday, November 1, 2007 - 8:04 am

Do the NICs not save the QoS bits in the VLAN header anywhere that we could
use to reconstitute the header?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com


-

From: David Miller
Date: Thursday, November 1, 2007 - 2:35 pm

From: Ben Greear <greearb@candelatech.com>

You get the 16-bit VLAN tag.
-

From: Dave Johnson
Date: Thursday, November 1, 2007 - 2:36 pm

Unless the device also supports NETIF_F_HW_VLAN_FILTER, it has no idea
which vlans the kernel cares about, it's up to __vlan_hwaccel_rx().



yep.  bad csum on tx packets as reported by tcpdump is also an issue.



Yes, but it's more than just a packet socket issue.

A quick look through the hwaccel capable drivers (in 2.6.23) and most
are doing something like:

if (foo->vlgrp && packet_is_tagged)
  vlan_hwaccel_receive_skb(skb, foo->vlgrp, vlan_tag);
else
  netif_receive_skb(skb);

The important thing here is if the vlan group is NULL, the MAC must
be configured to NOT strip the tag.

users of NETIF_F_HW_VLAN_RX:
---------------------------
./drivers/net/8139cp.c:             looks ok
./drivers/net/acenic.c:             *1
./drivers/net/amd8111e.c:           unsure, probably *1
./drivers/net/atl1/atl1_main.c:     looks ok
./drivers/net/bnx2.c:               *2
./drivers/net/bonding/bond_main.c:  unsure, probably ok
./drivers/net/chelsio/cxgb2.c:      looks ok
./drivers/net/cxgb3/cxgb3_main.c:   looks ok
./drivers/net/e1000/e1000_main.c:   looks ok
./drivers/net/ehea/ehea_main.c:     unsure, probably ok
./drivers/net/forcedeth.c:          looks ok
./drivers/net/gianfar.c:            looks ok
./drivers/net/ixgb/ixgb_main.c:     looks ok
./drivers/net/ns83820.c:            unsure, probably ok
./drivers/net/r8169.c:              looks ok
./drivers/net/s2io.c:               *1
./drivers/net/sky2.c:               looks ok
./drivers/net/starfire.c:           unsure, probably ok
./drivers/net/tg3.c:                *2
./drivers/net/typhoon.c:            unsure, probably ok
./drivers/s390/net/qeth_main.c:     unsure, probably ok

*1:  Driver configures the MAC to strip TAGs even if vlan group is
     NULL. MAC strips the tag, but driver calls netif_rx() or
     netif_receive_skb() with the packet as untagged.  Kernel
     processes tagged packet as if it was received untagged.  Possible
     security issue.

*2:  If chip supports 'ASF', tag is always stripped (see ...
From: Rick Jones
Date: Thursday, November 1, 2007 - 2:48 pm

With TX CKO enabled, there isn't any checksum to fixup when a tx packet is 
sniffed, so I'm not sure what can be done in the kernel apart from an 
unpalatable "disable CKO and all which depend upon it when entering promiscuous 
mode."  Having the tap calculate a checksum would be equally bad for 
performance, and would frankly be incorrect anyway because it would give the 
user the false impression that was the checksum which went-out onto the wire.

One could I suppose try to ammend the information passed to allow tcpdump to say 
"oh, this was a tx packet on the same machine on which I am tracing so don't 
worry about checksum mismatch" but I have to wonder if it is _really_ worth it. 
Already someone has to deal with seeing TCP segments >> the MSS thanks to TSO. 
(Actually tcpdump got rather confused about that too since the IP length of 
those was 0, but IIRC we got that patched to use the length of zero as a "ah, 
this was TSO so wing it" heuristic.)

rick jones
-

From: David Miller
Date: Thursday, November 1, 2007 - 2:59 pm

From: Rick Jones <rick.jones2@hp.com>

We do this already!

-

From: Rick Jones
Date: Thursday, November 1, 2007 - 3:04 pm

I'll try to go pester folks in tcpdump-workers then.

rick
-

From: David Miller
Date: Thursday, November 1, 2007 - 3:07 pm

From: Rick Jones <rick.jones2@hp.com>

The thing to check is "TP_STATUS_CSUMNOTREADY".

When using mmap(), it will be provided in the descriptor.  When using
recvmsg() it will be provided via a PACKET_AUXDATA control message
when enabled via the PACKET_AUXDATA socket option.


-

From: Rick Jones
Date: Thursday, November 1, 2007 - 4:26 pm

Figures... the "dailies" and "weeklies" for tar files of tcpdump and libpcap 
source are fubar... again.  I've email in to tcpdump-workers on that one.  If 
that isn't resolved quickly I'll learn how to access their CVS (pick an SCM, any 
SCM...)

I did an apt-get of debian lenny's tcpdump and sources:

hpcpc103:~# tcpdump -V
tcpdump version 3.9.8
libpcap version 0.9.8

and that seems to show the false checksum failure and not use the 
TP_STATUS_CSUMNOTREADY - at least that didn't appear in a grepping of the 
sources.  At first I thought it might be, but then I realized that my snaplen 
was too short to get the whole TSO'ed frame so tcpdump wasn't even trying to 
verify.  After disabling TSO on the NIC, leaving CKO on, and making my snaplen > 
1500 I could see it was doing undesirable stuff.

I'll see what top of trunk has at some point and what the folks there think of 
adding-in a change.

rick jones
-

From: Dave Johnson
Date: Monday, November 5, 2007 - 10:47 am

This patch changes the following drivers to use vlan_hwaccel_rx()
and/or vlan_hwaccel_receive_skb() with a NULL vlan group if needed as
they are not setup to dynamically enable/disable vlan removal in
their MAC based on the vlan group:

drivers/net/tg3.c           Tested on BCM5704, looks good
drivers/net/bnx2.c          Tested on BCM5708, looks good
drivers/net/acenic.c        Not tested, I don't have one of these
drivers/net/s2io.c          Not tested, I don't have one of these

In addition, these drivers might also need changes, but should
probably be done by the maintainer as it's not clear exactly what
change is needed:

drivers/net/amd8111e.c
drivers/net/cxgb3/*

Signed-off-by: Dave Johnson <djohnson@sw.starentnetworks.com>

===== drivers/net/acenic.c 1.77 vs edited =====
--- 1.77/drivers/net/acenic.c	2007-07-24 16:28:41 -04:00
+++ edited/drivers/net/acenic.c	2007-11-03 12:27:40 -04:00
@@ -2036,7 +2036,7 @@
 
 		/* send it up */
 #if ACENIC_DO_VLAN
-		if (ap->vlgrp && (bd_flags & BD_FLG_VLAN_TAG)) {
+		if (bd_flags & BD_FLG_VLAN_TAG) {
 			vlan_hwaccel_rx(skb, ap->vlgrp, retdesc->vlan);
 		} else
 #endif
===== drivers/net/bnx2.c 1.135 vs edited =====
--- 1.135/drivers/net/bnx2.c	2007-09-20 15:14:21 -04:00
+++ edited/drivers/net/bnx2.c	2007-11-05 09:34:26 -05:00
@@ -2493,7 +2493,8 @@
 		}
 
 #ifdef BCM_VLAN
-		if ((status & L2_FHDR_STATUS_L2_VLAN_TAG) && (bp->vlgrp != 0)) {
+		if ((status & L2_FHDR_STATUS_L2_VLAN_TAG) &&
+		    (bp->vlgrp || (bp->flags & ASF_ENABLE_FLAG))) {
 			vlan_hwaccel_receive_skb(skb, bp->vlgrp,
 				rx_hdr->l2_fhdr_vlan_tag);
 		}
===== drivers/net/s2io.c 1.124 vs edited =====
--- 1.124/drivers/net/s2io.c	2007-08-03 18:10:44 -04:00
+++ edited/drivers/net/s2io.c	2007-11-03 12:29:09 -04:00
@@ -6834,8 +6834,7 @@
 	sp->mac_control.stats_info->sw_stat.mem_freed += skb->truesize;
 	if (!sp->lro) {
 		skb->protocol = eth_type_trans(skb, dev);
-		if ((sp->vlgrp && RXD_GET_VLAN_TAG(rxdp->Control_2) &&
-			vlan_strip_flag)) {
+		if ...
From: Ramkrishna Vepa
Date: Monday, November 5, 2007 - 7:39 pm

The Xframe (S2io) adapter can be programmed dynamically to either,
always strip the vlan tag or not. In this case, if the vlan group is
NULL, it can be programmed at run time to NOT strip the vlan tag.

When a packet with a Vlan id is received that is not added to the group,
should it be dropped or indicated up? Either can be handled by the
hardware. The vlan tag in this particular case will be stripped as the
vlan group is not NULL.


-

From: Ramkrishna Vepa
Date: Tuesday, November 6, 2007 - 11:28 am

Dave,

If you can remove the patch for the s2io driver, we can submit a patch
that can dynamically program Xframe to strip or not strip the vlan tag
based on whether the vlan group is not NULL or NULL respectively. For
this, we have to modify the initialization as well as the vlan
registration.

With regards to the question on the Vlan id received not added to the
group, I think we should drop these packets. This change requires
additional changes to support multiple receive rings. But will add it to
the driver on our website, which supports multiple rx rings, tx fifos,
multiqueue etc, and is a very small incremental change.

Thanks,

-

From: Dave Johnson
Date: Tuesday, November 6, 2007 - 11:34 am

Sure, not having this hardware I didn't want to attempt a complicated
change.  I'll let you take care of this.

-- 
Dave Johnson
Starent Networks

-

From: Dave Johnson
Date: Monday, November 5, 2007 - 10:46 am

Some MACs are configured to remove VLAN tags on RX packets even if
the kernel isn't setup to use VLANs.

On these drivers it is bad to simply pass the packets with the VLAN
tag removed to the network stack.  This gives the network stack the
impression these packets arrived from the network without a VLAN tag
even though they had one.  This has 2 issues: 1, PF_PACKET sockets see
the packet without the VLAN tag when bound to the base device; and 2,
the L3 stacks will process these packets as if they had no tag to
begin with.

This patch changes vlan_hwaccel_rx() and vlan_hwaccel_receive_skb() to
accept a NULL vlan group.  This simplifies the drivers as they can
call vlan_hwaccel_rx()/vlan_hwaccel_receive_skb() if the MAC
has removed the VLAN tag, or netif_rx()/netif_receive_skb() if the
MAC hasn't removed a VLAN tag (or the packet arrived untagged).

If the vlan group is NULL or no device for the received VID exists,
vlan_hwaccel_rx()/vlan_hwaccel_receive_skb() will now re-add a VLAN
tag and pass the packet to the base device with tag intact.

It is still preferable for the drivers to disable VLAN tag removal in
the hardware if no vlan group is configured.  However for devices that
cannot do this the change will ensure tagged packets remain tagged in
the network stack.

Signed-off-by: Dave Johnson <djohnson@sw.starentnetworks.com>

---

Note, __vlan_hwaccel_rx() needed to move below __vlan_put_tag() so the
change really isn't as big as it may look below.

===== include/linux/if_vlan.h 1.25 vs edited =====
--- 1.25/include/linux/if_vlan.h	2007-07-14 21:53:28 -04:00
+++ edited/include/linux/if_vlan.h	2007-11-03 14:26:43 -04:00
@@ -164,71 +164,6 @@
 	(VLAN_TX_SKB_CB(__skb)->magic == VLAN_TX_COOKIE_MAGIC)
 #define vlan_tx_tag_get(__skb)	(VLAN_TX_SKB_CB(__skb)->vlan_tag)
 
-/* VLAN rx hw acceleration helper.  This acts like netif_{rx,receive_skb}(). */
-static inline int __vlan_hwaccel_rx(struct sk_buff *skb,
-				    struct vlan_group *grp,
-				    unsigned short ...
From: Patrick McHardy
Date: Monday, November 5, 2007 - 11:00 am

This looks like a rather expensive operation for the unlikely case
that packets will be received by a packet socket. IMO it should only
be reconstructed if actually needed, by af_packet itself.

As we discussed some time back storing the VLAN tag in the CB on
TX clashes with other users of the CB like qdiscs, so we need a
new field in the skb for this anyway.

-

From: David Miller
Date: Monday, November 5, 2007 - 4:15 pm

From: Patrick McHardy <kaber@trash.net>

Completely agreed.  We should not do this by default when %99

Someone will have to find a way to remove some other fields in
sk_buff before I'm going to allow more space to be eaten up
by this completely fringe case feature.
-

From: Patrick McHardy
Date: Monday, November 5, 2007 - 5:21 pm

I think there is one more case that matters, which is briding
from a device with VLAN stripping for a VLAN not configured
locally. The tag will be stripped and will be lost for forwarded
packets. But I'm not exactly sure this really can be configured


We have a two byte hole after tc_verd where we could fit this in.
But I'm pretty sure we also could reuse some other fields on input,
like queue_mapping or maybe even destructor for unowned skbs.

-

From: David Miller
Date: Monday, November 5, 2007 - 5:35 pm

From: Patrick McHardy <kaber@trash.net>

If so then when such rules are loaded, just like PF_PACKET, it can set

Two bytes is enough, so if there is a hole we can use it.
-

From: Krzysztof Halasa
Date: Tuesday, November 6, 2007 - 11:03 am

I think we should drop such packets on RX. Anyway we shouldn't
forward them.
-- 
Krzysztof Halasa
-

From: Ben Greear
Date: Tuesday, November 6, 2007 - 11:56 am

Bridging eth0 to eth1 should not pay attention to VLAN tags
at all (if the pkt comes in on VLAN 7, it should go out on VLAN 7),
in my opinion.  If the NIC is stripping the VLAN header, then this
cannot work unless something re-builds the VLAN header.  If the stripped
VLAN header is placed into the skb, then any code that does need to
rebuild it can do so.  It may be less efficient, but users can just
not use that NIC hardware for high-end solutions, and at any rate,
less efficient is better than broken.

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

-

From: Krzysztof Halasa
Date: Tuesday, November 6, 2007 - 1:08 pm

That is all true. The problem arises when you receive a tagged frame
on eth0, the chip removes the tag, and then the bridge sends it
out untagged on eth1.

I think there are two valid models for VLAN + bridging:

a) bridging works on "physical" interfaces, all tags are transmitted
   unchanged.

b) every VLAN is a different logical interface, packets from unknown
   VLANs are dropped on RX (and thus don't show up anywhere, except
   counters), bridging uses logical interfaces. VLAN 100 on eth0 may
   become VLAN 200 on eth1 and may be untagged on eth2.


"a" requires "soft" VLANs and/or adding the tags back (with
accelerated VLANs). This is how unmanaged switches labeled
"802.1Q - transparent" work. Not very flexible but usually good
enough.

"b" is how switches supporting VLANs (and 802.1Q) usually work.


I know ability to see exactly all packets as they are received
(including tags) is a really nice thing. But maybe we should change
the model? Making the ethX only carry untagged frames (even without
hw VLAN acceleration)?
-- 
Krzysztof Halasa
-

From: Patrick McHardy
Date: Tuesday, November 6, 2007 - 4:55 pm

I believe you misunderstand the configuration I was thinking about:

eth0.1000        br0
     |             |
eth0 with VLAN stripping

Its slightly different that I thought though, __vlan_hwaccel_rx
drops packets for unknown vids, so as soon as you configure a VLAN
locally on the same device that is used for a bridge, no VLAN
packets will be forwarded at all.

Slightly inconsistent with how other layered devices work, but
can be avoided by adding the VLAN to br0.

-

From: David Miller
Date: Thursday, November 1, 2007 - 2:58 pm

From: Dave Johnson <djohnson+linux-kernel@sw.starentnetworks.com>

We provide a tag to userspace that tcpdump should use to see that the
HW is going to checksum the packet, and therefore it should elide
trying to verify the checksums.

It's not a kernel issue.
-

From: Dave Johnson
Date: Friday, November 2, 2007 - 11:08 am

Michael,




Could you elaborate if this is really needed, if so is there some
workaround that could be done instead?

Simply removing the check seemed to work for me, but I'm unsure if
this is actually a valid thing to do with these MACs.



-- 
Dave Johnson
Starent Networks

-

From: Michael Chan
Date: Friday, November 2, 2007 - 2:52 pm

This is needed for management firmware to work properly.  Management
firmware expects any VLAN tags to be stripped.  Unfortunately, VLAN
stripping cannot be done independently between the driver and the

Most of these on-board devices are shipped with management firmware
enabled.  Removing the check will make the firmware not functional.

We realize this VLAN limitation is causing problems to many users and we
are looking for ways to address it.

-

From: David Miller
Date: Friday, November 2, 2007 - 2:20 pm

From: Dave Johnson <djohnson+linux-kernel@sw.starentnetworks.com>

Unfortunately the ASF firmware is very picky.

I think were are stuck with this behavior.
-

Previous thread: [GIT PULL] FireWire updates by Stefan Richter on Wednesday, October 31, 2007 - 11:24 am. (1 message)

Next thread: BUG? task->nsproxy == NULL after main calls pthread_exit by Tony Battersby on Wednesday, October 31, 2007 - 11:25 am. (1 message)