Re: MACVLANs really best solution? How about a bridge with multiple bridge virtual interfaces? (was Re: [PATCH] macvlan: Support creating macvlans from macvlans)

Previous thread: [PATCH] igb: fix kexec with igb by Yinghai Lu on Friday, March 6, 2009 - 9:33 pm. (36 messages)

Next thread: e100 fail in 2.6.29-rc6-git7, kernel requesting firmware by Denys Fedoryschenko on Saturday, March 7, 2009 - 3:10 pm. (1 message)

Hi,


I agree, however there's further things that mac-vlans aren't
currently doing as virtual ethernet interfaces that real ones do.
Unicast ethernet traffic sent out one mac-vlan interface with a
destination address of another mac-vlan interface on the same host
isn't delivered. mac-vlan interfaces, even though they're conceptually
located on the same ethernet segment, are currently isolated from each
other for unicast traffic.

Thinking about the few scenarios where I've been using mac-vlans, it
seems to me that the fundamental goal of mac-vlans is to be
functionaly equivalent to having multiple physical ethernet network
cards attached to the same LAN segment, with those physical
interfaces attached together via a switch or a hub.

Virtualising that scenario using mac-vlans and a single physical
interface means that the outbound mac-vlan code on a host needs to
replicate the functionality of a switch or a hub. Outbound unicast
traffic from one mac-vlan interface needs to be either seen by the
destination address matching code of all other mac-vlan interfaces (i.e.
replicating hub functionality), or selectively forwarded to the
intended mac-vlan interface (replicating switch functionality). If the
unicast destination doesn't match one of the other mac-vlan interfaces,
then it should be forwarded out the mac-vlan interfaces' parent physical
interface, as the destination is likely to be on the physical ethernet.
Blindly forwarding the outbound mac-vlan origin traffic out the physical
interface doesn't solve the problem - switches (and I'm pretty sure at
least functionaly, hubs) don't forward traffic back out the interface
they receive it on. So the hub/switch forwarding functionality between
mac-vlan interfaces and their parent physical interface needs to happen
within the kernel.

I've been thinking of having a go at adding this functionality. But
then I realised that that would just be more duplication of the
functionality of the code that is aleady in the kernel - the ...

At least for my use, having them all blindly TX is fine.  For thousands
of interfaces, if you did this right and also delivered all broadcast 
packets locally
(ie, ARP), you will cause a lot of overhead, and unless you are running 
a patched
kernel (or namespaces perhaps), you can't really communicate with 
yourself over the
network anyway using IP.

For the behaviour you want, try adding pairs of VETH interfaces and add 
one end
of the veth's to the bridge.  Add a physical port to the bridge for 
egress.  Since this
can be done, I don't really see any reason to change mac-vlan 
significantly...

If the veth/bridge thing doesn't work, then let us know, as I think that 
would be
a bug.  I use a similar-to-veth virtual-device pair in this way and it 
works fine.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com


--


There is one scenario in which macvlans totally beat bridging veth
devices.  macvlans support the full set of stateless hardware
offloads that the hardware supports.  Whereas veth device support none
of them.

I don't think changing macvlans makes a lot of sense.  Beyond the
pain of making it work, there are the semantic differences of local
broadcast working.

Doing something so that bridges have roughly the same performance 
as macvlans would be very nice.  I think it requires advertising
most if not all stateless hardware offloads, and then implementing
them in software on the endpoints that don't support them.

I did get as far as implementing a first draft at looping packets back
locally and behaviour difference for broadcasts and multicast
differences made macvlans a bad fit.  For clean code something like
the bridge code where you don't use the original interface directly
for sending and receiving traffic seems required.

Eric
--


On Sat, 07 Mar 2009 10:13:16 -0800

So then, my question is, what are mac-vlans for i.e. what is their
common use case?

The problem I was trying to solve was to run up an arbitrary
number of PPPoE servers on a single LAN segment. I could do that
with physical interfaces, however I only had a maximum of 4 ethernet
interfaces in the host. Using mac-vlans seemed to be the obvious way to
eliminate the physical constraints of the host. I did expect though that
the mac-vlan virtual interfaces would work the same real interfaces, so
I was expecting that I could bridge them and that unicast traffic
between them would work.

If bridged veth pairs is a solution to my problem, what would I use
mac-vlans for? Ben seems to have a use case, however I know that he
does network testing, so some of his use cases could be quite uncommon.
I do some odd things with networking occasionally, so the mac-vlan
behaviour / limitations might be useful, but I'm curious if there would
be some more common uses?

Regards,
mark.
--


Doesn't pppoe always talk to an upstream box (the pppoe-server)?  If 
that is so,
why would the local mac-vlans ever need to communicate directly to 
eachother?

We've used pppoe on mac-vlans, and it *seemed* to work, but perhaps we 
were missing
something...

I think they might also be useful for adding a more realistic 'virtual 
ip' to an interface, perhaps
for interesting routing setups.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com


--


Hi Ben,

On Sun, 08 Mar 2009 09:54:02 -0700


That's true, and in that sense, I was 'lucky' that I didn't encounter
this limitation in my test environment, but it was only because of my
test environment (i.e. protocol), didn't require unicast communication
between the mac-vlan interfaces on the same host. If I'd been testing
some other protocol that did require unicast communications between
macvlan interfaces, I would have been scratching my head, wondering
why things weren't working correctly, and may have spent a lot of time
thinking that my test setup was the thing that had failed, rather than
being caused by missing functionality that I assumed was there.

I think the "truth in advertising" or "principle of least surprise"
should hold - if mac-vlans are to be seen by their users as virtual
ethernet network cards, then they should function no differently to real
network cards, or alternatively, if they aren't going to function that
way, there needs to be quite obvious documentation somewhere (other
than this thread) about their limitations, and a corresponding
recommendation that you use bridged veth interfaces if you can't afford
--


I agree on most points. There is one fundamental operational difference
however. With macvlan, all MAC addresses are known are therefore can be
programmed as secondary unicast addresses, while a bridge always uses
promiscous mode and for unknown addresses needs to flood forward them.

This could be changed in the bridging code of course for bridges
consisting purely of local devices. Most of the bridging stuff isn't
needed for macvlans though, so its probably easier to simply perform
a lookup for local devices in macvlan on transmit, similar to what
is done on reception.


--


What I haven't figured out is how you handle the transmit path for
broadcast and multicast ethernet traffic.  How do you test to see if
you have already preformed local transmission?

For discussion but not for application because it is incomplete:
This is what I came up with when I played with getting the local
transmission case working the other day.


From 15e4a58ae0cea86338ef9d73ae14ba32e4819f5a Mon Sep 17 00:00:00 2001
From: Eric Biederman <ebiederm@xmission.com>
Date: Thu, 5 Mar 2009 07:46:10 -0800
Subject: [PATCH] macvlan: Reflect macvlan packets meant for other macvlan devices

Switch ports do not send packets back out the same port they came
in on.  This causes problems when using a macvlan device inside
of a network namespace as it becomes impossible to talk to
other macvlan devices.

Signed-off-by: Eric Biederman <ebiederm@aristanetworks.com>
---
 drivers/net/macvlan.c |   92 ++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 68 insertions(+), 24 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index b5241fc..eb2539f 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -29,6 +29,7 @@
 #include <linux/if_link.h>
 #include <linux/if_macvlan.h>
 #include <net/rtnetlink.h>
+#include <net/xfrm.h>
 
 #define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
 
@@ -61,7 +62,8 @@ static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
 }
 
 static void macvlan_broadcast(struct sk_buff *skb,
-			      const struct macvlan_port *port)
+			      const struct macvlan_port *port,
+			      struct net_device *src)
 {
 	const struct ethhdr *eth = eth_hdr(skb);
 	const struct macvlan_dev *vlan;
@@ -77,6 +79,9 @@ static void macvlan_broadcast(struct sk_buff *skb,
 		hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) {
 			dev = vlan->dev;
 
+			if (dev == src)
+				continue;
+
 			nskb = skb_clone(skb, GFP_ATOMIC);
 			if (nskb == NULL) {
 				dev->stats.rx_errors++;
@@ -99,20 ...

I'm not sure I understand the problem. Whats wrong with doing
the same as on transmit, i.e.:

- for multicast/broadcast, deliver everywhere (except self)

- for unicast, deliver to matching local macvlan device or

Pretty much like this :)
--

From: Eric W. Biederman
Date: Monday, March 9, 2009 - 8:48 am

Yes.

There are two tricky parts.

One problem is that macvlans and the primary hardware device share the
same transmit queue.  So when I have a broadcast packet on the primary
devices queue I don't know if I have already sent it out to the
macvlan devices or not.

The second problem is that when I transmit a multicast packet and I
have a local listener.  I believe replicating the packet both at the
ip layer and at the ethernet layer will result in receiving the packet
locally twice.

I'm not certain we need to solve the second problem as having two physical
interfaces plugged into a switch will have the same problem.

The first problem is all about how do we deliver packets everywhere except self.

Eric
--

From: Patrick McHardy
Date: Monday, March 9, 2009 - 8:53 am

So its about receiving packets on macvlan when transmitting on the
real device? That sounds like a really hard problem that would probably
indeed be better solved by a bridge.

If its just about whether the packet should be sent out by macvlan
to the wire as well, I'd say yes since thats what two real devices


I think "except self should just mean "not to the originating virtual
device".
--

From: Eric W. Biederman
Date: Monday, March 9, 2009 - 9:34 am

Yes.  My concern is that if we hook the real device we will
software broadcast packets twice.

Now that I think about it we could call ndo_start_xmit directly
from the macvlan code, and bypass whatever hook we use to
intercept packets going out the normal device it should not
be too difficult.

Operationally it would be very nice if arp worked between a macvlan
and the real device.

Eric
--

From: Patrick McHardy
Date: Monday, March 9, 2009 - 9:45 am

It would also require an additional hook in the networking core,

We don't intercept packets on TX, they have to be explicitly delivered
to macvlan.
--

From: Ben Greear
Date: Monday, March 9, 2009 - 11:58 am

It might suck for performance, but mac-vlan could register an 'ALL' protocol
on the physical dev, similar to tcp-dump, to grab pkts on tx and pass the
ones it cares about back up to the vlans?

I'd want run-time control to disable any of these costly options for those that
don't need it, however.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

--

From: Eric W. Biederman
Date: Monday, March 9, 2009 - 2:17 pm

I like that idea.  At least for prototyping.


If well implemented it should not be more expensive than the ingress path where
we already have, and where we already do that.  Unless your traffic is highly
assymmetric.

Eric
--

From: Ben Greear
Date: Monday, March 9, 2009 - 2:23 pm

Well, the ingress path isn't free, and especially for broadcast pkts it is quite
expensive with large numbers of devices.

In a single namespace implementation, there are very few uses for having two
NICs on the same system able to send to each other since an un-patched kernel
will not do IPv4 traffic between two external ports, and multicast loops back
in software already.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

--

From: Brian Haley
Date: Monday, March 9, 2009 - 11:33 am

If you want a local listener to see the packet you have to set
IP_MULTICAST_LOOP.  A packet from a local source address coming off the "wire"
will be dropped in fib_validate_source(), it would have to come over lo.  I'm
not sure how that relates to how macvlan works, just something I've run into
before (2 nics in same subnet, listener on one, sender on other,
IP_MULTICAST_LOOP=0, no packets).

-Brian
--

From: Ben Greear
Date: Monday, March 9, 2009 - 11:54 am

A flag could be added to the skb so that we know it originated from
a mac-vlan.  That shouldn't require any extra hooks but just an extra
check in the mac-vlan rx code to drop any pkt received from the
underlying NIC with this flag set.

For broadcasting (or unicasting) to other mac-vlans or the underlying physical
device, the mac-vlan tx logic could check for local delivery before telling the
lower-level NIC to transmit the pkt.

Since we already have a mac hash, we could probably key off of the dest MAC
fairly easily.  For broadcast, it would be a flood.  We could also add a
flag to mac-vlan tx logic to be clever and only send ARP to the mac-vlans
likely to care.  This might not be a good filter for all possible cases, but
for general cases, and thousands of mac-vlans, it would save a lot of work
cheaply.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

--

Previous thread: [PATCH] igb: fix kexec with igb by Yinghai Lu on Friday, March 6, 2009 - 9:33 pm. (36 messages)

Next thread: e100 fail in 2.6.29-rc6-git7, kernel requesting firmware by Denys Fedoryschenko on Saturday, March 7, 2009 - 3:10 pm. (1 message)