Re: Distributed Switch Architecture(DSA)

Previous thread: [net-next-2.6 PATCH 1/4] e1000e: avoid polling h/w registers during link negotiation by Jeff Kirsher on Thursday, June 17, 2010 - 9:58 pm. (6 messages)

Next thread: [v3 Patch 2/2] mlx4: add dynamic LRO disable support by Amerigo Wang on Friday, June 18, 2010 - 3:55 am. (5 messages)
From: Joakim Tjernlund
Date: Friday, June 18, 2010 - 12:06 am

I am trying to wrap my head around DSA and I need some help.

Assume the example from Lennert:

		 +-----------+       +-----------+
		 |           | RGMII |           |
		 |           +-------+           +------ 1000baseT MDI ("WAN")
		 |           |       |  6-port   +------ 1000baseT MDI ("LAN1")
		 |    CPU    |       |  ethernet +------ 1000baseT MDI ("LAN2")
		 |           |MIImgmt|  switch   +------ 1000baseT MDI ("LAN3")
		 |           +-------+  w/5 PHYs +------ 1000baseT MDI ("LAN4")
		 |           |       |           |
		 +-----------+       +-----------+

If I understand this correctly I get at least 5 virtual I/Fs corresponding
to WAN, LAN1-4, but how is the RGMII I/F modelled?
I guess I will have one "real" ethX I/F which maps to RGMII but do I get one
virtual I/F too?
What use are these virtual I/Fs? Just to read status from the corresponding
ports? Can one TX and RX network pkgs over these I/Fs too?

Now I want to add STP/RSTP to the switch. How would one do that?

 Jocke

--

From: Lennert Buytenhek
Date: Friday, June 18, 2010 - 12:33 am

The RGMII interface is just the interface that your "real" network
driver exports.  In the case of the Kirkwood 6281 A0 Reference Design
(which I developed this code on), that would be eth0.  After the DSA
driver is instantiated, you don't send or receive over eth0 directly

You get a virtual interface for each of the ports on the switch (that
are not CPU or inter-switch ports), i.e. all ports on the right of the
diagram -- wan, lan1, lan2, lan3, lan4.  These interfaces are created

That's one of the purposes, yes.  There's a polling routine that
periodically checks the status of each of the ports on the switch (via
the MII management interface) and feeds back that status to the virtual
interfaces.  I.e. if you plug a cable into lan3, you'll see a syslog
message about the link on the virtual interface lan3 having come up,


First, you'll want the hardware bridging patches that I posted to
netdev@ a while back, e.g.:

	http://patchwork.ozlabs.org/patch/16578/

They aren't in upstream-mergeable form in their current form, but they
do the job.  These will propagate brctl addif/delif calls into the switch
chip, so that switching between those ports will be done in hardware.

Now if all you want is regular STP, with that patch you'll be done --
the ->bridge_set_stp_state() hook propagates the spanning tree state of
each of the DSA virtual interfaces into the switch chip automatically.

If you want to use a userspace STP implementation, you'll just have to
make sure that STP state (listening/learning/blocking/forwarding/etc) is
correctly propagated to the switch chip similarly to how it's done in the
patch.

(Ideally, these patches should be reworked to receive bridge configuration
and port status changes via netlink.  Unfortunately, I was asked to return
all my Marvell hardware when I left Marvell, so someone else will have to
do this work.)
--

From: Joakim Tjernlund
Date: Friday, June 18, 2010 - 2:15 am

hmm, but how do I send normal pkgs form the CPU to the switch then?
I envision I would get some interface in the CPU I can set an IP address
on and use as a normal I/F which would be switched by the HW switch to

TX:ing pkgs on such virtual I/F would go directly to the port, bypassing
normal switching?
What about RX? What decides which pkg to route through the switch and

I see, will have to study this a bit closer. One question though,
does this disable MAC learning in the linux bridge?

Do you have any idea how to do DSA on a Broadcom switch?
The control plane is an attached with PCI and has a big

--

From: Lennert Buytenhek
Date: Friday, June 18, 2010 - 2:59 am

Yes, these are the DSA/slave interfaces created by net/dsa/slave.c.
You are free to attach IP addresses to the wan/lanX interfaces, and


By default, which is until you enable bridging on some subset of the
ports, all ports have their own address database, and all received
packets are passed directly up to the CPU, where the DSA code will


I have no idea.  When I originally submitted the DSA code for merging,
I contacted Broadcom people about adding support for Broadcom switch
chips to it, but I never heard back from them.
--

From: Joakim Tjernlund
Date: Friday, June 18, 2010 - 4:09 am

An ethernet broadcast pkg flooded onto all ports.
A normal ethernet host DST address would be looked up by


ah, so until I enable bridging, all ports are viewed as a separate
network I/F?
Once I create a linux bridge device and add the virtual I/Fs, one
enables the bridge function.
One drawback with that is that you kill the bridge when you reboot

Doesn't the HW switch handle all MAC leaning? Why duplicate
this in the SW bridge?

OK. With DSA, how does one configure VLANs, policing and parameters in the
HW switch that don't map or exist in the linux bridge?

 Jocke

--

From: Lennert Buytenhek
Date: Friday, June 18, 2010 - 5:12 am

This statement assumes that all ports have been configured into a
bridge, which is not the default case.  (And why would it be?  Having each
port in the same VLAN/subnet is only one of the many possible ways of
configuring your switch ports -- and regular (non-DSA) Linux network
interfaces aren't bridged together by default either.)  I.e. after boot,

In current upstream kernels, if you in fact bridge all switch ports
together using Linux bridging, this address lookup will be done by the

That the DSA interfaces will behave just like non-DSA Linux network

Yes.  The original DSA commit message says as much:

    The switch driver presents each port on the switch as a separate

Yes and no.  Right now there is no hardware switch offload code in the
upstream kernel, so all bridging will still be done in software.  You
will need something along the lines of the patch I pointed you to to

With the hardware bridging patch, hardware bridging will continue if
you don't break down your br0 interface before rebooting.  (Of course,
your board might still have a hardware reset line that resets the

Imagine the case where you bridge lan1, lan2 (both on the switch chip)
into br0, together with wlan0 (which is not on the switch chip).

Now a packet is sent out of br0.  Should it be sent to wlan0 or to the
switch chip?  How will you make this decision without an address database

The idea is to use existing kernel interface for this as much as
possible.  So e.g. if you do:

	vconfig add lan1 123
	vconfig add lan2 123
	brctl addbr br123
	brctl addif br123 lan1.123
	brctl addif br123 lan2.123

Then the DSA code (or some userspace netlink listener helper, or some
combination of both) should ideally also detect that VLAN 123 on
interfaces lan1 and lan2 are to be bridged together, and program the
switch chip accordingly.  I think all VLAN configurations that at least
the Marvell hardware supports can be expressed this way.

To configure things like ingress/egress rate limiting and ...
From: Joakim Tjernlund
Date: Friday, June 18, 2010 - 8:13 am

Yes, I am getting there mentally. I just have a hard time letting go of
viewing the HW switch as an external entity :)


hmm, one will have to recreate the exact config in several steps(create br0, add each

True, in this case you need it, but for only HW switch I/Fs you don't
need it and there can be several hundreds of MAC addresses passing
trough the HW switch. It would be nice if one didn't need to pass
all those up to the SW bridge, especially if you have a small embedded

Yes, but I image that this breaks down when you want to do something a bit more
advanced. For example I don't think linux VLANs supports "shared VLAN learning"(SVL)
and to configure a HW switch to do SVL one would first have to impl.
that in Linux VLAN and then add the DSA code to get the config to the switch.

Not sure how one would express whether VLAN tags should be stripped off or not when
egressing the HW switch's physical port.

Furthermore, suppose one have a big HW switch, 48 ports, and lots of VLANs in that
HW switch one would have to create a lot of virtual I/Fs and VLANs in linux

Yes, there are aspects of a HW switch that doesn't map into DSA currently.
Perhaps one should add some framework to support this?

     Jocke

--

From: Lennert Buytenhek
Date: Friday, June 18, 2010 - 1:12 pm

I think you overestimate the effect that address learning will have on
the host CPU.  It only needs to happen for the first packet for every
new MAC address, and address flooding attacks is something you'll need
to address in either case.

If you're really worried about this scenario, then just configure your
boot loader to bridge all switch ports together, and don't load the DSA
driver.  The switch will then appear as a single interface, 'eth0' (or
whatever your SoC calls it), over which you can talk directly without
any form of tagging.  You won't be able to use any advanced features,

Yes.  But that's really the best way to do it, in my humble opinion.

If you don't go the host networking stack integration route, you end
up with something like the vendor drivers.  Which work fine for most
scenarios.. until you want to do something like talking TCP/IP using
the host TCP stack over some of the switch ports, at which point the

If you transmit a packet onto 'lan', it will be sent to the switch chip
with an "untagged" DSA tag.  If you transmit a packet onto 'lan.123',
it will be sent to the switch chip with a "tagged" DSA tag.  See

Where the 'resource waste' is on the order of a couple of tens or
hundreds of kilobytes of RAM.  If this is a problem for your host

Sounds good.
--

From: Joakim Tjernlund
Date: Saturday, June 19, 2010 - 7:22 am

I will buy that for the moment. I can't see a better way either if
you truly want to integrate a HW switch into linux. I just wish


Ah, now I get it, thanks.
However, how does this work for LAN to LAN pkgs? LAN1 and LAN2 could be
in the same VLAN but one is implicit(port) VLAN and the

That is not a very good argument, this is how bloat builds.

Any idea how such an framework should look like? What transport
mechanism is suitable to talk to a user space daemon?

--

From: Lennert Buytenhek
Date: Saturday, June 19, 2010 - 9:56 am

Most people deal with this by running a userland STP daemon that uses
raw sockets to inject manually (i.e. in userspace) DSA-tagged packets
onto the eth0 (or whatever) interface.  This "works" (for some
definitions of 'works') for UDP apps such as a DHCP server as well --

If you tell the HW switch to forward these packets, they will never
appear at the CPU interface, so the DSA tagging/untagging doesn't enter

Tell the switch that the vlan is native on one of the ports but not on
the other.  It's been a while since I looked at the chip docs but there

If you have a better way of getting all the features while spending
less resources, please step forward with your ideas.  The current design
is the best I could come up with, but I'm sure it's not optimal in its

Have a look at netlink.
--

From: Joakim Tjernlund
Date: Saturday, June 19, 2010 - 11:48 am

"tell the HW switch"? Doesn't DSA do that already? If not, what
is the point of DSA then if it doesn't use the native forwarding

The current DSA impl. does not support this? There should be some

I don't, I am not that familiar with the inner working of Linux

I was afraid you would say that, I have no experience with netlink :)

--

From: Lennert Buytenhek
Date: Saturday, June 19, 2010 - 11:57 am

The point is and always was to provide a framework for proper integration
of hardware switch chips into the Linux kernel.  This framework doesn't
become useless just because it doesn't already support every single

Have you even tried the DSA code?
--

From: Joakim Tjernlund
Date: Sunday, June 20, 2010 - 7:41 am

Right, sorry if I sounded a bit harsh.

So DSA currently does a very minimal config of the HW switch to get
things going.
If you want to do something more fancy one has to
add a control plane to DSA which would possibly talk

Not yet and I don't have any MV HW either :(

--

From: Lennert Buytenhek
Date: Monday, July 5, 2010 - 10:24 am

Yes and no -- yes in the sense that if you want to use more functionality
of the switch chip, you'll have to add some code that extracts that info
from the Linux network interface config and turns it into commands for the
switch chip, and no in the sense that I'm not sure yet what the best way
to implement this would be.  (Doing it all in userspace is one option.)
--

Previous thread: [net-next-2.6 PATCH 1/4] e1000e: avoid polling h/w registers during link negotiation by Jeff Kirsher on Thursday, June 17, 2010 - 9:58 pm. (6 messages)

Next thread: [v3 Patch 2/2] mlx4: add dynamic LRO disable support by Amerigo Wang on Friday, June 18, 2010 - 3:55 am. (5 messages)