Re: Add PGM protocol support to the IP stack

Previous thread: [PATCH 2/3] can: add support for Janz VMOD-ICAN3 Intelligent CAN module by Ira W. Snyder on Thursday, March 18, 2010 - 9:38 am. (16 messages)

Next thread: [RFC PATCH 0/2] iproute2: Introduce new commands for L2TPv3 unmanaged tunnels by James Chapman on Thursday, March 18, 2010 - 11:14 am. (3 messages)
From: Christoph Lameter
Date: Thursday, March 18, 2010 - 10:58 am

Is there any work in progress on including PGM support (RFC 3208) in the
kernel?

I know about the openpgm implementation. Openpbm does this at the user
level and requires linking to a library. It is essentially a communication
protocol done in user space. It has privilege issues because it has to
create PGM packets via a raw socket. Which also has implications for the
possible performance. Openpgm seems to be able to interact with major
commercial implementations of PGM.

I am looking at openpgm right now and it seems that there are a number of
useful files and functions in there that could be used to implement PGM
support in the kernel.

There is also an existing socket API for handling PGM available in another
operating system whose name we rather avoid mentioning. That socket API
could be used as the basic. PGM use would then be possible without a
library and without privilege and performance issues.

PGM support would support two different modes of communication


1. Native PGM (allows NAK suppression by Cisco routers to be used)

	socket(AF_INET, SOCK_RDM, IPPROTO_RM)

(SOCK_RDM is defined in the kernel sources but not implemented. PGM
support would implement SOCK_RDM, IPPROTO_RM would need to be defined
according to the IANA protocol number for PGM).


2. PGM over UDP (which is used by many commercial product but not by the
unspeakable OS). No router support for NAK suppression is available. For
this I guess we would have to support

	socket(AF_INET, SOCK_RDM, IPPROTO_UDP)

I would be interested to find others who are interested in such a project
or maybe there is already a project in the works? If not then I will try
to come up with some code to get this going. Any help you could offer
would be appreciated.
--

From: Christoph Lameter
Date: Thursday, March 18, 2010 - 2:58 pm

Here is what I have so far after a couple of hours.
Something hacked together from openpgm and udplite.

---
 Documentation/networking/pgm/TODO       |    8
 Documentation/networking/pgm/references |    2
 Documentation/networking/pgm/usage      |   91 ++++
 include/linux/in.h                      |    2
 include/linux/pgm.h                     |  720 ++++++++++++++++++++++++++++++++
 net/ipv4/Kconfig                        |   14
 net/ipv4/Makefile                       |    3
 net/ipv4/pgm.c                          |  143 ++++++
 8 files changed, 983 insertions(+)

Index: linux-2.6/include/linux/in.h
===================================================================
--- linux-2.6.orig/include/linux/in.h	2010-03-18 11:05:24.000000000 -0500
+++ linux-2.6/include/linux/in.h	2010-03-18 15:47:59.000000000 -0500
@@ -44,6 +44,7 @@ enum {
   IPPROTO_PIM    = 103,		/* Protocol Independent Multicast	*/

   IPPROTO_COMP   = 108,                /* Compression Header protocol */
+  IPPROTO_PGM	 = 113,		/* Pragmatic General Multicast		*/
   IPPROTO_SCTP   = 132,		/* Stream Control Transport Protocol	*/
   IPPROTO_UDPLITE = 136,	/* UDP-Lite (RFC 3828)			*/

@@ -51,6 +52,7 @@ enum {
   IPPROTO_MAX
 };

+#define IPPROTO_RM IPPROTO_PGM

 /* Internet address. */
 struct in_addr {
Index: linux-2.6/include/linux/pgm.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/pgm.h	2010-03-18 16:56:19.000000000 -0500
@@ -0,0 +1,720 @@
+/*
+ * PGM packet formats, RFC 3208.
+ *
+ * Copyright (c) 2006 Miru Limited.
+ * Copyright (c) 2010 Christoph Lameter, The Linux Foundation.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that ...
From: Andi Kleen
Date: Friday, March 19, 2010 - 10:18 am

That seems like a poor reason alone to put something into the kernel
Perhaps you rather need some way to have unpriviledged raw sockets?

The classical way to do this is to start suid root, only open
the socket and then drop privileges.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: David Miller
Date: Friday, March 19, 2010 - 2:53 pm

From: Andi Kleen <andi@firstfloor.org>

I completely agree.

We should be able to make a way for unprivileged users to
use RAW sockets in some limited capacity, for cases like this.

But I also don't consider what openpbm has to do right now to
be all that much of a restriction.  You need privileges to
add the protocol to the kernel, you need privileges to run
the userspace variant, there is no real difference.
--

From: H. Peter Anvin
Date: Friday, March 19, 2010 - 3:26 pm

The real difference is if multiplex is needed between multiple
unprivileged users.

	-hpa
--

From: Christoph Lameter
Date: Monday, March 22, 2010 - 7:24 am

It is needed. PGM ports exist and work similarly to UDP and TCP ports.

PGM as provided by openpgm and other solutions avoids native PGM and
instead uses PGM over UDP. But the routers do not support PGM over UDP in
the same way as native PGM. So the NAK suppression and other advanced
features available in Juniper and Cisco switches cannot be used.

openpbm can work with the native PGM protocol via a raw socket but then
one cannot run multiple processes communicating via different ports
effectively.

The fragmentation of packets and the assembly etc in user space is a pain.

--

From: Christoph Lameter
Date: Monday, March 22, 2010 - 7:20 am

Not the only reason. There are also performance implications. NAKing and
other control messages from user space are a pain and the available
implementations add numerous threads just to control the timing of control
messages and the expiration of data etc. Its difficult to listen to a PGM
port from user space. You have to get all messages for the PGM protocol
and then filter in each process.


Yes those solutions exist and the experience with their limitations are
the reason to try to get PGM in the kernel.
--

From: Andi Kleen
Date: Monday, March 22, 2010 - 9:36 am

Ok that sounds like a good reason to have a kernel protocol.
Thanks.

Multicast reliable kernel protocols are somewhat new, I guess one
would need to make sure to come up with a clean generic interface 
for them first.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Christoph Lameter
Date: Monday, March 22, 2010 - 9:51 am

It has been around for a long time in another OS. I wonder if I should use
the socket API realized there as a model or come up with something new
from scratch?

What I have right now is:

1. Opening a socket

        A. Native PGM

                fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM)

        B. PGM over UDP

                fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP)

        C. PGM over SHM (?)

                fd = socket(AF_UNIX, SOCK_RDM, 0)


2. Binding to a multicast address

        A. Sender

                Connect the socket to a MC address and port using connect().

                Note that the port is significant since multiple streams on different
                ports can be run over the same MC addr.

        B. Receiver

                I. Bind the socket to the MC address and port of interest.

                II. Listen to the socket.

                        Process will wait until a PGM packet destined to the port of interest
                        is received.

                III. Accept a connection.

                        Establishes a session. Data can then be received.


3. Sending and receiving

        Use the usual socket read and write operations and the various flavors of waiting
        for a packet via select, poll, epoll etc.

        Packet sizes are determined by the number of  packets in a single sendmsg() unless
        overridden by the RM_SET_MESSAGE_BOUNDARY socket option.

        The sender will block when the send window is full unless a non blocking write is performed.

        The receiver shows the usual wait semantics. If the stream is set to unreliable then
        packets may arrive in random order. If the set is set to RM_LISTEN_ONLY then packets may
        just be missing.

4.      Transmitter Socket Options


        A. Setting the window size / rate.

                struct pgm_send_window x;
                x.RateKbitsPerSec = 56;
                x.WindowSizeInMsecs = 60000;
 ...
From: Andi Kleen
Date: Monday, March 22, 2010 - 10:43 am

If the other API doesn't have a serious flaw I guess it's better





That's a very large buffer for a socket. It would be better to use the usual


It's difficult to maintain 64 bit counters on 32bit hosts on all targets.
But I guess it would be ok to only fill in 32bit in this case.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Christoph Lameter
Date: Monday, March 22, 2010 - 11:07 am

RDM is Reliable Datagram Multicast I believe. I'd rather have SOCK_PGM if

Multiple processes would communicate via shm segments. Maybe defer to the
future but its an important operation mode as the systems grow bigger and bigger.
SHM segment would have to contain some sort of ring buffer that the
receivers could tap into. But that mode has not really been thought


No idea why it was implemented. It can be used to use send() for portions
of a message. Triggers the send() only when all bytes have been provided.
Probably necessary if one wants to have very long (megabytes) messages.

Reliable multicast protocols have a defined time period / "reliabilty
buffer" so that they can resend a message that was missed for a time
period. It is customary to either specify a time period or define the size


32 bit counters have the awful habit of overflowing.
--

From: Andi Kleen
Date: Monday, March 22, 2010 - 11:53 am

AF_UNIX is not SHM today.

The only point is to avoid one copy? (user1 -> kernel -> user2  to user1 -> user2) 
Not sure if that is really worth it. Don't you need another copy to the reliability
buffer anyways?

Letting kernel parse a data structure in user defined memory is also
always somewhat tricky.

But in principle AF_INET over localhost should not be that less efficient
than AF_UNIX, so you can probably drop it for now (unless you need special AF_UNIX

Those could be a problem in kernel memory consumption. One would need
to be very careful to have a good memory management scheme for the socket

One problem is memory management then. What happens when a process opens 100 of those
sockets and fills them all?


There's just no portable atomic64_t. Ok maybe you can use the socket lock
to synchronize all the counts if they are only per socket.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Christoph Lameter
Date: Monday, March 22, 2010 - 12:32 pm

Not sure either. Access of multiple processes to one reliability buffer

Well lets skip it for now and see if there are performance implications in


Pushes out the app? Same as the user space apps now. Some sort of

Yes.
--

From: Christoph Lameter
Date: Friday, March 26, 2010 - 10:33 am

Here is a pgm.7 manpage describing how the socket API could look like for
a PGM implementation.

I dumped the RM_* based socket options from the other OS since most of the
options were unusable.



.\" This man page is Copyright (C) 2010 Christoph Lameter <cl@linux-foundation.org>.
.\" Permission is granted to distribute possibly modified copies
.\" of this page provided the header is included verbatim,
.\" and in case of nontrivial modification author and date
.\" of the modification is added to the header.
.\"
.TH PGM  7 2010-08-01 "Linux" "Linux Programmer's Manual"
.SH NAME
pgm \- Pragmatic General Multicast Protocol Support for IPv4
.SH SYNOPSIS
.B #include <sys/socket.h>
.br
.B #include <netinet/in.h>
.br
.B #include <linux/pgm.h>
.sp
.B pgm_socket = socket(AF_INET, SOCK_RDM, IPPROTO_PGM);
.br
.B pgm_socket = socket(AF_INET, SOCK_RDM, IPPROTO_UDP);
.SH DESCRIPTION
This is an implementation of the Pragmatic General Multicast Protocol
described in RFC\ 3028.
PGM implements a connection oriented, Reliable Datagram Messaging
(thus SOCK_RDM) protocol. Packets are delivered in order even though the
network may
have reordered, duplicated or dropped packets. Receivers may ask for
retransmission of missed packets (NAK). Transmitters do not keep receiver
state so that an individual sender is able to interact with an unlimited
number of receivers.
The recovery mechanism of PGM can limit the scalability of PGM if too
many receivers are NAKing. Therefore measures exist at various layers
to reduce the potential repair volume that a transmitter may have to
deal with.

PGM supports two variants. The first one is the
.B native PGM protocol
which uses its own IP protocol implementation at the same level as TCP and UDP.
Native PGM supports NAK suppression ("assist") by network elements (Cisco,
Juniper and other commercially available routers have support for PGM) which
is an important measure to reduce the NAK volume in case of packet loss during
multicast replication of ...
From: Andi Kleen
Date: Saturday, March 27, 2010 - 6:11 am

I did a quick read and the manpage/interface seem reasonable to me.

You changed the parameter struct fields to lower case. While
that looks definitely more Linuxy than before does it mean programs
have to #ifdef this? It might be good idea to have at least some
optional compat header that #defines.

-Andi
--

From: Martin Sustrik
Date: Saturday, March 27, 2010 - 9:54 am

You may also have a look at original PGM implementation by Luigi Rizzo 
(FreeBSD). It's not maintained, but it might give you broader view.

http://info.iet.unipi.it/~luigi/pgm-code/

Martin
--

From: Christoph Lameter
Date: Monday, March 29, 2010 - 7:50 am

Interesting. Which files in that directory contain the most current code?

Looks like the tcpdump patch has been merged.

Here is another tcpdump patch that implements decoding PGM via UDP. Anyone
know how to submit something like that?

(Need to specify -Tpgm option to use pgm decoder on UDP traffic)

Index: tcpdump/interface.h
===================================================================
--- tcpdump.orig/interface.h	2010-02-26 18:50:39.411609391 -0600
+++ tcpdump/interface.h	2010-02-26 18:51:04.270350179 -0600
@@ -74,6 +74,7 @@
 #define PT_CNFP		7	/* Cisco NetFlow protocol */
 #define PT_TFTP		8	/* trivial file transfer protocol */
 #define PT_AODV		9	/* Ad-hoc On-demand Distance Vector Protocol */
+#define PT_PGM		10	/* The PGM protocol */

 #ifndef min
 #define min(a,b) ((a)>(b)?(b):(a))
Index: tcpdump/print-udp.c
===================================================================
--- tcpdump.orig/print-udp.c	2010-02-26 18:51:35.921610552 -0600
+++ tcpdump/print-udp.c	2010-02-26 18:53:54.440349950 -0600
@@ -520,6 +520,11 @@
 			tftp_print(cp, length);
 			break;

+		case PT_PGM:
+			udpipaddr_print(ip, sport, dport);
+			pgm_print(cp, length, (const u_char *)ip);
+			break;
+
 		case PT_AODV:
 			udpipaddr_print(ip, sport, dport);
 			aodv_print((const u_char *)(up + 1), length,
Index: tcpdump/tcpdump.c
===================================================================
--- tcpdump.orig/tcpdump.c	2010-02-26 18:37:13.971601597 -0600
+++ tcpdump/tcpdump.c	2010-02-26 18:37:43.290033748 -0600
@@ -854,6 +854,8 @@
 				packettype = PT_TFTP;
 			else if (strcasecmp(optarg, "aodv") == 0)
 				packettype = PT_AODV;
+			else if (strcasecmp(optarg, "pgm") == 0)
+				packettype = PT_PGM;
 			else
 				error("unknown packet type `%s'", optarg);
 			break;
--

From: Christoph Lameter
Date: Monday, March 29, 2010 - 8:00 am

Thanks. I will then proceed to get a patch out that implements the

The socket API will be completely different. The basic handling of the
sockets is the same (binding, listening, connecting). There is no way of
mapping M$ socket options to Linux socket options with the approach that
I proposed in the manpage. The stats structure is different too since some
key elements were missing.

What users are there of the M$ api? I have seen vendors supplying their
own pgm implementation (guess due to bit rot in the old M$
implementation).


--

From: Andi Kleen
Date: Monday, March 29, 2010 - 2:43 pm

I don't know, it was just a general consideration.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: H. Peter Anvin
Date: Monday, March 29, 2010 - 4:01 pm

In 2.6.34 there is (although some arches which could support it natively
don't as of yet... but that's fixable.)  See lib/atomic64.c.

	-hpa



--

From: Christoph Lameter
Date: Tuesday, March 30, 2010 - 11:12 am

There are also the 64bit thiscpu operations that were merged in 2.6.33.
They do the right thing if the arch does not provide operations.


--

Previous thread: [PATCH 2/3] can: add support for Janz VMOD-ICAN3 Intelligent CAN module by Ira W. Snyder on Thursday, March 18, 2010 - 9:38 am. (16 messages)

Next thread: [RFC PATCH 0/2] iproute2: Introduce new commands for L2TPv3 unmanaged tunnels by James Chapman on Thursday, March 18, 2010 - 11:14 am. (3 messages)