Is there any work in progress on including PGM support (RFC 3208) in the kernel? I know about the openpgm implementation. Openpbm does this at the user level and requires linking to a library. It is essentially a communication protocol done in user space. It has privilege issues because it has to create PGM packets via a raw socket. Which also has implications for the possible performance. Openpgm seems to be able to interact with major commercial implementations of PGM. I am looking at openpgm right now and it seems that there are a number of useful files and functions in there that could be used to implement PGM support in the kernel. There is also an existing socket API for handling PGM available in another operating system whose name we rather avoid mentioning. That socket API could be used as the basic. PGM use would then be possible without a library and without privilege and performance issues. PGM support would support two different modes of communication 1. Native PGM (allows NAK suppression by Cisco routers to be used) socket(AF_INET, SOCK_RDM, IPPROTO_RM) (SOCK_RDM is defined in the kernel sources but not implemented. PGM support would implement SOCK_RDM, IPPROTO_RM would need to be defined according to the IANA protocol number for PGM). 2. PGM over UDP (which is used by many commercial product but not by the unspeakable OS). No router support for NAK suppression is available. For this I guess we would have to support socket(AF_INET, SOCK_RDM, IPPROTO_UDP) I would be interested to find others who are interested in such a project or maybe there is already a project in the works? If not then I will try to come up with some code to get this going. Any help you could offer would be appreciated. --
Here is what I have so far after a couple of hours.
Something hacked together from openpgm and udplite.
---
Documentation/networking/pgm/TODO | 8
Documentation/networking/pgm/references | 2
Documentation/networking/pgm/usage | 91 ++++
include/linux/in.h | 2
include/linux/pgm.h | 720 ++++++++++++++++++++++++++++++++
net/ipv4/Kconfig | 14
net/ipv4/Makefile | 3
net/ipv4/pgm.c | 143 ++++++
8 files changed, 983 insertions(+)
Index: linux-2.6/include/linux/in.h
===================================================================
--- linux-2.6.orig/include/linux/in.h 2010-03-18 11:05:24.000000000 -0500
+++ linux-2.6/include/linux/in.h 2010-03-18 15:47:59.000000000 -0500
@@ -44,6 +44,7 @@ enum {
IPPROTO_PIM = 103, /* Protocol Independent Multicast */
IPPROTO_COMP = 108, /* Compression Header protocol */
+ IPPROTO_PGM = 113, /* Pragmatic General Multicast */
IPPROTO_SCTP = 132, /* Stream Control Transport Protocol */
IPPROTO_UDPLITE = 136, /* UDP-Lite (RFC 3828) */
@@ -51,6 +52,7 @@ enum {
IPPROTO_MAX
};
+#define IPPROTO_RM IPPROTO_PGM
/* Internet address. */
struct in_addr {
Index: linux-2.6/include/linux/pgm.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/pgm.h 2010-03-18 16:56:19.000000000 -0500
@@ -0,0 +1,720 @@
+/*
+ * PGM packet formats, RFC 3208.
+ *
+ * Copyright (c) 2006 Miru Limited.
+ * Copyright (c) 2010 Christoph Lameter, The Linux Foundation.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that ...That seems like a poor reason alone to put something into the kernel Perhaps you rather need some way to have unpriviledged raw sockets? The classical way to do this is to start suid root, only open the socket and then drop privileges. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
From: Andi Kleen <andi@firstfloor.org> I completely agree. We should be able to make a way for unprivileged users to use RAW sockets in some limited capacity, for cases like this. But I also don't consider what openpbm has to do right now to be all that much of a restriction. You need privileges to add the protocol to the kernel, you need privileges to run the userspace variant, there is no real difference. --
The real difference is if multiplex is needed between multiple unprivileged users. -hpa --
It is needed. PGM ports exist and work similarly to UDP and TCP ports. PGM as provided by openpgm and other solutions avoids native PGM and instead uses PGM over UDP. But the routers do not support PGM over UDP in the same way as native PGM. So the NAK suppression and other advanced features available in Juniper and Cisco switches cannot be used. openpbm can work with the native PGM protocol via a raw socket but then one cannot run multiple processes communicating via different ports effectively. The fragmentation of packets and the assembly etc in user space is a pain. --
Not the only reason. There are also performance implications. NAKing and other control messages from user space are a pain and the available implementations add numerous threads just to control the timing of control messages and the expiration of data etc. Its difficult to listen to a PGM port from user space. You have to get all messages for the PGM protocol and then filter in each process. Yes those solutions exist and the experience with their limitations are the reason to try to get PGM in the kernel. --
Ok that sounds like a good reason to have a kernel protocol. Thanks. Multicast reliable kernel protocols are somewhat new, I guess one would need to make sure to come up with a clean generic interface for them first. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
It has been around for a long time in another OS. I wonder if I should use
the socket API realized there as a model or come up with something new
from scratch?
What I have right now is:
1. Opening a socket
A. Native PGM
fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM)
B. PGM over UDP
fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP)
C. PGM over SHM (?)
fd = socket(AF_UNIX, SOCK_RDM, 0)
2. Binding to a multicast address
A. Sender
Connect the socket to a MC address and port using connect().
Note that the port is significant since multiple streams on different
ports can be run over the same MC addr.
B. Receiver
I. Bind the socket to the MC address and port of interest.
II. Listen to the socket.
Process will wait until a PGM packet destined to the port of interest
is received.
III. Accept a connection.
Establishes a session. Data can then be received.
3. Sending and receiving
Use the usual socket read and write operations and the various flavors of waiting
for a packet via select, poll, epoll etc.
Packet sizes are determined by the number of packets in a single sendmsg() unless
overridden by the RM_SET_MESSAGE_BOUNDARY socket option.
The sender will block when the send window is full unless a non blocking write is performed.
The receiver shows the usual wait semantics. If the stream is set to unreliable then
packets may arrive in random order. If the set is set to RM_LISTEN_ONLY then packets may
just be missing.
4. Transmitter Socket Options
A. Setting the window size / rate.
struct pgm_send_window x;
x.RateKbitsPerSec = 56;
x.WindowSizeInMsecs = 60000;
...If the other API doesn't have a serious flaw I guess it's better That's a very large buffer for a socket. It would be better to use the usual It's difficult to maintain 64 bit counters on 32bit hosts on all targets. But I guess it would be ok to only fill in 32bit in this case. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
RDM is Reliable Datagram Multicast I believe. I'd rather have SOCK_PGM if Multiple processes would communicate via shm segments. Maybe defer to the future but its an important operation mode as the systems grow bigger and bigger. SHM segment would have to contain some sort of ring buffer that the receivers could tap into. But that mode has not really been thought No idea why it was implemented. It can be used to use send() for portions of a message. Triggers the send() only when all bytes have been provided. Probably necessary if one wants to have very long (megabytes) messages. Reliable multicast protocols have a defined time period / "reliabilty buffer" so that they can resend a message that was missed for a time period. It is customary to either specify a time period or define the size 32 bit counters have the awful habit of overflowing. --
AF_UNIX is not SHM today. The only point is to avoid one copy? (user1 -> kernel -> user2 to user1 -> user2) Not sure if that is really worth it. Don't you need another copy to the reliability buffer anyways? Letting kernel parse a data structure in user defined memory is also always somewhat tricky. But in principle AF_INET over localhost should not be that less efficient than AF_UNIX, so you can probably drop it for now (unless you need special AF_UNIX Those could be a problem in kernel memory consumption. One would need to be very careful to have a good memory management scheme for the socket One problem is memory management then. What happens when a process opens 100 of those sockets and fills them all? There's just no portable atomic64_t. Ok maybe you can use the socket lock to synchronize all the counts if they are only per socket. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Not sure either. Access of multiple processes to one reliability buffer Well lets skip it for now and see if there are performance implications in Pushes out the app? Same as the user space apps now. Some sort of Yes. --
Here is a pgm.7 manpage describing how the socket API could look like for a PGM implementation. I dumped the RM_* based socket options from the other OS since most of the options were unusable. .\" This man page is Copyright (C) 2010 Christoph Lameter <cl@linux-foundation.org>. .\" Permission is granted to distribute possibly modified copies .\" of this page provided the header is included verbatim, .\" and in case of nontrivial modification author and date .\" of the modification is added to the header. .\" .TH PGM 7 2010-08-01 "Linux" "Linux Programmer's Manual" .SH NAME pgm \- Pragmatic General Multicast Protocol Support for IPv4 .SH SYNOPSIS .B #include <sys/socket.h> .br .B #include <netinet/in.h> .br .B #include <linux/pgm.h> .sp .B pgm_socket = socket(AF_INET, SOCK_RDM, IPPROTO_PGM); .br .B pgm_socket = socket(AF_INET, SOCK_RDM, IPPROTO_UDP); .SH DESCRIPTION This is an implementation of the Pragmatic General Multicast Protocol described in RFC\ 3028. PGM implements a connection oriented, Reliable Datagram Messaging (thus SOCK_RDM) protocol. Packets are delivered in order even though the network may have reordered, duplicated or dropped packets. Receivers may ask for retransmission of missed packets (NAK). Transmitters do not keep receiver state so that an individual sender is able to interact with an unlimited number of receivers. The recovery mechanism of PGM can limit the scalability of PGM if too many receivers are NAKing. Therefore measures exist at various layers to reduce the potential repair volume that a transmitter may have to deal with. PGM supports two variants. The first one is the .B native PGM protocol which uses its own IP protocol implementation at the same level as TCP and UDP. Native PGM supports NAK suppression ("assist") by network elements (Cisco, Juniper and other commercially available routers have support for PGM) which is an important measure to reduce the NAK volume in case of packet loss during multicast replication of ...
I did a quick read and the manpage/interface seem reasonable to me. You changed the parameter struct fields to lower case. While that looks definitely more Linuxy than before does it mean programs have to #ifdef this? It might be good idea to have at least some optional compat header that #defines. -Andi --
You may also have a look at original PGM implementation by Luigi Rizzo (FreeBSD). It's not maintained, but it might give you broader view. http://info.iet.unipi.it/~luigi/pgm-code/ Martin --
Interesting. Which files in that directory contain the most current code? Looks like the tcpdump patch has been merged. Here is another tcpdump patch that implements decoding PGM via UDP. Anyone know how to submit something like that? (Need to specify -Tpgm option to use pgm decoder on UDP traffic) Index: tcpdump/interface.h =================================================================== --- tcpdump.orig/interface.h 2010-02-26 18:50:39.411609391 -0600 +++ tcpdump/interface.h 2010-02-26 18:51:04.270350179 -0600 @@ -74,6 +74,7 @@ #define PT_CNFP 7 /* Cisco NetFlow protocol */ #define PT_TFTP 8 /* trivial file transfer protocol */ #define PT_AODV 9 /* Ad-hoc On-demand Distance Vector Protocol */ +#define PT_PGM 10 /* The PGM protocol */ #ifndef min #define min(a,b) ((a)>(b)?(b):(a)) Index: tcpdump/print-udp.c =================================================================== --- tcpdump.orig/print-udp.c 2010-02-26 18:51:35.921610552 -0600 +++ tcpdump/print-udp.c 2010-02-26 18:53:54.440349950 -0600 @@ -520,6 +520,11 @@ tftp_print(cp, length); break; + case PT_PGM: + udpipaddr_print(ip, sport, dport); + pgm_print(cp, length, (const u_char *)ip); + break; + case PT_AODV: udpipaddr_print(ip, sport, dport); aodv_print((const u_char *)(up + 1), length, Index: tcpdump/tcpdump.c =================================================================== --- tcpdump.orig/tcpdump.c 2010-02-26 18:37:13.971601597 -0600 +++ tcpdump/tcpdump.c 2010-02-26 18:37:43.290033748 -0600 @@ -854,6 +854,8 @@ packettype = PT_TFTP; else if (strcasecmp(optarg, "aodv") == 0) packettype = PT_AODV; + else if (strcasecmp(optarg, "pgm") == 0) + packettype = PT_PGM; else error("unknown packet type `%s'", optarg); break; --
Thanks. I will then proceed to get a patch out that implements the The socket API will be completely different. The basic handling of the sockets is the same (binding, listening, connecting). There is no way of mapping M$ socket options to Linux socket options with the approach that I proposed in the manpage. The stats structure is different too since some key elements were missing. What users are there of the M$ api? I have seen vendors supplying their own pgm implementation (guess due to bit rot in the old M$ implementation). --
I don't know, it was just a general consideration. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
In 2.6.34 there is (although some arches which could support it natively don't as of yet... but that's fixable.) See lib/atomic64.c. -hpa --
There are also the 64bit thiscpu operations that were merged in 2.6.33. They do the right thing if the arch does not provide operations. --
