Re: [PATCH 01/21] RDS: Socket interface

Previous thread: none

Next thread: Re: [PATCH] percpu: add optimized generic percpu accessors by Tejun Heo on Monday, January 26, 2009 - 7:24 pm. (21 messages)
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Hi Roland,

This patchset adds support for RDS as an Infiniband ULP. RDS is an
Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably,
and is used currently in Oracle RAC and Exadata products. It's lived
in OFED for 2+ years and I think it's time to get it upstream -- most
likely into your -next tree for .30, but if it snuck into .29 via the
"new code merge-window exception" then even better.

I've run checkpatch & sparse to clean up as many issues as possible so
what remains are really the design peculiarities (aka warts) that arise
from being a protocol designed by one company for a single critical
application. I think upstreaming this code is the first step towards
working out those issues, and making the end result available to a wider
audience.

Also available for review at:
git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6 for-roland

Thoughts? shortlog follows.

Thanks -- Regards -- Andy

Andy Grover (21):
      RDS: Socket interface
      RDS: Main header file
      RDS: Congestion-handling code
      RDS: Transport code
      RDS: Info and stats
      RDS: Connection handling
      RDS: loopback
      RDS: sysctls
      RDS: Message parsing
      RDS: send.c
      RDS: recv.c
      RDS: RDMA support
      RDS/IB: Infiniband transport
      RDS/IB: Ring-handling code.
      RDS/IB: Implement RDMA ops using FMRs
      RDS/IB: Implement IB-specific datagram send.
      RDS/IB: Receive datagrams via IB
      RDS/IB: Stats and sysctls
      RDS: Documentation
      RDS: Kconfig and Makefile
      RDS: Add AF and PF #defines for RDS sockets

 Documentation/networking/rds.txt        |  356 +++++++++++
 drivers/infiniband/Kconfig              |    2 +
 drivers/infiniband/Makefile             |    1 +
 drivers/infiniband/ulp/rds/Kconfig      |   13 +
 drivers/infiniband/ulp/rds/Makefile     |   13 +
 drivers/infiniband/ulp/rds/af_rds.c     |  677 +++++++++++++++++++++
 drivers/infiniband/ulp/rds/bind.c       |  202 +++++++
 ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

RDS handles per-socket congestion by updating peers with a complete
congestion map (8KB). This code keeps track of these maps for itself
and ones received from peers.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/cong.c |  424 +++++++++++++++++++++++++++++++++++++
 1 files changed, 424 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/cong.c

diff --git a/drivers/infiniband/ulp/rds/cong.c b/drivers/infiniband/ulp/rds/cong.c
new file mode 100644
index 0000000..b7c49d2
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/cong.c
@@ -0,0 +1,424 @@
+/*
+ * Copyright (c) 2007 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE ...
From: Stephen Hemminger
Date: Monday, January 26, 2009 - 8:48 pm

On Mon, 26 Jan 2009 18:17:40 -0800

So this is starting to look like another "Oracle special" like AIO
and HugeTLB. That has lots of caveat restrictions on the application.

--

From: Andrew Grover
Date: Tuesday, January 27, 2009 - 12:15 pm

On Mon, Jan 26, 2009 at 7:48 PM, Stephen Hemminger

Yep it's a datacenter-centric protocol.

Regards -- Andy
--

From: Evgeniy Polyakov
Date: Tuesday, January 27, 2009 - 6:10 am

Is this a rare condition? Is this protocol only intended for the
long-living connections and is not suitable for the cases when lots of
them are created and teared down quickly?

-- 
	Evgeniy Polyakov
--

From: Andrew Grover
Date: Tuesday, January 27, 2009 - 12:10 pm

Connections are long-lived. Imagine a cluster. RDS multiplexes all
sockets' datagrams between 2 hosts over a single transport-layer
connection, so if a node sends ONE datagram to another, an IB
connection is set up and sticks around indefinitely.

Regards -- Andy
--

From: Roland Dreier
Date: Wednesday, January 28, 2009 - 3:57 pm

> +EXPORT_SYMBOL_GPL(rds_cong_map_updated);

What is this being exported to?  AFAICT you are only building a single
RDS module, right?

 - R.
--

From: Andy Grover
Date: Wednesday, January 28, 2009 - 7:39 pm

In the current RDS development repo, transports are modularizable. For
the initial upstream submission I just wanted to include the IB
transport so I changed the build to compile rds-core and rds-ib
together, but didn't pull out the exports.

Steve Wise is working on having the iwarp transport debugged in the near
future. Once that is added and we make transports modularizable then
those exports are needed.

Take them out for now, then?

Thanks -- Regards -- Andy
--

From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Implement the RDS (Reliable Datagram Sockets) interface.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/af_rds.c |  677 +++++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/bind.c   |  202 +++++++++++
 2 files changed, 879 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/af_rds.c
 create mode 100644 drivers/infiniband/ulp/rds/bind.c

diff --git a/drivers/infiniband/ulp/rds/af_rds.c b/drivers/infiniband/ulp/rds/af_rds.c
new file mode 100644
index 0000000..7158438
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/af_rds.c
@@ -0,0 +1,677 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH ...
From: Stephen Hemminger
Date: Monday, January 26, 2009 - 8:46 pm

On Mon, 26 Jan 2009 18:17:38 -0800


Then I would think a high speed protocol would use per-cpu

--

From: Andrew Grover
Date: Wednesday, January 28, 2009 - 8:17 pm

On Mon, Jan 26, 2009 at 7:46 PM, Stephen Hemminger

For a spinlock guarding a socket list? I wouldn't think it would be
worth the complexity.

[other comments snipped, will fix per your advice, thanks!]

Regards -- Andy
--

From: David Miller
Date: Monday, January 26, 2009 - 9:11 pm

Socket family implementations do not belong under the
infiniband subdirectory.

Put it under net/ instead.

I don't care what the interdependencies happen to be.
--

From: Andrew Grover
Date: Thursday, January 29, 2009 - 1:22 pm

Roland, so you're ok with infiniband code under net/ ?

(or should I split it up??)

Thanks -- Regards -- Andy
--

From: Evgeniy Polyakov
Date: Tuesday, January 27, 2009 - 5:08 am

Hi Andy.



Global list of all sockets? This does not scale, maybe it should be


Does RDS sockets work with high number of creation/destruction

Are you absolutely sure that provided poll_table callback
will not do the bad things here? It is quite unusual to add several
different queues into the same head in the poll callback.

Is there a possibility to have lock iteraction problem with above


Hash table with the appropriate size will have faster lookup/access


Iirc there is a new %pi4 or similar format id.

-- 
	Evgeniy Polyakov
--

From: Andrew Grover
Date: Wednesday, January 28, 2009 - 9:02 pm

sch mentioned this too... is socket creation often a bottleneck? If so
we can certainly improve scalability here.

In any case, this is in the code to support a listing of RDS sockets
via the rds-info utility. Instead of having our own custom program to
list rds sockets we probably want to export an interface so netstat
will list them. Unfortunately netstat seems to be hardcoded to look
for particular entries in /proc/net, so both rds and netstat would
need to be updated before this would work, and RDS's custom

from the comments above that code:

 * We have to be careful about racing with the incoming path.  sock_orphan()
 * sets SOCK_DEAD and we use that as an indicator to the rx path that new
 * messages shouldn't be queued.



I don't know. I looked into the poll_wait code a little and it
appeared to be designed to allow multiple.


I didn't see anywhere where they were being acquired in reverse order,
or simultaneously. This is the kind of thing that lockdep would find
immediately, right? I think I've got that turned on but I'll double




Yup, will do.

Thanks again.

Regards -- Andy
--

From: Evgeniy Polyakov
Date: Thursday, January 29, 2009 - 9:24 am

Hi Andy.



It depends on the workload, but and it becomes a noticeble portion of
the overhead for multi-client short-living connections. Likely this
sockets will not be used for web-server like load, but something similar

Other sockets use similar technique, but they are groupped into hash
table, so if you think that amount of socket will be noticebly large or
they will be frequently created and removed, it may worth pushing them



It depends on how poll_table was initialized and how its callback
(invoked from the poll_wait()) operates with the given queue and head.
If you introduce own polling, some care has to be taken there for the
ordering of the wait queues and what their callbacks return when polling
even found.

For example with the own initialization it is possible that with
multiple queues are registered in the same table, only one of them
will be awakened (its callback invoked).

If you just hook into existing machinery things should be ok though, so

If lockdep entered the bad race, then yes, it will fire this up.
I just wondered that we spin lock under the read lock, so some bad

I meant that if there is an unsigned overflow this will suddenly become
a small number, so network timestamping comparison logic can be used,
but apparently neither address nor port are changed during the lifetime,
so nothing special is needed.

-- 
	Evgeniy Polyakov
--

From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

RDS supports multiple transports. While this initial submission
only supports Infiniband transport, this abstraction allows others
to be added. We're working on an iWARP transport, and also see
UDP over DCB as another possibility.

This code handles transport registration.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/transport.c |  134 ++++++++++++++++++++++++++++++++
 1 files changed, 134 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/transport.c

diff --git a/drivers/infiniband/ulp/rds/transport.c b/drivers/infiniband/ulp/rds/transport.c
new file mode 100644
index 0000000..e78f8b3
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/transport.c
@@ -0,0 +1,134 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR ...
From: Evgeniy Polyakov
Date: Tuesday, January 27, 2009 - 6:18 am

Wow. Why not declare 15 as some constant and put it into rds_transport

Tabs have run away.

-- 
	Evgeniy Polyakov
--

From: Andrew Grover
Date: Tuesday, January 27, 2009 - 12:36 pm

Will fix. Thanks.

Regards -- Andy
--

From: Evgeniy Polyakov
Date: Tuesday, January 27, 2009 - 2:56 pm

It confuses tags and the like otherwise, and looks more consistent with
the rest of the code. Likely it is not a must, but just better look.

-- 
	Evgeniy Polyakov
--

From: Andrew Grover
Date: Tuesday, January 27, 2009 - 3:15 pm

Yup, will do, just was curious.

Regards -- Andy
--

From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

In addition to some tunable parameters, we also make protocol #
available here, since until accepted in upstream we do not have
a fixed number assigned. This can be removed once upstream.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/sysctl.c |  164 +++++++++++++++++++++++++++++++++++
 1 files changed, 164 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/sysctl.c

diff --git a/drivers/infiniband/ulp/rds/sysctl.c b/drivers/infiniband/ulp/rds/sysctl.c
new file mode 100644
index 0000000..3337a3e
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/sysctl.c
@@ -0,0 +1,164 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

While arguably the fact that the underlying transport needs a
connection to convey RDS's datagrame reliably is not important
to rds proper, the transports implemented so far (IB and TCP)
have both been connection-oriented, and so the connection
state machine-related code is in the common rds code.

This patch also includes several work items, to handle connecting,
sending, receiving, and shutdown.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/connection.c |  501 +++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/threads.c    |  273 +++++++++++++++++
 2 files changed, 774 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/connection.c
 create mode 100644 drivers/infiniband/ulp/rds/threads.c

diff --git a/drivers/infiniband/ulp/rds/connection.c b/drivers/infiniband/ulp/rds/connection.c
new file mode 100644
index 0000000..6174629
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/connection.c
@@ -0,0 +1,501 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY ...
From: Evgeniy Polyakov
Date: Tuesday, January 27, 2009 - 6:34 am

This one is eventually invoked under the spin_lock with turned off irqs,
which may freeze the machine:
rds_for_each_conn_info() -> spin_lock_irqsave(global lock) ->
rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending()
-> boom.

I did not not check further though.

-- 
	Evgeniy Polyakov
--

From: Oliver Neukum
Date: Tuesday, January 27, 2009 - 6:47 am

Why? This is _trylock. It won't block.

	Regards
		Oliver
 


--

From: Evgeniy Polyakov
Date: Tuesday, January 27, 2009 - 6:51 am

Unlock may reschedule.

-- 
	Evgeniy Polyakov
--

From: Steve Wise
Date: Tuesday, January 27, 2009 - 9:28 am

mutex_trylock() uses spin_lock_mutex() which has this in the debug version:

DEBUG_LOCKS_WARN_ON(in_interrupt());

--

From: Andrew Grover
Date: Wednesday, January 28, 2009 - 8:03 pm

What's the best way to fix this?

This is all so rds-info can print out a nice list of connections, and
if they're sending or not. I don't see an easy way to fix this. A
_trylock-like function that didn't grab it would be nice? I can always
just not report this particular bit of info, that actually might be
easiest.

Regards -- Andy
--

From: Evgeniy Polyakov
Date: Thursday, January 29, 2009 - 1:03 am

You use atomic variables for the other cases, add another one here to
mark locked connection. Looks ugly but does not crash at least.

-- 
	Evgeniy Polyakov
--

From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Upon receiving a datagram from the transport, RDS parses the
headers and potentially queues an ACK.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/recv.c |  550 +++++++++++++++++++++++++++++++++++++
 1 files changed, 550 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/recv.c

diff --git a/drivers/infiniband/ulp/rds/recv.c b/drivers/infiniband/ulp/rds/recv.c
new file mode 100644
index 0000000..691f8cb
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/recv.c
@@ -0,0 +1,550 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Some transports may support RDMA features. This handles the
non-transport-specific parts, like pinning user pages and
tracking mapped regions.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/rdma.c     |  682 +++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/rdma.h     |   84 ++++
 drivers/infiniband/ulp/rds/rds_rdma.h |  245 ++++++++++++
 3 files changed, 1011 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/rdma.c
 create mode 100644 drivers/infiniband/ulp/rds/rdma.h
 create mode 100644 drivers/infiniband/ulp/rds/rds_rdma.h

diff --git a/drivers/infiniband/ulp/rds/rdma.c b/drivers/infiniband/ulp/rds/rdma.c
new file mode 100644
index 0000000..00e3450
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/rdma.c
@@ -0,0 +1,682 @@
+/*
+ * Copyright (c) 2007 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

A simple rds transport to handle loopback connections.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/loop.c |  189 +++++++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/loop.h |    9 ++
 2 files changed, 198 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/loop.c
 create mode 100644 drivers/infiniband/ulp/rds/loop.h

diff --git a/drivers/infiniband/ulp/rds/loop.c b/drivers/infiniband/ulp/rds/loop.c
new file mode 100644
index 0000000..40fa729
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/loop.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

RDS is a reliable datagram protocol used for IPC on Oracle
database clusters. This adds address and protocol family numbers
for it.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 include/linux/socket.h |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 20fc4bb..fda91af 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -191,7 +191,8 @@ struct ucred {
 #define AF_RXRPC	33	/* RxRPC sockets 		*/
 #define AF_ISDN		34	/* mISDN sockets 		*/
 #define AF_PHONET	35	/* Phonet sockets		*/
-#define AF_MAX		36	/* For now.. */
+#define AF_RDS		36	/* RDS sockets 			*/
+#define AF_MAX		37	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -229,6 +230,7 @@ struct ucred {
 #define PF_RXRPC	AF_RXRPC
 #define PF_ISDN		AF_ISDN
 #define PF_PHONET	AF_PHONET
+#define PF_RDS		AF_RDS
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
-- 
1.5.6.3

--

From: Rémi
Date: Tuesday, January 27, 2009 - 12:27 am

You also need to add lock class declaration to net/core/sock.c, I believe.

-- 
Rémi Denis-Courmont
Maemo Software, Nokia Devices R&D

--

From: Andrew Grover
Date: Tuesday, January 27, 2009 - 12:31 pm

On Mon, Jan 26, 2009 at 11:27 PM, Rémi Denis-Courmont

Very true, thanks.

Regards -- Andy
--

From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

This is the code to send an RDS datagram.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/send.c | 1006 +++++++++++++++++++++++++++++++++++++
 1 files changed, 1006 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/send.c

diff --git a/drivers/infiniband/ulp/rds/send.c b/drivers/infiniband/ulp/rds/send.c
new file mode 100644
index 0000000..276f7ac
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/send.c
@@ -0,0 +1,1006 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

IB-specific stats and sysctls.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_stats.c  |   95 ++++++++++++++++++++++
 drivers/infiniband/ulp/rds/ib_sysctl.c |  137 ++++++++++++++++++++++++++++++++
 2 files changed, 232 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_stats.c
 create mode 100644 drivers/infiniband/ulp/rds/ib_sysctl.c

diff --git a/drivers/infiniband/ulp/rds/ib_stats.c b/drivers/infiniband/ulp/rds/ib_stats.c
new file mode 100644
index 0000000..02e3e3d
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_stats.c
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_ring.c |  168 ++++++++++++++++++++++++++++++++++
 1 files changed, 168 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_ring.c

diff --git a/drivers/infiniband/ulp/rds/ib_ring.c b/drivers/infiniband/ulp/rds/ib_ring.c
new file mode 100644
index 0000000..d23cc59
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_ring.c
@@ -0,0 +1,168 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "ib.h"
+
+/*
+ * Locking for ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Registers as an RDS transport and an IB client, and uses IB CM
API to allocate ids, queue pairs, and the rest of that fun stuff.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib.c    |  312 +++++++++++++
 drivers/infiniband/ulp/rds/ib.h    |  358 +++++++++++++++
 drivers/infiniband/ulp/rds/ib_cm.c |  882 ++++++++++++++++++++++++++++++++++++
 3 files changed, 1552 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib.c
 create mode 100644 drivers/infiniband/ulp/rds/ib.h
 create mode 100644 drivers/infiniband/ulp/rds/ib_cm.c

diff --git a/drivers/infiniband/ulp/rds/ib.c b/drivers/infiniband/ulp/rds/ib.c
new file mode 100644
index 0000000..cd35fba
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib.c
@@ -0,0 +1,312 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

RDS's main data structure definitions and exported functions.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/rds.h |  763 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 763 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/rds.h

diff --git a/drivers/infiniband/ulp/rds/rds.h b/drivers/infiniband/ulp/rds/rds.h
new file mode 100644
index 0000000..133a237
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/rds.h
@@ -0,0 +1,763 @@
+#ifndef _RDS_H
+#define _RDS_H
+
+#include <net/sock.h>
+#include <linux/scatterlist.h>
+#include <asm/atomic.h>
+
+#include <linux/mutex.h>
+#include "rds_rdma.h"
+
+/*
+ * RDS Network protocol version
+ */
+#define RDS_PROTOCOL_3_0	0x0300
+#define RDS_PROTOCOL_3_1	0x0301
+#define RDS_PROTOCOL_VERSION	RDS_PROTOCOL_3_1
+#define RDS_PROTOCOL_MAJOR(v)	((v) >> 8)
+#define RDS_PROTOCOL_MINOR(v)	((v) & 255)
+#define RDS_PROTOCOL(maj, min)	(((maj) << 8) | min)
+
+/*
+ * XXX randomly chosen, but at least seems to be unused:
+ * #               18464-18768 Unassigned
+ * We should do better.  We want a reserved port to discourage unpriv'ed
+ * userspace from listening.
+ */
+#define RDS_PORT	18634
+
+#ifndef AF_RDS
+#define AF_RDS          28      /* Reliable Datagram Socket     */
+#endif
+
+#ifndef PF_RDS
+#define PF_RDS          AF_RDS
+#endif
+
+#ifndef SOL_RDS
+#define SOL_RDS         272
+#endif
+
+#define KERNEL_HAS_PROTO_REGISTER 1
+#define KERNEL_HAS_INET_SK_RETURNING_INET_SOCK 1
+#define KERNEL_HAS_CORE_CALLING_DEV_IOCTL 1
+
+#ifdef ATOMIC64_INIT
+#define KERNEL_HAS_ATOMIC64
+#endif
+
+/* x86-64 doesn't include kmap_types.h from anywhere */
+#include <asm/kmap_types.h>
+#include <linux/highmem.h>
+
+#include "info.h"
+
+#ifdef DEBUG
+#define rdsdebug(fmt, args...) pr_debug("%s(): " fmt, __func__ , ##args)
+#else
+/* sigh, pr_debug() causes unused variable warnings */
+static inline void __attribute__ ((format (printf, 1, 2)))
+rdsdebug(char ...
From: Rémi
Date: Tuesday, January 27, 2009 - 12:34 am

Internet transport protocol port number? IANA has a process for assigning port 
numbers to proprietary protocols.

Not that I'd blame you, as I inherited VLC media player's wide abuse of port 

You should probably remove that and put the last patch of your series ahead of 

This is used by RXRPC nowadays, although I myself don't really understand why 
socket option levels need to be unique across all families.

-- 
Rémi Denis-Courmont
Maemo Software, Nokia Devices R&D

--

From: Andrew Grover
Date: Tuesday, January 27, 2009 - 12:27 pm

On Mon, Jan 26, 2009 at 11:34 PM, Rémi Denis-Courmont


OK, shouldn't be too hard to fix.

Thanks.

Regards -- Andy
--

From: Evgeniy Polyakov
Date: Tuesday, January 27, 2009 - 6:05 am

Hi.


--

From: Andrew Grover
Date: Tuesday, January 27, 2009 - 12:23 pm

RDS errors out.

Yeah we're going to want to get an assigned port at some point, I guess.

Regards -- Andy
--

From: Steve Wise
Date: Tuesday, January 27, 2009 - 12:24 pm

You should start that process now.  It took a while to get nfsrdma's 
port number through...


--

From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Parsing of newly-received RDS message headers (including ext.
headers) and copy-to/from-user routines.

page.c implements a per-cpu page remainder cache, to reduce the
number of allocations needed for small datagrams.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/message.c |  414 ++++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/page.c    |  222 ++++++++++++++++++
 2 files changed, 636 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/message.c
 create mode 100644 drivers/infiniband/ulp/rds/page.c

diff --git a/drivers/infiniband/ulp/rds/message.c b/drivers/infiniband/ulp/rds/message.c
new file mode 100644
index 0000000..5cad4d5
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/message.c
@@ -0,0 +1,414 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Specific to IB is a credits-based flow control mechanism, in
addition to the expected usage of the IB API to package outgoing
data into work requests.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_send.c |  852 ++++++++++++++++++++++++++++++++++
 1 files changed, 852 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_send.c

diff --git a/drivers/infiniband/ulp/rds/ib_send.c b/drivers/infiniband/ulp/rds/ib_send.c
new file mode 100644
index 0000000..20af976
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_send.c
@@ -0,0 +1,852 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

RDS currently generates a lot of stats that are accessible via
the rds-info utility. This code implements the support for this.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/info.c  |  243 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/info.h  |   43 +++++++
 drivers/infiniband/ulp/rds/stats.c |  150 ++++++++++++++++++++++
 3 files changed, 436 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/info.c
 create mode 100644 drivers/infiniband/ulp/rds/info.h
 create mode 100644 drivers/infiniband/ulp/rds/stats.c

diff --git a/drivers/infiniband/ulp/rds/info.c b/drivers/infiniband/ulp/rds/info.c
new file mode 100644
index 0000000..ff3ba1c
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/info.c
@@ -0,0 +1,243 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR ...
From: Evgeniy Polyakov
Date: Tuesday, January 27, 2009 - 6:28 am

Those bug_ons look quite scary, is there a way to actually have a wrong
optname? Plus, those _INFO definitions are declared twice in the code,

This one is used to temporarily map some address, but functions called
between map and unmap functions (like rds_info_getsockopt()) may sleep,
which is wrong.

-- 
	Evgeniy Polyakov
--

From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_rdma.c |  641 ++++++++++++++++++++++++++++++++++
 1 files changed, 641 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_rdma.c

diff --git a/drivers/infiniband/ulp/rds/ib_rdma.c b/drivers/infiniband/ulp/rds/ib_rdma.c
new file mode 100644
index 0000000..69a6289
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_rdma.c
@@ -0,0 +1,641 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "rdma.h"
+#include ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

This file documents the specifics of the RDS sockets API,
as well as covering some of the details of its internal
implementation.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 Documentation/networking/rds.txt |  356 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 356 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/rds.txt

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
new file mode 100644
index 0000000..c67077c
--- /dev/null
+++ b/Documentation/networking/rds.txt
@@ -0,0 +1,356 @@
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports.  The current implementation used to support RDS over TCP as well
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
+UDP in the future.
+
+The high-level semantics of RDS from the application's point of view are
+
+ *	Addressing
+        RDS uses IPv4 addresses and 16bit port numbers to identify
+        the end point of a connection. All socket operations that involve
+        passing addresses between kernel and user space generally
+        use a struct sockaddr_in.
+
+        The fact that IPv4 addresses are used does not mean the underlying
+        ...
From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Add RDS Kconfig and Makefile, and modify infiniband's to add
us to the build.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/Kconfig          |    2 ++
 drivers/infiniband/Makefile         |    1 +
 drivers/infiniband/ulp/rds/Kconfig  |   13 +++++++++++++
 drivers/infiniband/ulp/rds/Makefile |   13 +++++++++++++
 4 files changed, 29 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/Kconfig
 create mode 100644 drivers/infiniband/ulp/rds/Makefile

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index dd0db67..1cba524 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -54,4 +54,6 @@ source "drivers/infiniband/ulp/srp/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 
+source "drivers/infiniband/ulp/rds/Kconfig"
+
 endif # INFINIBAND
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index ed35e44..39d0203 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_INFINIBAND_NES)		+= hw/nes/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/iser/
+obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/rds/
diff --git a/drivers/infiniband/ulp/rds/Kconfig b/drivers/infiniband/ulp/rds/Kconfig
new file mode 100644
index 0000000..bbc2ba4
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/Kconfig
@@ -0,0 +1,13 @@
+
+config INFINIBAND_RDS
+	tristate "Reliable Datagram Sockets (RDS) (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	---help---
+	  RDS provides reliable, sequenced delivery of datagrams
+	  over Infiniband.
+
+config INFINIBAND_RDS_DEBUG
+        bool "Debugging messages"
+	depends on INFINIBAND_RDS
+        default n
+
diff --git a/drivers/infiniband/ulp/rds/Makefile b/drivers/infiniband/ulp/rds/Makefile
new file mode 100644
index 0000000..d470550
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/Makefile
@@ -0,0 +1,13 ...
From: Roland Dreier
Date: Wednesday, January 28, 2009 - 3:59 pm

> +obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/rds/

Typo for ..._RDS

 > +config INFINIBAND_RDS_DEBUG
 > +        bool "Debugging messages"
 > +	depends on INFINIBAND_RDS
 > +        default n

No way to enable this?  Disabled by default?

You really want debugging messages to be built by default and controlled
at runtime ... otherwise debugging end-user installations is a pain
(they just install what the distro gives them, and it's very hard for
them to rebuild just to enable debugging).

 > +ib_rds-y :=	af_rds.o bind.o cong.o connection.o info.o message.o   \
 > +			recv.o send.o stats.o sysctl.o threads.o transport.o \
 > +			loop.o page.o rdma.o
 > +
 > +ib_rds-y += ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
 > +			ib_sysctl.o ib_rdma.o

a very strange way to write an assignment statement...

 - R.
--

From: Andy Grover
Date: Wednesday, January 28, 2009 - 7:19 pm

So the solution is just to base debug message output on a variable,
instead of a config option? RDS actually does do this a little already,
so converting totally isn't hard. I hadn't seen mention this was
preferable -- indeed, tons of drivers and subsystems have options for

RDS is implemented as a core sockets layer and then a transport layer.
IB is currently the only transport so I thought it made sense to just
compile them together, but once there are >1 then RDS's IB support could
be broken out into its own module.

Regards -- Andy
--

From: Roland Dreier
Date: Wednesday, January 28, 2009 - 10:14 pm

> So the solution is just to base debug message output on a variable,
 > instead of a config option? RDS actually does do this a little already,
 > so converting totally isn't hard. I hadn't seen mention this was
 > preferable -- indeed, tons of drivers and subsystems have options for
 > compile-time debug statements, should these be converted?

My experience is definitely that compile-time switches are a big pain
when you actually have to debug something that can only be reproduced on
someone else's setup (which will happen once users start using your
stuff).  You probably can use the dynamic_printk stuff that went in
recently to make this all very clean and standard.

 - R.
--

From: Andy Grover
Date: Monday, January 26, 2009 - 7:17 pm

Header parsing, ring refill. It puts the incoming data into an
rds_incoming struct, which is passed up to rds-core.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_recv.c |  894 ++++++++++++++++++++++++++++++++++
 1 files changed, 894 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_recv.c

diff --git a/drivers/infiniband/ulp/rds/ib_recv.c b/drivers/infiniband/ulp/rds/ib_recv.c
new file mode 100644
index 0000000..516f858
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_recv.c
@@ -0,0 +1,894 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN ...
From: Roland Dreier
Date: Wednesday, January 28, 2009 - 5:05 pm

> +static int rds_ib_recv_refill_one(struct rds_connection *conn,
 > +				  struct rds_ib_recv_work *recv,
 > +				  gfp_t kptr_gfp, gfp_t page_gfp)
 > +{
 > +	struct rds_ib_connection *ic = conn->c_transport_data;
 > +	dma_addr_t dma_addr;
 > +	struct ib_sge *sge;
 > +	int ret = -ENOMEM;
 > +
 > +	if (recv->r_ibinc == NULL) {
 > +		if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) {
 > +			rds_ib_stats_inc(s_ib_rx_alloc_limit);
 > +			goto out;
 > +		}
 > +		recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab,
 > +						 kptr_gfp);
 > +		if (recv->r_ibinc == NULL)
 > +			goto out;
 > +		atomic_inc(&rds_ib_allocation);

This is racy.  You check if you're at the limit, do the allocation, and
then increment the atomic rds_ib_allocation count.  So many threads can
pass the atomic_read() test and then take you over the limit.  If you
want to make it safe then you could do atomic_inc_return() and check if
that took you over the limit.

 - R.
--

From: Andy Grover
Date: Wednesday, January 28, 2009 - 7:20 pm

Woah, yup, thanks.

-- Andy
--

From: Olaf Kirch
Date: Thursday, January 29, 2009 - 2:02 pm

The refill code used to be single-threaded; and I think it still is. So
this can't race I think

Olaf
--

From: Roland Dreier
Date: Thursday, January 29, 2009 - 2:47 pm

> > > This is racy.  You check if you're at the limit, do the allocation, and
 > > > then increment the atomic rds_ib_allocation count.  So many threads can
 > > > pass the atomic_read() test and then take you over the limit.  If you
 > > > want to make it safe then you could do atomic_inc_return() and check if
 > > > that took you over the limit.
 > >
 > > Woah, yup, thanks.
 > 
 > The refill code used to be single-threaded; and I think it still is. So
 > this can't race I think

So you don't need the atomic op at all?

 - R.
--

From: Steve Wise
Date: Tuesday, January 27, 2009 - 8:34 am

Hey Andy,

Why didn't you include the iWARP transport as well?



--

From: Andrew Grover
Date: Tuesday, January 27, 2009 - 12:29 pm

Hi Steve,

As I mentioned on IRC, there are some ib/iw coexistence issues and
other minor bugs to resolve, and then I will include the iWARP code.

Regards -- Andy
--

From: Roland Dreier
Date: Wednesday, January 28, 2009 - 3:37 pm

> This patchset adds support for RDS as an Infiniband ULP. RDS is an
 > Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably,
 > and is used currently in Oracle RAC and Exadata products. It's lived
 > in OFED for 2+ years and I think it's time to get it upstream -- most
 > likely into your -next tree for .30, but if it snuck into .29 via the
 > "new code merge-window exception" then even better.

I'll read this over and comment, but to be honest I agree with Dave:
this is a new socket family, and as such it belongs under net/ and
probably should go through Dave's tree (just as the NFS/RDMA changes
went through the NFS trees and the 9p/RDMA changes went through the 9p
tree); even though it heavily uses RDMA I think the upper layer
interface to sockets/networking is more relevant.

 - R.
--

From: Andy Grover
Date: Wednesday, January 28, 2009 - 6:29 pm

OK no prob. I'll probably be ready early next week with an updated patchset.

Thanks -- Regards -- Andy
--

Previous thread: none

Next thread: Re: [PATCH] percpu: add optimized generic percpu accessors by Tejun Heo on Monday, January 26, 2009 - 7:24 pm. (21 messages)