[PATCH 02/03] 9prdma: Makefile change for the RDMA transport

Previous thread: patch x86-fix-smp-alternatives-use-mutex-instead-of-spinlock-text_poke-is-sleepable.patch added to 2.6.26-stable tree by gregkh on Wednesday, October 1, 2008 - 4:59 pm. (1 message)

Next thread: [PATCH] matroxfb: Support G200eV chip by Darrick J. Wong on Wednesday, October 1, 2008 - 5:21 pm. (2 messages)
From: Tom Tucker
Date: Wednesday, October 1, 2008 - 5:08 pm

Eric:

This patch series implements an RDMA Transport provider for 9P and 
is relative to your for-next branch.  The RDMA support is built on the 
OpenFabrics API and uses SEND and RECV to exchange data. This patch 
series has been tested with dbench and iozone.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Latchesar Ionkov <lionkov@lanl.gov>

[PATCH 01/03] 9prdma: RDMA Transport Support for 9P

 net/9p/trans_rdma.c |  996 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 996 insertions(+), 0 deletions(-)

[PATCH 02/03] 9prdma: Makefile change for the RDMA transport

 net/9p/Makefile |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

[PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport

 net/9p/Kconfig |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

--

From: Tom Tucker
Date: Wednesday, October 1, 2008 - 5:08 pm

This file implements the RDMA transport provider for 9P. It allows
mounts to be performed over iWARP and IB capable network interfaces
and uses the OpenFabrics API to perform I/O.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Latchesar Ionkov <lionkov@lanl.gov>

---
 net/9p/trans_rdma.c | 1025 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1025 insertions(+), 0 deletions(-)

diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
new file mode 100644
index 0000000..f919768
--- /dev/null
+++ b/net/9p/trans_rdma.c
@@ -0,0 +1,1025 @@
+/*
+ * linux/fs/9p/trans_rdma.c
+ *
+ * RDMA transport layer based on the trans_fd.c implementation.
+ *
+ *  Copyright (C) 2008 by Tom Tucker <tom@opengridcomputing.com>
+ *  Copyright (C) 2006 by Russ Cox <rsc@swtch.com>
+ *  Copyright (C) 2004-2005 by Latchesar Ionkov <lucho@ionkov.net>
+ *  Copyright (C) 2004-2008 by Eric Van Hensbergen <ericvh@gmail.com>
+ *  Copyright (C) 1997-2002 by Ron Minnich <rminnich@sarnoff.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to:
+ *  Free Software Foundation
+ *  51 Franklin Street, Fifth Floor
+ *  Boston, MA  02111-1301  USA
+ *
+ */
+
+#include <linux/in.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/ipv6.h>
+#include <linux/kthread.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/un.h>
+#include <linux/uaccess.h>
+#include <linux/inet.h>
+#include ...
From: Tom Tucker
Date: Wednesday, October 1, 2008 - 5:08 pm

This adds a make rule for the 9pnet_rdma module that implements
the RDMA transport.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Latchesar Ionkov <lionkov@lanl.gov>

---
 net/9p/Makefile |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/9p/Makefile b/net/9p/Makefile
index 5192194..bc909ab 100644
--- a/net/9p/Makefile
+++ b/net/9p/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_NET_9P) := 9pnet.o
 obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o
+obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o
 
 9pnet-objs := \
 	mod.o \
@@ -12,3 +13,6 @@ obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o
 
 9pnet_virtio-objs := \
 	trans_virtio.o \
+
+9pnet_rdma-objs := \
+	trans_rdma.o \
--

From: Tom Tucker
Date: Wednesday, October 1, 2008 - 5:08 pm

This patch adds a config option for the 9P RDMA transport.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Latchesar Ionkov <lionkov@lanl.gov>

---
 net/9p/Kconfig |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/9p/Kconfig b/net/9p/Kconfig
index ff34c5a..c42c0c4 100644
--- a/net/9p/Kconfig
+++ b/net/9p/Kconfig
@@ -20,6 +20,12 @@ config NET_9P_VIRTIO
 	  This builds support for a transports between
 	  guest partitions and a host partition.
 
+config NET_9P_RDMA
+	depends on NET_9P && INFINIBAND && EXPERIMENTAL
+	tristate "9P RDMA Transport (Experimental)"
+	help
+	  This builds support for a RDMA transport.
+
 config NET_9P_DEBUG
 	bool "Debug information"
 	depends on NET_9P
--

From: Roland Dreier
Date: Wednesday, October 1, 2008 - 10:11 pm

very cool... neat that it only takes 1000 lines of code to do this too.

a few quick comments from a cursory read:

 > This file implements the RDMA transport provider for 9P. It allows
 > mounts to be performed over iWARP and IB capable network interfaces
 > and uses the OpenFabrics API to perform I/O.

I don't like this "openfabrics API" terminology -- the RDMA API is just
one part of the kernel API that you're using, and I'm not sure it's
worth calling out specially.  ie just delete that third line from the
changelog.

 > +	atomic_t next_tag;

this seems to be a write-only field... delete it?

 > + * @wc_op: Mellanox's broken HW doesn't provide the original WR op
 > + * when the CQE completes in error. This forces apps to keep track of
 > + * the op themselves. Yes, it's a Pet Peeve of mine ;-)

there's nothing broken about this behavior -- the IB spec very
explicitly calls out that the opcode field is undefined for completions
with error status.

 > +find_req(struct p9_trans_rdma *rdma, u16 tag)

 > +	return found ? req : 0;

using 0 instead of NULL here... probably worth running all this through
sparse to see what else if flags.

 > +	if (atomic_inc_return(&rdma->sq_count) >= rdma->sq_depth)
 > +		wait_event_interruptible
 > +			(rdma->send_wait,
 > +			 (atomic_read(&rdma->sq_count) < rdma->sq_depth));

this worries me a bit... for one thing you use wait_event_interruptible()
without checking the return value, so it's entirely possible that this
gets woken up before the condition actually becomes true.  And to handle
that correctly, you'd need extra code to decrement the sq_count and wake
up other waiters if it was interrupted, etc.

I know counting semaphores are out of favor in the kernel but this seems
to be a good place to use real semaphores, rather than concocting your
own in a much more complicated way.

 > +	wait_for_completion_interruptible(&rdma->cm_done);

similarly there are lots of places you do this interruptible wait but
then don't check ...
From: Tom Tucker
Date: Monday, October 6, 2008 - 11:10 am

Well. The only place I do that is when the connection is being 
established. And actually, I don't(didn't?) really care why it stopped 
waiting, I  only care about the state of the connection and if it's not 
what I need it to be I error out. I suppose there is a theoretical race 
here where the user cancels the wait with ctrl-C, but the connection 
keeps going because the transport always gets back fast enough. This 
would eventually fail in rdma_rpc.


For IB it avoids IPoIB. Which I think is a big win. For iWARP, it's  not 

When using SEND/RECV, there isn't a protocol other than 9P itself. Once 
we add RDMA ops, one will be required to exchange RKEY, etc... But maybe 

ok.

Thanks for the review,

--

From: Roland Dreier
Date: Monday, October 6, 2008 - 3:35 pm

> > Is your protocol documented anywhere?  It seems npfs has support for the
 > > same transport.

 > When using SEND/RECV, there isn't a protocol other than 9P
 > itself. Once we add RDMA ops, one will be required to exchange RKEY,
 > etc... But maybe I don't understand the question?

No, that makes sense.  I don't know that much about 9p but I guess it's
fairly transport agnostic, so if you're just sending/receiving standard
9p messages then there's probably nothing to exchange.

 - R.
--

Previous thread: patch x86-fix-smp-alternatives-use-mutex-instead-of-spinlock-text_poke-is-sleepable.patch added to 2.6.26-stable tree by gregkh on Wednesday, October 1, 2008 - 4:59 pm. (1 message)

Next thread: [PATCH] matroxfb: Support G200eV chip by Darrick J. Wong on Wednesday, October 1, 2008 - 5:21 pm. (2 messages)