Hi Roland,
This patchset adds support for RDS as an Infiniband ULP. RDS is an
Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably,
and is used currently in Oracle RAC and Exadata products. It's lived
in OFED for 2+ years and I think it's time to get it upstream -- most
likely into your -next tree for .30, but if it snuck into .29 via the
"new code merge-window exception" then even better.
I've run checkpatch & sparse to clean up as many issues as possible so
what remains are really the design peculiarities (aka warts) that arise
from being a protocol designed by one company for a single critical
application. I think upstreaming this code is the first step towards
working out those issues, and making the end result available to a wider
audience.
Also available for review at:
git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6 for-roland
Thoughts? shortlog follows.
Thanks -- Regards -- Andy
Andy Grover (21):
RDS: Socket interface
RDS: Main header file
RDS: Congestion-handling code
RDS: Transport code
RDS: Info and stats
RDS: Connection handling
RDS: loopback
RDS: sysctls
RDS: Message parsing
RDS: send.c
RDS: recv.c
RDS: RDMA support
RDS/IB: Infiniband transport
RDS/IB: Ring-handling code.
RDS/IB: Implement RDMA ops using FMRs
RDS/IB: Implement IB-specific datagram send.
RDS/IB: Receive datagrams via IB
RDS/IB: Stats and sysctls
RDS: Documentation
RDS: Kconfig and Makefile
RDS: Add AF and PF #defines for RDS sockets
Documentation/networking/rds.txt | 356 +++++++++++
drivers/infiniband/Kconfig | 2 +
drivers/infiniband/Makefile | 1 +
drivers/infiniband/ulp/rds/Kconfig | 13 +
drivers/infiniband/ulp/rds/Makefile | 13 +
drivers/infiniband/ulp/rds/af_rds.c | 677 +++++++++++++++++++++
drivers/infiniband/ulp/rds/bind.c | 202 +++++++
...RDS handles per-socket congestion by updating peers with a complete congestion map (8KB). This code keeps track of these maps for itself and ones received from peers. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/cong.c | 424 +++++++++++++++++++++++++++++++++++++ 1 files changed, 424 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/cong.c diff --git a/drivers/infiniband/ulp/rds/cong.c b/drivers/infiniband/ulp/rds/cong.c new file mode 100644 index 0000000..b7c49d2 --- /dev/null +++ b/drivers/infiniband/ulp/rds/cong.c @@ -0,0 +1,424 @@ +/* + * Copyright (c) 2007 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE ...
On Mon, 26 Jan 2009 18:17:40 -0800 So this is starting to look like another "Oracle special" like AIO and HugeTLB. That has lots of caveat restrictions on the application. --
On Mon, Jan 26, 2009 at 7:48 PM, Stephen Hemminger Yep it's a datacenter-centric protocol. Regards -- Andy --
Is this a rare condition? Is this protocol only intended for the long-living connections and is not suitable for the cases when lots of them are created and teared down quickly? -- Evgeniy Polyakov --
Connections are long-lived. Imagine a cluster. RDS multiplexes all sockets' datagrams between 2 hosts over a single transport-layer connection, so if a node sends ONE datagram to another, an IB connection is set up and sticks around indefinitely. Regards -- Andy --
> +EXPORT_SYMBOL_GPL(rds_cong_map_updated); What is this being exported to? AFAICT you are only building a single RDS module, right? - R. --
In the current RDS development repo, transports are modularizable. For the initial upstream submission I just wanted to include the IB transport so I changed the build to compile rds-core and rds-ib together, but didn't pull out the exports. Steve Wise is working on having the iwarp transport debugged in the near future. Once that is added and we make transports modularizable then those exports are needed. Take them out for now, then? Thanks -- Regards -- Andy --
Implement the RDS (Reliable Datagram Sockets) interface. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/af_rds.c | 677 +++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/bind.c | 202 +++++++++++ 2 files changed, 879 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/af_rds.c create mode 100644 drivers/infiniband/ulp/rds/bind.c diff --git a/drivers/infiniband/ulp/rds/af_rds.c b/drivers/infiniband/ulp/rds/af_rds.c new file mode 100644 index 0000000..7158438 --- /dev/null +++ b/drivers/infiniband/ulp/rds/af_rds.c @@ -0,0 +1,677 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH ...
On Mon, 26 Jan 2009 18:17:38 -0800 Then I would think a high speed protocol would use per-cpu --
On Mon, Jan 26, 2009 at 7:46 PM, Stephen Hemminger For a spinlock guarding a socket list? I wouldn't think it would be worth the complexity. [other comments snipped, will fix per your advice, thanks!] Regards -- Andy --
Socket family implementations do not belong under the infiniband subdirectory. Put it under net/ instead. I don't care what the interdependencies happen to be. --
Roland, so you're ok with infiniband code under net/ ? (or should I split it up??) Thanks -- Regards -- Andy --
Hi Andy. Global list of all sockets? This does not scale, maybe it should be Does RDS sockets work with high number of creation/destruction Are you absolutely sure that provided poll_table callback will not do the bad things here? It is quite unusual to add several different queues into the same head in the poll callback. Is there a possibility to have lock iteraction problem with above Hash table with the appropriate size will have faster lookup/access Iirc there is a new %pi4 or similar format id. -- Evgeniy Polyakov --
sch mentioned this too... is socket creation often a bottleneck? If so we can certainly improve scalability here. In any case, this is in the code to support a listing of RDS sockets via the rds-info utility. Instead of having our own custom program to list rds sockets we probably want to export an interface so netstat will list them. Unfortunately netstat seems to be hardcoded to look for particular entries in /proc/net, so both rds and netstat would need to be updated before this would work, and RDS's custom from the comments above that code: * We have to be careful about racing with the incoming path. sock_orphan() * sets SOCK_DEAD and we use that as an indicator to the rx path that new * messages shouldn't be queued. I don't know. I looked into the poll_wait code a little and it appeared to be designed to allow multiple. I didn't see anywhere where they were being acquired in reverse order, or simultaneously. This is the kind of thing that lockdep would find immediately, right? I think I've got that turned on but I'll double Yup, will do. Thanks again. Regards -- Andy --
Hi Andy. It depends on the workload, but and it becomes a noticeble portion of the overhead for multi-client short-living connections. Likely this sockets will not be used for web-server like load, but something similar Other sockets use similar technique, but they are groupped into hash table, so if you think that amount of socket will be noticebly large or they will be frequently created and removed, it may worth pushing them It depends on how poll_table was initialized and how its callback (invoked from the poll_wait()) operates with the given queue and head. If you introduce own polling, some care has to be taken there for the ordering of the wait queues and what their callbacks return when polling even found. For example with the own initialization it is possible that with multiple queues are registered in the same table, only one of them will be awakened (its callback invoked). If you just hook into existing machinery things should be ok though, so If lockdep entered the bad race, then yes, it will fire this up. I just wondered that we spin lock under the read lock, so some bad I meant that if there is an unsigned overflow this will suddenly become a small number, so network timestamping comparison logic can be used, but apparently neither address nor port are changed during the lifetime, so nothing special is needed. -- Evgeniy Polyakov --
RDS supports multiple transports. While this initial submission only supports Infiniband transport, this abstraction allows others to be added. We're working on an iWARP transport, and also see UDP over DCB as another possibility. This code handles transport registration. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/transport.c | 134 ++++++++++++++++++++++++++++++++ 1 files changed, 134 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/transport.c diff --git a/drivers/infiniband/ulp/rds/transport.c b/drivers/infiniband/ulp/rds/transport.c new file mode 100644 index 0000000..e78f8b3 --- /dev/null +++ b/drivers/infiniband/ulp/rds/transport.c @@ -0,0 +1,134 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR ...
Wow. Why not declare 15 as some constant and put it into rds_transport Tabs have run away. -- Evgeniy Polyakov --
It confuses tags and the like otherwise, and looks more consistent with the rest of the code. Likely it is not a must, but just better look. -- Evgeniy Polyakov --
Yup, will do, just was curious. Regards -- Andy --
In addition to some tunable parameters, we also make protocol # available here, since until accepted in upstream we do not have a fixed number assigned. This can be removed once upstream. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/sysctl.c | 164 +++++++++++++++++++++++++++++++++++ 1 files changed, 164 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/sysctl.c diff --git a/drivers/infiniband/ulp/rds/sysctl.c b/drivers/infiniband/ulp/rds/sysctl.c new file mode 100644 index 0000000..3337a3e --- /dev/null +++ b/drivers/infiniband/ulp/rds/sysctl.c @@ -0,0 +1,164 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR ...
While arguably the fact that the underlying transport needs a connection to convey RDS's datagrame reliably is not important to rds proper, the transports implemented so far (IB and TCP) have both been connection-oriented, and so the connection state machine-related code is in the common rds code. This patch also includes several work items, to handle connecting, sending, receiving, and shutdown. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/connection.c | 501 +++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/threads.c | 273 +++++++++++++++++ 2 files changed, 774 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/connection.c create mode 100644 drivers/infiniband/ulp/rds/threads.c diff --git a/drivers/infiniband/ulp/rds/connection.c b/drivers/infiniband/ulp/rds/connection.c new file mode 100644 index 0000000..6174629 --- /dev/null +++ b/drivers/infiniband/ulp/rds/connection.c @@ -0,0 +1,501 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY ...
This one is eventually invoked under the spin_lock with turned off irqs, which may freeze the machine: rds_for_each_conn_info() -> spin_lock_irqsave(global lock) -> rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending() -> boom. I did not not check further though. -- Evgeniy Polyakov --
Why? This is _trylock. It won't block. Regards Oliver --
Unlock may reschedule. -- Evgeniy Polyakov --
mutex_trylock() uses spin_lock_mutex() which has this in the debug version: DEBUG_LOCKS_WARN_ON(in_interrupt()); --
What's the best way to fix this? This is all so rds-info can print out a nice list of connections, and if they're sending or not. I don't see an easy way to fix this. A _trylock-like function that didn't grab it would be nice? I can always just not report this particular bit of info, that actually might be easiest. Regards -- Andy --
You use atomic variables for the other cases, add another one here to mark locked connection. Looks ugly but does not crash at least. -- Evgeniy Polyakov --
Upon receiving a datagram from the transport, RDS parses the headers and potentially queues an ACK. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/recv.c | 550 +++++++++++++++++++++++++++++++++++++ 1 files changed, 550 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/recv.c diff --git a/drivers/infiniband/ulp/rds/recv.c b/drivers/infiniband/ulp/rds/recv.c new file mode 100644 index 0000000..691f8cb --- /dev/null +++ b/drivers/infiniband/ulp/rds/recv.c @@ -0,0 +1,550 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + ...
Some transports may support RDMA features. This handles the non-transport-specific parts, like pinning user pages and tracking mapped regions. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/rdma.c | 682 +++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/rdma.h | 84 ++++ drivers/infiniband/ulp/rds/rds_rdma.h | 245 ++++++++++++ 3 files changed, 1011 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/rdma.c create mode 100644 drivers/infiniband/ulp/rds/rdma.h create mode 100644 drivers/infiniband/ulp/rds/rds_rdma.h diff --git a/drivers/infiniband/ulp/rds/rdma.c b/drivers/infiniband/ulp/rds/rdma.c new file mode 100644 index 0000000..00e3450 --- /dev/null +++ b/drivers/infiniband/ulp/rds/rdma.c @@ -0,0 +1,682 @@ +/* + * Copyright (c) 2007 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL ...
A simple rds transport to handle loopback connections. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/loop.c | 189 +++++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/loop.h | 9 ++ 2 files changed, 198 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/loop.c create mode 100644 drivers/infiniband/ulp/rds/loop.h diff --git a/drivers/infiniband/ulp/rds/loop.c b/drivers/infiniband/ulp/rds/loop.c new file mode 100644 index 0000000..40fa729 --- /dev/null +++ b/drivers/infiniband/ulp/rds/loop.c @@ -0,0 +1,189 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE ...
RDS is a reliable datagram protocol used for IPC on Oracle
database clusters. This adds address and protocol family numbers
for it.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
include/linux/socket.h | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 20fc4bb..fda91af 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -191,7 +191,8 @@ struct ucred {
#define AF_RXRPC 33 /* RxRPC sockets */
#define AF_ISDN 34 /* mISDN sockets */
#define AF_PHONET 35 /* Phonet sockets */
-#define AF_MAX 36 /* For now.. */
+#define AF_RDS 36 /* RDS sockets */
+#define AF_MAX 37 /* For now.. */
/* Protocol families, same as address families. */
#define PF_UNSPEC AF_UNSPEC
@@ -229,6 +230,7 @@ struct ucred {
#define PF_RXRPC AF_RXRPC
#define PF_ISDN AF_ISDN
#define PF_PHONET AF_PHONET
+#define PF_RDS AF_RDS
#define PF_MAX AF_MAX
/* Maximum queue length specifiable by listen. */
--
1.5.6.3
--
You also need to add lock class declaration to net/core/sock.c, I believe. -- Rémi Denis-Courmont Maemo Software, Nokia Devices R&D --
On Mon, Jan 26, 2009 at 11:27 PM, Rémi Denis-Courmont Very true, thanks. Regards -- Andy --
This is the code to send an RDS datagram. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/send.c | 1006 +++++++++++++++++++++++++++++++++++++ 1 files changed, 1006 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/send.c diff --git a/drivers/infiniband/ulp/rds/send.c b/drivers/infiniband/ulp/rds/send.c new file mode 100644 index 0000000..276f7ac --- /dev/null +++ b/drivers/infiniband/ulp/rds/send.c @@ -0,0 +1,1006 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include <linux/kernel.h> +#include ...
IB-specific stats and sysctls. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/ib_stats.c | 95 ++++++++++++++++++++++ drivers/infiniband/ulp/rds/ib_sysctl.c | 137 ++++++++++++++++++++++++++++++++ 2 files changed, 232 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_stats.c create mode 100644 drivers/infiniband/ulp/rds/ib_sysctl.c diff --git a/drivers/infiniband/ulp/rds/ib_stats.c b/drivers/infiniband/ulp/rds/ib_stats.c new file mode 100644 index 0000000..02e3e3d --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_stats.c @@ -0,0 +1,95 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH ...
Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/ib_ring.c | 168 ++++++++++++++++++++++++++++++++++ 1 files changed, 168 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_ring.c diff --git a/drivers/infiniband/ulp/rds/ib_ring.c b/drivers/infiniband/ulp/rds/ib_ring.c new file mode 100644 index 0000000..d23cc59 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_ring.c @@ -0,0 +1,168 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include <linux/kernel.h> + +#include "rds.h" +#include "ib.h" + +/* + * Locking for ...
Registers as an RDS transport and an IB client, and uses IB CM API to allocate ids, queue pairs, and the rest of that fun stuff. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/ib.c | 312 +++++++++++++ drivers/infiniband/ulp/rds/ib.h | 358 +++++++++++++++ drivers/infiniband/ulp/rds/ib_cm.c | 882 ++++++++++++++++++++++++++++++++++++ 3 files changed, 1552 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib.c create mode 100644 drivers/infiniband/ulp/rds/ib.h create mode 100644 drivers/infiniband/ulp/rds/ib_cm.c diff --git a/drivers/infiniband/ulp/rds/ib.c b/drivers/infiniband/ulp/rds/ib.c new file mode 100644 index 0000000..cd35fba --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib.c @@ -0,0 +1,312 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT ...
RDS's main data structure definitions and exported functions. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/rds.h | 763 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 763 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/rds.h diff --git a/drivers/infiniband/ulp/rds/rds.h b/drivers/infiniband/ulp/rds/rds.h new file mode 100644 index 0000000..133a237 --- /dev/null +++ b/drivers/infiniband/ulp/rds/rds.h @@ -0,0 +1,763 @@ +#ifndef _RDS_H +#define _RDS_H + +#include <net/sock.h> +#include <linux/scatterlist.h> +#include <asm/atomic.h> + +#include <linux/mutex.h> +#include "rds_rdma.h" + +/* + * RDS Network protocol version + */ +#define RDS_PROTOCOL_3_0 0x0300 +#define RDS_PROTOCOL_3_1 0x0301 +#define RDS_PROTOCOL_VERSION RDS_PROTOCOL_3_1 +#define RDS_PROTOCOL_MAJOR(v) ((v) >> 8) +#define RDS_PROTOCOL_MINOR(v) ((v) & 255) +#define RDS_PROTOCOL(maj, min) (((maj) << 8) | min) + +/* + * XXX randomly chosen, but at least seems to be unused: + * # 18464-18768 Unassigned + * We should do better. We want a reserved port to discourage unpriv'ed + * userspace from listening. + */ +#define RDS_PORT 18634 + +#ifndef AF_RDS +#define AF_RDS 28 /* Reliable Datagram Socket */ +#endif + +#ifndef PF_RDS +#define PF_RDS AF_RDS +#endif + +#ifndef SOL_RDS +#define SOL_RDS 272 +#endif + +#define KERNEL_HAS_PROTO_REGISTER 1 +#define KERNEL_HAS_INET_SK_RETURNING_INET_SOCK 1 +#define KERNEL_HAS_CORE_CALLING_DEV_IOCTL 1 + +#ifdef ATOMIC64_INIT +#define KERNEL_HAS_ATOMIC64 +#endif + +/* x86-64 doesn't include kmap_types.h from anywhere */ +#include <asm/kmap_types.h> +#include <linux/highmem.h> + +#include "info.h" + +#ifdef DEBUG +#define rdsdebug(fmt, args...) pr_debug("%s(): " fmt, __func__ , ##args) +#else +/* sigh, pr_debug() causes unused variable warnings */ +static inline void __attribute__ ((format (printf, 1, 2))) +rdsdebug(char ...
Internet transport protocol port number? IANA has a process for assigning port numbers to proprietary protocols. Not that I'd blame you, as I inherited VLC media player's wide abuse of port You should probably remove that and put the last patch of your series ahead of This is used by RXRPC nowadays, although I myself don't really understand why socket option levels need to be unique across all families. -- Rémi Denis-Courmont Maemo Software, Nokia Devices R&D --
On Mon, Jan 26, 2009 at 11:34 PM, Rémi Denis-Courmont OK, shouldn't be too hard to fix. Thanks. Regards -- Andy --
RDS errors out. Yeah we're going to want to get an assigned port at some point, I guess. Regards -- Andy --
You should start that process now. It took a while to get nfsrdma's port number through... --
Parsing of newly-received RDS message headers (including ext. headers) and copy-to/from-user routines. page.c implements a per-cpu page remainder cache, to reduce the number of allocations needed for small datagrams. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/message.c | 414 ++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/page.c | 222 ++++++++++++++++++ 2 files changed, 636 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/message.c create mode 100644 drivers/infiniband/ulp/rds/page.c diff --git a/drivers/infiniband/ulp/rds/message.c b/drivers/infiniband/ulp/rds/message.c new file mode 100644 index 0000000..5cad4d5 --- /dev/null +++ b/drivers/infiniband/ulp/rds/message.c @@ -0,0 +1,414 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR ...
Specific to IB is a credits-based flow control mechanism, in addition to the expected usage of the IB API to package outgoing data into work requests. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/ib_send.c | 852 ++++++++++++++++++++++++++++++++++ 1 files changed, 852 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_send.c diff --git a/drivers/infiniband/ulp/rds/ib_send.c b/drivers/infiniband/ulp/rds/ib_send.c new file mode 100644 index 0000000..20af976 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_send.c @@ -0,0 +1,852 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE ...
RDS currently generates a lot of stats that are accessible via the rds-info utility. This code implements the support for this. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/info.c | 243 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/info.h | 43 +++++++ drivers/infiniband/ulp/rds/stats.c | 150 ++++++++++++++++++++++ 3 files changed, 436 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/info.c create mode 100644 drivers/infiniband/ulp/rds/info.h create mode 100644 drivers/infiniband/ulp/rds/stats.c diff --git a/drivers/infiniband/ulp/rds/info.c b/drivers/infiniband/ulp/rds/info.c new file mode 100644 index 0000000..ff3ba1c --- /dev/null +++ b/drivers/infiniband/ulp/rds/info.c @@ -0,0 +1,243 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR ...
Those bug_ons look quite scary, is there a way to actually have a wrong optname? Plus, those _INFO definitions are declared twice in the code, This one is used to temporarily map some address, but functions called between map and unmap functions (like rds_info_getsockopt()) may sleep, which is wrong. -- Evgeniy Polyakov --
Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/ib_rdma.c | 641 ++++++++++++++++++++++++++++++++++ 1 files changed, 641 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_rdma.c diff --git a/drivers/infiniband/ulp/rds/ib_rdma.c b/drivers/infiniband/ulp/rds/ib_rdma.c new file mode 100644 index 0000000..69a6289 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_rdma.c @@ -0,0 +1,641 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include <linux/kernel.h> + +#include "rds.h" +#include "rdma.h" +#include ...
This file documents the specifics of the RDS sockets API, as well as covering some of the details of its internal implementation. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- Documentation/networking/rds.txt | 356 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 356 insertions(+), 0 deletions(-) create mode 100644 Documentation/networking/rds.txt diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt new file mode 100644 index 0000000..c67077c --- /dev/null +++ b/Documentation/networking/rds.txt @@ -0,0 +1,356 @@ + +Overview +======== + +This readme tries to provide some background on the hows and whys of RDS, +and will hopefully help you find your way around the code. + +In addition, please see this email about RDS origins: +http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html + +RDS Architecture +================ + +RDS provides reliable, ordered datagram delivery by using a single +reliable connection between any two nodes in the cluster. This allows +applications to use a single socket to talk to any other process in the +cluster - so in a cluster with N processes you need N sockets, in contrast +to N*N if you use a connection-oriented socket transport like TCP. + +RDS is not Infiniband-specific; it was designed to support different +transports. The current implementation used to support RDS over TCP as well +as IB. Work is in progress to support RDS over iWARP, and using DCE to +guarantee no dropped packets on Ethernet, it may be possible to use RDS over +UDP in the future. + +The high-level semantics of RDS from the application's point of view are + + * Addressing + RDS uses IPv4 addresses and 16bit port numbers to identify + the end point of a connection. All socket operations that involve + passing addresses between kernel and user space generally + use a struct sockaddr_in. + + The fact that IPv4 addresses are used does not mean the underlying + ...
Add RDS Kconfig and Makefile, and modify infiniband's to add us to the build. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/Kconfig | 2 ++ drivers/infiniband/Makefile | 1 + drivers/infiniband/ulp/rds/Kconfig | 13 +++++++++++++ drivers/infiniband/ulp/rds/Makefile | 13 +++++++++++++ 4 files changed, 29 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/Kconfig create mode 100644 drivers/infiniband/ulp/rds/Makefile diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index dd0db67..1cba524 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -54,4 +54,6 @@ source "drivers/infiniband/ulp/srp/Kconfig" source "drivers/infiniband/ulp/iser/Kconfig" +source "drivers/infiniband/ulp/rds/Kconfig" + endif # INFINIBAND diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index ed35e44..39d0203 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -9,3 +9,4 @@ obj-$(CONFIG_INFINIBAND_NES) += hw/nes/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ +obj-$(CONFIG_INFINIBAND_ISER) += ulp/rds/ diff --git a/drivers/infiniband/ulp/rds/Kconfig b/drivers/infiniband/ulp/rds/Kconfig new file mode 100644 index 0000000..bbc2ba4 --- /dev/null +++ b/drivers/infiniband/ulp/rds/Kconfig @@ -0,0 +1,13 @@ + +config INFINIBAND_RDS + tristate "Reliable Datagram Sockets (RDS) (EXPERIMENTAL)" + depends on EXPERIMENTAL + ---help--- + RDS provides reliable, sequenced delivery of datagrams + over Infiniband. + +config INFINIBAND_RDS_DEBUG + bool "Debugging messages" + depends on INFINIBAND_RDS + default n + diff --git a/drivers/infiniband/ulp/rds/Makefile b/drivers/infiniband/ulp/rds/Makefile new file mode 100644 index 0000000..d470550 --- /dev/null +++ b/drivers/infiniband/ulp/rds/Makefile @@ -0,0 +1,13 ...
> +obj-$(CONFIG_INFINIBAND_ISER) += ulp/rds/ Typo for ..._RDS > +config INFINIBAND_RDS_DEBUG > + bool "Debugging messages" > + depends on INFINIBAND_RDS > + default n No way to enable this? Disabled by default? You really want debugging messages to be built by default and controlled at runtime ... otherwise debugging end-user installations is a pain (they just install what the distro gives them, and it's very hard for them to rebuild just to enable debugging). > +ib_rds-y := af_rds.o bind.o cong.o connection.o info.o message.o \ > + recv.o send.o stats.o sysctl.o threads.o transport.o \ > + loop.o page.o rdma.o > + > +ib_rds-y += ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ > + ib_sysctl.o ib_rdma.o a very strange way to write an assignment statement... - R. --
So the solution is just to base debug message output on a variable, instead of a config option? RDS actually does do this a little already, so converting totally isn't hard. I hadn't seen mention this was preferable -- indeed, tons of drivers and subsystems have options for RDS is implemented as a core sockets layer and then a transport layer. IB is currently the only transport so I thought it made sense to just compile them together, but once there are >1 then RDS's IB support could be broken out into its own module. Regards -- Andy --
> So the solution is just to base debug message output on a variable, > instead of a config option? RDS actually does do this a little already, > so converting totally isn't hard. I hadn't seen mention this was > preferable -- indeed, tons of drivers and subsystems have options for > compile-time debug statements, should these be converted? My experience is definitely that compile-time switches are a big pain when you actually have to debug something that can only be reproduced on someone else's setup (which will happen once users start using your stuff). You probably can use the dynamic_printk stuff that went in recently to make this all very clean and standard. - R. --
Header parsing, ring refill. It puts the incoming data into an rds_incoming struct, which is passed up to rds-core. Signed-off-by: Andy Grover <andy.grover@oracle.com> --- drivers/infiniband/ulp/rds/ib_recv.c | 894 ++++++++++++++++++++++++++++++++++ 1 files changed, 894 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_recv.c diff --git a/drivers/infiniband/ulp/rds/ib_recv.c b/drivers/infiniband/ulp/rds/ib_recv.c new file mode 100644 index 0000000..516f858 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_recv.c @@ -0,0 +1,894 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN ...
> +static int rds_ib_recv_refill_one(struct rds_connection *conn,
> + struct rds_ib_recv_work *recv,
> + gfp_t kptr_gfp, gfp_t page_gfp)
> +{
> + struct rds_ib_connection *ic = conn->c_transport_data;
> + dma_addr_t dma_addr;
> + struct ib_sge *sge;
> + int ret = -ENOMEM;
> +
> + if (recv->r_ibinc == NULL) {
> + if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) {
> + rds_ib_stats_inc(s_ib_rx_alloc_limit);
> + goto out;
> + }
> + recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab,
> + kptr_gfp);
> + if (recv->r_ibinc == NULL)
> + goto out;
> + atomic_inc(&rds_ib_allocation);
This is racy. You check if you're at the limit, do the allocation, and
then increment the atomic rds_ib_allocation count. So many threads can
pass the atomic_read() test and then take you over the limit. If you
want to make it safe then you could do atomic_inc_return() and check if
that took you over the limit.
- R.
--
The refill code used to be single-threaded; and I think it still is. So this can't race I think Olaf --
> > > This is racy. You check if you're at the limit, do the allocation, and > > > then increment the atomic rds_ib_allocation count. So many threads can > > > pass the atomic_read() test and then take you over the limit. If you > > > want to make it safe then you could do atomic_inc_return() and check if > > > that took you over the limit. > > > > Woah, yup, thanks. > > The refill code used to be single-threaded; and I think it still is. So > this can't race I think So you don't need the atomic op at all? - R. --
Hey Andy, Why didn't you include the iWARP transport as well? --
Hi Steve, As I mentioned on IRC, there are some ib/iw coexistence issues and other minor bugs to resolve, and then I will include the iWARP code. Regards -- Andy --
> This patchset adds support for RDS as an Infiniband ULP. RDS is an > Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably, > and is used currently in Oracle RAC and Exadata products. It's lived > in OFED for 2+ years and I think it's time to get it upstream -- most > likely into your -next tree for .30, but if it snuck into .29 via the > "new code merge-window exception" then even better. I'll read this over and comment, but to be honest I agree with Dave: this is a new socket family, and as such it belongs under net/ and probably should go through Dave's tree (just as the NFS/RDMA changes went through the NFS trees and the 9p/RDMA changes went through the 9p tree); even though it heavily uses RDMA I think the upper layer interface to sockets/networking is more relevant. - R. --
OK no prob. I'll probably be ready early next week with an updated patchset. Thanks -- Regards -- Andy --
