InfiniBand/iWARP/RDMA merge plans for 2.6.26 (what's in infiniband.git)

Previous thread: Re: Penguins can't fly by Aaron Gray on Tuesday, April 1, 2008 - 3:53 pm. (2 messages)

Next thread: Linux 2.6.25-rc8 by Linus Torvalds on Tuesday, April 1, 2008 - 4:08 pm. (11 messages)
To: <general@...>, <linux-kernel@...>
Date: Tuesday, April 1, 2008 - 4:02 pm

The 2.6.26 will open soon, so it's time to review what my plans are
for the merge window opens.

As usual, patch review by non-me people is always welcome.

Anyway, here are all the pending things that I'm aware of. As usual,
if something isn't already in my tree and isn't listed below, I
probably missed it or dropped it by mistake. Please remind me again
in that case.

Core:

- I did a bunch of cleanups all over drivers/infiniband and the
gcc and sparse warning noise is down to a pretty reasonable level.
Further cleanups welcome of course.

ULPs:

- I merged Eli's IPoIB stateless offload changes for checksum
offload and LSO changes. The interrupt moderation changes are
next, and should not be a problem to merge. Please test IPoIB
on all sorts of hardware!

- Shirley's IPoIB 4 KB MTU changes. I expect these to make it in,
although I would certainly appreciate review from Eli or anyone else.

HW specific:

- Vlad's mlx4 resize CQ support. Looks basically OK, so I think we
should be able to get it in.

- ipath support for 7220 HCAs. I don't expect any issues here once
the patches appear.

Here are a few topics that I believe will not be ready in time for the
2.6.26 window and will need to wait for 2.6.27 at least:

- XRC. I still don't have a good feeling that we have settled on all
the nuances of the ABI we want to expose to userspace for this, and
ideally I would like to understand how ehca LL QPs fit into the
picture as well.

- Remove LLTX from IPoIB. I haven't had time to finish this yet, so
I guess it will probably wait for 2.6.27 now...

- Multiple CQ event vector support. I still haven't seen any
discussions about how ULPs or userspace apps should decide which
vector to use, and hence no progress has been made since we
deferred this during the 2.6.23 merge window.

Here all the patches I already have in my for-2.6.26 branch:

Arthur Jones (4):
IB/ipath: Fix sparse warning about pointer s...

To: Roland Dreier <rdreier@...>
Cc: <general@...>, <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 8:31 am

We want to add send with invalidate & mask compare and swap.
Eli will be able to send the patches next week and since they are small
What about the split CQ for UD mode? It's improved the IPoIB performance
mlx4- we plan to send patches for the low level driver only to enable
mlx4_en. These only affect our low level driver.
I think we should try to push for XEC in 2.6.26 since there are already
MPI implementation that use it and this ties them to use OFED only.
Also this feature is stable and now being defined in IBTA
Not taking it causing changes between OFED and the kernel and your
libibverbs and we wish to avoid such gaps.
Is there any thing we can do to help and make it into 2.6.26?

--

To: Tziporet Koren <tziporet@...>
Cc: Roland Dreier <rdreier@...>, <linux-kernel@...>, <general@...>, Dror Goldenberg <gdror@...>
Date: Friday, April 4, 2008 - 1:54 am

On Wed, Apr 2, 2008 at 3:31 PM, Tziporet Koren

Does send with invalidate applies to rkeys generated through the
proprietary FMR API?
if not, what usage you envision to the new verb under nowadays IB devices?

Or.
--

To: <tziporet@...>
Cc: <general@...>, <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 12:19 pm

> We want to add send with invalidate & mask compare and swap.
> Eli will be able to send the patches next week and since they are
> small I think they can be in for 2.6.26

Send with invalidate should be OK. Let's see about the masked atomics
stuff -- we have a ton of new verbs and I think we might want to slow
down and make sure it all makes sense.

> What about the split CQ for UD mode? It's improved the IPoIB
> performance for small messages significantly.

Oh yeah... I'll try to get that in too.

> mlx4- we plan to send patches for the low level driver only to enable
> mlx4_en. These only affect our low level driver.

No problem in principle, let's see the actual patches.

> I think we should try to push for XEC in 2.6.26 since there are
> already MPI implementation that use it and this ties them to use OFED
> only.
> Also this feature is stable and now being defined in IBTA
> Not taking it causing changes between OFED and the kernel and your
> libibverbs and we wish to avoid such gaps.
> Is there any thing we can do to help and make it into 2.6.26?

I don't have a good feeling that the user-kernel interface is well
thought out, so I want to consider XRC + ehca LL stuff + new iWARP verbs
and make sure we have something that makes sense for the future.

- R.
--

To: Roland Dreier <rdreier@...>
Cc: <tziporet@...>, <linux-kernel@...>, <general@...>
Date: Friday, April 4, 2008 - 4:26 pm

> We want to add send with invalidate & mask compare and swap.
> Eli will be able to send the patches next week and since they are
> small I think they can be in for 2.6.26

We are very interested in these new operations and are moving in the
direction of tightly integrating RDMA along with atomics (if available)
into Oracle. We plan on testing some early prototypes of the these in
the few months.

Send with invalidate is an exact match for our current RDS V3 rdma
driver - and should be more efficient than the current background
syncing of the tpt to ensure keys are invalidated.

We intend on exposing the atomics via the RDS driver along with simple
low level rdma operations to Oracle's internal clients. If Oracle is
running over a transport which exports atomics and rdma - Oracle will
see a dramatic performance boost for several database operations.

--

To: Richard Frank <richard.frank@...>
Cc: <tziporet@...>, <linux-kernel@...>, <general@...>
Date: Friday, April 4, 2008 - 3:34 pm

> We are very interested in these new operations and are moving in the
> direction of tightly integrating RDMA along with atomics (if
> available) into Oracle. We plan on testing some early prototypes of
> the these in the few months.

And you need the ConnectX-only masked atomics? Or do the standard IB
atomic operations work for you? Of course using atomics at all means
that things don't work on iWARP.

> Send with invalidate is an exact match for our current RDS V3 rdma
> driver - and should be more efficient than the current background
> syncing of the tpt to ensure keys are invalidated.

How does send with invalidate interact with the current IB FMR stuff?
Seems that you would run into trouble keeping the state of the FMR
straight if the remote side is invalidating them.

Also I would think that send-with-invalidate would be much more
expensive than the current FMR method of batching up the invalidates,
since you don't get to amortize the cost of syncing up all the internal
HCA state.

- R.
--

To: Roland Dreier <rdreier@...>
Cc: <tziporet@...>, <linux-kernel@...>, <general@...>
Date: Friday, April 4, 2008 - 6:21 pm

We specifically asked for the masked operations.

Yes, this means Oracle will not get the performance boost of atomics on
IWARP - but we still get rdma - and that's a real win / benefit for
The model we implement is based on "use once" keys - we issue the key to
the rdma server and want to toss it as soon as the rdma is complete.
Today, we explicitly free the key after the rdma completes and we get a
message from the rdma server - saying rdma is complete. If the key is
auto invalidated by the recv'ing HCA then we do not need to do it in the
driver... which also meanswe do not need to issue the sync tpts to force
the HCA to be update its cache.

This is the one piece we do not know - our plans are to test this and
see where the trade offs are. We will keep the current design /
--

To: Roland Dreier <rdreier@...>
Cc: <tziporet@...>, <general@...>, <linux-kernel@...>
Date: Thursday, April 3, 2008 - 7:40 am

I see - but can't we figure this all for the 2.6.26 window?

Tziporet

--

To: Roland Dreier <rdreier@...>
Cc: <general@...>, <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 3:22 am

What's the status of RDS?

Thanks
Shirley

--

To: Shirley Ma <mashirle@...>
Cc: <linux-kernel@...>, <general@...>
Date: Wednesday, April 2, 2008 - 11:27 am

> What's the status of RDS?

I've never seen any patches. I guess ask the RDS guys if/when they want
to start working on getting RDS merged.

- R.
--

To: Roland Dreier <rdreier@...>
Cc: Shirley Ma <mashirle@...>, <linux-kernel@...>, <general@...>, <rds-devel@...>
Date: Wednesday, April 2, 2008 - 1:11 pm

What is the work we need to do here - I was thinking RDS should just work ?

--

To: Richard Frank <richard.frank@...>
Cc: Shirley Ma <mashirle@...>, <linux-kernel@...>, <general@...>, <rds-devel@...>
Date: Wednesday, April 2, 2008 - 12:15 pm

> What is the work we need to do here - I was thinking RDS should just work ?

Stuff doesn't get merged into the kernel on its own. If you want RDS
upstream then the first step is to post patches in a form suitable for
reviewing. Then respond to the review comments.

The files Documentation/SubmittingPatches and to some extent
Documentation/SubmittingDrivers in the kernel source have more info.

- R.
--

To: Roland Dreier <rdreier@...>
Cc: Shirley Ma <mashirle@...>, <linux-kernel@...>, <general@...>, <rds-devel@...>
Date: Wednesday, April 2, 2008 - 1:24 pm

WRT to merging RDS into the kernel - our current plans are to wait to
see RDS adopted by more than Oracle - before approaching the kernel
community about inclusion of RDS.

--

To: Richard Frank <richard.frank@...>, Roland Dreier (rdreier) <rdreier@...>
Cc: <rds-devel@...>, <linux-kernel@...>, <general@...>
Date: Wednesday, April 2, 2008 - 12:29 pm

I've seen statements before from someone from Oracle that RDS was only
for Oracle's use, for example, that person did not want netperf changed
to support RDS.

Scott Weitzenkamp
SQA and Release Manager
Data Center Access Engineering
Cisco Systems
--

To: Scott Weitzenkamp (sweitzen) <sweitzen@...>
Cc: Roland Dreier (rdreier) <rdreier@...>, <rds-devel@...>, <linux-kernel@...>, <general@...>
Date: Wednesday, April 2, 2008 - 1:37 pm

I believe there is a patch for NetPerf which supports RDS - although it
may need to be updated - and submitted.

The only prior discussion I can think of - was whether or not NetPerf
exercises RDS as Oracle would.

I'm not proposing that we should enhance NetPerf to do that (but that's
OK with me).

We created a tool rds-stress which does that.

--

To: Richard Frank <richard.frank@...>, Scott Weitzenkamp (sweitzen) <sweitzen@...>
Cc: Roland Dreier (rdreier) <rdreier@...>, <rds-devel@...>, <linux-kernel@...>, <general@...>
Date: Wednesday, April 2, 2008 - 12:46 pm

Rich,

On Nov 1, 2007, you wrote this to rds-devel:

"Netperf is too simplistic in that all it seems to do is stream data
in a
simple loop. This is not how Oracle uses the IPC and again does not
reflect what it would take to make UDP reliable.

For this reason we are not interested in having Netperf support RDS
and
or seeing Netperf data."

I would like to see RDS supported by existing common tools like netperf,
iperf, etc. so we can easily compare how RDS performs to UDP for IPC
models other than Oracle.

Scott Weitzenkamp
SQA and Release Manager
Data Center Access Engineering
--

To: Scott Weitzenkamp (sweitzen) <sweitzen@...>
Cc: Roland Dreier (rdreier) <rdreier@...>, <rds-devel@...>, <linux-kernel@...>, <general@...>
Date: Wednesday, April 2, 2008 - 2:00 pm

OK - and the conversation was about using NetPerf to compare performance
of RDS to UDP relative to suitability for Oracle use ... so I think
those statements still illustrate my points...

1) NetPerf does not do what Oracle does - and hence is not useful from
Oracle's perspective in comparing ULPs.
2) For some metrics - it's not valid to compare a non-reliable IPC to a
reliable IPC - it's not an apples to apples comparison. Especially when
the app is considered and what the app must do to use UDP vs RDS.

I did not say that NetPerf should not be extended to support RDS - just
that using it to do a comparison of ULPs to determine how well Oracle
would run - is not what we (Oracle) would want - at least that was my
intention..

--

To: Richard Frank <richard.frank@...>
Cc: Roland Dreier (rdreier) <rdreier@...>, <rds-devel@...>, <linux-kernel@...>, <general@...>
Date: Wednesday, April 2, 2008 - 1:04 pm

I'd like to see netperf comparisions of UDP_STREAM/UDP_RR vs
RDS_STREAM/RDS_RR, does anyone have a patch that will apply cleanly to a
recent netperf?

Scott Weitzenkamp
SQA and Release Manager
Data Center Access Engineering
--

To: Roland Dreier <rdreier@...>
Cc: Shirley Ma <mashirle@...>, <linux-kernel@...>, <general@...>, <rds-devel@...>
Date: Wednesday, April 2, 2008 - 1:18 pm

Yes, I see this is for pushing RDS upstream - but what about running RDS
as is over IWARP NICs - that should just work right ?

--

To: Richard Frank <richard.frank@...>
Cc: Shirley Ma <mashirle@...>, <linux-kernel@...>, <general@...>, <rds-devel@...>
Date: Wednesday, April 2, 2008 - 12:26 pm

> Yes, I see this is for pushing RDS upstream - but what about running
> RDS as is over IWARP NICs - that should just work right ?

No idea. It depends on whether you took into account the differences
between IB and iWARP. Anyway that's not really what this thread was about.
--

To: Roland Dreier <rdreier@...>
Cc: Shirley Ma <mashirle@...>, <linux-kernel@...>, <general@...>, <rds-devel@...>
Date: Wednesday, April 2, 2008 - 1:28 pm

got it...

--

To: Roland Dreier <rdreier@...>
Cc: <general@...>, <linux-kernel@...>
Date: Tuesday, April 1, 2008 - 12:55 pm

I did some prototype for IPoIB to enable multiple CQ event support. I
did see the approach improved multiple links aggregation performance. I
also see some customers' requirements in userspace. I will start the
discussion as soon as possible. But it would most likely miss 2.6.26
window.

Thanks
Shirley

--

Previous thread: Re: Penguins can't fly by Aaron Gray on Tuesday, April 1, 2008 - 3:53 pm. (2 messages)

Next thread: Linux 2.6.25-rc8 by Linus Torvalds on Tuesday, April 1, 2008 - 4:08 pm. (11 messages)