With 2.6.24 probably opening in the not-too-distant future, it's
probably a good time to review what my plans are for when the merge
window opens.At the kernel summit, we discussed patch review (doing a web search
for "kernel summit" "reviewed-by:" should turn up lots of info on
this). Due to an unfortunate combination of vacation and conference
travel, summer colds, and other inconveniences, I am very backed up on
reviewing. And in any case, I've allowed too much code review to be
dumped on me -- when there are dozens of people working on IB and RDMA
stuff, it obviously doesn't work to expect me to do all the reviewing.Unfortunately, due to the length of the backlog and the fact that
2.6.23 seems fairly close, some of the things listed below are going
to miss the 2.6.24 merge window. So, although the plan is to phase in
requiring "Reviewed-by:" gently, for this merge, if you can get
someone other than me to review your work, then the chances of it
being merged increase dramatically. I'm talking about a real review--
ideally, someone independent (from another company would be good) who
is willing to provide a "Reviewed-by:" line that means the reviewer
has really looked at and thought about the patch. There should be a
mailing list thread you can point me at where the reviewer comments on
the patch and a new version of that patch addressing all comments is
posted (or in exceptional cases, where the patch is perfect to start
with, where the reviewer says the patch is great).For example, given the number of IPoIB changes pending, it might be a
good idea for the people submitting them to get together and trade
reviews (ie "If you review my patch, I'll review your patch"). There
are a few cases where getting a review may not be necessary. First of
all, trivial and obvious patches don't need a review. It's a
judgement call what is trivial or obvious, and it's always a good idea
to provide a changelog that makes it clear why a patch is trivial and
obviously correct. Secon...
Roland, could you merge the common TX CQ patch please?
It actually fixes a real problem.--
MST
-
> Roland, could you merge the common TX CQ patch please?
> It actually fixes a real problem.Yes, I will, but it collides with the net-2.6.24 NAPI rework I think,
so it may not go in until a few days after the merge window.Have you verified that the patch cures the interrupt overload issues?
-
Yes.
--
MST
-
Missing from this list (IMPORTANT patch!):
[ofa-general] [PATCH 2 of 2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists
(Posted by me to list on Sept 4)
{patch header:
This is an addendum to Roland's commit 0e6e74162164d908edf7889ac66dca09e7505745
(June 18). This addendum adds prefetch headroom marking processing for s/g segments.We write s/g segments in reverse order into the WQE, in order to guarantee
that the first dword of all cachelines containing s/g segments is written last
(overwriting the headroom invalidation pattern). The entire cacheline will thus
contain valid data when the invalidation pattern is overwritten.
}
This patch series (1 of 2 is for libmlx4, the same issue).
============================================================Also, I'm now posting (in a separate post) the following patch to mlx4, which is important:
display the following device information via sysfs:
board_id, fw_ver, hw_rev, hca_type.The info is displayed under directory /sys/class/infiniband/mlx4_x, where x is
the pci bus sequence number (starting from zero).This patch makes information available to ibstat and ibv_devinfo under the
same directory as is used for tavor/arbel/sinai -- thus requiring no userspace
modifications.- Jack
-
Hi Roland,
I have reviewed the qos patches and provided comments which were
deployed in v2 of the series. I also tested it (ipoib and iser which is
rdma-cm based) against the Voltaire SM/SA to see that nothing was
The IGMP enabling patch posted by me on September 2nd isn't on your list
http://lists.openfabrics.org/pipermail/general/2007-September/040250.htmlJay Vosburgh, the bonding driver maintainer just sent an ack on all
patch series. As for the IPoIB changes, there are three patches, where
are handling a corner-case problems pointed by Michael Tsirkin.
Michael, will you be able to look on it and provide a reviewed-byJust for the record, the 'etc' above relates to the interrupt moderation
support (mlx4, core, ipoib {config through ethertool, usage). Among
other things, what is not clear to me here is if/how this goes
hand-in-hand with NAPI.As you saw the patch adding checksum offload support had a long thread,
and I think the discussion has reached the point where Michael is
waiting for your take on it.As for the LSO, LRO patches, I did not see any review comment.
I will see that I can review from the series, to begin with, will send
This patch series is somehow important as without them iser is useless
over connectx. Can be nice if you merge this and at max fix the abuse later.Or.
-
> The IGMP enabling patch posted by me on September 2nd isn't on your list
> http://lists.openfabrics.org/pipermail/general/2007-September/040250.html
> can you add it?Yes, I lost that somehow. I will add it to my list of things to take
a look at (no opinion yet).- R.
-
Hello Roland,
Since ehca can support 4K MTU, we would like to see a patch in
IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24
kernel. The idea is IPoIB link MTU will pick up a return value from SM's
default broadcast MTU. This patch should be a small patch, I hope you are
OK with this.Thanks
Shirley
-
> Since ehca can support 4K MTU, we would like to see a patch in
> IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24
> kernel. The idea is IPoIB link MTU will pick up a return value from SM's
> default broadcast MTU. This patch should be a small patch, I hope you are
> OK with this.It's actually not small, since it turns the skb allocation into a
4100-byte buffer, which ends up being more than 1 page usually, which
means it fails if memory is fragmented.Anyway given the backlog anything substantial that hasn't been posted
already is almost surely going to have to wait until 2.6.25.
-
I can give this patch a reviewed-by: too, and I will also try to review a couple
The new QoS fields fall into fields that are currently reserved, which should be
This patch was generated in response to an Intel MPI issue. We've seen MPI take
several minutes to respond to a connection request during the middle of large
application runs. When this happens, the active side times out the connection.
In OFED, we added module parameters to adjust the rdma_cm connection timeout on
the active side, but I believe that sending an MRA from the passive side is a
better solution.- Sean
-
> > - My user_mad P_Key index support patch. I'll test the ioctl to
> > change to the new mode and merge this I guess, since Hal and Sean
> > have tested this out.
>
> I can give this patch a reviewed-by: too, and I will also try to review a couple
> of the pending ipoib patches.Thanks!
> > - Sean's QoS changes. These look fine at first glance, and I just
> > plan to understand the backwards compatibility story (ie how this
> > works with an old SM) and merge. Anyone who objects let me know.
>
> The new QoS fields fall into fields that are currently reserved, which should be
> ignored by an older SM. I've only tested this against openSM however.That seems OK -- I'm OK with breaking things if an SM is clearly buggy
(and not ignoring fields that are defined to be ignored in the spec
would certainly be a clear bug to me).> This patch was generated in response to an Intel MPI issue. We've seen MPI take
> several minutes to respond to a connection request during the middle of large
> application runs. When this happens, the active side times out the connection.
> In OFED, we added module parameters to adjust the rdma_cm connection timeout on
> the active side, but I believe that sending an MRA from the passive side is a
> better solution.OK -- just to make sure I'm understanding what you're saying: have you
confirmed that your proposed patches actually fix the issue?- R.
-
Not directly. I cannot easily test kernel patches on our larger, production
clusters. We've seen the issue with specific applications on 512 and 1024
cores, but I've only been able to test the patch on a 48-core cluster. I have
verified that it successfully increases the timeout to where it *should* work,
but cannot absolutely confirm that it will fix the problem. I'm unlikely to
know that until the production clusters move to an OFED release (1.3?)
containing this patch.- Sean
-
> >OK -- just to make sure I'm understanding what you're saying: have you
> >confirmed that your proposed [CM MRA] patches actually fix the issue?
>
> Not directly. I cannot easily test kernel patches on our larger, production
> clusters. We've seen the issue with specific applications on 512 and 1024
> cores, but I've only been able to test the patch on a 48-core cluster. I have
> verified that it successfully increases the timeout to where it *should* work,
> but cannot absolutely confirm that it will fix the problem. I'm unlikely to
> know that until the production clusters move to an OFED release (1.3?)
> containing this patch.Umm... this is a difficult situation for me to merge the changes then.
We're changing the CM retry behavior blind here. How do we know that
the MRA changes don't make the scalability issue worse?- R.
-
What's currently upstream doesn't work for Intel MPI on our larger clusters.
The connection requests time out on the active side before the passive side can
respond.The OFED release works because it provides a kernel patch to make the timeout a
module parameter. I'm trying to avoid adding a module parameter, and the MRA is
designed for this situation.I tested this by simulating a slow passive side responder, and it worked as
expected for those tests. Using an MRA does add another MAD to the CM exchange,
which is why it is sent only after seeing a duplicate request. Alternatively,
we can take the OFED module parameter patch.- Sean
-
> I tested this by simulating a slow passive side responder, and it worked as
> expected for those tests. Using an MRA does add another MAD to the CM exchange,
> which is why it is sent only after seeing a duplicate request. Alternatively,
> we can take the OFED module parameter patch.What the heck, I added this for 2.6.24. If it doesn't work out we can
back it out.- R.
-
Hey Roland,
I was about to post v2 of my patch to avoid port space collisions with
the native stack. Can we get that 2.6.24? It is high priority IMO.
I've tried to solicit review on it, but I think folks are reluctant... ;-)Steve.
-
> I was about to post v2 of my patch to avoid port space collisions with
> the native stack. Can we get that 2.6.24? It is high priority
> IMO. I've tried to solicit review on it, but I think folks are
> reluctant... ;-)I would like to get this in, but I'm still at least a little
reluctant, since we would be committing to a user interface that seems
a little awkward at best, so I'd like to try and find something
better. Just to summarize my understanding:- your patch requires the administration to configure an ethX:iwY
alias address to use iwarp. (By the way is there anything other
than "don't do that" that avoids assigning the same address to the
iwarp alias and a non-iwarp interface?)- it would be nicer to create the alias automatically, but an alias
without an address doesn't make sense. Creating a whole separate
net device causes problems because the iwarp stuff still needs to
use the main net device to do ARP etc.- so I'm out of better ideas but I still want to push back a little
before we commit to something ugly.I've been meaning to track down the bnx2 iscsi offload patch to look
and see if this issue is addressed, since the same problem seems to
exist: it seems an iscsi connection and a main stack tcp connection
might share the same 4-tuple unless something is done to avoid that
happening.Also, I think it behooves us to get some agreement on this approach
with NetEffect and Kanoj (NetXen?) at least, since their iwarp drivers
seem to be imminent.- R.
-
Nope. Its totally up to the admin to create the ethX:iwY interface
-and- to segment his services so host TCP runs on the ethX subnet(s) and
the iwarp rdma ones run on ethX:iwY subnet(s). Without changing theI do log a warning if an iwarp application binds to address 0.0.0.0 and
-
iSCSI does not do passive listens, only active connections to the
target. But you're right, the port space is still shared between iSCSI
and the main stack. We currently rely on user apps binding to the main
stack to reserve certain ephemeral ports, and telling the iSCSI driver
which ports to use.-
> > I've been meaning to track down the bnx2 iscsi offload patch to look
> > and see if this issue is addressed, since the same problem seems to
> > exist: it seems an iscsi connection and a main stack tcp connection
> > might share the same 4-tuple unless something is done to avoid that
> > happening.> iSCSI does not do passive listens, only active connections to the
> target. But you're right, the port space is still shared between iSCSI
> and the main stack. We currently rely on user apps binding to the main
> stack to reserve certain ephemeral ports, and telling the iSCSI driver
> which ports to use.Got it... I wasn't thinking that clearly, but it is clear that a full
4-tuple collision with only active connections is quite unlikely. I
guess you would have to make both an offloaded and a non-offloaded
iSCSI connection to the same target and get really unlucky with
ephemeral port allocation. So in practice I guess it's not an issue
at all with your driver yet.However, do you have any plans to support iSCSI offload for targets?
Also, looking at the first CNIC patch, I can't help but notice that
you seem to have at least some support for iWARP there. How does the
CNIC look? Does it share the same interface/addresses as the
non-offload NIC, or does it create a completely separate netdevice?I want to make sure that whatever solution we come up with for cxgb3
doesn't cause problems for you. And of course if you have a better
idea than what Steve has come up with, that would be great :)- R.
-
We will support iWARP in the future and it should be similar to the way
We are looking at these discussions with great interest. If we have any
new ideas, we'll definitely let everyone know. Thanks.-
Well, if it involves /sharing/ port space with the native stack, i.e.
where port 1234 is IB but 1235 is Linux, pretty much all the networking
devs have NAK'd that approach AFAICS.Jeff
-
> Well, if it involves /sharing/ port space with the native stack,
> i.e. where port 1234 is IB but 1235 is Linux, pretty much all the
> networking devs have NAK'd that approach AFAICS.Just to be clear, InfiniBand has no problem; the issue is port
collisions involving iWARP connections.- R.
-
Jeff, I posted a fix that doesn't do this. No port sharing. The iwarp
device will use its own ip address and subnet to avoid collisions. You
should review the patch when I post v2.Thanks,
Steve.
-
Could you please resend it, since I missed it in netdev@.
--
Evgeniy Polyakov
-
Sounds promising, then! :)
Jeff
-
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Justin C. Sherrill | Re: pkgsrc bulk build and tiff |
| Jeremy Allison | Re: [RFC] Heads up on sys_fallocate() |
| Roland Dreier | Re: Integration of SCST in the mainstream Linux kernel |
| Matt Thomas | Re: Add a MAP_ALIGNED flag for mmap(2). |
| Vsevolod Stakhov | Unicode support in iso9660. |
| Jaromir Dolecek | Re: Speeding up fork/wait path |
| matthew green | re: merge of freebsd eventhandler |
git: | |
| Petr Janda | KDE and OpenSSL = Broken |
| sam | Re: Loader not found |
| Erick Perez | Re: dragonfly pdf documentation |
| Michel Talon | Re: Compatability with FreeBSD Ports [debian package tools] |
