On "large" IB-connected ia64 clusters, I (and some customers) are
seeing failures in MPI programs. This is commoner the bigger the
cluster nodes are, but I've seen it with as few as 32P/node.I'm using "Mellanox Technologies MT23108 InfiniHost (rev a1)"
HCAs, with firmware version 3.5.0 (but this has been seen with
several firmware revisions) and OFED-1.2.For example, with 2-128P systems connected via a single IB port,
using this simple MPI program:int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}and running it with something like:
# mpirun machine1, machine2 128 a.out
I see failures on >1% of runs.
On one run we got this in syslog (ib_mthca's debug_level set to 1):
15:34:34 ib_mthca 0012:01:00.0: Command 21 completed with status 09
15:35:34 ib_mthca 0012:01:00.0: HW2SW_MPT failed (-16)
....
(status 0x9==MTHCA_CMD_STAT_BAD_RES_STATE => problem with mpi?)or on another run:
13:57:15 ib_mthca 0005:01:00.0: Command 1a completed with status 01
13:57:15 ib_mthca 0005:01:00.0: modify QP 1->2 returnedstatus 01.
....
(status 0x1==MTHCA_CMD_STAT_INTERNAL_ERR => ???)These are just the first debug messages logged (rebooting between
each run), there are lots more, of almost every flavor.Anyone else seen anything like this? Got any suggestions for debugging?
Should I be looking at MPI, or would you suspect a driver or h/w
problem? Any other info I could provide that'd help to narrow things
down?Thanks for any pointers.
--
Arthur_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/generalTo unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> On one run we got this in syslog (ib_mthca's debug_level set to 1):
>
> 15:34:34 ib_mthca 0012:01:00.0: Command 21 completed with status 09
> 15:35:34 ib_mthca 0012:01:00.0: HW2SW_MPT failed (-16)
> ....
> (status 0x9==MTHCA_CMD_STAT_BAD_RES_STATE => problem with mpi?)
>
> or on another run:
>
> 13:57:15 ib_mthca 0005:01:00.0: Command 1a completed with status 01
> 13:57:15 ib_mthca 0005:01:00.0: modify QP 1->2 returnedstatus 01.
> ....
> (status 0x1==MTHCA_CMD_STAT_INTERNAL_ERR => ???)
>
> These are just the first debug messages logged (rebooting between
> each run), there are lots more, of almost every flavor.
>
> Anyone else seen anything like this? Got any suggestions for debugging?
> Should I be looking at MPI, or would you suspect a driver or h/w
> problem? Any other info I could provide that'd help to narrow things
> down?Almost certainly this is a driver and/or firmware bug. MPI and
userspace in general shouldn't be able to do anything that would cause
this type of error.Given the semi-random nature of the error messages and the fact that
having nodes with lots of CPUs means FW commands are being submitted
in parallel, I have to suspect a race somewhere, possibly in firmware
but possibly in the driver. You could try addingdev->cmd.max_cmds = 1;
to the beginning of mthca_cmd_use_events() as a hack, and see if you
still see problems.I don't really see anything racy in the FW command stuff, but it's
possible that there's something like an mmiowb() missing somewhere (I
have a hard time spotting that type of race for some reason).- R.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/generalTo unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> I don't really see anything racy in the FW command stuff, but it's
> possible that there's something like an mmiowb() missing somewhere (I
> have a hard time spotting that type of race for some reason).Another possibility (independent of the hack I suggested before) would
be to add an mmiowb() before the mutex_unlock() in mthca_cmd_post().I actually have a good feeling about this theory....
- R.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/generalTo unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Genius!
I have completed over 275 runs with the patch below, so
we can be very confident that this has fixed things.Roland, should I submit a proper patch, or do you want
to take care of this? (And thanks alot, too!)diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-06-21 07:38:47.000000000 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-10-05 16:04:38.926857822 -0700
@@ -288,7 +288,7 @@ static int mthca_cmd_post(struct mthca_d
else
err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier,
op_modifier, op, token, event);
-
+ mmiowb();
mutex_unlock(&dev->cmd.hcr_mutex);
return err;
}--
Arthur_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/generalTo unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> Roland, should I submit a proper patch, or do you want
> to take care of this? (And thanks alot, too!)Thanks for testing... I can take care of this -- I just added the
patches below to my tree (since as far as I can see, mlx4 would be
susceptible to the same bug):commit 66547550601a706e2b958ea351b34d8dee066b18
Author: Roland Dreier <rolandd@cisco.com>
Date: Sat Oct 6 13:35:24 2007 -0700IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up
Firmware commands are sent to the HCA by writing multiple words to a
command register block. Access to this block of registers is
serialized with a mutex. However, on large SGI systems, problems were
seen with multiple CPUs issuing FW commands at the same time, because
the writes to the register block may be reordered within the system
interconnect and reach the HCA in a different order than they were
issued (even with the mutex). Fix this by adding an mmiowb() before
dropping the mutex.Tested-by: Arthur Kepner <akepner@sgi.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c
index acc9589..6966f94 100644
--- a/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ b/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -290,6 +290,12 @@ static int mthca_cmd_post(struct mthca_dev *dev,
err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier,
op_modifier, op, token, event);+ /*
+ * Make sure that our HCR writes don't get mixed in with
+ * writes from another CPU starting a FW command.
+ */
+ mmiowb();
+
mutex_unlock(&dev->cmd.hcr_mutex);
return err;
}commit 8c2348735c721eed6f08343eed851bfbec6e5a9a
Author: Roland Dreier <rolandd@cisco.com>
Date: Sat Oct 6 13:39:38 2007 -0700mlx4_core: Use mmiowb() to avoid firmware commands getting jumbled up
Firmware commands are sent to the HCA by writing...
Roland - is this for 2.6.23 or 24?
Tziporet
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/generalTo unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
| Greg Kroah-Hartman | [PATCH 002/196] Chinese: rephrase English introduction in HOWTO |
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| Andi Kleen | Re: [PATCH] x86: Construct 32 bit boot time page tables in native format. |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | Re: Possible regression in HTB |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
git: | |
