> On one run we got this in syslog (ib_mthca's debug_level set to 1): > > 15:34:34 ib_mthca 0012:01:00.0: Command 21 completed with status 09 > 15:35:34 ib_mthca 0012:01:00.0: HW2SW_MPT failed (-16) > .... > (status 0x9==MTHCA_CMD_STAT_BAD_RES_STATE => problem with mpi?) > > or on another run: > > 13:57:15 ib_mthca 0005:01:00.0: Command 1a completed with status 01 > 13:57:15 ib_mthca 0005:01:00.0: modify QP 1->2 returnedstatus 01. > .... > (status 0x1==MTHCA_CMD_STAT_INTERNAL_ERR => ???) > > These are just the first debug messages logged (rebooting between > each run), there are lots more, of almost every flavor. > > Anyone else seen anything like this? Got any suggestions for debugging? > Should I be looking at MPI, or would you suspect a driver or h/w > problem? Any other info I could provide that'd help to narrow things > down? Almost certainly this is a driver and/or firmware bug. MPI and userspace in general shouldn't be able to do anything that would cause this type of error. Given the semi-random nature of the error messages and the fact that having nodes with lots of CPUs means FW commands are being submitted in parallel, I have to suspect a race somewhere, possibly in firmware but possibly in the driver. You could try adding dev->cmd.max_cmds = 1; to the beginning of mthca_cmd_use_events() as a hack, and see if you still see problems. I don't really see anything racy in the FW command stuff, but it's possible that there's something like an mmiowb() missing somewhere (I have a hard time spotting that type of race for some reason). - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Willy Tarreau | Re: Linux 2.6.21 |
| Jan Kundrát | kswapd high CPU usage with no swap |
git: | |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] tcp: splice as many packets as possible at once |
