[MITgcm-support] Strange Random MPI crashes running MITgcm
Dierk Polzin
dtpolzin at wisc.edu
Thu Sep 18 14:37:18 EDT 2008
Cluster Issue
Strange crashes of MPI jobs using PGI CLUSTER DEVELOPMENT KIT.
Diagnosing the Problem?
On our ROCKS Cluster, We had to upgrade our operating system in order
to install a new card to the motherboard for a high speed drive..
Since then we reinstalled PGI compilers and everything else..
But now our MPI Jobs crash giving no errors. We run a ocean model
MITgcm ran for 3 hours and then died while writing out day 17 of a
the first month in netcdf. Other jobs crash at the in 20 minutes,
the 0 process always seem to crash first.
Any ideas on how to diagnose the problem.
-- Log files to look at?
-- What kind of simple intermediate MPI jobs can i test run on 16
nodes for 3 hours or 3 days to find out what the glitch is? (I can
get simple MPI jobs to run)
-- Rebuild the MPI libraries from scratch rather than using the
precompiled from PGI?
-- Run a simple ring mpi job for 20 hours and see if that crashes?
Thanks for your help,
Dierk
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Dierk Polzin
mailto:dtpolzin at wisc.edu 608-334-3574 cell
Skype: saildirk
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20080918/d48a42e6/attachment.htm>
More information about the MITgcm-support
mailing list