[MITgcm-support] Strange Random MPI crashes running MITgcm

Dierk Polzin dtpolzin at wisc.edu
Thu Sep 18 14:37:18 EDT 2008


Cluster Issue
Strange crashes of MPI jobs using PGI CLUSTER DEVELOPMENT KIT.
Diagnosing the Problem?

On our ROCKS Cluster, We had to upgrade our operating system in order  
to install a new card to the motherboard for a high speed drive..  
Since then we reinstalled PGI compilers and everything else..

But now our MPI Jobs crash giving no errors. We run a ocean model  
MITgcm ran for 3 hours and then died while writing out day 17 of a  
the first month in netcdf.  Other jobs crash at the in 20 minutes,  
the 0 process always seem to crash first.

Any ideas on how to diagnose the problem.

-- Log files to look at?  

-- What kind of simple intermediate MPI jobs can i test run on 16  
nodes for 3 hours or 3 days to find out what the glitch is?  (I can  
get simple MPI jobs to run)

-- Rebuild the MPI libraries from scratch rather than using the  
precompiled from PGI?

-- Run a simple ring mpi job for 20 hours and see if that crashes?

Thanks for your help,

Dierk

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Dierk Polzin
    mailto:dtpolzin at wisc.edu    608-334-3574 cell
   Skype: saildirk
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20080918/d48a42e6/attachment.htm>


More information about the MITgcm-support mailing list