[MITgcm-support] Clean exit from errors during MPI runs
Constantinos Evangelinos
ce107 at ocean.mit.edu
Mon Oct 1 16:39:57 EDT 2007
On Mon 01 Oct 2007 15:26, Christopher L. Wolfe wrote:
> Hi modelers,
>
> I recently had a run stop within initialization due to a missing
> pickup file. The run executed the standard error code
>
> write(msgbuf,'(a)')
> & ' MDSREADFIELD: Files do not exist'
> call PRINT_MESSAGE( msgbuf, standardmessageunit,
> & SQUEEZE_RIGHT , mythid)
> call PRINT_ERROR( msgbuf, mythid )
> stop 'ABNORMAL END: S/R MDSREADFIELD'
>
> (from mdsio_readfield.F) and stopped. However, the job (running on
> SDSC's BlueGene) hung in the running state until it exceeded its
> walltime 12 hours later. When I asked the people at SDSC why this
> happened and how I could prevent it in the future, they said "A
> 'stop' statement won't stop the process. You need a MPI finallization
> to finish the process, otherwise the process will still be running."
This is correct - depending on the MPI runtime a STOP may or may not crash the
process(es). However the most generic way to abort execution cannot be
MPI_Finalize as that would require synchronization among the processes (in
this case they will all miss the pickup file but in other cases only one may
stop). MPI_Abort is supposed to do a best-effort attempt to shut down
everything cleanly.
> I am far from an MPI expert and know even less about how the WRAPPER
> works "under the hood," so I have no idea is this is true, though
> I've had jobs stop without hanging in the running state before. I
> guess what I'm asking is if the explanation I got from SDSC is
> reasonable and, if so, am I going to have to go through the MITgcm
> sprinkling "MPI_Finalize" statements before every "stop" command?
You can go ahead and do it with MPI_Abort instead. We could also define a
macro _STOP (like _BARRIER) that in serial mode translates to _STOP and in
parallel model translates to MPI_Abort.
Constantinos
--
Dr. Constantinos Evangelinos
Department of Earth, Atmospheric and Planetary Sciences
Massachusetts Institute of Technology
More information about the MITgcm-support
mailing list