[MITgcm-devel] mpi_finalize

EAPS ce107 at ocean.mit.edu
Thu Mar 29 18:36:27 EDT 2012


For the case when only a subset of processes fail one should use mpi_abort().

Constantinos

Sent from my iPhone

On Mar 29, 2012, at 1:20 PM, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:

> Hi Martin,
> 
> I am currently looking at this "termination" problem (with MPI + OpenMP),
> since, with some (old) mpich version, sometimes it hangs or finishes
> but leave some process behind (that needs to be killed afterward).
> 
> Now regarding your question, we have already 2 S/R to end cleanly:
> 1) ALL_PROC_DIE : needs to be called just before the "stop",
>  but it only works if all the MPI proc call it.
> 2) And for the case where few (but not all) MPI proc detect an error,
>  there is an other S/R: STOP_IF_ERROR which collects the error
>  and then decide to stop. But this 2nd one is not used currently 
>  (and I don't know what TAF will do with this), and the global_sum
>  can slow down the run if used too often.
> 
> The advantage of ALL_PROC_DIE + a STOP compared to a S/R like STOP_THE_MODEL
> (which would contain the stop) is that TAF can see the stop and we provide
> some flow directives for ALL_PROC_DIE (eesupp.flow).
> And ALL_PROC_DIE is used (e.g., ini pkg/monitor/mon_solution.F)
> but there are many places where we the call is missing.
> 
> Cheers,
> Jean-Michel
> 
> On Thu, Mar 29, 2012 at 09:49:57AM +0200, Martin Losch wrote:
>> Hi there,
>> 
>> as you know, my mpi-skills are not very good, which should explain the level of question:
>> 
>> I often get this type of error message when the model encounters a Fortran "STOP" statement, because some parameters are not set properly, or netcdf files are not overwritten.
>> 
>> MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
>> MPI: aborting job
>> 
>> Some system then complain about not being able to terminate the job properly and ask for manual intervention (e.g. run a LAM command or whatever), and sometimes some instances of mitgcmuv do remain and are difficult to delete without root-privildge
>> 
>> Would it be useful to replace all "STOP" statements with a S/R STOP_THE_MODEL, or some other fancy name (maybe we even have this routine and I just don't know about it?), where the system is then shut down "cleanly" (with calling MPI_finalize, if necessary)? Is that difficult to do (all sorts of different possilbilies, w/ MPI, w/out MPI, etc.)? Is it worth it?
>> 
>> Martin
>> 
>> 
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel



More information about the MITgcm-devel mailing list