[MITgcm-devel] mpi_finalize
EAPS
ce107 at ocean.mit.edu
Thu Mar 29 18:36:27 EDT 2012
For the case when only a subset of processes fail one should use mpi_abort().
Constantinos
Sent from my iPhone
On Mar 29, 2012, at 1:20 PM, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
> Hi Martin,
>
> I am currently looking at this "termination" problem (with MPI + OpenMP),
> since, with some (old) mpich version, sometimes it hangs or finishes
> but leave some process behind (that needs to be killed afterward).
>
> Now regarding your question, we have already 2 S/R to end cleanly:
> 1) ALL_PROC_DIE : needs to be called just before the "stop",
> but it only works if all the MPI proc call it.
> 2) And for the case where few (but not all) MPI proc detect an error,
> there is an other S/R: STOP_IF_ERROR which collects the error
> and then decide to stop. But this 2nd one is not used currently
> (and I don't know what TAF will do with this), and the global_sum
> can slow down the run if used too often.
>
> The advantage of ALL_PROC_DIE + a STOP compared to a S/R like STOP_THE_MODEL
> (which would contain the stop) is that TAF can see the stop and we provide
> some flow directives for ALL_PROC_DIE (eesupp.flow).
> And ALL_PROC_DIE is used (e.g., ini pkg/monitor/mon_solution.F)
> but there are many places where we the call is missing.
>
> Cheers,
> Jean-Michel
>
> On Thu, Mar 29, 2012 at 09:49:57AM +0200, Martin Losch wrote:
>> Hi there,
>>
>> as you know, my mpi-skills are not very good, which should explain the level of question:
>>
>> I often get this type of error message when the model encounters a Fortran "STOP" statement, because some parameters are not set properly, or netcdf files are not overwritten.
>>
>> MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
>> MPI: aborting job
>>
>> Some system then complain about not being able to terminate the job properly and ask for manual intervention (e.g. run a LAM command or whatever), and sometimes some instances of mitgcmuv do remain and are difficult to delete without root-privildge
>>
>> Would it be useful to replace all "STOP" statements with a S/R STOP_THE_MODEL, or some other fancy name (maybe we even have this routine and I just don't know about it?), where the system is then shut down "cleanly" (with calling MPI_finalize, if necessary)? Is that difficult to do (all sorts of different possilbilies, w/ MPI, w/out MPI, etc.)? Is it worth it?
>>
>> Martin
>>
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list