[MITgcm-devel] mpi_finalize
Jean-Michel Campin
jmc at ocean.mit.edu
Fri Mar 30 09:56:01 EDT 2012
Hi Martin and others,
I am not sure of how to proceed; Constantinos suggestion to use
MPI_ABORT is certainly better than my unused "stop_if_error.F" routine.
I would propose (because the MPI_ABORT might be little bit strong,
produce error return code/message, and less likely to flush buffer)
1) to keep the ALL_PROC_DIE + STOP feature when we know that all procs
will hit this instruction. There are many stop (in config_check
and in all the {PKG}_check.F) where it's the case and ALL_PROC_DIE
can be added. I will try to add those ALL_PROC_DIE in 1 _check.F
S/R and we will see how it goes.
And
2) we need to have an other subroutine (containing an MPI_ABORT call)
to stop when only 1 proc find an error (may be this could be similar
to stop_if_error.F usage, containing a stop if single proc and an
MPI_ABORT call if using MPI).
Would people agree with this ?
And regarding the 2 examples your mentionned:
a) pkg/mnc/mnc_var.F : I think all proc will hit this stop, so
we can add a ALL_PROC_DIE call before the stop.
(I added few ALL_PROC_DIE call in pkg/mdsio_read/write_field.F
some time ago, but did not go further).
b) model/src/calc_r_start.F : (line 246 ?) Here we need the 2nd
version because not all procs will hit this stop.
Does this make sense ?
Cheers,
Jean-Michel
On Fri, Mar 30, 2012 at 09:20:59AM +0200, Martin Losch wrote:
> Hi Jean-Michel and others,
>
> so you recommend that we always use
> CALL ALL_PROC_DIE(myThid)
> STOP 'SOME MESSAGE'
> also in cases such as, for example, pkg/mnc/mnc_var.F l219, which is
> probably the case that I encounter most often (when I forget to
> remove *.nc files from the working directory), or
> model/src/calc_r_start.F l108?
>
> Martin
>
> On 03/29/2012 07:20 PM, Jean-Michel Campin wrote:
> >Hi Martin,
> >
> >I am currently looking at this "termination" problem (with MPI + OpenMP),
> >since, with some (old) mpich version, sometimes it hangs or finishes
> >but leave some process behind (that needs to be killed afterward).
> >
> >Now regarding your question, we have already 2 S/R to end cleanly:
> >1) ALL_PROC_DIE : needs to be called just before the "stop",
> > but it only works if all the MPI proc call it.
> >2) And for the case where few (but not all) MPI proc detect an error,
> > there is an other S/R: STOP_IF_ERROR which collects the error
> > and then decide to stop. But this 2nd one is not used currently
> > (and I don't know what TAF will do with this), and the global_sum
> > can slow down the run if used too often.
> >
> >The advantage of ALL_PROC_DIE + a STOP compared to a S/R like STOP_THE_MODEL
> >(which would contain the stop) is that TAF can see the stop and we provide
> >some flow directives for ALL_PROC_DIE (eesupp.flow).
> >And ALL_PROC_DIE is used (e.g., ini pkg/monitor/mon_solution.F)
> >but there are many places where we the call is missing.
> >
> >Cheers,
> >Jean-Michel
> >
> >On Thu, Mar 29, 2012 at 09:49:57AM +0200, Martin Losch wrote:
> >>Hi there,
> >>
> >>as you know, my mpi-skills are not very good, which should explain the level of question:
> >>
> >>I often get this type of error message when the model encounters a Fortran "STOP" statement, because some parameters are not set properly, or netcdf files are not overwritten.
> >>
> >>MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
> >>MPI: aborting job
> >>
> >>Some system then complain about not being able to terminate the job properly and ask for manual intervention (e.g. run a LAM command or whatever), and sometimes some instances of mitgcmuv do remain and are difficult to delete without root-privildge
> >>
> >>Would it be useful to replace all "STOP" statements with a S/R STOP_THE_MODEL, or some other fancy name (maybe we even have this routine and I just don't know about it?), where the system is then shut down "cleanly" (with calling MPI_finalize, if necessary)? Is that difficult to do (all sorts of different possilbilies, w/ MPI, w/out MPI, etc.)? Is it worth it?
> >>
> >>Martin
> >>
> >>
> >>_______________________________________________
> >>MITgcm-devel mailing list
> >>MITgcm-devel at mitgcm.org
> >>http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >
> >_______________________________________________
> >MITgcm-devel mailing list
> >MITgcm-devel at mitgcm.org
> >http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> --
> Martin Losch
> Alfred Wegener Institute for Polar and Marine Research
> Postfach 120161, 27515 Bremerhaven, Germany;
> Tel./Fax: ++49(0471)4831-1872/1797
>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list