[MITgcm-devel] mpi_finalize
Jean-Michel Campin
jmc at ocean.mit.edu
Mon Apr 2 12:06:43 EDT 2012
Hi,
I made some changes in model/src/config_check.F (not claiming it's
the best way to do this check, but more like a prototype for these
various {PKG}_check):
a) it collects all the errors (instead of stopping at the first one).
this could be an improvement in case there are several errors,
we get all the error messages the 1rst time intead of having
to fix+recompile+run one after the other.
b) add a call to ALL_PROC_DIE before the stop.
I am not sure that the threading implementation is the best one,
(I only let the master-thread check & stop, avoiding confusing
error msg from all threads) but when I did different tests it was
working fine for me.
Cheers,
Jean-Michel
On Fri, Mar 30, 2012 at 09:56:01AM -0400, Jean-Michel Campin wrote:
> Hi Martin and others,
>
> I am not sure of how to proceed; Constantinos suggestion to use
> MPI_ABORT is certainly better than my unused "stop_if_error.F" routine.
>
> I would propose (because the MPI_ABORT might be little bit strong,
> produce error return code/message, and less likely to flush buffer)
> 1) to keep the ALL_PROC_DIE + STOP feature when we know that all procs
> will hit this instruction. There are many stop (in config_check
> and in all the {PKG}_check.F) where it's the case and ALL_PROC_DIE
> can be added. I will try to add those ALL_PROC_DIE in 1 _check.F
> S/R and we will see how it goes.
> And
> 2) we need to have an other subroutine (containing an MPI_ABORT call)
> to stop when only 1 proc find an error (may be this could be similar
> to stop_if_error.F usage, containing a stop if single proc and an
> MPI_ABORT call if using MPI).
> Would people agree with this ?
>
> And regarding the 2 examples your mentionned:
> a) pkg/mnc/mnc_var.F : I think all proc will hit this stop, so
> we can add a ALL_PROC_DIE call before the stop.
> (I added few ALL_PROC_DIE call in pkg/mdsio_read/write_field.F
> some time ago, but did not go further).
> b) model/src/calc_r_start.F : (line 246 ?) Here we need the 2nd
> version because not all procs will hit this stop.
> Does this make sense ?
>
> Cheers,
> Jean-Michel
>
> On Fri, Mar 30, 2012 at 09:20:59AM +0200, Martin Losch wrote:
> > Hi Jean-Michel and others,
> >
> > so you recommend that we always use
> > CALL ALL_PROC_DIE(myThid)
> > STOP 'SOME MESSAGE'
> > also in cases such as, for example, pkg/mnc/mnc_var.F l219, which is
> > probably the case that I encounter most often (when I forget to
> > remove *.nc files from the working directory), or
> > model/src/calc_r_start.F l108?
> >
> > Martin
> >
> > On 03/29/2012 07:20 PM, Jean-Michel Campin wrote:
> > >Hi Martin,
> > >
> > >I am currently looking at this "termination" problem (with MPI + OpenMP),
> > >since, with some (old) mpich version, sometimes it hangs or finishes
> > >but leave some process behind (that needs to be killed afterward).
> > >
> > >Now regarding your question, we have already 2 S/R to end cleanly:
> > >1) ALL_PROC_DIE : needs to be called just before the "stop",
> > > but it only works if all the MPI proc call it.
> > >2) And for the case where few (but not all) MPI proc detect an error,
> > > there is an other S/R: STOP_IF_ERROR which collects the error
> > > and then decide to stop. But this 2nd one is not used currently
> > > (and I don't know what TAF will do with this), and the global_sum
> > > can slow down the run if used too often.
> > >
> > >The advantage of ALL_PROC_DIE + a STOP compared to a S/R like STOP_THE_MODEL
> > >(which would contain the stop) is that TAF can see the stop and we provide
> > >some flow directives for ALL_PROC_DIE (eesupp.flow).
> > >And ALL_PROC_DIE is used (e.g., ini pkg/monitor/mon_solution.F)
> > >but there are many places where we the call is missing.
> > >
> > >Cheers,
> > >Jean-Michel
> > >
> > >On Thu, Mar 29, 2012 at 09:49:57AM +0200, Martin Losch wrote:
> > >>Hi there,
> > >>
> > >>as you know, my mpi-skills are not very good, which should explain the level of question:
> > >>
> > >>I often get this type of error message when the model encounters a Fortran "STOP" statement, because some parameters are not set properly, or netcdf files are not overwritten.
> > >>
> > >>MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
> > >>MPI: aborting job
> > >>
> > >>Some system then complain about not being able to terminate the job properly and ask for manual intervention (e.g. run a LAM command or whatever), and sometimes some instances of mitgcmuv do remain and are difficult to delete without root-privildge
> > >>
> > >>Would it be useful to replace all "STOP" statements with a S/R STOP_THE_MODEL, or some other fancy name (maybe we even have this routine and I just don't know about it?), where the system is then shut down "cleanly" (with calling MPI_finalize, if necessary)? Is that difficult to do (all sorts of different possilbilies, w/ MPI, w/out MPI, etc.)? Is it worth it?
> > >>
> > >>Martin
> > >>
> > >>
> > >>_______________________________________________
> > >>MITgcm-devel mailing list
> > >>MITgcm-devel at mitgcm.org
> > >>http://mitgcm.org/mailman/listinfo/mitgcm-devel
> > >
> > >_______________________________________________
> > >MITgcm-devel mailing list
> > >MITgcm-devel at mitgcm.org
> > >http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >
> > --
> > Martin Losch
> > Alfred Wegener Institute for Polar and Marine Research
> > Postfach 120161, 27515 Bremerhaven, Germany;
> > Tel./Fax: ++49(0471)4831-1872/1797
> >
> >
> > _______________________________________________
> > MITgcm-devel mailing list
> > MITgcm-devel at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list