[MITgcm-devel] mpi_finalize

Jean-Michel Campin jmc at ocean.mit.edu
Fri Mar 30 09:56:01 EDT 2012


Hi Martin and others,

I am not sure of how to proceed; Constantinos suggestion to use 
MPI_ABORT is certainly better than my unused "stop_if_error.F" routine. 

I would propose (because the MPI_ABORT might be little bit strong, 
produce error return code/message, and less likely to flush buffer) 
1) to keep the ALL_PROC_DIE + STOP feature when we know that all procs
 will hit this instruction. There are many stop (in config_check
 and in all the {PKG}_check.F) where it's the case and ALL_PROC_DIE
 can be added. I will try to add those ALL_PROC_DIE in 1 _check.F
 S/R and we will see how it goes.
And
2) we need to have an other subroutine (containing an MPI_ABORT call) 
 to stop when only 1 proc find an error (may be this could be similar 
 to stop_if_error.F usage, containing a stop if single proc and an
 MPI_ABORT call if using MPI).
Would people agree with this ?

And regarding the 2 examples your mentionned:
a) pkg/mnc/mnc_var.F : I think all proc will hit this stop, so 
   we can add a ALL_PROC_DIE call before the stop.
   (I added few ALL_PROC_DIE call in pkg/mdsio_read/write_field.F
   some time ago, but did not go further). 
b) model/src/calc_r_start.F : (line 246 ?) Here we need the 2nd
  version because not all procs will hit this stop.
Does this make sense ?

Cheers,
Jean-Michel

On Fri, Mar 30, 2012 at 09:20:59AM +0200, Martin Losch wrote:
> Hi Jean-Michel and others,
> 
> so you recommend that we always use
> CALL ALL_PROC_DIE(myThid)
> STOP 'SOME MESSAGE'
> also in cases such as, for example, pkg/mnc/mnc_var.F l219, which is
> probably the case that I encounter most often (when I forget to
> remove *.nc files from the working directory), or
> model/src/calc_r_start.F l108?
> 
> Martin
> 
> On 03/29/2012 07:20 PM, Jean-Michel Campin wrote:
> >Hi Martin,
> >
> >I am currently looking at this "termination" problem (with MPI + OpenMP),
> >since, with some (old) mpich version, sometimes it hangs or finishes
> >but leave some process behind (that needs to be killed afterward).
> >
> >Now regarding your question, we have already 2 S/R to end cleanly:
> >1) ALL_PROC_DIE : needs to be called just before the "stop",
> >   but it only works if all the MPI proc call it.
> >2) And for the case where few (but not all) MPI proc detect an error,
> >   there is an other S/R: STOP_IF_ERROR which collects the error
> >   and then decide to stop. But this 2nd one is not used currently
> >   (and I don't know what TAF will do with this), and the global_sum
> >   can slow down the run if used too often.
> >
> >The advantage of ALL_PROC_DIE + a STOP compared to a S/R like STOP_THE_MODEL
> >(which would contain the stop) is that TAF can see the stop and we provide
> >some flow directives for ALL_PROC_DIE (eesupp.flow).
> >And ALL_PROC_DIE is used (e.g., ini pkg/monitor/mon_solution.F)
> >but there are many places where we the call is missing.
> >
> >Cheers,
> >Jean-Michel
> >
> >On Thu, Mar 29, 2012 at 09:49:57AM +0200, Martin Losch wrote:
> >>Hi there,
> >>
> >>as you know, my mpi-skills are not very good, which should explain the level of question:
> >>
> >>I often get this type of error message when the model encounters a Fortran "STOP" statement, because some parameters are not set properly, or netcdf files are not overwritten.
> >>
> >>MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
> >>MPI: aborting job
> >>
> >>Some system then complain about not being able to terminate the job properly and ask for manual intervention (e.g. run a LAM command or whatever), and sometimes some instances of mitgcmuv do remain and are difficult to delete without root-privildge
> >>
> >>Would it be useful to replace all "STOP" statements with a S/R STOP_THE_MODEL, or some other fancy name (maybe we even have this routine and I just don't know about it?), where the system is then shut down "cleanly" (with calling MPI_finalize, if necessary)? Is that difficult to do (all sorts of different possilbilies, w/ MPI, w/out MPI, etc.)? Is it worth it?
> >>
> >>Martin
> >>
> >>
> >>_______________________________________________
> >>MITgcm-devel mailing list
> >>MITgcm-devel at mitgcm.org
> >>http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >
> >_______________________________________________
> >MITgcm-devel mailing list
> >MITgcm-devel at mitgcm.org
> >http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> -- 
> Martin Losch
> Alfred Wegener Institute for Polar and Marine Research
> Postfach 120161, 27515 Bremerhaven, Germany;
> Tel./Fax: ++49(0471)4831-1872/1797
> 
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel



More information about the MITgcm-devel mailing list