[MITgcm-devel] strange error messages from diagnostics pkg

Jean-Michel Campin jmc at mit.edu
Thu Jun 2 11:58:16 EDT 2016


Hi Martin,

This is interesting. The addition of "BARRIER" was not strictly necessary,
so the fact that it does not help is - may be - not too surprising.

However, if adding diagnostics_switch_onoff.F in NOOPTFILES  list with
NOOPTFLAGS set to "-O1" works (right ?), this is a pretty good solution.

Cheers,
Jean-Michel

On Thu, Jun 02, 2016 at 10:55:56AM +0200, Martin Losch wrote:
> Hi Jean-Michel,
> 
> this thread is a year old, but I have now a machine (ollie) that develops the same problem, but is much easier to handle, so I can pursue this a little
> 
> The ???new??? code does with the extra barriers not work either and I have to compile diagnostics_switch_onoff.F at least with -O1, which is proably not a problem. It???s still interesting though ???
> 
> Martin
> 
> 
> 
> 
> > On 03 Jun 2015, at 14:47, Martin Losch <Martin.Losch at awi.de> wrote:
> > 
> > Hi Jean-Michel,
> > 
> > thanks, I???ll give the new code a try. I haven???t had the time/chance to try NOOPTFILE-option.
> > 
> > M.
> > 
> >> On 02 Jun 2015, at 23:19, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
> >> 
> >> Hi Martin,
> >> 
> >> On Mon, May 18, 2015 at 05:10:13PM +0200, Martin Losch wrote:
> >>> thanks for checking; 
> >>> 
> >>> Not that I understand the details of this: Do you think that it is possible, that the master thread is slower than (at least) one of the other 3 threads (nSx=4,nSx=1) and that they call diagnostics_fill before the master thread has set diag_pkgStatus? I am doing singleCPUio=True and the Lustre-filesystem of this computer tends to be slow sometimes, so that I can imagine that the timing is a little screwed up between processes/threads. What do you think? Would a barrier in diagnostics_switch_onoff.F help?
> >> 
> >> As I wrote earlier, I don't have the impression that there was something
> >> wrong in the code (no thread race issue).
> >> But anyway, I have added _BARRIER any time "diag_pkgStatus" is updated,
> >> so that it's a little bit cleaner.
> >> It might fix also your problem in the same way as the NOOPTFILE option would,
> >> since adding BARRIER can prevent the compiler to wrongly optimize the code.
> >> 
> >> Did you have a chance to try the NOOPTFILE option ?
> >> 
> >> Cheers,
> >> Jean-Michel
> >> 
> >>> 
> >>> I???ll try the NOOPTFILE option. (BTW, this is not the only strange thing that happens on this computer).
> >>> 
> >>> Martin
> >>> 
> >>>> On 18 May 2015, at 16:41, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
> >>>> 
> >>>> Hi Martin,
> >>>> 
> >>>> I have checked the code, and the switch+check of "diag_pkgStatus"
> >>>> does not look wrong (even without BARRIER):
> >>>> - only Master-Thread update/modify diag_pkgStatus.
> >>>> - and even if all threads check its value, only Master-Thead
> >>>> do the print-error and stop (in diagnostics_status_error.F);
> >>>> 
> >>>> So, in your case, it's not clear why Master-Thread does not
> >>>> set diag_pkgStatus to 20 (=ready2fillDiags) when calling diagnostics_switch_onoff.F
> >>>> for the first time (myIter.EQ.nIter0).
> >>>> May be you could try to put diagnostics_switch_onoff.F in the NOOPTFILES list ?
> >>>> 
> >>>> Cheers,
> >>>> Jean-Michel
> >>>> 
> >>>> On Mon, May 18, 2015 at 09:45:24AM -0400, Jean-Michel Campin wrote:
> >>>>> Hi Martin,
> >>>>> 
> >>>>> I guess it's related to multi-threading (OpenMP) and it looks like
> >>>>> we are missing few "BARRIER" in the code. I am currently checking
> >>>>> this and we let you know later.
> >>>>> 
> >>>>> Cheers,
> >>>>> Jean-Michel
> >>>>> 
> >>>>> On Mon, May 18, 2015 at 10:14:29AM +0200, Martin Losch wrote:
> >>>>>> Hi there,
> >>>>>> 
> >>>>>> every now and then I am getting strange error messages from the
> >>>>>> diagnostics pkg:
> >>>>>> 
> >>>>>> (PID.TID 0121.0001) *** DIAGNOSTICS_STATUS_ERROR *** from:
> >>>>>> DIAGNOSTICS_FILL call
> >>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: diagName="ETAN
> >>>>>> ", expectStatus= 20, pkgStatus= 10
> >>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: <== called from
> >>>>>> the WRONG place, i.e.
> >>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: before
> >>>>>> DIAGNOSTICS_SWITCH_ONOFF call in FORWARD_STEP
> >>>>>> 
> >>>>>> this time from the processes 11 and 121 of a 624 cpu run (in fact I
> >>>>>> use nPx = 156 and nSx = 4 with OpenMP. They tend to be not
> >>>>>> reproducible (i.e. I rerun the same setup without any changes and
> >>>>>> without any problems), but pop up every couple of runs in a
> >>>>>> "chain-job" on the cray XC-30 (cca.ecmwf.int). This seems to happen
> >>>>>> when the model tries to store the first time slice of my first
> >>>>>> diagnostics (ETAN).
> >>>>>> The code is "vanilla" checkpoint65k plus a few days.
> >>>>>> 
> >>>>>> Have you seen that before? Are there any chances to debug this? Can
> >>>>>> it have to do anything with OpenMP?
> >>>>>> 
> >>>>>> Martin
> >>>>>> 
> >>>>>> 
> >>>>>> -- 
> >>>>>> Martin Losch
> >>>>>> Alfred Wegener Institute for Polar and Marine Research
> >>>>>> Postfach 120161, 27515 Bremerhaven, Germany;
> >>>>>> Tel./Fax: ++49(0471)4831-1872/1797
> >>>>>> 
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> MITgcm-devel mailing list
> >>>>>> MITgcm-devel at mitgcm.org
> >>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>>>> 
> >>>>> _______________________________________________
> >>>>> MITgcm-devel mailing list
> >>>>> MITgcm-devel at mitgcm.org
> >>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>>> 
> >>>> _______________________________________________
> >>>> MITgcm-devel mailing list
> >>>> MITgcm-devel at mitgcm.org
> >>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >>> 
> >>> 
> >>> _______________________________________________
> >>> MITgcm-devel mailing list
> >>> MITgcm-devel at mitgcm.org
> >>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >> 
> >> _______________________________________________
> >> MITgcm-devel mailing list
> >> MITgcm-devel at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> > 
> 
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel



More information about the MITgcm-devel mailing list