[MITgcm-devel] strange error messages from diagnostics pkg

Jean-Michel Campin jmc at ocean.mit.edu
Tue Jun 2 17:19:39 EDT 2015


Hi Martin,

On Mon, May 18, 2015 at 05:10:13PM +0200, Martin Losch wrote:
> thanks for checking; 
> 
> Not that I understand the details of this: Do you think that it is possible, that the master thread is slower than (at least) one of the other 3 threads (nSx=4,nSx=1) and that they call diagnostics_fill before the master thread has set diag_pkgStatus? I am doing singleCPUio=True and the Lustre-filesystem of this computer tends to be slow sometimes, so that I can imagine that the timing is a little screwed up between processes/threads. What do you think? Would a barrier in diagnostics_switch_onoff.F help?

As I wrote earlier, I don't have the impression that there was something
wrong in the code (no thread race issue).
But anyway, I have added _BARRIER any time "diag_pkgStatus" is updated,
so that it's a little bit cleaner.
It might fix also your problem in the same way as the NOOPTFILE option would,
since adding BARRIER can prevent the compiler to wrongly optimize the code.

Did you have a chance to try the NOOPTFILE option ?

Cheers,
Jean-Michel

> 
> I???ll try the NOOPTFILE option. (BTW, this is not the only strange thing that happens on this computer).
> 
> Martin
> 
> > On 18 May 2015, at 16:41, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
> > 
> > Hi Martin,
> > 
> > I have checked the code, and the switch+check of "diag_pkgStatus"
> > does not look wrong (even without BARRIER):
> > - only Master-Thread update/modify diag_pkgStatus.
> > - and even if all threads check its value, only Master-Thead
> >   do the print-error and stop (in diagnostics_status_error.F);
> > 
> > So, in your case, it's not clear why Master-Thread does not
> > set diag_pkgStatus to 20 (=ready2fillDiags) when calling diagnostics_switch_onoff.F
> > for the first time (myIter.EQ.nIter0).
> > May be you could try to put diagnostics_switch_onoff.F in the NOOPTFILES list ?
> > 
> > Cheers,
> > Jean-Michel
> > 
> > On Mon, May 18, 2015 at 09:45:24AM -0400, Jean-Michel Campin wrote:
> >> Hi Martin,
> >> 
> >> I guess it's related to multi-threading (OpenMP) and it looks like
> >> we are missing few "BARRIER" in the code. I am currently checking
> >> this and we let you know later.
> >> 
> >> Cheers,
> >> Jean-Michel
> >> 
> >> On Mon, May 18, 2015 at 10:14:29AM +0200, Martin Losch wrote:
> >>> Hi there,
> >>> 
> >>> every now and then I am getting strange error messages from the
> >>> diagnostics pkg:
> >>> 
> >>> (PID.TID 0121.0001) *** DIAGNOSTICS_STATUS_ERROR *** from:
> >>> DIAGNOSTICS_FILL call
> >>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: diagName="ETAN
> >>> ", expectStatus= 20, pkgStatus= 10
> >>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: <== called from
> >>> the WRONG place, i.e.
> >>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: before
> >>> DIAGNOSTICS_SWITCH_ONOFF call in FORWARD_STEP
> >>> 
> >>> this time from the processes 11 and 121 of a 624 cpu run (in fact I
> >>> use nPx = 156 and nSx = 4 with OpenMP. They tend to be not
> >>> reproducible (i.e. I rerun the same setup without any changes and
> >>> without any problems), but pop up every couple of runs in a
> >>> "chain-job" on the cray XC-30 (cca.ecmwf.int). This seems to happen
> >>> when the model tries to store the first time slice of my first
> >>> diagnostics (ETAN).
> >>> The code is "vanilla" checkpoint65k plus a few days.
> >>> 
> >>> Have you seen that before? Are there any chances to debug this? Can
> >>> it have to do anything with OpenMP?
> >>> 
> >>> Martin
> >>> 
> >>> 
> >>> -- 
> >>> Martin Losch
> >>> Alfred Wegener Institute for Polar and Marine Research
> >>> Postfach 120161, 27515 Bremerhaven, Germany;
> >>> Tel./Fax: ++49(0471)4831-1872/1797
> >>> 
> >>> 
> >>> _______________________________________________
> >>> MITgcm-devel mailing list
> >>> MITgcm-devel at mitgcm.org
> >>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >> 
> >> _______________________________________________
> >> MITgcm-devel mailing list
> >> MITgcm-devel at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> > 
> > _______________________________________________
> > MITgcm-devel mailing list
> > MITgcm-devel at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel



More information about the MITgcm-devel mailing list