[MITgcm-devel] strange error messages from diagnostics pkg

Jean-Michel Campin jmc at ocean.mit.edu
Mon May 18 10:41:33 EDT 2015


Hi Martin,

I have checked the code, and the switch+check of "diag_pkgStatus"
does not look wrong (even without BARRIER):
 - only Master-Thread update/modify diag_pkgStatus.
 - and even if all threads check its value, only Master-Thead
   do the print-error and stop (in diagnostics_status_error.F);

So, in your case, it's not clear why Master-Thread does not
set diag_pkgStatus to 20 (=ready2fillDiags) when calling diagnostics_switch_onoff.F
for the first time (myIter.EQ.nIter0).
May be you could try to put diagnostics_switch_onoff.F in the NOOPTFILES list ?

Cheers,
Jean-Michel

On Mon, May 18, 2015 at 09:45:24AM -0400, Jean-Michel Campin wrote:
> Hi Martin,
> 
> I guess it's related to multi-threading (OpenMP) and it looks like
> we are missing few "BARRIER" in the code. I am currently checking
> this and we let you know later.
> 
> Cheers,
> Jean-Michel
> 
> On Mon, May 18, 2015 at 10:14:29AM +0200, Martin Losch wrote:
> > Hi there,
> > 
> > every now and then I am getting strange error messages from the
> > diagnostics pkg:
> > 
> > (PID.TID 0121.0001) *** DIAGNOSTICS_STATUS_ERROR *** from:
> > DIAGNOSTICS_FILL call
> > (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: diagName="ETAN
> > ", expectStatus= 20, pkgStatus= 10
> > (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: <== called from
> > the WRONG place, i.e.
> > (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: before
> > DIAGNOSTICS_SWITCH_ONOFF call in FORWARD_STEP
> > 
> > this time from the processes 11 and 121 of a 624 cpu run (in fact I
> > use nPx = 156 and nSx = 4 with OpenMP. They tend to be not
> > reproducible (i.e. I rerun the same setup without any changes and
> > without any problems), but pop up every couple of runs in a
> > "chain-job" on the cray XC-30 (cca.ecmwf.int). This seems to happen
> > when the model tries to store the first time slice of my first
> > diagnostics (ETAN).
> > The code is "vanilla" checkpoint65k plus a few days.
> > 
> > Have you seen that before? Are there any chances to debug this? Can
> > it have to do anything with OpenMP?
> > 
> > Martin
> > 
> > 
> > -- 
> > Martin Losch
> > Alfred Wegener Institute for Polar and Marine Research
> > Postfach 120161, 27515 Bremerhaven, Germany;
> > Tel./Fax: ++49(0471)4831-1872/1797
> > 
> > 
> > _______________________________________________
> > MITgcm-devel mailing list
> > MITgcm-devel at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel



More information about the MITgcm-devel mailing list