[MITgcm-devel] strange error messages from diagnostics pkg

Martin Losch Martin.Losch at awi.de
Mon May 18 11:10:13 EDT 2015


thanks for checking; 

Not that I understand the details of this: Do you think that it is possible, that the master thread is slower than (at least) one of the other 3 threads (nSx=4,nSx=1) and that they call diagnostics_fill before the master thread has set diag_pkgStatus? I am doing singleCPUio=True and the Lustre-filesystem of this computer tends to be slow sometimes, so that I can imagine that the timing is a little screwed up between processes/threads. What do you think? Would a barrier in diagnostics_switch_onoff.F help?

I’ll try the NOOPTFILE option. (BTW, this is not the only strange thing that happens on this computer).

Martin

> On 18 May 2015, at 16:41, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
> 
> Hi Martin,
> 
> I have checked the code, and the switch+check of "diag_pkgStatus"
> does not look wrong (even without BARRIER):
> - only Master-Thread update/modify diag_pkgStatus.
> - and even if all threads check its value, only Master-Thead
>   do the print-error and stop (in diagnostics_status_error.F);
> 
> So, in your case, it's not clear why Master-Thread does not
> set diag_pkgStatus to 20 (=ready2fillDiags) when calling diagnostics_switch_onoff.F
> for the first time (myIter.EQ.nIter0).
> May be you could try to put diagnostics_switch_onoff.F in the NOOPTFILES list ?
> 
> Cheers,
> Jean-Michel
> 
> On Mon, May 18, 2015 at 09:45:24AM -0400, Jean-Michel Campin wrote:
>> Hi Martin,
>> 
>> I guess it's related to multi-threading (OpenMP) and it looks like
>> we are missing few "BARRIER" in the code. I am currently checking
>> this and we let you know later.
>> 
>> Cheers,
>> Jean-Michel
>> 
>> On Mon, May 18, 2015 at 10:14:29AM +0200, Martin Losch wrote:
>>> Hi there,
>>> 
>>> every now and then I am getting strange error messages from the
>>> diagnostics pkg:
>>> 
>>> (PID.TID 0121.0001) *** DIAGNOSTICS_STATUS_ERROR *** from:
>>> DIAGNOSTICS_FILL call
>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: diagName="ETAN
>>> ", expectStatus= 20, pkgStatus= 10
>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: <== called from
>>> the WRONG place, i.e.
>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: before
>>> DIAGNOSTICS_SWITCH_ONOFF call in FORWARD_STEP
>>> 
>>> this time from the processes 11 and 121 of a 624 cpu run (in fact I
>>> use nPx = 156 and nSx = 4 with OpenMP. They tend to be not
>>> reproducible (i.e. I rerun the same setup without any changes and
>>> without any problems), but pop up every couple of runs in a
>>> "chain-job" on the cray XC-30 (cca.ecmwf.int). This seems to happen
>>> when the model tries to store the first time slice of my first
>>> diagnostics (ETAN).
>>> The code is "vanilla" checkpoint65k plus a few days.
>>> 
>>> Have you seen that before? Are there any chances to debug this? Can
>>> it have to do anything with OpenMP?
>>> 
>>> Martin
>>> 
>>> 
>>> -- 
>>> Martin Losch
>>> Alfred Wegener Institute for Polar and Marine Research
>>> Postfach 120161, 27515 Bremerhaven, Germany;
>>> Tel./Fax: ++49(0471)4831-1872/1797
>>> 
>>> 
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> 
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list