[MITgcm-devel] strange error messages from diagnostics pkg

Martin Losch Martin.Losch at awi.de
Wed Jun 3 08:47:37 EDT 2015


Hi Jean-Michel,

thanks, I’ll give the new code a try. I haven’t had the time/chance to try NOOPTFILE-option.

M.

> On 02 Jun 2015, at 23:19, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
> 
> Hi Martin,
> 
> On Mon, May 18, 2015 at 05:10:13PM +0200, Martin Losch wrote:
>> thanks for checking; 
>> 
>> Not that I understand the details of this: Do you think that it is possible, that the master thread is slower than (at least) one of the other 3 threads (nSx=4,nSx=1) and that they call diagnostics_fill before the master thread has set diag_pkgStatus? I am doing singleCPUio=True and the Lustre-filesystem of this computer tends to be slow sometimes, so that I can imagine that the timing is a little screwed up between processes/threads. What do you think? Would a barrier in diagnostics_switch_onoff.F help?
> 
> As I wrote earlier, I don't have the impression that there was something
> wrong in the code (no thread race issue).
> But anyway, I have added _BARRIER any time "diag_pkgStatus" is updated,
> so that it's a little bit cleaner.
> It might fix also your problem in the same way as the NOOPTFILE option would,
> since adding BARRIER can prevent the compiler to wrongly optimize the code.
> 
> Did you have a chance to try the NOOPTFILE option ?
> 
> Cheers,
> Jean-Michel
> 
>> 
>> I???ll try the NOOPTFILE option. (BTW, this is not the only strange thing that happens on this computer).
>> 
>> Martin
>> 
>>> On 18 May 2015, at 16:41, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
>>> 
>>> Hi Martin,
>>> 
>>> I have checked the code, and the switch+check of "diag_pkgStatus"
>>> does not look wrong (even without BARRIER):
>>> - only Master-Thread update/modify diag_pkgStatus.
>>> - and even if all threads check its value, only Master-Thead
>>>  do the print-error and stop (in diagnostics_status_error.F);
>>> 
>>> So, in your case, it's not clear why Master-Thread does not
>>> set diag_pkgStatus to 20 (=ready2fillDiags) when calling diagnostics_switch_onoff.F
>>> for the first time (myIter.EQ.nIter0).
>>> May be you could try to put diagnostics_switch_onoff.F in the NOOPTFILES list ?
>>> 
>>> Cheers,
>>> Jean-Michel
>>> 
>>> On Mon, May 18, 2015 at 09:45:24AM -0400, Jean-Michel Campin wrote:
>>>> Hi Martin,
>>>> 
>>>> I guess it's related to multi-threading (OpenMP) and it looks like
>>>> we are missing few "BARRIER" in the code. I am currently checking
>>>> this and we let you know later.
>>>> 
>>>> Cheers,
>>>> Jean-Michel
>>>> 
>>>> On Mon, May 18, 2015 at 10:14:29AM +0200, Martin Losch wrote:
>>>>> Hi there,
>>>>> 
>>>>> every now and then I am getting strange error messages from the
>>>>> diagnostics pkg:
>>>>> 
>>>>> (PID.TID 0121.0001) *** DIAGNOSTICS_STATUS_ERROR *** from:
>>>>> DIAGNOSTICS_FILL call
>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: diagName="ETAN
>>>>> ", expectStatus= 20, pkgStatus= 10
>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: <== called from
>>>>> the WRONG place, i.e.
>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: before
>>>>> DIAGNOSTICS_SWITCH_ONOFF call in FORWARD_STEP
>>>>> 
>>>>> this time from the processes 11 and 121 of a 624 cpu run (in fact I
>>>>> use nPx = 156 and nSx = 4 with OpenMP. They tend to be not
>>>>> reproducible (i.e. I rerun the same setup without any changes and
>>>>> without any problems), but pop up every couple of runs in a
>>>>> "chain-job" on the cray XC-30 (cca.ecmwf.int). This seems to happen
>>>>> when the model tries to store the first time slice of my first
>>>>> diagnostics (ETAN).
>>>>> The code is "vanilla" checkpoint65k plus a few days.
>>>>> 
>>>>> Have you seen that before? Are there any chances to debug this? Can
>>>>> it have to do anything with OpenMP?
>>>>> 
>>>>> Martin
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Martin Losch
>>>>> Alfred Wegener Institute for Polar and Marine Research
>>>>> Postfach 120161, 27515 Bremerhaven, Germany;
>>>>> Tel./Fax: ++49(0471)4831-1872/1797
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>> 
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> 
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> 
>> 
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list