[MITgcm-devel] strange error messages from diagnostics pkg

Thu Jun 2 04:55:56 EDT 2016

Hi Jean-Michel,

this thread is a year old, but I have now a machine (ollie) that develops the same problem, but is much easier to handle, so I can pursue this a little

The “new” code does with the extra barriers not work either and I have to compile diagnostics_switch_onoff.F at least with -O1, which is proably not a problem. It’s still interesting though …

Martin

> On 03 Jun 2015, at 14:47, Martin Losch <Martin.Losch at awi.de> wrote:
> 
> Hi Jean-Michel,
> 
> thanks, I’ll give the new code a try. I haven’t had the time/chance to try NOOPTFILE-option.
> 
> M.
> 
>> On 02 Jun 2015, at 23:19, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
>> 
>> Hi Martin,
>> 
>> On Mon, May 18, 2015 at 05:10:13PM +0200, Martin Losch wrote:
>>> thanks for checking; 
>>> 
>>> Not that I understand the details of this: Do you think that it is possible, that the master thread is slower than (at least) one of the other 3 threads (nSx=4,nSx=1) and that they call diagnostics_fill before the master thread has set diag_pkgStatus? I am doing singleCPUio=True and the Lustre-filesystem of this computer tends to be slow sometimes, so that I can imagine that the timing is a little screwed up between processes/threads. What do you think? Would a barrier in diagnostics_switch_onoff.F help?
>> 
>> As I wrote earlier, I don't have the impression that there was something
>> wrong in the code (no thread race issue).
>> But anyway, I have added _BARRIER any time "diag_pkgStatus" is updated,
>> so that it's a little bit cleaner.
>> It might fix also your problem in the same way as the NOOPTFILE option would,
>> since adding BARRIER can prevent the compiler to wrongly optimize the code.
>> 
>> Did you have a chance to try the NOOPTFILE option ?
>> 
>> Cheers,
>> Jean-Michel
>> 
>>> 
>>> I???ll try the NOOPTFILE option. (BTW, this is not the only strange thing that happens on this computer).
>>> 
>>> Martin
>>> 
>>>> On 18 May 2015, at 16:41, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
>>>> 
>>>> Hi Martin,
>>>> 
>>>> I have checked the code, and the switch+check of "diag_pkgStatus"
>>>> does not look wrong (even without BARRIER):
>>>> - only Master-Thread update/modify diag_pkgStatus.
>>>> - and even if all threads check its value, only Master-Thead
>>>> do the print-error and stop (in diagnostics_status_error.F);
>>>> 
>>>> So, in your case, it's not clear why Master-Thread does not
>>>> set diag_pkgStatus to 20 (=ready2fillDiags) when calling diagnostics_switch_onoff.F
>>>> for the first time (myIter.EQ.nIter0).
>>>> May be you could try to put diagnostics_switch_onoff.F in the NOOPTFILES list ?
>>>> 
>>>> Cheers,
>>>> Jean-Michel
>>>> 
>>>> On Mon, May 18, 2015 at 09:45:24AM -0400, Jean-Michel Campin wrote:
>>>>> Hi Martin,
>>>>> 
>>>>> I guess it's related to multi-threading (OpenMP) and it looks like
>>>>> we are missing few "BARRIER" in the code. I am currently checking
>>>>> this and we let you know later.
>>>>> 
>>>>> Cheers,
>>>>> Jean-Michel
>>>>> 
>>>>> On Mon, May 18, 2015 at 10:14:29AM +0200, Martin Losch wrote:
>>>>>> Hi there,
>>>>>> 
>>>>>> every now and then I am getting strange error messages from the
>>>>>> diagnostics pkg:
>>>>>> 
>>>>>> (PID.TID 0121.0001) *** DIAGNOSTICS_STATUS_ERROR *** from:
>>>>>> DIAGNOSTICS_FILL call
>>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: diagName="ETAN
>>>>>> ", expectStatus= 20, pkgStatus= 10
>>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: <== called from
>>>>>> the WRONG place, i.e.
>>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: before
>>>>>> DIAGNOSTICS_SWITCH_ONOFF call in FORWARD_STEP
>>>>>> 
>>>>>> this time from the processes 11 and 121 of a 624 cpu run (in fact I
>>>>>> use nPx = 156 and nSx = 4 with OpenMP. They tend to be not
>>>>>> reproducible (i.e. I rerun the same setup without any changes and
>>>>>> without any problems), but pop up every couple of runs in a
>>>>>> "chain-job" on the cray XC-30 (cca.ecmwf.int). This seems to happen
>>>>>> when the model tries to store the first time slice of my first
>>>>>> diagnostics (ETAN).
>>>>>> The code is "vanilla" checkpoint65k plus a few days.
>>>>>> 
>>>>>> Have you seen that before? Are there any chances to debug this? Can
>>>>>> it have to do anything with OpenMP?
>>>>>> 
>>>>>> Martin
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Martin Losch
>>>>>> Alfred Wegener Institute for Polar and Marine Research
>>>>>> Postfach 120161, 27515 Bremerhaven, Germany;
>>>>>> Tel./Fax: ++49(0471)4831-1872/1797
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>> 
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>> 
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> 
>>> 
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> 
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>