[MITgcm-devel] strange error messages from diagnostics pkg
Martin Losch
Martin.Losch at awi.de
Wed Jun 3 08:47:37 EDT 2015
Hi Jean-Michel,
thanks, I’ll give the new code a try. I haven’t had the time/chance to try NOOPTFILE-option.
M.
> On 02 Jun 2015, at 23:19, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
>
> Hi Martin,
>
> On Mon, May 18, 2015 at 05:10:13PM +0200, Martin Losch wrote:
>> thanks for checking;
>>
>> Not that I understand the details of this: Do you think that it is possible, that the master thread is slower than (at least) one of the other 3 threads (nSx=4,nSx=1) and that they call diagnostics_fill before the master thread has set diag_pkgStatus? I am doing singleCPUio=True and the Lustre-filesystem of this computer tends to be slow sometimes, so that I can imagine that the timing is a little screwed up between processes/threads. What do you think? Would a barrier in diagnostics_switch_onoff.F help?
>
> As I wrote earlier, I don't have the impression that there was something
> wrong in the code (no thread race issue).
> But anyway, I have added _BARRIER any time "diag_pkgStatus" is updated,
> so that it's a little bit cleaner.
> It might fix also your problem in the same way as the NOOPTFILE option would,
> since adding BARRIER can prevent the compiler to wrongly optimize the code.
>
> Did you have a chance to try the NOOPTFILE option ?
>
> Cheers,
> Jean-Michel
>
>>
>> I???ll try the NOOPTFILE option. (BTW, this is not the only strange thing that happens on this computer).
>>
>> Martin
>>
>>> On 18 May 2015, at 16:41, Jean-Michel Campin <jmc at ocean.mit.edu> wrote:
>>>
>>> Hi Martin,
>>>
>>> I have checked the code, and the switch+check of "diag_pkgStatus"
>>> does not look wrong (even without BARRIER):
>>> - only Master-Thread update/modify diag_pkgStatus.
>>> - and even if all threads check its value, only Master-Thead
>>> do the print-error and stop (in diagnostics_status_error.F);
>>>
>>> So, in your case, it's not clear why Master-Thread does not
>>> set diag_pkgStatus to 20 (=ready2fillDiags) when calling diagnostics_switch_onoff.F
>>> for the first time (myIter.EQ.nIter0).
>>> May be you could try to put diagnostics_switch_onoff.F in the NOOPTFILES list ?
>>>
>>> Cheers,
>>> Jean-Michel
>>>
>>> On Mon, May 18, 2015 at 09:45:24AM -0400, Jean-Michel Campin wrote:
>>>> Hi Martin,
>>>>
>>>> I guess it's related to multi-threading (OpenMP) and it looks like
>>>> we are missing few "BARRIER" in the code. I am currently checking
>>>> this and we let you know later.
>>>>
>>>> Cheers,
>>>> Jean-Michel
>>>>
>>>> On Mon, May 18, 2015 at 10:14:29AM +0200, Martin Losch wrote:
>>>>> Hi there,
>>>>>
>>>>> every now and then I am getting strange error messages from the
>>>>> diagnostics pkg:
>>>>>
>>>>> (PID.TID 0121.0001) *** DIAGNOSTICS_STATUS_ERROR *** from:
>>>>> DIAGNOSTICS_FILL call
>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: diagName="ETAN
>>>>> ", expectStatus= 20, pkgStatus= 10
>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: <== called from
>>>>> the WRONG place, i.e.
>>>>> (PID.TID 0121.0001) *** ERROR *** DIAGNOSTICS_FILL: before
>>>>> DIAGNOSTICS_SWITCH_ONOFF call in FORWARD_STEP
>>>>>
>>>>> this time from the processes 11 and 121 of a 624 cpu run (in fact I
>>>>> use nPx = 156 and nSx = 4 with OpenMP. They tend to be not
>>>>> reproducible (i.e. I rerun the same setup without any changes and
>>>>> without any problems), but pop up every couple of runs in a
>>>>> "chain-job" on the cray XC-30 (cca.ecmwf.int). This seems to happen
>>>>> when the model tries to store the first time slice of my first
>>>>> diagnostics (ETAN).
>>>>> The code is "vanilla" checkpoint65k plus a few days.
>>>>>
>>>>> Have you seen that before? Are there any chances to debug this? Can
>>>>> it have to do anything with OpenMP?
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> --
>>>>> Martin Losch
>>>>> Alfred Wegener Institute for Polar and Marine Research
>>>>> Postfach 120161, 27515 Bremerhaven, Germany;
>>>>> Tel./Fax: ++49(0471)4831-1872/1797
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list