[MITgcm-devel] Re: [MITgcm-support] bug in exch2?
Chris Hill
cnh at mit.edu
Mon Jul 16 19:32:22 EDT 2007
OK since they are zero'd before it does sound like an MDS prob. Will try
and look, but I am in meeting in UK at moment.
Chris
Martin Losch wrote:
> Hi Chris,
> these are lines 1625-1627 of exch2_send_rl2.f:
> 1625 val1=sa1*array1(isl,jsl,ktl)
> 1626 & +sa2*array2(isl,jsl,ktl)
> 1627 e2Bufr1_RL(iBufr1)=val1
> Exactly what you thought. What's happening is that array1 and/or array2
> are a NaN, so that val1 is then NaN and the program chrashes wehn
> e2Bufr1_RL(iBufr1) is asigned NaN.
>
> MDS: uVel and vVel are initialized to zero (including the overlaps)
> BEFORE read_pickup; in read_pickup (after read_rec_3d_rl) the overlaps
> suddenly have some nans on them; not the entire overlap, just a few
> points always for (i,j)=(12,-3),(15,0),(18,4), in each vertical layer. I
> checked that with the "hallo-debugger". I use s1800_17x51, so that
> sNx=17, sNy=51.
>
> This tells me, that somewhere underneath the read_rec_3d_rl layer, the
> overlaps are re-initialised to NaN, right? I would think that this is an
> MDS issue, isn't it?
>
> BTW, the first CPU (with STDOUT.0000) does not have nans in it, and I am
> using useSingleCPUio=.true. When I unset this flag, the run does not
> even get past reading the pickups in a reasonable time (1h).
>
> Martin
>
>
>
> On 13 Jul 2007, at 18:06, chris hill wrote:
>
>> Hi Martin/JM,
>>
>> In principle the
>>
>> arr(*) -> arr(1-olx:sNx+olx,.....)
>>
>> should be fine. It is not obvious to me that there is an mds problem.
>> It would be legitimate for the overlaps to have NaN, if they are
>> uninitialized.
>>
>> Can you send the fortran line at
>>
>> exch2_send_rl2 ELN=1627
>>
>> it could be a subtle side effect of they way I have done the permute
>> op in exch2 (c=alpha*a+beta*c) and the range of indices I use in exch
>> and exch2, which means that we need to initialize better. If this is a
>> problem there is a safe fix that could be added to exch2, but it
>> wouldn't vectorize too well.
>>
>> Chris
>>
>> Martin Losch wrote:
>>> Hi Jean-Michel,
>>> thanks for answering. Just to clarify: This thread is called "bug in
>>> exch2", but as I found, the problem is not connected to any exchange
>>> routines but the reading the pickup via read_rec_3d_rl etc (but I
>>> cannot rename the thread, )-:). I have only encountered the problem
>>> on our SX8 with the cs510, with cs32 I cannot reproduce it.
>>> I can make the problem go away by making the compiler initialize
>>> everything to zero. This solution works for me, but it this
>>> satisfactory for others? What are possible candiates for problems in
>>> the calling sequence
>>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls
>>> mds_read_fields -> mds_seg4torl
>>> ? Is there anything I can try to track down the problem? These mdsio
>>> routines are terribly hard to understand, and I don't want to do
>>> anything in there, really, but I could help identify a potential
>>> problem.
>>> Martin
>>> PS. Do the exch2_* comments refer to the other thread: "Question:
>>> boundary exchange, hrcube condfiguration"?
>>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>>> Hi Martin,
>>>>
>>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>>> Hi again,
>>>>> this was meant to go the the devel list in the first place, oh well.
>>>>>
>>>>> I have tried to find where the nans in the overlaps come from, and
>>>>> they appear when u and v are read from the pickup file with
>>>>> read_rec_3d_rl.
>>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls mds_read_fields
>>>>> In the latter two routines, the array (uVel or vVel) to be read is
>>>>> declared as arr(*), but then mds_read_fields calls, eg. mds_seg4torl,
>>>>> where the array is declared as
>>>>> _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>>> Could that be the source of the problem. I don't know. Should we do
>>>>> anything about this?
>>>>
>>>> I don't think this declaration is a problem.
>>>>
>>>>> As a quick fix I can just use the compiler flag, that initilialises
>>>>> everything to zero, but that would mask any other problems assciated
>>>>> with wrong initializations.
>>>>>
>>>>> What's your opinion?
>>>>
>>>> This quick fix is worth to try.
>>>> I have ready to check in an other exch2_uv_cgrid which only
>>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>>> impression that it could work, with the chance of getting
>>>> an adjoint version more easily. I have also started an
>>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>>
>>>> Jean-Michel
>>>>
>>>>>
>>>>> Martin
>>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> there seems to be an initialisation issue in one/some of the exch2
>>>>>> routines. On our beloved (God, I hate this machine) SX8, the high-
>>>>>> res-cube stops with errors like this:
>>>>>>> * 253 Invalid operation PROGxch2_send_rl2 ELN==exch2_send_rl2
>>>>>>> ELN=1627(40049c9d8)
>>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>>> Called from ini_fields ELN=1703(4007d9d18)
>>>>>>> Called from initialise_varia ELN=2018(4008154cc)
>>>>>>> **** 99 Execution suspended PROG=exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>>> Called from exch2_rl2_cube ELN=1966(40048c594)
>>>>>>> Called from exch2_uv_3d_rl ELN=1603(4004a4a74)
>>>>>>> Called from exch_uv_3d_rl ELN=1826(4006f8478)
>>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>> so at the first uv exchange. A closer look confirms that array1 and
>>>>>> array2 in exch2_send_rl2 have nans on them in the overlap. This
>>>>>> problem goes away, when I make the compile initialise everything to
>>>>>> zero by default. (I also learned that apparently not the entire
>>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>>> points, at least for cubed exchanges; that would explain, why two
>>>>>> exchanges are necessary, wouldn't it?)
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-support mailing list
>>>>>> MITgcm-support at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
More information about the MITgcm-devel
mailing list