[MITgcm-devel] Re: [MITgcm-support] bug in exch2?
chris hill
cnh at mit.edu
Fri Jul 13 12:06:31 EDT 2007
Hi Martin/JM,
In principle the
arr(*) -> arr(1-olx:sNx+olx,.....)
should be fine. It is not obvious to me that there is an mds problem.
It would be legitimate for the overlaps to have NaN, if they are
uninitialized.
Can you send the fortran line at
exch2_send_rl2 ELN=1627
it could be a subtle side effect of they way I have done the permute
op in exch2 (c=alpha*a+beta*c) and the range of indices I use in exch
and exch2, which means that we need to initialize better. If this is a
problem there is a safe fix that could be added to exch2, but it
wouldn't vectorize too well.
Chris
Martin Losch wrote:
> Hi Jean-Michel,
>
> thanks for answering. Just to clarify: This thread is called "bug in
> exch2", but as I found, the problem is not connected to any exchange
> routines but the reading the pickup via read_rec_3d_rl etc (but I cannot
> rename the thread, )-:). I have only encountered the problem on our SX8
> with the cs510, with cs32 I cannot reproduce it.
>
> I can make the problem go away by making the compiler initialize
> everything to zero. This solution works for me, but it this satisfactory
> for others? What are possible candiates for problems in the calling
> sequence
> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls mds_read_fields
> -> mds_seg4torl
> ? Is there anything I can try to track down the problem? These mdsio
> routines are terribly hard to understand, and I don't want to do
> anything in there, really, but I could help identify a potential problem.
>
> Martin
>
> PS. Do the exch2_* comments refer to the other thread: "Question:
> boundary exchange, hrcube condfiguration"?
>
>
> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>
>> Hi Martin,
>>
>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>> Hi again,
>>> this was meant to go the the devel list in the first place, oh well.
>>>
>>> I have tried to find where the nans in the overlaps come from, and
>>> they appear when u and v are read from the pickup file with
>>> read_rec_3d_rl.
>>> read_rec_3d_rl calls mdsreadfield, which in turn calls mds_read_fields
>>> In the latter two routines, the array (uVel or vVel) to be read is
>>> declared as arr(*), but then mds_read_fields calls, eg. mds_seg4torl,
>>> where the array is declared as
>>> _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>> Could that be the source of the problem. I don't know. Should we do
>>> anything about this?
>>
>> I don't think this declaration is a problem.
>>
>>> As a quick fix I can just use the compiler flag, that initilialises
>>> everything to zero, but that would mask any other problems assciated
>>> with wrong initializations.
>>>
>>> What's your opinion?
>>
>> This quick fix is worth to try.
>> I have ready to check in an other exch2_uv_cgrid which only
>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>> impression that it could work, with the chance of getting
>> an adjoint version more easily. I have also started an
>> exch2_uv_bgrid, but looks more compicated than what I though.
>>
>> Jean-Michel
>>
>>>
>>> Martin
>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>
>>>> Hi there,
>>>>
>>>> there seems to be an initialisation issue in one/some of the exch2
>>>> routines. On our beloved (God, I hate this machine) SX8, the high-
>>>> res-cube stops with errors like this:
>>>>> * 253 Invalid operation PROGxch2_send_rl2 ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>> Called from ini_fields ELN=1703(4007d9d18)
>>>>> Called from initialise_varia ELN=2018(4008154cc)
>>>>> **** 99 Execution suspended PROG=exch2_send_rl2 ELN=1627(40049c9d8)
>>>>> Called from exch2_rl2_cube ELN=1966(40048c594)
>>>>> Called from exch2_uv_3d_rl ELN=1603(4004a4a74)
>>>>> Called from exch_uv_3d_rl ELN=1826(4006f8478)
>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>> so at the first uv exchange. A closer look confirms that array1 and
>>>> array2 in exch2_send_rl2 have nans on them in the overlap. This
>>>> problem goes away, when I make the compile initialise everything to
>>>> zero by default. (I also learned that apparently not the entire
>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>> points, at least for cubed exchanges; that would explain, why two
>>>> exchanges are necessary, wouldn't it?)
>>>>
>>>> Martin
>>>>
>>>>
>>>> _______________________________________________
>>>> MITgcm-support mailing list
>>>> MITgcm-support at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
More information about the MITgcm-devel
mailing list