[MITgcm-devel] Re: [MITgcm-support] bug in exch2?
Martin Losch
Martin.Losch at awi.de
Mon Jul 16 07:59:01 EDT 2007
Hi Chris,
these are lines 1625-1627 of exch2_send_rl2.f:
1625 val1=sa1*array1(isl,jsl,ktl)
1626 & +sa2*array2(isl,jsl,ktl)
1627 e2Bufr1_RL(iBufr1)=val1
Exactly what you thought. What's happening is that array1 and/or
array2 are a NaN, so that val1 is then NaN and the program chrashes
wehn e2Bufr1_RL(iBufr1) is asigned NaN.
MDS: uVel and vVel are initialized to zero (including the overlaps)
BEFORE read_pickup; in read_pickup (after read_rec_3d_rl) the
overlaps suddenly have some nans on them; not the entire overlap,
just a few points always for (i,j)=(12,-3),(15,0),(18,4), in each
vertical layer. I checked that with the "hallo-debugger". I use
s1800_17x51, so that sNx=17, sNy=51.
This tells me, that somewhere underneath the read_rec_3d_rl layer,
the overlaps are re-initialised to NaN, right? I would think that
this is an MDS issue, isn't it?
BTW, the first CPU (with STDOUT.0000) does not have nans in it, and I
am using useSingleCPUio=.true. When I unset this flag, the run does
not even get past reading the pickups in a reasonable time (1h).
Martin
On 13 Jul 2007, at 18:06, chris hill wrote:
> Hi Martin/JM,
>
> In principle the
>
> arr(*) -> arr(1-olx:sNx+olx,.....)
>
> should be fine. It is not obvious to me that there is an mds problem.
> It would be legitimate for the overlaps to have NaN, if they are
> uninitialized.
>
> Can you send the fortran line at
>
> exch2_send_rl2 ELN=1627
>
> it could be a subtle side effect of they way I have done the
> permute op in exch2 (c=alpha*a+beta*c) and the range of indices I
> use in exch and exch2, which means that we need to initialize
> better. If this is a problem there is a safe fix that could be
> added to exch2, but it wouldn't vectorize too well.
>
> Chris
>
> Martin Losch wrote:
>> Hi Jean-Michel,
>> thanks for answering. Just to clarify: This thread is called "bug
>> in exch2", but as I found, the problem is not connected to any
>> exchange routines but the reading the pickup via read_rec_3d_rl
>> etc (but I cannot rename the thread, )-:). I have only encountered
>> the problem on our SX8 with the cs510, with cs32 I cannot
>> reproduce it.
>> I can make the problem go away by making the compiler initialize
>> everything to zero. This solution works for me, but it this
>> satisfactory for others? What are possible candiates for problems
>> in the calling sequence
>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls
>> mds_read_fields -> mds_seg4torl
>> ? Is there anything I can try to track down the problem? These
>> mdsio routines are terribly hard to understand, and I don't want
>> to do anything in there, really, but I could help identify a
>> potential problem.
>> Martin
>> PS. Do the exch2_* comments refer to the other thread: "Question:
>> boundary exchange, hrcube condfiguration"?
>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>> Hi Martin,
>>>
>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>> Hi again,
>>>> this was meant to go the the devel list in the first place, oh
>>>> well.
>>>>
>>>> I have tried to find where the nans in the overlaps come from, and
>>>> they appear when u and v are read from the pickup file with
>>>> read_rec_3d_rl.
>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls
>>>> mds_read_fields
>>>> In the latter two routines, the array (uVel or vVel) to be read is
>>>> declared as arr(*), but then mds_read_fields calls, eg.
>>>> mds_seg4torl,
>>>> where the array is declared as
>>>> _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>> Could that be the source of the problem. I don't know. Should we do
>>>> anything about this?
>>>
>>> I don't think this declaration is a problem.
>>>
>>>> As a quick fix I can just use the compiler flag, that initilialises
>>>> everything to zero, but that would mask any other problems
>>>> assciated
>>>> with wrong initializations.
>>>>
>>>> What's your opinion?
>>>
>>> This quick fix is worth to try.
>>> I have ready to check in an other exch2_uv_cgrid which only
>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>> impression that it could work, with the chance of getting
>>> an adjoint version more easily. I have also started an
>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>
>>> Jean-Michel
>>>
>>>>
>>>> Martin
>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> there seems to be an initialisation issue in one/some of the exch2
>>>>> routines. On our beloved (God, I hate this machine) SX8, the high-
>>>>> res-cube stops with errors like this:
>>>>>> * 253 Invalid operation PROGxch2_send_rl2
>>>>>> ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>> Called from ini_fields ELN=1703(4007d9d18)
>>>>>> Called from initialise_varia ELN=2018(4008154cc)
>>>>>> **** 99 Execution suspended PROG=exch2_send_rl2 ELN=1627
>>>>>> (40049c9d8)
>>>>>> Called from exch2_rl2_cube ELN=1966(40048c594)
>>>>>> Called from exch2_uv_3d_rl ELN=1603(4004a4a74)
>>>>>> Called from exch_uv_3d_rl ELN=1826(4006f8478)
>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>> so at the first uv exchange. A closer look confirms that array1
>>>>> and
>>>>> array2 in exch2_send_rl2 have nans on them in the overlap. This
>>>>> problem goes away, when I make the compile initialise
>>>>> everything to
>>>>> zero by default. (I also learned that apparently not the entire
>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>> points, at least for cubed exchanges; that would explain, why two
>>>>> exchanges are necessary, wouldn't it?)
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-support mailing list
>>>>> MITgcm-support at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list