[MITgcm-devel] Re: [MITgcm-support] bug in exch2?

Martin Losch Martin.Losch at awi.de
Mon Jul 16 07:59:01 EDT 2007


Hi Chris,
these are lines 1625-1627 of exch2_send_rl2.f:
1625       val1=sa1*array1(isl,jsl,ktl)
1626    &       +sa2*array2(isl,jsl,ktl)
1627         e2Bufr1_RL(iBufr1)=val1
  Exactly what you thought. What's happening is that array1 and/or  
array2 are a NaN, so that val1 is then NaN and the program chrashes  
wehn e2Bufr1_RL(iBufr1) is asigned NaN.

MDS: uVel and vVel are initialized to zero (including the overlaps)  
BEFORE read_pickup; in read_pickup (after read_rec_3d_rl) the  
overlaps suddenly have some nans on them; not the entire overlap,  
just a few points always for (i,j)=(12,-3),(15,0),(18,4), in each  
vertical layer. I checked that with the "hallo-debugger". I use  
s1800_17x51, so that sNx=17, sNy=51.

This tells me, that somewhere underneath the read_rec_3d_rl layer,  
the overlaps are re-initialised to NaN, right? I would think that  
this is an MDS issue, isn't it?

BTW, the first CPU (with STDOUT.0000) does not have nans in it, and I  
am using useSingleCPUio=.true. When I unset this flag, the run does  
not even get past reading the pickups in a reasonable time (1h).

Martin



On 13 Jul 2007, at 18:06, chris hill wrote:

> Hi Martin/JM,
>
>  In principle the
>
>  arr(*) -> arr(1-olx:sNx+olx,.....)
>
>  should be fine. It is not obvious to me that there is an mds problem.
>  It would be legitimate for the overlaps to have NaN, if they are  
> uninitialized.
>
>  Can you send the fortran line at
>
>   exch2_send_rl2 ELN=1627
>
>  it could be a subtle side effect of they way I have done the  
> permute op in exch2 (c=alpha*a+beta*c) and the range of indices I  
> use in exch and exch2, which means that we need to initialize  
> better. If this is a problem there is a safe fix that could be  
> added to exch2, but it wouldn't vectorize too well.
>
> Chris
>
> Martin Losch wrote:
>> Hi Jean-Michel,
>> thanks for answering. Just to clarify: This thread is called "bug  
>> in exch2", but as I found, the problem is not connected to any  
>> exchange routines but the reading the pickup via read_rec_3d_rl  
>> etc (but I cannot rename the thread, )-:). I have only encountered  
>> the problem on our SX8 with the cs510, with cs32 I cannot  
>> reproduce it.
>> I can make the problem go away by making the compiler initialize  
>> everything to zero. This solution works for me, but it this  
>> satisfactory for others? What are possible candiates for problems  
>> in the calling sequence
>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls  
>> mds_read_fields -> mds_seg4torl
>> ? Is there anything I can try to track down the problem? These  
>> mdsio routines are terribly hard to understand, and I don't want  
>> to do anything in there, really, but I could help identify a  
>> potential problem.
>> Martin
>> PS. Do the exch2_* comments refer to the other thread: "Question:  
>> boundary exchange, hrcube condfiguration"?
>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>> Hi Martin,
>>>
>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>> Hi again,
>>>> this was meant to go the the devel list in the first place, oh  
>>>> well.
>>>>
>>>> I have tried to find where the nans in the overlaps come from, and
>>>> they appear when u and v are read from the pickup file with
>>>> read_rec_3d_rl.
>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls  
>>>> mds_read_fields
>>>> In the latter two routines, the array (uVel or vVel) to be read is
>>>> declared as arr(*), but then mds_read_fields calls, eg.  
>>>> mds_seg4torl,
>>>> where the array is declared as
>>>>       _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>> Could that be the source of the problem. I don't know. Should we do
>>>> anything about this?
>>>
>>> I don't think this declaration is a problem.
>>>
>>>> As a quick fix I can just use the compiler flag, that initilialises
>>>> everything to zero, but that would mask any other problems  
>>>> assciated
>>>> with wrong initializations.
>>>>
>>>> What's your opinion?
>>>
>>> This quick fix is worth to try.
>>> I have ready to check in an other exch2_uv_cgrid which only
>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>> impression that it could work, with the chance of getting
>>> an adjoint version more easily. I have also started an
>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>
>>> Jean-Michel
>>>
>>>>
>>>> Martin
>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> there seems to be an initialisation issue in one/some of the exch2
>>>>> routines. On our beloved (God, I hate this machine) SX8, the high-
>>>>> res-cube stops with errors like this:
>>>>>>   * 253 Invalid operation PROGxch2_send_rl2  
>>>>>> ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>>                 Called from read_pickup ELN=2022(40083c6a8)
>>>>>>                 Called from ini_fields ELN=1703(4007d9d18)
>>>>>>                 Called from initialise_varia ELN=2018(4008154cc)
>>>>>> ****  99 Execution suspended PROG=exch2_send_rl2 ELN=1627 
>>>>>> (40049c9d8)
>>>>>>                 Called from exch2_rl2_cube ELN=1966(40048c594)
>>>>>>                 Called from exch2_uv_3d_rl ELN=1603(4004a4a74)
>>>>>>                 Called from exch_uv_3d_rl ELN=1826(4006f8478)
>>>>>>                 Called from read_pickup ELN=2022(40083c6a8)
>>>>> so at the first uv exchange. A closer look confirms that array1  
>>>>> and
>>>>> array2 in exch2_send_rl2 have nans on them in the overlap. This
>>>>> problem goes away, when I make the compile initialise  
>>>>> everything to
>>>>> zero by default. (I also learned that apparently not the entire
>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>> points, at least for cubed exchanges; that would explain, why two
>>>>> exchanges are necessary, wouldn't it?)
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-support mailing list
>>>>> MITgcm-support at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list