[MITgcm-devel] Re: [MITgcm-support] bug in exch2?

Martin Losch Martin.Losch at awi.de
Tue Jul 17 03:38:06 EDT 2007


Hi Chris,

I cannot find where the global array
_RL sharedLocalBuf(1-Olx:sNx+Olx,1-Oly:sNy+Oly,nSx,nSy)
(in MDSIO_SCPU.h)
is initialized. I now initialize it to zero at the beginning of  
mdsio_read_field.F and then the problem goes away. Where should it be  
initialized properly? Why is this a global array in a common block  
anyway? It's only used in mdsio_read_field.F and mdsio_write_field.F  
as far as I can see and as the name implies it's a "local" array,  
that does not hold any information that used outside the respective  
routines. Am I missing something?

Martin

On 17 Jul 2007, at 09:14, Martin Losch wrote:

> A short update. I now have a run with useSingleCPUio = .false. and  
> it got past the pickup stage, that is the problem must be related  
> to the single cpu io code, right? Maybe we can track it down here,  
> as soon as we get the debugger to work with *.F files (it only  
> looks for *.F90 files, we are close to renaming all files ...).
>
> Martin
>
> On 17 Jul 2007, at 01:32, Chris Hill wrote:
>
>> OK since they are zero'd before it does sound like an MDS prob.  
>> Will try and look, but I am in meeting in UK at moment.
>>
>> Chris
>> Martin Losch wrote:
>>> Hi Chris,
>>> these are lines 1625-1627 of exch2_send_rl2.f:
>>> 1625       val1=sa1*array1(isl,jsl,ktl)
>>> 1626    &       +sa2*array2(isl,jsl,ktl)
>>> 1627         e2Bufr1_RL(iBufr1)=val1
>>>  Exactly what you thought. What's happening is that array1 and/or  
>>> array2 are a NaN, so that val1 is then NaN and the program  
>>> chrashes wehn e2Bufr1_RL(iBufr1) is asigned NaN.
>>> MDS: uVel and vVel are initialized to zero (including the  
>>> overlaps) BEFORE read_pickup; in read_pickup (after  
>>> read_rec_3d_rl) the overlaps suddenly have some nans on them; not  
>>> the entire overlap, just a few points always for (i,j)=(12,-3), 
>>> (15,0),(18,4), in each vertical layer. I checked that with the  
>>> "hallo-debugger". I use s1800_17x51, so that sNx=17, sNy=51.
>>> This tells me, that somewhere underneath the read_rec_3d_rl  
>>> layer, the overlaps are re-initialised to NaN, right? I would  
>>> think that this is an MDS issue, isn't it?
>>> BTW, the first CPU (with STDOUT.0000) does not have nans in it,  
>>> and I am using useSingleCPUio=.true. When I unset this flag, the  
>>> run does not even get past reading the pickups in a reasonable  
>>> time (1h).
>>> Martin
>>> On 13 Jul 2007, at 18:06, chris hill wrote:
>>>> Hi Martin/JM,
>>>>
>>>>  In principle the
>>>>
>>>>  arr(*) -> arr(1-olx:sNx+olx,.....)
>>>>
>>>>  should be fine. It is not obvious to me that there is an mds  
>>>> problem.
>>>>  It would be legitimate for the overlaps to have NaN, if they  
>>>> are uninitialized.
>>>>
>>>>  Can you send the fortran line at
>>>>
>>>>   exch2_send_rl2 ELN=1627
>>>>
>>>>  it could be a subtle side effect of they way I have done the  
>>>> permute op in exch2 (c=alpha*a+beta*c) and the range of indices  
>>>> I use in exch and exch2, which means that we need to initialize  
>>>> better. If this is a problem there is a safe fix that could be  
>>>> added to exch2, but it wouldn't vectorize too well.
>>>>
>>>> Chris
>>>>
>>>> Martin Losch wrote:
>>>>> Hi Jean-Michel,
>>>>> thanks for answering. Just to clarify: This thread is called  
>>>>> "bug in exch2", but as I found, the problem is not connected to  
>>>>> any exchange routines but the reading the pickup via  
>>>>> read_rec_3d_rl etc (but I cannot rename the thread, )-:). I  
>>>>> have only encountered the problem on our SX8 with the cs510,  
>>>>> with cs32 I cannot reproduce it.
>>>>> I can make the problem go away by making the compiler  
>>>>> initialize everything to zero. This solution works for me, but  
>>>>> it this satisfactory for others? What are possible candiates  
>>>>> for problems in the calling sequence
>>>>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls  
>>>>> mds_read_fields -> mds_seg4torl
>>>>> ? Is there anything I can try to track down the problem? These  
>>>>> mdsio routines are terribly hard to understand, and I don't  
>>>>> want to do anything in there, really, but I could help identify  
>>>>> a potential problem.
>>>>> Martin
>>>>> PS. Do the exch2_* comments refer to the other thread:  
>>>>> "Question: boundary exchange, hrcube condfiguration"?
>>>>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>>>>> Hi Martin,
>>>>>>
>>>>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>>>>> Hi again,
>>>>>>> this was meant to go the the devel list in the first place,  
>>>>>>> oh well.
>>>>>>>
>>>>>>> I have tried to find where the nans in the overlaps come  
>>>>>>> from, and
>>>>>>> they appear when u and v are read from the pickup file with
>>>>>>> read_rec_3d_rl.
>>>>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls  
>>>>>>> mds_read_fields
>>>>>>> In the latter two routines, the array (uVel or vVel) to be  
>>>>>>> read is
>>>>>>> declared as arr(*), but then mds_read_fields calls, eg.  
>>>>>>> mds_seg4torl,
>>>>>>> where the array is declared as
>>>>>>>       _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>>>>> Could that be the source of the problem. I don't know. Should  
>>>>>>> we do
>>>>>>> anything about this?
>>>>>>
>>>>>> I don't think this declaration is a problem.
>>>>>>
>>>>>>> As a quick fix I can just use the compiler flag, that  
>>>>>>> initilialises
>>>>>>> everything to zero, but that would mask any other problems  
>>>>>>> assciated
>>>>>>> with wrong initializations.
>>>>>>>
>>>>>>> What's your opinion?
>>>>>>
>>>>>> This quick fix is worth to try.
>>>>>> I have ready to check in an other exch2_uv_cgrid which only
>>>>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>>>>> impression that it could work, with the chance of getting
>>>>>> an adjoint version more easily. I have also started an
>>>>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>>>>
>>>>>> Jean-Michel
>>>>>>
>>>>>>>
>>>>>>> Martin
>>>>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> there seems to be an initialisation issue in one/some of the  
>>>>>>>> exch2
>>>>>>>> routines. On our beloved (God, I hate this machine) SX8, the  
>>>>>>>> high-
>>>>>>>> res-cube stops with errors like this:
>>>>>>>>>   * 253 Invalid operation PROGxch2_send_rl2  
>>>>>>>>> ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>>>>>                 Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>>>                 Called from ini_fields ELN=1703(4007d9d18)
>>>>>>>>>                 Called from initialise_varia ELN=2018 
>>>>>>>>> (4008154cc)
>>>>>>>>> ****  99 Execution suspended PROG=exch2_send_rl2 ELN=1627 
>>>>>>>>> (40049c9d8)
>>>>>>>>>                 Called from exch2_rl2_cube ELN=1966(40048c594)
>>>>>>>>>                 Called from exch2_uv_3d_rl ELN=1603(4004a4a74)
>>>>>>>>>                 Called from exch_uv_3d_rl ELN=1826(4006f8478)
>>>>>>>>>                 Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>> so at the first uv exchange. A closer look confirms that  
>>>>>>>> array1 and
>>>>>>>> array2 in exch2_send_rl2 have nans on them in the overlap. This
>>>>>>>> problem goes away, when I make the compile initialise  
>>>>>>>> everything to
>>>>>>>> zero by default. (I also learned that apparently not the entire
>>>>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>>>>> points, at least for cubed exchanges; that would explain,  
>>>>>>>> why two
>>>>>>>> exchanges are necessary, wouldn't it?)
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> MITgcm-support mailing list
>>>>>>>> MITgcm-support at mitgcm.org
>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> MITgcm-devel mailing list
>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list