[MITgcm-devel] Re: [MITgcm-support] bug in exch2?

Martin Losch Martin.Losch at awi.de
Tue Jul 17 03:45:43 EDT 2007


Hi Chris,

sorry about the previous email, which I did not mean to send. In  
fact, initializing sharedLocalBuf in mdsio_read_field.F did NOT help.  
I was so confident that it would that I wrote the email before the  
run was finished and accidentially pressed "send".
I still have nans ... oh well.
Martin
On 17 Jul 2007, at 09:38, Martin Losch wrote:

> Hi Chris,
>
> I cannot find where the global array
> _RL sharedLocalBuf(1-Olx:sNx+Olx,1-Oly:sNy+Oly,nSx,nSy)
> (in MDSIO_SCPU.h)
> is initialized. I now initialize it to zero at the beginning of  
> mdsio_read_field.F and then the problem goes away. Where should it  
> be initialized properly? Why is this a global array in a common  
> block anyway? It's only used in mdsio_read_field.F and  
> mdsio_write_field.F as far as I can see and as the name implies  
> it's a "local" array, that does not hold any information that used  
> outside the respective routines. Am I missing something?
>
> Martin
>
> On 17 Jul 2007, at 09:14, Martin Losch wrote:
>
>> A short update. I now have a run with useSingleCPUio = .false. and  
>> it got past the pickup stage, that is the problem must be related  
>> to the single cpu io code, right? Maybe we can track it down here,  
>> as soon as we get the debugger to work with *.F files (it only  
>> looks for *.F90 files, we are close to renaming all files ...).
>>
>> Martin
>>
>> On 17 Jul 2007, at 01:32, Chris Hill wrote:
>>
>>> OK since they are zero'd before it does sound like an MDS prob.  
>>> Will try and look, but I am in meeting in UK at moment.
>>>
>>> Chris
>>> Martin Losch wrote:
>>>> Hi Chris,
>>>> these are lines 1625-1627 of exch2_send_rl2.f:
>>>> 1625       val1=sa1*array1(isl,jsl,ktl)
>>>> 1626    &       +sa2*array2(isl,jsl,ktl)
>>>> 1627         e2Bufr1_RL(iBufr1)=val1
>>>>  Exactly what you thought. What's happening is that array1 and/ 
>>>> or array2 are a NaN, so that val1 is then NaN and the program  
>>>> chrashes wehn e2Bufr1_RL(iBufr1) is asigned NaN.
>>>> MDS: uVel and vVel are initialized to zero (including the  
>>>> overlaps) BEFORE read_pickup; in read_pickup (after  
>>>> read_rec_3d_rl) the overlaps suddenly have some nans on them;  
>>>> not the entire overlap, just a few points always for (i,j)= 
>>>> (12,-3),(15,0),(18,4), in each vertical layer. I checked that  
>>>> with the "hallo-debugger". I use s1800_17x51, so that sNx=17,  
>>>> sNy=51.
>>>> This tells me, that somewhere underneath the read_rec_3d_rl  
>>>> layer, the overlaps are re-initialised to NaN, right? I would  
>>>> think that this is an MDS issue, isn't it?
>>>> BTW, the first CPU (with STDOUT.0000) does not have nans in it,  
>>>> and I am using useSingleCPUio=.true. When I unset this flag, the  
>>>> run does not even get past reading the pickups in a reasonable  
>>>> time (1h).
>>>> Martin
>>>> On 13 Jul 2007, at 18:06, chris hill wrote:
>>>>> Hi Martin/JM,
>>>>>
>>>>>  In principle the
>>>>>
>>>>>  arr(*) -> arr(1-olx:sNx+olx,.....)
>>>>>
>>>>>  should be fine. It is not obvious to me that there is an mds  
>>>>> problem.
>>>>>  It would be legitimate for the overlaps to have NaN, if they  
>>>>> are uninitialized.
>>>>>
>>>>>  Can you send the fortran line at
>>>>>
>>>>>   exch2_send_rl2 ELN=1627
>>>>>
>>>>>  it could be a subtle side effect of they way I have done the  
>>>>> permute op in exch2 (c=alpha*a+beta*c) and the range of indices  
>>>>> I use in exch and exch2, which means that we need to initialize  
>>>>> better. If this is a problem there is a safe fix that could be  
>>>>> added to exch2, but it wouldn't vectorize too well.
>>>>>
>>>>> Chris
>>>>>
>>>>> Martin Losch wrote:
>>>>>> Hi Jean-Michel,
>>>>>> thanks for answering. Just to clarify: This thread is called  
>>>>>> "bug in exch2", but as I found, the problem is not connected  
>>>>>> to any exchange routines but the reading the pickup via  
>>>>>> read_rec_3d_rl etc (but I cannot rename the thread, )-:). I  
>>>>>> have only encountered the problem on our SX8 with the cs510,  
>>>>>> with cs32 I cannot reproduce it.
>>>>>> I can make the problem go away by making the compiler  
>>>>>> initialize everything to zero. This solution works for me, but  
>>>>>> it this satisfactory for others? What are possible candiates  
>>>>>> for problems in the calling sequence
>>>>>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls  
>>>>>> mds_read_fields -> mds_seg4torl
>>>>>> ? Is there anything I can try to track down the problem? These  
>>>>>> mdsio routines are terribly hard to understand, and I don't  
>>>>>> want to do anything in there, really, but I could help  
>>>>>> identify a potential problem.
>>>>>> Martin
>>>>>> PS. Do the exch2_* comments refer to the other thread:  
>>>>>> "Question: boundary exchange, hrcube condfiguration"?
>>>>>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>>>>>> Hi Martin,
>>>>>>>
>>>>>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>>>>>> Hi again,
>>>>>>>> this was meant to go the the devel list in the first place,  
>>>>>>>> oh well.
>>>>>>>>
>>>>>>>> I have tried to find where the nans in the overlaps come  
>>>>>>>> from, and
>>>>>>>> they appear when u and v are read from the pickup file with
>>>>>>>> read_rec_3d_rl.
>>>>>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls  
>>>>>>>> mds_read_fields
>>>>>>>> In the latter two routines, the array (uVel or vVel) to be  
>>>>>>>> read is
>>>>>>>> declared as arr(*), but then mds_read_fields calls, eg.  
>>>>>>>> mds_seg4torl,
>>>>>>>> where the array is declared as
>>>>>>>>       _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>>>>>> Could that be the source of the problem. I don't know.  
>>>>>>>> Should we do
>>>>>>>> anything about this?
>>>>>>>
>>>>>>> I don't think this declaration is a problem.
>>>>>>>
>>>>>>>> As a quick fix I can just use the compiler flag, that  
>>>>>>>> initilialises
>>>>>>>> everything to zero, but that would mask any other problems  
>>>>>>>> assciated
>>>>>>>> with wrong initializations.
>>>>>>>>
>>>>>>>> What's your opinion?
>>>>>>>
>>>>>>> This quick fix is worth to try.
>>>>>>> I have ready to check in an other exch2_uv_cgrid which only
>>>>>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>>>>>> impression that it could work, with the chance of getting
>>>>>>> an adjoint version more easily. I have also started an
>>>>>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>>>>>
>>>>>>> Jean-Michel
>>>>>>>
>>>>>>>>
>>>>>>>> Martin
>>>>>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> there seems to be an initialisation issue in one/some of  
>>>>>>>>> the exch2
>>>>>>>>> routines. On our beloved (God, I hate this machine) SX8,  
>>>>>>>>> the high-
>>>>>>>>> res-cube stops with errors like this:
>>>>>>>>>>   * 253 Invalid operation PROGxch2_send_rl2  
>>>>>>>>>> ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>>>>>>                 Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>>>>                 Called from ini_fields ELN=1703(4007d9d18)
>>>>>>>>>>                 Called from initialise_varia ELN=2018 
>>>>>>>>>> (4008154cc)
>>>>>>>>>> ****  99 Execution suspended PROG=exch2_send_rl2 ELN=1627 
>>>>>>>>>> (40049c9d8)
>>>>>>>>>>                 Called from exch2_rl2_cube ELN=1966 
>>>>>>>>>> (40048c594)
>>>>>>>>>>                 Called from exch2_uv_3d_rl ELN=1603 
>>>>>>>>>> (4004a4a74)
>>>>>>>>>>                 Called from exch_uv_3d_rl ELN=1826(4006f8478)
>>>>>>>>>>                 Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>>> so at the first uv exchange. A closer look confirms that  
>>>>>>>>> array1 and
>>>>>>>>> array2 in exch2_send_rl2 have nans on them in the overlap.  
>>>>>>>>> This
>>>>>>>>> problem goes away, when I make the compile initialise  
>>>>>>>>> everything to
>>>>>>>>> zero by default. (I also learned that apparently not the  
>>>>>>>>> entire
>>>>>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>>>>>> points, at least for cubed exchanges; that would explain,  
>>>>>>>>> why two
>>>>>>>>> exchanges are necessary, wouldn't it?)
>>>>>>>>>
>>>>>>>>> Martin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> MITgcm-support mailing list
>>>>>>>>> MITgcm-support at mitgcm.org
>>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> MITgcm-devel mailing list
>>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>> _______________________________________________
>>>>>>> MITgcm-devel mailing list
>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list