[MITgcm-devel] Re: [MITgcm-support] bug in exch2?

Martin Losch Martin.Losch at awi.de
Tue Jul 17 04:41:40 EDT 2007


After my previous premature babbling I have now found a fix to the  
problem:
I initialize "local" (which is the name of the argument on which  
sharedLocalBuf is passed) and "temp" in scatter_2d and the nans in  
the overlaps go away.

If this is a general solution, I'll happily check this in (but I'll  
wait for your approval, as I do not have enough of an overview on  
this issue).

Martin

On 17 Jul 2007, at 09:45, Martin Losch wrote:

> Hi Chris,
>
> sorry about the previous email, which I did not mean to send. In  
> fact, initializing sharedLocalBuf in mdsio_read_field.F did NOT  
> help. I was so confident that it would that I wrote the email  
> before the run was finished and accidentially pressed "send".
> I still have nans ... oh well.
> Martin
> On 17 Jul 2007, at 09:38, Martin Losch wrote:
>
>> Hi Chris,
>>
>> I cannot find where the global array
>> _RL sharedLocalBuf(1-Olx:sNx+Olx,1-Oly:sNy+Oly,nSx,nSy)
>> (in MDSIO_SCPU.h)
>> is initialized. I now initialize it to zero at the beginning of  
>> mdsio_read_field.F and then the problem goes away. Where should it  
>> be initialized properly? Why is this a global array in a common  
>> block anyway? It's only used in mdsio_read_field.F and  
>> mdsio_write_field.F as far as I can see and as the name implies  
>> it's a "local" array, that does not hold any information that used  
>> outside the respective routines. Am I missing something?
>>
>> Martin
>>
>> On 17 Jul 2007, at 09:14, Martin Losch wrote:
>>
>>> A short update. I now have a run with useSingleCPUio = .false.  
>>> and it got past the pickup stage, that is the problem must be  
>>> related to the single cpu io code, right? Maybe we can track it  
>>> down here, as soon as we get the debugger to work with *.F files  
>>> (it only looks for *.F90 files, we are close to renaming all  
>>> files ...).
>>>
>>> Martin
>>>
>>> On 17 Jul 2007, at 01:32, Chris Hill wrote:
>>>
>>>> OK since they are zero'd before it does sound like an MDS prob.  
>>>> Will try and look, but I am in meeting in UK at moment.
>>>>
>>>> Chris
>>>> Martin Losch wrote:
>>>>> Hi Chris,
>>>>> these are lines 1625-1627 of exch2_send_rl2.f:
>>>>> 1625       val1=sa1*array1(isl,jsl,ktl)
>>>>> 1626    &       +sa2*array2(isl,jsl,ktl)
>>>>> 1627         e2Bufr1_RL(iBufr1)=val1
>>>>>  Exactly what you thought. What's happening is that array1 and/ 
>>>>> or array2 are a NaN, so that val1 is then NaN and the program  
>>>>> chrashes wehn e2Bufr1_RL(iBufr1) is asigned NaN.
>>>>> MDS: uVel and vVel are initialized to zero (including the  
>>>>> overlaps) BEFORE read_pickup; in read_pickup (after  
>>>>> read_rec_3d_rl) the overlaps suddenly have some nans on them;  
>>>>> not the entire overlap, just a few points always for (i,j)= 
>>>>> (12,-3),(15,0),(18,4), in each vertical layer. I checked that  
>>>>> with the "hallo-debugger". I use s1800_17x51, so that sNx=17,  
>>>>> sNy=51.
>>>>> This tells me, that somewhere underneath the read_rec_3d_rl  
>>>>> layer, the overlaps are re-initialised to NaN, right? I would  
>>>>> think that this is an MDS issue, isn't it?
>>>>> BTW, the first CPU (with STDOUT.0000) does not have nans in it,  
>>>>> and I am using useSingleCPUio=.true. When I unset this flag,  
>>>>> the run does not even get past reading the pickups in a  
>>>>> reasonable time (1h).
>>>>> Martin
>>>>> On 13 Jul 2007, at 18:06, chris hill wrote:
>>>>>> Hi Martin/JM,
>>>>>>
>>>>>>  In principle the
>>>>>>
>>>>>>  arr(*) -> arr(1-olx:sNx+olx,.....)
>>>>>>
>>>>>>  should be fine. It is not obvious to me that there is an mds  
>>>>>> problem.
>>>>>>  It would be legitimate for the overlaps to have NaN, if they  
>>>>>> are uninitialized.
>>>>>>
>>>>>>  Can you send the fortran line at
>>>>>>
>>>>>>   exch2_send_rl2 ELN=1627
>>>>>>
>>>>>>  it could be a subtle side effect of they way I have done the  
>>>>>> permute op in exch2 (c=alpha*a+beta*c) and the range of  
>>>>>> indices I use in exch and exch2, which means that we need to  
>>>>>> initialize better. If this is a problem there is a safe fix  
>>>>>> that could be added to exch2, but it wouldn't vectorize too well.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> Martin Losch wrote:
>>>>>>> Hi Jean-Michel,
>>>>>>> thanks for answering. Just to clarify: This thread is called  
>>>>>>> "bug in exch2", but as I found, the problem is not connected  
>>>>>>> to any exchange routines but the reading the pickup via  
>>>>>>> read_rec_3d_rl etc (but I cannot rename the thread, )-:). I  
>>>>>>> have only encountered the problem on our SX8 with the cs510,  
>>>>>>> with cs32 I cannot reproduce it.
>>>>>>> I can make the problem go away by making the compiler  
>>>>>>> initialize everything to zero. This solution works for me,  
>>>>>>> but it this satisfactory for others? What are possible  
>>>>>>> candiates for problems in the calling sequence
>>>>>>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls  
>>>>>>> mds_read_fields -> mds_seg4torl
>>>>>>> ? Is there anything I can try to track down the problem?  
>>>>>>> These mdsio routines are terribly hard to understand, and I  
>>>>>>> don't want to do anything in there, really, but I could help  
>>>>>>> identify a potential problem.
>>>>>>> Martin
>>>>>>> PS. Do the exch2_* comments refer to the other thread:  
>>>>>>> "Question: boundary exchange, hrcube condfiguration"?
>>>>>>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>>>>>>> Hi Martin,
>>>>>>>>
>>>>>>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>>>>>>> Hi again,
>>>>>>>>> this was meant to go the the devel list in the first place,  
>>>>>>>>> oh well.
>>>>>>>>>
>>>>>>>>> I have tried to find where the nans in the overlaps come  
>>>>>>>>> from, and
>>>>>>>>> they appear when u and v are read from the pickup file with
>>>>>>>>> read_rec_3d_rl.
>>>>>>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls  
>>>>>>>>> mds_read_fields
>>>>>>>>> In the latter two routines, the array (uVel or vVel) to be  
>>>>>>>>> read is
>>>>>>>>> declared as arr(*), but then mds_read_fields calls, eg.  
>>>>>>>>> mds_seg4torl,
>>>>>>>>> where the array is declared as
>>>>>>>>>       _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>>>>>>> Could that be the source of the problem. I don't know.  
>>>>>>>>> Should we do
>>>>>>>>> anything about this?
>>>>>>>>
>>>>>>>> I don't think this declaration is a problem.
>>>>>>>>
>>>>>>>>> As a quick fix I can just use the compiler flag, that  
>>>>>>>>> initilialises
>>>>>>>>> everything to zero, but that would mask any other problems  
>>>>>>>>> assciated
>>>>>>>>> with wrong initializations.
>>>>>>>>>
>>>>>>>>> What's your opinion?
>>>>>>>>
>>>>>>>> This quick fix is worth to try.
>>>>>>>> I have ready to check in an other exch2_uv_cgrid which only
>>>>>>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>>>>>>> impression that it could work, with the chance of getting
>>>>>>>> an adjoint version more easily. I have also started an
>>>>>>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>>>>>>
>>>>>>>> Jean-Michel
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Martin
>>>>>>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>>>>>>
>>>>>>>>>> Hi there,
>>>>>>>>>>
>>>>>>>>>> there seems to be an initialisation issue in one/some of  
>>>>>>>>>> the exch2
>>>>>>>>>> routines. On our beloved (God, I hate this machine) SX8,  
>>>>>>>>>> the high-
>>>>>>>>>> res-cube stops with errors like this:
>>>>>>>>>>>   * 253 Invalid operation PROGxch2_send_rl2  
>>>>>>>>>>> ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>>>>>>>                 Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>>>>>                 Called from ini_fields ELN=1703(4007d9d18)
>>>>>>>>>>>                 Called from initialise_varia ELN=2018 
>>>>>>>>>>> (4008154cc)
>>>>>>>>>>> ****  99 Execution suspended PROG=exch2_send_rl2 ELN=1627 
>>>>>>>>>>> (40049c9d8)
>>>>>>>>>>>                 Called from exch2_rl2_cube ELN=1966 
>>>>>>>>>>> (40048c594)
>>>>>>>>>>>                 Called from exch2_uv_3d_rl ELN=1603 
>>>>>>>>>>> (4004a4a74)
>>>>>>>>>>>                 Called from exch_uv_3d_rl ELN=1826 
>>>>>>>>>>> (4006f8478)
>>>>>>>>>>>                 Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>>>> so at the first uv exchange. A closer look confirms that  
>>>>>>>>>> array1 and
>>>>>>>>>> array2 in exch2_send_rl2 have nans on them in the overlap.  
>>>>>>>>>> This
>>>>>>>>>> problem goes away, when I make the compile initialise  
>>>>>>>>>> everything to
>>>>>>>>>> zero by default. (I also learned that apparently not the  
>>>>>>>>>> entire
>>>>>>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>>>>>>> points, at least for cubed exchanges; that would explain,  
>>>>>>>>>> why two
>>>>>>>>>> exchanges are necessary, wouldn't it?)
>>>>>>>>>>
>>>>>>>>>> Martin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> MITgcm-support mailing list
>>>>>>>>>> MITgcm-support at mitgcm.org
>>>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> MITgcm-devel mailing list
>>>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>>> _______________________________________________
>>>>>>>> MITgcm-devel mailing list
>>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>> _______________________________________________
>>>>>>> MITgcm-devel mailing list
>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list