[MITgcm-devel] Re: [MITgcm-support] bug in exch2?
Martin Losch
Martin.Losch at awi.de
Tue Jul 17 03:14:35 EDT 2007
A short update. I now have a run with useSingleCPUio = .false. and it
got past the pickup stage, that is the problem must be related to the
single cpu io code, right? Maybe we can track it down here, as soon
as we get the debugger to work with *.F files (it only looks for
*.F90 files, we are close to renaming all files ...).
Martin
On 17 Jul 2007, at 01:32, Chris Hill wrote:
> OK since they are zero'd before it does sound like an MDS prob.
> Will try and look, but I am in meeting in UK at moment.
>
> Chris
> Martin Losch wrote:
>> Hi Chris,
>> these are lines 1625-1627 of exch2_send_rl2.f:
>> 1625 val1=sa1*array1(isl,jsl,ktl)
>> 1626 & +sa2*array2(isl,jsl,ktl)
>> 1627 e2Bufr1_RL(iBufr1)=val1
>> Exactly what you thought. What's happening is that array1 and/or
>> array2 are a NaN, so that val1 is then NaN and the program
>> chrashes wehn e2Bufr1_RL(iBufr1) is asigned NaN.
>> MDS: uVel and vVel are initialized to zero (including the
>> overlaps) BEFORE read_pickup; in read_pickup (after
>> read_rec_3d_rl) the overlaps suddenly have some nans on them; not
>> the entire overlap, just a few points always for (i,j)=(12,-3),
>> (15,0),(18,4), in each vertical layer. I checked that with the
>> "hallo-debugger". I use s1800_17x51, so that sNx=17, sNy=51.
>> This tells me, that somewhere underneath the read_rec_3d_rl layer,
>> the overlaps are re-initialised to NaN, right? I would think that
>> this is an MDS issue, isn't it?
>> BTW, the first CPU (with STDOUT.0000) does not have nans in it,
>> and I am using useSingleCPUio=.true. When I unset this flag, the
>> run does not even get past reading the pickups in a reasonable
>> time (1h).
>> Martin
>> On 13 Jul 2007, at 18:06, chris hill wrote:
>>> Hi Martin/JM,
>>>
>>> In principle the
>>>
>>> arr(*) -> arr(1-olx:sNx+olx,.....)
>>>
>>> should be fine. It is not obvious to me that there is an mds
>>> problem.
>>> It would be legitimate for the overlaps to have NaN, if they are
>>> uninitialized.
>>>
>>> Can you send the fortran line at
>>>
>>> exch2_send_rl2 ELN=1627
>>>
>>> it could be a subtle side effect of they way I have done the
>>> permute op in exch2 (c=alpha*a+beta*c) and the range of indices I
>>> use in exch and exch2, which means that we need to initialize
>>> better. If this is a problem there is a safe fix that could be
>>> added to exch2, but it wouldn't vectorize too well.
>>>
>>> Chris
>>>
>>> Martin Losch wrote:
>>>> Hi Jean-Michel,
>>>> thanks for answering. Just to clarify: This thread is called
>>>> "bug in exch2", but as I found, the problem is not connected to
>>>> any exchange routines but the reading the pickup via
>>>> read_rec_3d_rl etc (but I cannot rename the thread, )-:). I have
>>>> only encountered the problem on our SX8 with the cs510, with
>>>> cs32 I cannot reproduce it.
>>>> I can make the problem go away by making the compiler initialize
>>>> everything to zero. This solution works for me, but it this
>>>> satisfactory for others? What are possible candiates for
>>>> problems in the calling sequence
>>>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls
>>>> mds_read_fields -> mds_seg4torl
>>>> ? Is there anything I can try to track down the problem? These
>>>> mdsio routines are terribly hard to understand, and I don't want
>>>> to do anything in there, really, but I could help identify a
>>>> potential problem.
>>>> Martin
>>>> PS. Do the exch2_* comments refer to the other thread:
>>>> "Question: boundary exchange, hrcube condfiguration"?
>>>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>>>> Hi Martin,
>>>>>
>>>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>>>> Hi again,
>>>>>> this was meant to go the the devel list in the first place, oh
>>>>>> well.
>>>>>>
>>>>>> I have tried to find where the nans in the overlaps come from,
>>>>>> and
>>>>>> they appear when u and v are read from the pickup file with
>>>>>> read_rec_3d_rl.
>>>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls
>>>>>> mds_read_fields
>>>>>> In the latter two routines, the array (uVel or vVel) to be
>>>>>> read is
>>>>>> declared as arr(*), but then mds_read_fields calls, eg.
>>>>>> mds_seg4torl,
>>>>>> where the array is declared as
>>>>>> _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>>>> Could that be the source of the problem. I don't know. Should
>>>>>> we do
>>>>>> anything about this?
>>>>>
>>>>> I don't think this declaration is a problem.
>>>>>
>>>>>> As a quick fix I can just use the compiler flag, that
>>>>>> initilialises
>>>>>> everything to zero, but that would mask any other problems
>>>>>> assciated
>>>>>> with wrong initializations.
>>>>>>
>>>>>> What's your opinion?
>>>>>
>>>>> This quick fix is worth to try.
>>>>> I have ready to check in an other exch2_uv_cgrid which only
>>>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>>>> impression that it could work, with the chance of getting
>>>>> an adjoint version more easily. I have also started an
>>>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>>>
>>>>> Jean-Michel
>>>>>
>>>>>>
>>>>>> Martin
>>>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> there seems to be an initialisation issue in one/some of the
>>>>>>> exch2
>>>>>>> routines. On our beloved (God, I hate this machine) SX8, the
>>>>>>> high-
>>>>>>> res-cube stops with errors like this:
>>>>>>>> * 253 Invalid operation PROGxch2_send_rl2
>>>>>>>> ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>> Called from ini_fields ELN=1703(4007d9d18)
>>>>>>>> Called from initialise_varia ELN=2018
>>>>>>>> (4008154cc)
>>>>>>>> **** 99 Execution suspended PROG=exch2_send_rl2 ELN=1627
>>>>>>>> (40049c9d8)
>>>>>>>> Called from exch2_rl2_cube ELN=1966(40048c594)
>>>>>>>> Called from exch2_uv_3d_rl ELN=1603(4004a4a74)
>>>>>>>> Called from exch_uv_3d_rl ELN=1826(4006f8478)
>>>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>>> so at the first uv exchange. A closer look confirms that
>>>>>>> array1 and
>>>>>>> array2 in exch2_send_rl2 have nans on them in the overlap. This
>>>>>>> problem goes away, when I make the compile initialise
>>>>>>> everything to
>>>>>>> zero by default. (I also learned that apparently not the entire
>>>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>>>> points, at least for cubed exchanges; that would explain, why
>>>>>>> two
>>>>>>> exchanges are necessary, wouldn't it?)
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> MITgcm-support mailing list
>>>>>>> MITgcm-support at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list