[MITgcm-devel] Re: [MITgcm-support] bug in exch2?
Martin Losch
Martin.Losch at awi.de
Tue Jul 17 03:38:06 EDT 2007
Hi Chris,
I cannot find where the global array
_RL sharedLocalBuf(1-Olx:sNx+Olx,1-Oly:sNy+Oly,nSx,nSy)
(in MDSIO_SCPU.h)
is initialized. I now initialize it to zero at the beginning of
mdsio_read_field.F and then the problem goes away. Where should it be
initialized properly? Why is this a global array in a common block
anyway? It's only used in mdsio_read_field.F and mdsio_write_field.F
as far as I can see and as the name implies it's a "local" array,
that does not hold any information that used outside the respective
routines. Am I missing something?
Martin
On 17 Jul 2007, at 09:14, Martin Losch wrote:
> A short update. I now have a run with useSingleCPUio = .false. and
> it got past the pickup stage, that is the problem must be related
> to the single cpu io code, right? Maybe we can track it down here,
> as soon as we get the debugger to work with *.F files (it only
> looks for *.F90 files, we are close to renaming all files ...).
>
> Martin
>
> On 17 Jul 2007, at 01:32, Chris Hill wrote:
>
>> OK since they are zero'd before it does sound like an MDS prob.
>> Will try and look, but I am in meeting in UK at moment.
>>
>> Chris
>> Martin Losch wrote:
>>> Hi Chris,
>>> these are lines 1625-1627 of exch2_send_rl2.f:
>>> 1625 val1=sa1*array1(isl,jsl,ktl)
>>> 1626 & +sa2*array2(isl,jsl,ktl)
>>> 1627 e2Bufr1_RL(iBufr1)=val1
>>> Exactly what you thought. What's happening is that array1 and/or
>>> array2 are a NaN, so that val1 is then NaN and the program
>>> chrashes wehn e2Bufr1_RL(iBufr1) is asigned NaN.
>>> MDS: uVel and vVel are initialized to zero (including the
>>> overlaps) BEFORE read_pickup; in read_pickup (after
>>> read_rec_3d_rl) the overlaps suddenly have some nans on them; not
>>> the entire overlap, just a few points always for (i,j)=(12,-3),
>>> (15,0),(18,4), in each vertical layer. I checked that with the
>>> "hallo-debugger". I use s1800_17x51, so that sNx=17, sNy=51.
>>> This tells me, that somewhere underneath the read_rec_3d_rl
>>> layer, the overlaps are re-initialised to NaN, right? I would
>>> think that this is an MDS issue, isn't it?
>>> BTW, the first CPU (with STDOUT.0000) does not have nans in it,
>>> and I am using useSingleCPUio=.true. When I unset this flag, the
>>> run does not even get past reading the pickups in a reasonable
>>> time (1h).
>>> Martin
>>> On 13 Jul 2007, at 18:06, chris hill wrote:
>>>> Hi Martin/JM,
>>>>
>>>> In principle the
>>>>
>>>> arr(*) -> arr(1-olx:sNx+olx,.....)
>>>>
>>>> should be fine. It is not obvious to me that there is an mds
>>>> problem.
>>>> It would be legitimate for the overlaps to have NaN, if they
>>>> are uninitialized.
>>>>
>>>> Can you send the fortran line at
>>>>
>>>> exch2_send_rl2 ELN=1627
>>>>
>>>> it could be a subtle side effect of they way I have done the
>>>> permute op in exch2 (c=alpha*a+beta*c) and the range of indices
>>>> I use in exch and exch2, which means that we need to initialize
>>>> better. If this is a problem there is a safe fix that could be
>>>> added to exch2, but it wouldn't vectorize too well.
>>>>
>>>> Chris
>>>>
>>>> Martin Losch wrote:
>>>>> Hi Jean-Michel,
>>>>> thanks for answering. Just to clarify: This thread is called
>>>>> "bug in exch2", but as I found, the problem is not connected to
>>>>> any exchange routines but the reading the pickup via
>>>>> read_rec_3d_rl etc (but I cannot rename the thread, )-:). I
>>>>> have only encountered the problem on our SX8 with the cs510,
>>>>> with cs32 I cannot reproduce it.
>>>>> I can make the problem go away by making the compiler
>>>>> initialize everything to zero. This solution works for me, but
>>>>> it this satisfactory for others? What are possible candiates
>>>>> for problems in the calling sequence
>>>>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls
>>>>> mds_read_fields -> mds_seg4torl
>>>>> ? Is there anything I can try to track down the problem? These
>>>>> mdsio routines are terribly hard to understand, and I don't
>>>>> want to do anything in there, really, but I could help identify
>>>>> a potential problem.
>>>>> Martin
>>>>> PS. Do the exch2_* comments refer to the other thread:
>>>>> "Question: boundary exchange, hrcube condfiguration"?
>>>>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>>>>> Hi Martin,
>>>>>>
>>>>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>>>>> Hi again,
>>>>>>> this was meant to go the the devel list in the first place,
>>>>>>> oh well.
>>>>>>>
>>>>>>> I have tried to find where the nans in the overlaps come
>>>>>>> from, and
>>>>>>> they appear when u and v are read from the pickup file with
>>>>>>> read_rec_3d_rl.
>>>>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls
>>>>>>> mds_read_fields
>>>>>>> In the latter two routines, the array (uVel or vVel) to be
>>>>>>> read is
>>>>>>> declared as arr(*), but then mds_read_fields calls, eg.
>>>>>>> mds_seg4torl,
>>>>>>> where the array is declared as
>>>>>>> _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>>>>> Could that be the source of the problem. I don't know. Should
>>>>>>> we do
>>>>>>> anything about this?
>>>>>>
>>>>>> I don't think this declaration is a problem.
>>>>>>
>>>>>>> As a quick fix I can just use the compiler flag, that
>>>>>>> initilialises
>>>>>>> everything to zero, but that would mask any other problems
>>>>>>> assciated
>>>>>>> with wrong initializations.
>>>>>>>
>>>>>>> What's your opinion?
>>>>>>
>>>>>> This quick fix is worth to try.
>>>>>> I have ready to check in an other exch2_uv_cgrid which only
>>>>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>>>>> impression that it could work, with the chance of getting
>>>>>> an adjoint version more easily. I have also started an
>>>>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>>>>
>>>>>> Jean-Michel
>>>>>>
>>>>>>>
>>>>>>> Martin
>>>>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> there seems to be an initialisation issue in one/some of the
>>>>>>>> exch2
>>>>>>>> routines. On our beloved (God, I hate this machine) SX8, the
>>>>>>>> high-
>>>>>>>> res-cube stops with errors like this:
>>>>>>>>> * 253 Invalid operation PROGxch2_send_rl2
>>>>>>>>> ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>>> Called from ini_fields ELN=1703(4007d9d18)
>>>>>>>>> Called from initialise_varia ELN=2018
>>>>>>>>> (4008154cc)
>>>>>>>>> **** 99 Execution suspended PROG=exch2_send_rl2 ELN=1627
>>>>>>>>> (40049c9d8)
>>>>>>>>> Called from exch2_rl2_cube ELN=1966(40048c594)
>>>>>>>>> Called from exch2_uv_3d_rl ELN=1603(4004a4a74)
>>>>>>>>> Called from exch_uv_3d_rl ELN=1826(4006f8478)
>>>>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>> so at the first uv exchange. A closer look confirms that
>>>>>>>> array1 and
>>>>>>>> array2 in exch2_send_rl2 have nans on them in the overlap. This
>>>>>>>> problem goes away, when I make the compile initialise
>>>>>>>> everything to
>>>>>>>> zero by default. (I also learned that apparently not the entire
>>>>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>>>>> points, at least for cubed exchanges; that would explain,
>>>>>>>> why two
>>>>>>>> exchanges are necessary, wouldn't it?)
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> MITgcm-support mailing list
>>>>>>>> MITgcm-support at mitgcm.org
>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> MITgcm-devel mailing list
>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list