[MITgcm-devel] Re: [MITgcm-support] bug in exch2?
Martin Losch
Martin.Losch at awi.de
Tue Jul 17 04:41:40 EDT 2007
After my previous premature babbling I have now found a fix to the
problem:
I initialize "local" (which is the name of the argument on which
sharedLocalBuf is passed) and "temp" in scatter_2d and the nans in
the overlaps go away.
If this is a general solution, I'll happily check this in (but I'll
wait for your approval, as I do not have enough of an overview on
this issue).
Martin
On 17 Jul 2007, at 09:45, Martin Losch wrote:
> Hi Chris,
>
> sorry about the previous email, which I did not mean to send. In
> fact, initializing sharedLocalBuf in mdsio_read_field.F did NOT
> help. I was so confident that it would that I wrote the email
> before the run was finished and accidentially pressed "send".
> I still have nans ... oh well.
> Martin
> On 17 Jul 2007, at 09:38, Martin Losch wrote:
>
>> Hi Chris,
>>
>> I cannot find where the global array
>> _RL sharedLocalBuf(1-Olx:sNx+Olx,1-Oly:sNy+Oly,nSx,nSy)
>> (in MDSIO_SCPU.h)
>> is initialized. I now initialize it to zero at the beginning of
>> mdsio_read_field.F and then the problem goes away. Where should it
>> be initialized properly? Why is this a global array in a common
>> block anyway? It's only used in mdsio_read_field.F and
>> mdsio_write_field.F as far as I can see and as the name implies
>> it's a "local" array, that does not hold any information that used
>> outside the respective routines. Am I missing something?
>>
>> Martin
>>
>> On 17 Jul 2007, at 09:14, Martin Losch wrote:
>>
>>> A short update. I now have a run with useSingleCPUio = .false.
>>> and it got past the pickup stage, that is the problem must be
>>> related to the single cpu io code, right? Maybe we can track it
>>> down here, as soon as we get the debugger to work with *.F files
>>> (it only looks for *.F90 files, we are close to renaming all
>>> files ...).
>>>
>>> Martin
>>>
>>> On 17 Jul 2007, at 01:32, Chris Hill wrote:
>>>
>>>> OK since they are zero'd before it does sound like an MDS prob.
>>>> Will try and look, but I am in meeting in UK at moment.
>>>>
>>>> Chris
>>>> Martin Losch wrote:
>>>>> Hi Chris,
>>>>> these are lines 1625-1627 of exch2_send_rl2.f:
>>>>> 1625 val1=sa1*array1(isl,jsl,ktl)
>>>>> 1626 & +sa2*array2(isl,jsl,ktl)
>>>>> 1627 e2Bufr1_RL(iBufr1)=val1
>>>>> Exactly what you thought. What's happening is that array1 and/
>>>>> or array2 are a NaN, so that val1 is then NaN and the program
>>>>> chrashes wehn e2Bufr1_RL(iBufr1) is asigned NaN.
>>>>> MDS: uVel and vVel are initialized to zero (including the
>>>>> overlaps) BEFORE read_pickup; in read_pickup (after
>>>>> read_rec_3d_rl) the overlaps suddenly have some nans on them;
>>>>> not the entire overlap, just a few points always for (i,j)=
>>>>> (12,-3),(15,0),(18,4), in each vertical layer. I checked that
>>>>> with the "hallo-debugger". I use s1800_17x51, so that sNx=17,
>>>>> sNy=51.
>>>>> This tells me, that somewhere underneath the read_rec_3d_rl
>>>>> layer, the overlaps are re-initialised to NaN, right? I would
>>>>> think that this is an MDS issue, isn't it?
>>>>> BTW, the first CPU (with STDOUT.0000) does not have nans in it,
>>>>> and I am using useSingleCPUio=.true. When I unset this flag,
>>>>> the run does not even get past reading the pickups in a
>>>>> reasonable time (1h).
>>>>> Martin
>>>>> On 13 Jul 2007, at 18:06, chris hill wrote:
>>>>>> Hi Martin/JM,
>>>>>>
>>>>>> In principle the
>>>>>>
>>>>>> arr(*) -> arr(1-olx:sNx+olx,.....)
>>>>>>
>>>>>> should be fine. It is not obvious to me that there is an mds
>>>>>> problem.
>>>>>> It would be legitimate for the overlaps to have NaN, if they
>>>>>> are uninitialized.
>>>>>>
>>>>>> Can you send the fortran line at
>>>>>>
>>>>>> exch2_send_rl2 ELN=1627
>>>>>>
>>>>>> it could be a subtle side effect of they way I have done the
>>>>>> permute op in exch2 (c=alpha*a+beta*c) and the range of
>>>>>> indices I use in exch and exch2, which means that we need to
>>>>>> initialize better. If this is a problem there is a safe fix
>>>>>> that could be added to exch2, but it wouldn't vectorize too well.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> Martin Losch wrote:
>>>>>>> Hi Jean-Michel,
>>>>>>> thanks for answering. Just to clarify: This thread is called
>>>>>>> "bug in exch2", but as I found, the problem is not connected
>>>>>>> to any exchange routines but the reading the pickup via
>>>>>>> read_rec_3d_rl etc (but I cannot rename the thread, )-:). I
>>>>>>> have only encountered the problem on our SX8 with the cs510,
>>>>>>> with cs32 I cannot reproduce it.
>>>>>>> I can make the problem go away by making the compiler
>>>>>>> initialize everything to zero. This solution works for me,
>>>>>>> but it this satisfactory for others? What are possible
>>>>>>> candiates for problems in the calling sequence
>>>>>>> read_pickup -> read_rec_3d_rl -> mdsreadfield -> calls
>>>>>>> mds_read_fields -> mds_seg4torl
>>>>>>> ? Is there anything I can try to track down the problem?
>>>>>>> These mdsio routines are terribly hard to understand, and I
>>>>>>> don't want to do anything in there, really, but I could help
>>>>>>> identify a potential problem.
>>>>>>> Martin
>>>>>>> PS. Do the exch2_* comments refer to the other thread:
>>>>>>> "Question: boundary exchange, hrcube condfiguration"?
>>>>>>> On 13 Jul 2007, at 17:19, Jean-Michel Campin wrote:
>>>>>>>> Hi Martin,
>>>>>>>>
>>>>>>>> On Thu, Jul 12, 2007 at 03:54:02PM +0200, Martin Losch wrote:
>>>>>>>>> Hi again,
>>>>>>>>> this was meant to go the the devel list in the first place,
>>>>>>>>> oh well.
>>>>>>>>>
>>>>>>>>> I have tried to find where the nans in the overlaps come
>>>>>>>>> from, and
>>>>>>>>> they appear when u and v are read from the pickup file with
>>>>>>>>> read_rec_3d_rl.
>>>>>>>>> read_rec_3d_rl calls mdsreadfield, which in turn calls
>>>>>>>>> mds_read_fields
>>>>>>>>> In the latter two routines, the array (uVel or vVel) to be
>>>>>>>>> read is
>>>>>>>>> declared as arr(*), but then mds_read_fields calls, eg.
>>>>>>>>> mds_seg4torl,
>>>>>>>>> where the array is declared as
>>>>>>>>> _RL arr(1-oLx:sNx+oLx,1-oLy:sNy+oLy,nNz,nSx,nSy)
>>>>>>>>> Could that be the source of the problem. I don't know.
>>>>>>>>> Should we do
>>>>>>>>> anything about this?
>>>>>>>>
>>>>>>>> I don't think this declaration is a problem.
>>>>>>>>
>>>>>>>>> As a quick fix I can just use the compiler flag, that
>>>>>>>>> initilialises
>>>>>>>>> everything to zero, but that would mask any other problems
>>>>>>>>> assciated
>>>>>>>>> with wrong initializations.
>>>>>>>>>
>>>>>>>>> What's your opinion?
>>>>>>>>
>>>>>>>> This quick fix is worth to try.
>>>>>>>> I have ready to check in an other exch2_uv_cgrid which only
>>>>>>>> calls exch2_rl_cube (and not exch2_rl2_cube), and I have the
>>>>>>>> impression that it could work, with the chance of getting
>>>>>>>> an adjoint version more easily. I have also started an
>>>>>>>> exch2_uv_bgrid, but looks more compicated than what I though.
>>>>>>>>
>>>>>>>> Jean-Michel
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Martin
>>>>>>>>> On 11 Jul 2007, at 15:32, Martin Losch wrote:
>>>>>>>>>
>>>>>>>>>> Hi there,
>>>>>>>>>>
>>>>>>>>>> there seems to be an initialisation issue in one/some of
>>>>>>>>>> the exch2
>>>>>>>>>> routines. On our beloved (God, I hate this machine) SX8,
>>>>>>>>>> the high-
>>>>>>>>>> res-cube stops with errors like this:
>>>>>>>>>>> * 253 Invalid operation PROGxch2_send_rl2
>>>>>>>>>>> ELN==exch2_send_rl2 ELN=1627(40049c9d8)
>>>>>>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>>>>> Called from ini_fields ELN=1703(4007d9d18)
>>>>>>>>>>> Called from initialise_varia ELN=2018
>>>>>>>>>>> (4008154cc)
>>>>>>>>>>> **** 99 Execution suspended PROG=exch2_send_rl2 ELN=1627
>>>>>>>>>>> (40049c9d8)
>>>>>>>>>>> Called from exch2_rl2_cube ELN=1966
>>>>>>>>>>> (40048c594)
>>>>>>>>>>> Called from exch2_uv_3d_rl ELN=1603
>>>>>>>>>>> (4004a4a74)
>>>>>>>>>>> Called from exch_uv_3d_rl ELN=1826
>>>>>>>>>>> (4006f8478)
>>>>>>>>>>> Called from read_pickup ELN=2022(40083c6a8)
>>>>>>>>>> so at the first uv exchange. A closer look confirms that
>>>>>>>>>> array1 and
>>>>>>>>>> array2 in exch2_send_rl2 have nans on them in the overlap.
>>>>>>>>>> This
>>>>>>>>>> problem goes away, when I make the compile initialise
>>>>>>>>>> everything to
>>>>>>>>>> zero by default. (I also learned that apparently not the
>>>>>>>>>> entire
>>>>>>>>>> overlap is exchanged in exch2_rl2_cube, but only olx-1,oly-1
>>>>>>>>>> points, at least for cubed exchanges; that would explain,
>>>>>>>>>> why two
>>>>>>>>>> exchanges are necessary, wouldn't it?)
>>>>>>>>>>
>>>>>>>>>> Martin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> MITgcm-support mailing list
>>>>>>>>>> MITgcm-support at mitgcm.org
>>>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> MITgcm-devel mailing list
>>>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>>> _______________________________________________
>>>>>>>> MITgcm-devel mailing list
>>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>> _______________________________________________
>>>>>>> MITgcm-devel mailing list
>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list