[MITgcm-devel] Re: MITgcm vectorization on NEC SX

Tue Dec 4 03:13:30 EST 2007

Hi Chris,

thanks for the comments. I'll see what I will do about the exchanges  
of the exchanges.

1. you are right: exchange_rl_sent_put/recv_get_x use about 1-2% each  
in my simple configuration with only 1CPU (and MPI), when I go to 5  
CPUs (decomposition in y-direction only) on my small domain  
(243x170), then the x-exchanges use 2% but the corresponding y- 
exchanges suddenly 5.8 adn 3.6%.
2. I have not tried exch2 routines yet, but I suspect that there will  
be similar "problems". I  had left them alone deliberately, because  
Patrick asked to get a chance to make the adjoint work first, right?  
Should I tell Jens-Olaf to have a look at exch2, too?

Martin

On 3 Dec 2007, at 18:24, chris hill wrote:

> Hi Martin,
>
>  Exchanging the loops is definitely fine and makes sense. Needs  
> testing though once it is done. Most iB's should be able to work  
> fine if you change both sides of the loops. The iB0 value formula  
> will need to change - which looks a bit confusing in reverse mode!
>
>  Two general thoughts however
>
>  1 - from your earlier note I got the impression that the overhead  
> was a few %. If thats right then, unless scaling to large numbers  
> of SX procs, then the impact on time to solution won't be much?
>
>  2 - I suspect Jens-Olaf isn't going to like exch2! Be good if he  
> could look at that soon.
>
> Chris
>
> Chris
> Martin Losch wrote:
>> Chris,
>> do you have any thoughts on the loop exchange? there is also some  
>> iB's and iB0's that needs to be taken care of, if the loops are  
>> exchanged, which I don't quite understand.
>> Martin
>> On 26 Nov 2007, at 18:19, Jean-Michel Campin wrote:
>>> Hi Martin,
>>>
>>> I have few comments:
>>> a) regarding I/O: without singleCpuIO, MDSIO read/write
>>> 1 line at a time ; with singleCpuIO a 2-D field is read/written
>>> in 1 instruction. Does this make a difference in our run
>>> (not so easy to check, may be the easiest would be on 1 processor
>>> only) ?
>>>
>>> b) regarding BARRIER :
>>> the IF (nSx.gt.1.or.nSy.gt.1) solution will remove the BARRIER
>>> at compilation time. A more selective way to avoid the
>>> BARRIER call is just to add IF ( nThreads.GT.1 ) but this
>>> is not known at compilation stage.
>>>
>>> c) point 1 (inlining): remember talking to Constantinos
>>> and it does not seems so easy since the syntax of the
>>> inlining instruction is very specific to 1 compiler.
>>> And I fully agree with you, we should find a flexible solution.
>>>
>>> d) point 3 (directive): I am in favor of adding few directives
>>> if - significant improvement
>>>    - in S/R that don't change so often (like cg2d)
>>>    - not too much trouble with TAF (no worry with cg2d)
>>> Don't know which solution is the best, but I trust you.
>>>
>>> e) diagnostics:
>>> I checked again the code, and if the diagnostics is not turned off,
>>> the function call "diagnostics_is_on" and the subroutine
>>> "diagnostics_fill" are doing pretty much the same thing.
>>> Unless your compiler is much more efficient for a function
>>> call compared to a subroutine call (but I know some compiler
>>> that don't), I have the impression that you will not save much
>>> time here.
>>> On the other hand, if the issue is the vectorisation of the 2  
>>> loops in
>>> diagnostics_fill (DO n=1,nlists &  DO m=1,nActive(n)) because of
>>> the stuff inside the loops, then we could change diagnostics_fill
>>> to 1rst check only if the diagnostics is turned on, and then
>>> CALL DIAGNOSTICS_FILL_FIELD if it's the case.
>>> Also, in the case where many diagnostics are turned on,
>>> the modification you propose will slow down the code since
>>> it means checking 2 times (instead of 1) the list of active
>>> diagnostics.
>>> In summary, I would not go for changing diagnostics unless we have
>>> a serious test (with few diagnostics and with many) which support
>>> a significant performance improvement.
>>>
>>> and leave the exchange discussion to Chris.
>>>
>>> Cheers,
>>> Jean-Michel
>>>
>>> On Thu, Nov 22, 2007 at 03:20:37PM +0100, Martin Losch wrote:
>>>> Hi there,
>>>>
>>>> after some time (months!) I would like to pick up this thread again
>>>> and discuss with you, what has and what should happen in terms of
>>>> further vectorization of the MITgcm (on NEC SX, as this is the only
>>>> vector machine available to me).
>>>>
>>>> Let me try to summarize:
>>>>
>>>> I have already modified the following code (mostly following Jens-
>>>> Olaf Beismann's suggestions, that you kindly provides all the  
>>>> time):
>>>> - calc_r_star/seaice_advection and other routines: error  
>>>> catching is
>>>> moved out of loops
>>>> - seaice_lsr is vectorized, turned on with cpp-flag  
>>>> SEAICE_VECTORIZE_LSR
>>>> - exf_radiation/bulkformulae/mapfields: rearranged loops and if-
>>>> block, added extra 2d fields
>>>> - mom_calc_visc
>>>> - gad_os7mp_adv_*: completed loops, should also help adjoint
>>>>
>>>> with these and some other minor modifications I have been able  
>>>> to run
>>>> a 243x171x33 arctic configuration with seaice (lsr), exf (but no
>>>> exf_interp), kpp, daily forcing and deltaT=900sec with satifactory
>>>> efficiency. The average vector operation ratio is 99.2% and I  
>>>> achieve
>>>> 6941MFLOPS on one CPU. See the attached "example_ftrace.tgz" for a
>>>> detailed "ftrace" analysis. From this analysis you'll see that I
>>>> spend most time in calc_mom_visc, because I use the Leith  
>>>> scheme. The
>>>> most apparent "problems" are associated with exch_rl_send_put/
>>>> recv_get_x, together they take 3% of the time, and "ucase" (0.8%).
>>>> Then when there is more output, mnc-routines start to become more
>>>> important, e.g., mnc_get_ind (here very little time), or the  
>>>> monitor
>>>> routines start to matter with more monitor output.
>>>>
>>>> The reason for the exchange routines being slow is simple: the  
>>>> inner
>>>> loop (i-loop) is short (=Olx) and vectorizing it is inefficient.
>>>> Switching the loop order (together with a compiler directive) would
>>>> improve this situation a lot, likewise the barrier calls at the end
>>>> of the these routines should be inlined or not even called  
>>>> (remember
>>>> an earlier posting of Jean-Michel, where he suggests putting these
>>>> calles into IF (nSx.gt.1.or.nSy.gt.1)-blocks)
>>>> "ucase" cannot be vectorized ... I don't know what to do about  
>>>> this,
>>>> but making text output upper case cost almost 1% (or only 1% as you
>>>> wish).
>>>>
>>>> In a different configuration with more forcing input, the IO  
>>>> becomes
>>>> an issue, because the GFS cannot handle small slabs of IO too well.
>>>>
>>>> Open questions:
>>>> 1. Inlining: according to Jens-Olaf quite important. The SX- 
>>>> compilers
>>>> do automatic inlining only, if the routine to be inlined is short
>>>> that some defined #of lines (can be set as a compiler flag) AND the
>>>> routine to be inline is in the same file that the calling routine.
>>>> With compiler flags, external routines can inlined: one specifies
>>>> that name of the routine and the filename where the compiler should
>>>> look for it. However, this depends of the packages that you want to
>>>> load, because if the compiler does not find the routine it's  
>>>> supposed
>>>> to inline it returns an error. If possible, we should find a  
>>>> flexible
>>>> way of doing this.
>>>>
>>>> 2. exchange routines:
>>>> a. I can do the changes suggested above if you agree with them (see
>>>> also my earlier emails abou this).
>>>> b. In a second step, I/we may want to have special exchange  
>>>> routines
>>>> for the SX8, in particular for CG2D, as this is most expensive in
>>>> this respect. Jens-Olaf has come up with a solution, which very ad-
>>>> hoc and not at all general, but that may be a way to go? This is a
>>>> big project, but according to Jens-Olaf: "the number of exchange
>>>> calls used to exchange fields over processor boundaries and the
>>>> number of subroutine layers between the call and the actual MPI  
>>>> call
>>>> make the communication routines the most expensive ones."
>>>>
>>>> 3. !CDIR OUTERUNROLL=10 in CG2D gives a 30% speed up of cg2d: Do we
>>>> want to have directives like that in the code, maybe within ifdef's
>>>> TARGET_NEC_SX? stupid question: Can we have a fortran parameter,  
>>>> say
>>>> OURNR in SIZE.h or CG2D.h or that can be set with a -DOURNR=10 so
>>>> that we can have a more flexible way of doing the directive,  
>>>> like this
>>>> !CDIR OUTERUNROLL=OURNR
>>>> Is that possible (haven't tried)? If so, do we want this?
>>>> If not, how can we record this type of optimization so that  
>>>> users can
>>>> easily put this in their code?
>>>>
>>>> 4. special io routines that read the forcing files more  
>>>> efficiently.
>>>>
>>>> 5. diagnostics: there is a 5% overhead associated with the
>>>> diagnostics package (just turning it on without defining any
>>>> diagnostic in data.diagnostics), probably because  
>>>> diagnostics_fill is
>>>> called so often. The solution would be to replace every
>>>> "call diagnostics_fill (diagnostic)"
>>>> by
>>>> "IF (diagnostic_is_on(diagnostic)) call diagnostics_fill  
>>>> (diagnostic)"
>>>> A lot of work and ugly ...
>>>>
>>>> These are the most important issues that I came across (before I go
>>>> ahead and change my configuration (o:, thsice is requires a lot of
>>>> work, which I will not go into now and I have not tried a Cubed
>>>> sphere run yet). The list reflects my personal ranking. Point 2a  
>>>> and
>>>> 3. are relatively easy to do for me and I will do them if you agree
>>>> (also to a subset of these).
>>>>
>>>> There are also many other small things, e.g. in find_rho it  
>>>> would be
>>>> possible to speed up things, etc. but that's for later.
>>>>
>>>> Please let me know, what I can do now, and what I should not bother
>>>> thinking about, because it is not likely to make it into the
>>>> repository anyway ...
>>>>
>>>> Martin
>>>>
>>>> PS. Enjoy your Thanksgiving weekend!
>>>>
>>>>
>>>
>>>
>>>>
>>>>
>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel