[MITgcm-devel] Re: MITgcm vectorization on NEC SX

Mon Dec 3 12:24:16 EST 2007

Hi Martin,

  Exchanging the loops is definitely fine and makes sense. Needs testing 
though once it is done. Most iB's should be able to work fine if you 
change both sides of the loops. The iB0 value formula will need to 
change - which looks a bit confusing in reverse mode!

  Two general thoughts however

  1 - from your earlier note I got the impression that the overhead was 
a few %. If thats right then, unless scaling to large numbers of SX 
procs, then the impact on time to solution won't be much?

  2 - I suspect Jens-Olaf isn't going to like exch2! Be good if he could 
look at that soon.

Chris

Chris
Martin Losch wrote:
> Chris,
> 
> do you have any thoughts on the loop exchange? there is also some iB's 
> and iB0's that needs to be taken care of, if the loops are exchanged, 
> which I don't quite understand.
> 
> Martin
> 
> On 26 Nov 2007, at 18:19, Jean-Michel Campin wrote:
> 
>> Hi Martin,
>>
>> I have few comments:
>> a) regarding I/O: without singleCpuIO, MDSIO read/write
>> 1 line at a time ; with singleCpuIO a 2-D field is read/written
>> in 1 instruction. Does this make a difference in our run
>> (not so easy to check, may be the easiest would be on 1 processor
>> only) ?
>>
>> b) regarding BARRIER :
>> the IF (nSx.gt.1.or.nSy.gt.1) solution will remove the BARRIER
>> at compilation time. A more selective way to avoid the
>> BARRIER call is just to add IF ( nThreads.GT.1 ) but this
>> is not known at compilation stage.
>>
>> c) point 1 (inlining): remember talking to Constantinos
>> and it does not seems so easy since the syntax of the
>> inlining instruction is very specific to 1 compiler.
>> And I fully agree with you, we should find a flexible solution.
>>
>> d) point 3 (directive): I am in favor of adding few directives
>> if - significant improvement
>>    - in S/R that don't change so often (like cg2d)
>>    - not too much trouble with TAF (no worry with cg2d)
>> Don't know which solution is the best, but I trust you.
>>
>> e) diagnostics:
>> I checked again the code, and if the diagnostics is not turned off,
>> the function call "diagnostics_is_on" and the subroutine
>> "diagnostics_fill" are doing pretty much the same thing.
>> Unless your compiler is much more efficient for a function
>> call compared to a subroutine call (but I know some compiler
>> that don't), I have the impression that you will not save much
>> time here.
>> On the other hand, if the issue is the vectorisation of the 2 loops in
>> diagnostics_fill (DO n=1,nlists &  DO m=1,nActive(n)) because of
>> the stuff inside the loops, then we could change diagnostics_fill
>> to 1rst check only if the diagnostics is turned on, and then
>> CALL DIAGNOSTICS_FILL_FIELD if it's the case.
>> Also, in the case where many diagnostics are turned on,
>> the modification you propose will slow down the code since
>> it means checking 2 times (instead of 1) the list of active
>> diagnostics.
>> In summary, I would not go for changing diagnostics unless we have
>> a serious test (with few diagnostics and with many) which support
>> a significant performance improvement.
>>
>> and leave the exchange discussion to Chris.
>>
>> Cheers,
>> Jean-Michel
>>
>> On Thu, Nov 22, 2007 at 03:20:37PM +0100, Martin Losch wrote:
>>> Hi there,
>>>
>>> after some time (months!) I would like to pick up this thread again
>>> and discuss with you, what has and what should happen in terms of
>>> further vectorization of the MITgcm (on NEC SX, as this is the only
>>> vector machine available to me).
>>>
>>> Let me try to summarize:
>>>
>>> I have already modified the following code (mostly following Jens-
>>> Olaf Beismann's suggestions, that you kindly provides all the time):
>>> - calc_r_star/seaice_advection and other routines: error catching is
>>> moved out of loops
>>> - seaice_lsr is vectorized, turned on with cpp-flag SEAICE_VECTORIZE_LSR
>>> - exf_radiation/bulkformulae/mapfields: rearranged loops and if-
>>> block, added extra 2d fields
>>> - mom_calc_visc
>>> - gad_os7mp_adv_*: completed loops, should also help adjoint
>>>
>>> with these and some other minor modifications I have been able to run
>>> a 243x171x33 arctic configuration with seaice (lsr), exf (but no
>>> exf_interp), kpp, daily forcing and deltaT=900sec with satifactory
>>> efficiency. The average vector operation ratio is 99.2% and I achieve
>>> 6941MFLOPS on one CPU. See the attached "example_ftrace.tgz" for a
>>> detailed "ftrace" analysis. From this analysis you'll see that I
>>> spend most time in calc_mom_visc, because I use the Leith scheme. The
>>> most apparent "problems" are associated with exch_rl_send_put/
>>> recv_get_x, together they take 3% of the time, and "ucase" (0.8%).
>>> Then when there is more output, mnc-routines start to become more
>>> important, e.g., mnc_get_ind (here very little time), or the monitor
>>> routines start to matter with more monitor output.
>>>
>>> The reason for the exchange routines being slow is simple: the inner
>>> loop (i-loop) is short (=Olx) and vectorizing it is inefficient.
>>> Switching the loop order (together with a compiler directive) would
>>> improve this situation a lot, likewise the barrier calls at the end
>>> of the these routines should be inlined or not even called (remember
>>> an earlier posting of Jean-Michel, where he suggests putting these
>>> calles into IF (nSx.gt.1.or.nSy.gt.1)-blocks)
>>> "ucase" cannot be vectorized ... I don't know what to do about this,
>>> but making text output upper case cost almost 1% (or only 1% as you
>>> wish).
>>>
>>> In a different configuration with more forcing input, the IO becomes
>>> an issue, because the GFS cannot handle small slabs of IO too well.
>>>
>>> Open questions:
>>> 1. Inlining: according to Jens-Olaf quite important. The SX-compilers
>>> do automatic inlining only, if the routine to be inlined is short
>>> that some defined #of lines (can be set as a compiler flag) AND the
>>> routine to be inline is in the same file that the calling routine.
>>> With compiler flags, external routines can inlined: one specifies
>>> that name of the routine and the filename where the compiler should
>>> look for it. However, this depends of the packages that you want to
>>> load, because if the compiler does not find the routine it's supposed
>>> to inline it returns an error. If possible, we should find a flexible
>>> way of doing this.
>>>
>>> 2. exchange routines:
>>> a. I can do the changes suggested above if you agree with them (see
>>> also my earlier emails abou this).
>>> b. In a second step, I/we may want to have special exchange routines
>>> for the SX8, in particular for CG2D, as this is most expensive in
>>> this respect. Jens-Olaf has come up with a solution, which very ad-
>>> hoc and not at all general, but that may be a way to go? This is a
>>> big project, but according to Jens-Olaf: "the number of exchange
>>> calls used to exchange fields over processor boundaries and the
>>> number of subroutine layers between the call and the actual MPI call
>>> make the communication routines the most expensive ones."
>>>
>>> 3. !CDIR OUTERUNROLL=10 in CG2D gives a 30% speed up of cg2d: Do we
>>> want to have directives like that in the code, maybe within ifdef's
>>> TARGET_NEC_SX? stupid question: Can we have a fortran parameter, say
>>> OURNR in SIZE.h or CG2D.h or that can be set with a -DOURNR=10 so
>>> that we can have a more flexible way of doing the directive, like this
>>> !CDIR OUTERUNROLL=OURNR
>>> Is that possible (haven't tried)? If so, do we want this?
>>> If not, how can we record this type of optimization so that users can
>>> easily put this in their code?
>>>
>>> 4. special io routines that read the forcing files more efficiently.
>>>
>>> 5. diagnostics: there is a 5% overhead associated with the
>>> diagnostics package (just turning it on without defining any
>>> diagnostic in data.diagnostics), probably because diagnostics_fill is
>>> called so often. The solution would be to replace every
>>> "call diagnostics_fill (diagnostic)"
>>> by
>>> "IF (diagnostic_is_on(diagnostic)) call diagnostics_fill (diagnostic)"
>>> A lot of work and ugly ...
>>>
>>> These are the most important issues that I came across (before I go
>>> ahead and change my configuration (o:, thsice is requires a lot of
>>> work, which I will not go into now and I have not tried a Cubed
>>> sphere run yet). The list reflects my personal ranking. Point 2a and
>>> 3. are relatively easy to do for me and I will do them if you agree
>>> (also to a subset of these).
>>>
>>> There are also many other small things, e.g. in find_rho it would be
>>> possible to speed up things, etc. but that's for later.
>>>
>>> Please let me know, what I can do now, and what I should not bother
>>> thinking about, because it is not likely to make it into the
>>> repository anyway ...
>>>
>>> Martin
>>>
>>> PS. Enjoy your Thanksgiving weekend!
>>>
>>>
>>
>>
>>>
>>>
>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>