[MITgcm-devel] Re: MITgcm vectorization on NEC SX

Mon Nov 26 12:19:33 EST 2007

Hi Martin,

I have few comments:
a) regarding I/O: without singleCpuIO, MDSIO read/write 
1 line at a time ; with singleCpuIO a 2-D field is read/written
in 1 instruction. Does this make a difference in our run
(not so easy to check, may be the easiest would be on 1 processor
only) ?

b) regarding BARRIER :
the IF (nSx.gt.1.or.nSy.gt.1) solution will remove the BARRIER
at compilation time. A more selective way to avoid the
BARRIER call is just to add IF ( nThreads.GT.1 ) but this
is not known at compilation stage.

c) point 1 (inlining): remember talking to Constantinos 
and it does not seems so easy since the syntax of the
inlining instruction is very specific to 1 compiler.
And I fully agree with you, we should find a flexible solution.

d) point 3 (directive): I am in favor of adding few directives
if - significant improvement
   - in S/R that don't change so often (like cg2d)
   - not too much trouble with TAF (no worry with cg2d)
Don't know which solution is the best, but I trust you.

e) diagnostics:
I checked again the code, and if the diagnostics is not turned off,
the function call "diagnostics_is_on" and the subroutine
"diagnostics_fill" are doing pretty much the same thing.
Unless your compiler is much more efficient for a function 
call compared to a subroutine call (but I know some compiler
that don't), I have the impression that you will not save much 
time here.
On the other hand, if the issue is the vectorisation of the 2 loops in 
diagnostics_fill (DO n=1,nlists &  DO m=1,nActive(n)) because of
the stuff inside the loops, then we could change diagnostics_fill
to 1rst check only if the diagnostics is turned on, and then 
CALL DIAGNOSTICS_FILL_FIELD if it's the case.
Also, in the case where many diagnostics are turned on,
the modification you propose will slow down the code since
it means checking 2 times (instead of 1) the list of active 
diagnostics.
In summary, I would not go for changing diagnostics unless we have
a serious test (with few diagnostics and with many) which support 
a significant performance improvement.

and leave the exchange discussion to Chris.

Cheers,
Jean-Michel

On Thu, Nov 22, 2007 at 03:20:37PM +0100, Martin Losch wrote:
> Hi there,
> 
> after some time (months!) I would like to pick up this thread again  
> and discuss with you, what has and what should happen in terms of  
> further vectorization of the MITgcm (on NEC SX, as this is the only  
> vector machine available to me).
> 
> Let me try to summarize:
> 
> I have already modified the following code (mostly following Jens- 
> Olaf Beismann's suggestions, that you kindly provides all the time):
> - calc_r_star/seaice_advection and other routines: error catching is  
> moved out of loops
> - seaice_lsr is vectorized, turned on with cpp-flag SEAICE_VECTORIZE_LSR
> - exf_radiation/bulkformulae/mapfields: rearranged loops and if- 
> block, added extra 2d fields
> - mom_calc_visc
> - gad_os7mp_adv_*: completed loops, should also help adjoint
> 
> with these and some other minor modifications I have been able to run  
> a 243x171x33 arctic configuration with seaice (lsr), exf (but no  
> exf_interp), kpp, daily forcing and deltaT=900sec with satifactory  
> efficiency. The average vector operation ratio is 99.2% and I achieve  
> 6941MFLOPS on one CPU. See the attached "example_ftrace.tgz" for a  
> detailed "ftrace" analysis. From this analysis you'll see that I  
> spend most time in calc_mom_visc, because I use the Leith scheme. The  
> most apparent "problems" are associated with exch_rl_send_put/ 
> recv_get_x, together they take 3% of the time, and "ucase" (0.8%).  
> Then when there is more output, mnc-routines start to become more  
> important, e.g., mnc_get_ind (here very little time), or the monitor  
> routines start to matter with more monitor output.
> 
> The reason for the exchange routines being slow is simple: the inner  
> loop (i-loop) is short (=Olx) and vectorizing it is inefficient.  
> Switching the loop order (together with a compiler directive) would  
> improve this situation a lot, likewise the barrier calls at the end  
> of the these routines should be inlined or not even called (remember  
> an earlier posting of Jean-Michel, where he suggests putting these  
> calles into IF (nSx.gt.1.or.nSy.gt.1)-blocks)
> "ucase" cannot be vectorized ... I don't know what to do about this,  
> but making text output upper case cost almost 1% (or only 1% as you  
> wish).
> 
> In a different configuration with more forcing input, the IO becomes  
> an issue, because the GFS cannot handle small slabs of IO too well.
> 
> Open questions:
> 1. Inlining: according to Jens-Olaf quite important. The SX-compilers  
> do automatic inlining only, if the routine to be inlined is short  
> that some defined #of lines (can be set as a compiler flag) AND the  
> routine to be inline is in the same file that the calling routine.  
> With compiler flags, external routines can inlined: one specifies  
> that name of the routine and the filename where the compiler should  
> look for it. However, this depends of the packages that you want to  
> load, because if the compiler does not find the routine it's supposed  
> to inline it returns an error. If possible, we should find a flexible  
> way of doing this.
> 
> 2. exchange routines:
> a. I can do the changes suggested above if you agree with them (see  
> also my earlier emails abou this).
> b. In a second step, I/we may want to have special exchange routines  
> for the SX8, in particular for CG2D, as this is most expensive in  
> this respect. Jens-Olaf has come up with a solution, which very ad- 
> hoc and not at all general, but that may be a way to go? This is a  
> big project, but according to Jens-Olaf: "the number of exchange  
> calls used to exchange fields over processor boundaries and the  
> number of subroutine layers between the call and the actual MPI call  
> make the communication routines the most expensive ones."
> 
> 3. !CDIR OUTERUNROLL=10 in CG2D gives a 30% speed up of cg2d: Do we  
> want to have directives like that in the code, maybe within ifdef's  
> TARGET_NEC_SX? stupid question: Can we have a fortran parameter, say  
> OURNR in SIZE.h or CG2D.h or that can be set with a -DOURNR=10 so  
> that we can have a more flexible way of doing the directive, like this
> !CDIR OUTERUNROLL=OURNR
> Is that possible (haven't tried)? If so, do we want this?
> If not, how can we record this type of optimization so that users can  
> easily put this in their code?
> 
> 4. special io routines that read the forcing files more efficiently.
> 
> 5. diagnostics: there is a 5% overhead associated with the  
> diagnostics package (just turning it on without defining any  
> diagnostic in data.diagnostics), probably because diagnostics_fill is  
> called so often. The solution would be to replace every
> "call diagnostics_fill (diagnostic)"
> by
> "IF (diagnostic_is_on(diagnostic)) call diagnostics_fill (diagnostic)"
> A lot of work and ugly ...
> 
> These are the most important issues that I came across (before I go  
> ahead and change my configuration (o:, thsice is requires a lot of  
> work, which I will not go into now and I have not tried a Cubed  
> sphere run yet). The list reflects my personal ranking. Point 2a and  
> 3. are relatively easy to do for me and I will do them if you agree  
> (also to a subset of these).
> 
> There are also many other small things, e.g. in find_rho it would be  
> possible to speed up things, etc. but that's for later.
> 
> Please let me know, what I can do now, and what I should not bother  
> thinking about, because it is not likely to make it into the  
> repository anyway ...
> 
> Martin
> 
> PS. Enjoy your Thanksgiving weekend!
> 
> 

> 
> 

> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel