[MITgcm-devel] Re: MITgcm vectorization on NEC SX

Thu Nov 22 09:20:37 EST 2007

Hi there,

after some time (months!) I would like to pick up this thread again  
and discuss with you, what has and what should happen in terms of  
further vectorization of the MITgcm (on NEC SX, as this is the only  
vector machine available to me).

Let me try to summarize:

I have already modified the following code (mostly following Jens- 
Olaf Beismann's suggestions, that you kindly provides all the time):
- calc_r_star/seaice_advection and other routines: error catching is  
moved out of loops
- seaice_lsr is vectorized, turned on with cpp-flag SEAICE_VECTORIZE_LSR
- exf_radiation/bulkformulae/mapfields: rearranged loops and if- 
block, added extra 2d fields
- mom_calc_visc
- gad_os7mp_adv_*: completed loops, should also help adjoint

with these and some other minor modifications I have been able to run  
a 243x171x33 arctic configuration with seaice (lsr), exf (but no  
exf_interp), kpp, daily forcing and deltaT=900sec with satifactory  
efficiency. The average vector operation ratio is 99.2% and I achieve  
6941MFLOPS on one CPU. See the attached "example_ftrace.tgz" for a  
detailed "ftrace" analysis. From this analysis you'll see that I  
spend most time in calc_mom_visc, because I use the Leith scheme. The  
most apparent "problems" are associated with exch_rl_send_put/ 
recv_get_x, together they take 3% of the time, and "ucase" (0.8%).  
Then when there is more output, mnc-routines start to become more  
important, e.g., mnc_get_ind (here very little time), or the monitor  
routines start to matter with more monitor output.

The reason for the exchange routines being slow is simple: the inner  
loop (i-loop) is short (=Olx) and vectorizing it is inefficient.  
Switching the loop order (together with a compiler directive) would  
improve this situation a lot, likewise the barrier calls at the end  
of the these routines should be inlined or not even called (remember  
an earlier posting of Jean-Michel, where he suggests putting these  
calles into IF (nSx.gt.1.or.nSy.gt.1)-blocks)
"ucase" cannot be vectorized ... I don't know what to do about this,  
but making text output upper case cost almost 1% (or only 1% as you  
wish).

In a different configuration with more forcing input, the IO becomes  
an issue, because the GFS cannot handle small slabs of IO too well.

Open questions:
1. Inlining: according to Jens-Olaf quite important. The SX-compilers  
do automatic inlining only, if the routine to be inlined is short  
that some defined #of lines (can be set as a compiler flag) AND the  
routine to be inline is in the same file that the calling routine.  
With compiler flags, external routines can inlined: one specifies  
that name of the routine and the filename where the compiler should  
look for it. However, this depends of the packages that you want to  
load, because if the compiler does not find the routine it's supposed  
to inline it returns an error. If possible, we should find a flexible  
way of doing this.

2. exchange routines:
a. I can do the changes suggested above if you agree with them (see  
also my earlier emails abou this).
b. In a second step, I/we may want to have special exchange routines  
for the SX8, in particular for CG2D, as this is most expensive in  
this respect. Jens-Olaf has come up with a solution, which very ad- 
hoc and not at all general, but that may be a way to go? This is a  
big project, but according to Jens-Olaf: "the number of exchange  
calls used to exchange fields over processor boundaries and the  
number of subroutine layers between the call and the actual MPI call  
make the communication routines the most expensive ones."

3. !CDIR OUTERUNROLL=10 in CG2D gives a 30% speed up of cg2d: Do we  
want to have directives like that in the code, maybe within ifdef's  
TARGET_NEC_SX? stupid question: Can we have a fortran parameter, say  
OURNR in SIZE.h or CG2D.h or that can be set with a -DOURNR=10 so  
that we can have a more flexible way of doing the directive, like this
!CDIR OUTERUNROLL=OURNR
Is that possible (haven't tried)? If so, do we want this?
If not, how can we record this type of optimization so that users can  
easily put this in their code?

4. special io routines that read the forcing files more efficiently.

5. diagnostics: there is a 5% overhead associated with the  
diagnostics package (just turning it on without defining any  
diagnostic in data.diagnostics), probably because diagnostics_fill is  
called so often. The solution would be to replace every
"call diagnostics_fill (diagnostic)"
by
"IF (diagnostic_is_on(diagnostic)) call diagnostics_fill (diagnostic)"
A lot of work and ugly ...

These are the most important issues that I came across (before I go  
ahead and change my configuration (o:, thsice is requires a lot of  
work, which I will not go into now and I have not tried a Cubed  
sphere run yet). The list reflects my personal ranking. Point 2a and  
3. are relatively easy to do for me and I will do them if you agree  
(also to a subset of these).

There are also many other small things, e.g. in find_rho it would be  
possible to speed up things, etc. but that's for later.

Please let me know, what I can do now, and what I should not bother  
thinking about, because it is not likely to make it into the  
repository anyway ...

Martin

PS. Enjoy your Thanksgiving weekend!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: example_ftrace.gz
Type: application/x-gzip
Size: 16192 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-devel/attachments/20071122/52310f36/attachment.gz>
-------------- next part --------------