[MITgcm-devel] Re: MITgcm vectorization on NEC SX
Martin Losch
Martin.Losch at awi.de
Thu Nov 22 09:20:37 EST 2007
Hi there,
after some time (months!) I would like to pick up this thread again
and discuss with you, what has and what should happen in terms of
further vectorization of the MITgcm (on NEC SX, as this is the only
vector machine available to me).
Let me try to summarize:
I have already modified the following code (mostly following Jens-
Olaf Beismann's suggestions, that you kindly provides all the time):
- calc_r_star/seaice_advection and other routines: error catching is
moved out of loops
- seaice_lsr is vectorized, turned on with cpp-flag SEAICE_VECTORIZE_LSR
- exf_radiation/bulkformulae/mapfields: rearranged loops and if-
block, added extra 2d fields
- mom_calc_visc
- gad_os7mp_adv_*: completed loops, should also help adjoint
with these and some other minor modifications I have been able to run
a 243x171x33 arctic configuration with seaice (lsr), exf (but no
exf_interp), kpp, daily forcing and deltaT=900sec with satifactory
efficiency. The average vector operation ratio is 99.2% and I achieve
6941MFLOPS on one CPU. See the attached "example_ftrace.tgz" for a
detailed "ftrace" analysis. From this analysis you'll see that I
spend most time in calc_mom_visc, because I use the Leith scheme. The
most apparent "problems" are associated with exch_rl_send_put/
recv_get_x, together they take 3% of the time, and "ucase" (0.8%).
Then when there is more output, mnc-routines start to become more
important, e.g., mnc_get_ind (here very little time), or the monitor
routines start to matter with more monitor output.
The reason for the exchange routines being slow is simple: the inner
loop (i-loop) is short (=Olx) and vectorizing it is inefficient.
Switching the loop order (together with a compiler directive) would
improve this situation a lot, likewise the barrier calls at the end
of the these routines should be inlined or not even called (remember
an earlier posting of Jean-Michel, where he suggests putting these
calles into IF (nSx.gt.1.or.nSy.gt.1)-blocks)
"ucase" cannot be vectorized ... I don't know what to do about this,
but making text output upper case cost almost 1% (or only 1% as you
wish).
In a different configuration with more forcing input, the IO becomes
an issue, because the GFS cannot handle small slabs of IO too well.
Open questions:
1. Inlining: according to Jens-Olaf quite important. The SX-compilers
do automatic inlining only, if the routine to be inlined is short
that some defined #of lines (can be set as a compiler flag) AND the
routine to be inline is in the same file that the calling routine.
With compiler flags, external routines can inlined: one specifies
that name of the routine and the filename where the compiler should
look for it. However, this depends of the packages that you want to
load, because if the compiler does not find the routine it's supposed
to inline it returns an error. If possible, we should find a flexible
way of doing this.
2. exchange routines:
a. I can do the changes suggested above if you agree with them (see
also my earlier emails abou this).
b. In a second step, I/we may want to have special exchange routines
for the SX8, in particular for CG2D, as this is most expensive in
this respect. Jens-Olaf has come up with a solution, which very ad-
hoc and not at all general, but that may be a way to go? This is a
big project, but according to Jens-Olaf: "the number of exchange
calls used to exchange fields over processor boundaries and the
number of subroutine layers between the call and the actual MPI call
make the communication routines the most expensive ones."
3. !CDIR OUTERUNROLL=10 in CG2D gives a 30% speed up of cg2d: Do we
want to have directives like that in the code, maybe within ifdef's
TARGET_NEC_SX? stupid question: Can we have a fortran parameter, say
OURNR in SIZE.h or CG2D.h or that can be set with a -DOURNR=10 so
that we can have a more flexible way of doing the directive, like this
!CDIR OUTERUNROLL=OURNR
Is that possible (haven't tried)? If so, do we want this?
If not, how can we record this type of optimization so that users can
easily put this in their code?
4. special io routines that read the forcing files more efficiently.
5. diagnostics: there is a 5% overhead associated with the
diagnostics package (just turning it on without defining any
diagnostic in data.diagnostics), probably because diagnostics_fill is
called so often. The solution would be to replace every
"call diagnostics_fill (diagnostic)"
by
"IF (diagnostic_is_on(diagnostic)) call diagnostics_fill (diagnostic)"
A lot of work and ugly ...
These are the most important issues that I came across (before I go
ahead and change my configuration (o:, thsice is requires a lot of
work, which I will not go into now and I have not tried a Cubed
sphere run yet). The list reflects my personal ranking. Point 2a and
3. are relatively easy to do for me and I will do them if you agree
(also to a subset of these).
There are also many other small things, e.g. in find_rho it would be
possible to speed up things, etc. but that's for later.
Please let me know, what I can do now, and what I should not bother
thinking about, because it is not likely to make it into the
repository anyway ...
Martin
PS. Enjoy your Thanksgiving weekend!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: example_ftrace.gz
Type: application/x-gzip
Size: 16192 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-devel/attachments/20071122/52310f36/attachment.gz>
-------------- next part --------------
More information about the MITgcm-devel
mailing list