[MITgcm-devel] Re: MITgcm vectorization on NEC SX
Jean-Michel Campin
jmc at ocean.mit.edu
Mon Nov 26 12:19:33 EST 2007
Hi Martin,
I have few comments:
a) regarding I/O: without singleCpuIO, MDSIO read/write
1 line at a time ; with singleCpuIO a 2-D field is read/written
in 1 instruction. Does this make a difference in our run
(not so easy to check, may be the easiest would be on 1 processor
only) ?
b) regarding BARRIER :
the IF (nSx.gt.1.or.nSy.gt.1) solution will remove the BARRIER
at compilation time. A more selective way to avoid the
BARRIER call is just to add IF ( nThreads.GT.1 ) but this
is not known at compilation stage.
c) point 1 (inlining): remember talking to Constantinos
and it does not seems so easy since the syntax of the
inlining instruction is very specific to 1 compiler.
And I fully agree with you, we should find a flexible solution.
d) point 3 (directive): I am in favor of adding few directives
if - significant improvement
- in S/R that don't change so often (like cg2d)
- not too much trouble with TAF (no worry with cg2d)
Don't know which solution is the best, but I trust you.
e) diagnostics:
I checked again the code, and if the diagnostics is not turned off,
the function call "diagnostics_is_on" and the subroutine
"diagnostics_fill" are doing pretty much the same thing.
Unless your compiler is much more efficient for a function
call compared to a subroutine call (but I know some compiler
that don't), I have the impression that you will not save much
time here.
On the other hand, if the issue is the vectorisation of the 2 loops in
diagnostics_fill (DO n=1,nlists & DO m=1,nActive(n)) because of
the stuff inside the loops, then we could change diagnostics_fill
to 1rst check only if the diagnostics is turned on, and then
CALL DIAGNOSTICS_FILL_FIELD if it's the case.
Also, in the case where many diagnostics are turned on,
the modification you propose will slow down the code since
it means checking 2 times (instead of 1) the list of active
diagnostics.
In summary, I would not go for changing diagnostics unless we have
a serious test (with few diagnostics and with many) which support
a significant performance improvement.
and leave the exchange discussion to Chris.
Cheers,
Jean-Michel
On Thu, Nov 22, 2007 at 03:20:37PM +0100, Martin Losch wrote:
> Hi there,
>
> after some time (months!) I would like to pick up this thread again
> and discuss with you, what has and what should happen in terms of
> further vectorization of the MITgcm (on NEC SX, as this is the only
> vector machine available to me).
>
> Let me try to summarize:
>
> I have already modified the following code (mostly following Jens-
> Olaf Beismann's suggestions, that you kindly provides all the time):
> - calc_r_star/seaice_advection and other routines: error catching is
> moved out of loops
> - seaice_lsr is vectorized, turned on with cpp-flag SEAICE_VECTORIZE_LSR
> - exf_radiation/bulkformulae/mapfields: rearranged loops and if-
> block, added extra 2d fields
> - mom_calc_visc
> - gad_os7mp_adv_*: completed loops, should also help adjoint
>
> with these and some other minor modifications I have been able to run
> a 243x171x33 arctic configuration with seaice (lsr), exf (but no
> exf_interp), kpp, daily forcing and deltaT=900sec with satifactory
> efficiency. The average vector operation ratio is 99.2% and I achieve
> 6941MFLOPS on one CPU. See the attached "example_ftrace.tgz" for a
> detailed "ftrace" analysis. From this analysis you'll see that I
> spend most time in calc_mom_visc, because I use the Leith scheme. The
> most apparent "problems" are associated with exch_rl_send_put/
> recv_get_x, together they take 3% of the time, and "ucase" (0.8%).
> Then when there is more output, mnc-routines start to become more
> important, e.g., mnc_get_ind (here very little time), or the monitor
> routines start to matter with more monitor output.
>
> The reason for the exchange routines being slow is simple: the inner
> loop (i-loop) is short (=Olx) and vectorizing it is inefficient.
> Switching the loop order (together with a compiler directive) would
> improve this situation a lot, likewise the barrier calls at the end
> of the these routines should be inlined or not even called (remember
> an earlier posting of Jean-Michel, where he suggests putting these
> calles into IF (nSx.gt.1.or.nSy.gt.1)-blocks)
> "ucase" cannot be vectorized ... I don't know what to do about this,
> but making text output upper case cost almost 1% (or only 1% as you
> wish).
>
> In a different configuration with more forcing input, the IO becomes
> an issue, because the GFS cannot handle small slabs of IO too well.
>
> Open questions:
> 1. Inlining: according to Jens-Olaf quite important. The SX-compilers
> do automatic inlining only, if the routine to be inlined is short
> that some defined #of lines (can be set as a compiler flag) AND the
> routine to be inline is in the same file that the calling routine.
> With compiler flags, external routines can inlined: one specifies
> that name of the routine and the filename where the compiler should
> look for it. However, this depends of the packages that you want to
> load, because if the compiler does not find the routine it's supposed
> to inline it returns an error. If possible, we should find a flexible
> way of doing this.
>
> 2. exchange routines:
> a. I can do the changes suggested above if you agree with them (see
> also my earlier emails abou this).
> b. In a second step, I/we may want to have special exchange routines
> for the SX8, in particular for CG2D, as this is most expensive in
> this respect. Jens-Olaf has come up with a solution, which very ad-
> hoc and not at all general, but that may be a way to go? This is a
> big project, but according to Jens-Olaf: "the number of exchange
> calls used to exchange fields over processor boundaries and the
> number of subroutine layers between the call and the actual MPI call
> make the communication routines the most expensive ones."
>
> 3. !CDIR OUTERUNROLL=10 in CG2D gives a 30% speed up of cg2d: Do we
> want to have directives like that in the code, maybe within ifdef's
> TARGET_NEC_SX? stupid question: Can we have a fortran parameter, say
> OURNR in SIZE.h or CG2D.h or that can be set with a -DOURNR=10 so
> that we can have a more flexible way of doing the directive, like this
> !CDIR OUTERUNROLL=OURNR
> Is that possible (haven't tried)? If so, do we want this?
> If not, how can we record this type of optimization so that users can
> easily put this in their code?
>
> 4. special io routines that read the forcing files more efficiently.
>
> 5. diagnostics: there is a 5% overhead associated with the
> diagnostics package (just turning it on without defining any
> diagnostic in data.diagnostics), probably because diagnostics_fill is
> called so often. The solution would be to replace every
> "call diagnostics_fill (diagnostic)"
> by
> "IF (diagnostic_is_on(diagnostic)) call diagnostics_fill (diagnostic)"
> A lot of work and ugly ...
>
> These are the most important issues that I came across (before I go
> ahead and change my configuration (o:, thsice is requires a lot of
> work, which I will not go into now and I have not tried a Cubed
> sphere run yet). The list reflects my personal ranking. Point 2a and
> 3. are relatively easy to do for me and I will do them if you agree
> (also to a subset of these).
>
> There are also many other small things, e.g. in find_rho it would be
> possible to speed up things, etc. but that's for later.
>
> Please let me know, what I can do now, and what I should not bother
> thinking about, because it is not likely to make it into the
> repository anyway ...
>
> Martin
>
> PS. Enjoy your Thanksgiving weekend!
>
>
>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list