[MITgcm-devel] Re: MITgcm vectorization on NEC SX
Martin Losch
Martin.Losch at awi.de
Fri Jun 8 10:38:37 EDT 2007
Hi Chris and others,
I have now a list of changes that helped the perfomance and
vectorisation of the MITgcm on our NEC SX from Jens-Olaf, I'll try to
translate (but I am so bad at this computer terminology) and comment:
1. Inlining, in particular lagran (in exf_interp) and exf_bulk*
2. change of error checking in calc_r_star and the exch2-Routines
(ifdef W2_USE_E2_SAFEMODE), comment: the former is already in the code
3. Vectorisation of seaice_lsr (comment: I do not quite see what he
has done there)
4. Vectorisation of exf_radiation, exf_bulkformulae, exf_mapfields,
exf_interp, mon_stats_rl
5. Horner-Scheme in find_rho.F
6. non-blocking communication in gather_2d.F
specific for our benchmark case (CS510) and not general at all:
- manual splitting of loops (e.g. 515=2*172+171 instead of 2*256+3)
- enlargement of array dimension by +1 or +3 because of bank-conflicts
(in particular the latter should be kept in mind; can come in handy)
Then there are UNROLL- oder OUTERUNROLL-Directives in some routines
Routinen that have improved the performance in the benchmark case a
litte, probably useful in many cases, unless for small domain (short
vector lengths).
In impldiff I (Jens-Olaf), besides unrolling, used vector registers
explicitly - that reduces instances of memory access, but is very
difficult to read for the user (comment: I can confirm that).
I (Martin) have the code with most of these modifications. My plan is
to sort out which part is general enough to be included (as far as I
am compentent to evaluate that) and then discuss with you what should
go in the repository and what not. I can also share the code if you
want me to. If so, may be not into the repository MITgcm_contrib. I
am not sure how public all of this is.
Martin
On 7 Jun 2007, at 18:39, chris hill wrote:
> Jens-Olaf (and everyone else),
>
> Thanks for your input, its very useful.
>
> If possible, it would be useful if people could overcome their
> shyness and post their trace information to mitgcm-
> support at mitgcm.org. That way we can rapidly see what is showing up
> in terms of modules and loops that are not vectorizing too well. In
> the past most discussions about vectorization (which the underlying
> algorithms and kernel numerics are, in general, very amenable to)
> on the German NEC installs have studiously left out the core MITgcm
> development team, which isn't always useful!
>
> Chris
> Jens-Olaf Beismann wrote:
>> Dear all,
>> it's good to know that there's a quite some interest in using the
>> MITgcm on SX systems. I talked to Martin earlier this afternoon
>> and explained that unfortunately I'm too busy at the moment to
>> really get involved in optimising various MITgcm configurations.
>> But I'll try to answer any specific questions you might have if I
>> find a spare moment. I'd suggest that you collect all relevant
>> information (SIZE.h, MPIPROGINF, ftrace etc.) and make it
>> accessible for me on the machine you're using.
>> Please note that I'll probably not be able to answer any request
>> during the next couple of weeks. You might also want to involve
>> Armin Koehl (I guess most of you know him) in the discussion; he
>> has much experience both with the forward and the adjoint code on
>> SX-6.
>> Cheers,
>> Jens-Olaf
>>> You may have hit upon the problem with my configuration. I am
>>> indeed using pkg/dic where the looping is all
>>> over the slowest k index.
>>>
>>> How did you rewrite your gchem code? Can you share some of it
>>> with me? Perhaps I can take a hand at rewriting
>>> pkg/dic?
>>>
>>> Also, I am not using any tiles in the x-direction. But obviously
>>> the biogeochemistry will kill any benefit of doing so.
>>>
>>> I do find that adding the -C hopt optimization flag (not in any
>>> of the existing build options files) helps quite a bit.
>>>
>>> Samar
>>>
>>> On Jun 7, 2007, at 5:41 PM, Martin Losch wrote:
>>>
>>>> Hi Patrick et al.
>>>>
>>>> I am sorry to have expressed my current experience with our NEC
>>>> SX8 the way I did. For fairly large (what is "fairly large":
>>>> 300x300 horziontal points is good, 45x45 is not, 180x108 is
>>>> still ok but not great) problem I find the performance to be
>>>> good. For 1CPU and 300x300x100 points and the latest code I get
>>>> 5540MFLOPS which is approx 15% of the theoretical peak
>>>> performance. This does not involve exf nor any seaice pkg, nor
>>>> does it involve any fancy tweaking.
>>>>
>>>> I have not yet found the optimal set of compiler flags, if
>>>> anyone knows anything better than what I have put into SUPER-
>>>> UX_SX-8_sxf90+mpi_awi I'd love to get (and try) them.
>>>>
>>>> Samar, for your domain, you probably do not need any domain
>>>> decomposition (makes the inner loops too small). I am using bio-
>>>> geochemical code, in which the loops are "the wrong way around",
>>>> that is the k-loop is the inner most loop, as commonly the case
>>>> with these models, e.g., the DIC pkg will have this problem and
>>>> will slow down the code dramatically. In my case, the gchem-
>>>> related code (not part of the cvs-repository) which takes about
>>>> <10% on a parallel maching with amd64-cpus, takes 80% of the
>>>> total time on the NEC. I have tried to in line a few routines
>>>> within the k-loop and that enables vectorization of the
>>>> (short=23) k-loop, which reduces the cpu time spent in that
>>>> routine by a factor of 2 already. But in the end I will have to
>>>> rewrite this part of the code. But the other parts (even seaice
>>>> and exf) do not seem to be a terrible problem.
>>>>
>>>> Martin
>>>>
>>>> On 7 Jun 2007, at 16:41, Patrick Heimbach wrote:
>>>>
>>>>>
>>>>> Hi Jens-Olaf, Samar, Martin,
>>>>>
>>>>> Jens-Olaf, thanks for pointing out that there are
>>>>> MITgcm setups which are running efficiently on some SX platforms
>>>>> (there have been rumors out there that it couldn't).
>>>>> It would be great if you could share your experience as to
>>>>> what it took to achieve this efficiency, or known bottlenecks,
>>>>> maybe in working together with Martin at AWI and Samar at Kiel.
>>>>>
>>>>> We think that in theory, there is little that should prevent
>>>>> MITgcm to vectorize efficiently, if the domain decomposition is
>>>>> chosen
>>>>> accordingly (long inner loops).
>>>>> My understanding is that compiler optimization will also seek to
>>>>> collapse inner loops if possible (no index dependencies)
>>>>> to extend inner loop lengths.
>>>>> Again, this should work for many (but not all) subroutines.
>>>>>
>>>>> Some problematic code for vectorization that I am aware of
>>>>> is the seaice package as well as the bulk formula code in exf.
>>>>> But these ones Samar isn't using.
>>>>>
>>>>> If needed we can move this conversation to the devel list.
>>>>>
>>>>> Cheers
>>>>> -Patrick
>>>>>
>>>>>
>>>>>
>>>>> On Jun 7, 2007, at 10:21 AM, Samar Khatiwala wrote:
>>>>>
>>>>>> Hi Jens-Olaf
>>>>>>
>>>>>> This is a bit off topic, but just to follow up on your post:
>>>>>>
>>>>>> I am currently running the MITgcm on an SX-8 at uni-kiel. This
>>>>>> is a coarse resolution configuration (128 x 64 x 15).
>>>>>> Unfortunately, performance has not be so great and, I am told
>>>>>> by other users of the SX, is significantly below its
>>>>>> theoretical peak. Even less than the 25-35% number you quote.
>>>>>>
>>>>>> Perhaps you can advise, off-list, on how I can improve things.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Samar
>>>>>>
>>>>>> On Jun 7, 2007, at 2:28 PM, Jens-Olaf Beismann wrote:
>>>>>>
>>>>>>> Dear MITgcm users,
>>>>>>>
>>>>>>>> This "philosphy" works fairly well for most single cpu and
>>>>>>>> parallel computer architectures, although I am now
>>>>>>>> struggling with a vector computer for which the MITgcm is
>>>>>>>> only efficient if the horizontal domain size is fairly large
>>>>>>>> (because the code generally excludes vectorization in the
>>>>>>>> vertical dimension, and that's not likely to change).
>>>>>>>
>>>>>>> just a quick comment regarding the use of the MITgcm on
>>>>>>> vector machines: I'm not familiar with Martin's application,
>>>>>>> but I know several MITgcm configurations which are used very
>>>>>>> efficiently on SX machines at other computing centres. These
>>>>>>> are "medium-sized" regional ocean models, and they typically
>>>>>>> run at appr. 25-35% of the theoretical peak performance.
>>>>>>>
>>>>>>> As Martin pointed out, it is necessary to have a completely
>>>>>>> vectorised code to achieve good vector performance, but there
>>>>>>> is no general problem in running the MITgcm on a vector machine.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Jens-Olaf
>>>>>>> --
>>>>>>> Dr. Jens-Olaf Beismann Benchmarking Analyst
>>>>>>> NEC High Performance Computing Europe GmbH
>>>>>>> Prinzenallee 11, D-40549 Duesseldorf, Germany
>>>>>>> Tel: +49 4326 288859 (office) +49 160 183 5289 (mobile)
>>>>>>> Fax: +49 4326 288861 http://www.hpce.nec.com
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> MITgcm-support mailing list
>>>>>>> MITgcm-support at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-support mailing list
>>>>>> MITgcm-support at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>
>>>>> ---
>>>>> Dr Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/
>>>>> ~heimbach
>>>>> MIT | EAPS, 54-1518 | 77 Massachusetts Ave | Cambridge, MA
>>>>> 02139, USA
>>>>> FON: +1-617-253-5259 | FAX: +1-617-253-4464 | SKYPE:
>>>>> patrick.heimbach
>>>>>
>>>>>
>>>
>>>
>>>
>
More information about the MITgcm-devel
mailing list