[MITgcm-devel] Re: MITgcm vectorization on NEC SX

Fri Jun 8 10:38:37 EDT 2007

Hi Chris and others,

I have now a list of changes that helped the perfomance and  
vectorisation of the MITgcm on our NEC SX from Jens-Olaf, I'll try to  
translate (but I am so bad at this computer terminology) and comment:

1. Inlining, in particular lagran (in exf_interp) and exf_bulk*
2. change of error checking in calc_r_star and the exch2-Routines  
(ifdef W2_USE_E2_SAFEMODE), comment: the former is already in the code
3. Vectorisation of seaice_lsr (comment: I do not quite see what he  
has done there)
4. Vectorisation of exf_radiation, exf_bulkformulae, exf_mapfields,  
exf_interp, mon_stats_rl
5. Horner-Scheme in find_rho.F
6. non-blocking communication in gather_2d.F

specific for our benchmark case (CS510) and not general at all:
- manual splitting of loops (e.g. 515=2*172+171 instead of 2*256+3)
- enlargement of array dimension by +1 or +3 because of bank-conflicts
(in particular the latter should be kept in mind; can come in handy)

Then there are UNROLL- oder OUTERUNROLL-Directives in some routines  
Routinen that have improved the performance in the benchmark case a  
litte, probably useful in many cases, unless for small domain (short  
vector lengths).

In impldiff I (Jens-Olaf), besides unrolling, used vector registers  
explicitly - that reduces instances of memory access, but is very  
difficult to read for the user (comment: I can confirm that).

I (Martin) have the code with most of these modifications. My plan is  
to sort out which part is general enough to be included (as far as I  
am compentent to evaluate that) and then discuss with you what should  
go in the repository and what not. I can also share the code if you  
want me to. If so, may be not into the repository MITgcm_contrib. I  
am not sure how public all of this is.

Martin

On 7 Jun 2007, at 18:39, chris hill wrote:

> Jens-Olaf (and everyone else),
>
>  Thanks for your input, its very useful.
>
>  If possible, it would be useful if people could overcome their  
> shyness and post their trace information to mitgcm- 
> support at mitgcm.org. That way we can rapidly see what is showing up  
> in terms of modules and loops that are not vectorizing too well. In  
> the past most discussions about vectorization (which the underlying  
> algorithms and kernel numerics are, in general, very amenable to)  
> on the German NEC installs have studiously left out the core MITgcm  
> development team, which isn't always useful!
>
> Chris
> Jens-Olaf Beismann wrote:
>> Dear all,
>> it's good to know that there's a quite some interest in using the  
>> MITgcm on SX systems. I talked to Martin earlier this afternoon  
>> and explained that unfortunately I'm too busy at the moment to  
>> really get involved in optimising various MITgcm configurations.  
>> But I'll try to answer any specific questions you might have if I  
>> find a spare moment. I'd suggest that you collect all relevant  
>> information (SIZE.h, MPIPROGINF, ftrace etc.) and make it  
>> accessible for me on the machine you're using.
>> Please note that I'll probably not be able to answer any request  
>> during the next couple of weeks. You might also want to involve  
>> Armin Koehl (I guess most of you know him) in the discussion; he  
>> has much experience both with the forward and the adjoint code on  
>> SX-6.
>> Cheers,
>> Jens-Olaf
>>> You may have hit upon the problem with my configuration. I am  
>>> indeed using pkg/dic where the looping is all
>>> over the slowest k index.
>>>
>>> How did you rewrite your gchem code? Can you share some of it  
>>> with me? Perhaps I can take a hand at rewriting
>>> pkg/dic?
>>>
>>> Also, I am not using any tiles in the x-direction. But obviously  
>>> the biogeochemistry will kill any benefit of doing so.
>>>
>>> I do find that adding the -C hopt optimization flag (not in any  
>>> of the existing build options files) helps quite a bit.
>>>
>>> Samar
>>>
>>> On Jun 7, 2007, at 5:41 PM, Martin Losch wrote:
>>>
>>>> Hi Patrick et al.
>>>>
>>>> I am sorry to have expressed my current experience with our NEC  
>>>> SX8 the way I did. For fairly large (what is "fairly large":  
>>>> 300x300 horziontal points is good, 45x45 is not, 180x108 is  
>>>> still ok but not great) problem I find the performance to be  
>>>> good. For 1CPU and 300x300x100 points and the latest code I get  
>>>> 5540MFLOPS which is approx 15% of the theoretical peak  
>>>> performance. This does not involve exf nor any seaice pkg, nor  
>>>> does it involve any fancy tweaking.
>>>>
>>>> I have not yet found the optimal set of compiler flags, if  
>>>> anyone knows anything better than what I have put into SUPER- 
>>>> UX_SX-8_sxf90+mpi_awi I'd love to get (and try) them.
>>>>
>>>> Samar, for your domain, you probably do not need any domain  
>>>> decomposition (makes the inner loops too small). I am using bio- 
>>>> geochemical code, in which the loops are "the wrong way around",  
>>>> that is the k-loop is the inner most loop, as commonly the case  
>>>> with these models, e.g., the DIC pkg will have this problem and  
>>>> will slow down the code dramatically. In my case, the gchem- 
>>>> related code (not part of the cvs-repository) which takes about  
>>>> <10% on a parallel maching with amd64-cpus, takes 80% of the  
>>>> total time on the NEC. I have tried to in line a few routines  
>>>> within the k-loop and that enables vectorization of the  
>>>> (short=23) k-loop, which reduces the cpu time spent in that  
>>>> routine by a factor of 2 already. But in the end I will have to  
>>>> rewrite this part of the code. But the other parts (even seaice  
>>>> and exf) do not seem to be a terrible problem.
>>>>
>>>> Martin
>>>>
>>>> On 7 Jun 2007, at 16:41, Patrick Heimbach wrote:
>>>>
>>>>>
>>>>> Hi Jens-Olaf, Samar, Martin,
>>>>>
>>>>> Jens-Olaf, thanks for pointing out that there are
>>>>> MITgcm setups which are running efficiently on some SX platforms
>>>>> (there have been rumors out there that it couldn't).
>>>>> It would be great if you could share your experience as to
>>>>> what it took to achieve this efficiency, or known bottlenecks,
>>>>> maybe in working together with Martin at AWI and Samar at Kiel.
>>>>>
>>>>> We think that in theory, there is little that should prevent
>>>>> MITgcm to vectorize efficiently, if the domain decomposition is  
>>>>> chosen
>>>>> accordingly (long inner loops).
>>>>> My understanding is that compiler optimization will also seek to
>>>>> collapse inner loops if possible (no index dependencies)
>>>>> to extend inner loop lengths.
>>>>> Again, this should work for many (but not all) subroutines.
>>>>>
>>>>> Some problematic code for vectorization that I am aware of
>>>>> is the seaice package as well as the bulk formula code in exf.
>>>>> But these ones Samar isn't using.
>>>>>
>>>>> If needed we can move this conversation to the devel list.
>>>>>
>>>>> Cheers
>>>>> -Patrick
>>>>>
>>>>>
>>>>>
>>>>> On Jun 7, 2007, at 10:21 AM, Samar Khatiwala wrote:
>>>>>
>>>>>> Hi Jens-Olaf
>>>>>>
>>>>>> This is a bit off topic, but just to follow up on your post:
>>>>>>
>>>>>> I am currently running the MITgcm on an SX-8 at uni-kiel. This  
>>>>>> is a coarse resolution configuration (128 x 64 x 15).
>>>>>> Unfortunately, performance has not be so great and, I am told  
>>>>>> by other users of the SX, is significantly below its
>>>>>> theoretical peak. Even less than the 25-35% number you quote.
>>>>>>
>>>>>> Perhaps you can advise, off-list, on how I can improve things.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Samar
>>>>>>
>>>>>> On Jun 7, 2007, at 2:28 PM, Jens-Olaf Beismann wrote:
>>>>>>
>>>>>>> Dear MITgcm users,
>>>>>>>
>>>>>>>> This "philosphy" works fairly well for most single cpu and  
>>>>>>>> parallel computer architectures, although I am now  
>>>>>>>> struggling with a vector computer for which the MITgcm is  
>>>>>>>> only efficient if the horizontal domain size is fairly large  
>>>>>>>> (because the code generally excludes vectorization in the  
>>>>>>>> vertical dimension, and that's not likely to change).
>>>>>>>
>>>>>>> just a quick comment regarding the use of the MITgcm on  
>>>>>>> vector machines: I'm not familiar with Martin's application,  
>>>>>>> but I know several MITgcm configurations which are used very  
>>>>>>> efficiently on SX machines at other computing centres. These  
>>>>>>> are "medium-sized" regional ocean models, and they typically  
>>>>>>> run at appr. 25-35% of the theoretical peak performance.
>>>>>>>
>>>>>>> As Martin pointed out, it is necessary to have a completely  
>>>>>>> vectorised code to achieve good vector performance, but there  
>>>>>>> is no general problem in running the MITgcm on a vector machine.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Jens-Olaf
>>>>>>> -- 
>>>>>>> Dr. Jens-Olaf Beismann           Benchmarking Analyst
>>>>>>> NEC High Performance Computing Europe GmbH
>>>>>>> Prinzenallee 11, D-40549 Duesseldorf, Germany
>>>>>>> Tel: +49 4326 288859 (office)  +49 160 183 5289 (mobile)
>>>>>>> Fax: +49 4326 288861              http://www.hpce.nec.com
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> MITgcm-support mailing list
>>>>>>> MITgcm-support at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-support mailing list
>>>>>> MITgcm-support at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>
>>>>> ---
>>>>> Dr Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/ 
>>>>> ~heimbach
>>>>> MIT | EAPS, 54-1518 | 77 Massachusetts Ave | Cambridge, MA  
>>>>> 02139, USA
>>>>> FON: +1-617-253-5259 | FAX: +1-617-253-4464 | SKYPE:  
>>>>> patrick.heimbach
>>>>>
>>>>>
>>>
>>>
>>>
>