[MITgcm-devel] Re: thsice is reeeeeeeeeally scalar!

Mon Oct 1 07:51:50 EDT 2007

Hi Jens-Olaf,

thanks for your input, a few questions/answers below:
On 1 Oct 2007, at 12:48, Jens-Olaf Beismann wrote:

> Martin,
>
> I just had a very brief look at your ftraces:
>
> - on how may processors did you run these tests?
1CPU only, does that matter?
> - in both tests the total number of procedure calls is very high
> - in the THSICE case, thsice_get_exf and thsice_reshape_layers  
> together give appr. 25e6 calls
> - can these be inlined, and might inlining improve the  
> vectorisation of the thsice routines you mentioned? Maybe  
> vectorising THSICE isn't that big a task after all.
You are right, the many thsice-subroutines are called from within i,j- 
loops and are subject to inlining, but as you an I have found out,  
inlining is not trivial for the sxf90+MITgcm because breaks the  
genmake2 script. It will be necessary to restrict this optimization  
option to specific configurations. Alternatively one could inline the  
subroutines manually or do loop-pushing, but that is probably the  
huge task I was talking about.
> - inlining should also be applied to other routines, cf. the ones I  
> listed in the cubed sphere case
Same problem as above. I actually do inline a few subroutines. Some  
routines however, eg. fool_the_compiler, are meant to break the  
optimization and should NOT be inlined. I am not too familiar with  
the code bits where this is important (multithreading!).
> - you might want to try to get rid of some "barrier" calls as well.
That's for Jean-Michel to decide which a superfluous
> - regarding the advection routines, it would be helpful to compare  
> the corresponding compiler listings
True, but I was hoping that Jean-Michel would be able to tell me  
right away, what's different between these routines, they should be  
very similar.

Martin
>
> Cheers,
>
> Jens-Olaf
>
>> in my crusade to turn the MITgcm into a true vector code I noticed  
>> that the thsice package would require a lot of work. I have  
>> attached (in a gzipped tar-ball), the output  of a comparison  
>> between runs with seaice+thsice and seaice only. The domain is  
>> 243x170x33 (Rüdiger Gerdes' Arctic Ocean configuration from  
>> AOMIP), and I integrate for 10 days with deltaT=900sec, so 960  
>> timesteps.
>> If you have a look at ftrace.txt_thsice and ftrace.txt_seaice  
>> (from flow trace analyses) you'll notice a few things:
>> 1. mom_calc_visc is by far the most expensive routine, probably  
>> because I use the Leith scheme; I use a slightly lower  
>> optimization -Cvopt, instead of -Chopt for this routine, but I  
>> find this still surprising. I would have expected cg2d to be the  
>> top runner.
>> 2. all routines that start with thsice_* have zero vector  
>> operation ratio, and from the MFLOPS you can see that they are  
>> really slow because of that.
>> 3. exception seaice_advection (V. OP. Ratio = 83%) vectorises  
>> worse than thsice_advection (99.53%). I have no idea why.
>> 4. everything else looks decent except for the exch_rl_send/recv  
>> routines. I am not touching them without detailed instructions.
>> As a consequence the seaice+thsice is slower (692sec vs. 558sec,  
>> stdout.*). The excess time is spend in THSICE_MAIN (146.91sec, as  
>> opposed to seaice_growth+seaice_advdiff = 31.48-13.21=18.27sec).
>> I don't want to undertake the huge task of vectorizing thsice, but  
>> why is seaice_advection so different from thsice_advection (Jean- 
>> Michel?).
>> Martin
>> CC to Jens-Olaf, although he cannot reply to this list, I guess  
>> (just MITgcm-support at mitgcm.org).
>
>
> -- 
> Dr. Jens-Olaf Beismann           Benchmarking Analyst
> NEC High Performance Computing Europe GmbH
> Prinzenallee 11, D-40549 Duesseldorf, Germany
> Tel: +49 4326 288859 (office)  +49 160 183 5289 (mobile)
> Fax: +49 4326 288861              http://www.hpce.nec.com
>