[MITgcm-devel] exch_ and adjoint

Wed Dec 2 11:31:55 EST 2009

Hi there,

I am struggling with a performance issue: I am trying to increase my  
throughput, but increasing my number of CPUs from 1 to 2 (Vector)  
CPUs, I divide the domain in y-direction. In both case I run with MPI.  
With this change the time required per CPU for exch_rl_*_y in creases  
from (second column)
FREQUENCY  EXCLUSIVE       AVER.TIME    MOPS  MFLOPS V.OP  AVER.    
VECTOR I-CACHE O-CACHE    BANK  PROG.UNIT
            TIME[sec](  % )    [msec]                 RATIO V.LEN     
TIME   MISS    MISS      CONF
[...]
     83874     0.263(  0.6)     0.003  1606.6     0.0 77.88  45.5     
0.108  0.1018  0.0086  0.0001 exch_rl_send_put_y
     83874     0.259(  0.6)     0.003  1684.4    62.9 77.33 146.9     
0.068  0.0917  0.0341  0.0001  exch_rl_recv_get_y

to
    167748     5.867(  8.9)     0.035   381.0     5.7 68.74  56.7     
1.641  2.1979  0.7529  0.0269  exch_rl_recv_get_y
     83874     3.386            0.040   346.5     4.9 69.09  43.0     
1.203  1.0984  0.3761  0.0159   0.0
     83874     2.481            0.030   428.1     6.8 68.35  87.9     
0.438  1.0995  0.3768  0.0110   0.1
    167748     2.371(  3.6)     0.014   701.7     0.3 73.86  66.9     
0.459  1.1025  0.2787  0.0314  exch_rl_send_put_y
     83874     1.193            0.014   697.1     0.3 73.86  66.9     
0.229  0.5535  0.1433  0.0157   0.0
     83874     1.177            0.014   706.4     0.3 73.86  66.9     
0.229  0.5490  0.1354  0.0157   0.1
So over a factor of ten for recv_get_y  and a factor of 5 for  
send_put_y. For the corresponding x-routines, nothing much changes.  
When I do a division in x the relative roles of y-and x-exchanges  
switch. Is there a simple explanantion for this?

Further, exch_* seems to get called plenty of times for the adjoint  
(also just the corners, which make these routines very inefficient on  
our vector computer, but still, why does it get so much worse for 2  
CPU). Is there a way around that?

Martin