[MITgcm-support] Scalability

chris hill cnh at mit.edu
Wed Aug 3 09:57:35 EDT 2005


Stefano,

  Sorry to hear that it wasn;t a simple I/O thing.
  What sort of network do you have connecting the nodes in the cluster?
  Not sure what the problem is - it could be the network connecting the
nodes or it could be a system configuration thing.
  If you can give us access to the system we can probably figure out
what the cause is and whether it is easy to fix.

Thanks,

Chris
Stefano Querin wrote:
> Hi,
> I'm wondering if is there something wrong in my model configuration.
> I'm running on an AMD Opteron 64 small cluster (12 CPUs: 4 SunFire V20z  
> + 1 SunFire V40z). I made some runs (1, 2, 4 and 8 CPUs) to check the 
> scalability of the code: the results are in the file  Scalability.txt . 
> I also attach the file SIZE.h (taken from the 8 CPUs run).
> The domain is 88 x 128 x 28 points with horizontal resolution of 250 m, 
> double periodic. Levels are 0.5 (upper 6) and 1.0 (other 22) meters 
> thick. The simulation is made by 4320 steps, timestep is 10 s, dumptime 
> is 1800 s. The model (checkpoint57j_post) is forced with surface heat 
> fluxes and wind stress.
> It seems like the model doesn't scale so fine. User time scales almost 
> linearly; on the contrary, wall clock time (expecially passing from 4 to 
> 8 CPUs) gets worse.
> In the column  sea+land  I put the time elapsed running exactly the same 
> simulation, but using a different domain: in this latter case, the 
> southern half of the domain is made by land points (i.e. running on 8 
> CPUs, 4 handle only sea points, the other 4 handle only land points). 
> Simulations are not that faster (do land points require only a slightly 
> lower computational effort?). Scalability is similar to the "all sea" 
> experiment.
> My question is: why the gap between user+system time and wall clock time 
> rises so much as I increase the number of CPUs?
> I know that the model was created to be highly scalable on much larger 
> computational platforms, so I'm probably making some mistakes somewhere!
> I'm sure it is not an I/O problem: the DO_THE_MODEL_IO routine requires 
> few seconds, since I turned GlobalFiles off  (I also don't use the NFS 
> filesystem anymore).
> Most of the time (almost all the time) is "lost" in the 
> SOLVE_FOR_PRESSURE and BLOCKING_EXCHANGES routines, which (I think) 
> involve the whole domain. Can be a communication problem of our cluster 
> (network, switches...)?
> Any other ideas?
> Thank you very much!
> 
> Regards,
> 
> Stefano                     all sea        sea+land
> 
> 1p    user time [s]            7526        6737
>     system time [s]            2        2
>     wall clock time [s]        7532        6743
>                    
> 2p    user time [s]            3680        3208
>     system time [s]            49        46
>     wall clock time [s]        3764        3393
>                    
> 4p    user time [s]            1710        1553
>     system time [s]            60        55
>     wall clock time [s]        1961        1833
>                    
> 8p    user time [s]            897        773
>     system time [s]            94        94
>     wall clock time [s]        1575        1404
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support





More information about the MITgcm-support mailing list