[MITgcm-support] MITgcm on cluters!

Fri Mar 30 10:59:54 EDT 2007

Hi Baylor and Matt,

Thanks so much for your advices, I firstly will try with 
usesinglecpuio and some "FLAGS" options, will let you know what happen.
I know the Baylor's hints are valuable to do, but my model is quite big.
I definitely can't run with lower 8 cpus on my clusters, because when I 
used 32 cpus, it showed the model needs memory on each cpu:

VIRT=162 GB
RES=2.3 GB

I used a resolution 4000*500*60 (run with non-hydrostatic & 
freesurface) to study internal solitary waves in a estuary. The tiles are 
divided as :

                 sNx = 125,
                 sNy = 250,
                 OLx =   3,
                 OLy =   3,
                 nSx =   1,
                 nSy =   1,
                 nPx =  32,
                 nPy =   2,
                 Nx  = sNx*nSx*nPx,
                 Ny  = sNy*nSy*nPy,
                 Nr  =  60)

Info of my cluster:

Operating system: HP Linux XC 3.0
Interconnect: Quadrics Elan4
Nodes: 1-768
     CPU: 2 x Opteron 2.60 GHz
     Memory: 8 GB
     Local disk storage: 160 GB
Total number of processors/cores: 1536

I of course can use a smaller case to follow the hints of Baylor.

Thanks

Van Thinh
---------------------------------------------------------

On Fri, 30 Mar 2007, Baylor Fox-Kemper wrote:

> Hi Van Thinh,
> The performance will be quite different, since memory sharing across the 
> processors will typically be much slower on a cluster (especially if it is 
> built with a gigabit ethernet connection between the processors rather than a 
> myrinet or infiniband).  Also, the size of the memory on each processor and 
> the speed of each processor is probably not the same.
>
> I recommend an experimental approach.  If you are considering how to complete 
> a given set of runs, it is always a good idea to experiment:
>
> 1 Processor speed (if you've got enough memory for the whole run on one 
> processor)
> 2 Processor speed (if you've got enough memory for the whole run on 2 
> processors)
> 4 Processor speed (if you've got enough memory for the whole run on 4 
> processors)
>
> etc., until you get to 64.  You don't need to run for that many timesteps, 
> just enough to be sure that the length of the run is mostly consumed by the 
> timestepping and saving of files that you expect the production run to have.
>
> You can then plot the performance of wallclock time versus number of 
> processors and see what you get.  Or better yet, the time for execution times 
> number of processors versus processors.  If the code were scaling optimally, 
> the time*cpus would be constant, regardless of the number of processors.
>
> In practice, there will be a 'sweet spot' where you will see the best 
> performance, or at least where the performance is comparable to the 
> performance on one processor.  After this, the performance will drop suddenly 
> to much lower values.  For example, you might find 8 cpu*hours on one 
> processor, 8.1 cpu*hours on 4 processors, 8.5 cpu*hours on 16 processors and 
> 30 cpu*hours on 64 processors.  Obviously, you want to use 16 cpus, not 64. 
> A curiousity of modern systems may even make one spot better due to the size 
> of the memory.  It is possible that you may get, say 7.8 cpu*hours, on 8 
> processors because the whole model fits in cache memory with 8 processors and 
> the memory speedup exceeds the communcation slowdown.  This sweet spot will 
> typically occur on fewer processors in a cluster than it will in a more 
> high-performance system.  Basically, the slow communication between 
> processors in the cluster means that adding more processors eventually leads 
> to only minimal improvements in execution time.
>
> In any case, you then know then the way to execute your code.  Let's say you 
> find that 8 processors is near optimal as in the example above.  Then, you 
> just submit 8 jobs, each on 8 processors, *at the same time* covering the 
> runs you need to do, control runs, different parameters, different 
> configurations, etc.  This gives you 'embarrassingly parallel' performance 
> across the different jobs (the geek term for submitting jobs that do not rely 
> on interconnected cpus) and near-optimal scaleup within each run...  Overall, 
> you will be close to getting execution that is close to 64x faster than on a 
> single processor.  Of course, you maybe are only trying to do one big run, 
> not 8 little ones--in that case you're out of luck...
>
> Also, during this experimentation you can try some of the flags that might 
> have a big impact on performance, e.g., usesinglecpuio, or change the 
> frequency of saving to disk, or trying different compiler optimizations (-O2 
> -O3, -qhot, etc.)
>
> This is the best I can do without more specifics, but good luck!
>  -Baylor
>
>
>
> On Mar 30, 2007, at 12:48 AM, Van Thinh Nguyen wrote:
>
>> Hi,
>> 
>> I tried to run MITgcm on a cluster, I used 64 cpus. Unfortunately, in 
>> comparion with the same run on share memory (SMP) (used on 32 cpus), it 
>> shows the code runs very slowly on cluster. Does anyone know an idea to 
>> speedup the run on a cluster.
>> 
>> Thanks,
>> 
>> Van Thinh
>> 
>> -------------------------------------------------
>> 
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support