[MITgcm-support] MITgcm on cluters!

Fri Mar 30 09:38:39 EDT 2007

Hi Van Thinh,
   The performance will be quite different, since memory sharing  
across the processors will typically be much slower on a cluster  
(especially if it is built with a gigabit ethernet connection between  
the processors rather than a myrinet or infiniband).  Also, the size  
of the memory on each processor and the speed of each processor is  
probably not the same.

I recommend an experimental approach.  If you are considering how to  
complete a given set of runs, it is always a good idea to experiment:

1 Processor speed (if you've got enough memory for the whole run on  
one processor)
2 Processor speed (if you've got enough memory for the whole run on 2  
processors)
4 Processor speed (if you've got enough memory for the whole run on 4  
processors)

etc., until you get to 64.  You don't need to run for that many  
timesteps, just enough to be sure that the length of the run is  
mostly consumed by the timestepping and saving of files that you  
expect the production run to have.

You can then plot the performance of wallclock time versus number of  
processors and see what you get.  Or better yet, the time for  
execution times number of processors versus processors.  If the code  
were scaling optimally, the time*cpus would be constant, regardless  
of the number of processors.

In practice, there will be a 'sweet spot' where you will see the best  
performance, or at least where the performance is comparable to the  
performance on one processor.  After this, the performance will drop  
suddenly to much lower values.  For example, you might find 8  
cpu*hours on one processor, 8.1 cpu*hours on 4 processors, 8.5  
cpu*hours on 16 processors and 30 cpu*hours on 64 processors.   
Obviously, you want to use 16 cpus, not 64.  A curiousity of modern  
systems may even make one spot better due to the size of the memory.   
It is possible that you may get, say 7.8 cpu*hours, on 8 processors  
because the whole model fits in cache memory with 8 processors and  
the memory speedup exceeds the communcation slowdown.  This sweet  
spot will typically occur on fewer processors in a cluster than it  
will in a more high-performance system.  Basically, the slow  
communication between processors in the cluster means that adding  
more processors eventually leads to only minimal improvements in  
execution time.

In any case, you then know then the way to execute your code.  Let's  
say you find that 8 processors is near optimal as in the example  
above.  Then, you just submit 8 jobs, each on 8 processors, *at the  
same time* covering the runs you need to do, control runs, different  
parameters, different configurations, etc.  This gives you  
'embarrassingly parallel' performance across the different jobs (the  
geek term for submitting jobs that do not rely on interconnected  
cpus) and near-optimal scaleup within each run...  Overall, you will  
be close to getting execution that is close to 64x faster than on a  
single processor.  Of course, you maybe are only trying to do one big  
run, not 8 little ones--in that case you're out of luck...

Also, during this experimentation you can try some of the flags that  
might have a big impact on performance, e.g., usesinglecpuio, or  
change the frequency of saving to disk, or trying different compiler  
optimizations (-O2 -O3, -qhot, etc.)

This is the best I can do without more specifics, but good luck!
    -Baylor

On Mar 30, 2007, at 12:48 AM, Van Thinh Nguyen wrote:

> Hi,
>
> I tried to run MITgcm on a cluster, I used 64 cpus. Unfortunately,  
> in comparion with the same run on share memory (SMP) (used on 32  
> cpus), it shows the code runs very slowly on cluster. Does anyone  
> know an idea to speedup the run on a cluster.
>
> Thanks,
>
> Van Thinh
>
> -------------------------------------------------
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support