[MITgcm-support] MITgcm on cluters!
Baylor Fox-Kemper
baylor at MIT.EDU
Fri Mar 30 09:38:39 EDT 2007
Hi Van Thinh,
The performance will be quite different, since memory sharing
across the processors will typically be much slower on a cluster
(especially if it is built with a gigabit ethernet connection between
the processors rather than a myrinet or infiniband). Also, the size
of the memory on each processor and the speed of each processor is
probably not the same.
I recommend an experimental approach. If you are considering how to
complete a given set of runs, it is always a good idea to experiment:
1 Processor speed (if you've got enough memory for the whole run on
one processor)
2 Processor speed (if you've got enough memory for the whole run on 2
processors)
4 Processor speed (if you've got enough memory for the whole run on 4
processors)
etc., until you get to 64. You don't need to run for that many
timesteps, just enough to be sure that the length of the run is
mostly consumed by the timestepping and saving of files that you
expect the production run to have.
You can then plot the performance of wallclock time versus number of
processors and see what you get. Or better yet, the time for
execution times number of processors versus processors. If the code
were scaling optimally, the time*cpus would be constant, regardless
of the number of processors.
In practice, there will be a 'sweet spot' where you will see the best
performance, or at least where the performance is comparable to the
performance on one processor. After this, the performance will drop
suddenly to much lower values. For example, you might find 8
cpu*hours on one processor, 8.1 cpu*hours on 4 processors, 8.5
cpu*hours on 16 processors and 30 cpu*hours on 64 processors.
Obviously, you want to use 16 cpus, not 64. A curiousity of modern
systems may even make one spot better due to the size of the memory.
It is possible that you may get, say 7.8 cpu*hours, on 8 processors
because the whole model fits in cache memory with 8 processors and
the memory speedup exceeds the communcation slowdown. This sweet
spot will typically occur on fewer processors in a cluster than it
will in a more high-performance system. Basically, the slow
communication between processors in the cluster means that adding
more processors eventually leads to only minimal improvements in
execution time.
In any case, you then know then the way to execute your code. Let's
say you find that 8 processors is near optimal as in the example
above. Then, you just submit 8 jobs, each on 8 processors, *at the
same time* covering the runs you need to do, control runs, different
parameters, different configurations, etc. This gives you
'embarrassingly parallel' performance across the different jobs (the
geek term for submitting jobs that do not rely on interconnected
cpus) and near-optimal scaleup within each run... Overall, you will
be close to getting execution that is close to 64x faster than on a
single processor. Of course, you maybe are only trying to do one big
run, not 8 little ones--in that case you're out of luck...
Also, during this experimentation you can try some of the flags that
might have a big impact on performance, e.g., usesinglecpuio, or
change the frequency of saving to disk, or trying different compiler
optimizations (-O2 -O3, -qhot, etc.)
This is the best I can do without more specifics, but good luck!
-Baylor
On Mar 30, 2007, at 12:48 AM, Van Thinh Nguyen wrote:
> Hi,
>
> I tried to run MITgcm on a cluster, I used 64 cpus. Unfortunately,
> in comparion with the same run on share memory (SMP) (used on 32
> cpus), it shows the code runs very slowly on cluster. Does anyone
> know an idea to speedup the run on a cluster.
>
> Thanks,
>
> Van Thinh
>
> -------------------------------------------------
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
More information about the MITgcm-support
mailing list