[MITgcm-support] MITgcm on cluters!
Van Thinh Nguyen
vtnguyen at moisie2.math.uwaterloo.ca
Fri Mar 30 10:59:54 EDT 2007
Hi Baylor and Matt,
Thanks so much for your advices, I firstly will try with
usesinglecpuio and some "FLAGS" options, will let you know what happen.
I know the Baylor's hints are valuable to do, but my model is quite big.
I definitely can't run with lower 8 cpus on my clusters, because when I
used 32 cpus, it showed the model needs memory on each cpu:
VIRT=162 GB
RES=2.3 GB
I used a resolution 4000*500*60 (run with non-hydrostatic &
freesurface) to study internal solitary waves in a estuary. The tiles are
divided as :
sNx = 125,
sNy = 250,
OLx = 3,
OLy = 3,
nSx = 1,
nSy = 1,
nPx = 32,
nPy = 2,
Nx = sNx*nSx*nPx,
Ny = sNy*nSy*nPy,
Nr = 60)
Info of my cluster:
Operating system: HP Linux XC 3.0
Interconnect: Quadrics Elan4
Nodes: 1-768
CPU: 2 x Opteron 2.60 GHz
Memory: 8 GB
Local disk storage: 160 GB
Total number of processors/cores: 1536
I of course can use a smaller case to follow the hints of Baylor.
Thanks
Van Thinh
---------------------------------------------------------
On Fri, 30 Mar 2007, Baylor Fox-Kemper wrote:
> Hi Van Thinh,
> The performance will be quite different, since memory sharing across the
> processors will typically be much slower on a cluster (especially if it is
> built with a gigabit ethernet connection between the processors rather than a
> myrinet or infiniband). Also, the size of the memory on each processor and
> the speed of each processor is probably not the same.
>
> I recommend an experimental approach. If you are considering how to complete
> a given set of runs, it is always a good idea to experiment:
>
> 1 Processor speed (if you've got enough memory for the whole run on one
> processor)
> 2 Processor speed (if you've got enough memory for the whole run on 2
> processors)
> 4 Processor speed (if you've got enough memory for the whole run on 4
> processors)
>
> etc., until you get to 64. You don't need to run for that many timesteps,
> just enough to be sure that the length of the run is mostly consumed by the
> timestepping and saving of files that you expect the production run to have.
>
> You can then plot the performance of wallclock time versus number of
> processors and see what you get. Or better yet, the time for execution times
> number of processors versus processors. If the code were scaling optimally,
> the time*cpus would be constant, regardless of the number of processors.
>
> In practice, there will be a 'sweet spot' where you will see the best
> performance, or at least where the performance is comparable to the
> performance on one processor. After this, the performance will drop suddenly
> to much lower values. For example, you might find 8 cpu*hours on one
> processor, 8.1 cpu*hours on 4 processors, 8.5 cpu*hours on 16 processors and
> 30 cpu*hours on 64 processors. Obviously, you want to use 16 cpus, not 64.
> A curiousity of modern systems may even make one spot better due to the size
> of the memory. It is possible that you may get, say 7.8 cpu*hours, on 8
> processors because the whole model fits in cache memory with 8 processors and
> the memory speedup exceeds the communcation slowdown. This sweet spot will
> typically occur on fewer processors in a cluster than it will in a more
> high-performance system. Basically, the slow communication between
> processors in the cluster means that adding more processors eventually leads
> to only minimal improvements in execution time.
>
> In any case, you then know then the way to execute your code. Let's say you
> find that 8 processors is near optimal as in the example above. Then, you
> just submit 8 jobs, each on 8 processors, *at the same time* covering the
> runs you need to do, control runs, different parameters, different
> configurations, etc. This gives you 'embarrassingly parallel' performance
> across the different jobs (the geek term for submitting jobs that do not rely
> on interconnected cpus) and near-optimal scaleup within each run... Overall,
> you will be close to getting execution that is close to 64x faster than on a
> single processor. Of course, you maybe are only trying to do one big run,
> not 8 little ones--in that case you're out of luck...
>
> Also, during this experimentation you can try some of the flags that might
> have a big impact on performance, e.g., usesinglecpuio, or change the
> frequency of saving to disk, or trying different compiler optimizations (-O2
> -O3, -qhot, etc.)
>
> This is the best I can do without more specifics, but good luck!
> -Baylor
>
>
>
> On Mar 30, 2007, at 12:48 AM, Van Thinh Nguyen wrote:
>
>> Hi,
>>
>> I tried to run MITgcm on a cluster, I used 64 cpus. Unfortunately, in
>> comparion with the same run on share memory (SMP) (used on 32 cpus), it
>> shows the code runs very slowly on cluster. Does anyone know an idea to
>> speedup the run on a cluster.
>>
>> Thanks,
>>
>> Van Thinh
>>
>> -------------------------------------------------
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
More information about the MITgcm-support
mailing list