[MITgcm-support] speedup for cs64 on a linux cluster
Stefano Querin
squerin at ogs.trieste.it
Thu May 24 13:38:12 EDT 2012
Hi everybody,
just a few considerations (by a NON-expert). Sorry for any imprecision
regarding technical aspects and terminology...
I run the MITgcm on a small Linux cluster (total 44 cores) and I faced
some scalability problems related to both I/O and inter-node connection.
Recently, I also had rather big scalability problems with a new node
(24 cores): Sgi H2106-G7 (2 Opteron 6172 with 12 cores, 2.1GHz clock).
I wrote to this ML about these problems ("Scalability on a new Sgi
node" issue): thanks to Costantinos, Jean-Michel and Martin for their
very helpful suggestions!
> Things like I/O heavily affect the performance (and scaling). You
> have to test, if useSingelCPUio helps, debugLevel=-1, increase the
> monitorFreq, etc. Or reading from and writing to different (local)
> file systems that are faster. For a scaling analysis I'd turn off
> all I/O for a start and then later on start with some I/O, see above
> link to Hill et al.
As regards the I/O, we experienced scalability problems some years
ago: in a few words, we solved them doing the I/O locally on each
compute node (with useSingelCPUio=.FALSE.), then copying back all the
files on the front-end node (see, for example, the script below).
> There are many other factors that affect scaling, the most important
> one being the architecture you are on. I have access to computers,
> where the exchange between core on one node is fast, but the
> exchange between nodes is slow, so that when your cpu-number exceeds
> the number of cpu/node, scaling goes down.
We also had this problem: slow inter-node connection heavily affected
the speed-up. The STDOUT diagnostics helps to understand if your
application is not "squeezing the hardware" properly: if your "system
time" is the same order of magnitude of (or one less than) the "user
time", something is not working properly.
You can also try some profiling tools.
> A further issue is that with multicore chips, there is a bandwidth
> issue, because many cores try to access core memory through the same
> bus. In that case the only thing that help is to use only some of
> the cores on a chip.
This was the problem with the new node. The main issue was memory-core
binding, which we were not able to capitalize, also because of the old
version of the compiler (PGI 6.1). We optimized the system switching
to an up-to-date version of the (open source) Open64 compiler (http://www.open64.net/home.html
or http://developer.amd.com/tools/open64/Pages/default.aspx) and
using the MPICH2 implementation of MPI (http://www.mcs.anl.gov/research/projects/mpich2/index.php
).
To compile and install MPICH2, try: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.4.1-installguide.pdf
We used these optimization flags (in "build_options"):
FOPTIM='-O2 -march={yourarchitecture} -ipa -OPT:Ofast -fno-math-errno -
ffast-math -LNO:fusion=2 -OPT:roundoff=1:IEEE_arithmetic=3'
And launched using mpiexec:
$mpiexec -np $NSLOTS -binding cpu:cores -f $TMPDIR/machines.local
$basedir/$wrkdir/mitgcmuv
The most important option is: "-binding cpu:cores". This really
slashed the wall clock time! We obtained very good scaling (up to 24
cores) on the node, quite "close" to linearity. We also tested numactl
options (--membind=... --cpunodebind=...) but we didn't obtain the
improvements we had with "-binding cpu:cores".
Of course, using more cores (hence, using also other nodes) did not
give such a good scaling any more.
We did several other tests and tried other optimization flags, but
"the heart of the matter" is memory-core binding (or affinity).
You can find very interesting ideas and hints for multicore systems
here:
http://blogs.fau.de/hager/files/2011/02/PPoPP11-Tutorial-Multicore-final.pdf
Hope this helps.
Cheers!
Stefano
> On May 22, 2012, at 7:31 PM, Angela Zalucha wrote:
>
>> I also have not found any scaling analysis anywhere, but here is
>> the test I performed: I essentially run the 3D held-saurez cs
>> experiment (with slightly more advanced RT) with 30 levels. The
>> test was performed on the Texas Advanced Computing Center Lonestar
>> Linux cluster. The test went for 120,000 iterations.
>>
>> I attached a plot. The number of processors increases by powers of
>> 2 times 12 (i.e. 12, 24, 48, 96, 192, 384, 768, 1536). I did not
>> plot the ideal case but it would be a line. The scaling is not
>> ideal to large numbers of processors. Oddly, the scaling is also
>> not constant, e.g. 48 to 96 and 192 to 384 produce a greater
>> improvment than 96 to 192.
>>
>> Also, I noticed for a given number of processors, lower nSx and nSy
>> is always faster.
>>
>> In 2D, on my group's local (Linux) cluster at Southwest, 2 procs is
>> better than 1 proc, but 4 procs actually runs slower.
>>
>> Angela
>>
>>
>>
>> On Tue, 22 May 2012, Maura BRUNETTI wrote:
>>
>>> Dear MITgcm users,
>>> I am studying scaling properties of ocean-only configurations on a
>>> linux cluster.
>>> The results shown in the attached figure are obtained with a cubed
>>> sphere configuration with 64x64 face resolution and 15
>>> vertical levels (points in the figure correspond to: 6 tiles 64x64
>>> on 1 proc, 1 tile 64x64 on 6 procs, 1 tile 32x64 on 12
>>> procs, 1 tile 32x32 on 24 procs and 1 tile 16x32 on 48 procs).
>>> Only packages GMredi and tave are activated at run time.
>>> The scaling is not very good, starting already at 12 procs (see
>>> blue line). I have not found in the literature other
>>> scaling analysis, could you please suggest where I can find them?
>>> From my analysis, I have seen that it is not worth
>>> doing runs with tile dimension smaller than 32 grid points. Is it
>>> correct?
>>> Thanks,
>>> Maura --
>>> Dr. Maura Brunetti
>>> Institute for Environmental Sciences (ISE)
>>> University of Geneva -- Switzerland
>> <timevsproc_plot.eps>_______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
> Martin Losch
> Martin.Losch at awi.de
>
>
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
**************************************
Script example
**************************************
#!/bin/sh
#$ -S /bin/sh
# use current working directory
#$ -cwd
# merged output
#$ -j y
#$ -pe mpich 24
wrkdir={yourworkingdirectory}
fecwd=$(pwd)
# local I/O
basedir={localIOdir}
# NSLOT is passed by job scheduler
MPI_HOST={yourhostname}
#$ -v MPI_HOST
frontend={yourfrontendnode}
mpirun={yourcompilerpath}/mpirun
MPI_HOST=$frontend
#$ -v MPI_HOST
echo "sge machines (modified):"
cat $TMPDIR/machines
sed "s/$/.local/" $TMPDIR/machines > $TMPDIR/machines.local
echo "machines.local: "
cat $TMPDIR/machines.local
echo "copy working directory in local scratch space"
mynodes=$( uniq $TMPDIR/machines.local )
for node in $mynodes
do
ssh -n $node mkdir $basedir/$wrkdir
scp $frontend:$fecwd/* $node:$basedir/$wrkdir
done
cd $basedir/$wrkdir
echo "launching mpirun..."
$mpirun -leave_pg -np $NSLOTS -machinefile $TMPDIR/machines.local
$basedir/$wrkdir/{yourexecutable}
echo "copy back local scratch space in working directory on front end"
for node in $mynodes
do
ssh -n $frontend mkdir $fecwd/results_$node
scp -rp $node:$basedir/$wrkdir $frontend:$fecwd/results_$node
ssh -n $node rm -R $basedir/$wrkdir
done
More information about the MITgcm-support
mailing list