[MITgcm-support] speedup for cs64 on a linux cluster

Thu May 24 13:38:12 EDT 2012

Hi everybody,

just a few considerations (by a NON-expert). Sorry for any imprecision  
regarding technical aspects and terminology...
I run the MITgcm on a small Linux cluster (total 44 cores) and I faced  
some scalability problems related to both I/O and inter-node connection.
Recently, I also had rather big scalability problems with a new node  
(24 cores): Sgi H2106-G7 (2 Opteron 6172 with 12 cores, 2.1GHz clock).  
I wrote to this ML about these problems ("Scalability on a new Sgi  
node" issue): thanks to Costantinos, Jean-Michel and Martin for their  
very helpful suggestions!

> Things like I/O heavily affect the performance (and scaling). You  
> have to test, if useSingelCPUio helps, debugLevel=-1, increase the  
> monitorFreq, etc. Or reading from and writing to different (local)  
> file systems that are faster. For a scaling analysis I'd turn off  
> all I/O for a start and then later on start with some I/O, see above  
> link to Hill et al.

As regards the I/O, we experienced scalability problems some years  
ago: in a few words, we solved them doing the I/O locally on each  
compute node (with useSingelCPUio=.FALSE.), then copying back all the  
files on the front-end node (see, for example, the script below).

> There are many other factors that affect scaling, the most important  
> one being the architecture you are on. I have access to computers,  
> where the exchange between core on one node is fast, but the  
> exchange between nodes is slow, so that when your cpu-number exceeds  
> the number of cpu/node, scaling goes down.

We also had this problem: slow inter-node connection heavily affected  
the speed-up. The STDOUT diagnostics helps to understand if your  
application is not "squeezing the hardware" properly: if your "system  
time" is the same order of magnitude of (or one less than) the "user  
time", something is not working properly.
You can also try some profiling tools.

> A further issue is that with multicore chips, there is a bandwidth  
> issue, because many cores try to access core memory through the same  
> bus. In that case the only thing that help is to use only some of  
> the cores on a chip.

This was the problem with the new node. The main issue was memory-core  
binding, which we were not able to capitalize, also because of the old  
version of the compiler (PGI 6.1). We optimized the system switching  
to an up-to-date version of the (open source) Open64 compiler (http://www.open64.net/home.html 
   or http://developer.amd.com/tools/open64/Pages/default.aspx) and  
using the MPICH2 implementation of MPI (http://www.mcs.anl.gov/research/projects/mpich2/index.php 
).
To compile and install MPICH2, try: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.4.1-installguide.pdf

We used these optimization flags (in "build_options"):
FOPTIM='-O2 -march={yourarchitecture} -ipa -OPT:Ofast -fno-math-errno - 
ffast-math -LNO:fusion=2 -OPT:roundoff=1:IEEE_arithmetic=3'

And launched using mpiexec:
$mpiexec -np $NSLOTS -binding cpu:cores -f $TMPDIR/machines.local  
$basedir/$wrkdir/mitgcmuv

The most important option is: "-binding cpu:cores". This really  
slashed the wall clock time! We obtained very good scaling (up to 24  
cores) on the node, quite "close" to linearity. We also tested numactl  
options (--membind=... --cpunodebind=...) but we didn't obtain the  
improvements we had with "-binding cpu:cores".
Of course, using more cores (hence, using also other nodes) did not  
give such a good scaling any more.
We did several other tests and tried other optimization flags, but  
"the heart of the matter" is memory-core binding (or affinity).

You can find very interesting ideas and hints for multicore systems  
here:
http://blogs.fau.de/hager/files/2011/02/PPoPP11-Tutorial-Multicore-final.pdf

Hope this helps.

Cheers!

Stefano

> On May 22, 2012, at 7:31 PM, Angela Zalucha wrote:
>
>> I also have not found any scaling analysis anywhere, but here is  
>> the test I performed:  I essentially run the 3D held-saurez cs  
>> experiment (with slightly more advanced RT) with 30 levels.  The  
>> test was performed on the Texas Advanced Computing Center Lonestar  
>> Linux cluster.  The test went for 120,000 iterations.
>>
>> I attached a plot.  The number of processors increases by powers of  
>> 2 times 12 (i.e. 12, 24, 48, 96, 192, 384, 768, 1536).  I did not  
>> plot the ideal case but it would be a line.  The scaling is not  
>> ideal to large numbers of processors.  Oddly, the scaling is also  
>> not constant, e.g. 48 to 96 and 192 to 384 produce a greater  
>> improvment than 96 to 192.
>>
>> Also, I noticed for a given number of processors, lower nSx and nSy  
>> is always faster.
>>
>> In 2D, on my group's local (Linux) cluster at Southwest, 2 procs is  
>> better than 1 proc, but 4 procs actually runs slower.
>>
>> Angela
>>
>>
>>
>> On Tue, 22 May 2012, Maura BRUNETTI wrote:
>>
>>> Dear MITgcm users,
>>> I am studying scaling properties of ocean-only configurations on a  
>>> linux cluster.
>>> The results shown in the attached figure are obtained with a cubed  
>>> sphere configuration with 64x64 face resolution and 15
>>> vertical levels (points in the figure correspond to: 6 tiles 64x64  
>>> on 1 proc, 1 tile 64x64 on 6 procs, 1 tile 32x64 on 12
>>> procs, 1 tile 32x32 on 24 procs and 1 tile 16x32 on 48 procs).  
>>> Only packages GMredi and tave are activated at run time.
>>> The scaling is not very good, starting already at 12 procs (see  
>>> blue line). I have not found in the literature other
>>> scaling analysis, could you please suggest where I can find them?  
>>> From my analysis, I have seen that it is not worth
>>> doing runs with tile dimension smaller than 32 grid points. Is it  
>>> correct?
>>> Thanks,
>>> Maura --
>>> Dr. Maura Brunetti
>>> Institute for Environmental Sciences (ISE)
>>> University of Geneva -- Switzerland
>> <timevsproc_plot.eps>_______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
> Martin Losch
> Martin.Losch at awi.de
>
>
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support

**************************************
Script example
**************************************

#!/bin/sh
#$ -S /bin/sh

# use current working directory
#$ -cwd

# merged output
#$ -j y

#$ -pe mpich 24

wrkdir={yourworkingdirectory}

fecwd=$(pwd)

# local I/O
basedir={localIOdir}

# NSLOT is passed by job scheduler

MPI_HOST={yourhostname}
#$ -v MPI_HOST

frontend={yourfrontendnode}

mpirun={yourcompilerpath}/mpirun

MPI_HOST=$frontend
#$ -v MPI_HOST

echo "sge machines (modified):"
cat $TMPDIR/machines
sed "s/$/.local/" $TMPDIR/machines > $TMPDIR/machines.local
echo "machines.local: "
cat $TMPDIR/machines.local

echo "copy working directory in local scratch space"
mynodes=$( uniq $TMPDIR/machines.local )
for node in $mynodes
do
   ssh -n $node mkdir $basedir/$wrkdir
   scp $frontend:$fecwd/* $node:$basedir/$wrkdir
done

cd $basedir/$wrkdir

echo "launching mpirun..."
$mpirun -leave_pg -np $NSLOTS -machinefile $TMPDIR/machines.local  
$basedir/$wrkdir/{yourexecutable}

echo "copy back local scratch space in working directory on front end"
for node in $mynodes
do
   ssh -n $frontend mkdir $fecwd/results_$node
   scp -rp $node:$basedir/$wrkdir $frontend:$fecwd/results_$node
   ssh -n $node rm -R $basedir/$wrkdir
done