[MITgcm-support] Quad core performance notes

Fri Oct 24 09:57:28 EDT 2008

Further to the messages on 21st Oct (Martin Losch) and 22nd Oct
(Chris Hill) regarding scaling on SGI Altix quad-core systems,
perhaps I could contribute the following observations.
I recently started running MITgcm on a somewhat similar system,
an Altix ICE 8200 to be precise.
(see http://www.sgi.com/products/servers/altix/ice/ for details).
Like the Weizmann system, it has quad-core Intel Xeons, (though
in our case 2.83GHz E5440 "Harpertown"s), and the same 667MHz memory.
I'm guessing your system has the same chipset ("Greencreek"),
and thus the memory bandwith should be v. similar.
The system was bought primarily to run the NEMO ocean model
(http://www.nemo-ocean.eu/), which was extensively benchmarked by
several vendors, who all commented on its heavy requirement for both
memory bandwidth and network performance. We were advised by SGI to
run with the nodes half-occupied, with the threads of execution
carefully placed on the cores to minimise contention for memory
bandwidth (and minimise contention for the shared cache on the chip).
I have since confirmed that the same applied to running MITgcm.
If I run with 64 cores, I get the following:
Full occupancy overran the wallclock limit of 11h 55m,
and the integration had only reached time 277200 sec out of
a desired 604800 (seconds = 7 days).

Full occ: 277200 simulated seconds in 715 mins =  387.69 sec/min
Half occ: 604800 simulated seconds in 575 mins = 1051.8  sec/min

a speed ratio of 2.71.
By full occupancy, I mean using 8 compute nodes each having
2 x quad-core processors; half-occupancy means using 16 nodes,
(with exclusive use to prevent other people's jobs competing
for bandwidth) and using 4 cores on each. Using the PBS job
scheduler this looks like:

#PBS -l walltime=11:55:00
#PBS -l select=16:ncpus=8:mpiprocs=4
#PBS -l place=scatter:excl
...
echo PBS job ID is $PBS_JOBID
echo This jobs runs on the following machines:
echo `cat $PBS_NODEFILE | uniq`

echo "MITgcm job starting"
date
echo "-----------------------------"
limit coredumpsize 0
limit stacksize 800m
set numnodes = `wc $PBS_NODEFILE | awk '{ print $1 }'`

#! Create a machine file for MPI
cat $PBS_NODEFILE | head -$numnodes > host.file.$PBS_JOBID

#! Run the parallel MPI executable (nodes*ppn)

echo "Running mpirun -np $numnodes -hostfile host.file.$PBS_JOBID MITgcm"

setenv PSM_SHAREDPORTS 1
setenv LD_LIBRARY_PATH '/usr/mpi/mvapich-0.9.9/intel/lib/shared:/sw/Intel/fce/10.1.013/lib'

time /usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh -np $numnodes -hostfile host.file.$PBS_JOBID \
     /usr/bin/env VIADEV_USE_AFFINITY=0 /fibre/jeff/mpi_place 4 2 \
     ./mitgcmuv >& output.txt.run9.part5

date
echo "-----------------------------"
echo "MITgcm job finished"

Any other scheduler will presumably have something equivalent.
The script "mpi_place" was provided by SGI, and distributes the threads onto
the cores in what SGI found to be the optimal way. You need to know arcane
stuff like how the cores on a node are numbered. Details from SGI I'm afraid;
I have thus far failed to find the info. on their webpages. I could tell
you the answer for our system; not sure it's the same for yours.

The upshot of all the above is that you might be better off using 16 cores
with excellent memory access rather than 32 which are blocking one another
(counter-intuitive, but see the speed ratio of 2.71 that I found).

We are also investigating the effect of the MPI libraries that we use.
Benchmark experiments on other (computational chemistry) codes suggest
that the mvapich library we have used thus far is seriously outperformed
by the newer Silicon Graphics "MPT" library. This isn't yet tested on our
ocean models NEMO and MITgcm, so I can't tell you yet whether this would be
worth considering. We hope to make a decision in the next few weeks about a
possible software upgrade here to implement this. If anyone wants more details
on any of the above, why not e-mail me directly; this must be deadly dull for
readers who don't have an Altix system (though it probably applies to any quad-core).

                                                    Jeff Blundell

======================================================================
|                    e-mail:  jeff at noc.soton.ac.uk                   |
|   Jeff Blundell,  Room 256/09  |  Ocean Modelling & Forecasting,   |
|    phone: +44 [0]23 8059 6201  |  National Oceanography Centre,    |
|     fax : +44 [0]23 8059 6204  |  Southampton, Empress Dock,       |
|                                |  SOUTHAMPTON SO14 3ZH, UK.        |
|               WWW:  http://www.noc.soton.ac.uk/omf/                |
======================================================================