[MITgcm-support] Scalability on a new Sgi node

Tue Jun 14 11:23:15 EDT 2011

On Tue, 14 Jun 2011 14:21:35 +0200
Stefano Querin <squerin at ogs.trieste.it> wrote:

> Dear MITgcmers,
> 
> I'm experiencing a problem with a new Sgi node (AMD Opteron - 24  
> cores) mounted on our HPC cluster (called COBRA).
> The problem is that the model (checkpoint61t) does not scale fine,  
> especially switching from 12 to 24 processors. It even takes a bit  
> more time when using 24 rather than 12 cores!
> Of course, this is not a MITgcm issue: I run the SAME simulation,
> with the SAME configuration (namelists, I/O, ...), on another (Intel
> based) HPC cluster (called PLX) and the results, also using 48 cores,
> are reasonable.
> 
> I'm reporting below some details about the two clusters and the  
> numerical experiments that we carried out.
> 
> My questions are:
> - could this be due to the old version of the compiler on the COBRA  
> cluster (see below)?

Not really - somewhat lower performance maybe but not scaling.

> - could there be something wrong in the compilation/optimization
> flags?

You might do better by not using generic flags instead of flags
specific to the processor you are employing - a newer compiler version
would allow for more processor-specific flags.

> - it seems that the two (12 core) CPUs are not "talking to each
> other" efficiently. Could this be a hardware problem?

You are probably seeing the effects of (a) hogging down the memory
subsystem - not simply the Hypertransport link(s - there are in fact
more than one) between the processors - with too many requests from
each of your MPI processes (both in terms of memory bandwidth pressure
as well as MPI traffic).

> node name: cobra-0-5
> Sgi H2106-G7. Servers, One quadsocket. Chipset, AMD SR5690 + SR5670
> + SP5100
> CPUs: 24 x 2.05 GHz (2 Opteron 6172 with 12 cores, 2.1GHz, and 12MB
> L3 cache)
> Memory (RAM): 15.68 GB
> Disk: 1 x 600 GB SAS 15k RAID 1, 2 x 10/100/1000, 6 PCIe slots.
> (http://www.sgi.com/products/servers/four_way/)

Is this a half-populated quad-socket? You chose the high proc count
chips (AMD makes 8-core versions as well as 12-core versions, with the
8-core having higher frequency for the same power envelope or less
power for the same frequency and of course more effective memory
bandwidth per core).

> PGI compiler:
> FFLAGS='-r8 -Mnodclchk -Mextend -Ktrap=fp'
> FOPTIM='-tp k8-64 -pc=64 -fastsse -O3 -Msmart -Mvect=cachesize: 
> 1048576,transform'

These are generic flags - the cachesize for example happens to be right
by chance for the L3 per core on this chip - it would be wrong for some
other recent Opteron. You might want to add -tp k8-64e if your PGI
version supports it.

> Operating system:
> Rocks 5.4 (Maverick) x86_64

Keep in mind that 
> 
> Compiler, debugger, profiler (our PGI version is almost 4 years old:  
> could this be the issue?):
> PGDBG 6.1-2 x86-64
> PGPROF 6.1-2
> /share/apps/pgi/linux86-64/6.1
> 
> Job scheduler:
> /opt/gridengine/bin/lx26-amd64/qsub
> 
> Mpirun:
> /opt/openmpi/bin/mpirun
> 
> 
> TESTS (same simulation, using 4, 12 and 24 cores):
> 
> 4 (2x2) cobra-0-5
> User time: 3318.979990234599
> System time: 98.77000091597438
> Wall clock time: 4033.110437154770
> 
> 12 (6x2) cobra-0-5
> User time: 1735.529926758260
> System time: 77.22999450750649
> Wall clock time: 2233.585635900497
> 
> 24 (6x4) cobra-0-5
> User time: 1704.439960937947
> System time: 124.3899948149920
> Wall clock time: 2270.645053148270

This does seem like memory bandwidth congestion. Notice how going from
4 to 12 reduces the time by less than half and not to 1/3. 

> ************************* control node (PLX) used for comparisons  
> *************************

Only one?

> PLX DataPlex Cluster @ CINECA - RedHat EL 5.6!
> ( http://www.cineca.it/en/hardware/ibm-plx2290-0 WARNING! website not
> up to date... ) Qlogic QDR (40Gb/s) Infiniband high-performance
> network
> 
> 274 Compute node
> 2 esa-core Intel(R) Xeon(R) CPU E5645 @2.40GHz per Compute node
> 48 GB RAM per Compute node
> 2 Nvidia Tesla M2070 GPU per Compute node
> 
> 8 Fat node
> 2 Intel(R) Xeon(R) CPU X5570 @2.93GHz per Fat node
> 128 GB RAM per Fat node
> 
> 3352 Total cores
> 
> 6 Remote Visualization Login
> 2 Nvidia QuadroPlex 2200 S4
> PBSpro 10.1 batch scheduler
> 
> Intel compiler
> FFLAGS="$FFLAGS -WB -fno-alias -assume byterecl"
> FOPTIM='-O3 -xSSE4.2 -unroll=4 -axSSE4.2 -ipo -align -fno-alias - 
> assume byterecl'
> 
> 
> TESTS (same simulation, using 4, 12, 24 and 48 cores):

I assume this is using one node, 2 core on each of the two sockets.
> 4 (2x2)
> User time: 1174.67938481271
> System time: 0.000000000000000E+000
> Wall clock time: 1215.21540594101

Is this still using only one node (6 cores on each of the sockets)?
> 12 (6x2)
> User time: 642.731264695525
> System time: 0.000000000000000E+000
> Wall clock time: 692.214211940765

At this point you are either using 2 nodes (6 cores on each socket) or
hyperthreading on one node (and in that case you're seeing amazing
scaling for hyperthreading).
> 24 (6x4)
> User time: 328.649038668722
> System time: 0.000000000000000E+000
> Wall clock time: 360.033116102219

In this case if you are not going over Infiniband I'm flabbergasted.  
> 48 (8x6)
> User time: 179.773665569723
> System time: 0.000000000000000E+000
> Wall clock time: 233.140372991562

Constantinos
-- 
Dr. Constantinos Evangelinos
Research Staff Member
IBM Research, Computational Science Center
HPC Applications and Tools