[MITgcm-support] Scalability on a new Sgi node

Tue Jun 14 08:21:35 EDT 2011

Dear MITgcmers,

I'm experiencing a problem with a new Sgi node (AMD Opteron - 24  
cores) mounted on our HPC cluster (called COBRA).
The problem is that the model (checkpoint61t) does not scale fine,  
especially switching from 12 to 24 processors. It even takes a bit  
more time when using 24 rather than 12 cores!
Of course, this is not a MITgcm issue: I run the SAME simulation, with  
the SAME configuration (namelists, I/O, ...), on another (Intel based)  
HPC cluster (called PLX) and the results, also using 48 cores, are  
reasonable.

I'm reporting below some details about the two clusters and the  
numerical experiments that we carried out.

My questions are:
- could this be due to the old version of the compiler on the COBRA  
cluster (see below)?
- could there be something wrong in the compilation/optimization flags?
- it seems that the two (12 core) CPUs are not "talking to each other"  
efficiently. Could this be a hardware problem?

Thanks for any hint/suggestion,
cheers!

Stefano

************************* SGI node (COBRA) *************************

node name: cobra-0-5
Sgi H2106-G7. Servers, One quadsocket. Chipset, AMD SR5690 + SR5670 +  
SP5100
CPUs: 24 x 2.05 GHz (2 Opteron 6172 with 12 cores, 2.1GHz, and 12MB L3  
cache)
Memory (RAM): 15.68 GB
Disk: 1 x 600 GB SAS 15k RAID 1, 2 x 10/100/1000, 6 PCIe slots.
(http://www.sgi.com/products/servers/four_way/)

PGI compiler:
FFLAGS='-r8 -Mnodclchk -Mextend -Ktrap=fp'
FOPTIM='-tp k8-64 -pc=64 -fastsse -O3 -Msmart -Mvect=cachesize: 
1048576,transform'

Operating system:
Rocks 5.4 (Maverick) x86_64

Compiler, debugger, profiler (our PGI version is almost 4 years old:  
could this be the issue?):
PGDBG 6.1-2 x86-64
PGPROF 6.1-2
/share/apps/pgi/linux86-64/6.1

Job scheduler:
/opt/gridengine/bin/lx26-amd64/qsub

Mpirun:
/opt/openmpi/bin/mpirun

TESTS (same simulation, using 4, 12 and 24 cores):

4 (2x2) cobra-0-5
User time: 3318.979990234599
System time: 98.77000091597438
Wall clock time: 4033.110437154770

12 (6x2) cobra-0-5
User time: 1735.529926758260
System time: 77.22999450750649
Wall clock time: 2233.585635900497

24 (6x4) cobra-0-5
User time: 1704.439960937947
System time: 124.3899948149920
Wall clock time: 2270.645053148270

************************* control node (PLX) used for comparisons  
*************************

PLX DataPlex Cluster @ CINECA - RedHat EL 5.6! ( http://www.cineca.it/en/hardware/ibm-plx2290-0 
   WARNING! website not up to date... )
Qlogic QDR (40Gb/s) Infiniband high-performance network

274 Compute node
2 esa-core Intel(R) Xeon(R) CPU E5645 @2.40GHz per Compute node
48 GB RAM per Compute node
2 Nvidia Tesla M2070 GPU per Compute node

8 Fat node
2 Intel(R) Xeon(R) CPU X5570 @2.93GHz per Fat node
128 GB RAM per Fat node

3352 Total cores

6 Remote Visualization Login
2 Nvidia QuadroPlex 2200 S4
PBSpro 10.1 batch scheduler

Intel compiler
FFLAGS="$FFLAGS -WB -fno-alias -assume byterecl"
FOPTIM='-O3 -xSSE4.2 -unroll=4 -axSSE4.2 -ipo -align -fno-alias - 
assume byterecl'

TESTS (same simulation, using 4, 12, 24 and 48 cores):

4 (2x2)
User time: 1174.67938481271
System time: 0.000000000000000E+000
Wall clock time: 1215.21540594101

12 (6x2)
User time: 642.731264695525
System time: 0.000000000000000E+000
Wall clock time: 692.214211940765

24 (6x4)
User time: 328.649038668722
System time: 0.000000000000000E+000
Wall clock time: 360.033116102219

48 (8x6)
User time: 179.773665569723
System time: 0.000000000000000E+000
Wall clock time: 233.140372991562