[MITgcm-support] Scalability on a new Sgi node
Constantinos Evangelinos
ce107 at ocean.mit.edu
Tue Jun 14 11:23:15 EDT 2011
On Tue, 14 Jun 2011 14:21:35 +0200
Stefano Querin <squerin at ogs.trieste.it> wrote:
> Dear MITgcmers,
>
> I'm experiencing a problem with a new Sgi node (AMD Opteron - 24
> cores) mounted on our HPC cluster (called COBRA).
> The problem is that the model (checkpoint61t) does not scale fine,
> especially switching from 12 to 24 processors. It even takes a bit
> more time when using 24 rather than 12 cores!
> Of course, this is not a MITgcm issue: I run the SAME simulation,
> with the SAME configuration (namelists, I/O, ...), on another (Intel
> based) HPC cluster (called PLX) and the results, also using 48 cores,
> are reasonable.
>
> I'm reporting below some details about the two clusters and the
> numerical experiments that we carried out.
>
> My questions are:
> - could this be due to the old version of the compiler on the COBRA
> cluster (see below)?
Not really - somewhat lower performance maybe but not scaling.
> - could there be something wrong in the compilation/optimization
> flags?
You might do better by not using generic flags instead of flags
specific to the processor you are employing - a newer compiler version
would allow for more processor-specific flags.
> - it seems that the two (12 core) CPUs are not "talking to each
> other" efficiently. Could this be a hardware problem?
You are probably seeing the effects of (a) hogging down the memory
subsystem - not simply the Hypertransport link(s - there are in fact
more than one) between the processors - with too many requests from
each of your MPI processes (both in terms of memory bandwidth pressure
as well as MPI traffic).
> node name: cobra-0-5
> Sgi H2106-G7. Servers, One quadsocket. Chipset, AMD SR5690 + SR5670
> + SP5100
> CPUs: 24 x 2.05 GHz (2 Opteron 6172 with 12 cores, 2.1GHz, and 12MB
> L3 cache)
> Memory (RAM): 15.68 GB
> Disk: 1 x 600 GB SAS 15k RAID 1, 2 x 10/100/1000, 6 PCIe slots.
> (http://www.sgi.com/products/servers/four_way/)
Is this a half-populated quad-socket? You chose the high proc count
chips (AMD makes 8-core versions as well as 12-core versions, with the
8-core having higher frequency for the same power envelope or less
power for the same frequency and of course more effective memory
bandwidth per core).
> PGI compiler:
> FFLAGS='-r8 -Mnodclchk -Mextend -Ktrap=fp'
> FOPTIM='-tp k8-64 -pc=64 -fastsse -O3 -Msmart -Mvect=cachesize:
> 1048576,transform'
These are generic flags - the cachesize for example happens to be right
by chance for the L3 per core on this chip - it would be wrong for some
other recent Opteron. You might want to add -tp k8-64e if your PGI
version supports it.
> Operating system:
> Rocks 5.4 (Maverick) x86_64
Keep in mind that
>
> Compiler, debugger, profiler (our PGI version is almost 4 years old:
> could this be the issue?):
> PGDBG 6.1-2 x86-64
> PGPROF 6.1-2
> /share/apps/pgi/linux86-64/6.1
>
> Job scheduler:
> /opt/gridengine/bin/lx26-amd64/qsub
>
> Mpirun:
> /opt/openmpi/bin/mpirun
>
>
> TESTS (same simulation, using 4, 12 and 24 cores):
>
> 4 (2x2) cobra-0-5
> User time: 3318.979990234599
> System time: 98.77000091597438
> Wall clock time: 4033.110437154770
>
> 12 (6x2) cobra-0-5
> User time: 1735.529926758260
> System time: 77.22999450750649
> Wall clock time: 2233.585635900497
>
> 24 (6x4) cobra-0-5
> User time: 1704.439960937947
> System time: 124.3899948149920
> Wall clock time: 2270.645053148270
This does seem like memory bandwidth congestion. Notice how going from
4 to 12 reduces the time by less than half and not to 1/3.
> ************************* control node (PLX) used for comparisons
> *************************
Only one?
> PLX DataPlex Cluster @ CINECA - RedHat EL 5.6!
> ( http://www.cineca.it/en/hardware/ibm-plx2290-0 WARNING! website not
> up to date... ) Qlogic QDR (40Gb/s) Infiniband high-performance
> network
>
> 274 Compute node
> 2 esa-core Intel(R) Xeon(R) CPU E5645 @2.40GHz per Compute node
> 48 GB RAM per Compute node
> 2 Nvidia Tesla M2070 GPU per Compute node
>
> 8 Fat node
> 2 Intel(R) Xeon(R) CPU X5570 @2.93GHz per Fat node
> 128 GB RAM per Fat node
>
> 3352 Total cores
>
> 6 Remote Visualization Login
> 2 Nvidia QuadroPlex 2200 S4
> PBSpro 10.1 batch scheduler
>
> Intel compiler
> FFLAGS="$FFLAGS -WB -fno-alias -assume byterecl"
> FOPTIM='-O3 -xSSE4.2 -unroll=4 -axSSE4.2 -ipo -align -fno-alias -
> assume byterecl'
>
>
> TESTS (same simulation, using 4, 12, 24 and 48 cores):
I assume this is using one node, 2 core on each of the two sockets.
> 4 (2x2)
> User time: 1174.67938481271
> System time: 0.000000000000000E+000
> Wall clock time: 1215.21540594101
Is this still using only one node (6 cores on each of the sockets)?
> 12 (6x2)
> User time: 642.731264695525
> System time: 0.000000000000000E+000
> Wall clock time: 692.214211940765
At this point you are either using 2 nodes (6 cores on each socket) or
hyperthreading on one node (and in that case you're seeing amazing
scaling for hyperthreading).
> 24 (6x4)
> User time: 328.649038668722
> System time: 0.000000000000000E+000
> Wall clock time: 360.033116102219
In this case if you are not going over Infiniband I'm flabbergasted.
> 48 (8x6)
> User time: 179.773665569723
> System time: 0.000000000000000E+000
> Wall clock time: 233.140372991562
Constantinos
--
Dr. Constantinos Evangelinos
Research Staff Member
IBM Research, Computational Science Center
HPC Applications and Tools
More information about the MITgcm-support
mailing list