[MITgcm-support] scaling on SGI altix, AMD quad core cpus is terrible

Tue Oct 21 19:29:26 EDT 2008

Hi Martin,

  Something is wrong! Hard to say exactly what yet :-). Certainly there 
is only so much memory bandwidth and so things will be limited by that 
if nothing else limits first. However, that wouldn't normally produce a 
discrepancy between wall and user time. That is almost certainly coming 
from somewhere else.

  Is it possible that there are other things using up CPU cycles on the
system? Can you do a couple of runs on each of the boxes individually
and get some ratio of wall-clock to user numbers for a single box and 
see if they are consistent between boxes.

The things that produce big deltas between wall-clock and user time are
  1 - inadequate interconnect e.g. machines are connected with a 56K
modem or via email.
  2 - uneven load balancing e.g. heterogeneous hardware so everyone has
to wait for the slowest computer
  3 - I/O problems e.g. everyone is waiting for some sparc station NSF
server
  4 - something else is running on some of the CPUs e.g. the machine is
shared and somebody else is busy running matlab, open office, firefox, 
xscreensaver, crack etc.....
  5 - you are using a useless MPI implementation e.g. the MPI
implementation is going via the network interface even for local comms
within a box - maybe because you aren't using the nice optimized SGI MPI.
  6 - you are using too much memory so things are swapping e.g. you have
three hundred tracers and their diagnostics and only 16KB of memory. the
size and nm commands are useful here to tell you how much memory is needed.

  OpenMP/threads really only helps with situation 5 and its probably
easier and almost as effective to find/buy a decent MPI implementation
that knows about shared memory comms. Also, for some unknown reason, 
only me, Jean-Michel and Oliver can reliably write code that works with 
threads! So some packages can have issues with multi-threading.

  It is possible to address memory bandwidth issues a bit by using a mix 
of 32-bit and 64-bit floats, but I wouldn't do that until you've got to 
the bottom of the user v. wall clock issue. This changes would be 
relatively easy ( :-) ) for an eager volunteer to test for cg2d/cg3d.

Thanks,

Chris
Martin Losch wrote:
> Hi there,
> 
> this is a question to the hardware gurus (I am counting on Constantinos 
> and Chris).
> 
> Here at Weizmann, there is complex of 3 SGI altix computers with 
> Quadcore cpus and PGI compilers. 2 of the machines have 4 cpus each, and 
> 1 has 2, details are described in the PDF H.G.updated...(attached). 
> There was message by Constantinos about bandwidth limitations with 
> Quadcore cpus in this thread:
> http://forge.csail.mit.edu/pipermail/mitgcm-support/2007-October/005076.html
> 
> Now we find this:
> 1. A model on a fairly large grid (680x128x38 grid point) does not scale 
> on this system. Further, on 32cpus (the target number of cpus) the wall 
> clock time is about 9 times larger than the user time. Most of this 
> excess wall clock time is spent in CG2D and BLOCKING_EXCHANGES, that is 
> in places where there are many exchanges (and DYNAMICS but I believe 
> it's because OBCS is used that includes also a few exchanges).
> 2. We did some simple benchmarks with a 256x256x5 grid and all physical 
> packages (including obcs) turned off. Because we choose only 5 vertical 
> layers, the blocking exchanges do not take too much time, but the 
> principle is the same: It seems to spend time in the exchange routines, 
> see the attached figure, numbers are taken from the rough estimate at 
> the end of the model stdout.
> 3. Further, we have an independent analysis from some compute guys who 
> come to the conclusion that the MITgcm is memory bandwidth limited 
> (probably meaning that it requires a lot of bandwidth that the hardware 
> cannot provide), also attached, so are the specifics of the system.
> 
> Do you have any suggestions what might be going wrong. How can we make 
> the code more efficient and speed up the integrations? We have tried a 
> few CPP_EEOPTIONS flags, but nothing helps. OpenMP/Multithreading?
> 
> Martin, Eli, and Hezi
> 
> ------------------------------------------------------------------------
> 
> 
> ------------------------------------------------------------------------
> 
> 
> 
> ------------------------------------------------------------------------
> 
>> Hi Hezi,
>>
>> Couple of months ago, with a lot of help from Pierre choukroun and his
>> team, we did some analyzing tests on the MITGCM application in order to
>> find out the reason for the drop in performance when running on one node
>> using 8 cores.
>>
>> The following tests had been taken:
>>
>>    1. I/O tests - We used iostat tool in order to understand the I/O
>>       behavior of the application.
>>       The behavior involves reading several input files at the beginning
>>       of run and writing some output files at the end of the run.
>>       results - the application is not I/O bounded. I/O is not the
>>       reason for the performance drop.
>>
>>    2. Memory tests - we used memory analyzing tools like sar to find out
>>       if the application is utilizing the entire memory and thus swap
>>       memory is involved.
>>       on the model I was running, the application did not utilize the
>>       entire free RAM and no swap memory was needed. nevertheless, I
>>       recommended to repeat the test using bigger models. we loaned some
>>       extra memory to Pierre , we installed it on Hezi4 machine.
>>
>>    3. CPU tests - The application utilized 100% of the CPU on each core.
>>       96% are used by the application, 4% are used by the O.S on behalf
>>       of the application.
>>       using Intel Vtune tool (which was also installed together with
>>       Pierre on one of your systems) we managed to find a single
>>       function which keep the CPU busy(SUBROUTINE CG3D). you may
>>       consider optimizing this subroutine.
>>
>>    4. Interconnect tests - although the performance drop occurs on one
>>       node only, we ran MITGCM on 2 nodes using Ethernet and infiniband.
>>       we found out that the Ethernet is sufficient for this type of
>>       application.
>>
>>    5. Memory bandwidth tests - using Intel vtune tool we tested the
>>       memory bandwidth used by the application. we found out that the
>>       application is memory bandwidth limited which means that when
>>       2 sockets - 4 cores on each socket are trying to access their
>>       local memory at the same time, the memory bus is loaded which
>>       creates a bottle net.
>>
>>
>> According to our tests, the memory bandwidth limitation of MITGCM is the
>> cause for MITGCM nonlinearity scaling on a single node.   
>>
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support