[MITgcm-support] scaling on SGI altix, AMD quad core cpus is terrible

Tue Oct 21 05:14:54 EDT 2008

Hi there,

this is a question to the hardware gurus (I am counting on  
Constantinos and Chris).

Here at Weizmann, there is complex of 3 SGI altix computers with  
Quadcore cpus and PGI compilers. 2 of the machines have 4 cpus each,  
and 1 has 2, details are described in the PDF H.G.updated... 
(attached). There was message by Constantinos about bandwidth  
limitations with Quadcore cpus in this thread:
http://forge.csail.mit.edu/pipermail/mitgcm-support/2007-October/ 
005076.html

Now we find this:
1. A model on a fairly large grid (680x128x38 grid point) does not  
scale on this system. Further, on 32cpus (the target number of cpus)  
the wall clock time is about 9 times larger than the user time. Most  
of this excess wall clock time is spent in CG2D and  
BLOCKING_EXCHANGES, that is in places where there are many exchanges  
(and DYNAMICS but I believe it's because OBCS is used that includes  
also a few exchanges).
2. We did some simple benchmarks with a 256x256x5 grid and all  
physical packages (including obcs) turned off. Because we choose only  
5 vertical layers, the blocking exchanges do not take too much time,  
but the principle is the same: It seems to spend time in the exchange  
routines, see the attached figure, numbers are taken from the rough  
estimate at the end of the model stdout.
3. Further, we have an independent analysis from some compute guys  
who come to the conclusion that the MITgcm is memory bandwidth  
limited (probably meaning that it requires a lot of bandwidth that  
the hardware cannot provide), also attached, so are the specifics of  
the system.

Do you have any suggestions what might be going wrong. How can we  
make the code more efficient and speed up the integrations? We have  
tried a few CPP_EEOPTIONS flags, but nothing helps. OpenMP/ 
Multithreading?

Martin, Eli, and Hezi

> Hi Hezi,
>
> Couple of months ago, with a lot of help from Pierre choukroun and his
> team, we did some analyzing tests on the MITGCM application in  
> order to
> find out the reason for the drop in performance when running on one  
> node
> using 8 cores.
>
> The following tests had been taken:
>
>    1. I/O tests - We used iostat tool in order to understand the I/O
>       behavior of the application.
>       The behavior involves reading several input files at the  
> beginning
>       of run and writing some output files at the end of the run.
>       results - the application is not I/O bounded. I/O is not the
>       reason for the performance drop.
>
>    2. Memory tests - we used memory analyzing tools like sar to  
> find out
>       if the application is utilizing the entire memory and thus swap
>       memory is involved.
>       on the model I was running, the application did not utilize the
>       entire free RAM and no swap memory was needed. nevertheless, I
>       recommended to repeat the test using bigger models. we loaned  
> some
>       extra memory to Pierre , we installed it on Hezi4 machine.
>
>    3. CPU tests - The application utilized 100% of the CPU on each  
> core.
>       96% are used by the application, 4% are used by the O.S on  
> behalf
>       of the application.
>       using Intel Vtune tool (which was also installed together with
>       Pierre on one of your systems) we managed to find a single
>       function which keep the CPU busy(SUBROUTINE CG3D). you may
>       consider optimizing this subroutine.
>
>    4. Interconnect tests - although the performance drop occurs on one
>       node only, we ran MITGCM on 2 nodes using Ethernet and  
> infiniband.
>       we found out that the Ethernet is sufficient for this type of
>       application.
>
>    5. Memory bandwidth tests - using Intel vtune tool we tested the
>       memory bandwidth used by the application. we found out that the
>       application is memory bandwidth limited which means that when
>       2 sockets - 4 cores on each socket are trying to access their
>       local memory at the same time, the memory bus is loaded which
>       creates a bottle net.
>
>
> According to our tests, the memory bandwidth limitation of MITGCM  
> is the
> cause for MITGCM nonlinearity scaling on a single node.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20081021/54f4e625/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hezibench.png
Type: image/png
Size: 35948 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20081021/54f4e625/attachment.png>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20081021/54f4e625/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: H.G updated offer .pdf
Type: application/pdf
Size: 43448 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20081021/54f4e625/attachment.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20081021/54f4e625/attachment-0002.htm>