[MITgcm-support] building with MPI on a dual-core mac

Constantinos Evangelinos ce107 at ocean.mit.edu
Sun Jul 19 13:24:33 EDT 2009


Στις Σάββατο 18 Ιούλιος 2009 23:30:07 Klymak Jody γράψατε:
> Hi Brian,
>
> On 18-Jul-09, at 7:32 PM, Brian Rose wrote:
> > But in general, when would you expect to see a performance boost
> > from increasing the number of processes beyond the number of
> > hardware cores?  Currently during my 2-core, 2-process runs, each
> > process is already occupying about 99% of its respective CPU.
>
> Ooops, my mistake - I forgot that "Core 2 Duo" just means 2 cores.
> Sneaky Intel marketing. Of course there is no purpose to
> oversubscribing a core.
>
> However, on the new Nehalem Xeon chips there is some advantage to
> having more processes than cores, though its not huge.  Some very
> rough tests with a run that has no written output on a 2x4 core Xeon:
>
> 8  Processes on one machine:       54 minutes
> 16 processes on one machine:      40 minutes
>
> However, if you have two machines, connected by gigabit ethernet:
>
> 8+8 proc spread on two machines: 29 minutes
>
> This was for a grid 10x150x1600 with the non-hydrostatic code turned
> on, using openMPI.
>
> So for the new Nehalem chips my crude tests seem to imply that there
> is an advantage to oversubscribing if your MPI overhead is small.

That's because Nehalem supports Intel's version of simultaneous multithreading 
(in marketing speak "hyperthreading") which makes a core appear as two. The 
old Pentium4s after a certain stepping used to have this but it was a far 
inferior implementation that did not offer too much for most codes. The Power5 
and Power6 series of processors also have it (and it works very nicely). 
Essentially it uses empty slots in the pipeline (because of cache misses or 
branches) to schedule instructions from another thread. Quite obviously you 
don't expect it to double performance unless very specific conditions can be 
met but it can be a boost, especially if you're not FP pipeline bound and your 
memory bandwidth is not already consumed by a single thread.

Just to help everyone with quick and dirty benchmarking: if you add the "-ts" 
flag to genmake2 (or even testreport) you get output of user, system and 
wallclock time per timestep as the code runs so you can have a pretty good 
idea of the speed achieved without waiting for the run to finish. There are 
postprocessing scripts to be found in MITgcm_contrib/timing (I need to update 
them with one for parallel runs that produces statistics over all STDOUT.*).

Constantinos
-- 
Dr. Constantinos Evangelinos
Department of Earth, Atmospheric and Planetary Sciences
Massachusetts Institute of Technology






More information about the MITgcm-support mailing list