[MITgcm-support] OpenMP and multithreading
menemenlis at sbcglobal.net
Thu Oct 18 09:37:00 EDT 2007
Thank you Constantine and sorry Paola for forgetting to mention the nTx and nTy runtime parameters. It's been a while since I last used (or atempted to use) threaded code. In my personal experience there were two instances where threaded code worked better than message passing. One was with an early linux cluster that had dual cpus per box. The MPI implemetation was such that running two separate processes in each box was less efficient than running two threads per box and then using message passing across boxes.
The second case was when we ran a 4000-cpu test integration on Columbia and where the most efficient setup turned out to be to use threaded code within each 512-cpu box and message passing between the eight 512-cpu boxes. This was done in order to force numalink communications within each box and infiniband across boxes.
In both above cases, the threaded setup was non-trivial and required lots of help from Chris Hill and from other computer gurus, and for these two succesful examples I have many more examples where threaded code either failed to run or ran more slowly than MPI code on shared memory platforms.
My suspicion is (Constantine please correct if I am wrong) that if MPI is properly set up and callibrated then MPI code should be able to run at least as efficiently as threaded code on any shared memory platforms. For example, for the 4000-cpu Columbia test, mixed memory modeling was needed only because the MPI implementation was not set up to optimally use the two different communication channels.
From: Constantinos Evangelinos <ce107 at ocean.mit.edu>
Subj: Re: [MITgcm-support] OpenMP and multithreading
Date: Thu Oct 18, 2007 2:50 am
To: mitgcm-support at mitgcm.org
On Wednesday 17 October 2007 8:22:28 pm Dimitris Menemenlis wrote:
> Paola, the total size of the domain is Nx*Ny where
> & Nx = sNx*nSx*nPx,
> & Ny = sNy*nSy*nPy,
> nPx*nPy is the total number of (MPI) processes,
> nSx*nSy is the total number of tiles per process, and
> sNx*sNy is the dimension of each tile.
> For shared-memory threaded code each one of the nSx*nSy tiles will
> be handled by a different thread. If nSx*nSy=1, then you will have
> only one thread per MPI process.
Actually the number of threads is set in eedata and is different for the X
(nTx) and Y (nTy) direction (with their product nTx*nTy assumed to be equal
to OMP_NUM_THREADS for OpenMP code). The obvious restriction is that the
number of threads in X should be a divisor of sNx and the number of threads
in Y should be a divisor of sNy. So it is entirely possible to have sNx=sNy=2
and nTx=2, nTy=1.
> Some words of caution:
> 1) shared-memory threaded code is not as well supported as MPI code,
> especially in the packages where careless programmers (like myself)
> sometimes (accidentally) introduce constructs that break the threading.
You can look at the daily testreports with multithreading turned on to see
which of the test cases appear to work. On some platforms (eg. Linux on PPC
with the IBM XL compilers) we have complete lack of success for some unknown
(so far) reason.
> 2) with some exceptions there is very little gain in using threaded code vs
> MPI code, even on shared memory platforms. For example on the SGI origin
> and altix we typically use MPI rather than threaded code, even though they
> are shared memory platforms.
With dual and quad core processors we may need to revisit that question. For
the time being however quad core seems to be suffering from a lack of memory
bandwidth and OpenMP would not help there.
Dr. Constantinos Evangelinos
Department of Earth, Atmospheric and Planetary Sciences
Massachusetts Institute of Technology
MITgcm-support mailing list
MITgcm-support at mitgcm.org
--- message truncated ---
More information about the MITgcm-support