[MITgcm-support] inefficient pressure solver

Tue Jul 14 23:02:52 EDT 2009

Hi Martin et al.,

I switched on the flag GLOBAL_SUM_SEND_RECV in CPP_EEOPTIONS.h, and
re-compiled the model (with ifort11, openmpi1.3.2, the opt file very similar
to linux_amd64_ifort+mpi_beagle). It became worse and way too slower. I
actually didn't get the statistics because I set a pretty small walltime
limit (15 min, enough for my previous 4-node test runs), and the job got
killed. Is there anything else I can tweak?

Thanks,
David

On Tue, Jul 14, 2009 at 10:50 AM, David Wang <climater at gmail.com> wrote:

> Hi Martin and David,
>
> Thanks for the responses. I think Martin got me right. I use all 8 cores (2
> CPUs, 4 cores in each) in a compute node.
>
> It seems that, as Martin pointed out, MPI is not doing the right thing. I'm
> no expert on this. I have been using the default (both on our cluster and
> TACC's Ranger and Lonestar. Probably TACC has done some relevant MPI
> optimizations by default?) which is #undef GLOBAL_SUM_SEND_RECV . But I find
> in one of verification runs
> (verification/global_ocean.90x40x15/code/CPP_EEOPTIONS.h):
>
> #define GLOBAL_SUM_SEND_RECV
>
> I will try this. Thanks a lot!
> D.
>
>
> On Tue, Jul 14, 2009 at 10:16 AM, Martin Losch <Martin.Losch at awi.de>wrote:
>
>> The memory bandwith problem appears as soon as you use more than 1 or 2
>> cores per quad-core unit, so what David is seeing here is probably something
>> different, because it looks like he is running with fully loaded nodes,
>> right?
>>
>> cg2d does 2 2D exchanges and 2 global sums per iteration. I suspect that
>> one of these (or both) operations are very expensive on your system. Can you
>> do a flow trace analysis that lets you see, where the time is actually
>> spent? If I am right, it's not spent in the routine cg2d itself, but in the
>> MPI routines (mpi_send/recv/allreduce, whatever flags you are using, you can
>> change the behavior a little by defining the appropriate flags in
>> CPP_EEOPTIONS.h).
>>
>> Martin
>>
>>
>> On Jul 14, 2009, at 3:53 PM, David Hebert wrote:
>>
>>  David,
>>>
>>> I recall discussion earlier in the year about difficulties with quad core
>>> processors and memory bandwidth. Could this be what you are seeing as you
>>> increase cores?
>>>
>>> David
>>>
>>> David Wang wrote:
>>>
>>>> Hi MITgcmers,
>>>>
>>>> We have experienced problems with MITgcm on a small local cluster
>>>> (24-node dual AMD Opteron quad-core processors "shanghai" with Infiniband
>>>> using OpenMPI 1.3.2). The symptom is that when we increase the number of
>>>> processors (nProcs), the pressure solver cg2d takes a progressively larger
>>>> share (SOLVE_FOR_PRESSURE in STDOUT.0000) of the total walltime (ALL in
>>>> STDOUT.0000), and this percentage is much larger than on other clusters
>>>> (specifically TACC's Ranger and Lonestar).
>>>>
>>>> Some 1-year hydrostatic, implicit free-surface test runs with the grid
>>>> points of 360x224x46, asynchronous timestepping (1200.s/43200.s) result in
>>>> the following statistics:
>>>>
>>>> nodes    cores    ALL (sec)    SOLVE_FOR_PRESSURE (sec)
>>>>  SOLVE_FOR_PRESSURE/ALL (%)
>>>> 1    8    1873    93    4.97%
>>>> 2    16    922    129    13.99%
>>>> 4    32    682    310    45.45%
>>>>
>>>> And with 96 cores, this percentage soars to about 80%!
>>>>
>>>> However, our experience with TACC's Ranger and Lonestar shows that this
>>>> percentage does increase with the number of processors, but never above 40%.
>>>> TACC's machines use mvapich. So we also tested mvapich on our local cluster
>>>> but found no better luck.
>>>>
>>>> We have no idea why the cg2d pressure solver runs so inefficiently on
>>>> our cluster. If anyone can kindly provide a few clues, we will very much
>>>> appreciate them.
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> --
>>>> turn and live.
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> MITgcm-support mailing list
>>>> MITgcm-support at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>
>>>>
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>
>
>
>
> --
> turn and live.
>

-- 
turn and live.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20090714/23b530e3/attachment.htm>