[MITgcm-devel] [MITgcm-support] BLOCKING_EXCHANGES slowdown when using pkg/ptracers on Columbia
Dimitris Menemenlis
menemenlis at jpl.nasa.gov
Fri Mar 12 11:50:41 EST 2010
Martin, Jean-Michel, and Holy, thank you for feedback.
Martin, thank you for code. I will be at MIT next week and if JM wants me to, I can clean up and check in.
But as JM points out this is unlikely to solve An's problem since slowdown already occurs with one tracer.
An has moved her ptracer job to another system (Pleiades) and is now running with a reasonable slowdown, i.e.,
comparable to that reported by Holy.
The slow-down could be a hardware or a system problem specific to Columbia. We have also reported to NAS support people and they are looking into it.
Most of Columbia is being decommissioned in near future and since configuration is running well on Pleaides, it's probably not worth spending any more time diagnosing cause of slowdown.
Dimitris
Dimitris Menemenlis <menemenlis at jpl.nasa.gov>
Jet Propulsion Lab, California Institute of Technology
MS 300-323, 4800 Oak Grove Dr, Pasadena CA 91109-8099, USA
tel: 818-354-1656; cell: 818-625-6498; fax: 818-393-6720
On Mar 12, 2010, at 6:25 AM, Holly Dail wrote:
> Hello -
>
> I found about a 50% slowdown on long runs on the Altix here at MIT
> with 7 tracers.
>
> My runs were of a 120x108x23 N. Atlantic grid, 3600 sec time step on
> 20 processors and using 7 tracers. 300 years of simulation took 60 h
> without tracers and about 91 h with.
>
> Holly
>
> On Mar 12, 2010, at Mar 12 , 2:29 AM, Martin Losch wrote:
>
>> I agree with Jean-Michel, it is puzzling that even 1 ptracer is
>> enough to cause this problem (and in this case my hack won't do
>> anything).
>>
>> As I said, my solution is a hack, probably specific to the platform
>> I was using, that made my simulations (with 16 tracers) fast enough
>> so that I could continue to with the integration, I never bothered
>> to go into detail with this.
>>
>> Just as additional information. In my/our case we were runing on
>> JUMP, a IBM690/P4 cluster in Jülich (now offline) and observed bad
>> scaling with # of cpu, with the hack we were happy with the scaling
>> and did not investigate things any further (see here for a terrible
>> gray paper: <http://epic.awi.de/Publications/Los2008a.pdf> that
>> describes the problem).
>>
>> Martin
>>
>> On Mar 11, 2010, at 10:48 PM, Jean-Michel Campin wrote:
>>
>>> Hi,
>>>
>>> I think it would still be usefull to understand why Dimitris
>>> see such a slow down, specially with only 1 tracer (and in this case,
>>> Martin's solution, to copy back and forth to an array which has
>>> exactly the same shape, is unlikely to speed up the run).
>>> After, the 8 tracers case scale like the 1 tracer, and is easy to
>>> interpret if we understand the 1 tracer case.
>>>
>>> Jean-Michel
>>>
>>> On Thu, Mar 11, 2010 at 05:30:43PM +0100, Martin Losch wrote:
>>>> Hi Constantinos,
>>>>
>>>> I am moving this to the devel-list:
>>>>
>>>> I would have added the code long ago, if it had been clean, but it
>>>> isn't. Jean-Michel might even remove my cvs privileges if he only
>>>> _sees_ the code changes (o:
>>>>
>>>> I am happy to provide my changes (they are with very much outdated
>>>> code), and Dimitris can see if it works for him. Then we should
>>>> think about how to include this in a more general context, but
>>>> that's way beyond my technical (and currently even time)
>>>> capabilities.
>>>>
>>>> Martin
>>>> PS. Here comes the code (between checkpoint59n and p).
>>>>
>>>
>>>
>>>>
>>>>
>>>> On Mar 11, 2010, at 5:02 PM, Constantinos Evangelinos wrote:
>>>>
>>>>> On Thursday 11 March 2010 05:52:23 am Martin Losch wrote:
>>>>>
>>>>>> I observed a huge slowdown with ptracers as well, the solution
>>>>>> was to copy
>>>>>> ptracers into a new 5D-array where the 3rd index (k) is
>>>>>> nPtracer*Nr long,
>>>>>
>>>>> You mean 3D I take it.
>>>>>
>>>>>> and then use one exchange for that (and I think I had to hack the
>>>>>> corresponding exchange routine to have it allow such large
>>>>>> arrays). Can
>>>>>> provide code if required.
>>>>>
>>>>> Since more than one person has this problem it might be a good
>>>>> time to add the
>>>>> code (as an IFDEF) in the main code.
>>>>>
>>>>> Constantinos
>>>>> --
>>>>> Dr. Constantinos Evangelinos
>>>>> Department of Earth, Atmospheric and Planetary Sciences
>>>>> Massachusetts Institute of Technology
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-support mailing list
>>>>> MITgcm-support at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>
>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list