[MITgcm-support] changing number of processors

Fri Feb 6 17:43:54 EST 2015

Matt, I would be interested to know more about what you are saying, 
because obviously I want to maximize the efficiency of the code, since I 
need to do some very very long simulations (multiple Pluto years where 1 
Pluto year = 248 Earth years).  My conclusion about processors came from 
testing on TACC Lonestar (12 cores/node, now defunct), TACC Stampede (16 
cores/node), two local machines (Notus & Boreas with 24 and Ghost with 
64 processors each), a funny computer cluster out of the University of 
Houston (Titan) that has both 12/cores per node and 8/cores per node and 
is really only useful up to 12 nodes due to poor connections between the 
nodes but the individual nodes are very fast, and NASA HEC Pleiades 
(which offhand I think is 12/cores per node).

You're right, the scaling is quite bad under my scheme, so if you or 
anyone could help, it would be quite valuable to me.

I've attached a plot of my findings.  I've included only the fastest 
times, because as I said before there are multiple ways to do say 24 
processors. (Sorry there are two files, I pulled them from different 
sources since I have only recently had access to Stampede and my access 
to other machines has gotten yanked).

	Angela

On 02/06/2015 01:39 PM, Matthew Mazloff wrote:
> Hi Angela
>
> The MITgcm scales far better than you are reporting. Given your use of sNx=2, I think you are not considering the extra overhead you are introducing by increasing the overlapping areas.
>
> And regarding node dependance, that is very dependent on platform and memory/process of your executable. I don't think it has anything to do with the faces of the cube-sphere setup you are running…but perhaps I am wrong on this. What I think happened is when we exceeded 12 processes on the node you exceeded the available local memory, and that has nothing to do with communication.
>
> Finally, the number of processes/core you request will also be machine dependent. I suspect some cores would actually do better with nSx=2, even given the extra overlap
>
> sorry to derail this thread...
> Matt
>
>
> On Feb 6, 2015, at 10:38 AM, Angela Zalucha <azalucha at seti.org> wrote:
>
>> Hi,
>>
>> I'm not sure why you would be getting NaN's, but I have found that there is a trick to increasing the number of processors.  I ran on a machine that has 12 processes per node, and the highest number of processors I could run was 1536 (I should point out that at high processor numbers, I found the code to be less efficient, so if you have a limited amount of processor hours, you might be better off running with fewer processors, e.g.: the wall clock time difference between 768 and 1536 processors is only a factor of 1.03).
>>
>> Anyway, here is my SIZE.h parameters:
>> sNx=2
>> sNy=2
>> nSx=1
>> nSy=1
>> nPx=96
>> nPy=16
>>
>> I have noticed during my scaling tests (and maybe someone can confirm my explanations for this behavior that:
>> 1) scaling tests on a 12 processors per node machine had faster wall clock times for a 12 processor/node test than a 16 processor/node test, I think owing to the the cube-sphere geometry having a "built-in" factor of 6, and communication across cube faces gets strange when the number of processors is not a factor of 6)
>> (this deeply saddens me because the 12 processor machine I used to use was retired Jan. 1, and now I have to run on a 16 processor machine, even this is the wave of the future, it hurts my efficiency)
>> 2) sNx*nSx*nPx = 192 and sNy*nSy*nPy=32
>> 3) For the same number of processors, faster wall clock times are achieved when nSx and nSy are minimized.
>>
>> I can produce tables and tables of configurations if you want, since at low processors there is degeneracy  between sNx,nSx,nPx  and sNy,nSy,nPy, respectively.
>>
>>    Angela
>>
>>
>> On 02/06/2015 08:45 AM, Jonny Williams wrote:
>>> Hi eveyrone
>>>
>>> I'm trying to run my regional model on 480 processors, up from a
>>> successfully working 48 procesor version.
>>>
>>> I have recompiled my code.
>>>
>>> To do this (in SIZE.h) I reduced sNy by a factor of 10 and increased nPy
>>> by a factor of ten so that nPx*nPy was increased by a factor of 10,
>>> which I think is the total number of processors.
>>>
>>> The executable was created fine and the model does run but the data I am
>>> getting out in my NetCDF files (mnc package) is all NaNs.
>>>
>>> Has anyone encountered this type of issue or know how to fix it?
>>>
>>> Is there a maximum number of processors?
>>>
>>> Many thanks
>>>
>>> Jonny
>>>
>>> --
>>> Dr Jonny Williams
>>> School of Geographical Sciences
>>> Cabot Institute
>>> University of Bristol
>>> BS8 1SS
>>>
>>> +44 (0)117 3318352
>>> jonny.williams at bristol.ac.uk <mailto:jonny.williams at bristol.ac.uk>
>>> http://www.bristol.ac.uk/geography/people/jonny-h-williams
>>> <http://bit.ly/jonnywilliams>
>>>
>>>
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>
>>
>> --
>> =====================
>> Angela Zalucha, PhD
>> Research Scientist
>> SETI Institute
>> +1 (617) 894-2937
>> =====================
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
>

-- 
=====================
Angela Zalucha, PhD
Research Scientist
SETI Institute
+1 (617) 894-2937
=====================
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scaling_all_2-eps-converted-to.pdf
Type: application/pdf
Size: 10093 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20150206/52e4ead0/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scaling_all.eps
Type: image/x-eps
Size: 28568 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20150206/52e4ead0/attachment-0001.bin>