[MITgcm-support] speedup for cs64 on a linux cluster

Tue May 22 15:50:45 EDT 2012

Hi there,

I know of one referred publication that also talks about scaling:
http://ecco2.org/manuscripts/2007/Hill_etal_07_SciProg.pdf
and there are other's in the grey literature, e.g. http://epic.awi.de/18337/1/Los2008a.pdf and others, that I couldn't find today.

Based on my own scaling analysis (and the MITgcm-rumors) tile sizes below 30x30 are not very efficient, mostly because of the pressure solver cg2d.F, so you can only expect good scaling for grids that are big enough. The cs32 grid or even a cs64 grid is not very large and I would not expect anything useful above 6 cpu in the former case and 24 in the latter case.

Things like I/O heavily affect the performance (and scaling). You have to test, if useSingelCPUio helps, debugLevel=-1, increase the monitorFreq, etc. Or reading from and writing to different (local) file systems that are faster. For a scaling analysis I'd turn off all I/O for a start and then later on start with some I/O, see above link to Hill et al.

There are many other factors that affect scaling, the most important one being the architecture you are on. I have access to computers, where the exchange between core on one node is fast, but the exchange between nodes is slow, so that when your cpu-number exceeds the number of cpu/node, scaling goes down.

A further issue is that with multicore chips, there is a bandwidth issue, because many cores try to access core memory through the same bus. In that case the only thing that help is to use only some of the cores on a chip.

Hope that helps,
Martin

On May 22, 2012, at 7:31 PM, Angela Zalucha wrote:

> I also have not found any scaling analysis anywhere, but here is the test I performed:  I essentially run the 3D held-saurez cs experiment (with slightly more advanced RT) with 30 levels.  The test was performed on the Texas Advanced Computing Center Lonestar Linux cluster.  The test went for 120,000 iterations.
> 
> I attached a plot.  The number of processors increases by powers of 2 times 12 (i.e. 12, 24, 48, 96, 192, 384, 768, 1536).  I did not plot the ideal case but it would be a line.  The scaling is not ideal to large numbers of processors.  Oddly, the scaling is also not constant, e.g. 48 to 96 and 192 to 384 produce a greater improvment than 96 to 192.
> 
> Also, I noticed for a given number of processors, lower nSx and nSy is always faster.
> 
> In 2D, on my group's local (Linux) cluster at Southwest, 2 procs is better than 1 proc, but 4 procs actually runs slower.
> 
>  Angela
> 
> 
> 
> On Tue, 22 May 2012, Maura BRUNETTI wrote:
> 
>> Dear MITgcm users,
>> I am studying scaling properties of ocean-only configurations on a linux cluster.
>> The results shown in the attached figure are obtained with a cubed sphere configuration with 64x64 face resolution and 15
>> vertical levels (points in the figure correspond to: 6 tiles 64x64 on 1 proc, 1 tile 64x64 on 6 procs, 1 tile 32x64 on 12
>> procs, 1 tile 32x32 on 24 procs and 1 tile 16x32 on 48 procs). Only packages GMredi and tave are activated at run time.
>> The scaling is not very good, starting already at 12 procs (see blue line). I have not found in the literature other
>> scaling analysis, could you please suggest where I can find them? From my analysis, I have seen that it is not worth
>> doing runs with tile dimension smaller than 32 grid points. Is it correct?
>> Thanks,
>> Maura --
>> Dr. Maura Brunetti
>> Institute for Environmental Sciences (ISE)
>> University of Geneva -- Switzerland
> <timevsproc_plot.eps>_______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support

Martin Losch
Martin.Losch at awi.de