[MITgcm-support] Better than expected HPC Scaling

Tue Oct 20 17:05:43 EDT 2020

Thanks Martin. I should have thought to use the profiling info in STDOUT to explore the details.

Ed

On 19 Oct 2020, at 19:02, Martin Losch <martin.losch at awi.de<mailto:martin.losch at awi.de>> wrote:

Hi Ed,

certainly not my field of expertise, but could you use the profiing information at the end of STDOUT to figure out which parts of the code are scaling superlinearily? E.g, it is dyanmics (solve_for_pressure, probably not) or thermodynamics, maybe that can give you a clue.

Did you check out
https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup<https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup>
for more possible reasons?

Martin

> On 15. Oct 2020, at 00:07, Edward Doddridge <edward.doddridge at utas.edu.au<mailto:edward.doddridge at utas.edu.au>> wrote:
>
> Thanks Matt and Dimitris.
>
> 48 cores is definitely a small number for a job like this, but I wasn’t pushing the memory limits – these cores all have plenty of memory (4GB per core). As for sharing the node, it’s a good thought, but each node has 48 cores (which is why I went in multiples of 48). I also tried to keep the tiles as square as possible. They weren’t always perfect, but the 480 core run actually had slightly squarer tiles than the 384 core run and the 768 core run had perfectly square tiles.
>
> “Another possibility is that this is within the machine noise”
> That’s certainly a possibility. I haven’t rerun all of the tests, but I reran a couple and the timings were very similar. The variation wasn’t enough to pull the curve below the ideal scaling curve. I don’t think this is noise. It seemsto me that there is structure in the signal.
>
> Are you doing any I/O?
> I’m not doing much I/O. I set it to output a few monthly mean fields, but that was all. Using the timing breakdown in STDOUT, the time spent in “DO_THE_MODEL_IO” scales pretty well with the core count (see attached).
>
> Cheers,
> Ed
>
>
> Dimitris Menemenlis menemenlis at jpl.nasa.gov<http://jpl.nasa.gov>
> Tue Oct 13 00:38:33 EDT 2020
>
> I like the theory of better cache utilization initially, from 48 to ~250 processors, then degrading @ higher processor count because of communications (and to a lesser extent more grid cells and hence more computations in the overlap regions).
>
> Are you doing any I/O?
>
> > On Oct 12, 2020, at 9:24 PM, Matthew Mazloff <mmazloff at ucsd.edu<http://ucsd.edu>> wrote:
> >
> > Hi Ed
> >
> > It depends on the machine you are running on. But its only slightly better at 250 than 48, and 48 does seem a bit low for a job of that size. Maybe you were pushing max memory? It is odd, but I suspect you are on a machine that has very fast interconnects and I/O and 250 is just more efficient.
> >
> > Another possibility is that this is within the machine noise. Or that you were sharing a node when you ran the 48 and 96 job. Or that your tiles were very rectangular for the 48 and 96 core jobs so you had more overlap.
> >
> > Matt
> >
> >
>
>
> From: Edward Doddridge <edward.doddridge at utas.edu.au<mailto:edward.doddridge at utas.edu.au>>
> Date: Tuesday, 13 October 2020 at 15:00
> To: "mitgcm-support at mitgcm.org<mailto:mitgcm-support at mitgcm.org>" <mitgcm-support at mitgcm.org<mailto:mitgcm-support at mitgcm.org>>
> Subject: Better than expected HPC Scaling
>
> Hi MITgcmers,
>
> As part of an HPC bid I need to provide some scaling information for MITgcm on their cluster. The test configuration is a reentrant channel 600x800x50 grid points, using just the ocean component and some idealised forcing fields. As I increased the core count between 48 and 384 the model scaled better than the theoretical scaling (see attached figure). I’m not complaining that it ran faster, but I was surprised. Any thoughts about what would cause this sort of behaviour? I wondered if it might be something to do with the tiles not fitting in the cache for the low core count simulations. The bid might be more convincing if I can give a plausible explanation for why the model scales better than ideal.
>
> Cheers,
> Ed
>
>
> <image001.png>
> Edward Doddridge
> Research Associate and Theme Leader
> Australian Antarctic Program Partnership (AAPP)
> Institute for Marine and Antarctic Studies (IMAS)
> University of Tasmania (UTAS)
>
> doddridge.me<http://doddridge.me>
>
>
> University of Tasmania Electronic Communications Policy (December, 2014).
> This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.
>
> <Screen Shot 2020-10-15 at 09.02.33.png>_______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support<http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support>

_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support<http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20201020/3b561345/attachment-0001.html>