[MITgcm-support] changing number of processors

Martin Losch Martin.Losch at awi.de
Mon Mar 16 12:16:58 EDT 2015


Jonny, 

in the end you’d have to do a proper profiling to find out, what’s the problem. With netcdf, often a lot of time is spent on finding the correct netcdf-id of a variable (some integer function mnc_int or similar), but other totally unrelated issues can stem from the file system. Depending on the type, you many want to adjust your i/o parameters. E.g. on a global file syste (like a NFS mounted system or a GFS) the processes may not be allowed to write at the same time and “queue” up. In this case having only one CPU do all of the IO may be useful (useSingleCPUio = .TRUE.), or writing to local hard disks (in cases where each node has it’s own tmp-disk) … You’ll have to experiment a little.

Martin

> On 16 Mar 2015, at 14:58, Jonny Williams <Jonny.Williams at bristol.ac.uk> wrote:
> 
> Thanks for this Dimitris
> 
> The timings at the end of the STDOUT.0000 file are extremely useful and have enabled me to diagnose the fact that the I/O definitely is the limiting factor in my runs at the moment.
> 
> For example, in going from 48 (6x8) to 480 (15x32) processors, the time spent in the section called "DO_THE_MODEL_IO [FORWARD_STEP]" in STDOUT.0000 increased from 3% to 67% of the amount of total run time used!
> 
> I may well have to look into using the mdsio package again unless there is a way round this I/O, NetCDF issue?
> 
> Many thanks again
> 
> Jonny
> 
> 
> 
> On 5 March 2015 at 11:57, Menemenlis, Dimitris (329D) <Dimitris.Menemenlis at jpl.nasa.gov> wrote:
> My personal prejudice (and it may be wrong): if you want efficient I/O, you need to get rid of netcdf package; mdsio and extensions is a lot more flexible and efficient.  In any case there is no need to guess about cause of bottleneck; just look at timings at end of your STDOUT.0000 file.
> 
> On Mar 5, 2015, at 3:20 AM, Jonny Williams <Jonny.Williams at bristol.ac.uk> wrote:
> 
>> As a related question to this thread, is it possible to output one NetCDF file per stream (state*.nc, ptracers*.nc, etc) rather than one per process?
>> 
>> I am currently running on ARCHER, the national supercomputing facility and I am not getting the speed up that I am expecting for a long job whereas I did get the expected speed for a very short test job.
>> 
>> I am thinking that the I/O may be a bottleneck here perhaps?
>> 
>> Cheers!
>> 
>> Jonny
>> 
>> On 10 February 2015 at 07:38, Martin Losch <Martin.Losch at awi.de> wrote:
>> Hi Jonny and others,
>> 
>> I am not sure if I understand your question about "the utility of the overlap cells": the overlaps are filled with the values of the neighboring tiles so that you can compute terms of the model equations near the boundary; without the overlap you would not be able to evaluate any horizontal gradient or average at the domain boundary.
>> The size of the overlap depends on the computational stencil, that you want to use. A 2nd order operation needs an overlap of 1, a 3rd order operator needs an overlap of 2, and so forth. I think that at the model tells you, when your choice of advection schemes requires more overlap that you have avariciously specified.
>> 
>> Martin
>> 
>> PS:
>> Here’s my experience with scaling or not scaling (by no means are these absolute numbers or recommendations):
>> As a rule of thumb, the MITgcm dynamics/thermodynamics kernel (various packages may behave differently) scale usually nearly linearly down to tile sizes of (sNx * sNy) of 30*30, when the overhead of overlap/domain size becomes unfavorable (because of too many local communications between individual tiles) and the global pressure solver takes its toll (because of global communications when all processes have to wait). Below this tile size the time to solution reduces still with more processors, but the more slowly until the overhead is more expensive than the speedup. I re-iterating what Matt already wrote: It’s obvious that for a 30x30 tile, the overlap nearly 2*(Olx*sNy+Oly*sNx), so for an overlap of 2 you already have 8*30 cells in the overlap, more that one quarter of the cells in the interior, etc. From this point of view a tile size of 2x2 + 1 gridpoint overlap is totally inefficient.
>> Further, it is probably better to have nearly quadratic tiles (so sNx ~ sNy), except for vector machines, where you try to make sNx as large as possible (at least until you reach the maximum vector length of your machine).
>> 
>> In my experience you need to test this for every new computer that you have access to, to find out what is the best range of processors that you can efficiently run with. For example it may be more economic to use fewer processor and wait a little longer for the result, but have enough CPU time left to do a second run of the same type, than to use all you CPU time on a run with twice as many processors that may finish faster, but not twice as fast because the linear scaling limit has been reached.
>> 
>> > On 09 Feb 2015, at 16:05, Jonny Williams <Jonny.Williams at bristol.ac.uk> wrote:
>> >
>> > Dear Angela, Matthew
>> >
>> > Thanks you very much for your emails.
>> >
>> > For your information I have now gotten round my initial problem of the NaNs now by using a shorter timestep although I don't know why this would've have made much difference...
>> >
>> > Your discussion about the overlap parameters and run speed is of interest to me because I found that a decrease in timestep by a factor of 4 and an increase in the number of processors by a factor of 10 resulted in an almost identical run speed!
>> >
>> > My SIZE.h parameters were as follows...
>> >
>> > PARAMETER (
>> >      &           sNx =  75,
>> >      &           sNy =  10,
>> >      &           OLx =   4,
>> >      &           OLy =   4,
>> >      &           nSx =   1,
>> >      &           nSy =   1,
>> >      &           nPx =   6,
>> >      &           nPy =   80,
>> >      &           Nx  = sNx*nSx*nPx,
>> >      &           Ny  = sNy*nSy*nPy,
>> >      &           Nr  =   50)
>> >
>> > ... so (using the calculation from the earlier email) I have (4+75+4)*(4+10+4)=1494 grid cells per process and (75*10/1494)=50% are cells I care about.
>> >
>> > This is really good to know but I got me to thinking, what is the utility of these overlap cells in the first place?
>> >
>> > Many thanks!
>> >
>> > Jonny
>> 
>> 
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>> 
>> 
>> 
>> -- 
>> Dr Jonny Williams
>> School of Geographical Sciences
>> Cabot Institute
>> University of Bristol
>> BS8 1SS
>> 
>> +44 (0)117 3318352
>> jonny.williams at bristol.ac.uk
>> http://www.bristol.ac.uk/geography/people/jonny-h-williams
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> 
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
> 
> 
> 
> 
> -- 
> Dr Jonny Williams
> School of Geographical Sciences
> Cabot Institute
> University of Bristol
> BS8 1SS
> 
> +44 (0)117 3318352
> jonny.williams at bristol.ac.uk
> http://www.bristol.ac.uk/geography/people/jonny-h-williams
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support




More information about the MITgcm-support mailing list