[MITgcm-support] changing number of processors

Jonny Williams Jonny.Williams at bristol.ac.uk
Mon Mar 16 12:46:12 EDT 2015


Thanks a lot Martin

For your information we are running on ARCHER which is the UK national
supercomputer. It is a CRAY supercomputer and I am writing to a lustre file
system.

I am already running with the useSingleCPUio = .TRUE. flag.

I will get in touch with the ARCHER support people and will get let you
know if I make any more progress!

Thanks to everyone foe their continued support.

Jonny

On 16 March 2015 at 16:16, Martin Losch <Martin.Losch at awi.de> wrote:

> Jonny,
>
> in the end you’d have to do a proper profiling to find out, what’s the
> problem. With netcdf, often a lot of time is spent on finding the correct
> netcdf-id of a variable (some integer function mnc_int or similar), but
> other totally unrelated issues can stem from the file system. Depending on
> the type, you many want to adjust your i/o parameters. E.g. on a global
> file syste (like a NFS mounted system or a GFS) the processes may not be
> allowed to write at the same time and “queue” up. In this case having only
> one CPU do all of the IO may be useful (useSingleCPUio = .TRUE.), or
> writing to local hard disks (in cases where each node has it’s own
> tmp-disk) … You’ll have to experiment a little.
>
> Martin
>
> > On 16 Mar 2015, at 14:58, Jonny Williams <Jonny.Williams at bristol.ac.uk>
> wrote:
> >
> > Thanks for this Dimitris
> >
> > The timings at the end of the STDOUT.0000 file are extremely useful and
> have enabled me to diagnose the fact that the I/O definitely is the
> limiting factor in my runs at the moment.
> >
> > For example, in going from 48 (6x8) to 480 (15x32) processors, the time
> spent in the section called "DO_THE_MODEL_IO [FORWARD_STEP]" in STDOUT.0000
> increased from 3% to 67% of the amount of total run time used!
> >
> > I may well have to look into using the mdsio package again unless there
> is a way round this I/O, NetCDF issue?
> >
> > Many thanks again
> >
> > Jonny
> >
> >
> >
> > On 5 March 2015 at 11:57, Menemenlis, Dimitris (329D) <
> Dimitris.Menemenlis at jpl.nasa.gov> wrote:
> > My personal prejudice (and it may be wrong): if you want efficient I/O,
> you need to get rid of netcdf package; mdsio and extensions is a lot more
> flexible and efficient.  In any case there is no need to guess about cause
> of bottleneck; just look at timings at end of your STDOUT.0000 file.
> >
> > On Mar 5, 2015, at 3:20 AM, Jonny Williams <Jonny.Williams at bristol.ac.uk>
> wrote:
> >
> >> As a related question to this thread, is it possible to output one
> NetCDF file per stream (state*.nc, ptracers*.nc, etc) rather than one per
> process?
> >>
> >> I am currently running on ARCHER, the national supercomputing facility
> and I am not getting the speed up that I am expecting for a long job
> whereas I did get the expected speed for a very short test job.
> >>
> >> I am thinking that the I/O may be a bottleneck here perhaps?
> >>
> >> Cheers!
> >>
> >> Jonny
> >>
> >> On 10 February 2015 at 07:38, Martin Losch <Martin.Losch at awi.de> wrote:
> >> Hi Jonny and others,
> >>
> >> I am not sure if I understand your question about "the utility of the
> overlap cells": the overlaps are filled with the values of the neighboring
> tiles so that you can compute terms of the model equations near the
> boundary; without the overlap you would not be able to evaluate any
> horizontal gradient or average at the domain boundary.
> >> The size of the overlap depends on the computational stencil, that you
> want to use. A 2nd order operation needs an overlap of 1, a 3rd order
> operator needs an overlap of 2, and so forth. I think that at the model
> tells you, when your choice of advection schemes requires more overlap that
> you have avariciously specified.
> >>
> >> Martin
> >>
> >> PS:
> >> Here’s my experience with scaling or not scaling (by no means are these
> absolute numbers or recommendations):
> >> As a rule of thumb, the MITgcm dynamics/thermodynamics kernel (various
> packages may behave differently) scale usually nearly linearly down to tile
> sizes of (sNx * sNy) of 30*30, when the overhead of overlap/domain size
> becomes unfavorable (because of too many local communications between
> individual tiles) and the global pressure solver takes its toll (because of
> global communications when all processes have to wait). Below this tile
> size the time to solution reduces still with more processors, but the more
> slowly until the overhead is more expensive than the speedup. I
> re-iterating what Matt already wrote: It’s obvious that for a 30x30 tile,
> the overlap nearly 2*(Olx*sNy+Oly*sNx), so for an overlap of 2 you already
> have 8*30 cells in the overlap, more that one quarter of the cells in the
> interior, etc. From this point of view a tile size of 2x2 + 1 gridpoint
> overlap is totally inefficient.
> >> Further, it is probably better to have nearly quadratic tiles (so sNx ~
> sNy), except for vector machines, where you try to make sNx as large as
> possible (at least until you reach the maximum vector length of your
> machine).
> >>
> >> In my experience you need to test this for every new computer that you
> have access to, to find out what is the best range of processors that you
> can efficiently run with. For example it may be more economic to use fewer
> processor and wait a little longer for the result, but have enough CPU time
> left to do a second run of the same type, than to use all you CPU time on a
> run with twice as many processors that may finish faster, but not twice as
> fast because the linear scaling limit has been reached.
> >>
> >> > On 09 Feb 2015, at 16:05, Jonny Williams <
> Jonny.Williams at bristol.ac.uk> wrote:
> >> >
> >> > Dear Angela, Matthew
> >> >
> >> > Thanks you very much for your emails.
> >> >
> >> > For your information I have now gotten round my initial problem of
> the NaNs now by using a shorter timestep although I don't know why this
> would've have made much difference...
> >> >
> >> > Your discussion about the overlap parameters and run speed is of
> interest to me because I found that a decrease in timestep by a factor of 4
> and an increase in the number of processors by a factor of 10 resulted in
> an almost identical run speed!
> >> >
> >> > My SIZE.h parameters were as follows...
> >> >
> >> > PARAMETER (
> >> >      &           sNx =  75,
> >> >      &           sNy =  10,
> >> >      &           OLx =   4,
> >> >      &           OLy =   4,
> >> >      &           nSx =   1,
> >> >      &           nSy =   1,
> >> >      &           nPx =   6,
> >> >      &           nPy =   80,
> >> >      &           Nx  = sNx*nSx*nPx,
> >> >      &           Ny  = sNy*nSy*nPy,
> >> >      &           Nr  =   50)
> >> >
> >> > ... so (using the calculation from the earlier email) I have
> (4+75+4)*(4+10+4)=1494 grid cells per process and (75*10/1494)=50% are
> cells I care about.
> >> >
> >> > This is really good to know but I got me to thinking, what is the
> utility of these overlap cells in the first place?
> >> >
> >> > Many thanks!
> >> >
> >> > Jonny
> >>
> >>
> >> _______________________________________________
> >> MITgcm-support mailing list
> >> MITgcm-support at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-support
> >>
> >>
> >>
> >> --
> >> Dr Jonny Williams
> >> School of Geographical Sciences
> >> Cabot Institute
> >> University of Bristol
> >> BS8 1SS
> >>
> >> +44 (0)117 3318352
> >> jonny.williams at bristol.ac.uk
> >> http://www.bristol.ac.uk/geography/people/jonny-h-williams
> >> _______________________________________________
> >> MITgcm-support mailing list
> >> MITgcm-support at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >
> >
> >
> > --
> > Dr Jonny Williams
> > School of Geographical Sciences
> > Cabot Institute
> > University of Bristol
> > BS8 1SS
> >
> > +44 (0)117 3318352
> > jonny.williams at bristol.ac.uk
> > http://www.bristol.ac.uk/geography/people/jonny-h-williams
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-support
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
>



-- 
Dr Jonny Williams
School of Geographical Sciences
Cabot Institute
University of Bristol
BS8 1SS

+44 (0)117 3318352
jonny.williams at bristol.ac.uk
http://www.bristol.ac.uk/geography/people/jonny-h-williams
<http://bit.ly/jonnywilliams>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20150316/dbe4a6ac/attachment-0001.htm>


More information about the MITgcm-support mailing list