[MITgcm-support] results quite differents depending on number of procs used
Camille Mazoyer
mazoyer at univ-tln.fr
Sat Apr 2 10:00:17 EDT 2016
Hello,
I ran my simulation on another cluster (case1 and case2), with ifort
12.01 + openmpi 1.4.5, but I still have these errors.
I continue to search why I have these behaviours on my runs. When I find
a idea, I will write on the list.
Thanks for all your help, and see you next time,
Camille Mazoyer
Le 21/03/2016 22:08, Camille Mazoyer a écrit :
> Hello Jody,
> My mistake! I thought I needed to use subtiles because of this
> sentence. As you are using a domain of 3008x32, if sNx=47, nPx=8, then
> there is nSx=8 (maybe the author wanted to write nPx=64 instead).
>
> I tried to reproduce the problem, using your set-up with sNx=47,nPx=8
> (sNy=16,nPy=2):
>
> So, without subtiles: I have tried two new simulations, with nSx and
> nSy even.
> case1 : 160x150 domain, 10 procs with sNx=1, nSx=80, nPx=2 and sNy=1,
> nSy=30, nPy=5
> versus
> case2: 160x150 domain, 120 procs with sNx=1, nSx=20, nPx=8 and sNy=1,
> nSy=10, nPy=15
> Are they a decomposition correct? with even nSx and nSy?
>
> Even if the number of tiles is even, there is still important
> differences, for exemple at time=5 min (as you can see in the attached
> file).
> I have to try ifort+ openmpi on another cluster. I keep in touch as
> soon as I have the results.
>
> Thanks for all,
> Camille
>
>
> Le 21/03/2016 16:45, Jody Klymak a écrit :
>>> On Mar 21, 2016, at 6:25 AM, Camille Mazoyer <mazoyer at univ-tln.fr>
>>> wrote:
>>>
>>> Hi Jody,
>>> Thanks for the tip, but unfortunatly it did'nt managed yet to fix my
>>> problem. But it helps me to think about compilers and mpi protocoles
>>> issues.
>>> I change my domain to:
>>> case1: 160x150 domain, with sNx=10, nSx=8, nPx=2, and
>>> sNy=15,nSy=2,nPy=5
>>> case2: 160x150 domain, with sNx=20, nSx=4, nPx=2, and
>>> sNy=15,nSy=2,nPy=5
>>> So I have for both, even-numbered nSx. It was what you meant?
>>> I ran these simulation on 10 procs (nPx*nPy procs).
>>
>> Why are you using subtiles? Also, I’d imagine the same error might
>> occur if sNy is odd, which it is .
>>
>> Cheers, Jody
>>
>>>
>>> ------ ifort----------------
>>> my ifort compiler is: 14.0.1 20131008.
>>> Concerning mpi protocole, when I check the mpiiofrt script, I have
>>> the lines:
>>> if [[ ! -n ${MP_MPILIB} ]] ; then
>>> # Pick MPICH2 as default library -- since this is the MPICH script.
>>> export MP_MPILIB=mpich2
>>> fi
>>> => So I think mpich2 is the mpi protocole. Is there another way to
>>> know the mpi protocole?
>>>
>>> In attached, you can see the differences between temperature, on
>>> surface, at time=5min, and at time=1h, with ifort compiler.
>>> Differences really blows up.
>>>
>>> ------ gfortran----------------
>>> So I tried to compile my code with gfortran (4.4.7). I think mpi
>>> protocole is mpich2 but I'm not sure (I read the mpif90 script).
>>> And the result is nearly the same: differences blow up. At time=5
>>> min, ifort results and gfortran results are the same, and at
>>> time=1h, results between compiler are slightly differents.
>>>
>>>
>>> ------ sum up------------------
>>> To resume, changing the compiler from ifort 14.01 to gfortran 4.4.7
>>> doesn't fix the problem. I'm not sure about the mpi protocole, so it
>>> might be the same and it might be the cause of my bug. I'm going to
>>> try my code on another machine just to see his behaviour. But I need
>>> to calculate on the cluster where I get those problems, so I hope to
>>> find quickly what's wrong.
>>>
>>> Cheers,
>>> Camille
>>>
>>>
>>>
>>>
>>> Le 18/03/2016 20:27, Jody Klymak a écrit :
>>>> Hi Camille,
>>>>
>>>> I had trouble with tiles that had odd values for nSx. There was
>>>> some weird instability that seemed to be just a compiler issue.
>>>> Search on “mitgcm problem at tile bdys”. I think ifort was the
>>>> problem, which I note that you are using:
>>>>
>>>> My take-home was always use even-numbered nSx...
>>>>
>>>> Cheers, Jody
>>>>
>>>> "One thing that strikes me is that my 128-core simulation was using
>>>> 47x16-sized tiles in a 3008x32 domain (i.e. 64x2 tiles). My
>>>> 64-core simulation, which is running fine right now, is 94x16 (same
>>>> domain, 32x2 tiles). Could it be that there is an even/odd
>>>> problem? I imagine most of the time folks do things in powers of
>>>> 2, or at least with even-sized tiles.”
>>>>
>>>> "Hi Jody,
>>>>
>>>> I tried to reproduce the problem, using your set-up with sNx=47,nPx=8
>>>> (sNy=16,nPy=2):
>>>> 1) it runs fine with no tile egdes problem on the 1rst cluster, using
>>>> gfortran and openmpi (optfile: linux_amd64_gfortran);
>>>> 2) same good results on an other cluster, using
>>>> ifort (10.0.025) and mpich-mx-1.2.7 (optfile:
>>>> linux_amd64_ifort_beagle)
>>>> 3) but when I try (on the 1rst cluster) to use
>>>> ifort (13.0.0.079) and mvapich2-1.7 (optfile: linux_amd64_ifort11)
>>>> it blows up at iteration 68 and show same problems as in your run.
>>>> And with same compiler/mpi/optfile, with sNx=48 it runs fine (no
>>>> problem).
>>>>
>>>> So, it looks like there is a problem with ifort 13.0 (can't be MPI
>>>> since your are using the intel one and (3) fails with mpich2).
>>>> Will take a look at compiler options.”
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On Mar 18, 2016, at 6:58 AM, Camille Mazoyer
>>>>> <mazoyer at univ-tln.fr> wrote:
>>>>>
>>>>> Hi Jean-Michel,
>>>>>
>>>>> Thank you very much for your reply, and sorry for the delay of mine.
>>>>> I check the different points below:
>>>>>
>>>>> Le 07/03/2016 16:52, Jean-Michel Campin a écrit :
>>>>>> Hi Camille,
>>>>>>
>>>>>> Few comments here:
>>>>>> 1) With tile size reduced to sNx=20, sNy=10 (120 procs) it's
>>>>>> likely that
>>>>>> it will scale not as well (in part due to the increase of
>>>>>> number of points
>>>>>> when including overlap). But it should works as well as the 10
>>>>>> procs case.
>>>>>> 2) One thing you can check would be to compare, let's say
>>>>>> a 80 procs case (sNx=20, sNy=15, nPx=8, nPy=10) with
>>>>>> a 10 procs case with same tile size (sNx=20, sNy=15) but with
>>>>>> more
>>>>>> tiles per procs (e.g., nSx=8, nSy=1, nPx=1, nPy=10).
>>>>>> These two cases should give identical results with recent
>>>>>> version of the
>>>>>> code (#define GLOBAL_SUM_ORDER_TILES, added on Aug 25, 2015).
>>>>> You're right! I have exactly the same results.
>>>>>> 3) With different tile size, we expect small differences, but in
>>>>>> your
>>>>>> case, differences seem quite large:
>>>>>> a) could be the flow regime is unstable, or the model
>>>>>> parameter is close to
>>>>>> unstable, and then a small difference grows with time.
>>>>>> b) or there is some thing not right with one of the 2
>>>>>> tile-size. I would
>>>>>> suggest to repeat 2 short runs (one for each case) but turning
>>>>>> off
>>>>>> compiler optimisation flag (e.g., -O0).
>>>>> I ran short simulations (time=1hour) with -O0 and -O2. There are
>>>>> differences (eg. temperature) between -00 and -02 for each
>>>>> configuration.
>>>>> My flags are:
>>>>> - debug compilation: mpiifort -w95 -W0 -WB -convert big_endian
>>>>> -assume byterecl -fPIC -O0 -noalign -xW -ip -mp
>>>>> - standard compilation: mpiifort -w95 -W0 -WB -convert big_endian
>>>>> -assume byterecl -fPIC -O2 -align -xW -ip
>>>>> Configurations tested:
>>>>> - 10x1 procs
>>>>> - 10x1 procs (same tile size than 80 procs)
>>>>> - 8x10 procs
>>>>> - 8x15 procs (120 procs)
>>>>>
>>>>> When I compare two simulations (with no compiler optimisation), it
>>>>> appears that I still have some differences between the
>>>>> simulations. Except 10x1 versus 10x1 procs same tile size than 80
>>>>> procs which give same results.
>>>>> I send you plots wich show differences on surface temperature
>>>>> after t=1 hour. As you can see in the plots attached, the
>>>>> differences are bigger between 10x1 vs 8x15 than between 10x1 vs
>>>>> 8x10.
>>>>>
>>>>> For the simulations 10x1 vs 8x15 (compiled with -O0), an
>>>>> interesting thing is that after only a time of 5 min, differences
>>>>> between surface temperature appear in the south. Their shape is
>>>>> more or less lines: I check, these lines are just in the border
>>>>> between 2 tiles (file: diff_10x1_8x15_5min_k130.gif).
>>>>>
>>>>> I see in CPP_EEEOPTIONS.h that there are CPP keys for MPI SUM. Can
>>>>> another CPP Key from this file be usefull for my problem?
>>>>> In fact, the main problem for me is that I don't know wich run is
>>>>> the saffer. I'm afraid the saffer are the ones less parallelized
>>>>> for the moment?
>>>>>
>>>>> Thank you,
>>>>> Camille
>>>>>
>>>>>> There has been reports of compilier optimisation problems that
>>>>>> only show up
>>>>>> for some tile size but just fine for others.
>>>>>>
>>>>>> Cheers,
>>>>>> Jean-Michel
>>>>>>
>>>>>> On Mon, Mar 07, 2016 at 11:29:11AM +0100, Camille Mazoyer wrote:
>>>>>>> Dear all,
>>>>>>>
>>>>>>> I ran two simulations of a configuration of the Mediterranean
>>>>>>> coast,
>>>>>>> near Toulon, France.
>>>>>>> The simulations are exactly the same except the number of procs (10
>>>>>>> procs for one run, 120 procs for the other run). I only change the
>>>>>>> file SIZE.h to change the number of procs.
>>>>>>> I know we can't except to have exactly the same results but I was
>>>>>>> very surprised to see the differences. After 5 days, for example,
>>>>>>> the max of differences between temperature fields is around 0.034.
>>>>>>> Have you ever see such differences while changing number of procs?
>>>>>>> Is this ok for you? If not, do you know where I might have made a
>>>>>>> mistake?
>>>>>>>
>>>>>>> In attached files, you can see different plots, to compare a run
>>>>>>> with 10 procs, and a run with 120 procs:
>>>>>>> - the difference of temperature at the surface (k=kmax) :
>>>>>>> diff_temp_kmax_5days.gif
>>>>>>> - the difference of u field at the surface (k=kmax) :
>>>>>>> diff_u_kmax_5days.gif
>>>>>>> - the difference of v field at the surface (k=kmax) :
>>>>>>> diff_v_kmax_5days.gif
>>>>>>> - I calculate the mean of differences in the domain Nx*Ny*Nz, and I
>>>>>>> plot it for each time : mean_diff_temp.gif (temperature ),
>>>>>>> mean_diff_u.gif (u zonal), mean_diff_v.gif (v meridional).
>>>>>>> =>>>> Differences increase with time.
>>>>>>>
>>>>>>>
>>>>>>> Number of points on the domain: Nx=160, Ny=150, Nz=130.
>>>>>>> Subdomains for 120 procs: sNx=20, sNy=10 points => Is it to small,
>>>>>>> for a subdomain?
>>>>>>> Subdomains for 10 procs: sNx=160, sNy=15 points
>>>>>>>
>>>>>>>
>>>>>>> Thank you for your advices and ideas,
>>>>>>> Camille
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ------------------------------------------
>>>>>>> Camille Mazoyer
>>>>>>> Phd Student
>>>>>>> Mediterranean Institute of Oceanography (MIO)
>>>>>>> Institut de Mathématiques de Toulon (IMATH)
>>>>>>> Université de TOULON
>>>>>>> Bat X - CS 60584
>>>>>>> 83041 TOULON cedex 9
>>>>>>> France
>>>>>>> http://mio.pytheas.univ-amu.fr/
>>>>>>> http://imath.fr/
>>>>>>> _______________________________________________
>>>>>>> MITgcm-support mailing list
>>>>>>> MITgcm-support at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>> _______________________________________________
>>>>>> MITgcm-support mailing list
>>>>>> MITgcm-support at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>> --
>>>>> ------------------------------------------
>>>>> Camille Mazoyer
>>>>> Phd Student
>>>>> Mediterranean Institute of Oceanography (MIO)
>>>>> Institut de Mathématiques de Toulon (IMATH)
>>>>> Université de TOULON
>>>>> Bat X - CS 60584
>>>>> 83041 TOULON cedex 9
>>>>> France
>>>>> http://mio.pytheas.univ-amu.fr/
>>>>> http://imath.fr/
>>>>>
>>>>> <diff_O2_10x1_vs_10x1tilesize80procs_l13_1h.gif><diff_O0_O2_10x1_tilesize80procs_l13_1h.gif><diff_O0_O2_10x1_l13_1h.gif><diff_O0_10x1_vs_10x1tilesize80procs_l13_1h.gif><diff_10x1_8x15_5min_k130.gif>_______________________________________________
>>>>>
>>>>> MITgcm-support mailing list
>>>>> MITgcm-support at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>> _______________________________________________
>>>> MITgcm-support mailing list
>>>> MITgcm-support at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>> --
>>> ------------------------------------------
>>> Camille Mazoyer
>>> Phd Student
>>> Mediterranean Institute of Oceanography (MIO)
>>> Institut de Mathématiques de Toulon (IMATH)
>>> Université de TOULON
>>> Bat X - CS 60584
>>> 83041 TOULON cedex 9
>>> France
>>> http://mio.pytheas.univ-amu.fr/
>>> http://imath.fr/
>>>
>>> <diff_case1_vs_case2_t1h.png><diff_case1_vs_case_t5min.png>_______________________________________________
>>>
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
--
------------------------------------------
Camille Mazoyer
Phd Student
Mediterranean Institute of Oceanography (MIO)
Institut de Mathématiques de Toulon (IMATH)
Université de TOULON
Bat X - CS 60584
83041 TOULON cedex 9
France
http://mio.pytheas.univ-amu.fr/
http://imath.fr/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20160402/2c028aeb/attachment-0001.htm>
More information about the MITgcm-support
mailing list