[MITgcm-support] results quite differents depending on number of procs used

Camille Mazoyer mazoyer at univ-tln.fr
Sat Apr 2 10:00:17 EDT 2016


Hello,
I ran my simulation on another cluster (case1 and case2), with ifort 
12.01 + openmpi 1.4.5, but I still have these errors.
I continue to search why I have these behaviours on my runs. When I find 
a idea, I will write on the list.
Thanks for all your help, and see you next time,
Camille Mazoyer

Le 21/03/2016 22:08, Camille Mazoyer a écrit :
> Hello Jody,
> My mistake!  I thought I needed to use subtiles because of this 
> sentence. As you are using a domain of 3008x32, if sNx=47, nPx=8, then 
> there is nSx=8 (maybe the author wanted to write nPx=64 instead).
>
> I tried to reproduce the problem, using your set-up with sNx=47,nPx=8
> (sNy=16,nPy=2):
>
> So, without subtiles: I have tried two new simulations, with nSx and 
> nSy even.
> case1 : 160x150 domain, 10 procs with sNx=1, nSx=80, nPx=2 and sNy=1, 
> nSy=30, nPy=5
> versus
> case2: 160x150 domain, 120 procs with sNx=1, nSx=20, nPx=8 and sNy=1, 
> nSy=10, nPy=15
> Are they a decomposition correct? with even nSx and nSy?
>
> Even if the number of tiles is even, there is still important 
> differences, for exemple at time=5 min (as you can see in the attached 
> file).
> I have to try ifort+ openmpi on another cluster. I keep in touch as 
> soon as I have the results.
>
> Thanks for all,
> Camille
>
>
> Le 21/03/2016 16:45, Jody Klymak a écrit :
>>> On Mar 21, 2016, at  6:25 AM, Camille Mazoyer <mazoyer at univ-tln.fr> 
>>> wrote:
>>>
>>> Hi Jody,
>>> Thanks for the tip, but unfortunatly it did'nt managed yet to fix my 
>>> problem. But it helps me to think about compilers and mpi protocoles 
>>> issues.
>>> I change my domain to:
>>> case1: 160x150 domain, with sNx=10, nSx=8, nPx=2, and 
>>> sNy=15,nSy=2,nPy=5
>>> case2: 160x150 domain, with sNx=20, nSx=4, nPx=2, and 
>>> sNy=15,nSy=2,nPy=5
>>> So I have for both, even-numbered nSx. It was what you meant?
>>> I ran these simulation on 10 procs (nPx*nPy procs).
>>
>> Why are you using subtiles?  Also, I’d imagine the same error might 
>> occur if sNy is odd, which it is .
>>
>> Cheers,   Jody
>>
>>>
>>> ------ ifort----------------
>>> my ifort compiler is: 14.0.1 20131008.
>>> Concerning mpi protocole, when I check the mpiiofrt script, I have 
>>> the lines:
>>> if [[ ! -n ${MP_MPILIB} ]] ; then
>>>    # Pick MPICH2 as default library -- since this is the MPICH script.
>>>    export MP_MPILIB=mpich2
>>> fi
>>> => So I think mpich2 is the mpi protocole. Is there another way to 
>>> know the mpi protocole?
>>>
>>> In attached, you can see the differences  between temperature, on 
>>> surface, at time=5min, and at time=1h, with ifort compiler. 
>>> Differences really blows up.
>>>
>>> ------ gfortran----------------
>>> So I tried to compile my code with gfortran (4.4.7). I think mpi 
>>> protocole is mpich2 but I'm not sure (I read the mpif90 script).
>>> And the result is nearly the same: differences blow up. At time=5 
>>> min, ifort results and gfortran results are the same, and at 
>>> time=1h, results between compiler are slightly differents.
>>>
>>>
>>> ------ sum up------------------
>>> To resume, changing the compiler from ifort 14.01 to gfortran 4.4.7 
>>> doesn't fix the problem. I'm not sure about the mpi protocole, so it 
>>> might be the same and it might be the cause of my bug. I'm going to 
>>> try my code on another machine just to see his behaviour. But I need 
>>> to calculate on the cluster where I get those problems, so I hope to 
>>> find quickly what's wrong.
>>>
>>> Cheers,
>>> Camille
>>>
>>>
>>>
>>>
>>> Le 18/03/2016 20:27, Jody Klymak a écrit :
>>>> Hi Camille,
>>>>
>>>> I had trouble with tiles that had odd values for nSx.  There was 
>>>> some weird instability that seemed to be just a compiler issue.  
>>>> Search on “mitgcm problem at tile bdys”.  I think ifort was the 
>>>> problem, which I note that you are using:
>>>>
>>>> My take-home was always use even-numbered nSx...
>>>>
>>>> Cheers,  Jody
>>>>
>>>> "One thing that strikes me is that my 128-core simulation was using 
>>>> 47x16-sized tiles in a 3008x32 domain (i.e. 64x2 tiles).  My 
>>>> 64-core simulation, which is running fine right now, is 94x16 (same 
>>>> domain, 32x2 tiles).  Could it be that there is an even/odd 
>>>> problem?  I imagine most of the time folks do things in powers of 
>>>> 2, or at least with even-sized tiles.”
>>>>
>>>> "Hi Jody,
>>>>
>>>> I tried to reproduce the problem, using your set-up with sNx=47,nPx=8
>>>> (sNy=16,nPy=2):
>>>> 1) it runs fine with no tile egdes problem on the 1rst cluster, using
>>>> gfortran and openmpi (optfile: linux_amd64_gfortran);
>>>> 2) same good results on an other cluster, using
>>>> ifort (10.0.025) and mpich-mx-1.2.7 (optfile: 
>>>> linux_amd64_ifort_beagle)
>>>> 3) but when I try (on the 1rst cluster) to use
>>>> ifort (13.0.0.079) and mvapich2-1.7 (optfile: linux_amd64_ifort11)
>>>> it blows up at iteration 68 and show same problems as in your run.
>>>> And with same compiler/mpi/optfile, with sNx=48 it runs fine (no 
>>>> problem).
>>>>
>>>> So, it looks like there is a problem with ifort 13.0 (can't be MPI
>>>> since your are using the intel one and (3) fails with mpich2).
>>>> Will take a look at compiler options.”
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On Mar 18, 2016, at  6:58 AM, Camille Mazoyer 
>>>>> <mazoyer at univ-tln.fr> wrote:
>>>>>
>>>>> Hi Jean-Michel,
>>>>>
>>>>> Thank you very much for your reply, and sorry for the delay of mine.
>>>>> I check the different points below:
>>>>>
>>>>> Le 07/03/2016 16:52, Jean-Michel Campin a écrit :
>>>>>> Hi Camille,
>>>>>>
>>>>>> Few comments here:
>>>>>> 1) With tile size reduced to sNx=20, sNy=10 (120 procs) it's 
>>>>>> likely that
>>>>>>    it will scale not as well (in part due to the increase of 
>>>>>> number of points
>>>>>>    when including overlap). But it should works as well as the 10 
>>>>>> procs case.
>>>>>> 2) One thing you can check would be to compare, let's say
>>>>>>    a 80 procs case (sNx=20, sNy=15, nPx=8, nPy=10) with
>>>>>>    a 10 procs case with same tile size (sNx=20, sNy=15) but with 
>>>>>> more
>>>>>>    tiles per procs (e.g., nSx=8, nSy=1, nPx=1, nPy=10).
>>>>>>    These two cases should give identical results with recent 
>>>>>> version of the
>>>>>>    code (#define GLOBAL_SUM_ORDER_TILES, added on Aug 25, 2015).
>>>>> You're right! I have exactly the same results.
>>>>>> 3) With different tile size, we expect small differences, but in 
>>>>>> your
>>>>>>    case, differences seem quite large:
>>>>>>    a) could be the flow regime is unstable, or the model 
>>>>>> parameter is close to
>>>>>>     unstable, and then a small difference grows with time.
>>>>>>    b) or there is some thing not right with one of the 2 
>>>>>> tile-size. I would
>>>>>>    suggest to repeat 2 short runs (one for each case) but turning 
>>>>>> off
>>>>>>    compiler optimisation flag (e.g., -O0).
>>>>> I ran short simulations (time=1hour) with -O0 and -O2. There are 
>>>>> differences (eg. temperature) between -00 and -02 for each 
>>>>> configuration.
>>>>> My flags are:
>>>>> - debug compilation: mpiifort -w95 -W0 -WB -convert big_endian 
>>>>> -assume byterecl -fPIC -O0 -noalign -xW -ip -mp
>>>>> - standard compilation: mpiifort -w95 -W0 -WB -convert big_endian 
>>>>> -assume byterecl -fPIC -O2 -align -xW -ip
>>>>> Configurations tested:
>>>>> - 10x1 procs
>>>>> - 10x1 procs (same tile size than 80 procs)
>>>>> - 8x10 procs
>>>>> - 8x15 procs (120 procs)
>>>>>
>>>>> When I compare two simulations (with no compiler optimisation), it 
>>>>> appears that I still have some differences between the 
>>>>> simulations. Except 10x1 versus 10x1 procs same tile size than 80 
>>>>> procs which give same results.
>>>>> I send you plots wich show differences on surface temperature 
>>>>> after t=1 hour. As you can see in the plots attached, the 
>>>>> differences are bigger between  10x1 vs 8x15 than between 10x1 vs 
>>>>> 8x10.
>>>>>
>>>>> For the simulations 10x1 vs 8x15 (compiled with -O0), an 
>>>>> interesting thing is that after only a time of 5 min, differences 
>>>>> between surface temperature appear in the south. Their shape is 
>>>>> more or less lines: I check, these lines are just in the border 
>>>>> between 2 tiles (file: diff_10x1_8x15_5min_k130.gif).
>>>>>
>>>>> I see in CPP_EEEOPTIONS.h that there are CPP keys for MPI SUM. Can 
>>>>> another CPP Key from this file be usefull for my problem?
>>>>> In fact, the main problem for me is that I don't know wich run is 
>>>>> the saffer. I'm afraid the saffer are the ones less parallelized 
>>>>> for the moment?
>>>>>
>>>>> Thank you,
>>>>> Camille
>>>>>
>>>>>>    There has been reports of compilier optimisation problems that 
>>>>>> only show up
>>>>>>    for some tile size but just fine for others.
>>>>>>
>>>>>> Cheers,
>>>>>> Jean-Michel
>>>>>>
>>>>>> On Mon, Mar 07, 2016 at 11:29:11AM +0100, Camille Mazoyer wrote:
>>>>>>> Dear all,
>>>>>>>
>>>>>>> I ran two simulations of a configuration of the Mediterranean 
>>>>>>> coast,
>>>>>>> near Toulon, France.
>>>>>>> The simulations are exactly the same except the number of procs (10
>>>>>>> procs for one run, 120 procs for the other run). I only change the
>>>>>>> file SIZE.h to change the number of procs.
>>>>>>> I know we can't except to have exactly the same results but I was
>>>>>>> very surprised to see the differences. After 5 days, for example,
>>>>>>> the max of differences between temperature fields is around 0.034.
>>>>>>> Have you ever see such differences while changing number of procs?
>>>>>>> Is this ok for you? If not, do you know where I might have made a
>>>>>>> mistake?
>>>>>>>
>>>>>>> In attached files, you can see different plots, to compare a run
>>>>>>> with 10 procs, and a run with 120 procs:
>>>>>>> - the difference of temperature at the surface (k=kmax) :
>>>>>>> diff_temp_kmax_5days.gif
>>>>>>> - the difference of u field at the surface (k=kmax) : 
>>>>>>> diff_u_kmax_5days.gif
>>>>>>> - the difference of v field at the surface (k=kmax) : 
>>>>>>> diff_v_kmax_5days.gif
>>>>>>> - I calculate the mean of differences in the domain Nx*Ny*Nz, and I
>>>>>>> plot it for each time : mean_diff_temp.gif (temperature ),
>>>>>>> mean_diff_u.gif (u zonal), mean_diff_v.gif (v meridional).
>>>>>>> =>>>> Differences increase with time.
>>>>>>>
>>>>>>>
>>>>>>> Number of points on the domain: Nx=160, Ny=150, Nz=130.
>>>>>>> Subdomains for 120 procs: sNx=20, sNy=10 points  => Is it to small,
>>>>>>> for a subdomain?
>>>>>>> Subdomains for  10 procs: sNx=160, sNy=15 points
>>>>>>>
>>>>>>>
>>>>>>> Thank you for your advices and ideas,
>>>>>>> Camille
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> ------------------------------------------
>>>>>>> Camille Mazoyer
>>>>>>> Phd Student
>>>>>>> Mediterranean Institute of Oceanography (MIO)
>>>>>>> Institut de Mathématiques de Toulon (IMATH)
>>>>>>> Université de TOULON
>>>>>>> Bat X - CS 60584
>>>>>>> 83041 TOULON cedex 9
>>>>>>> France
>>>>>>> http://mio.pytheas.univ-amu.fr/
>>>>>>> http://imath.fr/
>>>>>>> _______________________________________________
>>>>>>> MITgcm-support mailing list
>>>>>>> MITgcm-support at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>>> _______________________________________________
>>>>>> MITgcm-support mailing list
>>>>>> MITgcm-support at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>>> -- 
>>>>> ------------------------------------------
>>>>> Camille Mazoyer
>>>>> Phd Student
>>>>> Mediterranean Institute of Oceanography (MIO)
>>>>> Institut de Mathématiques de Toulon (IMATH)
>>>>> Université de TOULON
>>>>> Bat X - CS 60584
>>>>> 83041 TOULON cedex 9
>>>>> France
>>>>> http://mio.pytheas.univ-amu.fr/
>>>>> http://imath.fr/
>>>>>
>>>>> <diff_O2_10x1_vs_10x1tilesize80procs_l13_1h.gif><diff_O0_O2_10x1_tilesize80procs_l13_1h.gif><diff_O0_O2_10x1_l13_1h.gif><diff_O0_10x1_vs_10x1tilesize80procs_l13_1h.gif><diff_10x1_8x15_5min_k130.gif>_______________________________________________ 
>>>>>
>>>>> MITgcm-support mailing list
>>>>> MITgcm-support at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>> _______________________________________________
>>>> MITgcm-support mailing list
>>>> MITgcm-support at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>> -- 
>>> ------------------------------------------
>>> Camille Mazoyer
>>> Phd Student
>>> Mediterranean Institute of Oceanography (MIO)
>>> Institut de Mathématiques de Toulon (IMATH)
>>> Université de TOULON
>>> Bat X - CS 60584
>>> 83041 TOULON cedex 9
>>> France
>>> http://mio.pytheas.univ-amu.fr/
>>> http://imath.fr/
>>>
>>> <diff_case1_vs_case2_t1h.png><diff_case1_vs_case_t5min.png>_______________________________________________ 
>>>
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support

-- 
------------------------------------------
Camille Mazoyer
Phd Student
Mediterranean Institute of Oceanography (MIO)
Institut de Mathématiques de Toulon (IMATH)
Université de TOULON
Bat X - CS 60584
83041 TOULON cedex 9
France
http://mio.pytheas.univ-amu.fr/
http://imath.fr/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20160402/2c028aeb/attachment-0001.htm>


More information about the MITgcm-support mailing list