[MITgcm-support] results quite differents depending on number of procs used

Jody Klymak jklymak at uvic.ca
Mon Mar 21 11:45:41 EDT 2016


> On Mar 21, 2016, at  6:25 AM, Camille Mazoyer <mazoyer at univ-tln.fr> wrote:
> 
> Hi Jody,
> Thanks for the tip, but unfortunatly it did'nt managed yet to fix my problem. But it helps me to think about compilers and mpi protocoles issues.
> I change my domain to:
> case1: 160x150 domain, with sNx=10, nSx=8, nPx=2, and sNy=15,nSy=2,nPy=5
> case2: 160x150 domain, with sNx=20, nSx=4, nPx=2, and sNy=15,nSy=2,nPy=5
> So I have for both, even-numbered nSx. It was what you meant?
> I ran these simulation on 10 procs (nPx*nPy procs).


Why are you using subtiles?  Also, I’d imagine the same error might occur if sNy is odd, which it is .

Cheers,   Jody

> 
> 
> ------ ifort----------------
> my ifort compiler is: 14.0.1 20131008.
> Concerning mpi protocole, when I check the mpiiofrt script, I have the lines:
> if [[ ! -n ${MP_MPILIB} ]] ; then
>   # Pick MPICH2 as default library -- since this is the MPICH script.
>   export MP_MPILIB=mpich2
> fi
> => So I think mpich2 is the mpi protocole. Is there another way to know the mpi protocole?
> 
> In attached, you can see the differences  between temperature, on surface, at time=5min, and at time=1h, with ifort compiler. Differences really blows up.
> 
> ------ gfortran----------------
> So I tried to compile my code with gfortran (4.4.7). I think mpi protocole is mpich2 but I'm not sure (I read the mpif90 script).
> And the result is nearly the same: differences blow up. At time=5 min, ifort results and gfortran results are the same, and at time=1h, results between compiler are slightly differents.
> 
> 
> ------ sum up------------------
> To resume, changing the compiler from ifort 14.01 to gfortran 4.4.7 doesn't fix the problem. I'm not sure about the mpi protocole, so it might be the same and it might be the cause of my bug. I'm going to try my code on another machine just to see his behaviour. But I need to calculate on the cluster where I get those problems, so I hope to find quickly what's wrong.
> 
> Cheers,
> Camille
> 
> 
> 
> 
> Le 18/03/2016 20:27, Jody Klymak a écrit :
>> Hi Camille,
>> 
>> I had trouble with tiles that had odd values for nSx.  There was some weird instability that seemed to be just a compiler issue.  Search on “mitgcm problem at tile bdys”.  I think ifort was the problem, which I note that you are using:
>> 
>> My take-home was always use even-numbered nSx...
>> 
>> Cheers,  Jody
>> 
>> "One thing that strikes me is that my 128-core simulation was using 47x16-sized tiles in a 3008x32 domain (i.e. 64x2 tiles).  My 64-core simulation, which is running fine right now, is 94x16 (same domain, 32x2 tiles).  Could it be that there is an even/odd problem?  I imagine most of the time folks do things in powers of 2, or at least with even-sized tiles.”
>> 
>> "Hi Jody,
>> 
>> I tried to reproduce the problem, using your set-up with sNx=47,nPx=8
>> (sNy=16,nPy=2):
>> 1) it runs fine with no tile egdes problem on the 1rst cluster, using
>> gfortran and openmpi (optfile: linux_amd64_gfortran);
>> 2) same good results on an other cluster, using
>> ifort (10.0.025) and mpich-mx-1.2.7 (optfile: linux_amd64_ifort_beagle)
>> 3) but when I try (on the 1rst cluster) to use
>> ifort (13.0.0.079) and mvapich2-1.7 (optfile: linux_amd64_ifort11)
>> it blows up at iteration 68 and show same problems as in your run.
>> And with same compiler/mpi/optfile, with sNx=48 it runs fine (no problem).
>> 
>> So, it looks like there is a problem with ifort 13.0 (can't be MPI
>> since your are using the intel one and (3) fails with mpich2).
>> Will take a look at compiler options.”
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Mar 18, 2016, at  6:58 AM, Camille Mazoyer <mazoyer at univ-tln.fr> wrote:
>>> 
>>> Hi Jean-Michel,
>>> 
>>> Thank you very much for your reply, and sorry for the delay of mine.
>>> I check the different points below:
>>> 
>>> Le 07/03/2016 16:52, Jean-Michel Campin a écrit :
>>>> Hi Camille,
>>>> 
>>>> Few comments here:
>>>> 1) With tile size reduced to sNx=20, sNy=10 (120 procs) it's likely that
>>>>   it will scale not as well (in part due to the increase of number of points
>>>>   when including overlap). But it should works as well as the 10 procs case.
>>>> 2) One thing you can check would be to compare, let's say
>>>>   a 80 procs case (sNx=20, sNy=15, nPx=8, nPy=10) with
>>>>   a 10 procs case with same tile size (sNx=20, sNy=15) but with more
>>>>   tiles per procs (e.g., nSx=8, nSy=1, nPx=1, nPy=10).
>>>>   These two cases should give identical results with recent version of the
>>>>   code (#define GLOBAL_SUM_ORDER_TILES, added on Aug 25, 2015).
>>> You're right! I have exactly the same results.
>>>> 3) With different tile size, we expect small differences, but in your
>>>>   case, differences seem quite large:
>>>>   a) could be the flow regime is unstable, or the model parameter is close to
>>>>    unstable, and then a small difference grows with time.
>>>>   b) or there is some thing not right with one of the 2 tile-size. I would
>>>>   suggest to repeat 2 short runs (one for each case) but turning off
>>>>   compiler optimisation flag (e.g., -O0).
>>> I ran short simulations (time=1hour) with -O0 and -O2. There are differences (eg. temperature) between -00 and -02 for each configuration.
>>> My flags are:
>>> - debug compilation: mpiifort -w95 -W0 -WB -convert big_endian -assume byterecl -fPIC -O0 -noalign -xW -ip -mp
>>> - standard compilation: mpiifort -w95 -W0 -WB -convert big_endian -assume byterecl -fPIC -O2 -align -xW -ip
>>> Configurations tested:
>>> - 10x1 procs
>>> - 10x1 procs (same tile size than 80 procs)
>>> - 8x10 procs
>>> - 8x15 procs (120 procs)
>>> 
>>> When I compare two simulations (with no compiler optimisation), it appears that I still have some differences between the simulations. Except 10x1 versus 10x1 procs same tile size than 80 procs which give same results.
>>> I send you plots wich show differences on surface temperature after t=1 hour. As you can see in the plots attached, the differences are bigger between  10x1 vs 8x15 than between 10x1 vs 8x10.
>>> 
>>> For the simulations 10x1 vs 8x15 (compiled with -O0), an interesting thing is that after only a time of 5 min, differences between surface temperature appear in the south. Their shape is more or less lines: I check, these lines are just in the border between 2 tiles (file: diff_10x1_8x15_5min_k130.gif).
>>> 
>>> I see in CPP_EEEOPTIONS.h that there are CPP keys for MPI SUM. Can another CPP Key from this file be usefull for my problem?
>>> In fact, the main problem for me is that I don't know wich run is the saffer. I'm afraid the saffer are the ones less parallelized for the moment?
>>> 
>>> Thank you,
>>> Camille
>>> 
>>>>   There has been reports of compilier optimisation problems that only show up
>>>>   for some tile size but just fine for others.
>>>> 
>>>> Cheers,
>>>> Jean-Michel
>>>> 
>>>> On Mon, Mar 07, 2016 at 11:29:11AM +0100, Camille Mazoyer wrote:
>>>>> Dear all,
>>>>> 
>>>>> I ran two simulations of a configuration of the Mediterranean coast,
>>>>> near Toulon, France.
>>>>> The simulations are exactly the same except the number of procs (10
>>>>> procs for one run, 120 procs for the other run). I only change the
>>>>> file SIZE.h to change the number of procs.
>>>>> I know we can't except to have exactly the same results but I was
>>>>> very surprised to see the differences. After 5 days, for example,
>>>>> the max of differences between temperature fields is around 0.034.
>>>>> Have you ever see such differences while changing number of procs?
>>>>> Is this ok for you? If not, do you know where I might have made a
>>>>> mistake?
>>>>> 
>>>>> In attached files, you can see different plots, to compare a run
>>>>> with 10 procs, and a run with 120 procs:
>>>>> - the difference of temperature at the surface (k=kmax) :
>>>>> diff_temp_kmax_5days.gif
>>>>> - the difference of u field at the surface (k=kmax) : diff_u_kmax_5days.gif
>>>>> - the difference of v field at the surface (k=kmax) : diff_v_kmax_5days.gif
>>>>> - I calculate the mean of differences in the domain Nx*Ny*Nz, and I
>>>>> plot it for each time : mean_diff_temp.gif (temperature ),
>>>>> mean_diff_u.gif (u zonal), mean_diff_v.gif (v meridional).
>>>>> =>>>> Differences increase with time.
>>>>> 
>>>>> 
>>>>> Number of points on the domain: Nx=160, Ny=150, Nz=130.
>>>>> Subdomains for 120 procs: sNx=20, sNy=10 points  => Is it to small,
>>>>> for a subdomain?
>>>>> Subdomains for  10 procs: sNx=160, sNy=15 points
>>>>> 
>>>>> 
>>>>> Thank you for your advices and ideas,
>>>>> Camille
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> ------------------------------------------
>>>>> Camille Mazoyer
>>>>> Phd Student
>>>>> Mediterranean Institute of Oceanography (MIO)
>>>>> Institut de Mathématiques de Toulon (IMATH)
>>>>> Université de TOULON
>>>>> Bat X - CS 60584
>>>>> 83041 TOULON cedex 9
>>>>> France
>>>>> http://mio.pytheas.univ-amu.fr/
>>>>> http://imath.fr/
>>>>> _______________________________________________
>>>>> MITgcm-support mailing list
>>>>> MITgcm-support at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>>> _______________________________________________
>>>> MITgcm-support mailing list
>>>> MITgcm-support at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>> -- 
>>> ------------------------------------------
>>> Camille Mazoyer
>>> Phd Student
>>> Mediterranean Institute of Oceanography (MIO)
>>> Institut de Mathématiques de Toulon (IMATH)
>>> Université de TOULON
>>> Bat X - CS 60584
>>> 83041 TOULON cedex 9
>>> France
>>> http://mio.pytheas.univ-amu.fr/
>>> http://imath.fr/
>>> 
>>> <diff_O2_10x1_vs_10x1tilesize80procs_l13_1h.gif><diff_O0_O2_10x1_tilesize80procs_l13_1h.gif><diff_O0_O2_10x1_l13_1h.gif><diff_O0_10x1_vs_10x1tilesize80procs_l13_1h.gif><diff_10x1_8x15_5min_k130.gif>_______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>> 
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> 
> -- 
> ------------------------------------------
> Camille Mazoyer
> Phd Student
> Mediterranean Institute of Oceanography (MIO)
> Institut de Mathématiques de Toulon (IMATH)
> Université de TOULON
> Bat X - CS 60584
> 83041 TOULON cedex 9
> France
> http://mio.pytheas.univ-amu.fr/
> http://imath.fr/
> 
> <diff_case1_vs_case2_t1h.png><diff_case1_vs_case_t5min.png>_______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support




More information about the MITgcm-support mailing list