[MITgcm-support] results quite differents depending on number of procs used

Jody Klymak jklymak at uvic.ca
Fri Mar 18 15:27:00 EDT 2016


Hi Camille,

I had trouble with tiles that had odd values for nSx.  There was some weird instability that seemed to be just a compiler issue.  Search on “mitgcm problem at tile bdys”.  I think ifort was the problem, which I note that you are using: 

My take-home was always use even-numbered nSx...

Cheers,  Jody

"One thing that strikes me is that my 128-core simulation was using 47x16-sized tiles in a 3008x32 domain (i.e. 64x2 tiles).  My 64-core simulation, which is running fine right now, is 94x16 (same domain, 32x2 tiles).  Could it be that there is an even/odd problem?  I imagine most of the time folks do things in powers of 2, or at least with even-sized tiles.”

"Hi Jody,

I tried to reproduce the problem, using your set-up with sNx=47,nPx=8
(sNy=16,nPy=2):
1) it runs fine with no tile egdes problem on the 1rst cluster, using
gfortran and openmpi (optfile: linux_amd64_gfortran);
2) same good results on an other cluster, using
ifort (10.0.025) and mpich-mx-1.2.7 (optfile: linux_amd64_ifort_beagle)
3) but when I try (on the 1rst cluster) to use
ifort (13.0.0.079) and mvapich2-1.7 (optfile: linux_amd64_ifort11)
it blows up at iteration 68 and show same problems as in your run.
And with same compiler/mpi/optfile, with sNx=48 it runs fine (no problem).

So, it looks like there is a problem with ifort 13.0 (can't be MPI
since your are using the intel one and (3) fails with mpich2).
Will take a look at compiler options.”






> On Mar 18, 2016, at  6:58 AM, Camille Mazoyer <mazoyer at univ-tln.fr> wrote:
> 
> Hi Jean-Michel,
> 
> Thank you very much for your reply, and sorry for the delay of mine.
> I check the different points below:
> 
> Le 07/03/2016 16:52, Jean-Michel Campin a écrit :
>> Hi Camille,
>> 
>> Few comments here:
>> 1) With tile size reduced to sNx=20, sNy=10 (120 procs) it's likely that
>>   it will scale not as well (in part due to the increase of number of points
>>   when including overlap). But it should works as well as the 10 procs case.
>> 2) One thing you can check would be to compare, let's say
>>   a 80 procs case (sNx=20, sNy=15, nPx=8, nPy=10) with
>>   a 10 procs case with same tile size (sNx=20, sNy=15) but with more
>>   tiles per procs (e.g., nSx=8, nSy=1, nPx=1, nPy=10).
>>   These two cases should give identical results with recent version of the
>>   code (#define GLOBAL_SUM_ORDER_TILES, added on Aug 25, 2015).
> You're right! I have exactly the same results.
>> 3) With different tile size, we expect small differences, but in your
>>   case, differences seem quite large:
>>   a) could be the flow regime is unstable, or the model parameter is close to
>>    unstable, and then a small difference grows with time.
>>   b) or there is some thing not right with one of the 2 tile-size. I would
>>   suggest to repeat 2 short runs (one for each case) but turning off
>>   compiler optimisation flag (e.g., -O0).
> I ran short simulations (time=1hour) with -O0 and -O2. There are differences (eg. temperature) between -00 and -02 for each configuration.
> My flags are:
> - debug compilation: mpiifort -w95 -W0 -WB -convert big_endian -assume byterecl -fPIC -O0 -noalign -xW -ip -mp
> - standard compilation: mpiifort -w95 -W0 -WB -convert big_endian -assume byterecl -fPIC -O2 -align -xW -ip
> Configurations tested:
> - 10x1 procs
> - 10x1 procs (same tile size than 80 procs)
> - 8x10 procs
> - 8x15 procs (120 procs)
> 
> When I compare two simulations (with no compiler optimisation), it appears that I still have some differences between the simulations. Except 10x1 versus 10x1 procs same tile size than 80 procs which give same results.
> I send you plots wich show differences on surface temperature after t=1 hour. As you can see in the plots attached, the differences are bigger between  10x1 vs 8x15 than between 10x1 vs 8x10.
> 
> For the simulations 10x1 vs 8x15 (compiled with -O0), an interesting thing is that after only a time of 5 min, differences between surface temperature appear in the south. Their shape is more or less lines: I check, these lines are just in the border between 2 tiles (file: diff_10x1_8x15_5min_k130.gif).
> 
> I see in CPP_EEEOPTIONS.h that there are CPP keys for MPI SUM. Can another CPP Key from this file be usefull for my problem?
> In fact, the main problem for me is that I don't know wich run is the saffer. I'm afraid the saffer are the ones less parallelized for the moment?
> 
> Thank you,
> Camille
> 
>>   There has been reports of compilier optimisation problems that only show up
>>   for some tile size but just fine for others.
>> 
>> Cheers,
>> Jean-Michel
>> 
>> On Mon, Mar 07, 2016 at 11:29:11AM +0100, Camille Mazoyer wrote:
>>> Dear all,
>>> 
>>> I ran two simulations of a configuration of the Mediterranean coast,
>>> near Toulon, France.
>>> The simulations are exactly the same except the number of procs (10
>>> procs for one run, 120 procs for the other run). I only change the
>>> file SIZE.h to change the number of procs.
>>> I know we can't except to have exactly the same results but I was
>>> very surprised to see the differences. After 5 days, for example,
>>> the max of differences between temperature fields is around 0.034.
>>> Have you ever see such differences while changing number of procs?
>>> Is this ok for you? If not, do you know where I might have made a
>>> mistake?
>>> 
>>> In attached files, you can see different plots, to compare a run
>>> with 10 procs, and a run with 120 procs:
>>> - the difference of temperature at the surface (k=kmax) :
>>> diff_temp_kmax_5days.gif
>>> - the difference of u field at the surface (k=kmax) : diff_u_kmax_5days.gif
>>> - the difference of v field at the surface (k=kmax) : diff_v_kmax_5days.gif
>>> - I calculate the mean of differences in the domain Nx*Ny*Nz, and I
>>> plot it for each time : mean_diff_temp.gif (temperature ),
>>> mean_diff_u.gif (u zonal), mean_diff_v.gif (v meridional).
>>> =>>>> Differences increase with time.
>>> 
>>> 
>>> Number of points on the domain: Nx=160, Ny=150, Nz=130.
>>> Subdomains for 120 procs: sNx=20, sNy=10 points  => Is it to small,
>>> for a subdomain?
>>> Subdomains for  10 procs: sNx=160, sNy=15 points
>>> 
>>> 
>>> Thank you for your advices and ideas,
>>> Camille
>>> 
>>> 
>>> 
>>> -- 
>>> ------------------------------------------
>>> Camille Mazoyer
>>> Phd Student
>>> Mediterranean Institute of Oceanography (MIO)
>>> Institut de Mathématiques de Toulon (IMATH)
>>> Université de TOULON
>>> Bat X - CS 60584
>>> 83041 TOULON cedex 9
>>> France
>>> http://mio.pytheas.univ-amu.fr/
>>> http://imath.fr/
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>> 
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> 
> -- 
> ------------------------------------------
> Camille Mazoyer
> Phd Student
> Mediterranean Institute of Oceanography (MIO)
> Institut de Mathématiques de Toulon (IMATH)
> Université de TOULON
> Bat X - CS 60584
> 83041 TOULON cedex 9
> France
> http://mio.pytheas.univ-amu.fr/
> http://imath.fr/
> 
> <diff_O2_10x1_vs_10x1tilesize80procs_l13_1h.gif><diff_O0_O2_10x1_tilesize80procs_l13_1h.gif><diff_O0_O2_10x1_l13_1h.gif><diff_O0_10x1_vs_10x1tilesize80procs_l13_1h.gif><diff_10x1_8x15_5min_k130.gif>_______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support




More information about the MITgcm-support mailing list