[MITgcm-devel] SIZE.h matters on Columbia for CS510?

Thu Jul 10 17:38:55 EDT 2008

Another interesting point for the CS510 runs:
The size of each tile has significant impact on the performance.
We have two 450-cpu runs.
One is that (in SIZE.h)
     &           sNx =  34,
     &           sNy =  34,
     &           OLx =   8,
     &           OLy =   8,
     &           nSx =   3,
     &           nSy =   1,
     &           nPx = 450,
     &           nPy =   1
Our estimate for each cpu burden is (sNx+2*OLx)*(sNy+2*OLy)/sNx/sNy, ie, 
2.16
The other set is
     &           sNx =  17,
     &           sNy =  51,
     &           OLx =   8,
     &           OLy =   8,
     &           nSx =   4,
     &           nSy =   1,
     &           nPx = 450,
     &           nPy =   1,
then  each cpu is about 2.55.
The first run is expected to be 15% faster than the second.
But the real runs show that the first run is much more than 15%.
Look at the time stamps for the first run:
09:59 THETA.0000002232.data
11:16 THETA.0000004320.data
12:49 THETA.0000006552.data
14:05 THETA.0000008712.data
about 82 minutes per monthly output.
While the second run is
15:13 THETA.0000002232.data
17:08 THETA.0000004320.data
19:17 THETA.0000006552.data
21:12 THETA.0000008712.data
about 2 hours per month.

So the idea is that we'd better set up a square tile to improve the 
efficiency.

Hong Zhang wrote:
> Following the thread of "
> It seems like possible trouble with the global_sum or something like 
> that?  Would anyone have a suggestion as to what individual files we 
> could try compiling with -O0 to proceed? "
> We tried many experiments by linking different object files which are 
> generated by
> -O2 or -O0 options. The major candidates are global-sum and cg2d/cg3d 
> routines,
> as suggested by Chris.
> Finally we identified that global_sum_tile.F is the problem-maker in 
> optimization.
> So as a remedy, we compile this file with -O0 while all other files 
> with -O3.
> And now it works for 450-cpu config.
> The detail of the opt-file is at
> http://mitgcm.org/cgi-bin/viewcvs.cgi/MITgcm_contrib/high_res_cube/code-mods/linux_ia64_ifort%2Bmpi_altix_nas?rev=1.11&content-type=text/vnd.viewcvs-markup 
>
>
> thanks,
> hong
>
>
>
> Dimitris Menemenlis wrote:
>> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been around 
>> for a long time and we have successfully used them on many occasions 
>> before.
>>
>> We have determined that the problem is most likely a compiler 
>> optimization issue.  In addition to being able to run successfully on 
>> 270 CPUs (but failing on 54, 216, and 450), the code will also run 
>> successfully if we use -O0 optimization.  We have tried -O0 
>> successfully on both 54 and 450 CPUs.
>>
>> The way the model fails, when it fails, is the appearance of randomly 
>> distributed spikes in Eta, up to +/- 200 m, during the second time step:
>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>>
>> Initial Eta does not contain these spikes:
>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>>
>> The spikes only appear in ETAN (at first).  All the other model 
>> prognostic variables (we have looked at THETA and SALT and monitored 
>> UVEL/VVEL) seem OK.
>>
>> The spikes are randomly distributed everywhere in the domain, i.e., 
>> they do not appear to be associated with edge effects of any sort.
>>
>> Has anyone ever seen a similar problem.  It seems like possible 
>> trouble with the global_sum or something like that?  Would anyone 
>> have a suggestion as to what individual files we could try compiling 
>> with -O0 to proceed?
>>
>> Hong and Dimitris
>>
>> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>>
>>>
>>> Hi Hong,
>>>
>>> afaik,
>>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>>> are dependent on your domain decomposition on the cubed sphere,
>>> i.e. if you change that decomposition in SIZE.h
>>> (which seems to be what you did), you need to regenerate
>>> these two files so that all tile and face neighbor informations
>>> on the cube remain correct.
>>> The matlab script to do that is in
>>> utils/exch2/matlab-topology-generator/driver.m
>>>
>>> At least from your mail it sounds like you didn't do that.
>>> And it means your problem is not a code version problem.
>>>
>>> Hope this helps
>>> -Patrick
>>>
>>>
>>>
>>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>>
>>>> Dear all,
>>>> last lime we reported a problem (attached here:
>>>> ---------
>>>> Something has happened to code from checkpoint59l to current head 
>>>> branch, which makes it impossible to restart CS510 code.  Any clues 
>>>> where we should look and what chekpoints to test?
>>>> Job crashes on third time step with
>>>>
>>>>> WARNING: r*FacC < hFacInf at       3 pts : bi,bj,Thid,Iter=   1   
>>>>> 1   1       218
>>>>> e.g. at i,j=  65  85 ; rStarFac,H,eta = -1.237739  4.755480E+03 
>>>>> -1.064152E+04
>>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>>
>>>> ---------
>>>> We found this problem is related to the config of SIZE.h and 
>>>> w2_e2setup.F
>>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and 
>>>> s216t_85x85/SIZE.h_54.
>>>> They all failed and caused the same error as mentioned above.
>>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>>> For s216t_85x85/SIZE.h_54 we further switched off the optimization
>>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>>> We checked the output @second timestep
>>>> but didn't find obvious overlap problem.
>>>> Does anyone have any clue?
>>>>
>>>> hong
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> ---
>>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>>
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>