[MITgcm-devel] SIZE.h matters on Columbia for CS510?

Fri Jul 11 12:12:32 EDT 2008

My guess (without confirmation) is that the speed-up is due to item 2  
on your list, i.e., cash memory.  With Chris we had similar experience  
in running the 1/8th and 1/16th configurations with overall super- 
linear speed-up in some cases as processor count was increased even  
though exchange, global_sum, and i/o routines were slowed down.  D.

Dimitris Menemenlis
DMenemenlis at gmail.com

On Jul 11, 2008, at 7:16 AM, Jean-Michel Campin wrote:

> Hello Hong,
>
> I don't know mutch, but on some platform, the "volume" of
> data which is send/received when doing an exchange
> matter in term of speed. This could be the case here (since
> Nr is not so small).
> Other thing could be better fit into the cash memory ? not sure
> of that (also difficult to tell since the number of tiles is
> different).
>
> Cheers,
> Jean-Michel
>
> On Thu, Jul 10, 2008 at 02:38:55PM -0700, Hong Zhang wrote:
>> Another interesting point for the CS510 runs:
>> The size of each tile has significant impact on the performance.
>> We have two 450-cpu runs.
>> One is that (in SIZE.h)
>>    &           sNx =  34,
>>    &           sNy =  34,
>>    &           OLx =   8,
>>    &           OLy =   8,
>>    &           nSx =   3,
>>    &           nSy =   1,
>>    &           nPx = 450,
>>    &           nPy =   1
>> Our estimate for each cpu burden is (sNx+2*OLx)*(sNy+2*OLy)/sNx/ 
>> sNy, ie,
>> 2.16
>> The other set is
>>    &           sNx =  17,
>>    &           sNy =  51,
>>    &           OLx =   8,
>>    &           OLy =   8,
>>    &           nSx =   4,
>>    &           nSy =   1,
>>    &           nPx = 450,
>>    &           nPy =   1,
>> then  each cpu is about 2.55.
>> The first run is expected to be 15% faster than the second.
>> But the real runs show that the first run is much more than 15%.
>> Look at the time stamps for the first run:
>> 09:59 THETA.0000002232.data
>> 11:16 THETA.0000004320.data
>> 12:49 THETA.0000006552.data
>> 14:05 THETA.0000008712.data
>> about 82 minutes per monthly output.
>> While the second run is
>> 15:13 THETA.0000002232.data
>> 17:08 THETA.0000004320.data
>> 19:17 THETA.0000006552.data
>> 21:12 THETA.0000008712.data
>> about 2 hours per month.
>>
>> So the idea is that we'd better set up a square tile to improve the
>> efficiency.
>>
>> Hong Zhang wrote:
>>> Following the thread of "
>>> It seems like possible trouble with the global_sum or something like
>>> that?  Would anyone have a suggestion as to what individual files we
>>> could try compiling with -O0 to proceed? "
>>> We tried many experiments by linking different object files which  
>>> are
>>> generated by
>>> -O2 or -O0 options. The major candidates are global-sum and cg2d/ 
>>> cg3d
>>> routines,
>>> as suggested by Chris.
>>> Finally we identified that global_sum_tile.F is the problem-maker in
>>> optimization.
>>> So as a remedy, we compile this file with -O0 while all other files
>>> with -O3.
>>> And now it works for 450-cpu config.
>>> The detail of the opt-file is at
>
>>> http://mitgcm.org/cgi-bin/viewcvs.cgi/MITgcm_contrib/high_res_cube/code-mods/linux_ia64_ifort%2Bmpi_altix_nas?rev=1.11&content-type=text/vnd.viewcvs-markup
>>>
>>>
>>>
>>> thanks,
>>> hong
>>>
>>>
>>>
>>> Dimitris Menemenlis wrote:
>>>> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been  
>>>> around
>>>> for a long time and we have successfully used them on many  
>>>> occasions
>>>> before.
>>>>
>>>> We have determined that the problem is most likely a compiler
>>>> optimization issue.  In addition to being able to run  
>>>> successfully on
>>>> 270 CPUs (but failing on 54, 216, and 450), the code will also run
>>>> successfully if we use -O0 optimization.  We have tried -O0
>>>> successfully on both 54 and 450 CPUs.
>>>>
>>>> The way the model fails, when it fails, is the appearance of  
>>>> randomly
>>>> distributed spikes in Eta, up to +/- 200 m, during the second time
>>>> step:
>>>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>>>>
>>>> Initial Eta does not contain these spikes:
>>>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>>>>
>>>> The spikes only appear in ETAN (at first).  All the other model
>>>> prognostic variables (we have looked at THETA and SALT and  
>>>> monitored
>>>> UVEL/VVEL) seem OK.
>>>>
>>>> The spikes are randomly distributed everywhere in the domain, i.e.,
>>>> they do not appear to be associated with edge effects of any sort.
>>>>
>>>> Has anyone ever seen a similar problem.  It seems like possible
>>>> trouble with the global_sum or something like that?  Would anyone
>>>> have a suggestion as to what individual files we could try  
>>>> compiling
>>>> with -O0 to proceed?
>>>>
>>>> Hong and Dimitris
>>>>
>>>> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>>>>
>>>>>
>>>>> Hi Hong,
>>>>>
>>>>> afaik,
>>>>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>>>>> are dependent on your domain decomposition on the cubed sphere,
>>>>> i.e. if you change that decomposition in SIZE.h
>>>>> (which seems to be what you did), you need to regenerate
>>>>> these two files so that all tile and face neighbor informations
>>>>> on the cube remain correct.
>>>>> The matlab script to do that is in
>>>>> utils/exch2/matlab-topology-generator/driver.m
>>>>>
>>>>> At least from your mail it sounds like you didn't do that.
>>>>> And it means your problem is not a code version problem.
>>>>>
>>>>> Hope this helps
>>>>> -Patrick
>>>>>
>>>>>
>>>>>
>>>>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>>>>
>>>>>> Dear all,
>>>>>> last lime we reported a problem (attached here:
>>>>>> ---------
>>>>>> Something has happened to code from checkpoint59l to current head
>>>>>> branch, which makes it impossible to restart CS510 code.  Any
>>>>>> clues where we should look and what chekpoints to test?
>>>>>> Job crashes on third time step with
>>>>>>
>>>>>>> WARNING: r*FacC < hFacInf at       3 pts : bi,bj,Thid,Iter=   1
>>>>>>>  1   1       218
>>>>>>> e.g. at i,j=  65  85 ; rStarFac,H,eta = -1.237739  4.755480E+03
>>>>>>> -1.064152E+04
>>>>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>>>>
>>>>>> ---------
>>>>>> We found this problem is related to the config of SIZE.h and
>>>>>> w2_e2setup.F
>>>>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and
>>>>>> s216t_85x85/SIZE.h_54.
>>>>>> They all failed and caused the same error as mentioned above.
>>>>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>>>>> For s216t_85x85/SIZE.h_54 we further switched off the  
>>>>>> optimization
>>>>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>>>>> We checked the output @second timestep
>>>>>> but didn't find obvious overlap problem.
>>>>>> Does anyone have any clue?
>>>>>>
>>>>>> hong
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>
>>>>> ---
>>>>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>>>>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>>>>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>>
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel