[MITgcm-devel] SIZE.h matters on Columbia for CS510?

Hong Zhang hong.zhang at caltech.edu
Wed Jul 2 13:01:38 EDT 2008


Following the thread of "
It seems like possible trouble with the global_sum or something like 
that?  Would anyone have a suggestion as to what individual files we 
could try compiling with -O0 to proceed? "
We tried many experiments by linking different object files which are 
generated by
-O2 or -O0 options. The major candidates are global-sum and cg2d/cg3d 
routines,
as suggested by Chris.
Finally we identified that global_sum_tile.F is the problem-maker in 
optimization.
So as a remedy, we compile this file with -O0 while all other files with 
-O3.
And now it works for 450-cpu config.
The detail of the opt-file is at
http://mitgcm.org/cgi-bin/viewcvs.cgi/MITgcm_contrib/high_res_cube/code-mods/linux_ia64_ifort%2Bmpi_altix_nas?rev=1.11&content-type=text/vnd.viewcvs-markup

thanks,
hong



Dimitris Menemenlis wrote:
> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been around 
> for a long time and we have successfully used them on many occasions 
> before.
>
> We have determined that the problem is most likely a compiler 
> optimization issue.  In addition to being able to run successfully on 
> 270 CPUs (but failing on 54, 216, and 450), the code will also run 
> successfully if we use -O0 optimization.  We have tried -O0 
> successfully on both 54 and 450 CPUs.
>
> The way the model fails, when it fails, is the appearance of randomly 
> distributed spikes in Eta, up to +/- 200 m, during the second time step:
> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>
> Initial Eta does not contain these spikes:
> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>
> The spikes only appear in ETAN (at first).  All the other model 
> prognostic variables (we have looked at THETA and SALT and monitored 
> UVEL/VVEL) seem OK.
>
> The spikes are randomly distributed everywhere in the domain, i.e., 
> they do not appear to be associated with edge effects of any sort.
>
> Has anyone ever seen a similar problem.  It seems like possible 
> trouble with the global_sum or something like that?  Would anyone have 
> a suggestion as to what individual files we could try compiling with 
> -O0 to proceed?
>
> Hong and Dimitris
>
> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>
>>
>> Hi Hong,
>>
>> afaik,
>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>> are dependent on your domain decomposition on the cubed sphere,
>> i.e. if you change that decomposition in SIZE.h
>> (which seems to be what you did), you need to regenerate
>> these two files so that all tile and face neighbor informations
>> on the cube remain correct.
>> The matlab script to do that is in
>> utils/exch2/matlab-topology-generator/driver.m
>>
>> At least from your mail it sounds like you didn't do that.
>> And it means your problem is not a code version problem.
>>
>> Hope this helps
>> -Patrick
>>
>>
>>
>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>
>>> Dear all,
>>> last lime we reported a problem (attached here:
>>> ---------
>>> Something has happened to code from checkpoint59l to current head 
>>> branch, which makes it impossible to restart CS510 code.  Any clues 
>>> where we should look and what chekpoints to test?
>>> Job crashes on third time step with
>>>
>>>> WARNING: r*FacC < hFacInf at       3 pts : bi,bj,Thid,Iter=   1   
>>>> 1   1       218
>>>> e.g. at i,j=  65  85 ; rStarFac,H,eta = -1.237739  4.755480E+03 
>>>> -1.064152E+04
>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>
>>> ---------
>>> We found this problem is related to the config of SIZE.h and 
>>> w2_e2setup.F
>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and 
>>> s216t_85x85/SIZE.h_54.
>>> They all failed and caused the same error as mentioned above.
>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>> For s216t_85x85/SIZE.h_54 we further switched off the optimization
>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>> We checked the output @second timestep
>>> but didn't find obvious overlap problem.
>>> Does anyone have any clue?
>>>
>>> hong
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> ---
>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
>




More information about the MITgcm-devel mailing list