[MITgcm-devel] SIZE.h matters on Columbia for CS510?

Dimitris Menemenlis dmenemenlis at gmail.com
Fri Jun 27 12:47:33 EDT 2008


The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been around  
for a long time and we have successfully used them on many occasions  
before.

We have determined that the problem is most likely a compiler  
optimization issue.  In addition to being able to run successfully on  
270 CPUs (but failing on 54, 216, and 450), the code will also run  
successfully if we use -O0 optimization.  We have tried -O0  
successfully on both 54 and 450 CPUs.

The way the model fails, when it fails, is the appearance of randomly  
distributed spikes in Eta, up to +/- 200 m, during the second time step:
http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf

Initial Eta does not contain these spikes:
http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf

The spikes only appear in ETAN (at first).  All the other model  
prognostic variables (we have looked at THETA and SALT and monitored  
UVEL/VVEL) seem OK.

The spikes are randomly distributed everywhere in the domain, i.e.,  
they do not appear to be associated with edge effects of any sort.

Has anyone ever seen a similar problem.  It seems like possible  
trouble with the global_sum or something like that?  Would anyone have  
a suggestion as to what individual files we could try compiling with - 
O0 to proceed?

Hong and Dimitris

On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:

>
> Hi Hong,
>
> afaik,
> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
> are dependent on your domain decomposition on the cubed sphere,
> i.e. if you change that decomposition in SIZE.h
> (which seems to be what you did), you need to regenerate
> these two files so that all tile and face neighbor informations
> on the cube remain correct.
> The matlab script to do that is in
> utils/exch2/matlab-topology-generator/driver.m
>
> At least from your mail it sounds like you didn't do that.
> And it means your problem is not a code version problem.
>
> Hope this helps
> -Patrick
>
>
>
> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>
>> Dear all,
>> last lime we reported a problem (attached here:
>> ---------
>> Something has happened to code from checkpoint59l to current head  
>> branch, which makes it impossible to restart CS510 code.  Any clues  
>> where we should look and what chekpoints to test?
>> Job crashes on third time step with
>>
>>> WARNING: r*FacC < hFacInf at       3 pts : bi,bj,Thid,Iter=   1    
>>> 1   1       218
>>> e.g. at i,j=  65  85 ; rStarFac,H,eta = -1.237739  4.755480E+03  
>>> -1.064152E+04
>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>
>> ---------
>> We found this problem is related to the config of SIZE.h and  
>> w2_e2setup.F
>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and  
>> s216t_85x85/SIZE.h_54.
>> They all failed and caused the same error as mentioned above.
>> But the config of s1350t_34x34/SIZE_270.h is workable.
>> For s216t_85x85/SIZE.h_54 we further switched off the optimization
>> (in Makefile setting FOPTIM =) but it has the same problem.
>> We checked the output @second timestep
>> but didn't find obvious overlap problem.
>> Does anyone have any clue?
>>
>> hong
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> ---
> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list