[MITgcm-devel] SIZE.h matters on Columbia for CS510?

Fri Jun 27 15:56:34 EDT 2008

Hi Dimitris,

Do you need to completly turn off the optimisation (-O0) or
may be just reducing it  (-O3 -> -O2) would work ?
On which machine & optfile are you working ?
Cheers,
Jean-Michel

On Fri, Jun 27, 2008 at 12:42:04PM -0700, Dimitris Menemenlis wrote:
> Martin, thanks for taking a look.  We'll have a closer look at seaice  
> output.
>
> The weird thing is that problem goes away when we use a specific number 
> of
> processors or when we turn off optimization.
>
> D.
>
> Dimitris Menemenlis
> DMenemenlis at gmail.com
>
> On Jun 27, 2008, at 11:44 AM, Martin Losch wrote:
>
>> Hallo,
>>
>> I work on an Apple computer, so your huge pdfs absolutely killed my  
>> machine (o; but what I saw, before I had restart my poor old Apple,  
>> lets me wonder, if the problem may have something to do with the  
>> seaice-model? Extreme values seem to occur more often near the ice  
>> edge and underneath the ice. Since the ice model is global, numerical 
>> problems may actually affect eta everywhere (via some spikes in the 
>> phi0Surf). There have been many changes in pkg/seaice since 59l, ...
>>
>> Martin
>>
>> PS. Can you send the actual files ETAN(iter=1) and ETAN(iter=2),  
>> because I guess with matlab I can have a closer look than with these  
>> PDFs.
>>
>>
>> On 27 Jun 2008, at 18:47, Dimitris Menemenlis wrote:
>>
>>> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been  
>>> around for a long time and we have successfully used them on many  
>>> occasions before.
>>>
>>> We have determined that the problem is most likely a compiler  
>>> optimization issue.  In addition to being able to run successfully  
>>> on 270 CPUs (but failing on 54, 216, and 450), the code will also  
>>> run successfully if we use -O0 optimization.  We have tried -O0  
>>> successfully on both 54 and 450 CPUs.
>>>
>>> The way the model fails, when it fails, is the appearance of  
>>> randomly distributed spikes in Eta, up to +/- 200 m, during the  
>>> second time step:
>>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>>>
>>> Initial Eta does not contain these spikes:
>>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>>>
>>> The spikes only appear in ETAN (at first).  All the other model  
>>> prognostic variables (we have looked at THETA and SALT and monitored 
>>> UVEL/VVEL) seem OK.
>>>
>>> The spikes are randomly distributed everywhere in the domain, i.e.,  
>>> they do not appear to be associated with edge effects of any sort.
>>>
>>> Has anyone ever seen a similar problem.  It seems like possible  
>>> trouble with the global_sum or something like that?  Would anyone  
>>> have a suggestion as to what individual files we could try compiling 
>>> with -O0 to proceed?
>>>
>>> Hong and Dimitris
>>>
>>> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>>>
>>>>
>>>> Hi Hong,
>>>>
>>>> afaik,
>>>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>>>> are dependent on your domain decomposition on the cubed sphere,
>>>> i.e. if you change that decomposition in SIZE.h
>>>> (which seems to be what you did), you need to regenerate
>>>> these two files so that all tile and face neighbor informations
>>>> on the cube remain correct.
>>>> The matlab script to do that is in
>>>> utils/exch2/matlab-topology-generator/driver.m
>>>>
>>>> At least from your mail it sounds like you didn't do that.
>>>> And it means your problem is not a code version problem.
>>>>
>>>> Hope this helps
>>>> -Patrick
>>>>
>>>>
>>>>
>>>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>>>
>>>>> Dear all,
>>>>> last lime we reported a problem (attached here:
>>>>> ---------
>>>>> Something has happened to code from checkpoint59l to current head 
>>>>> branch, which makes it impossible to restart CS510 code.  Any  
>>>>> clues where we should look and what chekpoints to test?
>>>>> Job crashes on third time step with
>>>>>
>>>>>> WARNING: r*FacC < hFacInf at       3 pts : bi,bj,Thid,Iter=    
>>>>>> 1   1   1       218
>>>>>> e.g. at i,j=  65  85 ; rStarFac,H,eta = -1.237739  4.755480E+03 
>>>>>> -1.064152E+04
>>>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>>>
>>>>> ---------
>>>>> We found this problem is related to the config of SIZE.h and  
>>>>> w2_e2setup.F
>>>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and  
>>>>> s216t_85x85/SIZE.h_54.
>>>>> They all failed and caused the same error as mentioned above.
>>>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>>>> For s216t_85x85/SIZE.h_54 we further switched off the optimization
>>>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>>>> We checked the output @second timestep
>>>>> but didn't find obvious overlap problem.
>>>>> Does anyone have any clue?
>>>>>
>>>>> hong
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> ---
>>>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>>>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>>>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>>>
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel