[MITgcm-devel] SIZE.h matters on Columbia for CS510?

Dimitris Menemenlis dmenemenlis at gmail.com
Fri Jun 27 15:42:04 EDT 2008


Martin, thanks for taking a look.  We'll have a closer look at seaice  
output.

The weird thing is that problem goes away when we use a specific  
number of
processors or when we turn off optimization.

D.

Dimitris Menemenlis
DMenemenlis at gmail.com

On Jun 27, 2008, at 11:44 AM, Martin Losch wrote:

> Hallo,
>
> I work on an Apple computer, so your huge pdfs absolutely killed my  
> machine (o; but what I saw, before I had restart my poor old Apple,  
> lets me wonder, if the problem may have something to do with the  
> seaice-model? Extreme values seem to occur more often near the ice  
> edge and underneath the ice. Since the ice model is global,  
> numerical problems may actually affect eta everywhere (via some  
> spikes in the phi0Surf). There have been many changes in pkg/seaice  
> since 59l, ...
>
> Martin
>
> PS. Can you send the actual files ETAN(iter=1) and ETAN(iter=2),  
> because I guess with matlab I can have a closer look than with these  
> PDFs.
>
>
> On 27 Jun 2008, at 18:47, Dimitris Menemenlis wrote:
>
>> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been  
>> around for a long time and we have successfully used them on many  
>> occasions before.
>>
>> We have determined that the problem is most likely a compiler  
>> optimization issue.  In addition to being able to run successfully  
>> on 270 CPUs (but failing on 54, 216, and 450), the code will also  
>> run successfully if we use -O0 optimization.  We have tried -O0  
>> successfully on both 54 and 450 CPUs.
>>
>> The way the model fails, when it fails, is the appearance of  
>> randomly distributed spikes in Eta, up to +/- 200 m, during the  
>> second time step:
>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>>
>> Initial Eta does not contain these spikes:
>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>>
>> The spikes only appear in ETAN (at first).  All the other model  
>> prognostic variables (we have looked at THETA and SALT and  
>> monitored UVEL/VVEL) seem OK.
>>
>> The spikes are randomly distributed everywhere in the domain, i.e.,  
>> they do not appear to be associated with edge effects of any sort.
>>
>> Has anyone ever seen a similar problem.  It seems like possible  
>> trouble with the global_sum or something like that?  Would anyone  
>> have a suggestion as to what individual files we could try  
>> compiling with -O0 to proceed?
>>
>> Hong and Dimitris
>>
>> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>>
>>>
>>> Hi Hong,
>>>
>>> afaik,
>>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>>> are dependent on your domain decomposition on the cubed sphere,
>>> i.e. if you change that decomposition in SIZE.h
>>> (which seems to be what you did), you need to regenerate
>>> these two files so that all tile and face neighbor informations
>>> on the cube remain correct.
>>> The matlab script to do that is in
>>> utils/exch2/matlab-topology-generator/driver.m
>>>
>>> At least from your mail it sounds like you didn't do that.
>>> And it means your problem is not a code version problem.
>>>
>>> Hope this helps
>>> -Patrick
>>>
>>>
>>>
>>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>>
>>>> Dear all,
>>>> last lime we reported a problem (attached here:
>>>> ---------
>>>> Something has happened to code from checkpoint59l to current head  
>>>> branch, which makes it impossible to restart CS510 code.  Any  
>>>> clues where we should look and what chekpoints to test?
>>>> Job crashes on third time step with
>>>>
>>>>> WARNING: r*FacC < hFacInf at       3 pts : bi,bj,Thid,Iter=    
>>>>> 1   1   1       218
>>>>> e.g. at i,j=  65  85 ; rStarFac,H,eta = -1.237739  4.755480E+03  
>>>>> -1.064152E+04
>>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>>
>>>> ---------
>>>> We found this problem is related to the config of SIZE.h and  
>>>> w2_e2setup.F
>>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and  
>>>> s216t_85x85/SIZE.h_54.
>>>> They all failed and caused the same error as mentioned above.
>>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>>> For s216t_85x85/SIZE.h_54 we further switched off the optimization
>>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>>> We checked the output @second timestep
>>>> but didn't find obvious overlap problem.
>>>> Does anyone have any clue?
>>>>
>>>> hong
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> ---
>>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>>
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list