[MITgcm-devel] SIZE.h matters on Columbia for CS510?
Dimitris Menemenlis
dmenemenlis at gmail.com
Fri Jun 27 15:42:04 EDT 2008
Martin, thanks for taking a look. We'll have a closer look at seaice
output.
The weird thing is that problem goes away when we use a specific
number of
processors or when we turn off optimization.
D.
Dimitris Menemenlis
DMenemenlis at gmail.com
On Jun 27, 2008, at 11:44 AM, Martin Losch wrote:
> Hallo,
>
> I work on an Apple computer, so your huge pdfs absolutely killed my
> machine (o; but what I saw, before I had restart my poor old Apple,
> lets me wonder, if the problem may have something to do with the
> seaice-model? Extreme values seem to occur more often near the ice
> edge and underneath the ice. Since the ice model is global,
> numerical problems may actually affect eta everywhere (via some
> spikes in the phi0Surf). There have been many changes in pkg/seaice
> since 59l, ...
>
> Martin
>
> PS. Can you send the actual files ETAN(iter=1) and ETAN(iter=2),
> because I guess with matlab I can have a closer look than with these
> PDFs.
>
>
> On 27 Jun 2008, at 18:47, Dimitris Menemenlis wrote:
>
>> The w2_e2setup.F and W2_EXCH2_TOPOLOGY.h that we use have been
>> around for a long time and we have successfully used them on many
>> occasions before.
>>
>> We have determined that the problem is most likely a compiler
>> optimization issue. In addition to being able to run successfully
>> on 270 CPUs (but failing on 54, 216, and 450), the code will also
>> run successfully if we use -O0 optimization. We have tried -O0
>> successfully on both 54 and 450 CPUs.
>>
>> The way the model fails, when it fails, is the appearance of
>> randomly distributed spikes in Eta, up to +/- 200 m, during the
>> second time step:
>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e2.pdf
>>
>> Initial Eta does not contain these spikes:
>> http://ecco2.jpl.nasa.gov/data1/cube/cube81/run_test/e1.pdf
>>
>> The spikes only appear in ETAN (at first). All the other model
>> prognostic variables (we have looked at THETA and SALT and
>> monitored UVEL/VVEL) seem OK.
>>
>> The spikes are randomly distributed everywhere in the domain, i.e.,
>> they do not appear to be associated with edge effects of any sort.
>>
>> Has anyone ever seen a similar problem. It seems like possible
>> trouble with the global_sum or something like that? Would anyone
>> have a suggestion as to what individual files we could try
>> compiling with -O0 to proceed?
>>
>> Hong and Dimitris
>>
>> On Jun 24, 2008, at 7:57 AM, Patrick Heimbach wrote:
>>
>>>
>>> Hi Hong,
>>>
>>> afaik,
>>> the files w2_e2setup.F and W2_EXCH2_TOPOLOGY.h
>>> are dependent on your domain decomposition on the cubed sphere,
>>> i.e. if you change that decomposition in SIZE.h
>>> (which seems to be what you did), you need to regenerate
>>> these two files so that all tile and face neighbor informations
>>> on the cube remain correct.
>>> The matlab script to do that is in
>>> utils/exch2/matlab-topology-generator/driver.m
>>>
>>> At least from your mail it sounds like you didn't do that.
>>> And it means your problem is not a code version problem.
>>>
>>> Hope this helps
>>> -Patrick
>>>
>>>
>>>
>>> On Jun 23, 2008, at 7:13 PM, Hong Zhang wrote:
>>>
>>>> Dear all,
>>>> last lime we reported a problem (attached here:
>>>> ---------
>>>> Something has happened to code from checkpoint59l to current head
>>>> branch, which makes it impossible to restart CS510 code. Any
>>>> clues where we should look and what chekpoints to test?
>>>> Job crashes on third time step with
>>>>
>>>>> WARNING: r*FacC < hFacInf at 3 pts : bi,bj,Thid,Iter=
>>>>> 1 1 1 218
>>>>> e.g. at i,j= 65 85 ; rStarFac,H,eta = -1.237739 4.755480E+03
>>>>> -1.064152E+04
>>>>> STOP in CALC_R_STAR : too SMALL rStarFacC !
>>>>>
>>>> ---------
>>>> We found this problem is related to the config of SIZE.h and
>>>> w2_e2setup.F
>>>> We tested s216t_85x85/SIZE.h_216, s1800t_17x51/SIZE.h_450, and
>>>> s216t_85x85/SIZE.h_54.
>>>> They all failed and caused the same error as mentioned above.
>>>> But the config of s1350t_34x34/SIZE_270.h is workable.
>>>> For s216t_85x85/SIZE.h_54 we further switched off the optimization
>>>> (in Makefile setting FOPTIM =) but it has the same problem.
>>>> We checked the output @second timestep
>>>> but didn't find obvious overlap problem.
>>>> Does anyone have any clue?
>>>>
>>>> hong
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> ---
>>> Patrick Heimbach | heimbach at mit.edu | http://www.mit.edu/~heimbach
>>> MIT | EAPS 54-1518 | 77 Massachusetts Ave | Cambridge MA 02139 USA
>>> FON +1-617-253-5259 | FAX +1-617-253-4464 | SKYPE patrick.heimbach
>>>
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list