[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups
Jean-Michel Campin
jmc at ocean.mit.edu
Fri Feb 27 11:08:13 EST 2009
Hi Martin,
Just a short question:
Did you try with zero optimisation on SX8 to see if it solves
some of the restart issues ?
And I also know that David is not always getting a perfect
restart on Columbia, don't remember if this was with zero
optimisation, but he was using small tiles.
Cheers,
Jean-Michel
On Fri, Feb 27, 2009 at 04:29:58PM +0100, Martin Losch wrote:
> Hi DImitris, etal,
> thanks for your comments/suggestions. A lot is pointing at the lat-lon
> grid, which would probably mean that the metric terms are the problem. I
> am now running a lat-lon grid, where I turned off the metric terms in
> seaice_lsr. However, I am not quite sure, if the pickup issues are
> unrelated, because I can influence the results with things that should
> not have any effect (such as the pickup). Further comments about your
> suggestion below:
> On Feb 27, 2009, at 2:34 PM, Dimitris Menemenlis wrote:
>
>> Martin, let's assume for a moment that restart and c-lsr blow-ups are
>> unrelated. Let's assume that the restart issue is peculiar to SX8
>> compiler. From following this discussion, the common thread is that
>> c-lsr blows up in lat-lon grids but not in curvilinear grids. This
>> would also explain the problems that Matt has had on his SOSE grid
>> while
>> the CS510 integrations have run stably for 1000+ years of cumulative
>> integrations.
>>
>> Some (random) suggestions:
>>
>> 1. turn on debuglevel=3 and monitorfreq=1, to see if you can catch
>> anything anomalous right before crash
> That's how I found out that it is in the seaice code, I actually added
> debug_stats information to seaice_lsr before and after the iterations
> start (not yet checked in, but maybe useful? Let me know)
>>
>>
>> 2. increase lsr solver accuracy and number of allowed iterations
> Ongoing
>>
>>
>> 3. use a different compiler on SXB to see if restart problem goes away
> Not possible, this platform comes only with a cross compiler for the
> compute nodes. The head node is a linux system with amd64 cpus with
> g77/ifort, but that's irrelevant for this problem.
>>
>>
>> 4. use a different platform and/or compiler, which do not exhibit the
>> restart problem, in order to help you debug your c-lsr-latlon crashes.
> Honestly, that's what I should do in order to separate the restart
> problem from the ice problem, but I don't have these resources at the
> moment, I am afraid. Also I have not yet observed the seaice problem on
> other platforms (I think).
>>
>>
>> (if you do not have access to a different platform, you can send me
>> your
>> config details and I can set it up on one of the JPL or NAS altices)
> Thanks for offering: I put a 427MB tar file into my home directory on
> skylla, which expands into an input and a code directory. You need to
> edit SIZE.h: Nr needs to be 23 (not 50) and I usually use up to 18
> (nPx=6,nPy=3) non-vector cpus for this configuration. On the SX8 it's
> only 1 (vector)cpu.
> Unfortunately, the configuration is quite stable and you might need to
> run it longer than the 100 years that are specified in the data file
> (all with asynchronous time stepping, so I shouldn't take too long).
>
> Martin
>
>
>>
>>
>> D.
>>
>> On Fri, 2009-02-27 at 14:07 +0100, Martin Losch wrote:
>>> Hi all, but probably in particular Jean-Michel,
>>>
>>> I have no found this on our SX8:
>>>
>>> 1. restarts that work elsewhere (e.g. lab_sea on eddy.csail.mit.edu)
>>> do not work. I have no idea why, it is not connected with a
>>> particular
>>> package, but also for experiments where data.pkg has no e
>>> ntries the
>>> restart is broken. This is clearly an issue related to SX8, as the
>>> restart behavior is regularly tested. I am still looking for the
>>> precise reason, but at the moment I am clueless. Suggestions are
>>> welcome.
>>>
>>> 2. "spontaneous" explosions happen in the C-LSR solver, but so far
>>> not
>>> in the B-LSR or C-EVP solver. I am not sure to what extent this is
>>> just coincidence. Currently this happens in a 1cpu-2deg-lat-lon
>>> configuration, a 2cpu Arctic configuration with a rotated lat-lon
>>> grid
>>> and .25deg resolution and with OBCS, and regional 0.5deg resolution
>>> for the Weddell Sea (so far without OBCS). I have run the CS510 for
>>> 16year without problems, also I have run the above Arcttc
>>> configuration with a curvilinear grid (basically the grid is the
>>> same,
>>> but the metric terms in the ice model are no there) without any
>>> problems. It "looks" like it's connected to the lat-lon grid (and
>>> thus
>>> metric terms?).
>>>
>>> 3. C-LSR (and B-LSR) is basically set of iterations. At the
>>> beginning,
>>> the first timelevel velocity is copied to the third: uice(i,j,
>>> 3,bi,bj)=uice(i,j,1,bi,bj), then later we compute an innovation like
>>> this:
>>> u(1) = u(3) + .95*(URT-u(3)).
>>> and at the end of each iteration there is an
>>> exch_uv_3d_rl(uice,vice,.true.,3,mythid).
>>> All of these computations happen within j=1,sNy; i=1,sNx (but
>>> partiallly in separate loops). u(3) is never used outside of
>>> seaice_lsr.F (lsr.F, except in some obsolete and never used ice/ocea-
>>> stress computation). I have made a change so that uice(3)=uice(1) is
>>> now done for the entire array: j=1-Oly,sNy+Oly; i=1-Olx,sNx+Olx, that
>>> is including the overlaps. These overlaps of u(3) (and v(3)) are
>>> never
>>> touched elsewhere, except in the exchange routines. After this change
>>> (copy of u/v(1) to u/v(3), including overlaps), the results should
>>> not
>>> change; they do not change on, say, eddy.csail.mit.edu, but the do
>>> change on our SX8. In some cases the "spontaneous" explosions go
>>> away,
>>> in others they are "delayed" by order(1000) timesteps.
>>>
>>> My preliminary conclusions are, that the problem with seaice_lsr and
>>> pickups are actually connected. The only thing that can go wrong in
>>> the pickups is that something fishy is happening in the exchanges.
>>> Other option is, that it is somehow connected to metric terms in the
>>> ice model, which I find hard to believe; it would not explain the
>>> restart problem.
>>>
>>> What should I try next to figure out this problem?
>>>
>>> Martin
>>> cc to Olaf Klatt
>>>
>>>
>>> On Feb 19, 2009, at 8:57 AM, Martin Losch wrote:
>>>
>>>> Hi Jinlun and Matt, thanks for your comments,
>>>>
>>>> I did comparison runs with the B-grid code and with EVP and in the
>>>> particular instances I am interested in, they do not crash. That's a
>>>> bit discomforting for me, but on the other hand, I do not use the B-
>>>> grid or EVP code to often, so that I don't have an appropriate
>>>> statistical sample (again, in nearly all cases the C-LSR code is
>>>> absolutely stable, and Dimitris, does all his CS510 runs with C-
>>>> LSR).
>>>>
>>>> Matt, the original seaice_growth.F has lots of these
>>>>> HEFF(I,J,2,bi,bj) = MAX(0. _d 0, HEFF(I,J,
>>>>> 2,bi,bj) )
>>>>> HSNOW(I,J,bi,bj) = MAX(0. _d 0,
>>>>> HSNOW(I,J,bi,bj) )
>>>>> AREA(I,J,2,bi,bj) = MAX(0. _d 0, AREA(I,J,
>>>>> 2,bi,bj) )
>>>> as well, but we will try this also. I don't think that the
>>>> thermodynamic growth is the problem, it's more likely that changing
>>>> anything in the sea ice model makes the model not crash at a
>>>> particular point (e.g., interrupting and restarting and integration
>>>> from a pickup rather than doing everything in one go, in this sense
>>>> changing from C to B-grid is a change, too, and not a small one),
>>>> but I guess, if we have some funny HEFF etc, the LSR solver might
>>>> get into trouble, too.
>>>> So I'll try this.
>>>>
>>>> Martin
>>>>
>>>> On Feb 18, 2009, at 5:42 PM, Jinlun Zhang wrote:
>>>>
>>>>> Martin,
>>>>> Have you tried LSR on B-grid with the bug fixed, just for a
>>>>> comparison?
>>>>> Good luck, Jinlun
>>>>>
>>>>> Martin Losch wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> just to let you know that we are experiencing problems with the
>>>>>> LSR sea ice solver on the C-grid: At unpredictable points of the
>>>>>> integration, it appears to become instable and blows up. I have
>>>>>> not been able to isolate this in all cases, because a small issue
>>>>>> with pickups hampers this:
>>>>>>
>>>>>> Apparently, starting from pickup is NOT exact. We have tried the
>>>>>> famous 2+2=4 test with our 8CPU job on our SX8 (cc to Olaf, who's
>>>>>> been mostly involved in this) and found no difference between the
>>>>>> cg2d output (and other output). However, when we run an experiment
>>>>>> for a longer time, the same test fails, e.g., 2160+2160 != 4320
>>>>>> (we can provide plots if required). I assume that this is
>>>>>> expected, because double precision is not more than double
>>>>>> precisioin and in the cg2d output (and other monitor output) there
>>>>>> are always only 15 digits, and we don't know about the 16th one,
>>>>>> correct? Anyway, this tiny pickup issue hinders me from
>>>>>> approaching the point of model crash with pickups, because after
>>>>>> starting from a pickup, the model integrate beyond the problem and
>>>>>> crashes (sometimes) at a much later time. This is to say, that the
>>>>>> problem in seaice_lsr (the problem only appears when the C-LSR
>>>>>> solver is used) very sensitive; the code crashes without any
>>>>>> warning from one time step to the other. A while ago, in a
>>>>>> different case I was able to get close enough the point of
>>>>>> crashing to do some diagnostics, but its almost impossible to
>>>>>> identify, why the model explodes. I am assuming that for random
>>>>>> pathological cases one or more matrix entries are nearly zero,
>>>>>> which then prevents the solver from converging.
>>>>>>
>>>>>> Any comments? Any similar experience?
>>>>>>
>>>>>> I run this code in so many different configurations, and I have
>>>>>> these problems only very seldom/randomly, so I am a little at a
>>>>>> loss where I should continue looking, so any hint is appreciated.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> --
>> Dimitris Menemenlis <DMenemenlis at gmail.com>
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list