[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups

Fri Feb 27 10:29:58 EST 2009

Hi DImitris, etal,
thanks for your comments/suggestions. A lot is pointing at the lat-lon  
grid, which would probably mean that the metric terms are the problem.  
I am now running a lat-lon grid, where I turned off the metric terms  
in seaice_lsr.  However, I am not quite sure, if the pickup issues are  
unrelated, because I can influence the results with things that should  
not have any effect (such as the pickup). Further comments about your  
suggestion below:
On Feb 27, 2009, at 2:34 PM, Dimitris Menemenlis wrote:

> Martin, let's assume for a moment that restart and c-lsr blow-ups are
> unrelated.  Let's assume that the restart issue is peculiar to SX8
> compiler.  From following this discussion, the common thread is that
> c-lsr blows up in lat-lon grids but not in curvilinear grids.  This
> would also explain the problems that Matt has had on his SOSE grid  
> while
> the CS510 integrations have run stably for 1000+ years of cumulative
> integrations.
>
> Some (random) suggestions:
>
> 1. turn on debuglevel=3 and monitorfreq=1, to see if you can catch
> anything anomalous right before crash
That's how I found out that it is in the seaice code, I actually added  
debug_stats information to seaice_lsr before and after the iterations  
start  (not yet checked in, but maybe useful? Let me know)
>
>
> 2. increase lsr solver accuracy and number of allowed iterations
Ongoing
>
>
> 3. use a different compiler on SXB to see if restart problem goes away
Not possible, this platform comes only with a cross compiler for the  
compute nodes. The head node is a linux system with amd64 cpus with  
g77/ifort, but that's irrelevant for this problem.
>
>
> 4. use a different platform and/or compiler, which do not exhibit the
> restart problem, in order to help you debug your c-lsr-latlon crashes.
Honestly, that's what I should do in order to separate the restart  
problem from the ice problem, but I don't have these resources at the  
moment, I am afraid. Also I have not yet observed the seaice problem  
on other platforms (I think).
>
>
> (if you do not have access to a different platform, you can send me  
> your
> config details and I can set it up on one of the JPL or NAS altices)
Thanks for offering: I put a 427MB tar file into my home directory on  
skylla, which expands into an input and a code directory. You need to  
edit SIZE.h: Nr needs to be 23 (not 50) and I usually use up to 18  
(nPx=6,nPy=3) non-vector cpus for this configuration. On the SX8 it's  
only 1 (vector)cpu.
Unfortunately, the configuration is quite stable and you might need to  
run it longer than the 100 years that are specified in the data file  
(all with asynchronous time stepping, so I shouldn't take too long).

Martin

>
>
> D.
>
> On Fri, 2009-02-27 at 14:07 +0100, Martin Losch wrote:
>> Hi all, but probably in particular Jean-Michel,
>>
>> I have no found this on our SX8:
>>
>> 1. restarts that work elsewhere (e.g. lab_sea on eddy.csail.mit.edu)
>> do not work. I have no idea why, it is not connected with a  
>> particular
>> package, but also for experiments where data.pkg has no e
>> ntries the
>> restart is broken. This is clearly an issue related to SX8, as the
>> restart behavior is regularly tested. I am still looking for the
>> precise reason, but at the moment I am clueless. Suggestions are
>> welcome.
>>
>> 2. "spontaneous" explosions happen in the C-LSR solver, but so far  
>> not
>> in the B-LSR or C-EVP solver. I am not sure to what extent this is
>> just coincidence. Currently this happens in a 1cpu-2deg-lat-lon
>> configuration, a 2cpu Arctic configuration with a rotated lat-lon  
>> grid
>> and .25deg resolution and with OBCS, and regional 0.5deg resolution
>> for the Weddell Sea (so far without OBCS). I have run the CS510 for
>> 16year without problems, also I have run the above Arcttc
>> configuration with a curvilinear grid (basically the grid is the  
>> same,
>> but the metric terms in the ice model are no there) without any
>> problems. It "looks" like it's connected to the lat-lon grid (and  
>> thus
>> metric terms?).
>>
>> 3. C-LSR (and B-LSR) is basically set of iterations. At the  
>> beginning,
>> the first timelevel velocity is copied to the third: uice(i,j,
>> 3,bi,bj)=uice(i,j,1,bi,bj), then later we compute an innovation like
>> this:
>> u(1) = u(3) + .95*(URT-u(3)).
>> and at the end of each iteration there is an
>> exch_uv_3d_rl(uice,vice,.true.,3,mythid).
>> All of these computations happen within j=1,sNy; i=1,sNx (but
>> partiallly in separate loops). u(3) is never used outside of
>> seaice_lsr.F (lsr.F, except in some obsolete and never used ice/ocea-
>> stress computation). I have made a change so that uice(3)=uice(1) is
>> now done for the entire array: j=1-Oly,sNy+Oly; i=1-Olx,sNx+Olx, that
>> is including the overlaps. These overlaps of u(3) (and v(3)) are  
>> never
>> touched elsewhere, except in the exchange routines. After this change
>> (copy of u/v(1) to u/v(3), including overlaps), the results should  
>> not
>> change; they do not change on, say, eddy.csail.mit.edu, but the do
>> change on our SX8. In some cases the "spontaneous" explosions go  
>> away,
>> in others they are "delayed" by order(1000) timesteps.
>>
>> My preliminary conclusions are, that the problem with seaice_lsr and
>> pickups are actually connected. The only thing that can go wrong in
>> the pickups is that something fishy is happening in the exchanges.
>> Other option is, that it is somehow connected to metric terms in the
>> ice model, which I find hard to believe; it would not explain the
>> restart problem.
>>
>> What should I try next to figure out this problem?
>>
>> Martin
>> cc to Olaf Klatt
>>
>>
>> On Feb 19, 2009, at 8:57 AM, Martin Losch wrote:
>>
>>> Hi Jinlun and Matt, thanks for your comments,
>>>
>>> I did comparison runs with the B-grid code and with EVP and in the
>>> particular instances I am interested in, they do not crash. That's a
>>> bit discomforting for me, but on the other hand, I do not use the B-
>>> grid or EVP code to often, so that I don't have an appropriate
>>> statistical sample (again, in nearly all cases the C-LSR code is
>>> absolutely stable, and Dimitris, does all his CS510 runs with C- 
>>> LSR).
>>>
>>> Matt, the original seaice_growth.F has lots of these
>>>>                HEFF(I,J,2,bi,bj)  = MAX(0. _d 0, HEFF(I,J,
>>>> 2,bi,bj)  )
>>>>                HSNOW(I,J,bi,bj)   = MAX(0. _d 0,
>>>> HSNOW(I,J,bi,bj)   )
>>>>                AREA(I,J,2,bi,bj)  = MAX(0. _d 0, AREA(I,J,
>>>> 2,bi,bj)  )
>>> as well, but we will try this also. I don't think that the
>>> thermodynamic growth is the problem, it's more likely that changing
>>> anything in the sea ice model makes the model not crash at a
>>> particular point (e.g., interrupting and restarting and integration
>>> from a pickup rather than doing everything in one go, in this sense
>>> changing from C to B-grid is a change, too, and not a small one),
>>> but I guess, if we have some funny HEFF etc, the LSR solver might
>>> get into trouble, too.
>>> So I'll try this.
>>>
>>> Martin
>>>
>>> On Feb 18, 2009, at 5:42 PM, Jinlun Zhang wrote:
>>>
>>>> Martin,
>>>> Have you tried LSR on B-grid with the bug fixed, just for a
>>>> comparison?
>>>> Good luck, Jinlun
>>>>
>>>> Martin Losch wrote:
>>>>> Hi all,
>>>>>
>>>>> just to let you know that we are experiencing problems with the
>>>>> LSR sea ice solver on the C-grid: At unpredictable points of the
>>>>> integration, it appears to become instable and blows up. I have
>>>>> not been able to isolate this in all cases, because a small issue
>>>>> with pickups hampers this:
>>>>>
>>>>> Apparently, starting from pickup is NOT exact. We have tried the
>>>>> famous 2+2=4 test with our 8CPU job on our SX8 (cc to Olaf, who's
>>>>> been mostly involved in this) and found no difference between the
>>>>> cg2d output (and other output). However, when we run an experiment
>>>>> for a longer time, the same test fails, e.g., 2160+2160 != 4320
>>>>> (we can provide plots if required). I assume that this is
>>>>> expected, because double precision is not more than double
>>>>> precisioin and in the cg2d output (and other monitor output) there
>>>>> are always only 15 digits, and we don't know about the 16th one,
>>>>> correct? Anyway, this tiny pickup issue hinders me from
>>>>> approaching the point of model crash with pickups, because after
>>>>> starting from a pickup, the model integrate beyond the problem and
>>>>> crashes (sometimes) at a much later time. This is to say, that the
>>>>> problem in seaice_lsr (the problem only appears when the C-LSR
>>>>> solver is used) very sensitive; the code crashes without any
>>>>> warning from one time step to the other. A while ago, in a
>>>>> different case I was able to get close enough the point of
>>>>> crashing to do some diagnostics, but its almost impossible to
>>>>> identify, why the model explodes. I am assuming that for random
>>>>> pathological cases one or more matrix entries are nearly zero,
>>>>> which then prevents the solver from converging.
>>>>>
>>>>> Any comments? Any similar experience?
>>>>>
>>>>> I run this code in so many different configurations, and I have
>>>>> these problems only very seldom/randomly, so I am a little at a
>>>>> loss where I should continue looking, so any hint is appreciated.
>>>>>
>>>>> Martin
>>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> -- 
> Dimitris Menemenlis <DMenemenlis at gmail.com>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel