[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups

Fri Feb 27 08:34:19 EST 2009

Martin, let's assume for a moment that restart and c-lsr blow-ups are
unrelated.  Let's assume that the restart issue is peculiar to SX8
compiler.  From following this discussion, the common thread is that
c-lsr blows up in lat-lon grids but not in curvilinear grids.  This
would also explain the problems that Matt has had on his SOSE grid while
the CS510 integrations have run stably for 1000+ years of cumulative
integrations.

Some (random) suggestions:

1. turn on debuglevel=3 and monitorfreq=1, to see if you can catch
anything anomalous right before crash

2. increase lsr solver accuracy and number of allowed iterations

3. use a different compiler on SXB to see if restart problem goes away

4. use a different platform and/or compiler, which do not exhibit the
restart problem, in order to help you debug your c-lsr-latlon crashes.

(if you do not have access to a different platform, you can send me your
config details and I can set it up on one of the JPL or NAS altices)

D.

On Fri, 2009-02-27 at 14:07 +0100, Martin Losch wrote:
> Hi all, but probably in particular Jean-Michel,
> 
> I have no found this on our SX8:
> 
> 1. restarts that work elsewhere (e.g. lab_sea on eddy.csail.mit.edu)  
> do not work. I have no idea why, it is not connected with a particular  
> package, but also for experiments where data.pkg has no e
> ntries the  
> restart is broken. This is clearly an issue related to SX8, as the  
> restart behavior is regularly tested. I am still looking for the  
> precise reason, but at the moment I am clueless. Suggestions are  
> welcome.
> 
> 2. "spontaneous" explosions happen in the C-LSR solver, but so far not  
> in the B-LSR or C-EVP solver. I am not sure to what extent this is  
> just coincidence. Currently this happens in a 1cpu-2deg-lat-lon  
> configuration, a 2cpu Arctic configuration with a rotated lat-lon grid  
> and .25deg resolution and with OBCS, and regional 0.5deg resolution  
> for the Weddell Sea (so far without OBCS). I have run the CS510 for  
> 16year without problems, also I have run the above Arcttc  
> configuration with a curvilinear grid (basically the grid is the same,  
> but the metric terms in the ice model are no there) without any  
> problems. It "looks" like it's connected to the lat-lon grid (and thus  
> metric terms?).
> 
> 3. C-LSR (and B-LSR) is basically set of iterations. At the beginning,  
> the first timelevel velocity is copied to the third: uice(i,j, 
> 3,bi,bj)=uice(i,j,1,bi,bj), then later we compute an innovation like  
> this:
> u(1) = u(3) + .95*(URT-u(3)).
> and at the end of each iteration there is an  
> exch_uv_3d_rl(uice,vice,.true.,3,mythid).
> All of these computations happen within j=1,sNy; i=1,sNx (but  
> partiallly in separate loops). u(3) is never used outside of  
> seaice_lsr.F (lsr.F, except in some obsolete and never used ice/ocea- 
> stress computation). I have made a change so that uice(3)=uice(1) is  
> now done for the entire array: j=1-Oly,sNy+Oly; i=1-Olx,sNx+Olx, that  
> is including the overlaps. These overlaps of u(3) (and v(3)) are never  
> touched elsewhere, except in the exchange routines. After this change  
> (copy of u/v(1) to u/v(3), including overlaps), the results should not  
> change; they do not change on, say, eddy.csail.mit.edu, but the do  
> change on our SX8. In some cases the "spontaneous" explosions go away,  
> in others they are "delayed" by order(1000) timesteps.
> 
> My preliminary conclusions are, that the problem with seaice_lsr and  
> pickups are actually connected. The only thing that can go wrong in  
> the pickups is that something fishy is happening in the exchanges.  
> Other option is, that it is somehow connected to metric terms in the  
> ice model, which I find hard to believe; it would not explain the  
> restart problem.
> 
> What should I try next to figure out this problem?
> 
> Martin
> cc to Olaf Klatt
> 
> 
> On Feb 19, 2009, at 8:57 AM, Martin Losch wrote:
> 
> > Hi Jinlun and Matt, thanks for your comments,
> >
> > I did comparison runs with the B-grid code and with EVP and in the  
> > particular instances I am interested in, they do not crash. That's a  
> > bit discomforting for me, but on the other hand, I do not use the B- 
> > grid or EVP code to often, so that I don't have an appropriate  
> > statistical sample (again, in nearly all cases the C-LSR code is  
> > absolutely stable, and Dimitris, does all his CS510 runs with C-LSR).
> >
> > Matt, the original seaice_growth.F has lots of these
> >>                 HEFF(I,J,2,bi,bj)  = MAX(0. _d 0, HEFF(I,J, 
> >> 2,bi,bj)  )
> >>                 HSNOW(I,J,bi,bj)   = MAX(0. _d 0,  
> >> HSNOW(I,J,bi,bj)   )
> >>                 AREA(I,J,2,bi,bj)  = MAX(0. _d 0, AREA(I,J, 
> >> 2,bi,bj)  )
> > as well, but we will try this also. I don't think that the  
> > thermodynamic growth is the problem, it's more likely that changing  
> > anything in the sea ice model makes the model not crash at a  
> > particular point (e.g., interrupting and restarting and integration  
> > from a pickup rather than doing everything in one go, in this sense  
> > changing from C to B-grid is a change, too, and not a small one),  
> > but I guess, if we have some funny HEFF etc, the LSR solver might  
> > get into trouble, too.
> > So I'll try this.
> >
> > Martin
> >
> > On Feb 18, 2009, at 5:42 PM, Jinlun Zhang wrote:
> >
> >> Martin,
> >> Have you tried LSR on B-grid with the bug fixed, just for a  
> >> comparison?
> >> Good luck, Jinlun
> >>
> >> Martin Losch wrote:
> >>> Hi all,
> >>>
> >>> just to let you know that we are experiencing problems with the  
> >>> LSR sea ice solver on the C-grid: At unpredictable points of the  
> >>> integration, it appears to become instable and blows up. I have  
> >>> not been able to isolate this in all cases, because a small issue  
> >>> with pickups hampers this:
> >>>
> >>> Apparently, starting from pickup is NOT exact. We have tried the  
> >>> famous 2+2=4 test with our 8CPU job on our SX8 (cc to Olaf, who's  
> >>> been mostly involved in this) and found no difference between the  
> >>> cg2d output (and other output). However, when we run an experiment  
> >>> for a longer time, the same test fails, e.g., 2160+2160 != 4320  
> >>> (we can provide plots if required). I assume that this is  
> >>> expected, because double precision is not more than double  
> >>> precisioin and in the cg2d output (and other monitor output) there  
> >>> are always only 15 digits, and we don't know about the 16th one,  
> >>> correct? Anyway, this tiny pickup issue hinders me from  
> >>> approaching the point of model crash with pickups, because after  
> >>> starting from a pickup, the model integrate beyond the problem and  
> >>> crashes (sometimes) at a much later time. This is to say, that the  
> >>> problem in seaice_lsr (the problem only appears when the C-LSR  
> >>> solver is used) very sensitive; the code crashes without any  
> >>> warning from one time step to the other. A while ago, in a  
> >>> different case I was able to get close enough the point of  
> >>> crashing to do some diagnostics, but its almost impossible to  
> >>> identify, why the model explodes. I am assuming that for random  
> >>> pathological cases one or more matrix entries are nearly zero,  
> >>> which then prevents the solver from converging.
> >>>
> >>> Any comments? Any similar experience?
> >>>
> >>> I run this code in so many different configurations, and I have  
> >>> these problems only very seldom/randomly, so I am a little at a  
> >>> loss where I should continue looking, so any hint is appreciated.
> >>>
> >>> Martin
> >>>
> >> _______________________________________________
> >> MITgcm-devel mailing list
> >> MITgcm-devel at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >
> > _______________________________________________
> > MITgcm-devel mailing list
> > MITgcm-devel at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
-- 
Dimitris Menemenlis <DMenemenlis at gmail.com>