[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups
Dimitris Menemenlis
dmenemenlis at gmail.com
Fri Feb 27 08:34:19 EST 2009
Martin, let's assume for a moment that restart and c-lsr blow-ups are
unrelated. Let's assume that the restart issue is peculiar to SX8
compiler. From following this discussion, the common thread is that
c-lsr blows up in lat-lon grids but not in curvilinear grids. This
would also explain the problems that Matt has had on his SOSE grid while
the CS510 integrations have run stably for 1000+ years of cumulative
integrations.
Some (random) suggestions:
1. turn on debuglevel=3 and monitorfreq=1, to see if you can catch
anything anomalous right before crash
2. increase lsr solver accuracy and number of allowed iterations
3. use a different compiler on SXB to see if restart problem goes away
4. use a different platform and/or compiler, which do not exhibit the
restart problem, in order to help you debug your c-lsr-latlon crashes.
(if you do not have access to a different platform, you can send me your
config details and I can set it up on one of the JPL or NAS altices)
D.
On Fri, 2009-02-27 at 14:07 +0100, Martin Losch wrote:
> Hi all, but probably in particular Jean-Michel,
>
> I have no found this on our SX8:
>
> 1. restarts that work elsewhere (e.g. lab_sea on eddy.csail.mit.edu)
> do not work. I have no idea why, it is not connected with a particular
> package, but also for experiments where data.pkg has no e
> ntries the
> restart is broken. This is clearly an issue related to SX8, as the
> restart behavior is regularly tested. I am still looking for the
> precise reason, but at the moment I am clueless. Suggestions are
> welcome.
>
> 2. "spontaneous" explosions happen in the C-LSR solver, but so far not
> in the B-LSR or C-EVP solver. I am not sure to what extent this is
> just coincidence. Currently this happens in a 1cpu-2deg-lat-lon
> configuration, a 2cpu Arctic configuration with a rotated lat-lon grid
> and .25deg resolution and with OBCS, and regional 0.5deg resolution
> for the Weddell Sea (so far without OBCS). I have run the CS510 for
> 16year without problems, also I have run the above Arcttc
> configuration with a curvilinear grid (basically the grid is the same,
> but the metric terms in the ice model are no there) without any
> problems. It "looks" like it's connected to the lat-lon grid (and thus
> metric terms?).
>
> 3. C-LSR (and B-LSR) is basically set of iterations. At the beginning,
> the first timelevel velocity is copied to the third: uice(i,j,
> 3,bi,bj)=uice(i,j,1,bi,bj), then later we compute an innovation like
> this:
> u(1) = u(3) + .95*(URT-u(3)).
> and at the end of each iteration there is an
> exch_uv_3d_rl(uice,vice,.true.,3,mythid).
> All of these computations happen within j=1,sNy; i=1,sNx (but
> partiallly in separate loops). u(3) is never used outside of
> seaice_lsr.F (lsr.F, except in some obsolete and never used ice/ocea-
> stress computation). I have made a change so that uice(3)=uice(1) is
> now done for the entire array: j=1-Oly,sNy+Oly; i=1-Olx,sNx+Olx, that
> is including the overlaps. These overlaps of u(3) (and v(3)) are never
> touched elsewhere, except in the exchange routines. After this change
> (copy of u/v(1) to u/v(3), including overlaps), the results should not
> change; they do not change on, say, eddy.csail.mit.edu, but the do
> change on our SX8. In some cases the "spontaneous" explosions go away,
> in others they are "delayed" by order(1000) timesteps.
>
> My preliminary conclusions are, that the problem with seaice_lsr and
> pickups are actually connected. The only thing that can go wrong in
> the pickups is that something fishy is happening in the exchanges.
> Other option is, that it is somehow connected to metric terms in the
> ice model, which I find hard to believe; it would not explain the
> restart problem.
>
> What should I try next to figure out this problem?
>
> Martin
> cc to Olaf Klatt
>
>
> On Feb 19, 2009, at 8:57 AM, Martin Losch wrote:
>
> > Hi Jinlun and Matt, thanks for your comments,
> >
> > I did comparison runs with the B-grid code and with EVP and in the
> > particular instances I am interested in, they do not crash. That's a
> > bit discomforting for me, but on the other hand, I do not use the B-
> > grid or EVP code to often, so that I don't have an appropriate
> > statistical sample (again, in nearly all cases the C-LSR code is
> > absolutely stable, and Dimitris, does all his CS510 runs with C-LSR).
> >
> > Matt, the original seaice_growth.F has lots of these
> >> HEFF(I,J,2,bi,bj) = MAX(0. _d 0, HEFF(I,J,
> >> 2,bi,bj) )
> >> HSNOW(I,J,bi,bj) = MAX(0. _d 0,
> >> HSNOW(I,J,bi,bj) )
> >> AREA(I,J,2,bi,bj) = MAX(0. _d 0, AREA(I,J,
> >> 2,bi,bj) )
> > as well, but we will try this also. I don't think that the
> > thermodynamic growth is the problem, it's more likely that changing
> > anything in the sea ice model makes the model not crash at a
> > particular point (e.g., interrupting and restarting and integration
> > from a pickup rather than doing everything in one go, in this sense
> > changing from C to B-grid is a change, too, and not a small one),
> > but I guess, if we have some funny HEFF etc, the LSR solver might
> > get into trouble, too.
> > So I'll try this.
> >
> > Martin
> >
> > On Feb 18, 2009, at 5:42 PM, Jinlun Zhang wrote:
> >
> >> Martin,
> >> Have you tried LSR on B-grid with the bug fixed, just for a
> >> comparison?
> >> Good luck, Jinlun
> >>
> >> Martin Losch wrote:
> >>> Hi all,
> >>>
> >>> just to let you know that we are experiencing problems with the
> >>> LSR sea ice solver on the C-grid: At unpredictable points of the
> >>> integration, it appears to become instable and blows up. I have
> >>> not been able to isolate this in all cases, because a small issue
> >>> with pickups hampers this:
> >>>
> >>> Apparently, starting from pickup is NOT exact. We have tried the
> >>> famous 2+2=4 test with our 8CPU job on our SX8 (cc to Olaf, who's
> >>> been mostly involved in this) and found no difference between the
> >>> cg2d output (and other output). However, when we run an experiment
> >>> for a longer time, the same test fails, e.g., 2160+2160 != 4320
> >>> (we can provide plots if required). I assume that this is
> >>> expected, because double precision is not more than double
> >>> precisioin and in the cg2d output (and other monitor output) there
> >>> are always only 15 digits, and we don't know about the 16th one,
> >>> correct? Anyway, this tiny pickup issue hinders me from
> >>> approaching the point of model crash with pickups, because after
> >>> starting from a pickup, the model integrate beyond the problem and
> >>> crashes (sometimes) at a much later time. This is to say, that the
> >>> problem in seaice_lsr (the problem only appears when the C-LSR
> >>> solver is used) very sensitive; the code crashes without any
> >>> warning from one time step to the other. A while ago, in a
> >>> different case I was able to get close enough the point of
> >>> crashing to do some diagnostics, but its almost impossible to
> >>> identify, why the model explodes. I am assuming that for random
> >>> pathological cases one or more matrix entries are nearly zero,
> >>> which then prevents the solver from converging.
> >>>
> >>> Any comments? Any similar experience?
> >>>
> >>> I run this code in so many different configurations, and I have
> >>> these problems only very seldom/randomly, so I am a little at a
> >>> loss where I should continue looking, so any hint is appreciated.
> >>>
> >>> Martin
> >>>
> >> _______________________________________________
> >> MITgcm-devel mailing list
> >> MITgcm-devel at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >
> > _______________________________________________
> > MITgcm-devel mailing list
> > MITgcm-devel at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
--
Dimitris Menemenlis <DMenemenlis at gmail.com>
More information about the MITgcm-devel
mailing list