[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups
Martin Losch
Martin.Losch at awi.de
Fri Feb 27 11:31:45 EST 2009
Hi Jean-Michel,
it's probably a good idea for me to first tackle the restart problem.
Here's what I get on 1CPU (two tiles, snx=2) with my aggressive
optimization for lab_sea/input.lsr (output.0-10 is for a total of 10
steps, output.5-10 is starting from a pickup at niter0=5)
sx8::tr_run.lsr> grep cg2d: output.0-10
[...]
cg2d: Sum(rhs),rhsMax = 3.07698311274862E-13 1.19974476101239E+00
cg2d: Sum(rhs),rhsMax = 4.01567668006919E-13 1.19252858573205E+00
cg2d: Sum(rhs),rhsMax = 5.02708985550271E-13 1.18194572452171E+00
cg2d: Sum(rhs),rhsMax = 6.01629857044372E-13 1.16776484963845E+00
cg2d: Sum(rhs),rhsMax = 8.02802269106451E-13 1.15096778602035E+00
sx8::tr_run.lsr> grep cg2d: output.5-10
cg2d: Sum(rhs),rhsMax = 3.07975867031018E-13 1.19974476101239E+00
cg2d: Sum(rhs),rhsMax = 4.01789712611844E-13 1.19252858573205E+00
cg2d: Sum(rhs),rhsMax = 5.03430630516277E-13 1.18194572452171E+00
cg2d: Sum(rhs),rhsMax = 6.03184169278848E-13 1.16776484963844E+00
cg2d: Sum(rhs),rhsMax = 8.05300270911857E-13 1.15096778602035E+00
and with the lowest possible optimization ("ssafe" only safe scalar
optimization):
sx8::tr_run.lsr> grep cg2d: output.0-10
[...]
cg2d: Sum(rhs),rhsMax = 3.05866443284231E-13 1.19974475698064E+00
cg2d: Sum(rhs),rhsMax = 4.00179889226138E-13 1.19252857858165E+00
cg2d: Sum(rhs),rhsMax = 5.01432229071952E-13 1.18194571749093E+00
cg2d: Sum(rhs),rhsMax = 6.03017635825154E-13 1.16776484246162E+00
cg2d: Sum(rhs),rhsMax = 8.00970401115819E-13 1.15096777725923E+00
sx8::tr_run.lsr> grep cg2d: output.5-10
cg2d: Sum(rhs),rhsMax = 3.05810932132999E-13 1.19974475698064E+00
cg2d: Sum(rhs),rhsMax = 3.99458244260131E-13 1.19252857858165E+00
cg2d: Sum(rhs),rhsMax = 5.01820807130571E-13 1.18194571749093E+00
cg2d: Sum(rhs),rhsMax = 6.02740080068997E-13 1.16776484246162E+00
cg2d: Sum(rhs),rhsMax = 8.03301869467532E-13 1.15096777725923E+00
Note that in both cases the rhsMax-values are identical after the
pickup, but he Sum(rhs) are not (substraction of large numbers?); with
aggressive optimization I am losing one digit precisition (3 instead
of 4, big deal). On eddy, both numbers are identical.
Martin
On Feb 27, 2009, at 5:08 PM, Jean-Michel Campin wrote:
> Hi Martin,
>
> Just a short question:
> Did you try with zero optimisation on SX8 to see if it solves
> some of the restart issues ?
> And I also know that David is not always getting a perfect
> restart on Columbia, don't remember if this was with zero
> optimisation, but he was using small tiles.
> Cheers,
> Jean-Michel
>
> On Fri, Feb 27, 2009 at 04:29:58PM +0100, Martin Losch wrote:
>> Hi DImitris, etal,
>> thanks for your comments/suggestions. A lot is pointing at the lat-
>> lon
>> grid, which would probably mean that the metric terms are the
>> problem. I
>> am now running a lat-lon grid, where I turned off the metric terms in
>> seaice_lsr. However, I am not quite sure, if the pickup issues are
>> unrelated, because I can influence the results with things that
>> should
>> not have any effect (such as the pickup). Further comments about your
>> suggestion below:
>> On Feb 27, 2009, at 2:34 PM, Dimitris Menemenlis wrote:
>>
>>> Martin, let's assume for a moment that restart and c-lsr blow-ups
>>> are
>>> unrelated. Let's assume that the restart issue is peculiar to SX8
>>> compiler. From following this discussion, the common thread is that
>>> c-lsr blows up in lat-lon grids but not in curvilinear grids. This
>>> would also explain the problems that Matt has had on his SOSE grid
>>> while
>>> the CS510 integrations have run stably for 1000+ years of cumulative
>>> integrations.
>>>
>>> Some (random) suggestions:
>>>
>>> 1. turn on debuglevel=3 and monitorfreq=1, to see if you can catch
>>> anything anomalous right before crash
>> That's how I found out that it is in the seaice code, I actually
>> added
>> debug_stats information to seaice_lsr before and after the iterations
>> start (not yet checked in, but maybe useful? Let me know)
>>>
>>>
>>> 2. increase lsr solver accuracy and number of allowed iterations
>> Ongoing
>>>
>>>
>>> 3. use a different compiler on SXB to see if restart problem goes
>>> away
>> Not possible, this platform comes only with a cross compiler for the
>> compute nodes. The head node is a linux system with amd64 cpus with
>> g77/ifort, but that's irrelevant for this problem.
>>>
>>>
>>> 4. use a different platform and/or compiler, which do not exhibit
>>> the
>>> restart problem, in order to help you debug your c-lsr-latlon
>>> crashes.
>> Honestly, that's what I should do in order to separate the restart
>> problem from the ice problem, but I don't have these resources at the
>> moment, I am afraid. Also I have not yet observed the seaice
>> problem on
>> other platforms (I think).
>>>
>>>
>>> (if you do not have access to a different platform, you can send me
>>> your
>>> config details and I can set it up on one of the JPL or NAS altices)
>> Thanks for offering: I put a 427MB tar file into my home directory on
>> skylla, which expands into an input and a code directory. You need to
>> edit SIZE.h: Nr needs to be 23 (not 50) and I usually use up to 18
>> (nPx=6,nPy=3) non-vector cpus for this configuration. On the SX8 it's
>> only 1 (vector)cpu.
>> Unfortunately, the configuration is quite stable and you might need
>> to
>> run it longer than the 100 years that are specified in the data file
>> (all with asynchronous time stepping, so I shouldn't take too long).
>>
>> Martin
>>
>>
>>>
>>>
>>> D.
>>>
>>> On Fri, 2009-02-27 at 14:07 +0100, Martin Losch wrote:
>>>> Hi all, but probably in particular Jean-Michel,
>>>>
>>>> I have no found this on our SX8:
>>>>
>>>> 1. restarts that work elsewhere (e.g. lab_sea on
>>>> eddy.csail.mit.edu)
>>>> do not work. I have no idea why, it is not connected with a
>>>> particular
>>>> package, but also for experiments where data.pkg has no e
>>>> ntries the
>>>> restart is broken. This is clearly an issue related to SX8, as the
>>>> restart behavior is regularly tested. I am still looking for the
>>>> precise reason, but at the moment I am clueless. Suggestions are
>>>> welcome.
>>>>
>>>> 2. "spontaneous" explosions happen in the C-LSR solver, but so far
>>>> not
>>>> in the B-LSR or C-EVP solver. I am not sure to what extent this is
>>>> just coincidence. Currently this happens in a 1cpu-2deg-lat-lon
>>>> configuration, a 2cpu Arctic configuration with a rotated lat-lon
>>>> grid
>>>> and .25deg resolution and with OBCS, and regional 0.5deg resolution
>>>> for the Weddell Sea (so far without OBCS). I have run the CS510 for
>>>> 16year without problems, also I have run the above Arcttc
>>>> configuration with a curvilinear grid (basically the grid is the
>>>> same,
>>>> but the metric terms in the ice model are no there) without any
>>>> problems. It "looks" like it's connected to the lat-lon grid (and
>>>> thus
>>>> metric terms?).
>>>>
>>>> 3. C-LSR (and B-LSR) is basically set of iterations. At the
>>>> beginning,
>>>> the first timelevel velocity is copied to the third: uice(i,j,
>>>> 3,bi,bj)=uice(i,j,1,bi,bj), then later we compute an innovation
>>>> like
>>>> this:
>>>> u(1) = u(3) + .95*(URT-u(3)).
>>>> and at the end of each iteration there is an
>>>> exch_uv_3d_rl(uice,vice,.true.,3,mythid).
>>>> All of these computations happen within j=1,sNy; i=1,sNx (but
>>>> partiallly in separate loops). u(3) is never used outside of
>>>> seaice_lsr.F (lsr.F, except in some obsolete and never used ice/
>>>> ocea-
>>>> stress computation). I have made a change so that uice(3)=uice(1)
>>>> is
>>>> now done for the entire array: j=1-Oly,sNy+Oly; i=1-Olx,sNx+Olx,
>>>> that
>>>> is including the overlaps. These overlaps of u(3) (and v(3)) are
>>>> never
>>>> touched elsewhere, except in the exchange routines. After this
>>>> change
>>>> (copy of u/v(1) to u/v(3), including overlaps), the results should
>>>> not
>>>> change; they do not change on, say, eddy.csail.mit.edu, but the do
>>>> change on our SX8. In some cases the "spontaneous" explosions go
>>>> away,
>>>> in others they are "delayed" by order(1000) timesteps.
>>>>
>>>> My preliminary conclusions are, that the problem with seaice_lsr
>>>> and
>>>> pickups are actually connected. The only thing that can go wrong in
>>>> the pickups is that something fishy is happening in the exchanges.
>>>> Other option is, that it is somehow connected to metric terms in
>>>> the
>>>> ice model, which I find hard to believe; it would not explain the
>>>> restart problem.
>>>>
>>>> What should I try next to figure out this problem?
>>>>
>>>> Martin
>>>> cc to Olaf Klatt
>>>>
>>>>
>>>> On Feb 19, 2009, at 8:57 AM, Martin Losch wrote:
>>>>
>>>>> Hi Jinlun and Matt, thanks for your comments,
>>>>>
>>>>> I did comparison runs with the B-grid code and with EVP and in the
>>>>> particular instances I am interested in, they do not crash.
>>>>> That's a
>>>>> bit discomforting for me, but on the other hand, I do not use
>>>>> the B-
>>>>> grid or EVP code to often, so that I don't have an appropriate
>>>>> statistical sample (again, in nearly all cases the C-LSR code is
>>>>> absolutely stable, and Dimitris, does all his CS510 runs with C-
>>>>> LSR).
>>>>>
>>>>> Matt, the original seaice_growth.F has lots of these
>>>>>> HEFF(I,J,2,bi,bj) = MAX(0. _d 0, HEFF(I,J,
>>>>>> 2,bi,bj) )
>>>>>> HSNOW(I,J,bi,bj) = MAX(0. _d 0,
>>>>>> HSNOW(I,J,bi,bj) )
>>>>>> AREA(I,J,2,bi,bj) = MAX(0. _d 0, AREA(I,J,
>>>>>> 2,bi,bj) )
>>>>> as well, but we will try this also. I don't think that the
>>>>> thermodynamic growth is the problem, it's more likely that
>>>>> changing
>>>>> anything in the sea ice model makes the model not crash at a
>>>>> particular point (e.g., interrupting and restarting and
>>>>> integration
>>>>> from a pickup rather than doing everything in one go, in this
>>>>> sense
>>>>> changing from C to B-grid is a change, too, and not a small one),
>>>>> but I guess, if we have some funny HEFF etc, the LSR solver might
>>>>> get into trouble, too.
>>>>> So I'll try this.
>>>>>
>>>>> Martin
>>>>>
>>>>> On Feb 18, 2009, at 5:42 PM, Jinlun Zhang wrote:
>>>>>
>>>>>> Martin,
>>>>>> Have you tried LSR on B-grid with the bug fixed, just for a
>>>>>> comparison?
>>>>>> Good luck, Jinlun
>>>>>>
>>>>>> Martin Losch wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> just to let you know that we are experiencing problems with the
>>>>>>> LSR sea ice solver on the C-grid: At unpredictable points of the
>>>>>>> integration, it appears to become instable and blows up. I have
>>>>>>> not been able to isolate this in all cases, because a small
>>>>>>> issue
>>>>>>> with pickups hampers this:
>>>>>>>
>>>>>>> Apparently, starting from pickup is NOT exact. We have tried the
>>>>>>> famous 2+2=4 test with our 8CPU job on our SX8 (cc to Olaf,
>>>>>>> who's
>>>>>>> been mostly involved in this) and found no difference between
>>>>>>> the
>>>>>>> cg2d output (and other output). However, when we run an
>>>>>>> experiment
>>>>>>> for a longer time, the same test fails, e.g., 2160+2160 != 4320
>>>>>>> (we can provide plots if required). I assume that this is
>>>>>>> expected, because double precision is not more than double
>>>>>>> precisioin and in the cg2d output (and other monitor output)
>>>>>>> there
>>>>>>> are always only 15 digits, and we don't know about the 16th one,
>>>>>>> correct? Anyway, this tiny pickup issue hinders me from
>>>>>>> approaching the point of model crash with pickups, because after
>>>>>>> starting from a pickup, the model integrate beyond the problem
>>>>>>> and
>>>>>>> crashes (sometimes) at a much later time. This is to say, that
>>>>>>> the
>>>>>>> problem in seaice_lsr (the problem only appears when the C-LSR
>>>>>>> solver is used) very sensitive; the code crashes without any
>>>>>>> warning from one time step to the other. A while ago, in a
>>>>>>> different case I was able to get close enough the point of
>>>>>>> crashing to do some diagnostics, but its almost impossible to
>>>>>>> identify, why the model explodes. I am assuming that for random
>>>>>>> pathological cases one or more matrix entries are nearly zero,
>>>>>>> which then prevents the solver from converging.
>>>>>>>
>>>>>>> Any comments? Any similar experience?
>>>>>>>
>>>>>>> I run this code in so many different configurations, and I have
>>>>>>> these problems only very seldom/randomly, so I am a little at a
>>>>>>> loss where I should continue looking, so any hint is
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> --
>>> Dimitris Menemenlis <DMenemenlis at gmail.com>
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel
More information about the MITgcm-devel
mailing list