[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups

Fri Feb 27 11:48:47 EST 2009

Hi Jean-Michel,

sorry, I was offset by the Sum(rhs); all other values that I checked  
(dynstat_theta/uvel_min/max/mean/sd) do agree perfectly (for both the  
agressive and minimal optimization), so that for lab_sea the restart  
seems to be OK. So I need to go to my configuration and do the checks  
there (where there are really differences and the restart does not  
work). I'll have to figure out, what's different to lab_sea  
(parameters mostly) and narrow down the problem, more to follow ...
Martin

On Feb 27, 2009, at 5:31 PM, Martin Losch wrote:

> Hi Jean-Michel,
>
> it's probably a good idea for me to first tackle the restart  
> problem. Here's what I get on 1CPU (two tiles, snx=2) with my  
> aggressive optimization for lab_sea/input.lsr (output.0-10 is for a  
> total of 10 steps, output.5-10 is starting from a pickup at niter0=5)
> sx8::tr_run.lsr> grep cg2d: output.0-10
> [...]
> cg2d: Sum(rhs),rhsMax =   3.07698311274862E-13  1.19974476101239E+00
> cg2d: Sum(rhs),rhsMax =   4.01567668006919E-13  1.19252858573205E+00
> cg2d: Sum(rhs),rhsMax =   5.02708985550271E-13  1.18194572452171E+00
> cg2d: Sum(rhs),rhsMax =   6.01629857044372E-13  1.16776484963845E+00
> cg2d: Sum(rhs),rhsMax =   8.02802269106451E-13  1.15096778602035E+00
> sx8::tr_run.lsr> grep cg2d: output.5-10
> cg2d: Sum(rhs),rhsMax =   3.07975867031018E-13  1.19974476101239E+00
> cg2d: Sum(rhs),rhsMax =   4.01789712611844E-13  1.19252858573205E+00
> cg2d: Sum(rhs),rhsMax =   5.03430630516277E-13  1.18194572452171E+00
> cg2d: Sum(rhs),rhsMax =   6.03184169278848E-13  1.16776484963844E+00
> cg2d: Sum(rhs),rhsMax =   8.05300270911857E-13  1.15096778602035E+00
>
> and with the lowest possible optimization ("ssafe" only safe scalar  
> optimization):
> sx8::tr_run.lsr> grep cg2d: output.0-10
> [...]
> cg2d: Sum(rhs),rhsMax =   3.05866443284231E-13  1.19974475698064E+00
> cg2d: Sum(rhs),rhsMax =   4.00179889226138E-13  1.19252857858165E+00
> cg2d: Sum(rhs),rhsMax =   5.01432229071952E-13  1.18194571749093E+00
> cg2d: Sum(rhs),rhsMax =   6.03017635825154E-13  1.16776484246162E+00
> cg2d: Sum(rhs),rhsMax =   8.00970401115819E-13  1.15096777725923E+00
> sx8::tr_run.lsr> grep cg2d: output.5-10
> cg2d: Sum(rhs),rhsMax =   3.05810932132999E-13  1.19974475698064E+00
> cg2d: Sum(rhs),rhsMax =   3.99458244260131E-13  1.19252857858165E+00
> cg2d: Sum(rhs),rhsMax =   5.01820807130571E-13  1.18194571749093E+00
> cg2d: Sum(rhs),rhsMax =   6.02740080068997E-13  1.16776484246162E+00
> cg2d: Sum(rhs),rhsMax =   8.03301869467532E-13  1.15096777725923E+00
>
> Note that in both cases the rhsMax-values are identical after the  
> pickup, but he Sum(rhs) are not (substraction of large numbers?);  
> with aggressive optimization I am losing one digit precisition (3  
> instead of 4, big deal). On eddy, both numbers are identical.
>
> Martin
>
> On Feb 27, 2009, at 5:08 PM, Jean-Michel Campin wrote:
>
>> Hi Martin,
>>
>> Just a short question:
>> Did you try with zero optimisation on SX8 to see if it solves
>> some of the restart issues ?
>> And I also know that David is not always getting a perfect
>> restart on Columbia, don't remember if this was with zero
>> optimisation, but he was using small tiles.
>> Cheers,
>> Jean-Michel
>>
>> On Fri, Feb 27, 2009 at 04:29:58PM +0100, Martin Losch wrote:
>>> Hi DImitris, etal,
>>> thanks for your comments/suggestions. A lot is pointing at the lat- 
>>> lon
>>> grid, which would probably mean that the metric terms are the  
>>> problem. I
>>> am now running a lat-lon grid, where I turned off the metric terms  
>>> in
>>> seaice_lsr.  However, I am not quite sure, if the pickup issues are
>>> unrelated, because I can influence the results with things that  
>>> should
>>> not have any effect (such as the pickup). Further comments about  
>>> your
>>> suggestion below:
>>> On Feb 27, 2009, at 2:34 PM, Dimitris Menemenlis wrote:
>>>
>>>> Martin, let's assume for a moment that restart and c-lsr blow-ups  
>>>> are
>>>> unrelated.  Let's assume that the restart issue is peculiar to SX8
>>>> compiler.  From following this discussion, the common thread is  
>>>> that
>>>> c-lsr blows up in lat-lon grids but not in curvilinear grids.  This
>>>> would also explain the problems that Matt has had on his SOSE grid
>>>> while
>>>> the CS510 integrations have run stably for 1000+ years of  
>>>> cumulative
>>>> integrations.
>>>>
>>>> Some (random) suggestions:
>>>>
>>>> 1. turn on debuglevel=3 and monitorfreq=1, to see if you can catch
>>>> anything anomalous right before crash
>>> That's how I found out that it is in the seaice code, I actually  
>>> added
>>> debug_stats information to seaice_lsr before and after the  
>>> iterations
>>> start  (not yet checked in, but maybe useful? Let me know)
>>>>
>>>>
>>>> 2. increase lsr solver accuracy and number of allowed iterations
>>> Ongoing
>>>>
>>>>
>>>> 3. use a different compiler on SXB to see if restart problem goes  
>>>> away
>>> Not possible, this platform comes only with a cross compiler for the
>>> compute nodes. The head node is a linux system with amd64 cpus with
>>> g77/ifort, but that's irrelevant for this problem.
>>>>
>>>>
>>>> 4. use a different platform and/or compiler, which do not exhibit  
>>>> the
>>>> restart problem, in order to help you debug your c-lsr-latlon  
>>>> crashes.
>>> Honestly, that's what I should do in order to separate the restart
>>> problem from the ice problem, but I don't have these resources at  
>>> the
>>> moment, I am afraid. Also I have not yet observed the seaice  
>>> problem on
>>> other platforms (I think).
>>>>
>>>>
>>>> (if you do not have access to a different platform, you can send me
>>>> your
>>>> config details and I can set it up on one of the JPL or NAS  
>>>> altices)
>>> Thanks for offering: I put a 427MB tar file into my home directory  
>>> on
>>> skylla, which expands into an input and a code directory. You need  
>>> to
>>> edit SIZE.h: Nr needs to be 23 (not 50) and I usually use up to 18
>>> (nPx=6,nPy=3) non-vector cpus for this configuration. On the SX8  
>>> it's
>>> only 1 (vector)cpu.
>>> Unfortunately, the configuration is quite stable and you might  
>>> need to
>>> run it longer than the 100 years that are specified in the data file
>>> (all with asynchronous time stepping, so I shouldn't take too long).
>>>
>>> Martin
>>>
>>>
>>>>
>>>>
>>>> D.
>>>>
>>>> On Fri, 2009-02-27 at 14:07 +0100, Martin Losch wrote:
>>>>> Hi all, but probably in particular Jean-Michel,
>>>>>
>>>>> I have no found this on our SX8:
>>>>>
>>>>> 1. restarts that work elsewhere (e.g. lab_sea on  
>>>>> eddy.csail.mit.edu)
>>>>> do not work. I have no idea why, it is not connected with a
>>>>> particular
>>>>> package, but also for experiments where data.pkg has no e
>>>>> ntries the
>>>>> restart is broken. This is clearly an issue related to SX8, as the
>>>>> restart behavior is regularly tested. I am still looking for the
>>>>> precise reason, but at the moment I am clueless. Suggestions are
>>>>> welcome.
>>>>>
>>>>> 2. "spontaneous" explosions happen in the C-LSR solver, but so far
>>>>> not
>>>>> in the B-LSR or C-EVP solver. I am not sure to what extent this is
>>>>> just coincidence. Currently this happens in a 1cpu-2deg-lat-lon
>>>>> configuration, a 2cpu Arctic configuration with a rotated lat-lon
>>>>> grid
>>>>> and .25deg resolution and with OBCS, and regional 0.5deg  
>>>>> resolution
>>>>> for the Weddell Sea (so far without OBCS). I have run the CS510  
>>>>> for
>>>>> 16year without problems, also I have run the above Arcttc
>>>>> configuration with a curvilinear grid (basically the grid is the
>>>>> same,
>>>>> but the metric terms in the ice model are no there) without any
>>>>> problems. It "looks" like it's connected to the lat-lon grid (and
>>>>> thus
>>>>> metric terms?).
>>>>>
>>>>> 3. C-LSR (and B-LSR) is basically set of iterations. At the
>>>>> beginning,
>>>>> the first timelevel velocity is copied to the third: uice(i,j,
>>>>> 3,bi,bj)=uice(i,j,1,bi,bj), then later we compute an innovation  
>>>>> like
>>>>> this:
>>>>> u(1) = u(3) + .95*(URT-u(3)).
>>>>> and at the end of each iteration there is an
>>>>> exch_uv_3d_rl(uice,vice,.true.,3,mythid).
>>>>> All of these computations happen within j=1,sNy; i=1,sNx (but
>>>>> partiallly in separate loops). u(3) is never used outside of
>>>>> seaice_lsr.F (lsr.F, except in some obsolete and never used ice/ 
>>>>> ocea-
>>>>> stress computation). I have made a change so that  
>>>>> uice(3)=uice(1) is
>>>>> now done for the entire array: j=1-Oly,sNy+Oly; i=1-Olx,sNx+Olx,  
>>>>> that
>>>>> is including the overlaps. These overlaps of u(3) (and v(3)) are
>>>>> never
>>>>> touched elsewhere, except in the exchange routines. After this  
>>>>> change
>>>>> (copy of u/v(1) to u/v(3), including overlaps), the results should
>>>>> not
>>>>> change; they do not change on, say, eddy.csail.mit.edu, but the do
>>>>> change on our SX8. In some cases the "spontaneous" explosions go
>>>>> away,
>>>>> in others they are "delayed" by order(1000) timesteps.
>>>>>
>>>>> My preliminary conclusions are, that the problem with seaice_lsr  
>>>>> and
>>>>> pickups are actually connected. The only thing that can go wrong  
>>>>> in
>>>>> the pickups is that something fishy is happening in the exchanges.
>>>>> Other option is, that it is somehow connected to metric terms in  
>>>>> the
>>>>> ice model, which I find hard to believe; it would not explain the
>>>>> restart problem.
>>>>>
>>>>> What should I try next to figure out this problem?
>>>>>
>>>>> Martin
>>>>> cc to Olaf Klatt
>>>>>
>>>>>
>>>>> On Feb 19, 2009, at 8:57 AM, Martin Losch wrote:
>>>>>
>>>>>> Hi Jinlun and Matt, thanks for your comments,
>>>>>>
>>>>>> I did comparison runs with the B-grid code and with EVP and in  
>>>>>> the
>>>>>> particular instances I am interested in, they do not crash.  
>>>>>> That's a
>>>>>> bit discomforting for me, but on the other hand, I do not use  
>>>>>> the B-
>>>>>> grid or EVP code to often, so that I don't have an appropriate
>>>>>> statistical sample (again, in nearly all cases the C-LSR code is
>>>>>> absolutely stable, and Dimitris, does all his CS510 runs with C-
>>>>>> LSR).
>>>>>>
>>>>>> Matt, the original seaice_growth.F has lots of these
>>>>>>>              HEFF(I,J,2,bi,bj)  = MAX(0. _d 0, HEFF(I,J,
>>>>>>> 2,bi,bj)  )
>>>>>>>              HSNOW(I,J,bi,bj)   = MAX(0. _d 0,
>>>>>>> HSNOW(I,J,bi,bj)   )
>>>>>>>              AREA(I,J,2,bi,bj)  = MAX(0. _d 0, AREA(I,J,
>>>>>>> 2,bi,bj)  )
>>>>>> as well, but we will try this also. I don't think that the
>>>>>> thermodynamic growth is the problem, it's more likely that  
>>>>>> changing
>>>>>> anything in the sea ice model makes the model not crash at a
>>>>>> particular point (e.g., interrupting and restarting and  
>>>>>> integration
>>>>>> from a pickup rather than doing everything in one go, in this  
>>>>>> sense
>>>>>> changing from C to B-grid is a change, too, and not a small one),
>>>>>> but I guess, if we have some funny HEFF etc, the LSR solver might
>>>>>> get into trouble, too.
>>>>>> So I'll try this.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>> On Feb 18, 2009, at 5:42 PM, Jinlun Zhang wrote:
>>>>>>
>>>>>>> Martin,
>>>>>>> Have you tried LSR on B-grid with the bug fixed, just for a
>>>>>>> comparison?
>>>>>>> Good luck, Jinlun
>>>>>>>
>>>>>>> Martin Losch wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> just to let you know that we are experiencing problems with the
>>>>>>>> LSR sea ice solver on the C-grid: At unpredictable points of  
>>>>>>>> the
>>>>>>>> integration, it appears to become instable and blows up. I have
>>>>>>>> not been able to isolate this in all cases, because a small  
>>>>>>>> issue
>>>>>>>> with pickups hampers this:
>>>>>>>>
>>>>>>>> Apparently, starting from pickup is NOT exact. We have tried  
>>>>>>>> the
>>>>>>>> famous 2+2=4 test with our 8CPU job on our SX8 (cc to Olaf,  
>>>>>>>> who's
>>>>>>>> been mostly involved in this) and found no difference between  
>>>>>>>> the
>>>>>>>> cg2d output (and other output). However, when we run an  
>>>>>>>> experiment
>>>>>>>> for a longer time, the same test fails, e.g., 2160+2160 != 4320
>>>>>>>> (we can provide plots if required). I assume that this is
>>>>>>>> expected, because double precision is not more than double
>>>>>>>> precisioin and in the cg2d output (and other monitor output)  
>>>>>>>> there
>>>>>>>> are always only 15 digits, and we don't know about the 16th  
>>>>>>>> one,
>>>>>>>> correct? Anyway, this tiny pickup issue hinders me from
>>>>>>>> approaching the point of model crash with pickups, because  
>>>>>>>> after
>>>>>>>> starting from a pickup, the model integrate beyond the  
>>>>>>>> problem and
>>>>>>>> crashes (sometimes) at a much later time. This is to say,  
>>>>>>>> that the
>>>>>>>> problem in seaice_lsr (the problem only appears when the C-LSR
>>>>>>>> solver is used) very sensitive; the code crashes without any
>>>>>>>> warning from one time step to the other. A while ago, in a
>>>>>>>> different case I was able to get close enough the point of
>>>>>>>> crashing to do some diagnostics, but its almost impossible to
>>>>>>>> identify, why the model explodes. I am assuming that for random
>>>>>>>> pathological cases one or more matrix entries are nearly zero,
>>>>>>>> which then prevents the solver from converging.
>>>>>>>>
>>>>>>>> Any comments? Any similar experience?
>>>>>>>>
>>>>>>>> I run this code in so many different configurations, and I have
>>>>>>>> these problems only very seldom/randomly, so I am a little at a
>>>>>>>> loss where I should continue looking, so any hint is  
>>>>>>>> appreciated.
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> MITgcm-devel mailing list
>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>> -- 
>>>> Dimitris Menemenlis <DMenemenlis at gmail.com>
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel