[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups

Sun Mar 1 19:54:00 EST 2009

Hi Martin,

I am a little bit confused: 
If cg2d_rhs (sum or max, doesn't matter) is not identical,
it means the state is different, and I guess if you do 
a diff of the final pickup (as I wrote earlier, this is what I
consider to be the "true" answer), it will be different too.
So, it seems to me that there is a more fundamental Pb with
restart/pickup. Because only 3 or 4 correct digits for 
Sum(rhs) does not look very good.

Could you try to run the "../tools/do_tst_2+2" from
MITgcm/verification where the last SX8 testreport has run ?
I made some changes recently for MPI restart test, and put an 
automatic restart test after the aces_ifc_mpi testreport
(see the changes in tools/example_scripts/ACESgrid/aces_test_ifc_mpi)
You don't need to recompile anything, so the issue of cross compiler
should not be a problem.
And if something in those script does not work on this platform,
would be happy to try to fix it.

Cheers,
Jean-Michel

On Fri, Feb 27, 2009 at 05:48:47PM +0100, Martin Losch wrote:
> Hi Jean-Michel,
>
> sorry, I was offset by the Sum(rhs); all other values that I checked  
> (dynstat_theta/uvel_min/max/mean/sd) do agree perfectly (for both the  
> agressive and minimal optimization), so that for lab_sea the restart  
> seems to be OK. So I need to go to my configuration and do the checks  
> there (where there are really differences and the restart does not  
> work). I'll have to figure out, what's different to lab_sea (parameters 
> mostly) and narrow down the problem, more to follow ...
> Martin
>
> On Feb 27, 2009, at 5:31 PM, Martin Losch wrote:
>
>> Hi Jean-Michel,
>>
>> it's probably a good idea for me to first tackle the restart problem. 
>> Here's what I get on 1CPU (two tiles, snx=2) with my aggressive 
>> optimization for lab_sea/input.lsr (output.0-10 is for a total of 10 
>> steps, output.5-10 is starting from a pickup at niter0=5)
>> sx8::tr_run.lsr> grep cg2d: output.0-10
>> [...]
>> cg2d: Sum(rhs),rhsMax =   3.07698311274862E-13  1.19974476101239E+00
>> cg2d: Sum(rhs),rhsMax =   4.01567668006919E-13  1.19252858573205E+00
>> cg2d: Sum(rhs),rhsMax =   5.02708985550271E-13  1.18194572452171E+00
>> cg2d: Sum(rhs),rhsMax =   6.01629857044372E-13  1.16776484963845E+00
>> cg2d: Sum(rhs),rhsMax =   8.02802269106451E-13  1.15096778602035E+00
>> sx8::tr_run.lsr> grep cg2d: output.5-10
>> cg2d: Sum(rhs),rhsMax =   3.07975867031018E-13  1.19974476101239E+00
>> cg2d: Sum(rhs),rhsMax =   4.01789712611844E-13  1.19252858573205E+00
>> cg2d: Sum(rhs),rhsMax =   5.03430630516277E-13  1.18194572452171E+00
>> cg2d: Sum(rhs),rhsMax =   6.03184169278848E-13  1.16776484963844E+00
>> cg2d: Sum(rhs),rhsMax =   8.05300270911857E-13  1.15096778602035E+00
>>
>> and with the lowest possible optimization ("ssafe" only safe scalar  
>> optimization):
>> sx8::tr_run.lsr> grep cg2d: output.0-10
>> [...]
>> cg2d: Sum(rhs),rhsMax =   3.05866443284231E-13  1.19974475698064E+00
>> cg2d: Sum(rhs),rhsMax =   4.00179889226138E-13  1.19252857858165E+00
>> cg2d: Sum(rhs),rhsMax =   5.01432229071952E-13  1.18194571749093E+00
>> cg2d: Sum(rhs),rhsMax =   6.03017635825154E-13  1.16776484246162E+00
>> cg2d: Sum(rhs),rhsMax =   8.00970401115819E-13  1.15096777725923E+00
>> sx8::tr_run.lsr> grep cg2d: output.5-10
>> cg2d: Sum(rhs),rhsMax =   3.05810932132999E-13  1.19974475698064E+00
>> cg2d: Sum(rhs),rhsMax =   3.99458244260131E-13  1.19252857858165E+00
>> cg2d: Sum(rhs),rhsMax =   5.01820807130571E-13  1.18194571749093E+00
>> cg2d: Sum(rhs),rhsMax =   6.02740080068997E-13  1.16776484246162E+00
>> cg2d: Sum(rhs),rhsMax =   8.03301869467532E-13  1.15096777725923E+00
>>
>> Note that in both cases the rhsMax-values are identical after the  
>> pickup, but he Sum(rhs) are not (substraction of large numbers?); with 
>> aggressive optimization I am losing one digit precisition (3 instead of 
>> 4, big deal). On eddy, both numbers are identical.
>>
>> Martin
>>
>> On Feb 27, 2009, at 5:08 PM, Jean-Michel Campin wrote:
>>
>>> Hi Martin,
>>>
>>> Just a short question:
>>> Did you try with zero optimisation on SX8 to see if it solves
>>> some of the restart issues ?
>>> And I also know that David is not always getting a perfect
>>> restart on Columbia, don't remember if this was with zero
>>> optimisation, but he was using small tiles.
>>> Cheers,
>>> Jean-Michel
>>>
>>> On Fri, Feb 27, 2009 at 04:29:58PM +0100, Martin Losch wrote:
>>>> Hi DImitris, etal,
>>>> thanks for your comments/suggestions. A lot is pointing at the lat- 
>>>> lon
>>>> grid, which would probably mean that the metric terms are the  
>>>> problem. I
>>>> am now running a lat-lon grid, where I turned off the metric terms  
>>>> in
>>>> seaice_lsr.  However, I am not quite sure, if the pickup issues are
>>>> unrelated, because I can influence the results with things that  
>>>> should
>>>> not have any effect (such as the pickup). Further comments about  
>>>> your
>>>> suggestion below:
>>>> On Feb 27, 2009, at 2:34 PM, Dimitris Menemenlis wrote:
>>>>
>>>>> Martin, let's assume for a moment that restart and c-lsr blow-ups 
>>>>> are
>>>>> unrelated.  Let's assume that the restart issue is peculiar to SX8
>>>>> compiler.  From following this discussion, the common thread is  
>>>>> that
>>>>> c-lsr blows up in lat-lon grids but not in curvilinear grids.  This
>>>>> would also explain the problems that Matt has had on his SOSE grid
>>>>> while
>>>>> the CS510 integrations have run stably for 1000+ years of  
>>>>> cumulative
>>>>> integrations.
>>>>>
>>>>> Some (random) suggestions:
>>>>>
>>>>> 1. turn on debuglevel=3 and monitorfreq=1, to see if you can catch
>>>>> anything anomalous right before crash
>>>> That's how I found out that it is in the seaice code, I actually  
>>>> added
>>>> debug_stats information to seaice_lsr before and after the  
>>>> iterations
>>>> start  (not yet checked in, but maybe useful? Let me know)
>>>>>
>>>>>
>>>>> 2. increase lsr solver accuracy and number of allowed iterations
>>>> Ongoing
>>>>>
>>>>>
>>>>> 3. use a different compiler on SXB to see if restart problem goes 
>>>>> away
>>>> Not possible, this platform comes only with a cross compiler for the
>>>> compute nodes. The head node is a linux system with amd64 cpus with
>>>> g77/ifort, but that's irrelevant for this problem.
>>>>>
>>>>>
>>>>> 4. use a different platform and/or compiler, which do not exhibit 
>>>>> the
>>>>> restart problem, in order to help you debug your c-lsr-latlon  
>>>>> crashes.
>>>> Honestly, that's what I should do in order to separate the restart
>>>> problem from the ice problem, but I don't have these resources at  
>>>> the
>>>> moment, I am afraid. Also I have not yet observed the seaice  
>>>> problem on
>>>> other platforms (I think).
>>>>>
>>>>>
>>>>> (if you do not have access to a different platform, you can send me
>>>>> your
>>>>> config details and I can set it up on one of the JPL or NAS  
>>>>> altices)
>>>> Thanks for offering: I put a 427MB tar file into my home directory  
>>>> on
>>>> skylla, which expands into an input and a code directory. You need  
>>>> to
>>>> edit SIZE.h: Nr needs to be 23 (not 50) and I usually use up to 18
>>>> (nPx=6,nPy=3) non-vector cpus for this configuration. On the SX8  
>>>> it's
>>>> only 1 (vector)cpu.
>>>> Unfortunately, the configuration is quite stable and you might  
>>>> need to
>>>> run it longer than the 100 years that are specified in the data file
>>>> (all with asynchronous time stepping, so I shouldn't take too long).
>>>>
>>>> Martin
>>>>
>>>>
>>>>>
>>>>>
>>>>> D.
>>>>>
>>>>> On Fri, 2009-02-27 at 14:07 +0100, Martin Losch wrote:
>>>>>> Hi all, but probably in particular Jean-Michel,
>>>>>>
>>>>>> I have no found this on our SX8:
>>>>>>
>>>>>> 1. restarts that work elsewhere (e.g. lab_sea on  
>>>>>> eddy.csail.mit.edu)
>>>>>> do not work. I have no idea why, it is not connected with a
>>>>>> particular
>>>>>> package, but also for experiments where data.pkg has no e
>>>>>> ntries the
>>>>>> restart is broken. This is clearly an issue related to SX8, as the
>>>>>> restart behavior is regularly tested. I am still looking for the
>>>>>> precise reason, but at the moment I am clueless. Suggestions are
>>>>>> welcome.
>>>>>>
>>>>>> 2. "spontaneous" explosions happen in the C-LSR solver, but so far
>>>>>> not
>>>>>> in the B-LSR or C-EVP solver. I am not sure to what extent this is
>>>>>> just coincidence. Currently this happens in a 1cpu-2deg-lat-lon
>>>>>> configuration, a 2cpu Arctic configuration with a rotated lat-lon
>>>>>> grid
>>>>>> and .25deg resolution and with OBCS, and regional 0.5deg  
>>>>>> resolution
>>>>>> for the Weddell Sea (so far without OBCS). I have run the CS510 
>>>>>> for
>>>>>> 16year without problems, also I have run the above Arcttc
>>>>>> configuration with a curvilinear grid (basically the grid is the
>>>>>> same,
>>>>>> but the metric terms in the ice model are no there) without any
>>>>>> problems. It "looks" like it's connected to the lat-lon grid (and
>>>>>> thus
>>>>>> metric terms?).
>>>>>>
>>>>>> 3. C-LSR (and B-LSR) is basically set of iterations. At the
>>>>>> beginning,
>>>>>> the first timelevel velocity is copied to the third: uice(i,j,
>>>>>> 3,bi,bj)=uice(i,j,1,bi,bj), then later we compute an innovation 
>>>>>> like
>>>>>> this:
>>>>>> u(1) = u(3) + .95*(URT-u(3)).
>>>>>> and at the end of each iteration there is an
>>>>>> exch_uv_3d_rl(uice,vice,.true.,3,mythid).
>>>>>> All of these computations happen within j=1,sNy; i=1,sNx (but
>>>>>> partiallly in separate loops). u(3) is never used outside of
>>>>>> seaice_lsr.F (lsr.F, except in some obsolete and never used 
>>>>>> ice/ocea-
>>>>>> stress computation). I have made a change so that  
>>>>>> uice(3)=uice(1) is
>>>>>> now done for the entire array: j=1-Oly,sNy+Oly; 
>>>>>> i=1-Olx,sNx+Olx, that
>>>>>> is including the overlaps. These overlaps of u(3) (and v(3)) are
>>>>>> never
>>>>>> touched elsewhere, except in the exchange routines. After this  
>>>>>> change
>>>>>> (copy of u/v(1) to u/v(3), including overlaps), the results should
>>>>>> not
>>>>>> change; they do not change on, say, eddy.csail.mit.edu, but the do
>>>>>> change on our SX8. In some cases the "spontaneous" explosions go
>>>>>> away,
>>>>>> in others they are "delayed" by order(1000) timesteps.
>>>>>>
>>>>>> My preliminary conclusions are, that the problem with 
>>>>>> seaice_lsr and
>>>>>> pickups are actually connected. The only thing that can go 
>>>>>> wrong in
>>>>>> the pickups is that something fishy is happening in the exchanges.
>>>>>> Other option is, that it is somehow connected to metric terms 
>>>>>> in the
>>>>>> ice model, which I find hard to believe; it would not explain the
>>>>>> restart problem.
>>>>>>
>>>>>> What should I try next to figure out this problem?
>>>>>>
>>>>>> Martin
>>>>>> cc to Olaf Klatt
>>>>>>
>>>>>>
>>>>>> On Feb 19, 2009, at 8:57 AM, Martin Losch wrote:
>>>>>>
>>>>>>> Hi Jinlun and Matt, thanks for your comments,
>>>>>>>
>>>>>>> I did comparison runs with the B-grid code and with EVP and 
>>>>>>> in the
>>>>>>> particular instances I am interested in, they do not crash.  
>>>>>>> That's a
>>>>>>> bit discomforting for me, but on the other hand, I do not use 
>>>>>>> the B-
>>>>>>> grid or EVP code to often, so that I don't have an appropriate
>>>>>>> statistical sample (again, in nearly all cases the C-LSR code is
>>>>>>> absolutely stable, and Dimitris, does all his CS510 runs with C-
>>>>>>> LSR).
>>>>>>>
>>>>>>> Matt, the original seaice_growth.F has lots of these
>>>>>>>>              HEFF(I,J,2,bi,bj)  = MAX(0. _d 0, HEFF(I,J,
>>>>>>>> 2,bi,bj)  )
>>>>>>>>              HSNOW(I,J,bi,bj)   = MAX(0. _d 0,
>>>>>>>> HSNOW(I,J,bi,bj)   )
>>>>>>>>              AREA(I,J,2,bi,bj)  = MAX(0. _d 0, AREA(I,J,
>>>>>>>> 2,bi,bj)  )
>>>>>>> as well, but we will try this also. I don't think that the
>>>>>>> thermodynamic growth is the problem, it's more likely that  
>>>>>>> changing
>>>>>>> anything in the sea ice model makes the model not crash at a
>>>>>>> particular point (e.g., interrupting and restarting and  
>>>>>>> integration
>>>>>>> from a pickup rather than doing everything in one go, in this 
>>>>>>> sense
>>>>>>> changing from C to B-grid is a change, too, and not a small one),
>>>>>>> but I guess, if we have some funny HEFF etc, the LSR solver might
>>>>>>> get into trouble, too.
>>>>>>> So I'll try this.
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>> On Feb 18, 2009, at 5:42 PM, Jinlun Zhang wrote:
>>>>>>>
>>>>>>>> Martin,
>>>>>>>> Have you tried LSR on B-grid with the bug fixed, just for a
>>>>>>>> comparison?
>>>>>>>> Good luck, Jinlun
>>>>>>>>
>>>>>>>> Martin Losch wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> just to let you know that we are experiencing problems with the
>>>>>>>>> LSR sea ice solver on the C-grid: At unpredictable points 
>>>>>>>>> of the
>>>>>>>>> integration, it appears to become instable and blows up. I have
>>>>>>>>> not been able to isolate this in all cases, because a 
>>>>>>>>> small issue
>>>>>>>>> with pickups hampers this:
>>>>>>>>>
>>>>>>>>> Apparently, starting from pickup is NOT exact. We have 
>>>>>>>>> tried the
>>>>>>>>> famous 2+2=4 test with our 8CPU job on our SX8 (cc to 
>>>>>>>>> Olaf, who's
>>>>>>>>> been mostly involved in this) and found no difference 
>>>>>>>>> between the
>>>>>>>>> cg2d output (and other output). However, when we run an  
>>>>>>>>> experiment
>>>>>>>>> for a longer time, the same test fails, e.g., 2160+2160 != 4320
>>>>>>>>> (we can provide plots if required). I assume that this is
>>>>>>>>> expected, because double precision is not more than double
>>>>>>>>> precisioin and in the cg2d output (and other monitor 
>>>>>>>>> output) there
>>>>>>>>> are always only 15 digits, and we don't know about the 
>>>>>>>>> 16th one,
>>>>>>>>> correct? Anyway, this tiny pickup issue hinders me from
>>>>>>>>> approaching the point of model crash with pickups, 
>>>>>>>>> because after
>>>>>>>>> starting from a pickup, the model integrate beyond the  
>>>>>>>>> problem and
>>>>>>>>> crashes (sometimes) at a much later time. This is to say, 
>>>>>>>>> that the
>>>>>>>>> problem in seaice_lsr (the problem only appears when the C-LSR
>>>>>>>>> solver is used) very sensitive; the code crashes without any
>>>>>>>>> warning from one time step to the other. A while ago, in a
>>>>>>>>> different case I was able to get close enough the point of
>>>>>>>>> crashing to do some diagnostics, but its almost impossible to
>>>>>>>>> identify, why the model explodes. I am assuming that for random
>>>>>>>>> pathological cases one or more matrix entries are nearly zero,
>>>>>>>>> which then prevents the solver from converging.
>>>>>>>>>
>>>>>>>>> Any comments? Any similar experience?
>>>>>>>>>
>>>>>>>>> I run this code in so many different configurations, and I have
>>>>>>>>> these problems only very seldom/randomly, so I am a little at a
>>>>>>>>> loss where I should continue looking, so any hint is  
>>>>>>>>> appreciated.
>>>>>>>>>
>>>>>>>>> Martin
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> MITgcm-devel mailing list
>>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> MITgcm-devel mailing list
>>>>>>> MITgcm-devel at mitgcm.org
>>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>> -- 
>>>>> Dimitris Menemenlis <DMenemenlis at gmail.com>
>>>>>
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel