[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups

Mon Mar 2 08:24:52 EST 2009

Hi Martin,

few comments:
o What I assume is that the restart is OK when the set of pickup files
  written at the end of the 4.it run and at the end of the 2nd
  2.it run are identical (zero diff). And it's what is done in tst_2+2
  to decide if pass or fail. I think it makes sense, because the
  real*8 pickup file are supposed to contain the "state" variables, 
  and if they are identical, it's a good indication that there is
  no difference in the state. Now, if the state variables were 
  hold in memory with more precision than 64.bits, my assumption 
  would not be right (zero diff would just mean: small differences).
  Now, I don't understand how sum(cg2d_rhs) could be so different 
  with the identical pickup files at the end ?
o will work out how to fix (do_)tst_2+2 script on some other
  platforms (but I don't promise anything on Sun-OS).
o for a non-standard test, you can try to use the script "tst_2+2"
  which is used by "do_tst_2+2" for each experiment (needs executable
  + a previous successful run to get some pickup and a standard output
  file to figure out where to start). But each step can be done
  separatly (as opposed to what is done in do_tst_2+2: "tst_2+2 All")
  which should make the debuging easier:
> danton{tools}% tst_2+2
> Usage: tst_2+2 flag [-mpi] [-exe EXECUTABLE] [-command COMMAND]
>  Check restart: compare 1 run of 2 x 2 it long
>     with 2 consecutive runs of 2 it long each
> where: flag = 0 -> prepare (from a previous run) pickup & data files
>        flag = 1 -> do the 3 runs (using COMMAND
>                         or simply using EXECUTABLE, default=./mitgcmuv)
>        flag = 2 -> compare std_outp
>        flag = 3 -> compare pickup files
>        flag = 4 -> clean-up output files
>      flag = All -> do 0,1,2,3,4 sequentially
Thanks,
Jean-Michel

On Mon, Mar 02, 2009 at 11:49:13AM +0100, Martin Losch wrote:
> Hi Jean-Michel:
> On Mar 2, 2009, at 1:54 AM, Jean-Michel Campin wrote:
>
>> Hi Martin,
>>
>> I am a little bit confused:
> so am I.
>>
>> If cg2d_rhs (sum or max, doesn't matter) is not identical,
>> it means the state is different, and I guess if you do
>> a diff of the final pickup (as I wrote earlier, this is what I
>> consider to be the "true" answer), it will be different too.
>> So, it seems to me that there is a more fundamental Pb with
>> restart/pickup. Because only 3 or 4 correct digits for
>> Sum(rhs) does not look very good.
>>
>> Could you try to run the "../tools/do_tst_2+2" from
>> MITgcm/verification where the last SX8 testreport has run ?
>> I made some changes recently for MPI restart test, and put an
>> automatic restart test after the aces_ifc_mpi testreport
>> (see the changes in tools/example_scripts/ACESgrid/aces_test_ifc_mpi)
>> You don't need to recompile anything, so the issue of cross compiler
>> should not be a problem.
>> And if something in those script does not work on this platform,
>> would be happy to try to fix it.
> Thanks for the modified do_tst_2+2 (BTW, tst_2+2 does not work on my  
> Apple/Leopard, some sed syntax issues, I think, but I did not have the  
> time to sort it out, as my sed skills are poor; does the script work on 
> other non-linux platforms? I assume that the shell tools are  
> different/GNU vs. BSD Unix, etc).
>
> I ran do_tst_2+2 on the SX8 for lab_sea (will includes these tests into 
> the weekly routine), and all four tests pass. I repeated the procedure 
> with grid rotation, and still the tests do pass. So everything that is 
> tested in do_tst_2+2 seems to be perfectly OK. But does the script test 
> the Sum(rhs) numbers? (I am running the tests on all verification 
> experiments now and so far there are no fails, except for fizhi, where 
> the verification tests did no run, either)
>
> Now I have to figure out, why I am diagnosing wrong restarts in my  
> specific configuration. What do I need to do to run you scripts on my  
> non-verification configuration?
>
> Unrelated to the restart issues:
> Over the weekend I have solved at least one problem: I understood, why  
> for me the loop counters for the copy of u(3)=u(1) matter. On the SX8  
> the default is to have SEAICE_VECTORIZE_LSR defined. Then tLev=3  
> (otherwise 1), and u(i,j-1,tLev,bi,bj) and u(i,j+1,tlev,bi,bj) are  
> actually used, and thus the overlap of u(3) is actually used. My  
> mistake!
>
> Further, there are some inconsistencies in the discretisation of the  
> metric terms in seaice_lsr.F, these lead to slight asymmetriies in the  
> solutions, when the solutions should be symmetric (e.g. I have put my  
> funnel/channel at the equator, so that everything should be symmetric  
> about the equator). I have not yet managed to get everything symmetric, 
> but on difficulty is that even the grid parameters, such as rA and fCori 
> are not quite symmetric (on the truncation level). As a matter of fact, 
> when SEAICE_VECTORIZE_LSR is undefined, the solutions are even "more" 
> non-symmetric so there is something in the LSOR algorithm itself that is 
> not quite consistent (probably just the solver accuracy).
> However, remembering the non-symmetric discretization in the B-grid  
> code? Once that was removed, Dimitris' problems with CS510 with the B- 
> grid LSR disappeared, so I would no be surprised if I these asymmetries 
> in the metric terms cause the "spontaneous" explosions that I have been 
> talking about initially. I am testing this now and will check-in my fixes 
> soon.
>
> To summarize:
> 1. there are no restart problems in the verification experiments, as  
> diagnosed by do_tst_2+2
> 2. the restart problem in my specific configuration (and the one that  
> Olaf Klatt uses, a regular lat lon grid) remains, the grid rotation  
> changes the behavior, but in the end the restarts fail here, too  
> (pickups are different)
> 3. the overlap problem is solved, as usual I am the culprit
> 4. the explosions may be caused by non-symmetric or even wrong  
> discretizations (by me, again) in the metric terms, but that's unclear  
> yet.
>
> I am attaching the code directory and more number in restarttest.out.  
> Maybe you have an idea, why I am getting the wrong restarts (maybe it's 
> just in my diagnostics, which are not automatized as in your script, 
> again, how can I use the script on my example?).
>
> Martin

>> Cheers,
>> Jean-Michel
>>
>> On Fri, Feb 27, 2009 at 05:48:47PM +0100, Martin Losch wrote:
>>> Hi Jean-Michel,
>>>
>>> sorry, I was offset by the Sum(rhs); all other values that I checked
>>> (dynstat_theta/uvel_min/max/mean/sd) do agree perfectly (for both the
>>> agressive and minimal optimization), so that for lab_sea the restart
>>> seems to be OK. So I need to go to my configuration and do the checks
>>> there (where there are really differences and the restart does not
>>> work). I'll have to figure out, what's different to lab_sea  
>>> (parameters
>>> mostly) and narrow down the problem, more to follow ...
>>> Martin
>>>
>>> On Feb 27, 2009, at 5:31 PM, Martin Losch wrote:
>>>
>>>> Hi Jean-Michel,
>>>>
>>>> it's probably a good idea for me to first tackle the restart  
>>>> problem.
>>>> Here's what I get on 1CPU (two tiles, snx=2) with my aggressive
>>>> optimization for lab_sea/input.lsr (output.0-10 is for a total of 10
>>>> steps, output.5-10 is starting from a pickup at niter0=5)
>>>> sx8::tr_run.lsr> grep cg2d: output.0-10
>>>> [...]
>>>> cg2d: Sum(rhs),rhsMax =   3.07698311274862E-13  1.19974476101239E+00
>>>> cg2d: Sum(rhs),rhsMax =   4.01567668006919E-13  1.19252858573205E+00
>>>> cg2d: Sum(rhs),rhsMax =   5.02708985550271E-13  1.18194572452171E+00
>>>> cg2d: Sum(rhs),rhsMax =   6.01629857044372E-13  1.16776484963845E+00
>>>> cg2d: Sum(rhs),rhsMax =   8.02802269106451E-13  1.15096778602035E+00
>>>> sx8::tr_run.lsr> grep cg2d: output.5-10
>>>> cg2d: Sum(rhs),rhsMax =   3.07975867031018E-13  1.19974476101239E+00
>>>> cg2d: Sum(rhs),rhsMax =   4.01789712611844E-13  1.19252858573205E+00
>>>> cg2d: Sum(rhs),rhsMax =   5.03430630516277E-13  1.18194572452171E+00
>>>> cg2d: Sum(rhs),rhsMax =   6.03184169278848E-13  1.16776484963844E+00
>>>> cg2d: Sum(rhs),rhsMax =   8.05300270911857E-13  1.15096778602035E+00
>>>>
>>>> and with the lowest possible optimization ("ssafe" only safe scalar
>>>> optimization):
>>>> sx8::tr_run.lsr> grep cg2d: output.0-10
>>>> [...]
>>>> cg2d: Sum(rhs),rhsMax =   3.05866443284231E-13  1.19974475698064E+00
>>>> cg2d: Sum(rhs),rhsMax =   4.00179889226138E-13  1.19252857858165E+00
>>>> cg2d: Sum(rhs),rhsMax =   5.01432229071952E-13  1.18194571749093E+00
>>>> cg2d: Sum(rhs),rhsMax =   6.03017635825154E-13  1.16776484246162E+00
>>>> cg2d: Sum(rhs),rhsMax =   8.00970401115819E-13  1.15096777725923E+00
>>>> sx8::tr_run.lsr> grep cg2d: output.5-10
>>>> cg2d: Sum(rhs),rhsMax =   3.05810932132999E-13  1.19974475698064E+00
>>>> cg2d: Sum(rhs),rhsMax =   3.99458244260131E-13  1.19252857858165E+00
>>>> cg2d: Sum(rhs),rhsMax =   5.01820807130571E-13  1.18194571749093E+00
>>>> cg2d: Sum(rhs),rhsMax =   6.02740080068997E-13  1.16776484246162E+00
>>>> cg2d: Sum(rhs),rhsMax =   8.03301869467532E-13  1.15096777725923E+00
>>>>
>>>> Note that in both cases the rhsMax-values are identical after the
>>>> pickup, but he Sum(rhs) are not (substraction of large numbers?);  
>>>> with
>>>> aggressive optimization I am losing one digit precisition (3  
>>>> instead of
>>>> 4, big deal). On eddy, both numbers are identical.
>>>>
>>>> Martin
>>>>

> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel