[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups

Mon Mar 2 08:56:40 EST 2009

Hi Jean-Michel,
as an addedum the result of do_tst_2x2 on the entire verification  
suite (as run on SX8). Besides two the fizhi runs, two more runs did  
not work: isomip and tutorial_plume_on_slope. Whatever that means,  
it's not the terrible pkg/shelfice (I turned it off and the test still  
fails), maybe Orlanski BCs in the latter?

I was able to run tst_2+2 on my Arctic configuration and, as with my  
"manual" tests, the restart test fails:
sx8::crash> ~/aaomip/MITgcm/tools/tst_2+2 2 -mpi -command ./runit_sxf90
cmdEXE='./runit_sxf90'
  start-end iter: 368210 , 368212 , 368214
  sufix: '0000368210' '0000368212' '0000368214'
  cmdEXE=./runit_sxf90

== compare cg2d_init_res :
  run 1iA:
     3.90305695808906E+00
     1.44674997413935E+00
  run 1iB:
     1.09426703195375E+00
     9.19285268621797E-01
  run 2it:
     3.90305695808906E+00
     1.44674997413935E+00
     1.09426757967368E+00
     9.19283711567311E-01
sx8::crash> ~/aaomip/MITgcm/tools/tst_2+2 3 -mpi -command ./runit_sxf90
cmdEXE='./runit_sxf90'
  start-end iter: 368210 , 368212 , 368214
  sufix: '0000368210' '0000368212' '0000368214'
  cmdEXE=./runit_sxf90

== diff pickup files : end of 1rst run (2x2 it) & end of 3rd run (2nd  
2 it)
--> file=pickup.0000368214, listY=001.001 001.002
  diff res_2it/pickup.0000368214.001.001.data res_1iB
Files res_2it/pickup.0000368214.001.001.data and res_1iB/pickup. 
0000368214.001.001.data differ
Diff outp= 2  ==> stop

OK, now I have to figure out why? But how ...

Martin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: restartsx8
Type: application/octet-stream
Size: 11513 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-devel/attachments/20090302/23af5e4e/attachment.obj>
-------------- next part --------------

On Mar 2, 2009, at 2:24 PM, Jean-Michel Campin wrote:

> Hi Martin,
>
> few comments:
> o What I assume is that the restart is OK when the set of pickup files
>  written at the end of the 4.it run and at the end of the 2nd
>  2.it run are identical (zero diff). And it's what is done in tst_2+2
>  to decide if pass or fail. I think it makes sense, because the
>  real*8 pickup file are supposed to contain the "state" variables,
>  and if they are identical, it's a good indication that there is
>  no difference in the state. Now, if the state variables were
>  hold in memory with more precision than 64.bits, my assumption
>  would not be right (zero diff would just mean: small differences).
>  Now, I don't understand how sum(cg2d_rhs) could be so different
>  with the identical pickup files at the end ?
> o will work out how to fix (do_)tst_2+2 script on some other
>  platforms (but I don't promise anything on Sun-OS).
> o for a non-standard test, you can try to use the script "tst_2+2"
>  which is used by "do_tst_2+2" for each experiment (needs executable
>  + a previous successful run to get some pickup and a standard output
>  file to figure out where to start). But each step can be done
>  separatly (as opposed to what is done in do_tst_2+2: "tst_2+2 All")
>  which should make the debuging easier:
>> danton{tools}% tst_2+2
>> Usage: tst_2+2 flag [-mpi] [-exe EXECUTABLE] [-command COMMAND]
>> Check restart: compare 1 run of 2 x 2 it long
>>    with 2 consecutive runs of 2 it long each
>> where: flag = 0 -> prepare (from a previous run) pickup & data files
>>       flag = 1 -> do the 3 runs (using COMMAND
>>                        or simply using EXECUTABLE, default=./ 
>> mitgcmuv)
>>       flag = 2 -> compare std_outp
>>       flag = 3 -> compare pickup files
>>       flag = 4 -> clean-up output files
>>     flag = All -> do 0,1,2,3,4 sequentially
> Thanks,
> Jean-Michel
>
> On Mon, Mar 02, 2009 at 11:49:13AM +0100, Martin Losch wrote:
>> Hi Jean-Michel:
>> On Mar 2, 2009, at 1:54 AM, Jean-Michel Campin wrote:
>>
>>> Hi Martin,
>>>
>>> I am a little bit confused:
>> so am I.
>>>
>>> If cg2d_rhs (sum or max, doesn't matter) is not identical,
>>> it means the state is different, and I guess if you do
>>> a diff of the final pickup (as I wrote earlier, this is what I
>>> consider to be the "true" answer), it will be different too.
>>> So, it seems to me that there is a more fundamental Pb with
>>> restart/pickup. Because only 3 or 4 correct digits for
>>> Sum(rhs) does not look very good.
>>>
>>> Could you try to run the "../tools/do_tst_2+2" from
>>> MITgcm/verification where the last SX8 testreport has run ?
>>> I made some changes recently for MPI restart test, and put an
>>> automatic restart test after the aces_ifc_mpi testreport
>>> (see the changes in tools/example_scripts/ACESgrid/ 
>>> aces_test_ifc_mpi)
>>> You don't need to recompile anything, so the issue of cross compiler
>>> should not be a problem.
>>> And if something in those script does not work on this platform,
>>> would be happy to try to fix it.
>> Thanks for the modified do_tst_2+2 (BTW, tst_2+2 does not work on my
>> Apple/Leopard, some sed syntax issues, I think, but I did not have  
>> the
>> time to sort it out, as my sed skills are poor; does the script  
>> work on
>> other non-linux platforms? I assume that the shell tools are
>> different/GNU vs. BSD Unix, etc).
>>
>> I ran do_tst_2+2 on the SX8 for lab_sea (will includes these tests  
>> into
>> the weekly routine), and all four tests pass. I repeated the  
>> procedure
>> with grid rotation, and still the tests do pass. So everything that  
>> is
>> tested in do_tst_2+2 seems to be perfectly OK. But does the script  
>> test
>> the Sum(rhs) numbers? (I am running the tests on all verification
>> experiments now and so far there are no fails, except for fizhi,  
>> where
>> the verification tests did no run, either)
>>
>> Now I have to figure out, why I am diagnosing wrong restarts in my
>> specific configuration. What do I need to do to run you scripts on my
>> non-verification configuration?
>>
>> Unrelated to the restart issues:
>> Over the weekend I have solved at least one problem: I understood,  
>> why
>> for me the loop counters for the copy of u(3)=u(1) matter. On the SX8
>> the default is to have SEAICE_VECTORIZE_LSR defined. Then tLev=3
>> (otherwise 1), and u(i,j-1,tLev,bi,bj) and u(i,j+1,tlev,bi,bj) are
>> actually used, and thus the overlap of u(3) is actually used. My
>> mistake!
>>
>> Further, there are some inconsistencies in the discretisation of the
>> metric terms in seaice_lsr.F, these lead to slight asymmetriies in  
>> the
>> solutions, when the solutions should be symmetric (e.g. I have put my
>> funnel/channel at the equator, so that everything should be symmetric
>> about the equator). I have not yet managed to get everything  
>> symmetric,
>> but on difficulty is that even the grid parameters, such as rA and  
>> fCori
>> are not quite symmetric (on the truncation level). As a matter of  
>> fact,
>> when SEAICE_VECTORIZE_LSR is undefined, the solutions are even "more"
>> non-symmetric so there is something in the LSOR algorithm itself  
>> that is
>> not quite consistent (probably just the solver accuracy).
>> However, remembering the non-symmetric discretization in the B-grid
>> code? Once that was removed, Dimitris' problems with CS510 with the  
>> B-
>> grid LSR disappeared, so I would no be surprised if I these  
>> asymmetries
>> in the metric terms cause the "spontaneous" explosions that I have  
>> been
>> talking about initially. I am testing this now and will check-in my  
>> fixes
>> soon.
>>
>> To summarize:
>> 1. there are no restart problems in the verification experiments, as
>> diagnosed by do_tst_2+2
>> 2. the restart problem in my specific configuration (and the one that
>> Olaf Klatt uses, a regular lat lon grid) remains, the grid rotation
>> changes the behavior, but in the end the restarts fail here, too
>> (pickups are different)
>> 3. the overlap problem is solved, as usual I am the culprit
>> 4. the explosions may be caused by non-symmetric or even wrong
>> discretizations (by me, again) in the metric terms, but that's  
>> unclear
>> yet.
>>
>> I am attaching the code directory and more number in restarttest.out.
>> Maybe you have an idea, why I am getting the wrong restarts (maybe  
>> it's
>> just in my diagnostics, which are not automatized as in your script,
>> again, how can I use the script on my example?).
>>
>> Martin
>
>
>
>>> Cheers,
>>> Jean-Michel
>>>
>>> On Fri, Feb 27, 2009 at 05:48:47PM +0100, Martin Losch wrote:
>>>> Hi Jean-Michel,
>>>>
>>>> sorry, I was offset by the Sum(rhs); all other values that I  
>>>> checked
>>>> (dynstat_theta/uvel_min/max/mean/sd) do agree perfectly (for both  
>>>> the
>>>> agressive and minimal optimization), so that for lab_sea the  
>>>> restart
>>>> seems to be OK. So I need to go to my configuration and do the  
>>>> checks
>>>> there (where there are really differences and the restart does not
>>>> work). I'll have to figure out, what's different to lab_sea
>>>> (parameters
>>>> mostly) and narrow down the problem, more to follow ...
>>>> Martin
>>>>
>>>> On Feb 27, 2009, at 5:31 PM, Martin Losch wrote:
>>>>
>>>>> Hi Jean-Michel,
>>>>>
>>>>> it's probably a good idea for me to first tackle the restart
>>>>> problem.
>>>>> Here's what I get on 1CPU (two tiles, snx=2) with my aggressive
>>>>> optimization for lab_sea/input.lsr (output.0-10 is for a total  
>>>>> of 10
>>>>> steps, output.5-10 is starting from a pickup at niter0=5)
>>>>> sx8::tr_run.lsr> grep cg2d: output.0-10
>>>>> [...]
>>>>> cg2d: Sum(rhs),rhsMax =   3.07698311274862E-13  1.19974476101239E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   4.01567668006919E-13  1.19252858573205E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   5.02708985550271E-13  1.18194572452171E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   6.01629857044372E-13  1.16776484963845E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   8.02802269106451E-13  1.15096778602035E 
>>>>> +00
>>>>> sx8::tr_run.lsr> grep cg2d: output.5-10
>>>>> cg2d: Sum(rhs),rhsMax =   3.07975867031018E-13  1.19974476101239E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   4.01789712611844E-13  1.19252858573205E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   5.03430630516277E-13  1.18194572452171E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   6.03184169278848E-13  1.16776484963844E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   8.05300270911857E-13  1.15096778602035E 
>>>>> +00
>>>>>
>>>>> and with the lowest possible optimization ("ssafe" only safe  
>>>>> scalar
>>>>> optimization):
>>>>> sx8::tr_run.lsr> grep cg2d: output.0-10
>>>>> [...]
>>>>> cg2d: Sum(rhs),rhsMax =   3.05866443284231E-13  1.19974475698064E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   4.00179889226138E-13  1.19252857858165E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   5.01432229071952E-13  1.18194571749093E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   6.03017635825154E-13  1.16776484246162E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   8.00970401115819E-13  1.15096777725923E 
>>>>> +00
>>>>> sx8::tr_run.lsr> grep cg2d: output.5-10
>>>>> cg2d: Sum(rhs),rhsMax =   3.05810932132999E-13  1.19974475698064E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   3.99458244260131E-13  1.19252857858165E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   5.01820807130571E-13  1.18194571749093E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   6.02740080068997E-13  1.16776484246162E 
>>>>> +00
>>>>> cg2d: Sum(rhs),rhsMax =   8.03301869467532E-13  1.15096777725923E 
>>>>> +00
>>>>>
>>>>> Note that in both cases the rhsMax-values are identical after the
>>>>> pickup, but he Sum(rhs) are not (substraction of large numbers?);
>>>>> with
>>>>> aggressive optimization I am losing one digit precisition (3
>>>>> instead of
>>>>> 4, big deal). On eddy, both numbers are identical.
>>>>>
>>>>> Martin
>>>>>
>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel