[MITgcm-devel] (not so) funny things happen in seaice_lsr and pickups

Mon Mar 2 05:49:13 EST 2009

Hi Jean-Michel:
On Mar 2, 2009, at 1:54 AM, Jean-Michel Campin wrote:

> Hi Martin,
>
> I am a little bit confused:
so am I.
>
> If cg2d_rhs (sum or max, doesn't matter) is not identical,
> it means the state is different, and I guess if you do
> a diff of the final pickup (as I wrote earlier, this is what I
> consider to be the "true" answer), it will be different too.
> So, it seems to me that there is a more fundamental Pb with
> restart/pickup. Because only 3 or 4 correct digits for
> Sum(rhs) does not look very good.
>
> Could you try to run the "../tools/do_tst_2+2" from
> MITgcm/verification where the last SX8 testreport has run ?
> I made some changes recently for MPI restart test, and put an
> automatic restart test after the aces_ifc_mpi testreport
> (see the changes in tools/example_scripts/ACESgrid/aces_test_ifc_mpi)
> You don't need to recompile anything, so the issue of cross compiler
> should not be a problem.
> And if something in those script does not work on this platform,
> would be happy to try to fix it.
Thanks for the modified do_tst_2+2 (BTW, tst_2+2 does not work on my  
Apple/Leopard, some sed syntax issues, I think, but I did not have the  
time to sort it out, as my sed skills are poor; does the script work  
on other non-linux platforms? I assume that the shell tools are  
different/GNU vs. BSD Unix, etc).

I ran do_tst_2+2 on the SX8 for lab_sea (will includes these tests  
into the weekly routine), and all four tests pass. I repeated the  
procedure with grid rotation, and still the tests do pass. So  
everything that is tested in do_tst_2+2 seems to be perfectly OK. But  
does the script test the Sum(rhs) numbers? (I am running the tests on  
all verification experiments now and so far there are no fails, except  
for fizhi, where the verification tests did no run, either)

Now I have to figure out, why I am diagnosing wrong restarts in my  
specific configuration. What do I need to do to run you scripts on my  
non-verification configuration?

Unrelated to the restart issues:
Over the weekend I have solved at least one problem: I understood, why  
for me the loop counters for the copy of u(3)=u(1) matter. On the SX8  
the default is to have SEAICE_VECTORIZE_LSR defined. Then tLev=3  
(otherwise 1), and u(i,j-1,tLev,bi,bj) and u(i,j+1,tlev,bi,bj) are  
actually used, and thus the overlap of u(3) is actually used. My  
mistake!

Further, there are some inconsistencies in the discretisation of the  
metric terms in seaice_lsr.F, these lead to slight asymmetriies in the  
solutions, when the solutions should be symmetric (e.g. I have put my  
funnel/channel at the equator, so that everything should be symmetric  
about the equator). I have not yet managed to get everything  
symmetric, but on difficulty is that even the grid parameters, such as  
rA and fCori are not quite symmetric (on the truncation level). As a  
matter of fact, when SEAICE_VECTORIZE_LSR is undefined, the solutions  
are even "more" non-symmetric so there is something in the LSOR  
algorithm itself that is not quite consistent (probably just the  
solver accuracy).
However, remembering the non-symmetric discretization in the B-grid  
code? Once that was removed, Dimitris' problems with CS510 with the B- 
grid LSR disappeared, so I would no be surprised if I these  
asymmetries in the metric terms cause the "spontaneous" explosions  
that I have been talking about initially. I am testing this now and  
will check-in my fixes soon.

To summarize:
1. there are no restart problems in the verification experiments, as  
diagnosed by do_tst_2+2
2. the restart problem in my specific configuration (and the one that  
Olaf Klatt uses, a regular lat lon grid) remains, the grid rotation  
changes the behavior, but in the end the restarts fail here, too  
(pickups are different)
3. the overlap problem is solved, as usual I am the culprit
4. the explosions may be caused by non-symmetric or even wrong  
discretizations (by me, again) in the metric terms, but that's unclear  
yet.

I am attaching the code directory and more number in restarttest.out.  
Maybe you have an idea, why I am getting the wrong restarts (maybe  
it's just in my diagnostics, which are not automatized as in your  
script, again, how can I use the script on my example?).

Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: restarttest.out
Type: application/octet-stream
Size: 5799 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-devel/attachments/20090302/64cd00dd/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: code.tgz
Type: application/octet-stream
Size: 12223 bytes
Desc: not available
URL: <http://mitgcm.org/pipermail/mitgcm-devel/attachments/20090302/64cd00dd/attachment-0001.obj>
-------------- next part --------------
> Cheers,
> Jean-Michel
>
> On Fri, Feb 27, 2009 at 05:48:47PM +0100, Martin Losch wrote:
>> Hi Jean-Michel,
>>
>> sorry, I was offset by the Sum(rhs); all other values that I checked
>> (dynstat_theta/uvel_min/max/mean/sd) do agree perfectly (for both the
>> agressive and minimal optimization), so that for lab_sea the restart
>> seems to be OK. So I need to go to my configuration and do the checks
>> there (where there are really differences and the restart does not
>> work). I'll have to figure out, what's different to lab_sea  
>> (parameters
>> mostly) and narrow down the problem, more to follow ...
>> Martin
>>
>> On Feb 27, 2009, at 5:31 PM, Martin Losch wrote:
>>
>>> Hi Jean-Michel,
>>>
>>> it's probably a good idea for me to first tackle the restart  
>>> problem.
>>> Here's what I get on 1CPU (two tiles, snx=2) with my aggressive
>>> optimization for lab_sea/input.lsr (output.0-10 is for a total of 10
>>> steps, output.5-10 is starting from a pickup at niter0=5)
>>> sx8::tr_run.lsr> grep cg2d: output.0-10
>>> [...]
>>> cg2d: Sum(rhs),rhsMax =   3.07698311274862E-13  1.19974476101239E+00
>>> cg2d: Sum(rhs),rhsMax =   4.01567668006919E-13  1.19252858573205E+00
>>> cg2d: Sum(rhs),rhsMax =   5.02708985550271E-13  1.18194572452171E+00
>>> cg2d: Sum(rhs),rhsMax =   6.01629857044372E-13  1.16776484963845E+00
>>> cg2d: Sum(rhs),rhsMax =   8.02802269106451E-13  1.15096778602035E+00
>>> sx8::tr_run.lsr> grep cg2d: output.5-10
>>> cg2d: Sum(rhs),rhsMax =   3.07975867031018E-13  1.19974476101239E+00
>>> cg2d: Sum(rhs),rhsMax =   4.01789712611844E-13  1.19252858573205E+00
>>> cg2d: Sum(rhs),rhsMax =   5.03430630516277E-13  1.18194572452171E+00
>>> cg2d: Sum(rhs),rhsMax =   6.03184169278848E-13  1.16776484963844E+00
>>> cg2d: Sum(rhs),rhsMax =   8.05300270911857E-13  1.15096778602035E+00
>>>
>>> and with the lowest possible optimization ("ssafe" only safe scalar
>>> optimization):
>>> sx8::tr_run.lsr> grep cg2d: output.0-10
>>> [...]
>>> cg2d: Sum(rhs),rhsMax =   3.05866443284231E-13  1.19974475698064E+00
>>> cg2d: Sum(rhs),rhsMax =   4.00179889226138E-13  1.19252857858165E+00
>>> cg2d: Sum(rhs),rhsMax =   5.01432229071952E-13  1.18194571749093E+00
>>> cg2d: Sum(rhs),rhsMax =   6.03017635825154E-13  1.16776484246162E+00
>>> cg2d: Sum(rhs),rhsMax =   8.00970401115819E-13  1.15096777725923E+00
>>> sx8::tr_run.lsr> grep cg2d: output.5-10
>>> cg2d: Sum(rhs),rhsMax =   3.05810932132999E-13  1.19974475698064E+00
>>> cg2d: Sum(rhs),rhsMax =   3.99458244260131E-13  1.19252857858165E+00
>>> cg2d: Sum(rhs),rhsMax =   5.01820807130571E-13  1.18194571749093E+00
>>> cg2d: Sum(rhs),rhsMax =   6.02740080068997E-13  1.16776484246162E+00
>>> cg2d: Sum(rhs),rhsMax =   8.03301869467532E-13  1.15096777725923E+00
>>>
>>> Note that in both cases the rhsMax-values are identical after the
>>> pickup, but he Sum(rhs) are not (substraction of large numbers?);  
>>> with
>>> aggressive optimization I am losing one digit precisition (3  
>>> instead of
>>> 4, big deal). On eddy, both numbers are identical.
>>>
>>> Martin
>>>