[MITgcm-support] tutorial_global_oce_optim optimisation failed

Fri May 4 11:11:18 EDT 2018

And, still no luck(?)

Running for a year (switching the commented and uncommented nTimeSteps and
lastinterval declarations in data and data.cost), optim.x (lsopt+optim,
*not* optim_m1qn3) now gives the output

  cost function............... 0.60514949E+01
  norm of x................... 0.00000000E+00
  norm of g................... 0.23235517E+00

  optimization stopped because :
  ifail =   4    the search direction is not a descent one

On 4 May 2018 at 13:58, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:

> On 4 May 2018 at 06:04, Patrick Heimbach <heimbach at mit.edu> wrote:
>
>> Hi Matt,
>>
>> as you indicated, all is still good, and I suspect the same you did
>> regarding what might be at issue.
>>
>> I just downloaded latest MITgcm, re-ran adjoint, and conducted 2
>> iterations (using lsopt).
>>
>> It still works "out of the box" ... if one realizes that a manual is part
>> of that "box", and section 3.18 (old manual prior to readthedocs) has some
>> description of this tutorial, thanks to dfer (admittedly somewhat out of
>> date, but still mostly relevant). In particular it says there that the
>> optimization has been conducted for a 1-year simulation.
>>
>
> Okay, thanks.  I interpreted the manual footnote as "running a 1-year
> simulation will reproduce the scientifically-interesting graphs in the
> manual", not as "the default parameters are *only* useful for verifying
> correctness of the adjoint, but will break the optimisation routine".  I'll
> see if I have more success with the longer run.
>
>
>>
>> Since we do not want to conduct 1-year integrations for *any* of the
>> tutorials within our regression tests (these tests consist of 90 forward,
>> 24 adjoint/TAF, 10 adjoint/OpenAD, and 16 tangent-linear/TAF
>> configurations, each needing to be compiled and executed) we have shortened
>> the number of time steps to 10 (= 10 days) to perform efficient nightly
>> regression tests of the adjoint. Not changing the number of time steps
>> leads to optimizing in the noise - in fact cost function goes up in that
>> case.
>>
>> That the user's cost function does not change at all suggests a more
>> basic problem though (hard to speculate what it might be).
>>
>> I made a quick test by extending nTimeSteps from 10 to 90 days, which
>> leads to cost reduction as desired, namely, for:
>>  numiter=1,
>>  nfunc=3,
>>  fmin=5.74,
>> (values in data.optim that comes with tutorial_global_oce_optim)
>> I obtain following costs:
>> iter. 0: fc =  0.184199260445164D+02
>> iter. 1: fc =  0.130860446841901D+02
>> iter. 2: fc =  0.979374136987667D+01
>>
>> I did that test "by hand", i.e. not using the script cycsh also provided
>> (see manual). Doing so by hand requires two more lines in data.ctrl:
>>  &CTRL_PACKNAMES
>>  costname='ecco_cost',
>>  ctrlname='ecco_ctrl',
>>
>> Since gradients produced with TAF are extremely similar (10+ digits?) to
>> those produce with OpenAD (see results/ directory which has both TAF and
>> OpenAD reference results), I expect it to work with OpenAD too (have not
>> tested it right now).
>>
>> -Patrick
>>
>>
>>
>> > On May 2, 2018, at 12:34 PM, Andrew McRae <
>> andrew.mcrae at physics.ox.ac.uk> wrote:
>> >
>> > Thanks for this.
>> >
>> > Just as a sanity check, before I involve optim_m1qn3 again, the output
>> of my ./testreport -t tutorial_global_oce_optim -oad includes
>> >
>> > There were 16 decimal places of similarity for "ADM CostFct"
>> > There were 16 decimal places of similarity for "ADM Ad Grad"
>> > There were 0 decimal places of similarity for "ADM FD Grad"
>> >
>> > Should I be concerned about this?
>> >
>> > E.g. lines 2116-2118 of my output_oadm.txt file are
>> >
>> > (PID.TID 0000.0001)  ADM  ref_cost_function      =  6.20023228182329E+00
>> > (PID.TID 0000.0001)  ADM  adjoint_gradient       = -2.69091500991183E-06
>> > (PID.TID 0000.0001)  ADM  finite-diff_grad       =  0.00000000000000E+00
>> >
>> > But at least my cost function value is the same:
>> >
>> > (PID.TID 0000.0001)   local fc =  0.620023228182329D+01
>> > (PID.TID 0000.0001)  global fc =  0.620023228182329D+01
>> >
>> > Andrew
>> >
>> > On 2 May 2018 at 10:34, Martin Losch <Martin.Losch at awi.de> wrote:
>> > Hi Andrew,
>> >
>> > I won’t be able to help you much with the optim/lsopt code, because I
>> would have to get it running again myself. But I do recommend using the
>> MITgcm_contrib/mlosch/optim_m1qn3 code. It’s not very well documented,
>> but I am attaching a skeleton script to illustrate how to use it. Please
>> give it a try and if you find it useful, I can add this script to the
>> repository.
>> >
>> > The two versions of the optimization routine are similar, both
>> implement the same optimization algorithm (BFGS), but optim_m1qn3 uses a
>> later version of the m1qn3 code, I think it’s easier to compile (only one
>> Makefile) and I believe (but there’s debate about this) that it does the
>> right thing as opposed to the optim/lsopt variant, which somehow truncates
>> the optimization in each iteration. Having said that, I have used both in
>> parallel, and the reduction of the cost function (which is really all we
>> care about) is sometimes better with the optim_m1qn3 code, sometimes it is
>> better with the optim/lsopt code. The optim_m1qn3 code is closer to the
>> idea of the original m1qn3 code.
>> >
>> > Let me know if you can use my attached instructions.
>> >
>> > Martin
>> >
>> >
>> >
>> > > On 1. May 2018, at 00:00, Andrew McRae <andrew.mcrae at physics.ox.ac.uk>
>> wrote:
>> > >
>> > > Right, but the cost function is the same value each time, the norm of
>> x is 0 each time, and the norm of g is the same each time.  This suggests
>> nothing is happening.  It's a bit ridiculous that one of the core tutorials
>> simply isn't working out of the box...
>> > >
>> > > I will have a go at debugging.
>> > >
>> > > Andrew
>> > >
>> > > On 30 April 2018 at 22:54, Matthew Mazloff <mmazloff at ucsd.edu> wrote:
>> > > Well you are correct that its not actually taking a step because the
>> dot product of the control is 0:
>> > >>> norm of x................... 0.00000000E+00
>> > > meaning the controls are all 0 still.
>> > >
>> > > However the gradients are non-zero
>> > >>> norm of g................... 0.12730927E-01
>> > > so the linesearch should step and
>> > > ecco_ctrl_MIT_CE_000.opt0001
>> > > should not be all zero.
>> > >
>> > > To debug this you could put a print statement in optim_writedata.F to
>> see what it is writing…..
>> > >
>> > > I don’t know enough about this tutorial to be a bigger help, sorry
>> > >
>> > > Matt
>> > >
>> > >
>> > >> On Apr 30, 2018, at 2:50 PM, Andrew McRae <
>> andrew.mcrae at physics.ox.ac.uk> wrote:
>> > >>
>> > >> Yes, I did.
>> > >>
>> > >> On 30 April 2018 at 22:42, Matthew Mazloff <mmazloff at ucsd.edu>
>> wrote:
>> > >> This is still iteration 0. You have to update data.optim to tell it
>> you are now at iteration 1
>> > >>
>> > >> Matt
>> > >>
>> > >>
>> > >>> On Apr 30, 2018, at 2:38 PM, Andrew McRae <
>> andrew.mcrae at physics.ox.ac.uk> wrote:
>> > >>>
>> > >>> I tried a few steps of this, but the output of optim.x always has
>> > >>>
>> > >>>   cost function............... 0.62002323E+01
>> > >>>   norm of x................... 0.00000000E+00
>> > >>>   norm of g................... 0.12730927E-01
>> > >>>
>> > >>> near the end, with no decrease in the cost function.  So I guess
>> it's not actually taking the step?
>> > >>>
>> > >>> Andrew
>> > >>>
>> > >>> On 27 April 2018 at 18:04, Andrew McRae <
>> andrew.mcrae at physics.ox.ac.uk> wrote:
>> > >>> !!!  Okay...
>> > >>>
>> > >>> Yes, it produced the .opt0001 file.  I'll see how this goes.
>> > >>>
>> > >>> Thanks,
>> > >>> Andrew
>> > >>>
>> > >>> On 27 April 2018 at 17:57, Matthew Mazloff <mmazloff at ucsd.edu>
>> wrote:
>> > >>> Hello
>> > >>>
>> > >>> Its been awhile, but I am pretty sure that is the normal output. It
>> says “fail", but it did give you a new and ecco_ctrl_MIT_CE_000.opt0001
>> (correct?) and if you unpack and run likely the cost will descend.
>> > >>>
>> > >>> I think it worked correctly. lsopt/optim are just confusing…but I
>> think its working. I think all is good!
>> > >>>
>> > >>> Matt
>> > >>>
>> > >>>
>> > >>>
>> > >>>> On Apr 27, 2018, at 8:25 AM, Andrew McRae <
>> andrew.mcrae at physics.ox.ac.uk> wrote:
>> > >>>>
>> > >>>> Just separating this from the other thread, I got the bundled
>> MITgcm optim routine built (having made these changes, based on this thread
>> from 2010 and this one from 2016).
>> > >>>>
>> > >>>> I use OpenAD to create the adjoint.
>> > >>>>
>> > >>>> My steps are:
>> > >>>> 1) in the build directory, run ../../../tools/genmake2 -oad
>> -mods=../code_oad
>> > >>>> 2) run make depend and make adAll
>> > >>>> 3) copy input_oad/ into a new folder scratch/
>> > >>>> 4) within scratch/, run ./prepare_run
>> > >>>> 5) copy mitgcmuv_ad from build/ into scratch/, copy optim.x into
>> scratch/OPTIM/
>> > >>>> 6) run ./mitgcmuv_ad
>> > >>>> 7) in scratch/OPTIM, create symlinks to ../data.optim and
>> ../data.ctrl
>> > >>>> 8) copy the files ecco_cost_MIT_CE_000.opt0000 and
>> ecco_ctrl_MIT_CE_000.opt0000 into the OPTIM subdirectory
>> > >>>> 9) run ./optim.x within the subdirectory
>> > >>>>
>> > >>>> The full output is attached, but I assume the optimisation failed
>> since the last lines are
>> > >>>>
>> > >>>>   optimization stopped because :
>> > >>>>   ifail =   4    the search direction is not a descent one
>> > >>>>
>> > >>>> Any ideas?  (I guess this isn't something that is tested in the
>> daily builds?)
>> > >>>>
>> > >>>> In the meantime, I'll try the m1qn3 routine as in the other
>> thread, which should help distinguish between a problem with the
>> optimisation routine or the gradient generated by mitgcmuv_ad.
>> > >>>>
>> > >>>> Andrew
>> > >>>> <out.txt>_______________________________________________
>> > >>>> MITgcm-support mailing list
>> > >>>> MITgcm-support at mitgcm.org
>> > >>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> > >>>
>> > >>>
>> > >>>
>> > >>> _______________________________________________
>> > >>> MITgcm-support mailing list
>> > >>> MITgcm-support at mitgcm.org
>> > >>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> MITgcm-support mailing list
>> > >> MITgcm-support at mitgcm.org
>> > >> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> > >
>> > >
>> > > _______________________________________________
>> > > MITgcm-support mailing list
>> > > MITgcm-support at mitgcm.org
>> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> >
>> > _______________________________________________
>> > MITgcm-support mailing list
>> > MITgcm-support at mitgcm.org
>> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> >
>> > _______________________________________________
>> > MITgcm-support mailing list
>> > MITgcm-support at mitgcm.org
>> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20180504/fc77ffdc/attachment-0001.html>