[MITgcm-support] tutorial_global_oce_optim optimisation failed
Patrick Heimbach
heimbach at mit.edu
Sat May 5 15:12:25 EDT 2018
A quick update:
This tutorial works as advertised (in the manual), but not as "hoped".
What I mean is that it has been developed and only ever fully tested and used in optimization mode with TAF-generated code (and that's what's documented in the manual).
Of course, it should not make a difference of whether we use TAF vs. OpenAD as long as gradients are correct. But as it turns out, with the OpenAD code there appears to be a little glitch. Gradient seems correct, and iteration 1 update is properly read in, but then not used (instead it is reset to zero). Oh well. I'll need to check where that happens, so stay tuned.
p.
> On May 4, 2018, at 10:11 AM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:
>
> And, still no luck(?)
>
> Running for a year (switching the commented and uncommented nTimeSteps and lastinterval declarations in data and data.cost), optim.x (lsopt+optim, not optim_m1qn3) now gives the output
>
> cost function............... 0.60514949E+01
> norm of x................... 0.00000000E+00
> norm of g................... 0.23235517E+00
>
> optimization stopped because :
> ifail = 4 the search direction is not a descent one
>
>
> On 4 May 2018 at 13:58, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:
> On 4 May 2018 at 06:04, Patrick Heimbach <heimbach at mit.edu> wrote:
> Hi Matt,
>
> as you indicated, all is still good, and I suspect the same you did regarding what might be at issue.
>
> I just downloaded latest MITgcm, re-ran adjoint, and conducted 2 iterations (using lsopt).
>
> It still works "out of the box" ... if one realizes that a manual is part of that "box", and section 3.18 (old manual prior to readthedocs) has some description of this tutorial, thanks to dfer (admittedly somewhat out of date, but still mostly relevant). In particular it says there that the optimization has been conducted for a 1-year simulation.
>
> Okay, thanks. I interpreted the manual footnote as "running a 1-year simulation will reproduce the scientifically-interesting graphs in the manual", not as "the default parameters are only useful for verifying correctness of the adjoint, but will break the optimisation routine". I'll see if I have more success with the longer run.
>
>
> Since we do not want to conduct 1-year integrations for *any* of the tutorials within our regression tests (these tests consist of 90 forward, 24 adjoint/TAF, 10 adjoint/OpenAD, and 16 tangent-linear/TAF configurations, each needing to be compiled and executed) we have shortened the number of time steps to 10 (= 10 days) to perform efficient nightly regression tests of the adjoint. Not changing the number of time steps leads to optimizing in the noise - in fact cost function goes up in that case.
>
> That the user's cost function does not change at all suggests a more basic problem though (hard to speculate what it might be).
>
> I made a quick test by extending nTimeSteps from 10 to 90 days, which leads to cost reduction as desired, namely, for:
> numiter=1,
> nfunc=3,
> fmin=5.74,
> (values in data.optim that comes with tutorial_global_oce_optim)
> I obtain following costs:
> iter. 0: fc = 0.184199260445164D+02
> iter. 1: fc = 0.130860446841901D+02
> iter. 2: fc = 0.979374136987667D+01
>
> I did that test "by hand", i.e. not using the script cycsh also provided (see manual). Doing so by hand requires two more lines in data.ctrl:
> &CTRL_PACKNAMES
> costname='ecco_cost',
> ctrlname='ecco_ctrl',
>
> Since gradients produced with TAF are extremely similar (10+ digits?) to those produce with OpenAD (see results/ directory which has both TAF and OpenAD reference results), I expect it to work with OpenAD too (have not tested it right now).
>
> -Patrick
>
>
>
> > On May 2, 2018, at 12:34 PM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:
> >
> > Thanks for this.
> >
> > Just as a sanity check, before I involve optim_m1qn3 again, the output of my ./testreport -t tutorial_global_oce_optim -oad includes
> >
> > There were 16 decimal places of similarity for "ADM CostFct"
> > There were 16 decimal places of similarity for "ADM Ad Grad"
> > There were 0 decimal places of similarity for "ADM FD Grad"
> >
> > Should I be concerned about this?
> >
> > E.g. lines 2116-2118 of my output_oadm.txt file are
> >
> > (PID.TID 0000.0001) ADM ref_cost_function = 6.20023228182329E+00
> > (PID.TID 0000.0001) ADM adjoint_gradient = -2.69091500991183E-06
> > (PID.TID 0000.0001) ADM finite-diff_grad = 0.00000000000000E+00
> >
> > But at least my cost function value is the same:
> >
> > (PID.TID 0000.0001) local fc = 0.620023228182329D+01
> > (PID.TID 0000.0001) global fc = 0.620023228182329D+01
> >
> > Andrew
> >
> > On 2 May 2018 at 10:34, Martin Losch <Martin.Losch at awi.de> wrote:
> > Hi Andrew,
> >
> > I won’t be able to help you much with the optim/lsopt code, because I would have to get it running again myself. But I do recommend using the MITgcm_contrib/mlosch/optim_m1qn3 code. It’s not very well documented, but I am attaching a skeleton script to illustrate how to use it. Please give it a try and if you find it useful, I can add this script to the repository.
> >
> > The two versions of the optimization routine are similar, both implement the same optimization algorithm (BFGS), but optim_m1qn3 uses a later version of the m1qn3 code, I think it’s easier to compile (only one Makefile) and I believe (but there’s debate about this) that it does the right thing as opposed to the optim/lsopt variant, which somehow truncates the optimization in each iteration. Having said that, I have used both in parallel, and the reduction of the cost function (which is really all we care about) is sometimes better with the optim_m1qn3 code, sometimes it is better with the optim/lsopt code. The optim_m1qn3 code is closer to the idea of the original m1qn3 code.
> >
> > Let me know if you can use my attached instructions.
> >
> > Martin
> >
> >
> >
> > > On 1. May 2018, at 00:00, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:
> > >
> > > Right, but the cost function is the same value each time, the norm of x is 0 each time, and the norm of g is the same each time. This suggests nothing is happening. It's a bit ridiculous that one of the core tutorials simply isn't working out of the box...
> > >
> > > I will have a go at debugging.
> > >
> > > Andrew
> > >
> > > On 30 April 2018 at 22:54, Matthew Mazloff <mmazloff at ucsd.edu> wrote:
> > > Well you are correct that its not actually taking a step because the dot product of the control is 0:
> > >>> norm of x................... 0.00000000E+00
> > > meaning the controls are all 0 still.
> > >
> > > However the gradients are non-zero
> > >>> norm of g................... 0.12730927E-01
> > > so the linesearch should step and
> > > ecco_ctrl_MIT_CE_000.opt0001
> > > should not be all zero.
> > >
> > > To debug this you could put a print statement in optim_writedata.F to see what it is writing…..
> > >
> > > I don’t know enough about this tutorial to be a bigger help, sorry
> > >
> > > Matt
> > >
> > >
> > >> On Apr 30, 2018, at 2:50 PM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:
> > >>
> > >> Yes, I did.
> > >>
> > >> On 30 April 2018 at 22:42, Matthew Mazloff <mmazloff at ucsd.edu> wrote:
> > >> This is still iteration 0. You have to update data.optim to tell it you are now at iteration 1
> > >>
> > >> Matt
> > >>
> > >>
> > >>> On Apr 30, 2018, at 2:38 PM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:
> > >>>
> > >>> I tried a few steps of this, but the output of optim.x always has
> > >>>
> > >>> cost function............... 0.62002323E+01
> > >>> norm of x................... 0.00000000E+00
> > >>> norm of g................... 0.12730927E-01
> > >>>
> > >>> near the end, with no decrease in the cost function. So I guess it's not actually taking the step?
> > >>>
> > >>> Andrew
> > >>>
> > >>> On 27 April 2018 at 18:04, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:
> > >>> !!! Okay...
> > >>>
> > >>> Yes, it produced the .opt0001 file. I'll see how this goes.
> > >>>
> > >>> Thanks,
> > >>> Andrew
> > >>>
> > >>> On 27 April 2018 at 17:57, Matthew Mazloff <mmazloff at ucsd.edu> wrote:
> > >>> Hello
> > >>>
> > >>> Its been awhile, but I am pretty sure that is the normal output. It says “fail", but it did give you a new and ecco_ctrl_MIT_CE_000.opt0001 (correct?) and if you unpack and run likely the cost will descend.
> > >>>
> > >>> I think it worked correctly. lsopt/optim are just confusing…but I think its working. I think all is good!
> > >>>
> > >>> Matt
> > >>>
> > >>>
> > >>>
> > >>>> On Apr 27, 2018, at 8:25 AM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk> wrote:
> > >>>>
> > >>>> Just separating this from the other thread, I got the bundled MITgcm optim routine built (having made these changes, based on this thread from 2010 and this one from 2016).
> > >>>>
> > >>>> I use OpenAD to create the adjoint.
> > >>>>
> > >>>> My steps are:
> > >>>> 1) in the build directory, run ../../../tools/genmake2 -oad -mods=../code_oad
> > >>>> 2) run make depend and make adAll
> > >>>> 3) copy input_oad/ into a new folder scratch/
> > >>>> 4) within scratch/, run ./prepare_run
> > >>>> 5) copy mitgcmuv_ad from build/ into scratch/, copy optim.x into scratch/OPTIM/
> > >>>> 6) run ./mitgcmuv_ad
> > >>>> 7) in scratch/OPTIM, create symlinks to ../data.optim and ../data.ctrl
> > >>>> 8) copy the files ecco_cost_MIT_CE_000.opt0000 and ecco_ctrl_MIT_CE_000.opt0000 into the OPTIM subdirectory
> > >>>> 9) run ./optim.x within the subdirectory
> > >>>>
> > >>>> The full output is attached, but I assume the optimisation failed since the last lines are
> > >>>>
> > >>>> optimization stopped because :
> > >>>> ifail = 4 the search direction is not a descent one
> > >>>>
> > >>>> Any ideas? (I guess this isn't something that is tested in the daily builds?)
> > >>>>
> > >>>> In the meantime, I'll try the m1qn3 routine as in the other thread, which should help distinguish between a problem with the optimisation routine or the gradient generated by mitgcmuv_ad.
> > >>>>
> > >>>> Andrew
> > >>>> <out.txt>_______________________________________________
> > >>>> MITgcm-support mailing list
> > >>>> MITgcm-support at mitgcm.org
> > >>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > >>>
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> MITgcm-support mailing list
> > >>> MITgcm-support at mitgcm.org
> > >>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > >>
> > >>
> > >> _______________________________________________
> > >> MITgcm-support mailing list
> > >> MITgcm-support at mitgcm.org
> > >> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> > >
> > >
> > > _______________________________________________
> > > MITgcm-support mailing list
> > > MITgcm-support at mitgcm.org
> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
> >
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1849 bytes
Desc: not available
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20180505/fda62384/attachment.p7s>
More information about the MITgcm-support
mailing list