[MITgcm-support] tutorial_global_oce_optim optimisation failed

Tue Jun 19 08:43:26 EDT 2018

The active_read_xy routine used in OpenAD mode looks suspicious:
https://github.com/MITgcm/MITgcm/blob/master/pkg/openad/externalDummies.F#L269-L296

1) ALLOW_OPENAD_ACTIVE_READ_XY isn't defined for tutorial_global_oce_optim;
I guess it should be?

2) This routine seems to be basically a no-op anyway?  I guess
active_var_file should be read into active_var, or similar?

Andrew

On 18 June 2018 at 18:04, Andrew McRae <andrew.mcrae at physics.ox.ac.uk>
wrote:

> Not sure if you've had a chance to look at this yet... the only time I can
> see tmpfld2d being written to (and not just initialised to 0.0 or 1.0) is
> in pkg/admtlm/bypassad.F line 96.  Presumably that package isn't switched
> on here.  I can't see xx_hfluxm being written to at all.
>
> A few lines above, active_read_xy is called with xx_hfluxm_dummy as the
> last argument... should this have been xx_hfluxm, perhaps?
> (xx_hfluxm_dummy is a single variable, while xx_hfluxm is an array, so this
> probably won't work as-is...)
>
> Andrew
>
> On 13 June 2018 at 23:18, Andrew McRae <andrew.mcrae at physics.ox.ac.uk>
> wrote:
>
>> Okay, thank you.  If do you have any advice on debugging this, do say.  I
>> guess you already got as far as spotting that all the terms on the RHS of
>> https://github.com/MITgcm/MITgcm/blob/master/pkg/ctrl/ctrl_
>> map_forcing.F#L259 are zero.
>>
>> Andrew
>>
>> On 13 June 2018 at 21:36, Patrick Heimbach <heimbach at mit.edu> wrote:
>>
>>> Andrew,
>>>
>>> I have not been able to look into this due to various other commitments
>>> over the last couple of months.
>>>
>>> I'll be grounded for a while in Austin starting next week, and this will
>>> be near the top of my ToDo list.
>>>
>>> Patrick
>>>
>>> > On Jun 13, 2018, at 12:56 PM, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >
>>> > MITgcm built with OpenAD is not making use of the ecco_ctrl files for
>>> optimcycle >= 1.  The file apparently gets read in, but the contents get
>>> dropped on the floor somewhere.
>>> >
>>> > Andrew
>>> >
>>> > On 13 June 2018 at 18:51, Matthew Mazloff <mmazloff at ucsd.edu> wrote:
>>> > Hello
>>> >
>>> > Sorry, I lost track. What needs to be debugged? Can you please
>>> reiterate the problem?
>>> >
>>> > Thanks
>>> > Matt
>>> >
>>> >
>>> >> On Jun 13, 2018, at 10:14 AM, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >>
>>> >> Hi Patrick,
>>> >>
>>> >> Were you able to make any progress with this?  If not, do you have
>>> any advice on debugging this?  (I'm getting lost in ctrl_unpack as to which
>>> variable the control vector is even read into)
>>> >>
>>> >> Thanks,
>>> >> Andrew
>>> >>
>>> >> On 5 May 2018 at 20:12, Patrick Heimbach <heimbach at mit.edu> wrote:
>>> >> A quick update:
>>> >>
>>> >> This tutorial works as advertised (in the manual), but not as "hoped".
>>> >> What I mean is that it has been developed and only ever fully tested
>>> and used  in optimization mode with TAF-generated code (and that's what's
>>> documented in the manual).
>>> >>
>>> >> Of course, it should not make a difference of whether we use TAF vs.
>>> OpenAD as long as gradients are correct. But as it turns out, with the
>>> OpenAD code there appears to be a little glitch. Gradient seems correct,
>>> and iteration 1 update is properly read in, but then not used (instead it
>>> is reset to zero). Oh well. I'll need to check where that happens, so stay
>>> tuned.
>>> >>
>>> >> p.
>>> >>
>>> >> > On May 4, 2018, at 10:11 AM, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >> >
>>> >> > And, still no luck(?)
>>> >> >
>>> >> > Running for a year (switching the commented and uncommented
>>> nTimeSteps and lastinterval declarations in data and data.cost), optim.x
>>> (lsopt+optim, not optim_m1qn3) now gives the output
>>> >> >
>>> >> >   cost function............... 0.60514949E+01
>>> >> >   norm of x................... 0.00000000E+00
>>> >> >   norm of g................... 0.23235517E+00
>>> >> >
>>> >> >   optimization stopped because :
>>> >> >   ifail =   4    the search direction is not a descent one
>>> >> >
>>> >> >
>>> >> > On 4 May 2018 at 13:58, Andrew McRae <andrew.mcrae at physics.ox.ac.uk>
>>> wrote:
>>> >> > On 4 May 2018 at 06:04, Patrick Heimbach <heimbach at mit.edu> wrote:
>>> >> > Hi Matt,
>>> >> >
>>> >> > as you indicated, all is still good, and I suspect the same you did
>>> regarding what might be at issue.
>>> >> >
>>> >> > I just downloaded latest MITgcm, re-ran adjoint, and conducted 2
>>> iterations (using lsopt).
>>> >> >
>>> >> > It still works "out of the box" ... if one realizes that a manual
>>> is part of that "box", and section 3.18 (old manual prior to readthedocs)
>>> has some description of this tutorial, thanks to dfer (admittedly somewhat
>>> out of date, but still mostly relevant). In particular it says there that
>>> the optimization has been conducted for a 1-year simulation.
>>> >> >
>>> >> > Okay, thanks.  I interpreted the manual footnote as "running a
>>> 1-year simulation will reproduce the scientifically-interesting graphs in
>>> the manual", not as "the default parameters are only useful for verifying
>>> correctness of the adjoint, but will break the optimisation routine".  I'll
>>> see if I have more success with the longer run.
>>> >> >
>>> >> >
>>> >> > Since we do not want to conduct 1-year integrations for *any* of
>>> the tutorials within our regression tests (these tests consist of 90
>>> forward, 24 adjoint/TAF, 10 adjoint/OpenAD, and 16 tangent-linear/TAF
>>> configurations, each needing to be compiled and executed) we have shortened
>>> the number of time steps to 10 (= 10 days) to perform efficient nightly
>>> regression tests of the adjoint. Not changing the number of time steps
>>> leads to optimizing in the noise - in fact cost function goes up in that
>>> case.
>>> >> >
>>> >> > That the user's cost function does not change at all suggests a
>>> more basic problem though (hard to speculate what it might be).
>>> >> >
>>> >> > I made a quick test by extending nTimeSteps from 10 to 90 days,
>>> which leads to cost reduction as desired, namely, for:
>>> >> >  numiter=1,
>>> >> >  nfunc=3,
>>> >> >  fmin=5.74,
>>> >> > (values in data.optim that comes with tutorial_global_oce_optim)
>>> >> > I obtain following costs:
>>> >> > iter. 0: fc =  0.184199260445164D+02
>>> >> > iter. 1: fc =  0.130860446841901D+02
>>> >> > iter. 2: fc =  0.979374136987667D+01
>>> >> >
>>> >> > I did that test "by hand", i.e. not using the script cycsh also
>>> provided (see manual). Doing so by hand requires two more lines in
>>> data.ctrl:
>>> >> >  &CTRL_PACKNAMES
>>> >> >  costname='ecco_cost',
>>> >> >  ctrlname='ecco_ctrl',
>>> >> >
>>> >> > Since gradients produced with TAF are extremely similar (10+
>>> digits?) to those produce with OpenAD (see results/ directory which has
>>> both TAF and OpenAD reference results), I expect it to work with OpenAD too
>>> (have not tested it right now).
>>> >> >
>>> >> > -Patrick
>>> >> >
>>> >> >
>>> >> >
>>> >> > > On May 2, 2018, at 12:34 PM, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >> > >
>>> >> > > Thanks for this.
>>> >> > >
>>> >> > > Just as a sanity check, before I involve optim_m1qn3 again, the
>>> output of my ./testreport -t tutorial_global_oce_optim -oad includes
>>> >> > >
>>> >> > > There were 16 decimal places of similarity for "ADM CostFct"
>>> >> > > There were 16 decimal places of similarity for "ADM Ad Grad"
>>> >> > > There were 0 decimal places of similarity for "ADM FD Grad"
>>> >> > >
>>> >> > > Should I be concerned about this?
>>> >> > >
>>> >> > > E.g. lines 2116-2118 of my output_oadm.txt file are
>>> >> > >
>>> >> > > (PID.TID 0000.0001)  ADM  ref_cost_function      =
>>> 6.20023228182329E+00
>>> >> > > (PID.TID 0000.0001)  ADM  adjoint_gradient       =
>>> -2.69091500991183E-06
>>> >> > > (PID.TID 0000.0001)  ADM  finite-diff_grad       =
>>> 0.00000000000000E+00
>>> >> > >
>>> >> > > But at least my cost function value is the same:
>>> >> > >
>>> >> > > (PID.TID 0000.0001)   local fc =  0.620023228182329D+01
>>> >> > > (PID.TID 0000.0001)  global fc =  0.620023228182329D+01
>>> >> > >
>>> >> > > Andrew
>>> >> > >
>>> >> > > On 2 May 2018 at 10:34, Martin Losch <Martin.Losch at awi.de> wrote:
>>> >> > > Hi Andrew,
>>> >> > >
>>> >> > > I won’t be able to help you much with the optim/lsopt code,
>>> because I would have to get it running again myself. But I do recommend
>>> using the MITgcm_contrib/mlosch/optim_m1qn3 code. It’s not very well
>>> documented, but I am attaching a skeleton script to illustrate how to use
>>> it. Please give it a try and if you find it useful, I can add this script
>>> to the repository.
>>> >> > >
>>> >> > > The two versions of the optimization routine are similar, both
>>> implement the same optimization algorithm (BFGS), but optim_m1qn3 uses a
>>> later version of the m1qn3 code, I think it’s easier to compile (only one
>>> Makefile) and I believe (but there’s debate about this) that it does the
>>> right thing as opposed to the optim/lsopt variant, which somehow truncates
>>> the optimization in each iteration. Having said that, I have used both in
>>> parallel, and the reduction of the cost function (which is really all we
>>> care about) is sometimes better with the optim_m1qn3 code, sometimes it is
>>> better with the optim/lsopt code. The optim_m1qn3 code is closer to the
>>> idea of the original m1qn3 code.
>>> >> > >
>>> >> > > Let me know if you can use my attached instructions.
>>> >> > >
>>> >> > > Martin
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > > > On 1. May 2018, at 00:00, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >> > > >
>>> >> > > > Right, but the cost function is the same value each time, the
>>> norm of x is 0 each time, and the norm of g is the same each time.  This
>>> suggests nothing is happening.  It's a bit ridiculous that one of the core
>>> tutorials simply isn't working out of the box...
>>> >> > > >
>>> >> > > > I will have a go at debugging.
>>> >> > > >
>>> >> > > > Andrew
>>> >> > > >
>>> >> > > > On 30 April 2018 at 22:54, Matthew Mazloff <mmazloff at ucsd.edu>
>>> wrote:
>>> >> > > > Well you are correct that its not actually taking a step
>>> because the dot product of the control is 0:
>>> >> > > >>> norm of x................... 0.00000000E+00
>>> >> > > > meaning the controls are all 0 still.
>>> >> > > >
>>> >> > > > However the gradients are non-zero
>>> >> > > >>> norm of g................... 0.12730927E-01
>>> >> > > > so the linesearch should step and
>>> >> > > > ecco_ctrl_MIT_CE_000.opt0001
>>> >> > > > should not be all zero.
>>> >> > > >
>>> >> > > > To debug this you could put a print statement in
>>> optim_writedata.F to see what it is writing…..
>>> >> > > >
>>> >> > > > I don’t know enough about this tutorial to be a bigger help,
>>> sorry
>>> >> > > >
>>> >> > > > Matt
>>> >> > > >
>>> >> > > >
>>> >> > > >> On Apr 30, 2018, at 2:50 PM, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >> > > >>
>>> >> > > >> Yes, I did.
>>> >> > > >>
>>> >> > > >> On 30 April 2018 at 22:42, Matthew Mazloff <mmazloff at ucsd.edu>
>>> wrote:
>>> >> > > >> This is still iteration 0. You have to update data.optim to
>>> tell it you are now at iteration 1
>>> >> > > >>
>>> >> > > >> Matt
>>> >> > > >>
>>> >> > > >>
>>> >> > > >>> On Apr 30, 2018, at 2:38 PM, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >> > > >>>
>>> >> > > >>> I tried a few steps of this, but the output of optim.x always
>>> has
>>> >> > > >>>
>>> >> > > >>>   cost function............... 0.62002323E+01
>>> >> > > >>>   norm of x................... 0.00000000E+00
>>> >> > > >>>   norm of g................... 0.12730927E-01
>>> >> > > >>>
>>> >> > > >>> near the end, with no decrease in the cost function.  So I
>>> guess it's not actually taking the step?
>>> >> > > >>>
>>> >> > > >>> Andrew
>>> >> > > >>>
>>> >> > > >>> On 27 April 2018 at 18:04, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >> > > >>> !!!  Okay...
>>> >> > > >>>
>>> >> > > >>> Yes, it produced the .opt0001 file.  I'll see how this goes.
>>> >> > > >>>
>>> >> > > >>> Thanks,
>>> >> > > >>> Andrew
>>> >> > > >>>
>>> >> > > >>> On 27 April 2018 at 17:57, Matthew Mazloff <mmazloff at ucsd.edu>
>>> wrote:
>>> >> > > >>> Hello
>>> >> > > >>>
>>> >> > > >>> Its been awhile, but I am pretty sure that is the normal
>>> output. It says “fail", but it did give you a new and
>>> ecco_ctrl_MIT_CE_000.opt0001 (correct?) and if you unpack and run likely
>>> the cost will descend.
>>> >> > > >>>
>>> >> > > >>> I think it worked correctly. lsopt/optim are just
>>> confusing…but I think its working. I think all is good!
>>> >> > > >>>
>>> >> > > >>> Matt
>>> >> > > >>>
>>> >> > > >>>
>>> >> > > >>>
>>> >> > > >>>> On Apr 27, 2018, at 8:25 AM, Andrew McRae <
>>> andrew.mcrae at physics.ox.ac.uk> wrote:
>>> >> > > >>>>
>>> >> > > >>>> Just separating this from the other thread, I got the
>>> bundled MITgcm optim routine built (having made these changes, based on
>>> this thread from 2010 and this one from 2016).
>>> >> > > >>>>
>>> >> > > >>>> I use OpenAD to create the adjoint.
>>> >> > > >>>>
>>> >> > > >>>> My steps are:
>>> >> > > >>>> 1) in the build directory, run ../../../tools/genmake2 -oad
>>> -mods=../code_oad
>>> >> > > >>>> 2) run make depend and make adAll
>>> >> > > >>>> 3) copy input_oad/ into a new folder scratch/
>>> >> > > >>>> 4) within scratch/, run ./prepare_run
>>> >> > > >>>> 5) copy mitgcmuv_ad from build/ into scratch/, copy optim.x
>>> into scratch/OPTIM/
>>> >> > > >>>> 6) run ./mitgcmuv_ad
>>> >> > > >>>> 7) in scratch/OPTIM, create symlinks to ../data.optim and
>>> ../data.ctrl
>>> >> > > >>>> 8) copy the files ecco_cost_MIT_CE_000.opt0000 and
>>> ecco_ctrl_MIT_CE_000.opt0000 into the OPTIM subdirectory
>>> >> > > >>>> 9) run ./optim.x within the subdirectory
>>> >> > > >>>>
>>> >> > > >>>> The full output is attached, but I assume the optimisation
>>> failed since the last lines are
>>> >> > > >>>>
>>> >> > > >>>>   optimization stopped because :
>>> >> > > >>>>   ifail =   4    the search direction is not a descent one
>>> >> > > >>>>
>>> >> > > >>>> Any ideas?  (I guess this isn't something that is tested in
>>> the daily builds?)
>>> >> > > >>>>
>>> >> > > >>>> In the meantime, I'll try the m1qn3 routine as in the other
>>> thread, which should help distinguish between a problem with the
>>> optimisation routine or the gradient generated by mitgcmuv_ad.
>>> >> > > >>>>
>>> >> > > >>>> Andrew
>>> >> > > >>>> <out.txt>_______________________________________________
>>> >> > > >>>> MITgcm-support mailing list
>>> >> > > >>>> MITgcm-support at mitgcm.org
>>> >> > > >>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> >> > > >>>
>>> >> > > >>>
>>> >> > > >>>
>>> >> > > >>> _______________________________________________
>>> >> > > >>> MITgcm-support mailing list
>>> >> > > >>> MITgcm-support at mitgcm.org
>>> >> > > >>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> >> > > >>
>>> >> > > >>
>>> >> > > >> _______________________________________________
>>> >> > > >> MITgcm-support mailing list
>>> >> > > >> MITgcm-support at mitgcm.org
>>> >> > > >> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> >> > > >
>>> >> > > >
>>> >> > > > _______________________________________________
>>> >> > > > MITgcm-support mailing list
>>> >> > > > MITgcm-support at mitgcm.org
>>> >> > > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> >> > >
>>> >> > > _______________________________________________
>>> >> > > MITgcm-support mailing list
>>> >> > > MITgcm-support at mitgcm.org
>>> >> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> >> > >
>>> >> > > _______________________________________________
>>> >> > > MITgcm-support mailing list
>>> >> > > MITgcm-support at mitgcm.org
>>> >> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> >> >
>>> >> >
>>> >> >
>>> >> > _______________________________________________
>>> >> > MITgcm-support mailing list
>>> >> > MITgcm-support at mitgcm.org
>>> >> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> MITgcm-support mailing list
>>> >> MITgcm-support at mitgcm.org
>>> >> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>> >
>>> >
>>> > _______________________________________________
>>> > MITgcm-support mailing list
>>> > MITgcm-support at mitgcm.org
>>> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20180619/ac73c9e5/attachment-0001.html>