[MITgcm-support] tutorial_global_oce_optim optimisation failed
Ron Goldman
ron at ocean.org.il
Thu Jun 21 05:15:31 EDT 2018
Hi Andrew,
It compiled, and grdchk returned output that matched the finite difference. I recall that optim reduced the norm by little but I don't recall if the change in OPENAD_OPTIONS.h was needed for that.
Ron
On 06/21/18 10:22, Andrew McRae wrote:
Hi Ron,
"It worked" = it compiled, or it compiled + everything now seems to work (including the optimization)?
Andrew
On 21 June 2018 at 05:57, Ron Goldman <ron at ocean.org.il<mailto:ron at ocean.org.il>> wrote:
Hi Andrew,
I've been having the same issue. It worked when I changed the code by dropping the %v %d.
Changing tools/OAD_support/ad_template.active_read_xy.F will propagate the changes to externalDummies_cb2m_oad.f.
I am still not sure what makes openAD decide if active_var is type(active) or real.
Best reagrds,
Ron
On 06/20/18 20:28, Andrew McRae wrote:
Damn. After doing this, the gradient written into ecco_cost seems to be all 0.0. Help?
Andrew
On 19 June 2018 at 15:37, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
Okay, I have
1) copied OPENAD_OPTIONS.h from pkg/openad to the code_oad/ subfolder of the tutorial, changing it to define ALLOW_OPENAD_ACTIVE_READ_XY
Good news: the main body of tools/OAD_support/ad_template.active_read_xy.F (which is wrapped in #ifdef ALLOW_OPENAD_ACTIVE_READ_XY) now appears in external_Dummies_cb2m_oad.f
Bad news: this gives a compile error in externalDummies_cb2m_oad.f of "Error: Unexpected '%' for nonderived-type variable 'active_var'". This seems to be because active_var is declared as a REAL(w2f__8) in externalDummies_cb2m_oad.f, not a type(active). The lines of code corresponding to
active_var = dummy + active_var
dummy = active_var(1,1,1,1) + dummy
don't appear in the post-processed code [optimized out by the OpenAD toolchain, or something else?], which is probably why active_var doesn't become an active variable. Therefore, I....
2) change the type of active_var to type(active) in the post-processed file (yuck). make adAll continues from where it left off, and mitgcmuv_ad now compiles :)
(I tried changing the type of this variable in pkg/openad/externalDummies.F<https://github.com/MITgcm/MITgcm/blob/master/pkg/openad/externalDummies.F#L285>, but this leads to a bork in the OpenAD toolchain)
I can confirm the cost function changes from iteration to iteration, and I'll now test if the optimization works. Hopefully you can find a more permanent solution to the above.
Andrew
On 19 June 2018 at 13:43, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
The active_read_xy routine used in OpenAD mode looks suspicious: https://github.com/MITgcm/MITgcm/blob/master/pkg/openad/externalDummies.F#L269-L296
1) ALLOW_OPENAD_ACTIVE_READ_XY isn't defined for tutorial_global_oce_optim; I guess it should be?
2) This routine seems to be basically a no-op anyway? I guess active_var_file should be read into active_var, or similar?
Andrew
On 18 June 2018 at 18:04, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
Not sure if you've had a chance to look at this yet... the only time I can see tmpfld2d being written to (and not just initialised to 0.0 or 1.0) is in pkg/admtlm/bypassad.F line 96. Presumably that package isn't switched on here. I can't see xx_hfluxm being written to at all.
A few lines above, active_read_xy is called with xx_hfluxm_dummy as the last argument... should this have been xx_hfluxm, perhaps? (xx_hfluxm_dummy is a single variable, while xx_hfluxm is an array, so this probably won't work as-is...)
Andrew
On 13 June 2018 at 23:18, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
Okay, thank you. If do you have any advice on debugging this, do say. I guess you already got as far as spotting that all the terms on the RHS of https://github.com/MITgcm/MITgcm/blob/master/pkg/ctrl/ctrl_map_forcing.F#L259 are zero.
Andrew
On 13 June 2018 at 21:36, Patrick Heimbach <heimbach at mit.edu<mailto:heimbach at mit.edu>> wrote:
Andrew,
I have not been able to look into this due to various other commitments over the last couple of months.
I'll be grounded for a while in Austin starting next week, and this will be near the top of my ToDo list.
Patrick
> On Jun 13, 2018, at 12:56 PM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>
> MITgcm built with OpenAD is not making use of the ecco_ctrl files for optimcycle >= 1. The file apparently gets read in, but the contents get dropped on the floor somewhere.
>
> Andrew
>
> On 13 June 2018 at 18:51, Matthew Mazloff <mmazloff at ucsd.edu<mailto:mmazloff at ucsd.edu>> wrote:
> Hello
>
> Sorry, I lost track. What needs to be debugged? Can you please reiterate the problem?
>
> Thanks
> Matt
>
>
>> On Jun 13, 2018, at 10:14 AM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>>
>> Hi Patrick,
>>
>> Were you able to make any progress with this? If not, do you have any advice on debugging this? (I'm getting lost in ctrl_unpack as to which variable the control vector is even read into)
>>
>> Thanks,
>> Andrew
>>
>> On 5 May 2018 at 20:12, Patrick Heimbach <heimbach at mit.edu<mailto:heimbach at mit.edu>> wrote:
>> A quick update:
>>
>> This tutorial works as advertised (in the manual), but not as "hoped".
>> What I mean is that it has been developed and only ever fully tested and used in optimization mode with TAF-generated code (and that's what's documented in the manual).
>>
>> Of course, it should not make a difference of whether we use TAF vs. OpenAD as long as gradients are correct. But as it turns out, with the OpenAD code there appears to be a little glitch. Gradient seems correct, and iteration 1 update is properly read in, but then not used (instead it is reset to zero). Oh well. I'll need to check where that happens, so stay tuned.
>>
>> p.
>>
>> > On May 4, 2018, at 10:11 AM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>> >
>> > And, still no luck(?)
>> >
>> > Running for a year (switching the commented and uncommented nTimeSteps and lastinterval declarations in data and data.cost), optim.x (lsopt+optim, not optim_m1qn3) now gives the output
>> >
>> > cost function............... 0.60514949E+01
>> > norm of x................... 0.00000000E+00
>> > norm of g................... 0.23235517E+00
>> >
>> > optimization stopped because :
>> > ifail = 4 the search direction is not a descent one
>> >
>> >
>> > On 4 May 2018 at 13:58, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>> > On 4 May 2018 at 06:04, Patrick Heimbach <heimbach at mit.edu<mailto:heimbach at mit.edu>> wrote:
>> > Hi Matt,
>> >
>> > as you indicated, all is still good, and I suspect the same you did regarding what might be at issue.
>> >
>> > I just downloaded latest MITgcm, re-ran adjoint, and conducted 2 iterations (using lsopt).
>> >
>> > It still works "out of the box" ... if one realizes that a manual is part of that "box", and section 3.18 (old manual prior to readthedocs) has some description of this tutorial, thanks to dfer (admittedly somewhat out of date, but still mostly relevant). In particular it says there that the optimization has been conducted for a 1-year simulation.
>> >
>> > Okay, thanks. I interpreted the manual footnote as "running a 1-year simulation will reproduce the scientifically-interesting graphs in the manual", not as "the default parameters are only useful for verifying correctness of the adjoint, but will break the optimisation routine". I'll see if I have more success with the longer run.
>> >
>> >
>> > Since we do not want to conduct 1-year integrations for *any* of the tutorials within our regression tests (these tests consist of 90 forward, 24 adjoint/TAF, 10 adjoint/OpenAD, and 16 tangent-linear/TAF configurations, each needing to be compiled and executed) we have shortened the number of time steps to 10 (= 10 days) to perform efficient nightly regression tests of the adjoint. Not changing the number of time steps leads to optimizing in the noise - in fact cost function goes up in that case.
>> >
>> > That the user's cost function does not change at all suggests a more basic problem though (hard to speculate what it might be).
>> >
>> > I made a quick test by extending nTimeSteps from 10 to 90 days, which leads to cost reduction as desired, namely, for:
>> > numiter=1,
>> > nfunc=3,
>> > fmin=5.74,
>> > (values in data.optim that comes with tutorial_global_oce_optim)
>> > I obtain following costs:
>> > iter. 0: fc = 0.184199260445164D+02
>> > iter. 1: fc = 0.130860446841901D+02
>> > iter. 2: fc = 0.979374136987667D+01
>> >
>> > I did that test "by hand", i.e. not using the script cycsh also provided (see manual). Doing so by hand requires two more lines in data.ctrl:
>> > &CTRL_PACKNAMES
>> > costname='ecco_cost',
>> > ctrlname='ecco_ctrl',
>> >
>> > Since gradients produced with TAF are extremely similar (10+ digits?) to those produce with OpenAD (see results/ directory which has both TAF and OpenAD reference results), I expect it to work with OpenAD too (have not tested it right now).
>> >
>> > -Patrick
>> >
>> >
>> >
>> > > On May 2, 2018, at 12:34 PM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>> > >
>> > > Thanks for this.
>> > >
>> > > Just as a sanity check, before I involve optim_m1qn3 again, the output of my ./testreport -t tutorial_global_oce_optim -oad includes
>> > >
>> > > There were 16 decimal places of similarity for "ADM CostFct"
>> > > There were 16 decimal places of similarity for "ADM Ad Grad"
>> > > There were 0 decimal places of similarity for "ADM FD Grad"
>> > >
>> > > Should I be concerned about this?
>> > >
>> > > E.g. lines 2116-2118 of my output_oadm.txt file are
>> > >
>> > > (PID.TID 0000.0001) ADM ref_cost_function = 6.20023228182329E+00
>> > > (PID.TID 0000.0001) ADM adjoint_gradient = -2.69091500991183E-06
>> > > (PID.TID 0000.0001) ADM finite-diff_grad = 0.00000000000000E+00
>> > >
>> > > But at least my cost function value is the same:
>> > >
>> > > (PID.TID 0000.0001) local fc = 0.620023228182329D+01
>> > > (PID.TID 0000.0001) global fc = 0.620023228182329D+01
>> > >
>> > > Andrew
>> > >
>> > > On 2 May 2018 at 10:34, Martin Losch <Martin.Losch at awi.de<mailto:Martin.Losch at awi.de>> wrote:
>> > > Hi Andrew,
>> > >
>> > > I won’t be able to help you much with the optim/lsopt code, because I would have to get it running again myself. But I do recommend using the MITgcm_contrib/mlosch/optim_m1qn3 code. It’s not very well documented, but I am attaching a skeleton script to illustrate how to use it. Please give it a try and if you find it useful, I can add this script to the repository.
>> > >
>> > > The two versions of the optimization routine are similar, both implement the same optimization algorithm (BFGS), but optim_m1qn3 uses a later version of the m1qn3 code, I think it’s easier to compile (only one Makefile) and I believe (but there’s debate about this) that it does the right thing as opposed to the optim/lsopt variant, which somehow truncates the optimization in each iteration. Having said that, I have used both in parallel, and the reduction of the cost function (which is really all we care about) is sometimes better with the optim_m1qn3 code, sometimes it is better with the optim/lsopt code. The optim_m1qn3 code is closer to the idea of the original m1qn3 code.
>> > >
>> > > Let me know if you can use my attached instructions.
>> > >
>> > > Martin
>> > >
>> > >
>> > >
>> > > > On 1. May 2018, at 00:00, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>> > > >
>> > > > Right, but the cost function is the same value each time, the norm of x is 0 each time, and the norm of g is the same each time. This suggests nothing is happening. It's a bit ridiculous that one of the core tutorials simply isn't working out of the box...
>> > > >
>> > > > I will have a go at debugging.
>> > > >
>> > > > Andrew
>> > > >
>> > > > On 30 April 2018 at 22:54, Matthew Mazloff <mmazloff at ucsd.edu<mailto:mmazloff at ucsd.edu>> wrote:
>> > > > Well you are correct that its not actually taking a step because the dot product of the control is 0:
>> > > >>> norm of x................... 0.00000000E+00
>> > > > meaning the controls are all 0 still.
>> > > >
>> > > > However the gradients are non-zero
>> > > >>> norm of g................... 0.12730927E-01
>> > > > so the linesearch should step and
>> > > > ecco_ctrl_MIT_CE_000.opt0001
>> > > > should not be all zero.
>> > > >
>> > > > To debug this you could put a print statement in optim_writedata.F to see what it is writing…..
>> > > >
>> > > > I don’t know enough about this tutorial to be a bigger help, sorry
>> > > >
>> > > > Matt
>> > > >
>> > > >
>> > > >> On Apr 30, 2018, at 2:50 PM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>> > > >>
>> > > >> Yes, I did.
>> > > >>
>> > > >> On 30 April 2018 at 22:42, Matthew Mazloff <mmazloff at ucsd.edu<mailto:mmazloff at ucsd.edu>> wrote:
>> > > >> This is still iteration 0. You have to update data.optim to tell it you are now at iteration 1
>> > > >>
>> > > >> Matt
>> > > >>
>> > > >>
>> > > >>> On Apr 30, 2018, at 2:38 PM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>> > > >>>
>> > > >>> I tried a few steps of this, but the output of optim.x always has
>> > > >>>
>> > > >>> cost function............... 0.62002323E+01
>> > > >>> norm of x................... 0.00000000E+00
>> > > >>> norm of g................... 0.12730927E-01
>> > > >>>
>> > > >>> near the end, with no decrease in the cost function. So I guess it's not actually taking the step?
>> > > >>>
>> > > >>> Andrew
>> > > >>>
>> > > >>> On 27 April 2018 at 18:04, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>> > > >>> !!! Okay...
>> > > >>>
>> > > >>> Yes, it produced the .opt0001 file. I'll see how this goes.
>> > > >>>
>> > > >>> Thanks,
>> > > >>> Andrew
>> > > >>>
>> > > >>> On 27 April 2018 at 17:57, Matthew Mazloff <mmazloff at ucsd.edu<mailto:mmazloff at ucsd.edu>> wrote:
>> > > >>> Hello
>> > > >>>
>> > > >>> Its been awhile, but I am pretty sure that is the normal output. It says “fail", but it did give you a new and ecco_ctrl_MIT_CE_000.opt0001 (correct?) and if you unpack and run likely the cost will descend.
>> > > >>>
>> > > >>> I think it worked correctly. lsopt/optim are just confusing…but I think its working. I think all is good!
>> > > >>>
>> > > >>> Matt
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>>> On Apr 27, 2018, at 8:25 AM, Andrew McRae <andrew.mcrae at physics.ox.ac.uk<mailto:andrew.mcrae at physics.ox.ac.uk>> wrote:
>> > > >>>>
>> > > >>>> Just separating this from the other thread, I got the bundled MITgcm optim routine built (having made these changes, based on this thread from 2010 and this one from 2016).
>> > > >>>>
>> > > >>>> I use OpenAD to create the adjoint.
>> > > >>>>
>> > > >>>> My steps are:
>> > > >>>> 1) in the build directory, run ../../../tools/genmake2 -oad -mods=../code_oad
>> > > >>>> 2) run make depend and make adAll
>> > > >>>> 3) copy input_oad/ into a new folder scratch/
>> > > >>>> 4) within scratch/, run ./prepare_run
>> > > >>>> 5) copy mitgcmuv_ad from build/ into scratch/, copy optim.x into scratch/OPTIM/
>> > > >>>> 6) run ./mitgcmuv_ad
>> > > >>>> 7) in scratch/OPTIM, create symlinks to ../data.optim and ../data.ctrl
>> > > >>>> 8) copy the files ecco_cost_MIT_CE_000.opt0000 and ecco_ctrl_MIT_CE_000.opt0000 into the OPTIM subdirectory
>> > > >>>> 9) run ./optim.x within the subdirectory
>> > > >>>>
>> > > >>>> The full output is attached, but I assume the optimisation failed since the last lines are
>> > > >>>>
>> > > >>>> optimization stopped because :
>> > > >>>> ifail = 4 the search direction is not a descent one
>> > > >>>>
>> > > >>>> Any ideas? (I guess this isn't something that is tested in the daily builds?)
>> > > >>>>
>> > > >>>> In the meantime, I'll try the m1qn3 routine as in the other thread, which should help distinguish between a problem with the optimisation routine or the gradient generated by mitgcmuv_ad.
>> > > >>>>
>> > > >>>> Andrew
>> > > >>>> <out.txt>_______________________________________________
>> > > >>>> MITgcm-support mailing list
>> > > >>>> MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> > > >>>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> _______________________________________________
>> > > >>> MITgcm-support mailing list
>> > > >>> MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> > > >>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> > > >>
>> > > >>
>> > > >> _______________________________________________
>> > > >> MITgcm-support mailing list
>> > > >> MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> > > >> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> > > >
>> > > >
>> > > > _______________________________________________
>> > > > MITgcm-support mailing list
>> > > > MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> > > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> > >
>> > > _______________________________________________
>> > > MITgcm-support mailing list
>> > > MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> > >
>> > > _______________________________________________
>> > > MITgcm-support mailing list
>> > > MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> > > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>> >
>> >
>> >
>> > _______________________________________________
>> > MITgcm-support mailing list
>> > MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>>
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20180621/7d665d3a/attachment-0001.html>
More information about the MITgcm-support
mailing list