[MITgcm-support] optim_m1qn3 line search getting stuck

Wed Feb 13 18:24:47 EST 2019

Good question.  I don't know.  I don't have much practical experience in
this area.

My gut feeling is that if you're far enough away from the minimum for
non-convexity to be a problem, gradient descent (i.e. a cold start) is as
good a choice as any.  But maybe this is unduly pessimistic.

There's definitely literature on globalisation strategies for BFGS.  One of
the most highly cited papers is
https://doi.org/10.1016/S0377-0427(00)00540-9 (no numerical examples, only
proofs), and that simply replaces y_k by y_k' = y_k + r_k*s_k, where r_k is
a scalar of order ||g_k||.

On Wed, 13 Feb 2019 at 21:48, Martin Losch <Martin.Losch at awi.de> wrote:

> Hi Andrew,
>
> do you think you could “cold start” the routine in the middle of an
> optimization, i.e. throw away all storted Hessian information, or just the
> new (problematic) contribution <y, s> < 0. I find this routine so difficult
> to read (especially with all the French comments) and haven’t had good look
> at it in a long time.
>
> Martin
>
> > On 13. Feb 2019, at 14:30, Andrew McRae <andrew.mcrae at physics.ox.ac.uk>
> wrote:
> >
> > Okay, I think I understand what is happening.  In short, the BFGS
> algorithm, which underpins m1qn3, is only guaranteed to behave sensibly if
> the cost function is convex.  If this is not the case, it can produce
> approximations to H, the inverse Hessian, which aren't positive definite.
> This can lead to the algorithm producing uphill search directions.
> (gradient g, descent direction d, d = -Hg.  If H is pos def, g.d = g.(-Hg),
> which is < 0 as long as g isn't 0.  If H isn't pos-def, this isn't
> guaranteed).  We're definitely seeing d.g > 0 in the output above, which
> can't be good.
> >
> > Maybe I, or someone else, should write a wrapper that automatically
> cold-restarts m1qn3 if something goes wrong (either by checking d.g, or by
> checking y.s, see below)?
> >
> > Longer version:
> >
> > Define s_k = x_{k+1} - x_k, the state increment
> > Define y_k = g_{k+1} - g_k, the gradient increment.
> >
> > BFGS computes an approximation H_{k+1} for the inverse Hessian, from
> H_k, s_k, and y_k.  If H_k is symmetric then H_{k+1} is also symmetric.
> Furthermore, if H_k is positive definite and <y, s> > 0, H_{k+1} is also
> positive definite.  <y, s> > 0 is true for sensible updates steps in a
> convex optimization problem, but not true in general (think of a parabola,
> but imagine it briefly steepening as you approach the minimum due to
> nonlinearities).  So a negative-definite H can be generated.
> >
> > The version used is actually L-BFGS, which maintains a diagonal matrix
> D_k, and forms H_k from D_k and the m most recent pairs (s_k, y_k), but the
> same things hold.
> >
> >
> >
> > On Tue, 12 Feb 2019 at 16:11, Andrew McRae <
> andrew.mcrae at physics.ox.ac.uk> wrote:
> > Actually I was wrong, it's the first Wolfe test being violated.  I.e.
> f(x + t*d) turns out to be larger than f(x), even for very small t, where t
> is the step multiplier and d is the search direction.
> >
> > The values being printed out are t, f(x + t*d) - f(x), and <d, g>, where
> g is the gradient.  The second number should be negative for small enough
> t, and the third number should definitely be negative else it's not a
> descent direction.
> >
> > In fact, even in a sucessful run, this occasionally happens (2nd number
> negative = good, but third number is often positive!)
> >
> >      mlis3     1.000D+00 -1.108D+00 -9.114D-01
> >      mlis3     1.000D+00 -1.379D-01  6.624D+00
> >      mlis3     1.000D+00 -5.767D-01  4.122D+00
> >      mlis3     1.000D+00 -1.120D+00 -7.614D-01
> >      mlis3     1.000D+00 -5.014D-01 -3.579D-01
> >      mlis3     1.000D+00 -6.903D-01 -1.962D-01
> >      mlis3     1.000D+00 -1.657D-01  1.824D-01
> >
> > I'm now confused, because the lines
> https://github.com/dorugeber/MITgcm/blob/optim_m1qn3/optim_m1qn3/m1qn3_offline.F#L551-L559
> seem to guard against the case d.g > 0.
> >
> > On Tue, 12 Feb 2019 at 12:02, Andrew McRae <
> andrew.mcrae at physics.ox.ac.uk> wrote:
> > This is using the optim_m1qn3 package from mitgcm_contrib.
> >
> > Quite often, the algorithm runs a few steps, then gets stuck in a line
> search.  E.g., after 3 good iterations, I get
> >
> >  m1qn3: iter 4, simul 4, f= 3.25818816D+00, h'(0)=-2.84510D+00
> >
> >  m1qn3: line search
> >
> >      mlis3       fpn=-2.845D+00 d2= 9.74D+01  tmin= 5.72D-07 tmax=
> 1.00D+20
> >      mlis3                                      1.000D+00  8.992D-01
> 1.438D+00
> >      mlis3                                      2.468D-01  1.155D-01
> 6.129D-01
> >      mlis3                                      2.468D-03  7.656D-04
> 2.227D+02
> >      mlis3                                      2.345D-03  7.240D-04
> 1.019D+01
> >      mlis3                                      1.759D-03  5.496D-04
> 2.970D+00
> >      mlis3                                      1.231D-03  3.855D-04
> -3.000D+00
> >      mlis3                                      8.618D-04  2.713D-04
> 3.105D-01
> >      mlis3                                      6.032D-04  1.894D-04
> 2.202D-01
> >      mlis3                                      4.223D-04  1.341D-04
> 2.223D+00
> >      mlis3                                      2.956D-04  9.450D-05
> 3.954D-01
> >      mlis3                                      2.069D-04  6.755D-05
> 3.121D-01
> >      mlis3                                      6.207D-05  1.948D-05
> 3.642D-01
> >
> > This is for the included tutorial_global_oce_optim running for 1 year,
> and with mult_hflux_tut set to 0.2 rather than 2.
> >
> > To be honest, I'm not even sure what numbers are being printed, but I
> suspect one of those first two numbers is the step size multiplier.  I.e.
> it's trying to take smaller and smaller steps, but these are still being
> rejected.
> >
> > I'm about to dive in with gdb and see what's going on, but my hypothesis
> is that the second Wolfe test is being violated.  Roughly speaking, this
> forces the gradient to decrease by 10% each iteration (at least, the
> component of the gradient in the descent direction).  This makes sense once
> the algorithm is in the basin near the minimizer of the cost function, but
> there's no apriori reason for it to hold further away.
> >
> > Is there likely to be anything wrong with modifying this check to allow
> (some) gradient steepening?  I guess I'll find out...
> > _______________________________________________
> > MITgcm-support mailing list
> > MITgcm-support at mitgcm.org
> > http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20190213/bf45afc1/attachment.html>