<div dir="ltr"><div>Just a small update: this was wrong, and the m1qn3 code adequately checks that <y, s> is positive and <d, g> is negative. If these are violated, the code exists pretty quickly with omode set to 7.</div><div><br></div><div>The numbers being printed in the log file, which I was worried about, seem to be <d_old, g_new>. It's fine for this to be positive.... it just means the step went past the minimum and up the other side.</div><div><br></div><div>(The control flow in this routine is disgusting! Welcome to 1980s Fortran. It was only clear when various variables were being updated once I stepped through line-by-line with a debugger for a few iterations.)</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 13 Feb 2019 at 13:30, Andrew McRae <<a href="mailto:andrew.mcrae@physics.ox.ac.uk">andrew.mcrae@physics.ox.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Okay, I <i>think</i> I understand what is happening. In short, the BFGS algorithm, which underpins m1qn3, is only guaranteed to behave sensibly if the cost function is convex. If this is not the case, it can produce approximations to H, the inverse Hessian, which aren't positive definite. This can lead to the algorithm producing uphill search directions. (gradient g, descent direction d, d = -Hg. If H is pos def, g.d = g.(-Hg), which is < 0 as long as g isn't 0. If H isn't pos-def, this isn't guaranteed). We're definitely seeing d.g > 0 in the output above, which can't be good.<br></div><div><br></div><div>Maybe I, or someone else, should write a wrapper that automatically cold-restarts m1qn3 if something goes wrong (either by checking d.g, or by checking y.s, see below)?<br></div><div><br></div><div>Longer version:<br></div><div><br></div><div>Define s_k = x_{k+1} - x_k, the state increment</div><div>Define y_k = g_{k+1} - g_k, the gradient increment.</div><div><br></div><div>BFGS computes an approximation H_{k+1} for the inverse Hessian, from H_k, s_k, and y_k. If H_k is symmetric then H_{k+1} is also symmetric. Furthermore, if H_k is positive definite <b>and <y, s> > 0</b>, H_{k+1} is also positive definite. <y, s> > 0 is true for sensible updates steps in a convex optimization problem, but not true in general (think of a parabola, but imagine it briefly steepening as you approach the minimum due to nonlinearities). So a negative-definite H can be generated.<br></div><div><br></div><div>The version used is actually L-BFGS, which maintains a diagonal matrix D_k, and forms H_k from D_k and the m most recent pairs (s_k, y_k), but the same things hold.</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 12 Feb 2019 at 16:11, Andrew McRae <<a href="mailto:andrew.mcrae@physics.ox.ac.uk" target="_blank">andrew.mcrae@physics.ox.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Actually I was wrong, it's the first Wolfe test being violated. I.e. f(x + t*d) turns out to be larger than f(x), even for very small t, where t is the step multiplier and d is the search direction.</div><div><br></div><div>The values being printed out are t, f(x + t*d) - f(x), and <d, g>, where g is the gradient. The second number should be negative for small enough t, and the third number should definitely be negative else it's not a descent direction.<br></div><div><br></div>In fact, even in a sucessful run, this occasionally happens (2nd number negative = good, but third number is often positive!)<br></div><div dir="ltr"><br></div><div dir="ltr"> mlis3 1.000D+00 -1.108D+00 -9.114D-01</div><div dir="ltr"> mlis3 1.000D+00 -1.379D-01 6.624D+00</div><div dir="ltr"> mlis3 1.000D+00 -5.767D-01 4.122D+00</div><div dir="ltr"> mlis3 1.000D+00 -1.120D+00 -7.614D-01</div><div dir="ltr"> mlis3 1.000D+00 -5.014D-01 -3.579D-01<br></div><div dir="ltr"> mlis3 1.000D+00 -6.903D-01 -1.962D-01<br></div><div dir="ltr"> mlis3 1.000D+00 -1.657D-01 1.824D-01</div><div dir="ltr"><br><div>I'm now confused, because the lines <a href="https://github.com/dorugeber/MITgcm/blob/optim_m1qn3/optim_m1qn3/m1qn3_offline.F#L551-L559" target="_blank">https://github.com/dorugeber/MITgcm/blob/optim_m1qn3/optim_m1qn3/m1qn3_offline.F#L551-L559</a> seem to guard against the case d.g > 0.<br></div></div></div></div></div></div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 12 Feb 2019 at 12:02, Andrew McRae <<a href="mailto:andrew.mcrae@physics.ox.ac.uk" target="_blank">andrew.mcrae@physics.ox.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div>This is using the optim_m1qn3 package from mitgcm_contrib.</div><div><br></div><div>Quite often, the algorithm runs a few steps, then gets stuck in a line search. E.g., after 3 good iterations, I get<br></div><div><br></div><div dir="ltr"></div><div dir="ltr"> m1qn3: iter 4, simul 4, f= 3.25818816D+00, h'(0)=-2.84510D+00<br><br> m1qn3: line search<br><br> mlis3 fpn=-2.845D+00 d2= 9.74D+01 tmin= 5.72D-07 tmax= 1.00D+20<br> mlis3 1.000D+00 8.992D-01 1.438D+00<br> mlis3 2.468D-01 1.155D-01 6.129D-01<br> mlis3 2.468D-03 7.656D-04 2.227D+02<br> mlis3 2.345D-03 7.240D-04 1.019D+01<br> mlis3 1.759D-03 5.496D-04 2.970D+00<br> mlis3 1.231D-03 3.855D-04 -3.000D+00<br> mlis3 8.618D-04 2.713D-04 3.105D-01<br> mlis3 6.032D-04 1.894D-04 2.202D-01<br> mlis3 4.223D-04 1.341D-04 2.223D+00<br> mlis3 2.956D-04 9.450D-05 3.954D-01<br> mlis3 2.069D-04 6.755D-05 3.121D-01<br> mlis3 6.207D-05 1.948D-05 3.642D-01</div><div dir="ltr"><br></div><div dir="ltr">This is for the included tutorial_global_oce_optim running for 1 year, and with mult_hflux_tut set to 0.2 rather than 2.</div><div dir="ltr"><br></div><div>To be honest, I'm not even sure what numbers are being printed, but I suspect one of those first two numbers is the step size multiplier. I.e. it's trying to take smaller and smaller steps, but these are still being rejected.<br>
</div><div dir="ltr"><br></div><div dir="ltr">
I'm about to dive in with gdb and see what's going on, but my hypothesis is that the second <a href="https://en.wikipedia.org/wiki/Wolfe_conditions" target="_blank">Wolfe test</a> is being violated. Roughly speaking, this forces the gradient to decrease by 10% each iteration (at least, the component of the gradient in the descent direction). This makes sense once the algorithm is in the basin near the minimizer of the cost function, but there's no apriori reason for it to hold further away.</div><div dir="ltr"><br></div><div>Is there likely to be anything wrong with modifying this check to allow (some) gradient steepening? I guess I'll find out...</div></div></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>