This is something that has been bugging me for a while, and I couldn't find any satisfactory answers online, so here goes:

After reviewing a set of lectures on convex optimization, Newton's method seems to be a far superior algorithm than gradient descent to find globally optimal solutions, because Newton's method can provide a guarantee for its solution, it's affine invariant, and most of all it converges in far fewer steps. Why is second-order optimization algorithms, such as Newton's method not as widely used as stochastic gradient descent in machine learning problems?

share|improve this question
8  
For neural networks, deeplearningbook.org Section "8.6 Approximate Second-Order Methods" gives a nice overview. In summary "Beyond the challenges created by certain features of the objective function, such as saddle points, the application of Newton’s method for training large neural networks is limited by the significant computational burden it imposes." There exist alternatives that attempt to gain some of the advantages of Newton’s method while side-stepping the computational hurdles, but they have their own issues. – Franck Dernoncourt 2 days ago
    
see this related question and comments, stats.stackexchange.com/questions/232305/… – hxd1011 2 days ago
    
Note that the other comments have some wider applicability to machine learning beyond just "deep learning". However while all ML problems can tend to be "big data", not all ML problems are necessarily "big features" (i.e. many parameters to tune), though deep learning invariably is. – GeoMatt22 2 days ago

Gradient descent maximizes a function using knowledge of its derivative. Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative. That can be faster when the second derivative is known and easy to compute (the Newton-Raphson algorithm is used in logistic regression). However, the analytic expression for the second derivative is often complicated or intractable, requiring a lot of computation. Numerical methods for computing the second derivative also require a lot of computation -- if $N$ values are required to compute the first derivative, $N^2$ are required for the second derivative.

share|improve this answer
4  
Worth noting that (things based on) the Gauss-Newton method are probably more common. This is a specialization of Newton to nonlinear least squares. – GeoMatt22 2 days ago
3  
I wouldn't call Gauss-Newton a specialization of Newton to nonlinear least squares. I'd call it a bastardized approximation of Newton for nonlinear least squares, which uses a more inaccurate Hessian approximation, the larger the residuals in the fitted equations, and accordingly, the further the argument is from optimality. – Mark L. Stone 2 days ago
    
@MarkL.Stone fair point, I was trying to not get into technicalities :) It is true that Gauss-Newton style methods try to "fake" 2nd order w/only 1st order info. Personally I have never used Newton methods for optimization, just Gauss-Newton (or LM, or ~similar UKF) or DFO-SQP methods (e.g. BOBYQA). "Optimality" is a tricky question I would say ... for a ML problem, vs. say an engineering design-optimization problem, the reliability/informativeness of a "local Hessian" can be dubious. Perhaps non-local DFO-SQP is ~"stochastic Newton"? (e.g. "online") – GeoMatt22 2 days ago
    
On second thought, DFO-SQP approaches tend to be nonlocal in the parameter space, rather than data batches. The UKF may be the closest in flavor to "stochastic Newton" as it is online w/limited memory ... but it effectively assumes a positive-definite Hessian (i.e. Gaussian approx.). – GeoMatt22 2 days ago

I recently learned this myself - the problem is the proliferation of saddle points in high-dimensional space, that Newton methods want to converge to. See this article: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.

Indeed the ratio of the number of saddle points to local minima increases exponentially with the dimensionality N.

While gradient descent dynamics are repelled away from a saddle point to lower error by following directions of negative curvature, ...the Newton method does not treat saddle points appropriately; as argued below, saddle-points instead become attractive under the Newton dynamics.

share|improve this answer
2  
Could you add some explanation to why this is so? In theory, Newton's method preforms a weighted gradient descent with "optimal" weights for each of the eigenvectors. – nbubis 2 days ago
1  
What that article says about Newton methods "wanting to" converge to saddle points is only true for garbage implementations of Newton's method. – Mark L. Stone yesterday

More people should be using Newton's method in machine learning*. I say this as someone with a background in numerical optimization, who has dabbled in machine learning over the past couple of years.

Most of the drawbacks in answers here (and even in the literature) are not really an issue if you use Newton's method correctly. Moreover, the drawbacks that do matter also slow down gradient descent the same amount or more, but through less obvious mechanisms.

  • Convergence to saddle points is dealt with through linesearch with the Wolfe conditions, or trust regions. A proper gradient descent implementation should be doing this too..

  • The whole (dense) Hessian doesn't need to be constructed; the inverse can be applied with iterative methods that only use matrix-vector products (e.g., Krylov methods like conjugate gradient). See, for example, the CG-Steihaug trust region method.

  • Hessian matrix-vector products can be computed efficiently through solving two higher order adjoint equations of the same form as the adjoint equation that is already used to compute the gradient (e.g., the work of two backpropagation steps in neural network training).

  • Ill conditioning slows the convergence of iterative linear solvers, but it also slows gradient descent equally or worse. Basically the difficulty has been shifted from nonlinear optimization stage (where not much can be done to improve the situation) to the linear algebra stage (where it can be attacked with the entire arsenal of numerical linear algebra preconditioning techniques).

  • Also, the computation shifts from "many many cheap steps" to "a few costly steps", opening up more opportunities for parallelism at the sub-step (linear algebra) level.

*Of course, Newton's method will not help you with L1 or other similar compressed sensing/sparsity promoting penalty functions, since they lack the required smoothness.

share|improve this answer
    
I think we're in violent agreement with each other, not with everyone else. – Mark L. Stone yesterday
    
Tell me what you think of this paper link.springer.com/article/10.1007/s11081-016-9323-4 ? Line search Newton method applied to very non-convex problems w/no adjustment to Hessian or searching along directions of negative curvature. No wonder he thinks Quasi-Newton (probably BFGS) is more robust than Newton. The author claims he used a simple version of Newton's method so as to have an apples to apples comparison between Euclidean and Riemannian Newton's method. (continued in next comment) – Mark L. Stone yesterday
1  
That's like comparing whether UK or USA produces better research mathematicians by comparing the mathematical abilities of 26 year old drug addict high school dropouts, rather than by comparing the top echelon of mathematics graduate students coming out of each country's finest schools. The paper is signed, sealed, and delivered, no one, and I mean no one's changing it or withdrawing it now. Incroyable. – Mark L. Stone yesterday
1  
@MarkL.Stone It seems a conversation happened here and was deleted while I was away. Anyways, I think you're right that we agree with each other and no one else. I guess this is to be expected based on our background as compared to the other people here. As you probably expect I don't think very much of the linked paper. On the other hand, I do think that Riemannian manifold Newton's method, where one shoots a geodesic trajectory in a Newton search direction, is a technique with a lot of promise for very hard problems. – Nick Alger yesterday

Gradient descent direction's cheaper to calculate, and performing a line search in that direction is a more reliable, steady source of progress toward an optimum. In short, gradient descent's relatively reliable.

Newton's method is relatively expensive in that you need to calculate the Hessian on the first iteration. Then, on each subsequent iteration, you can either fully recalculate the Hessian (as in Newton's method) or merely "update" the prior iteration's Hessian (in quasi-Newton methods) which is cheaper but less robust.

In the extreme case of a very well-behaved function, especially a perfectly quadratic function, Newton's method is the clear winner. If it's perfectly quadratic, Newton's method will converge in a single iteration.

In the opposite extreme case of a very poorly behaved function, gradient descent will tend to win out. It'll pick a search direction, search down that direction, and ultimately take a small-but-productive step. By contrast, Newton's method will tend to fail in these cases, especially if you try to use the quasi-Newton approximations.

In-between gradient descent and Newton's method, there're methods like Levenberg–Marquardt algorithm (LMA), though I've seen the names confused a bit. The gist is to use more gradient-descent-informed search when things are chaotic and confusing, then switch to a more Newton-method-informed search when things are getting more linear and reliable.

share|improve this answer
1  
Boy, you must use terrible implementations of Newton and Quasi-Newton. If using either with a non positive definite Hessian, then either use trust regions or do line search along direction(s) of negative curvature. If so, they are MORE reliable than steepest descent (i.e, gradient descent with line search or trust region). In short, gradiewnt descent is much less reliable than a properly implemented Quasi-Newton method, which is less reliable than a properly implemented Newton method. Computation time and memory requirements per iteration are a different matter however. – Mark L. Stone 2 days ago
3  
I think you mean perfectly quadratic function. That is, Newton's method converges in a single iteration with a quadratic objective function, which has a linear gradient. – Elizabeth Santorella 2 days ago
    
@MarkL.Stone: By definition, gradient descent goes in the optimal direction for an infinitely small step. By contrast, Newton's method uses second-order effects to project a zero, attempting to get a more direct route assuming that you'll make a more significant step. For a very poorly behaved function where step sizes are arbitrarily small, gradient descent produces the optimal search direction. If your implementation of Newton's method gives you the optimal search direction for tiny steps, it's not Newton's method - it's steepest descent, by definition. – Chemical Engineer yesterday
1  
@ElizabethSantorella: Yup, you're right! I updated the answer. – Chemical Engineer yesterday
    
@Chemical Engineer In your last sentence, you have a very strange definition. In the limit if you you restrict algorithms to take a zero length step, then steepest descent is in a multi-way tie for being optimal; otherwise, not. Here's a way to think about steepest descent, it is Newton's method, which is based on a quadratic expansion, except that instead of using the true Hessian in the 2nd order term, it uses approximation that Hessian equals the Identity matrix. This decreases fidelity of the quadratic model, and makes any non-zero step based on it inferior unless true Hessian = Identity – Mark L. Stone yesterday

For large dimensions, the Hessian is typically expensive to store and solving $Hd = g$ for a direction can be expensive. It is also more difficult to parallelise.

Newton's method works well when close to a solution, or if the Hessian is slowly varying, but needs some tricks to deal with lack of convergence and lack of definiteness.

Often an improvement is sought, rather than an exact solution, in which case the extra cost of Newton or Newton like methods is not justified.

There are various ways of amelioration the above such as variable metric or trust region methods.

As a side note, in many problems a key issue is scaling and the Hessian provides excellent scaling information, albeit at a cost. If one can approximate the Hessian, it can often improve performance considerably. To some extent, Newton's method provides the 'best' scaling in that it is affine invariant.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.