A Gentle Introduction to the BFGS Optimization Algorithm



The Broyden, Fletcher, Goldfarb, and Shanno, or BFGS Algorithm, is a local search optimization algorithm.

It is a type of second-order optimization algorithm, meaning that it makes use of the second-order derivative of an objective function and belongs to a matriculation of algorithms referred to as Quasi-Newton methods that injudicious the second derivative (called the Hessian) for optimization problems where the second derivative cannot be calculated.

The BFGS algorithm is perhaps one of the most widely used second-order algorithms for numerical optimization and is wontedly used to fit machine learning algorithms such as the logistic regression algorithm.

In this tutorial, you will discover the BFGS second-order optimization algorithm.

After completing this tutorial, you will know:

  • Second-order optimization algorithms are algorithms that make use of the second-order derivative, tabbed the Hessian matrix for multivariate objective functions.
  • The BFGS algorithm is perhaps the most popular second-order algorithm for numerical optimization and belongs to a group tabbed Quasi-Newton methods.
  • How to minimize objective functions using the BFGS and L-BFGS-B algorithms in Python.

Let’s get started.

A Gentle Introduction to the BFGS Optimization Algorithm

A Gentle Introduction to the BFGS Optimization Algorithm
Photo by Timo Newton-Syms, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Second-Order Optimization Algorithms
  2. BFGS Optimization Algorithm
  3. Worked Example of BFGS

Second-Order Optimization Algorithms

Optimization involves finding values for input parameters that maximize or minimize an objective function.

Newton-method optimization algorithms are those algorithms that make use of the second derivative of the objective function.

You may recall from calculus that the first derivative of a function is the rate of transpiration or curvature of the function at a explicit point. The derivative can be followed downhill (or uphill) by an optimization algorithm toward the minima of the function (the input values that result in the smallest output of the objective function).

Algorithms that make use of the first derivative are tabbed first-order optimization algorithms. An example of a first-order algorithm is the gradient descent optimization algorithm.

  • First-Order Methods: Optimization algorithms that make use of the first-order derivative to find the optima of an objective function.

The second-order derivative is the derivative of the derivative, or the rate of transpiration of the rate of change.

The second derivative can be followed to increasingly efficiently locate the optima of the objective function. This makes sense increasingly generally, as the increasingly information we have well-nigh the objective function, the easier it may be to optimize it.

The second-order derivative allows us to know both which direction to move (like the first-order) but moreover estimate how far to move in that direction, tabbed the step size.

Second-order information, on the other hand, allows us to make a quadratic propinquity of the objective function and injudicious the right step size to reach a local minimum …

— Page 87, Algorithms for Optimization, 2019.

Algorithms that make use of the second-order derivative are referred to as second-order optimization algorithms.

  • Second-Order Methods: Optimization algorithms that make use of the second-order derivative to find the optima of an objective function.

An example of a second-order optimization algorithm is Newton’s method.

When an objective function has increasingly than one input variable, the input variables together may be thought of as a vector, which may be familiar from linear algebra.

The gradient is the generalization of the derivative to multivariate functions. It captures the local slope of the function, permitting us to predict the effect of taking a small step from a point in any direction.

— Page 21, Algorithms for Optimization, 2019.

Similarly, the first derivative of multiple input variables may moreover be a vector, where each element is tabbed a partial derivative. This vector of partial derivatives is referred to as the gradient.

  • Gradient: Vector of partial first derivatives for multiple input variables of an objective function.

This idea generalizes to the second-order derivatives of the multivariate inputs, which is a matrix containing the second derivatives tabbed the Hessian matrix.

  • Hessian: Matrix of partial second-order derivatives for multiple input variables of an objective function.

The Hessian matrix is square and symmetric if the second derivatives are all continuous at the point where we are gingerly the derivatives. This is often the specimen when solving real-valued optimization problems and an expectation when using many second-order methods.

The Hessian of a multivariate function is a matrix containing all of the second derivatives with respect to the input. The second derivatives capture information well-nigh the local curvature of the function.

— Page 21, Algorithms for Optimization, 2019.

As such, it is worldwide to describe second-order optimization algorithms making use of or pursuit the Hessian to the optima of the objective function.

Now that we have a high-level understanding of second-order optimization algorithms, let’s take a closer squint at the BFGS algorithm.

BFGS Optimization Algorithm

BFGS is a second-order optimization algorithm.

It is an acronym, named for the four co-discovers of the algorithm: Broyden, Fletcher, Goldfarb, and Shanno.

It is a local search algorithm, intended for convex optimization problems with a each optima.

The BFGS algorithm is perhaps weightier understood as belonging to a group of algorithms that are an extension to Newton’s Method optimization algorithm, referred to as Quasi-Newton Methods.

Newton’s method is a second-order optimization algorithm that makes use of the Hessian matrix.

A limitation of Newton’s method is that it requires the numbering of the inverse of the Hessian matrix. This is a computationally expensive operation and may not be stable depending on the properties of the objective function.

Quasi-Newton methods are second-order optimization algorithms that injudicious the inverse of the Hessian matrix using the gradient, meaning that the Hessian and its inverse do not need to be misogynist or calculated precisely for each step of the algorithm.

Quasi-Newton methods are among the most widely used methods for nonlinear optimization. They are incorporated in many software libraries, and they are constructive in solving a wide variety of small to midsize problems, in particular when the Hessian is nonflexible to compute.

— Page 411, Linear and Nonlinear Optimization, 2009.

The main difference between variegated Quasi-Newton optimization algorithms is the explicit way in which the propinquity of the inverse Hessian is calculated.

The BFGS algorithm is one explicit way for updating the numbering of the inverse Hessian, instead of recalculating it every iteration. It, or its extensions, may be one of the most popular Quasi-Newton or plane second-order optimization algorithms used for numerical optimization.

The most popular quasi-Newton algorithm is the BFGS method, named for its discoverers Broyden, Fletcher, Goldfarb, and Shanno.

— Page 136, Numerical Optimization, 2006.

A goody of using the Hessian, when available, is that it can be used to determine both the direction and the step size to move in order to transpiration the input parameters to minimize (or maximize) the objective function.

Quasi-Newton methods like BFGS injudicious the inverse Hessian, which can then be used to determine the direction to move, but we no longer have the step size.

The BFGS algorithm addresses this by using a line search in the chosen direction to determine how far to move in that direction.

For the derivation and calculations used by the BFGS algorithm, I recommend the resources in the remoter reading section at the end of this tutorial.

The size of the Hessian and its inverse is proportional to the number of input parameters to the objective function. As such, the size of the matrix can wilt very large for hundreds, thousand, or millions of parameters.

… the BFGS algorithm must store the inverse Hessian matrix, M, that requires O(n2) memory, making BFGS impractical for most modern deep learning models that typically have millions of parameters.

— Page 317, Deep Learning, 2016.

Limited Memory BFGS (or L-BFGS) is an extension to the BFGS algorithm that addresses the forfeit of having a large number of parameters. It does this by not requiring that the unshortened propinquity of the inverse matrix be stored, by thesping a simplification of the inverse Hessian in the previous iteration of the algorithm (used in the approximation).

Now that we are familiar with the BFGS algorithm from a high-level, let’s squint at how we might make use of it.

Worked Example of BFGS

In this section, we will squint at some examples of using the BFGS optimization algorithm.

We can implement the BFGS algorithm for optimizing wrong-headed functions in Python using the minimize() SciPy function.

The function takes a number of arguments, but most importantly, we can specify the name of the objective function as the first argument, the starting point for the search as the second argument, and specify the “method” treatise as ‘BFGS‘. The name of the function used to summate the derivative of the objective function can be specified via the “jac” argument.

Let’s squint at an example.

First, we can pinpoint a simple two-dimensional objective function, a trencher function, e.g. x^2. It is simple the sum of the squared input variables with an optima at f(0, 0) = 0.0.

Next, let’s pinpoint a function for the derivative of the function, which is [x*2, y*2].

We will pinpoint the premises of the function as a box with the range -5 and 5 in each dimension.

The starting point of the search will be a randomly generated position in the search domain.

We can then wield the BFGS algorithm to find the minima of the objective function by specifying the name of the objective function, the initial point, the method we want to use (BFGS), and the name of the derivative function.

We can then review the result reporting a message as to whether the algorithm finished successfully or not and the total number of evaluations of the objective function that were performed.

Finally, we can report the input variables that were found and their evaluation versus the objective function.

Tying this together, the well-constructed example is listed below.

Running the example applies the BFGS algorithm to our objective function and reports the results.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the stereotype outcome.

In this case, we can see that four iterations of the algorithm were performed and a solution very tropical to the optima f(0.0, 0.0) = 0.0 was discovered, at least to a useful level of precision.

The minimize() function moreover supports the L-BFGS algorithm that has lower memory requirements than BFGS.

Specifically, the L-BFGS-B version of the algorithm where the -B suffix indicates a “boxed” version of the algorithm, where the premises of the domain can be specified.

This can be achieved by specifying the “method” treatise as “L-BFGS-B“.

The well-constructed example with this update is listed below.

Running the example using applies the L-BFGS-B algorithm to our objective function and reports the results.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the stereotype outcome.

Again, we can see that the minima to the function is found in very few evaluations.

It might be a fun exercise to increase the dimensions of the test problem to millions of parameters and compare the memory usage and run time of the two algorithms.

Further Reading

This section provides increasingly resources on the topic if you are looking to go deeper.

Books

APIs

Articles

Summary

In this tutorial, you discovered the BFGS second-order optimization algorithm.

Specifically, you learned:

  • Second-order optimization algorithms are algorithms that make use of the second-order derivative, tabbed the Hessian matrix for multivariate objective functions.
  • The BFGS algorithm is perhaps the most popular second-order algorithm for numerical optimization and belongs to a group tabbed Quasi-Newton methods.
  • How to minimize objective functions using the BFGS and L-BFGS-B algorithms in Python.

Do you have any questions?
Ask your questions in the comments unelevated and I will do my weightier to answer.



Author: Shantun Parmar

Leave a Reply

Your email address will not be published.