Code Adam Gradient Descent Optimization From Scratch
Last Updated on January 16, 2021
Gradient descent is an optimization algorithm that follows the negative gradient of an
By Nick Cotes
Last Updated on January 16, 2021
Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.
A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like AdaGrad and RMSProp update the algorithm to use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.
The Adaptive Movement Estimation algorithm, or Adam for short, is an extension to gradient descent and a natural successor to techniques like AdaGrad and RMSProp that automatically adapts a learning rate for each input variable for the objective function and further smooths the search process by using an exponentially decreasing moving average of the gradient to make updates to variables.
In this tutorial, you will discover how to develop gradient descent with Adam optimization algorithm from scratch.
After completing this tutorial, you will know:
Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adam.
How to implement the Adam optimization algorithm from scratch and apply it to an objective function and evaluate the results.
Let’s get started.
Gradient Descent Optimization With Adam From Scratch Photo by Don Graham, some rights reserved.
This tutorial is divided into three parts; they are:
The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.
If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.
Gradient: First-order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.
Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.
The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.
The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.
The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.
A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.
x(t) = x(t-1) – step_size * f'(x(t-1))
The steeper the objective function at a given point, the larger the magnitude of the gradient and, in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.
Step Size (alpha): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.
If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.
Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the Adam algorithm.
Adam Optimization Algorithm
Adaptive Movement Estimation algorithm, or Adam for short, is an extension to the gradient descent optimization algorithm.
Adam is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.
This is achieved by calculating a step size for each input parameter that is being optimized. Importantly, each step size is automatically adapted throughput the search process based on the gradients (partial derivatives) encountered for each variable.
We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation
First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as m and v (really the Greek letter nu) respectively. They are initialized to 0.0 at the start of the search.
m = 0
v = 0
The algorithm is executed iteratively over time t starting at t=1, and each iteration involves calculating a new set of parameter values x, e.g. going from x(t-1) to x(t).
It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.
First, the gradient (partial derivatives) are calculated for the current time step.
g(t) = f'(x(t-1))
Next, the first moment is updated using the gradient and a hyperparameter beta1.
m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)
Then the second moment is updated using the squared gradient and a hyperparameter beta2.
v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2
The first and second moments are biased because they are initialized with zero values.
… these moving averages are initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially during the initial timesteps, and especially when the decay rates are small (i.e. the betas are close to 1). The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected estimates …
Next the first and second moments are bias-corrected, starring with the first moment:
mhat(t) = m(t) / (1 – beta1(t))
And then the second moment:
vhat(t) = v(t) / (1 – beta2(t))
Note, beta1(t) and beta2(t) refer to the beta1 and beta2 hyperparameters that are decayed on a schedule over the iterations of the algorithm. A static decay schedule can be used, although the paper recommend the following:
beta1(t) = beta1^t
beta2(t) = beta2^t
Finally, we can calculate the value for the parameter for this iteration.
This is then repeated for each parameter that is being optimized.
At the end of the iteration we can evaluate the new parameter values and report the performance of the search.
# evaluate candidate point
# report progress
print('>%d f(%s) = %.5f'%(t,x,score))
We can tie all of this together into a function named adam() that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.
Note: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for better performance.
We can then define our hyperparameters and call the adam() function to optimize our test objective function.
In this case, we will use 60 iterations of the algorithm with an initial steps size of 0.02 and beta1 and beta2 values of 0.8 and 0.999 respectively. These hyperparameter values were found after a little trial and error.
Running the example applies the Adam optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see that a near-optimal solution was found after perhaps 53 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.