Gradient Descent With RMSProp from Scratch || Blockchain & Web development

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.

A limitation of gradient descent is that it uses the same step size (learning rate) for each input variable. AdaGrad, for short, is an extension of the gradient descent optimization algorithm that allows the step size in each dimension used by the optimization algorithm to be automatically well-timed based on the gradients seen for the variable (partial derivatives) over the undertow of the search.

A limitation of AdaGrad is that it can result in a very small step size for each parameter by the end of the search that can slow the progress of the search lanugo too much and may midpoint not locating the optima.

Root Midpoint Squared Propagation, or RMSProp, is an extension of gradient descent and the AdaGrad version of gradient descent that uses a perishable stereotype of partial gradients in the version of the step size for each parameter. The use of a perishable moving stereotype allows the algorithm to forget early gradients and focus on the most recently observed partial gradients seen during the progress of the search, overcoming the limitation of AdaGrad.

In this tutorial, you will discover how to develop the gradient descent with RMSProp optimization algorithm from scratch.

After completing this tutorial, you will know:

Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
Gradient descent can be updated to use an automatically adaptive step size for each input variable using a perishable moving stereotype of partial derivatives, tabbed RMSProp.
How to implement the RMSProp optimization algorithm from scratch and wield it to an objective function and evaluate the results.

Let’s get started.

Gradient Descent With RMSProp from Scratch
Photo by pavel ahmed, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Gradient Descent
Root Midpoint Squared Propagation (RMSProp)
Gradient Descent With RMSProp
1. Two-Dimensional Test Problem
2. Gradient Descent Optimization With RMSProp
3. Visualization of RMSProp

Gradient Descent

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first order derivative of the target objective function.

First-order methods rely on gradient information to help uncontrived the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first order derivative, or simply the “derivative,” is the rate of transpiration or slope of the target function at a explicit point, e.g. for a explicit input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may moreover be taken as a vector and is referred to often as the gradient.

Gradient: First order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest takeoff of the target function for a explicit input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is stuff optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, thesping we are minimizing the target function.

A downhill movement is made by first gingerly how far to move in the input space, calculated as the step size (called start or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move versus the gradient, or lanugo the target function.

x = x – step_size * f'(x)

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

Step Size (alpha): Hyperparameter that controls how far to move in the search space versus the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may vellicate virtually the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a squint at RMSProp.

Root Midpoint Squared Propagation (RMSProp)

Root Midpoint Squared Propagation, or RMSProp for short, is an extension to the gradient descent optimization algorithm.

It is an unpublished extension, first described in Geoffrey Hinton’s lecture notes for his Coursera undertow on neural networks, specifically Lecture 6e titled “rmsprop: Divide the gradient by a running stereotype of its recent magnitude.”

RMSProp is planned to slide the optimization process, e.g. subtract the number of function evaluations required to reach the optima, or to modernize the sufficiency of the optimization algorithm, e.g. result in a largest final result.

It is related to flipside extension to gradient descent tabbed Adaptive Gradient, or AdaGrad.

AdaGrad is planned to specifically explore the idea of automatically tailoring the step size (learning rate) for each parameter in the search space. This is achieved by first gingerly a step size for a given dimension, then using the calculated step size to make a movement in that dimension using the partial derivative. This process is then repeated for each dimension in the search space.

Adagrad calculates the step size for each parameter by first summing the partial derivatives for the parameter seen so far during the search, then dividing the initial step size hyperparameter by the square root of the sum of the squared partial derivatives.

The numbering of the custom step size for one parameter is as follows:

cust_step_size = step_size / (1e-8 sqrt(s))

Where cust_step_size is the calculated step size for an input variable for a given point during the search, step_size is the initial step size, sqrt() is the square root operation and s is the sum of the squared partial derivatives for the input variable seen during the search so far.

This has the effect of smoothing out the oscillations in the search for optimization problems that have a lot of curvature in the search space.

AdaGrad shrinks the learning rate equal to the unshortened history of the squared gradient and may have made the learning rate too small surpassing arriving at such a convex structure.

— Pages 307-308, Deep Learning, 2016.

A problem with AdaGrad is that it can slow the search lanugo too much, resulting in very small learning rates for each parameter or dimension of the search by the end of the run. This has the effect of stopping the search too soon, surpassing the minimal can be located.

RMSProp extends Adagrad to stave the effect of a monotonically decreasing learning rate.

— Page 78, Algorithms for Optimization, 2019.

RMSProp can be thought of as an extension of AdaGrad in that it uses a perishable stereotype or moving stereotype of the partial derivatives instead of the sum in the numbering of the learning rate for each parameter.

This is achieved by subtracting a new hyperparameter we will undeniability rho that acts like momentum for the partial derivatives.

RMSProp maintains a perishable stereotype of squared gradients.

— Page 78, Algorithms for Optimization, 2019.

Using a perishable moving stereotype of the partial derivative allows the search to forget early partial derivative values and focus on the most recently seen shape of the search space.

RMSProp uses an exponentially perishable stereotype to discard history from the lattermost past so that it can converge rapidly without finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl.

— Page 308, Deep Learning, 2016.

The numbering of the midpoint squared partial derivative for one parameter is as follows:

s(t 1) = (s(t) * rho) (f'(x(t))^2 * (1.0-rho))

Where s(t 1) is the perishable moving stereotype of the squared partial derivative for one parameter for the current iteration of the algorithm, s(t) is the perishable moving stereotype squared partial derivative for the previous iteration, f'(x(t))^2 is the squared partial derivative for the current parameter, and rho is a hyperparameter, typically with the value of 0.9 like momentum.

Given that we are using a perishable stereotype of the partial derivatives and gingerly the square root of this stereotype gives the technique its name, e.g, square root of the midpoint squared partial derivatives or root midpoint square (RMS). For example, the custom step size for a parameter may be written as:

cust_step_size(t 1) = step_size / (1e-8 RMS(s(t 1)))

Once we have the custom step size for the parameter, we can update the parameter using the custom step size and the partial derivative f'(x(t)).

x(t 1) = x(t) – cust_step_size(t 1) * f'(x(t))

This process is then repeated for each input variable until a new point in the search space is created and can be evaluated.

RMSProp is a very constructive extension of gradient descent and is one of the preferred approaches often used to fit deep learning neural networks.

Empirically, RMSProp has been shown to be an constructive and practical optimization algorithm for deep neural networks. It is currently one of the go-to optimization methods stuff employed routinely by deep learning practitioners.

— Page 308, Deep Learning, 2016.

Now that we are familiar with the RMSprop algorithm, let’s explore how we might implement it and evaluate its performance.

Gradient Descent With RMSProp

In this section, we will explore how to implement the gradient descent optimization algorithm with adaptive gradients using the RMSProp algorithm.

Two-Dimensional Test Problem

First, let’s pinpoint an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and pinpoint the range of valid inputs from -1.0 to 1.0.

The objective() function unelevated implements this function

# objective function def objective(x, y): return x**2.0 y**2.0

# objective function

def objective(x, y):

return x**2.0 y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The well-constructed example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 y**2.0 # pinpoint range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet verisimilitude scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

# 3d plot of the test function

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# objective function

def objective(x, y):

return x**2.0 y**2.0

# pinpoint range for input

r_min, r_max = -1.0, 1.0

# sample input range uniformly at 0.1 increments

xaxis = arange(r_min, r_max, 0.1)

yaxis = arange(r_min, r_max, 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a surface plot with the jet verisimilitude scheme

figure = pyplot.figure()

axis = figure.gca(projection='3d')

axis.plot_surface(x, y, results, cmap='jet')

# show the plot

pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar trencher shape with the global minima at f(0, 0) = 0.

Three-Dimensional Plot of the Test Objective Function

We can moreover create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example unelevated creates a silhouette plot of the objective function.

# silhouette plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 y**2.0 # pinpoint range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled silhouette plot with 50 levels and jet verisimilitude scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

# silhouette plot of the test function

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# objective function

def objective(x, y):

return x**2.0 y**2.0

# pinpoint range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled silhouette plot with 50 levels and jet verisimilitude scheme

pyplot.contourf(x, y, results, levels=50, cmap='jet')

# show the plot

pyplot.show()

Running the example creates a two-dimensional silhouette plot of the objective function.

We can see the trencher shape compressed to contours shown with a verisimilitude gradient. We will use this plot to plot the explicit points explored during the progress of the search.

Two-Dimensional Silhouette Plot of the Test Objective Function

Now that we have a test objective function, let’s squint at how we might implement the RMSProp optimization algorithm.

Gradient Descent Optimization With RMSProp

We can wield the gradient descent with RMSProp to the test problem.

First, we need a function that calculates the derivative for this function.

f(x) = x^2
f'(x) = x * 2

The derivative of x^2 is x * 2 in each dimension. The derivative() function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization.

First, we can select a random point in the premises of the problem as a starting point for the search.

This assumes we have an variety that defines the premises of the search with one row for each dimension and the first post defines the minimum and the second post defines the maximum of the dimension.

... # generate an initial point solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

...

# generate an initial point

solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the waste stereotype of the squared partial derivatives for each dimension to 0.0 values.

... # list of the stereotype square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

...

# list of the stereotype square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

We can then enumerate a stock-still number of iterations of the search optimization algorithm specified by a “n_iter” hyperparameter.

... # run the gradient descent for it in range(n_iter): ...

...

# run the gradient descent

for it in range(n_iter):

...

The first step is to summate the gradient for the current solution using the derivative() function.

... # summate gradient gradient = derivative(solution[0], solution[1])

...

# summate gradient

gradient = derivative(solution[0], solution[1])

We then need to summate the square of the partial derivative and update the perishable stereotype of the squared partial derivatives with the “rho” hyperparameter.

... # update the stereotype of the squared partial derivatives for i in range(gradient.shape[0]): # summate the squared gradient sg = gradient[i]**2.0 # update the moving stereotype of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho))

...

# update the stereotype of the squared partial derivatives

for i in range(gradient.shape[0]):

# summate the squared gradient

sg = gradient[i]**2.0

# update the moving stereotype of the squared gradient

sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho))

We can then use the moving stereotype of the squared partial derivatives and gradient to summate the step size for the next point.

We will do this one variable at a time, first gingerly the step size for the variable, then the new value for the variable. These values are built up in an variety until we have a completely new solution that is in the steepest descent direction from the current point using the custom step sizes.

... # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # summate the step size for this variable alpha = step_size / (1e-8 sqrt(sq_grad_avg[i])) # summate the new position in this variable value = solution[i] - start * gradient[i] # store this variable new_solution.append(value)

...

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# summate the step size for this variable

alpha = step_size / (1e-8 sqrt(sq_grad_avg[i]))

# summate the new position in this variable

value = solution[i] - alpha * gradient[i]

# store this variable

new_solution.append(value)

This new solution can then be evaluated using the objective() function and the performance of the search can be reported.

... # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

...

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

And that’s it.

We can tie all of this together into a function named rmsprop() that takes the names of the objective function and the derivative function, an variety with the premises of the domain and hyperparameter values for the total number of algorithm iterations and the initial learning rate, and returns the final solution and its evaluation.

This well-constructed function is listed below.

# gradient descent algorithm with rmsprop def rmsprop(objective, derivative, bounds, n_iter, step_size, rho): # generate an initial point solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the stereotype square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # summate gradient gradient = derivative(solution[0], solution[1]) # update the stereotype of the squared partial derivatives for i in range(gradient.shape[0]): # summate the squared gradient sg = gradient[i]**2.0 # update the moving stereotype of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho)) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # summate the step size for this variable alpha = step_size / (1e-8 sqrt(sq_grad_avg[i])) # summate the new position in this variable value = solution[i] - start * gradient[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval]

# gradient descent algorithm with rmsprop

def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):

# generate an initial point

solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# list of the stereotype square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# summate gradient

gradient = derivative(solution[0], solution[1])

# update the stereotype of the squared partial derivatives

for i in range(gradient.shape[0]):

# summate the squared gradient

sg = gradient[i]**2.0

# update the moving stereotype of the squared gradient

sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho))

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# summate the step size for this variable

alpha = step_size / (1e-8 sqrt(sq_grad_avg[i]))

# summate the new position in this variable

value = solution[i] - alpha * gradient[i]

# store this variable

new_solution.append(value)

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return [solution, solution_eval]

Note: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel self-ruling to transmute the implementation to a vectorization implementation with NumPy arrays for largest performance.

We can then pinpoint our hyperparameters and undeniability the rmsprop() function to optimize our test objective function.

In this case, we will use 50 iterations of the algorithm, an initial learning rate of 0.01, and a value of 0.99 for the rho hyperparameter, all chosen without a little trial and error.

... # seed the pseudo random number generator seed(1) # pinpoint range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # pinpoint the total iterations n_iter = 50 # pinpoint the step size step_size = 0.01 # momentum for rmsprop rho = 0.99 # perform the gradient descent search with rmsprop best, score = rmsprop(objective, derivative, bounds, n_iter, step_size, rho) print('Done!') print('f(%s) = %f' % (best, score))

...

# seed the pseudo random number generator

seed(1)

# pinpoint range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# pinpoint the total iterations

n_iter = 50

# pinpoint the step size

step_size = 0.01

# momentum for rmsprop

rho = 0.99

# perform the gradient descent search with rmsprop

best, score = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

print('Done!')

print('f(%s) = %f' % (best, score))

Tying all of this together, the well-constructed example of gradient descent optimization with RMSProp is listed below.

# gradient descent optimization with rmsprop for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with rmsprop def rmsprop(objective, derivative, bounds, n_iter, step_size, rho): # generate an initial point solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the stereotype square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # summate gradient gradient = derivative(solution[0], solution[1]) # update the stereotype of the squared partial derivatives for i in range(gradient.shape[0]): # summate the squared gradient sg = gradient[i]**2.0 # update the moving stereotype of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho)) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # summate the step size for this variable alpha = step_size / (1e-8 sqrt(sq_grad_avg[i])) # summate the new position in this variable value = solution[i] - start * gradient[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval] # seed the pseudo random number generator seed(1) # pinpoint range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # pinpoint the total iterations n_iter = 50 # pinpoint the step size step_size = 0.01 # momentum for rmsprop rho = 0.99 # perform the gradient descent search with rmsprop best, score = rmsprop(objective, derivative, bounds, n_iter, step_size, rho) print('Done!') print('f(%s) = %f' % (best, score))

# gradient descent optimization with rmsprop for a two-dimensional test function

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# objective function

def objective(x, y):

return x**2.0 y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with rmsprop

def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):

# generate an initial point

solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# list of the stereotype square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# summate gradient

gradient = derivative(solution[0], solution[1])

# update the stereotype of the squared partial derivatives

for i in range(gradient.shape[0]):

# summate the squared gradient

sg = gradient[i]**2.0

# update the moving stereotype of the squared gradient

sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho))

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# summate the step size for this variable

alpha = step_size / (1e-8 sqrt(sq_grad_avg[i]))

# summate the new position in this variable

value = solution[i] - alpha * gradient[i]

# store this variable

new_solution.append(value)

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return [solution, solution_eval]

# seed the pseudo random number generator

seed(1)

# pinpoint range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# pinpoint the total iterations

n_iter = 50

# pinpoint the step size

step_size = 0.01

# momentum for rmsprop

rho = 0.99

# perform the gradient descent search with rmsprop

best, score = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

print('Done!')

print('f(%s) = %f' % (best, score))

Running the example applies the RMSProp optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the stereotype outcome.

In this case, we can see that a near optimal solution was found without perhaps 33 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >30 f([-9.61030898e-14 3.19352553e-03]) = 0.00001 >31 f([-3.42767893e-14 2.71513758e-03]) = 0.00001 >32 f([-1.21143047e-14 2.30636623e-03]) = 0.00001 >33 f([-4.24204875e-15 1.95738936e-03]) = 0.00000 >34 f([-1.47154482e-15 1.65972553e-03]) = 0.00000 >35 f([-5.05629595e-16 1.40605727e-03]) = 0.00000 >36 f([-1.72064649e-16 1.19007691e-03]) = 0.00000 >37 f([-5.79813754e-17 1.00635204e-03]) = 0.00000 >38 f([-1.93445677e-17 8.50208253e-04]) = 0.00000 >39 f([-6.38906842e-18 7.17626999e-04]) = 0.00000 >40 f([-2.08860690e-18 6.05156738e-04]) = 0.00000 >41 f([-6.75689941e-19 5.09835645e-04]) = 0.00000 >42 f([-2.16291217e-19 4.29124484e-04]) = 0.00000 >43 f([-6.84948980e-20 3.60848338e-04]) = 0.00000 >44 f([-2.14551097e-20 3.03146089e-04]) = 0.00000 >45 f([-6.64629576e-21 2.54426642e-04]) = 0.00000 >46 f([-2.03575780e-21 2.13331041e-04]) = 0.00000 >47 f([-6.16437387e-22 1.78699710e-04]) = 0.00000 >48 f([-1.84495110e-22 1.49544152e-04]) = 0.00000 >49 f([-5.45667355e-23 1.25022522e-04]) = 0.00000 Done! f([-5.45667355e-23 1.25022522e-04]) = 0.000000

...

>30 f([-9.61030898e-14 3.19352553e-03]) = 0.00001

>31 f([-3.42767893e-14 2.71513758e-03]) = 0.00001

>32 f([-1.21143047e-14 2.30636623e-03]) = 0.00001

>33 f([-4.24204875e-15 1.95738936e-03]) = 0.00000

>34 f([-1.47154482e-15 1.65972553e-03]) = 0.00000

>35 f([-5.05629595e-16 1.40605727e-03]) = 0.00000

>36 f([-1.72064649e-16 1.19007691e-03]) = 0.00000

>37 f([-5.79813754e-17 1.00635204e-03]) = 0.00000

>38 f([-1.93445677e-17 8.50208253e-04]) = 0.00000

>39 f([-6.38906842e-18 7.17626999e-04]) = 0.00000

>40 f([-2.08860690e-18 6.05156738e-04]) = 0.00000

>41 f([-6.75689941e-19 5.09835645e-04]) = 0.00000

>42 f([-2.16291217e-19 4.29124484e-04]) = 0.00000

>43 f([-6.84948980e-20 3.60848338e-04]) = 0.00000

>44 f([-2.14551097e-20 3.03146089e-04]) = 0.00000

>45 f([-6.64629576e-21 2.54426642e-04]) = 0.00000

>46 f([-2.03575780e-21 2.13331041e-04]) = 0.00000

>47 f([-6.16437387e-22 1.78699710e-04]) = 0.00000

>48 f([-1.84495110e-22 1.49544152e-04]) = 0.00000

>49 f([-5.45667355e-23 1.25022522e-04]) = 0.00000

Done!

f([-5.45667355e-23 1.25022522e-04]) = 0.000000

Visualization of RMSProp

We can plot the progress of the search on a silhouette plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the rmsprop() function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with rmsprop def rmsprop(objective, derivative, bounds, n_iter, step_size, rho): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the stereotype square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # summate gradient gradient = derivative(solution[0], solution[1]) # update the stereotype of the squared partial derivatives for i in range(gradient.shape[0]): # summate the squared gradient sg = gradient[i]**2.0 # update the moving stereotype of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho)) # build solution new_solution = list() for i in range(solution.shape[0]): # summate the learning rate for this variable alpha = step_size / (1e-8 sqrt(sq_grad_avg[i])) # summate the new position in this variable value = solution[i] - start * gradient[i] new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions

# gradient descent algorithm with rmsprop

def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):

# track all solutions

solutions = list()

# generate an initial point

solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# list of the stereotype square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# summate gradient

gradient = derivative(solution[0], solution[1])

# update the stereotype of the squared partial derivatives

for i in range(gradient.shape[0]):

# summate the squared gradient

sg = gradient[i]**2.0

# update the moving stereotype of the squared gradient

sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho))

# build solution

new_solution = list()

for i in range(solution.shape[0]):

# summate the learning rate for this variable

alpha = step_size / (1e-8 sqrt(sq_grad_avg[i]))

# summate the new position in this variable

value = solution[i] - alpha * gradient[i]

new_solution.append(value)

# store the new solution

solution = asarray(new_solution)

solutions.append(solution)

# evaluate candidate point

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the weightier final solution.

... # seed the pseudo random number generator seed(1) # pinpoint range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # pinpoint the total iterations n_iter = 50 # pinpoint the step size step_size = 0.01 # momentum for rmsprop rho = 0.99 # perform the gradient descent search with rmsprop solutions = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

...

# seed the pseudo random number generator

seed(1)

# pinpoint range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# pinpoint the total iterations

n_iter = 50

# pinpoint the step size

step_size = 0.01

# momentum for rmsprop

rho = 0.99

# perform the gradient descent search with rmsprop

solutions = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

We can then create a silhouette plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled silhouette plot with 50 levels and jet verisimilitude scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

...

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled silhouette plot with 50 levels and jet verisimilitude scheme

pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot unfluctuating by a line.

... # plot the sample as woebegone circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

...

# plot the sample as woebegone circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the well-constructed example of performing the RMSProp optimization on the test problem and plotting the results on a silhouette plot is listed below.

# example of plotting the rmsprop search on a silhouette plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with rmsprop def rmsprop(objective, derivative, bounds, n_iter, step_size, rho): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the stereotype square gradients for each variable sq_grad_avg = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # summate gradient gradient = derivative(solution[0], solution[1]) # update the stereotype of the squared partial derivatives for i in range(gradient.shape[0]): # summate the squared gradient sg = gradient[i]**2.0 # update the moving stereotype of the squared gradient sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho)) # build solution new_solution = list() for i in range(solution.shape[0]): # summate the learning rate for this variable alpha = step_size / (1e-8 sqrt(sq_grad_avg[i])) # summate the new position in this variable value = solution[i] - start * gradient[i] new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions # seed the pseudo random number generator seed(1) # pinpoint range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # pinpoint the total iterations n_iter = 50 # pinpoint the step size step_size = 0.01 # momentum for rmsprop rho = 0.99 # perform the gradient descent search with rmsprop solutions = rmsprop(objective, derivative, bounds, n_iter, step_size, rho) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled silhouette plot with 50 levels and jet verisimilitude scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as woebegone circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

# example of plotting the rmsprop search on a silhouette plot of the test function

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

# objective function

def objective(x, y):

return x**2.0 y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with rmsprop

def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):

# track all solutions

solutions = list()

# generate an initial point

solution = bounds[:, 0] rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

# list of the stereotype square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# summate gradient

gradient = derivative(solution[0], solution[1])

# update the stereotype of the squared partial derivatives

for i in range(gradient.shape[0]):

# summate the squared gradient

sg = gradient[i]**2.0

# update the moving stereotype of the squared gradient

sq_grad_avg[i] = (sq_grad_avg[i] * rho) (sg * (1.0-rho))

# build solution

new_solution = list()

for i in range(solution.shape[0]):

# summate the learning rate for this variable

alpha = step_size / (1e-8 sqrt(sq_grad_avg[i]))

# summate the new position in this variable

value = solution[i] - alpha * gradient[i]

new_solution.append(value)

# store the new solution

solution = asarray(new_solution)

solutions.append(solution)

# evaluate candidate point

solution_eval = objective(solution[0], solution[1])

# report progress

print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

return solutions

# seed the pseudo random number generator

seed(1)

# pinpoint range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# pinpoint the total iterations

n_iter = 50

# pinpoint the step size

step_size = 0.01

# momentum for rmsprop

rho = 0.99

# perform the gradient descent search with rmsprop

solutions = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled silhouette plot with 50 levels and jet verisimilitude scheme

pyplot.contourf(x, y, results, levels=50, cmap='jet')

# plot the sample as woebegone circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

# show the plot

pyplot.show()

Running the example performs the search as before, except in this case, the silhouette plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting whilom the optima and progressively getting closer to the optima at the part-way of the plot.

Contour Plot of the Test Objective Function With RMSProp Search Results Shown

Summary

In this tutorial, you discovered how to develop gradient descent with RMSProp optimization algorithm from scratch.

Specifically, you learned:

Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
Gradient descent can be updated to use an automatically adaptive step size for each input variable using a perishable stereotype of partial derivatives, tabbed RMSProp.
How to implement the RMSProp optimization algorithm from scratch and wield it to an objective function and evaluate the results.

Do you have any questions?
Ask your questions in the comments unelevated and I will do my weightier to answer.

Gradient Descent With RMSProp from Scratch

Tutorial Overview

Gradient Descent

Root Midpoint Squared Propagation (RMSProp)

Gradient Descent With RMSProp

Two-Dimensional Test Problem

Gradient Descent Optimization With RMSProp

Visualization of RMSProp

Further Reading

Papers

Books

APIs

Articles

Summary

Problems are not stop signs, they are guidelines

Problems are not stop signs, they are guidelines

Problems are not stop signs, they are guidelines

Problems are not stop signs, they are guidelines

Problems are not stop signs, they are guidelines

Problems are not stop signs, they are guidelines