This article was published as a part of the Data Science Blogathon.
Gradient Descent is one of the most used machine learning algorithms in the industry. And yet it confounds a lot of newcomers.
I get that! The math behind gradient boosting isn’t easy if you’re just starting out. My aim is to help you get an intuition behind gradient descent in this article.
We will quickly understand the role of a cost function, explanation of Gradient descent, how to choose the learning parameter, and the effect of overshooting in gradient descent. Let’s begin!
What is a Cost Function?
It is a function that measures the performance of a model for any given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.
After making a hypothesis with initial parameters, we calculate the Cost function. And with a goal to reduce the cost function, we modify the parameters by using the Gradient descent algorithm over the given data. Here’s the mathematical representation for it:
What is Gradient Descent?
The million-dollar question!
Let’s say you are playing a game where the players are at the top of a mountain, and they are asked to reach the lowest point of the mountain. Additionally, they are blindfolded. So, what approach do you think would make you reach the lake?
Take a moment to think about this before you read on.
The best way is to observe the ground and find where the land descends. From that position, take a step in the descending direction and iterate this process until we reach the lowest point.
Gradient descent is an iterative optimization algorithm for finding the local minimum of a function.
To find the local minimum of a function using gradient descent, we must take steps proportional to the negative of the gradient (move away from the gradient) of the function at the current point. If we take steps proportional to the positive of the gradient (moving towards the gradient), we will approach a local maximum of the function, and the procedure is called Gradient Ascent.
Gradient descent was originally proposed by CAUCHY in 1847. It is also known as steepest descent.
The goal of the gradient descent algorithm is to minimize the given function (say cost function). To achieve this goal, it performs two steps iteratively:
- Compute the gradient (slope), the first order derivative of the function at that point
- Make a step (move) in the direction opposite to the gradient, opposite direction of slope increase from the current point by alpha times the gradient at that point
Alpha is called Learning rate – a tuning parameter in the optimization process. It decides the length of the steps.
Plotting the Gradient Descent Algorithm
When we have a single parameter (theta), we can plot the dependent variable cost on the y-axis and theta on the x-axis. If there are two parameters, we can go with a 3-D plot, with cost on one axis and the two parameters (thetas) along the other two axes.
It can also be visualized by using Contours. This shows a 3-D plot in two dimensions with parameters along both axes and the response as a contour. The value of the response increases away from the center and has the same value along with the rings. The response is directly proportional to the distance of a point from the center (along a direction).
Alpha – The Learning Rate
We have the direction we want to move in, now we must decide the size of the step we must take.
*It must be chosen carefully to end up with local minima.
- If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing, without reaching the minima
- If the learning rate is too small, the training might turn out to be too long
- a) Learning rate is optimal, model converges to the minimum
- b) Learning rate is too small, it takes more time but converges to the minimum
- c) Learning rate is higher than the optimal value, it overshoots but converges ( 1/C < η <2/C)
- d) Learning rate is very large, it overshoots and diverges, moves away from the minima, performance decreases on learning
Note: As the gradient decreases while moving towards the local minima, the size of the step decreases. So, the learning rate (alpha) can be constant over the optimization and need not be varied iteratively.
The cost function may consist of many minimum points. The gradient may settle on any one of the minima, which depends on the initial point (i.e initial parameters(theta)) and the learning rate. Therefore, the optimization may converge to different points with different starting points and learning rate.
Code Implementation of Gradient Descent in Python
Once we tune the learning parameter (alpha) and get the optimal learning rate, we start iterating until we converge to the local minima.