Skip to content

Gradient Descent Algorithm in Machine Learning

The gradient Descent algorithm is a machine-learning algorithm that is used to train machine learning and neural network algorithms. It is one of the best available algorithms to minimize the cost function for solving machine-learning problems.

The application of the gradient descent is to find optimum values of Ө0 and Ө1 that minimize the cost function. Gradient descent minimizes the cost function by tweaking the parameters iteratively.

What is Gradient Descent Algorithm

The gradient descent algorithm works by tweaking the parameters iteratively to minimize the cost function. Parameters that minimize the cost function are selected as final parameters.

This image shows Gradient Descent Algorithm Approach

We can understand the Gradient-Decent approach using the example of a traveler on the top of a mountain. He is given the challenge to come down the mountain blindfolded.

The best strategy is to take the correct step in the correct direction where the slope is maximum. This is the only way we can reach to bottom in the shortest possible steps. The step size is critical in this approach. The gradient Descent algorithm follows a similar approach to reach the global minima where the error or cost is minimal.

Inputs for Gradient Descent Algorithm

We need the following inputs to implement gradient descent to determine the unknown parameters that minimize the cost function.

  1. Learning Rate or Step Size
  2. Random initialization parameters or weights.
  3. Maximum Number of Iterations

1. Impact of Leaning Rate on Gradient Decent

The learning rate hyper-parameter determines the step size in gradient descent algorithm implementation.

For a smaller learning rate, the algorithm will require a comparatively large number of iterations to reach global minima and converge the solution. This will increase the computational requirements and time to find the solution. For example, we need more time to come to the bottom of the hill when we take small steps.

The ML algorithm needs a few iterations to reach the global minimum for a high learning rate. For example, we need less time to come to the bottom of a hill when we take large steps. But at the same time, we might skip the global minima and reach the other side of the valley.

2. Impact of Random Initialization of weights

The cost function is not always a regular bowl. It may have local minima and plateaus. These irregularities make it difficult to converge the cost function.

In gradient descent algorithm, we aim to increase the values of Ө0 and Ө1 in a direction where the squared error reduces. Apart from step size from where we start (initial values for Ө0 and Ө1) play a critical role in selecting the right parameters.

This image shows the Impact of step initialization on gradient descent
Impact of random initialization of weights

As shown in the above image, The solution will converge to local minima if the initialization starts from point A and we go towards the left side.

Whereas, the solution will converge to global minima if initialization starts from the start point and we move towards the right, provided:

  • Right step size is selected.
  • Solution is not stopped too early.

3. Impact of number of Iteration

We can resolve the impact of initialization (starting from left or right) just by selecting enough number of steps. The following two points ensure gradient descent algorithm reaches a global minimum provided the number of iterations is enough and the Learning rate is not large.

  • For convex cost function, if we pick any two points on the curve, the line joining them never crosses the curve. As a result there are no local minima, just one global minimum.
  • The cost function should be a continuous function with a slope that never changes abruptly.

Why do we need Features on the same scale for Gradient Decent

The Cost function shape is a bowl if the features are on the same scale. Whereas, it becomes an elongated bowl if the features have very different scales. With an elongated bowl, we can still get the local minima but it will take relatively more number of iterations.

Therefore it is always recommended to keep features on the same scale, otherwise, it will take longer to converge.

Steps in Gradient Descent to minimize cost function

This image shows the steps in gradient descent to minimize the cost function.

Types of Gradient Descent Algorithms

We can further classify gradient descent algorithm into the following three types:

  1. Batch Gradient Descent Algorithm
  2. Stochastic Gradient Descent Algorithm
  3. Mini Batch Gradient Descent Algorithm

Batch Gradient Descent Algorithm

We can calculate the batch gradient descent by calculating the GD of the cost function for each parameter (Ө0 and Ө1). To calculate the gradient descent, we need to the derivative for the cost function against each parameter.

Batch gradient descent computes the gradient of the cost function over the complete batch of the training set.

Formula for Cost function for Batch Gradient Descent Algorithm

This image shows the formula for cost function for mean squared error
Formula for cost function for mean squared error

The above formula calculates the cost function over a full training set at each gradient decent step.

Formula to calculate gradient for Cost function

This image shows the calculation Formula for derivative for cost function

Formula to update weights as per learning rate

This image shows formula to Update parameters or weights with learning rate

Implementation of Batch Gradient Descent Algorithm in Python

We can use following code to calculate the best values for weights or parameters for linear regression analysis in python.

# Defining Required Inputs for Batch Gradient Descent Algorithm

eta = 0.1  # Defining the Learning Rate
n_iterations = 1000 # Defining Maximum numbner of Iterations
m=100 # Number of data points

theta = np.random.randn (2,1) # Random Initialization of theta
print("Random initialized value for theta are:", theta)
# Running Iterations for Batch Gradient Descent
for iterations in range(n_iterations):
    gradients = (2 / m)*(X_b.T).dot(X_b.dot(theta)-y)
    theta = theta - eta * gradients
print ("The best value for theta is:", theta)

Impact of Learning Rate on Batch Gradient decent Model

If the learning rate is small, it requires a lot of iterations to converge. Whereas, when the learning rate is large, It may diverge from the solution. Therefore learning rate is an important parameter that we need to optimize.

How to define the Number of Iterations for batch Gradient Descent Model

We will not get an optimum solution if you set a relatively small number of iterations. Whereas the solution time will increase drastically if we set a high number of iterations. In this case, the best approach is to:

  • Set a large number of iterations.
  • Interrupt the algorithm if the gradient vector becomes small.

In batch gradient descent, we take a single step or update the parameter after considering all training data. As a result, Batch GD is very slow for large training sets. In other words, Number of epochs = Number of Steps

To update the gradient descent, we take the average of the gradients of all training examples and then use a mean gradient to update the parameters.

Limitations of Batch Gradient Descent Algorithms

In batch gradient descent, the gradient is computed over the whole training dataset at each step to update the parameters. For example, if we have a million instances in the training set, the model needs to calculate one million gradients for each step. As a result, the model becomes very slow.

Stochastic Gradient Descent Algorithm

Instead of using all training data to calculate the gradient, Stochastic Gradient Descent uses random instances in the training data at every step. As a result, the algorithm becomes fast since it has to compute small data at every iteration.

  • Reduction in the cost function is not smooth in Stochastic Gradient Decent. The cost function may increase or decrease with every iteration. But on average it always reduces with multiple iterations.
  • The irregular nature of the cost function helps the algorithm skip the local minima.

Limitations or Stochastic Gradient Decent

The possibility is the SGD algorithm is not set to minima because of cost function irregular behavior. We can solve this problem by keeping the learning rate large initially and reducing it gradually as the algorithm is about to reach the solution.

  • The solution can stuck in a local minimum if the learning rate is reduced quickly.
  • A solution may jump around the minimum if the learning rate is reduced slowly.

Steps in Stochastic Gradient Decent

Stochastic Gradient Descent algorithm works in the following steps:

  1. Pick a random instance in the training set at every step.
  2. Computes the gradients based only on that single instance.
  3. Update the weights.
This image shows the Steps in stochastic gradient descent algorithm.

Implementation of Stochastic Gradient Decent in Python

eta = 0.1  # Defining the Learning Rate
n_iterations = 1000 # Defining Maximum numbner of Iterations
m=100 # Number of data points

n_epochs = 50 
# n_epochs: Number of times a learning algorithm passes through the training data
t0, t1 = 5, 50 # Learning schedule hyperparameters

# Define a function for learning schedule

def learning_schedule(t):
    return t0 / (t + t1)
theta_sgd = np.random.randn(2 , 1) # Randomization of theta
for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m) #Generate a random number from 0 to (m-1)
        xi = X_b[random_index : random_index+1]
        yi= y[random_index : random_index+1]
        gradients = 2 * xi.T.dot(xi.dot(theta_sgd) - yi) #Calculation of gradiants
        eta = learning_schedule(epoch * m + i) #learning Rate will change wrt epoch number and sample number
        theta_sgd = theta_sgd-eta*gradients
print ("The best value for theta is:", theta_sgd)
  • The calculated value of unknown parameters is similar to what we calculated using batch gradient descent algorithm in relatively smaller iterations.
  • In the previous case Batch Gradient Descent code iterated 1,000 times through the whole training set. Whereas, stochastic gradient descent goes through the training set only 50 times (n_epchos / m) and reaches a fairly good solution.

Implementation of SGD in Python using Scikit-Learn SGDRegressor class

We can also use Scikit-Learn SGDRegressor class to do stochastic gradient decent.

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=0.05, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
print ("The best value for theta is:", sgd_reg.intercept_, sgd_reg.coef_)
  • The above code runs for a maximum of 1000 epochs (max_iter=1000) or until the loss drops by less than 0.05 during one epoch.
  • The learning rate will start from 0.1 with the default learning schedule.
  • It does not use any regularization.

Mini-batch Gradient Descent Algorithm

Mini-batch Gradient Descent takes advantage of both batch GD and Stochastic GD. It computes the gradients on small random sets of instances called mini-batches.

  • Reduction in the cost function is comparatively less irregular in mini-batch gradient descent compared to stochastic gradient descent.
  • If the batch size is large, it becomes difficult to skip the local minima.

Steps in mini-batch Gradient Descent

Implementation of Stochastic Gradient Decent in Python

theta_path_mgd = []

n_iterations = 100 # Define Number of itterations
minibatch_size = 20 # Define minibatch size

np.random.seed(42)

theta_mini = np.random.randn(2, 1)  # Random initialization of theta

t0, t1 = 200, 1000  # Corrected variable names
def learning_schedule(t):
    return t0 / (t + t1)

t = 0

for epoch in range(n_iterations):
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, minibatch_size):
        t += 1
        xi = X_b_shuffled[i: i + minibatch_size]
        yi = y_shuffled[i: i + minibatch_size]
        gradients = 2 / minibatch_size * xi.T.dot(xi.dot(theta_mini) - yi)  # Corrected variable name
        eta = learning_schedule(t)
        theta_mini = theta_mini - eta * gradients
        theta_path_mgd.append(theta_mini.copy())  # Append a copy of theta_mini to the path

print("The best value for theta with minibatch gradient descent is:", theta_mini)

The best value for theta with minibatch gradient descent is: [[6.20309696] [1.67912157]]

Batch vs Stochastic vs Mini-Batch Gradient Descent

Parameters Batch GD Stochastic GD Mini-Batch GD
Unknown Parameters Update Entire Training Data One Training Instance Batch of Training Instance
Update in Cost Function Smooth: Always Decreases Less Smooth: May increase or decrease Irregular: May increase or decrease
Speed Slow High In-between Batch and SGD
Local Minima Difficult to skip local minima. Irregular cost function helps the algorithm skip the local minima. Difficult to skip the local minima for large dataset.
Global Minima Reach global minima if learning rate is small and enough number of itterations are selected. Does not guarantee convergence to the global minimum

Frequently Asked Questions on Gradient Descent

For a smaller learning rate, the algorithm will require a comparatively large number of iterations to reach global minima and converge the solution. whereas, with large learning rate you may skip the global minima and end up hoping around the global minima.

Best approach is to use random initialization of weights.

  1. Select enough number of iterations.
  2. The cost function is convex.
  3. The cost function should be a continuous function with a slope that never changes abruptly.

In batch gradient descent, the gradient is calculated using the entire training dataset. Whereas, in stochastic gradient descent, the gradient is calculated using one instance of training set.

We may end up at a sub-optimal solution and does not reach the global minimum.

Multiple Choice Questions on Gradient Descent

Welcome to your Quiz Interview Questions and answers on Gradient Descent

Select all the correct statement from the following.

Which of following are input to gradient descent or hyper-parameter for GD?

It is always a great idea to set smallest learning rate for gradient descent.

Select all the correct statement from the following for gradient descent.

Is random values of parameters or weight has any impact on solution using gradient descent

Which of the following statement is true for batch gradient descent?

Select all the correct statement from the following for stochastic gradient descent.

Select all the correct statement from the following for mini batch gradient descent.

Leave a Reply

Your email address will not be published. Required fields are marked *