Skip to content

Linear Regression Algorithm in Machine Learning

The best practice to implement an ML algorithm is to understand how it works mathematically. It will help in selecting the right hyper-parameters as per requirements. Linear regression algorithm in ML is one of the simplest Machine Learning algorithms where dependent and independent variables are linearly related.

Regression is a statistical technique to establish a relationship between the dependent (y) and multiple independent (X) variables. This relationship can be linear, parabolic, or something else in nature. Linear Regression is a part of regression techniques.

What is Linear Regression Algorithm in ML?

Linear Regression is a statistical technique to determine a Linear Relation between Independent and multiple dependent variables. This relationship can be linear, parabolic, or something else in nature.

The ultimate goal of running LR analysis is determining parameters that create a straight line with minimum error.

Let’s consider this dataset where X is independent, and y is the dependent variable. This is a supervised machine learning problem because the output/ dependent variable is known.

This image represents the Linear Regression algorithm implementation on data.

The first step in ML problem-solving is to visualize the data. When we plot a graph between the independent (X) and dependent (y) variables and find the relationship is linear. We become confident in solving this problem using a simple linear regression algorithm.

In a simple linear regression algorithm, we will find the best line that ensures a minimum error (Root Sum of Square Error, Root Mean Square Error) in all training data. Afterward, we can predict the unknown y values for given values of X from the best-fit line.

This image shows types of Type of Linear Regression algorithms

We can classify linear regression models in machine learning into the following two types according to the number of independent variables.

  1. Simple Linear Regression
  2. Multiple Linear Regression

Simple Linear Regression Algorithm in Machine Learning

A Simple Linear Regression Model is a type of linear regression with one independent variable. This model is quite simple to visualize and solve with minimum complexity.

Equation

The basic equation of a simple linear regression model with one independent and one dependent variable can be represented in the following ways.

y = m X + C

y = b0 + b1 X

hθ(x) = θ0 + θ1 X

Where:
y = Dependent Variable;
X= Independent Variable;
b0 and b1 or θ0 and θ1 are linerar Regression model parameters or weights.

Multiple Linear Regression Algorithm

A Multiple Linear Regression Algorithm is a type of linear regression with multiple independent and  one dependent variable. It becomes very difficult to visualize multiple linear regression model as we increase number of variables.

Equation

The basic equation of multiple linear regression Algorithm with multiple independent and one dependent variable is given by:

y = θ0 + θ1 X1 + θ2 X2 +.....+ θn Xn

Where:
y = dependent variable;
X1, X2,..., Xn are independent variables;
θ0, θ1, θ2,......, θn are n multiple linear regression parameters;

Here the goal of the multiple linear regression algorithm is to calculate the parameters that minimize the squared error. A Linear regression model makes a prediction by calculating the sum of weighted average of the input features and a constant.

Vector form of Linear Regression Model

We can represent the Linear Regression model in the following general vector and vector form.

y = X.θ

Where:
y = A column vector of the dependent variable.
X = Input features or independent variable matrix. The values in the first column of this matrix are ones.
θ = A column vector for coefficients We can write the above equation in the following matrix form:

Common Challenges in Linear Regression Model Development

It is comparatively easy to develop linear regression algorithms. However, we face many challenges to ensure the quality linear regression algorithm. Here is the list of common challenges during linear regression algorithm development.

  1. Outliners in data impact linear regression parameters. As a result, the LR model does not perform as expected on training and test data.
  2. Overfitting is a problem when the ML model performs better on training data and does not generalize on unseen test data.
  3. Underfitting is a common problem in LR models. It occurs when the algorithm performs poorly on training and test data. 
  4. Any violation in Linear regression prior conditions leads to inefficient weight estimation.
  5. Correlation in independent variables leads to wrong parameter estimation because linear regression considers the impact of independent variable additive.
  6. Inappropriate selection of parameters may lead the solution to local minima. 

How to implement Linear Regression ML Algorithm?

We can use the following two approaches to solve linear regression problems in ML.

  1. Direct Approach: Directly compute the model parameters that minimize the cost function on training data.
  2. Gradient Decent: Iterative optimization approach that minimizes the cost function on training data.

Both of the above approaches to minimizing the cost function and selecting the best parameters have their advantages and limitations.

Here is the list of prior conditions or assumptions of the linear regression machine learning model. We need to ensure these assumptions are fulfilled before implementing linear regression.

  1. Linear Relationship between independent and dependent variable.
  2. Multi-collinearity does not exist in data.
  3. No Auto-correlation exists.
  4. Homoscedasticity.
  5. Normally Distributed Prediction Error.

Cost Function for Linear Regression Algorithm

Our Goal during Linear Regression model development is to find the parameters that minimize the Root Mean Squared Error (RMSE) or Mean Squared Error (MSE). It is comparatively easy to minimize MSE compared to RMSE. But both functions give similar results.

The cost function also known as the loss function is mean squared error for linear regression algorithm. In other words, cost function is error that we want to minimize on training data.

This image shows the Cost function for linear regression model.
Formula for Cost Function

Why cost function is divided by "2m"

The cost function is divided with 2m instead of m, just to simplify the calculations when calculating the derivative or minimizing the cost function.

Direct Approach to Solve Linear Regression Problems

We can use one of the following two approaches to calculate the unknown coefficient.

  1. Normal Equation Approach
  2. Singular Value Decomposition (SVD) Approach

Normal Equation Approach to determine unknown coefficients

θ = (XTX)-1 X T y

Here θ is a vector of unknown coefficients that minimizes the cost function. Now we will use python code to determine unknown coefficients using normal equation.   

Python Code to solve Normal Equation for Linear Regression Model Development

The Normal Equation computes (n + 1) × (n + 1) matrix (where n is the number of features) to solve the linear regression problem. If we double the number of features, the computation time increases drastically.

# Import Required Library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
import matplotlib as mpl

We will use the NumPy library to generate the training data. Afterward, we will plot this data using the Matplot library.

# Generate Dependendent and Independent values considering equation y = 5*X + 6 + error (y = mX+c+error)
X = 1.5 * np.random.rand(100, 1) #Generate a random data with uniform distribution
# Generates a NumPy array with shape (100, 1) filled with random values (mean=0, standard deviation=1).
error = np.random.randn(100, 1) 
y = 5 * X + 6 + error
# Plotting the X and y data 
plt.plot (X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 1.5, 2, 16]) # limit the graph size

Just by looking at the above image, we can say that a straight line can fit the input data. Therefore we can use simple linear regression here.

The next step to solve normal equations to develop a linear regression model is to add a column to independent data.

X_b = np.c_[np.ones((100, 1)), X]

# np.ones((100, 1)) will create a 100 X 1 matrix with all exlements = 1)
# np.c_ function will join two matrices by column
# np.c_[100X1_matrix, 100X1_matrix] : Output will be a 100 X 2 matrix the first columns will be all 1

print(X_b.shape)
# Implementing the normal equation to calculate coefficients

coeff_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

# X_b.T : Calculates the transpose of the matrix
# (X_b.T).dot(X_b) : Calculates the dot product of (X_b.T) and (X_b)
# np.linalg.inv(matrix): Calculate the inverse of matrix
print("Best Coefficient values are:",coeff_best )

Best Coefficient values are: [[6.22343373] [1.69388673]]

The calculated coeff_0 and coeff_1 are 6.223 and 1.69388 respectively. But in our initial equation “y = 6 + 2 * X + error” we considered this as 6 and 2 respectively. Therefore calculated values are closer to actual values.

# Predicting unknown values of x
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # This will add x0 = 1 to each instance
y_predict = X_new_b.dot(coeff_best) # dot product of X with X0 and theta_best
print("Predicted values for Y for x_new are:", y_predict)

Predicted values for Y for x_new are: [[6.22343373] [9.61120718]]

# Plotting the Best Fit Line
plt.plot(X_new, y_predict, "r-") # This will plot a line for x and y
plt.plot(X, y, "b.") #This will plot the predicted points for X and y respectively
plt.axis([0, 1.5, 3.5, 10])

Singular Value Decomposition (SVD) Approach for Linear Regression

In the SVD approach, we calculate linear regression model coefficients using Scikit-Learn’s LinearRegression class. 

  • SciPy library calculates the pseudo-inverse using a standard matrix factorization technique Known as Singular Value Decomposition (SVD). SVD decomposes the training set matrix X into the matrix multiplication of three matrices.
  • A normal Equation does not work if the dot product of the matrix transpose of X and X is not singular. In such cases, pseudo-inverse is the best approach to solve machine learning problems.
  • If we double the number of features, the computation time will be 4 times in SVD.
SVD Implementation example in Python for Linear Regression Model Development
# Import the sklearn LinearRegression Library
from sklearn.linear_model import LinearRegression
# Call Linear Regression function
lin_reg = LinearRegression()
# Train Linear Regression model on Previous Data. 
#There is no need to change the add column in independent variable data
lin_reg.fit(X, y)
# Getting Coefficient Values from Trained Linear Regression Model

print("Best Coefficient values are:", lin_reg.intercept_, "and", lin_reg.coef_)

Best Coefficient values are: [6.22343373] and [[1.69388673]]

The output values are similar to values we calculated from normal equation method.

Limitations of Solving Linear Regression Problem using Direct Method

  1. The normal Equation approach to finding the unknown parameters is computationally expensive. If we double the number of features, the computation requirements increase 8 times.
  2. The SVM approach is comparatively less expensive than the normal equation approach. If we double the number of features the computation requirements will increase 4 times.
  3. Both the normal equation and SVM approaches are best used when the number of features is relatively less.

We can also use a gradient descent approach to determine unknown parameters. It is an alternative approach to the direct method and comparatively less computationally expensive. We can use this approach if the number of features is relatively large.

This image shows Gradient Descent Algorithm Approach

To implement gradient descent all features must be on the same scale. The applications of gradient descent are not only limited to linear regression algorithm.

Inputs for Gradient Decent

We need the following inputs to implement gradient descent to determine the unknown parameters that minimize the cost function.

  1. Random initialization values for parameters (theta).
  2. Learning Rate
  3. Maximum Number of Iterations

How MSE Cost Function Handles Irregular Cost Function Problem

The following two points ensure gradient descent reaches a global minimum provided the number of iterations is enough and the Learning rate is not large.

  • The MSE cost function for a Linear Regression is a convex function. It means that if we pick any two points on the curve, the line joining them never crosses the curve. As a result, there are no local minima, just one global minimum.
  • It is also a continuous function with a slope that never changes abruptly.

Implementation of Batch Gradient Descent in Python

Define Required input for Batch Gradient Descent
eta = 0.1  # Defining the Learning Rate
n_iterations = 1000 # Defining Maximum numbner of Iterations
m=100 # Number of data points

theta = np.random.randn (2,1) # Random Initialization of theta
print("Random initialized value for theta are:", theta)

Random initialized value for theta are: [[ 0.10850119] [-0.56628056]]

Random initialized values may or may not be different.

Running Code for batch Gradient Descent
for iterations in range(n_iterations):
    gradients = (2 / m)*(X_b.T).dot(X_b.dot(theta)-y)
    theta = theta - eta * gradients
print ("The best value for theta is:", theta)

The best value for theta is: [[6.22343373] [1.69388673]]

  • Above code callculates the theta for the defined number of iterations.
  • The calculated theta values are similar to what we observed earlier with SVD or normal equation.

Stochastic Gradient Decent Algorithm

Instead of using all training data to calculate the gradient, Stochastic Gradient Descent uses random instances in the training data at every step. As a result, the algorithm becomes fast since it has to compute small data at every iteration.

Stochastic Gradient Descent works in the following steps:

  1. Pick a random instance in the training set at every step.
  2. Computes the gradients based only on that single instance.
  3. Update the weights.
Instead of using all training data to calculate the gradient, Stochastic Gradient Descent uses random instances in the training data at every step. As a result, the algorithm becomes fast since it has to compute small data at every iteration.
Implementation of Stochastic Gradient Descent with simple learning schedule in Python
eta = 0.1  # Defining the Learning Rate
n_iterations = 1000 # Defining Maximum numbner of Iterations
m=100 # Number of data points

n_epochs = 50 
# n_epochs: Number of times a learning algorithm passes through the training data
t0, t1 = 5, 50 # Learning schedule hyperparameters

# Define a function for learning schedule

def learning_schedule(t):
    return t0 / (t + t1)
theta_sgd = np.random.randn(2 , 1) # Randomization of theta
for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m) #Generate a random number from 0 to (m-1)
        xi = X_b[random_index : random_index+1]
        yi= y[random_index : random_index+1]
        gradients = 2 * xi.T.dot(xi.dot(theta_sgd) - yi) #Calculation of gradiants
        eta = learning_schedule(epoch * m + i) #learning Rate will change wrt epoch number and sample number
        theta_sgd = theta_sgd-eta*gradients
print ("The best value for theta is:", theta_sgd)

The best value for theta is: [[6.22911403] [1.69601783]]

Implementation of SGD in Python using Scikit-Learn SGDRegressor class

We can also use Scikit-Learn SGDRegressor class to do stochastic gradient decent.

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
print ("The best value for theta is:", sgd_reg.intercept_, sgd_reg.coef_)

The best value for theta is: [6.1550789] [1.69565741]

  • The above coefficient values are quite similar to what we calculated with the normal equation

Mini-Batch Gradient Decent Algorithm

Mini-batch Gradient Descent takes advantage of both batch GD and Stochastic GD. Mini-batch GD computes the gradients on small random sets of instances called mini-batches.

theta_path_mgd = []

n_iterations = 100
minibatch_size = 20

np.random.seed(42)

theta_mini = np.random.randn(2, 1)  # Random initialization of theta

t0, t1 = 200, 1000  # Corrected variable names
def learning_schedule(t):
    return t0 / (t + t1)

t = 0

for epoch in range(n_iterations):
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, minibatch_size):
        t += 1
        xi = X_b_shuffled[i: i + minibatch_size]
        yi = y_shuffled[i: i + minibatch_size]
        gradients = 2 / minibatch_size * xi.T.dot(xi.dot(theta_mini) - yi)  # Corrected variable name
        eta = learning_schedule(t)
        theta_mini = theta_mini - eta * gradients
        theta_path_mgd.append(theta_mini.copy())  # Append a copy of theta_mini to the path

print("The best value for theta with minibatch gradient descent is:", theta_mini)

The best value for theta with minibatch gradient descent is: [[6.20309696] [1.67912157]]

Comparison of Linear Regression Algorithms

Algorithm Number of training Samples Number of Features Scaling Required Scikit-learn Library
Normal Equation Does't Matter Best for Small number of features No NA
SVD Linear Regression
Batch GD Small Does't Matter Yes SGDRegressor
Stochastic GD Does't Matter
Mini Batch GD

How can we improve the accuracy of Linear Regression Model?

We can improve linear regression model accuracy by ensuring the quality of training data and selecting the right hyperparameters. Here is the list of points we can consider to improve the ML model accuracy.

Handling the Outliners

Outliners can greatly impact the performance of machine learning algorithms. According to your data and requirements, you can decide on whether you want to remove the outliner or replace the outliner with some other data.

Handling the Missing values

Missing values in data impact the linear regression model performance. Some missing values are important and data engineers need to decide on how to handle missing values so that it has minimal impact on model performance.

Feature Selection

Best machine learning algorithms have only relevant features. Therefore the best practice is to remove irrelevant or redundant features that do not have a significant impact on parameter selection.

Regularization Techniques

We can use one of the following regularization techniques to improve the machine learning model performance.

  1. Ridge Regression
  2. Lasso Regression
  3. Early Stopping

Independent Variable Scaling

We need to scale the input features if we are using gradient descent techniques to select the best parameters.

Hyper Parameter Tuning

We can use grid search, randomized search, or cross-validation techniques to select the best parameters for linear regression.

Frequently Asked Questions on Linear Regression

Linear Regression is a statistical technique to determine a Linear Relation between Independent and multiple dependent variables. This relationship can be linear, parabolic, or something else in nature.

 

Regression line is also known as best fit line is a straight line that defines a relationship between dependent and independent variables.

No, its a regression algorithm to determine the continuous variables.

We can use one of the following techniques to calculate the error in linear regression.

  1. Mean Square error
  2. Root mean square error
  3. Mean absolute error.

The cost function is divided with 2m instead of m, just to simplify the calculations when calculating the derivative or minimizing the cost function.

  1. Linear Relationship between independent and dependent variable.
  2. Multi-collinearity does not exist in data.
  3. No Auto-correlation exists.
  4. Homoscedasticity.
  5. Normally Distributed Prediction Error.

It is comparatively easy to minimize MSE compared to RMSE. But both functions give similar results.

Leave a Reply

Your email address will not be published. Required fields are marked *