Gradients for linear regression

Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples.

The gradients computed analytically are compared against gradient computed using deep learning framework PyTorch. Further, using the gradients, training loop using gradient descent is implemented for the simplest example of fitting a straight line.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Model

Let us take an example of estimating based on feature vector having features i.e. .

There are examples.

Assume that the estimate is as linear function of .

For a single training example, this can be written as :

where,

  • is the feature vector of size i.e. and
  • is a scalar

Least Mean Squares

To find the parameters and , based on training examples, need to formalise a metric to quantify the “closeness” of the estimate to the true value . As an arbitrary choice, let us define a metric based on the mean square error (MSE) as,

Goal is to find the parameters and which minimizes the metric . This can be considered as ordinary least squares (wiki entry on ordinary least squares) model.

To find the value of parameters and which minimises the metric , let us try gradient descent method where we

i) start with initial random values of parameters and

ii) repeatedly update parameters simultaneously for all values of and

where,

is the learning rate,

and are the partial derivatives of the loss metric over parameters and respectively.

The intuition is, to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent Wiki Article on gradient descent.

Gradients

In this formulation, we need to find the derivative of a scalar i.e. loss over the a vector of parameters of and .

For easier understanding, can define the derivative over each parameter as below,

Further, taking only one training example, the loss is

Taking the derivative w.r.t to first parameter

Similarly, for the parameter of , the gradient is

For the bias parameter , the gradient is

The intuition from above equations is :

if the estimate is close to the true value then the gradient is small, and the update to the parameters is also correspondingly smaller.

With training examples the loss is averaged, and this becomes :

Vectorised operations

Vectorised operations allow CPUs/GPUs to do SIMD (Single Instruction Multiple Data(Refer Wiki)) processing, making it much faster than using for-loops.

Inputs & Outputs

In the current example, this translates to the parameters and represented as,

respectively.

The training examples of features is represented as

The output


Gradients


The gradient w.r.t to can be represented in matrix operations as,


Similarly, for the bias term

Training

Training loop – using the derivatives

The code for linear regression using the gradients descent defined in the previous section.

Computing Gradients

Using PyTorch

For the simple linear regression example, it is relatively straight forward to derive the gradients and perform the training loop. When the function for estimation involves multiple stages/layers a.k.a deep learning (refer wiki), it becomes harder to derive the gradients.

Popular deep learning frameworks like Pytorch provides tools for automatic differentiation ( torch.autograd refer pytorch entry on autograd ) to find the gradients of each parameter based on the loss function.

Numerical approximation (finite difference method)

To verify the gradients, derivatives can be computed numerically using using finite difference (refer wiki entry on finite difference) method i.e.

where,

is the true derivative of function and

is a small constant.

Example – Analytic vs PyTorch vs Numerical Approximation

For the toy example below, can see that the gradients computed by Analytically vs PyTorch vs numerical approximation using finite difference methods are matching.

Training Loop – using PyTorch

Key aspects in the code for implementing the training loop using PyTorch :

  • the variables are defined as torch tensors refer pytorch article on tensors.
    • Tensors are similar to numpy ndarrays, with capabilities to use GPU/hardware accelerators, optimized for automatic differentiation etc
  • defining the parameters needing gradient computation.
    • the parameters and which needs gradient computation are initialised with requires_grad=True
  • computing the gradient
    • the call loss.backward() is used to compute the gradients for the parameters and .
    • this makes the gradient values available in w.grad() and b.grad() respectively
  • updating the paramaeters
    • as gradient tracking is unnecessary during parameter updates, they are performed within torch.no_grad(): context
  • zeroing gradients between calls
    • PyTorch accumulates gradients by default during each backward pass i.e. each loss.backward() call
    • so, performing w.grad.zero_() and b.grad.zero_() is needed to clear previous gradients.

As would expect, both the training loop approaches converges to similar values for parameters and .

Mean Absolute Error

Another popular metric to quantify the “closeness” of the estimate to the true value is Mean Absolute Error (MAE). In the cases where there are outliers in the data, Mean Absolute Error (MAE) is preferred over Mean Squared Error (MSE) as MAE penalizes errors linearly rather than quadratically.

Formally,

For computing the gradient for the loss from Mean Absolute Error, need to find the gradient of absolute function.

Gradient – Absolute function

The absolute value of function is defined as,

The derivative is

This can be compactly written as,

The absolute function is non-differentiable at point at , where the function has a sharp corner.

The concept of subderivative (or subgradient) generalises the derivative to a convex functions which are not differentiable (refer wiki entry on Subderivative) . With this definition, subderivative at lies in the interval .

Using the concept of Symmetric derivative (refer wiki entry on symmetric derivative), the subderivtive at can be chosen as 0.

In practice, deep learning frameworks (like PyTorch, TensorFlow) and numerical methods like NumPy define . This is the subgradient, and it works fine in optimization.

Training

Deriving the Gradients

For a single training example, for the parameter of , the gradient is

Similarly for the bias term,

Training Loop – using derivatives and PyTorch

For the same example, the code for training the linear regression using Mean Absolute Error as the Loss function.

Can see that both the training loops for Mean Absolute Error (MAE) using PyTorch and Analaytic approaches converges to the same parameters and .

Summary

The post covers the following key aspects

  • Gradient Basics: How to deriving the gradients for loss functions Mean Square Error and Mean Absolute Error
  • Efficient Computation: Use of vectorized operations and PyTorch autograd
  • Gradient Computation: Analytical, numerical (finite difference), and PyTorch comparison
  • Training Loops: Implementing updates using both manual and PyTorch-based gradients

Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *