Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples.
The gradients computed analytically are compared against gradient computed using deep learning framework PyTorch. Further, using the gradients, training loop using gradient descent is implemented for the simplest example of fitting a straight line.
As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.
Model
Let us take an example of estimating based on feature vector
having
features i.e.
.
There are examples.
Assume that the estimate is as linear function of
.
For a single training example, this can be written as :
where,
is the feature vector of size
i.e.
and
is a scalar
Least Mean Squares
To find the parameters and
, based on
training examples, need to formalise a metric to quantify the “closeness” of the estimate
to the true value
. As an arbitrary choice, let us define a metric
based on the mean square error (MSE) as,
Goal is to find the parameters and
which minimizes the metric
. This can be considered as ordinary least squares (wiki entry on ordinary least squares) model.
To find the value of parameters and
which minimises the metric
, let us try gradient descent method where we
i) start with initial random values of parameters and
ii) repeatedly update parameters simultaneously for all values of and
where,
is the learning rate,
and
are the partial derivatives of the loss metric
over parameters
and
respectively.
The intuition is, to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent Wiki Article on gradient descent.
Gradients
In this formulation, we need to find the derivative of a scalar i.e. loss over the a vector of
parameters of
and
.
For easier understanding, can define the derivative over each parameter as below,
Further, taking only one training example, the loss is
Taking the derivative w.r.t to first parameter
Similarly, for the parameter of
, the gradient is
For the bias parameter , the gradient is
The intuition from above equations is :
if the estimate is close to the true value
then the gradient is small, and the update to the parameters is also correspondingly smaller.
With training examples the loss is averaged, and this becomes :
Vectorised operations
Vectorised operations allow CPUs/GPUs to do SIMD (Single Instruction Multiple Data(Refer Wiki)) processing, making it much faster than using for-loops.
Inputs & Outputs
In the current example, this translates to the parameters and
represented as,
respectively.
The training examples of
features is represented as
The output
Gradients
The gradient w.r.t to can be represented in matrix operations as,
Similarly, for the bias term
Training
Training loop – using the derivatives
The code for linear regression using the gradients descent defined in the previous section.
Computing Gradients
Using PyTorch
For the simple linear regression example, it is relatively straight forward to derive the gradients and perform the training loop. When the function for estimation involves multiple stages/layers a.k.a deep learning (refer wiki), it becomes harder to derive the gradients.
Popular deep learning frameworks like Pytorch provides tools for automatic differentiation ( torch.autograd
refer pytorch entry on autograd ) to find the gradients of each parameter based on the loss function.
Numerical approximation (finite difference method)
To verify the gradients, derivatives can be computed numerically using using finite difference (refer wiki entry on finite difference) method i.e.
where,
is the true derivative of function
and
is a small constant.
Example – Analytic vs PyTorch vs Numerical Approximation
For the toy example below, can see that the gradients computed by Analytically vs PyTorch vs numerical approximation using finite difference methods are matching.
Training Loop – using PyTorch
Key aspects in the code for implementing the training loop using PyTorch :
- the variables are defined as torch tensors refer pytorch article on tensors.
- Tensors are similar to numpy ndarrays, with capabilities to use GPU/hardware accelerators, optimized for automatic differentiation etc
- defining the parameters needing gradient computation.
- the parameters
and
which needs gradient computation are initialised with
requires_grad=True
- the parameters
- computing the gradient
- the call
loss.backward()
is used to compute the gradients for the parametersand
.
- this makes the gradient values available in
w.grad()
andb.grad()
respectively
- the call
- updating the paramaeters
- as gradient tracking is unnecessary during parameter updates, they are performed within
torch.no_grad():
context
- as gradient tracking is unnecessary during parameter updates, they are performed within
- zeroing gradients between calls
- PyTorch accumulates gradients by default during each backward pass i.e. each
loss.backward()
call - so, performing
w.grad.zero_()
andb.grad.zero_()
is needed to clear previous gradients.
- PyTorch accumulates gradients by default during each backward pass i.e. each
As would expect, both the training loop approaches converges to similar values for parameters and
.
Mean Absolute Error
Another popular metric to quantify the “closeness” of the estimate to the true value
is Mean Absolute Error (MAE). In the cases where there are outliers in the data, Mean Absolute Error (MAE) is preferred over Mean Squared Error (MSE) as MAE penalizes errors linearly rather than quadratically.
Formally,
For computing the gradient for the loss from Mean Absolute Error, need to find the gradient of absolute function.
Gradient – Absolute function
The absolute value of function is defined as,
The derivative is
This can be compactly written as,
The absolute function is non-differentiable at point at , where the function has a sharp corner.
The concept of subderivative (or subgradient) generalises the derivative to a convex functions which are not differentiable (refer wiki entry on Subderivative) . With this definition, subderivative at lies in the interval
.
Using the concept of Symmetric derivative (refer wiki entry on symmetric derivative), the subderivtive at can be chosen as 0.
In practice, deep learning frameworks (like PyTorch, TensorFlow) and numerical methods like NumPy define . This is the subgradient, and it works fine in optimization.
Training
Deriving the Gradients
For a single training example, for the parameter of
, the gradient is
Similarly for the bias term,
Training Loop – using derivatives and PyTorch
For the same example, the code for training the linear regression using Mean Absolute Error as the Loss function.
Can see that both the training loops for Mean Absolute Error (MAE) using PyTorch and Analaytic approaches converges to the same parameters and
.
Summary
The post covers the following key aspects
- Gradient Basics: How to deriving the gradients for loss functions Mean Square Error and Mean Absolute Error
- Efficient Computation: Use of vectorized operations and PyTorch autograd
- Gradient Computation: Analytical, numerical (finite difference), and PyTorch comparison
- Training Loops: Implementing updates using both manual and PyTorch-based gradients
Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section. 🙂