Gradients for linear regression

Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples.

The gradients computed analytically are compared against gradient computed using deep learning framework PyTorch. Further, using the gradients, training loop using gradient descent is implemented for the simplest example of fitting a straight line.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Model

Let us take an example of estimating $y$ based on feature vector $\mathbf{x}$ having $n$ features i.e. $\mathbf{x} \in \mathbb{R}^n$ .

There are $m$ examples.

$\begin{tabular}{|c|c|c|c|c|} \hline&{example^1}&{example^2}&\ldots&{example^m}\\ \hline{feature_1}&{x_1}^{1}&{x_1}^{2}&\ldots&{x_1}^{m}\\ \hline{feature_2}&{x_2}^{1}&{x_2}^{2}&\ldots&{x_2}^{m}\\ \hline&\vdots&\vdots&\ldots&\vdots\\ \hline{feature_n}&{x_n}^{1}&{x_n}^{2}&\ldots&{x_n}^{m}&\\ \hline{output}&{y}^{1}&{y}^{2}&\ldots&{y}^{m}\end{tabular}$

Assume that the estimate ${z}$ is as linear function of $\mathbf{x}$ .

For a single training example, this can be written as :

$\begin{array}{lll}{z}^1&=&w_1{x_1}^1+w_2{x_2}^1+\dots+w_n{x_n}^1+b\\&=&\mathbf{w^T}\mathbf{x}^1+b\end{array}$

where,

$\mathbf{w}$ is the feature vector of size $n$ i.e. $\mathbf{w} \in \mathbb{R}^n$ and
$b$ is a scalar

Least Mean Squares

To find the parameters $\mathbf{w}$ and $b$ , based on $m$ training examples, need to formalise a metric to quantify the “closeness” of the estimate ${z}$ to the true value $y$ . As an arbitrary choice, let us define a metric $L$ based on the mean square error (MSE) as,

$\begin{array}{lll}L&=&\frac{1}{m}\sum_{i=1}^{i}${z}^i-y^i$^2\\&=&\frac{1}{m}\sum_{i=1}^{m}$\mathbf{w^T}\mathbf{x}^i+b -y^i$^2\end{array}$

Goal is to find the parameters $\mathbf{w}$ and $b$ which minimizes the metric $L$ . This can be considered as ordinary least squares (^{wiki entry on ordinary least squares}) model.

To find the value of parameters $\mathbf{w}$ and $b$ which minimises the metric $L$ , let us try gradient descent method where we

i) start with initial random values of parameters and

ii) repeatedly update parameters simultaneously for all values of $\mathbf{w}$ and $b$

$\textbf{till convergence:} \quad \begin{cases} \mathbf{w} := \mathbf{w} - \alpha \dfrac{\partial L}{\partial \mathbf{w}} \\ b := b - \alpha \dfrac{\partial L}{\partial b} \end{cases}$

where,

$\alpha$ is the learning rate,

$\frac{\partial L}{\partial \mathbf{w}$ and $\frac{\partial L}{\partial b$ are the partial derivatives of the loss metric $L$ over parameters $\mathbf{w}$ and $b$ respectively.

The intuition is, to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent ^{Wiki Article on gradient descent}.

Gradients

In this formulation, we need to find the derivative of a scalar i.e. loss $L$ over the a vector of $n+1$ parameters of $\mathbf{w}$ and $b$ .

For easier understanding, can define the derivative over each parameter as below,

$\begin{array}{lcl}w_1&:=&w_1-\alpha\frac{\partial L}{\partial w_1}\\w_2&:=&w_2-\alpha\frac{\partial L}{\partial w_2}\\&\vdots&\\w_n&:=&w_n-\alpha\frac{\partial L}{\partial w_n}\\b&:=&b-\alpha\frac{\partial L}{\partial b}\end{array}$

Further, taking only one training example, the loss is

$\begin{array}{lll}L&=&(\mathbf{w^T}\mathbf{x^1}+b -y^1\)^2\\&=&(w_1x_1^1 + w_2x_2_^1 + \dots + w_nx_n^1 + b - y^1)^2\\&=&(z^1-y^1)^2\end{array}$

Taking the derivative w.r.t to first parameter $w_1$

$\begin{array}{lll}\frac{\partial L}{\partial w_1}&=&\frac{\partial }{\partial w_1}(w_1{x_1}^1 + w_2{x_2}_^1 + \dots + w_n{x_n}^1 + b - y^1)^2\\&=&2(w_1{x_1}^1 + w_2{x_2}_^1 + \dots + w_n{x_n}^1 + b - y^1)\frac{\partial L}{\partial w_1}(w_1{x_1}^1 + w_2{x_2}_^1 + \dots + w_n{x_n}^1 + b - y^1)\\&=&2(w_1{x_1}^1 + w_2{x_2}_^1 + \dots + w_n{x_n}^1 + b - y^1){x_1}^1\\&=&2(\mathbf{w^T}\mathbf{x^1}+b-y^1){x_1}^1\\&=&2(z^1-{y^1}){x_1}^1\end{array}$

Similarly, for the $n^{th}$ parameter of $\mathbf{w}$ , the gradient is

$\begin{array}{lll}\frac{\partial L}{\partial w_n}&=&2(\mathbf{w^T}\mathbf{x^1}+b-y^1){x_n}^1=&2(z^1-y^1){x_n}^1\end{array}$

For the bias parameter $b$ , the gradient is

$\begin{array}{lll}\frac{\partial L}{\partial b}&=&2(\mathbf{w^T}\mathbf{x^1}+b-y^1)=2(z^1-y^1)\end{array}$

The intuition from above equations is :

if the estimate ${z}=\mathbf{w^T}\mathbf{x}+b$ is close to the true value $y$ then the gradient is small, and the update to the parameters is also correspondingly smaller.

With $m$ training examples the loss is averaged, and this becomes :

$\begin{array}{lll}\frac{\partial L}{\partial w_n}&=&\frac{2}{m}\sum_{i=1}^m(\mathbf{w^T}\mathbf{x^i}+b-y^i){x_n}^i&=&\frac{2}{m}\sum_{i=1}^m$z^i-y^i${x_n}^i\end{array}$ $\begin{array}{lll}\frac{\partial L}{\partial b}&=&\frac{2}{m}\sum_{i=1}^m(\mathbf{w^T}\mathbf{x^i}+b-y^i)&=&\frac{2}{m}\sum_{i=1}^m$z^i-y^i$\end{array}$

Vectorised operations

Vectorised operations allow CPUs/GPUs to do SIMD (Single Instruction Multiple Data^{(Refer Wiki)}) processing, making it much faster than using for-loops.

Inputs & Outputs

In the current example, this translates to the parameters $\mathbf{w}$ and $b$ represented as,

$\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}, \quad \mathbf{w} \in \mathbb{R}^{n \times 1}$

$b\in \mathbb{R}^{1 \times 1}$

respectively.

The $m$ training examples of $n$ features is represented as

$\mathbf{X} = \begin{bmatrix} x_{1}^1 & x_{1}^2 & \dots & x_{1}^m \\ x_{2}^1 & x_{2}^2 & \dots & x_{2}^m \\ \vdots & \vdots & \ddots & \vdots \\ x_{n}^1 & x_{n}^2 & \dots & x_{n}^m \end{bmatrix}, \quad \mathbf{X} \in \mathbb{R}^{n \times m}$

$\mathbf{y} = \begin{bmatrix} y^1 & y^2 & \dots & y^m \end{bmatrix}, \quad \mathbf{y} \in \mathbb{R}^{1 \times m}$

The output

$\begin{array}{lll}\mathbf{z}& = & \quad \mathbf{w^T}\mathbf{X} +b \\&=&\begin{bmatrix} z^1 & z^2 & \dots & z^m \end{bmatrix}, \quad \mathbf{z} \in \mathbb{R}^{1 \times m}\end{array}$

Gradients

$d\mathbf{w} = \begin{bmatrix} \frac{\partial \mathbf{L}}{\partial w_1} \\ \frac{\partial \mathbf{L}}{\partial w_2} \\ \vdots \\ \frac{\partial \mathbf{L}}{\partial w_n} \end{bmatrix}, \quad d\mathbf{w} \in \mathbb{R}^{n \times 1}$

The gradient w.r.t to $\mathbf{w}$ can be represented in matrix operations as,

$\begin{array}{lll}d\mathbf{w} &=& \frac{2}{m}\begin{bmatrix} x_{1}^1 & x_{1}^2 & \dots & x_{1}^m \\ x_{2}^1 & x_{2}^2 & \dots & x_{2}^m \\ \vdots & \vdots & \ddots & \vdots \\ x_{n}^1 & x_{n}^2 & \dots & x_{n}^m \end{bmatrix}\begin{bmatrix}z^1-y^1\\z^2-y^2\\\vdots\\z^m-y^m\end{bmatrix}\\ &=&\frac{2}{m}\mathbf{X}(\mathbf{z}-\mathbf{y})^T\end{array}$

Similarly, for the bias term

$\begin{array}{lll}\frac{\partial L}{\partial b}=d\mathbf{b} &=& \frac{2}{m}\underbrac{\begin{bmatrix} 1 & 1 & \dots & 1 \\ \end{bmatrix}}_{1\times m}\begin{bmatrix}z^1-y^1\\z^2-y^2\\\vdots\\z^m-y^m\end{bmatrix}\\ &=&\frac{2}{m}\sum_i^m(z^i-y^i)\end{array}$

Training

Training loop – using the derivatives

The code for linear regression using the gradients descent defined in the previous section.

Computing Gradients

Using PyTorch

For the simple linear regression example, it is relatively straight forward to derive the gradients and perform the training loop. When the function for estimation involves multiple stages/layers a.k.a deep learning^{(refer wiki)}, it becomes harder to derive the gradients.

Popular deep learning frameworks like Pytorch provides tools for automatic differentiation ( torch.autograd ^{refer pytorch entry on autograd}) to find the gradients of each parameter based on the loss function.

Numerical approximation (finite difference method)

To verify the gradients, derivatives can be computed numerically using using finite difference^{(refer wiki entry on finite difference)} method i.e.

$f'(x) = \lim_{\varepsilon \to 0} \displaystyle \frac{f(x+\varepsilon) - f(x - \varepsilon)}{2\varepsilon}$

where,

$f'(x)$ is the true derivative of function $f(x)$ and

$\varepsilon$ is a small constant.

Example – Analytic vs PyTorch vs Numerical Approximation

For the toy example below, can see that the gradients computed by Analytically vs PyTorch vs numerical approximation using finite difference methods are matching.

Training Loop – using PyTorch

Key aspects in the code for implementing the training loop using PyTorch :

the variables are defined as torch tensors ^{refer pytorch article on tensors}.
- Tensors are similar to numpy ndarrays, with capabilities to use GPU/hardware accelerators, optimized for automatic differentiation etc
defining the parameters needing gradient computation.
- the parameters $\mathbf{w}$ and $b$ which needs gradient computation are initialised with requires_grad=True
computing the gradient
- the call loss.backward() is used to compute the gradients for the parameters $\mathbf{w}$ and $b$ .
- this makes the gradient values available in w.grad() and b.grad() respectively
updating the paramaeters
- as gradient tracking is unnecessary during parameter updates, they are performed within torch.no_grad(): context
zeroing gradients between calls
- PyTorch accumulates gradients by default during each backward pass i.e. each loss.backward() call
- so, performing w.grad.zero_() and b.grad.zero_() is needed to clear previous gradients.

As would expect, both the training loop approaches converges to similar values for parameters $\mathbf{w}$ and $b$ .

Mean Absolute Error

Another popular metric to quantify the “closeness” of the estimate ${z}$ to the true value $y$ is Mean Absolute Error (MAE). In the cases where there are outliers in the data, Mean Absolute Error (MAE) is preferred over Mean Squared Error (MSE) as MAE penalizes errors linearly rather than quadratically.

Formally,

$\begin{array}{lll} L_{\text{mae}} &=& \frac{1}{m} \sum_{i=1}^{m} \left| z^i - y^i \right| \\ &=& \frac{1}{m} \sum_{i=1}^{m} \left| \mathbf{w}^T \mathbf{x}^i + b - y^i \right| \end{array}$

For computing the gradient for the loss from Mean Absolute Error, need to find the gradient of absolute function.

Gradient – Absolute function

The absolute value of function is defined as,

$|x| = \begin{cases} x, & \text{if } x \geq 0 \\ -x, & \text{if } x < 0 \end{cases}$

The derivative is

$\frac{d}{dx}|x|=\begin{cases} +1&\text{if}x\gt0\\ -1&\text{if}x\lt0\\ \text{undefined}&\text{if}x=0 \end{cases}$

This can be compactly written as,

$\frac{d}{dx}|x| = \mathrm{sign}(x), \quad \text{for } x \ne 0$

The absolute function is non-differentiable at point at $x=0$ , where the function has a sharp corner.

The concept of subderivative (or subgradient) generalises the derivative to a convex functions which are not differentiable ^{(refer wiki entry on Subderivative)}. With this definition, subderivative at $x=0$ lies in the interval $[ -1, 1]$ .

Using the concept of Symmetric derivative ^{(refer wiki entry on symmetric derivative)}, the subderivtive at $x=0$ can be chosen as 0.

$\begin{array}{llll}f_s(0) &= \lim_{h \to 0} \frac{f(0 + h) - f(0 - h)}{2h} = \lim_{h \to 0} \frac{f(h) - f(-h)}{2h} \\ &= \lim_{h \to 0} \frac{|h| - |-h|}{2h} \\ &= \lim_{h \to 0} \frac{|h| - |h|}{2h} \\ &= \lim_{h \to 0} \frac{0}{2h} = 0. \end{array}$

In practice, deep learning frameworks (like PyTorch, TensorFlow) and numerical methods like NumPy define $\text{sign}(0) = 0$ . This is the subgradient, and it works fine in optimization.

Training

Deriving the Gradients

For a single training example, for the $n^{th}$ parameter of $\mathbf{w}$ , the gradient is

$\begin{array}{lll}\frac{\partial L}{\partial w_n}&=&\frac{\partial }{\partial w_n}|w_1{x_1}^1 + w_2{x_2}_^1 + \dots + w_n{x_n}^1 + b - y^1|\\&=&\mathrm{sign}(w_1{x_1}^1 + w_2{x_2}_^1 + \dots + w_n{x_n}^1 + b - y^1)\frac{\partial L}{\partial w_n}(w_1{x_1}^1 + w_2{x_2}_^1 + \dots + w_n{x_n}^1 + b - y^1)\\&=&\mathrm{sign}(w_1{x_1}^1 + w_2{x_2}_^1 + \dots + w_n{x_n}^1 + b - y^1){x_n}^1\\&=&\mathrm{sign}(\mathbf{w^T}\mathbf{x^1}+b-y^1){x_n}^1\\&=&\mathrm{sign}(z^1-{y^1}){x_n}^1\end{array}$

Similarly for the bias term,

$\begin{array}{lll}\frac{\partial L}{\partial b}&=&\mathrm{sign}(\mathbf{w^T}\mathbf{x^1}+b-y^1)=\mathrm{sign}(z^1-y^1)\end{array}$

Training Loop – using derivatives and PyTorch

For the same example, the code for training the linear regression using Mean Absolute Error as the Loss function.

Can see that both the training loops for Mean Absolute Error (MAE) using PyTorch and Analaytic approaches converges to the same parameters $\mathbf{w}$ and $b$ .

Summary

The post covers the following key aspects

Gradient Basics: How to deriving the gradients for loss functions Mean Square Error and Mean Absolute Error
Efficient Computation: Use of vectorized operations and PyTorch autograd
Gradient Computation: Analytical, numerical (finite difference), and PyTorch comparison
Training Loops: Implementing updates using both manual and PyTorch-based gradients

Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section. 🙂

Gradients for linear regression

Model

Least Mean Squares

Gradients

Vectorised operations

Inputs & Outputs

Gradients

Training

Training loop – using the derivatives

Computing Gradients

Using PyTorch

Numerical approximation (finite difference method)

Example – Analytic vs PyTorch vs Numerical Approximation

Training Loop – using PyTorch

Mean Absolute Error

Gradient – Absolute function

Training

Deriving the Gradients

Training Loop – using derivatives and PyTorch

Summary

One thought on “Gradients for linear regression”

Leave a Reply Cancel reply

Model

Least Mean Squares

Gradients

Vectorised operations

Inputs & Outputs

Gradients

Training

Training loop – using the derivatives

Computing Gradients

Using PyTorch

Numerical approximation (finite difference method)

Example – Analytic vs PyTorch vs Numerical Approximation

Training Loop – using PyTorch

Mean Absolute Error

Gradient – Absolute function

Training

Deriving the Gradients

Training Loop – using derivatives and PyTorch

Summary

One thought on “Gradients for linear regression”

Leave a Reply Cancel reply

Related Articles

Gradients for multi class classification with Softmax

Gradients for Binary Classification with Sigmoid