Articles

# Least Squares in Gaussian Noise – Maximum Likelihood

From the previous posts on Linear Regression (using Batch Gradient descent, Stochastic Gradient Descent, Closed form solution), we discussed couple of different ways to estimate the  parameter vector in the least square error sense for the given training set. However, how does the least square error criterion work when the training set is corrupted by noise? In this post, let us discuss the case where training set is corrupted by Gaussian noise.

For the $j^{th}$ training set, the system model is :

$y^{(j)} = \theta^Tx^{(j)} + n^{(j)}$,

where,

$x^{(j)}$ is the input sequence,

$y^{(j)}$ is the output  sequence,

$\theta$ is the parameter vector and

$n^{(j)}$ is the noise in the observations.

Let us assume that the noise term $n^{(j)}$ are independent and identically distributed following a Gaussian probability having mean 0 and variance $\sigma^2$.

The probability density function of noise term can be written as,
$p$$n^{(j)}$$ = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(n^{(j))^2}}{2\sigma^2}$.

This means that probability of the output sequence $y^{(j)}$ given $x^{(j)}$ and parameterised by $\theta$ is,

$p$$y^{(j)}|x^{(j)}; \theta$$ = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{$$y^{(i)}-\theta^Tx^{(i)}$$^2}{2\sigma^2}}$.

Let us write the likelihood of $\theta$, given all the observations of input sequence $X$ and output $Y$ as,

$L(\theta)=p(Y|X;\theta)$.

Given that all the $m$ observations are independent, the likelihood of $\theta$ is,

$\begin{array}{lll} L(\theta) & = & \prod_{1}^{m}p$$y^{(j)}|x^{(j)}; \theta$$\\ & = & \prod_{1}^{m} \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{$$y^{(i)}-\theta^Tx^{(i)}$$^2}{2\sigma^2}}\end{array}$.

Taking logarithm on both sides, the log-likelihood function is,

$\begin{array}{lll}l(\theta) & = & \log L(\theta)\\& = & \log \prod_{i=1}^{m}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{$$y^{(i)}-\theta^Tx^{(i)}$$^2}{2\sigma^2}}\\ & = & \sum_{i=1}^{m}\log \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{$$y^{(i)}-\theta^Tx^{(i)}$$^2}{2\sigma^2}}\\ & = & m \log \frac{1}{\sqrt{2\pi\sigma^2}}-\frac{1}{2\sigma^2}\underbrace{\sum_{i=1}^m$$y^{(i)}-\theta^Tx^{(i)}$$^2}\end{array}$.

From the above expression, we can see that maximizing the likelihood function $L(\theta)$ is same as minimizing

$\Large \sum_{i=1}^m$$y^{(i)}-\theta^Tx^{(i)}$$^2 = J(\theta)$

Recall: This is same cost function which was minimized in the Least Squares solution.

Summarizing:

a) When the observations are corrupted by independent Gaussian Noise, the least squares solution is the Maximum Likelihood estimate of the parameter vector $\theta$.

b) The term $\frac{1}{\sigma^2}$ is not a playing a role in this minimization. However if the noise variance of each observation is different, this needs to get factored in. We will discuss this in another post.

## References

CS229 Lecture notes1, Chapter 3 Probabilistic Interpretation, Prof. Andrew Ng