From the previous posts on Linear Regression (using Batch Gradient descent, Stochastic Gradient Descent, Closed form solution), we discussed couple of different ways to estimate the parameter vector in the least square error sense for the given training set. However, how does the least square error criterion work when the training set is corrupted by noise? In this post, let us discuss the case where training set is corrupted by Gaussian noise.

For the training set, the system model is :

,

where,

is the input sequence,

is the output sequence,

is the parameter vector and

is the noise in the observations.

Let us assume that the noise term are independent and identically distributed following a Gaussian probability having mean 0 and variance .

The probability density function of noise term can be written as,

.

This means that probability of the output sequence given and parameterised by is,

.

Let us write the** likelihood of** , given all the observations of input sequence and output as,

.

Given that all the observations are independent, the **likelihood of** is,

.

Taking logarithm on both sides, the** log-likelihood function is,**

.

From the above expression, we can see that maximizing the likelihood function is same as minimizing

Recall: This is same cost function which was minimized in the Least Squares solution.

**Summarizing:**

a) When the observations are corrupted by **independent Gaussian Noise**, the** least squares solution** is the **Maximum Likelihood estimate** of the parameter vector .

b) The term is not a playing a role in this minimization. However if the noise variance of each observation is different, this needs to get factored in. We will discuss this in another post.

**References**

CS229 Lecture notes1, Chapter 3 Probabilistic Interpretation, Prof. Andrew Ng