Closed form solution for linear regression

In the previous post on Batch Gradient Descent and Stochastic Gradient Descent, we looked at two iterative methods for finding the parameter vector $\theta$ which minimizes the square of the error between the predicted value $h_{\theta}(x)$ and the actual output $y$ for all $j$ values in the training set.

A closed form solution for finding the parameter vector $\theta$ is possible, and in this post let us explore that. Ofcourse, I thank Prof. Andrew Ng for putting all these material available on public domain (Lecture Notes 1).^{retrieved 11th Sep 2024}

Table of Contents

Notations

Let’s revisit the notations.

$m$ be the number of training set (in our case top 50 articles),

$x$ be the input sequence (the page index),

$y$ be the output sequence (the page views for each page index)

$n$ be the number of features/parameters (=2 for our example).

The value of $(x^j,y^j)$ corresponds to the $j^{th}$ training set

The predicted the number of page views for a given page index using a hypothesis $h_{\theta}(x)$ defined as :

$\begin{array}{lll}h_{\theta}(x)&=&\theta{_0}x_0 + \theta{_1}x_1\\&=&\sum_{i=0}^{n-1}\theta{_i}x_i\\&=&\theta^Tx\\\end{array}$

where,

$x_1$ is the page index,

$x_0\ = \ 1$ .

Formulating in matrix notations:

The input sequence is,

$X = \[\begin{array}{mm}x_0^1 & x_1^1 \\x_0^2 & x_1^2\\ \vdots & \vdots \\x_0^m & x_1^m\end{array}\]$ of dimension [m x n]

The measured values are,

$Y = \[\begin{array}{m}y^1\\y^2\\\vdots \\y^m\end{array}\]$ of dimension [m x 1].

The parameter vector is,

$\theta=\[\begin{array}{m}\theta_0\\\theta_1\end{array}\]$ of dimension [n x 1]

The hypothesis term is,

$H_\theta(X)=X\theta= \[\begin{array}{mm}x_0^1 & x_1^1 \\x_0^2 & x_1^2\\ \vdots & \vdots \\x_0^m & x_1^m\end{array}\]\[\begin{array}{m}\theta_0\\\theta_1 \end{array}\]$ of dimension [m x 1].

From the above,

$H_\theta(X)-Y=X\theta-Y= \[\begin{array}{mm}h_\theta(x^1)-y^1\\h_\theta(x^2)-y^2\\\vdots\\h_\theta(x^m)-y^m\end{array}\]$ .

Recall :

Our goal is to find the parameter vector $\theta$ which minimizes the square of the error between the predicted value $h_{\theta}(x)$ and the actual output $y$ for all $j$ values in the training set i.e.

$\min_{\theta} \sum_{j=1}^m\[h_{\theta}(x^j) - y^j\]^2$ .

From matrix algebra, we know that

$\sum_{j=1}^m\[h_{\theta}(x^j) - y^j\]^2=$X\theta-Y$^T$X\theta-Y$$ .

So we can now go about to define the cost function $J$\theta$$ as,

$J$\theta$=\frac{1}{2}\sum_{j=1}^m\[h_{\theta}(x^j) - y^j\]^2 = \frac{1}{2}$X\theta-Y$^T$X\theta-Y$$ .

To find the value of $\theta$ which minimizes $J$\theta$$ , we can differentiate $J$\theta$$ with respect to $\theta$ .

$\begin{array}{lll}\frac{\partial}{\partial\theta}J$\theta$& =& \frac{1}{2}\frac{\partial}{\partial\theta}$X\theta-Y$^T$X\theta-Y$\\&=&\frac{1}{2}\frac{\partial}{\partial\theta}$\theta^TX^TX\theta - \theta^TX^TY -Y^TX\theta+Y^TY $\\&=&$X^TX\theta - X^TY$\end{array}}$

To find the value of $\theta$ which minimizes $\theta$ , we set

$\frac{\partial}{\partial\theta}J(\theta)=0$ ,

$\begin{array}{lll}$X^TX\theta - X^TY$&=&0\end{array}$ .

Solving,

$\begin{array}{lll}\theta & = & $X^TX$^{-1}X^TY\end{array}$

Note : (Update 7th Dec 2011)

As pointed by Mr. Andre KrishKevich, the above solution is same as the formula for liner least squares fit (linear least squares, least square in wiki)

Matlab/Octave code snippet

clear ;
close all;
x = [1:50].';
y = [4554 3014 2171 1891 1593 1532 1416 1326 1297 1266 ...
	1248 1052 951 936 918 797 743 665 662 652 ...
	629 609 596 590 582 547 486 471 462 435 ...
	424 403 400 386 386 384 384 383 370 365 ...
	360 358 354 347 320 319 318 311 307 290 ].';

m = length(y); % store the number of training examples
x = [ ones(m,1) x]; % Add a column of ones to x
n = size(x,2); % number of features
theta_vec = inv(x'*x)*x'*y;

The computed $\theta$ values are

$\begin{array}{lll}\theta_0&=&1840.618\\\theta_1&=&-39.820\end{array}$ .

Note :

$\begin{array}{lll}\frac{\partial}{\partial\theta}\theta^TX^TX\theta & = & X^TX\theta+XX^T\theta\\ & =&2X^TX\theta\end{array}$

(Refer: Matrix calculus notes – University of Colorado)

$\begin{array}{lll}-\frac{\partial}{\partial\theta}(\theta^TX^TY) &=&-\frac{\partial}{\partial\theta}tr(\theta^TX^TY) \\ & = & -\frac{\partial}{\partial\theta}tr(Y^TX\theta)\\&=&-X^TY\end{array}$

(Refer : Matrix Calculus Wiki)

References

An Application of Supervised Learning – Autonomous Deriving (Video Lecture, Class2)

CS 229 Machine Learning Course Materials

Refer: Matrix calculus notes – University of Colorado

Matrix Calculus Wiki

10 thoughts on “Closed form solution for linear regression”

hi
thanks a lot for this article

Pingback: Weighted Least Squares and locally weighted linear regression

i am a school student sir ….i m really sorry if i have been spamming ur blog sir…but sir i guess ur blog is amongst the best blogs on education …..
please watch the following videos on you-tube, vote for me and also ask all your near n dear ones to do the same.

http://www.youtube.com/watch?v=wh5cJOdwLHQ
http://www.youtube.com/watch?v=t1fn6_JN-NI&feature=related

If you face any problem in doing so can please go through the bullets below.

You just need to create an account on you-tube using your gmail account
Sign in your account. Stay signed in.
Go to you-tube.com ( in another tab)
Click ‘Create Account’ {your account would be created}
Follow the link
Click the ‘Like’ button

My work is done!
Please do take the pains of liking the videos and ask as many people as you can, it would help me a lot.Please don’t forget to ask other to do the same.
Thank You
m really sorry ..for bothering u ..i hope u’ll understand

Krishna Sankar says:

December 24, 2011 at 6:07 am

@archit: good luck !

Reply

You use:

theta_vec = inv(x’*x)*x’*y;

While that is theoretically correct, why not just use:

theta_vec = x \ y;

Krishna Sankar says:

December 12, 2011 at 5:04 am

@Will: Yes, I could have used that. Just that I preferred to use the expanded version.

And for reference, from Mathworks on mldivide \
http://www.mathworks.in/help/techdoc/ref/mldivide.html
“If the equation Ax = b does not have a solution (and A is not a square matrix), x = A\b returns a least squares solution — in other words, a solution that minimizes the length of the vector Ax – b, which is equal to norm(A*x – b).”

Reply

Very nice post. Please correct me if I’m wrong, but it looks like you’re simply deriving the linear least square fit formula. It’s probably worth mentioning in the article.

Krishna Sankar says:

December 7, 2011 at 4:41 am

@Andrei: Yup, it turns out to be the closed form for least squares fit. Recall, we are trying to minimize the square of the errors, which is indeed least squares. I will add a note on that… 🙂

Reply

Hello, is it the same as the MATLAB function polyfit?

Krishna Sankar says:

December 7, 2011 at 4:43 am

@linkin8834: I have not played with polyfit(), but reading from Mathworks documentation on polyfit(), both seems to be similar.

http://www.mathworks.in/help/techdoc/ref/polyfit.html
“p = polyfit(x,y,n) finds the coefficients of a polynomial p(x) of degree n that fits the data, p(x(i)) to y(i), in a least squares sense.”

Reply

Notations

Matlab/Octave code snippet

10 thoughts on “Closed form solution for linear regression”

Leave a Reply Cancel reply

Related Articles

Modeling phase noise (frequency domain approach)

Weighted Least Squares and locally weighted linear regression

Least Squares in Gaussian Noise – Maximum Likelihood

Newton’s method to find square root, inverse