Newton’s method to find square root, inverse

Some of us would have used Newton’s method (also known as Newton-Raphson method) in some form or other. The method has quite a bit of history,  starting with the Babylonian way of finding the square root and later over centuries reaching the present recursive way of finding the solution. In this post, we will describe Newton’s method and apply it to find the square root and the inverse of a number.

Geometrical interpretation

We know that the derivative of a function at  is the slope of the tangent (red line) at  i.e.,

.

Rearranging, the intercept of the tangent at x-axis is,

.

From the figure above, can see that the tangent (red line) intercepts the x-axis at  which is closer to  the  where compared to . Keep on doing this operation recursively, and it converges to the zero of the function OR in another words the root of the function.

In general for iteration, the equation is :

.

 

Finding square root

Let us, for example try to use this method for finding the square root of D=100. The function to zero out in the Newton’s method frame work  is,

, where .

The first derivative is

.

The recursive equation is,

.

Matlab code snippet

clear ; close all;
D = 100; % number to find the square root
x = 1; % initial value
for ii = 1:10
	fx = x.^2 - D;
	f1x = 2*x;
	x = x - fx/f1x;
	x_v(ii) = x;
end
x_v' =
   50.5000000000000
   26.2400990099010
   15.0255301199868
   10.8404346730269
   10.0325785109606
   10.0000528956427
   10.0000000001399
   10.0000000000000
   10.0000000000000
   10.0000000000000

We can see that the it converges within around 8 iterations. Further, playing around with the initial value,

a) if we start with initial value of x = -1, then we will converge to -10.

b) if we start with initial value of x = 0, then we will not converge

and so on…

Finding inverse (division)

Newton’s method can be used to find the inverse of a variable D. One way to write the function to zero out is, but we soon realize that this does not work as we need know in the first place.

Alternatively the function to zero out can be written as,

.

The first derivative is,

.

The equation in the recursive form is,

.

Matlab code snippet

clear ; close all;
D = .1; % number to find the square root
x = [.1:.2:1]; % initial value
for ii = 1:10
	fx = (1./x) - D;
	f1x = -1./x.^2;
	x = x - fx./f1x;
	x_v(:,ii) = x;
end
plot(x_v');
legend('0.1', '0.3', '0.5', '0.7', '0.9');
grid on; xlabel('number of iterations'); ylabel('inverse');
title('finding inverse newton''s method');

The following plot shows the convergence of inverse computation to the right value for different values of for this example matlab code snippet.


Figure : convergence of inverse computation

Finding the minima of a function

To find the minima of a function, we to find where the derivative of the function becomes zero i.e.  .

Using Newton’s method, the recursive equation becomes :

.

Thoughts

We have  briefly gone through the Newton’s method and its applications to find the roots of a function, inverse, minima etc. However, there are quite a few aspects which we did not go over, like :

a) Impact of the initial value on the convergence of the function

b) Rate of the convergence

c) Error bounds of the converged result

d) Conditions where the convergence does not happen

and so on…

Hoping to discuss those in another post… 🙂

References

Wiki on Newton’s method

Closed form solution for linear regression

In the previous post on Batch Gradient Descent and Stochastic Gradient Descent, we looked at two iterative methods for finding the parameter vector  which minimizes the square of the error between the predicted value  and the actual output  for all  values in the training set.

A closed form solution for finding the parameter vector  is possible, and in this post let us explore that. Ofcourse, I thank Prof. Andrew Ng for putting all these material available on public domain (Lecture Notes 1).retrieved 11th Sep 2024

Continue reading “Closed form solution for linear regression”

Stochastic Gradient Descent

For curve fitting using linear regression, there exists a minor variant of Batch Gradient Descent algorithm, called Stochastic Gradient Descent.

In the Batch Gradient Descent, the parameter vector  is updated as,

.

(loop over all elements of training set in one iteration)

For Stochastic Gradient Descent, the vector gets updated as, at each iteration the algorithm goes over only one among  training set, i.e.

.

When the training set is large, Stochastic Gradient Descent can be useful (as we need not go over the full data to get the first set of the parameter vector )

For the same Matlab example used in the previous post, we can see that both batch and stochastic gradient descent converged to reasonably close values.

Matlab/Octave code snippet

clear ;
close all;
x = [1:50].';
y = [4554 3014 2171 1891 1593 1532 1416 1326 1297 1266 ...
	1248 1052 951 936 918 797 743 665 662 652 ...
	629 609 596 590 582 547 486 471 462 435 ...
	424 403 400 386 386 384 384 383 370 365 ...
	360 358 354 347 320 319 318 311 307 290 ].';
%
m = length(y); % store the number of training examples
x = [ ones(m,1) x]; % Add a column of ones to x
n = size(x,2); % number of features
theta_batch_vec = [0 0]';
theta_stoch_vec = [0 0]';
alpha = 0.002;
err = [0 0]';
theta_batch_vec_v = zeros(10000,2);
theta_stoch_vec_v = zeros(50*10000,2);
for kk = 1:10000
	% batch gradient descent - loop over all training set
	h_theta_batch = (x*theta_batch_vec);
	h_theta_batch_v = h_theta_batch*ones(1,n);
	y_v = y*ones(1,n);
	theta_batch_vec = theta_batch_vec - alpha*1/m*sum((h_theta_batch_v - y_v).*x).';
	theta_batch_vec_v(kk,:) = theta_batch_vec;
	%j_theta_batch(kk) = 1/(2*m)*sum((h_theta_batch - y).^2);

	% stochastic gradient descent - loop over one training set at a time
	for (jj = 1:50)
		h_theta_stoch = (x(jj,:)*theta_stoch_vec);
		h_theta_stoch_v = h_theta_stoch*ones(1,n);
		y_v = y(jj,:)*ones(1,n);
		theta_stoch_vec = theta_stoch_vec - alpha*1/m*((h_theta_stoch_v - y_v).*x(jj,:)).';
		%j_theta_stoch(kk,jj) = 1/(2*m)*sum((h_theta_stoch - y).^2);
		theta_stoch_vec_v(50*(kk-1)+jj,:) = theta_stoch_vec;
	end
end

figure;
plot(x(:,2),y,'bs-');
hold on
plot(x(:,2),x*theta_batch_vec,'md-');
plot(x(:,2),x*theta_stoch_vec,'rp-');
legend('measured', 'predicted-batch','predicted-stochastic');
grid on;
xlabel('Page index, x');
ylabel('Page views, y');
title('Measured and predicted page views');

j_theta = zeros(250, 250);   % initialize j_theta
theta0_vals = linspace(-2500, 2500, 250);
theta1_vals = linspace(-50, 50, 250);
for i = 1:length(theta0_vals)
	  for j = 1:length(theta1_vals)
		theta_val_vec = [theta0_vals(i) theta1_vals(j)]';
		h_theta = (x*theta_val_vec);
		j_theta(i,j) = 1/(2*m)*sum((h_theta - y).^2);
    end
end
figure;
contour(theta0_vals,theta1_vals,10*log10(j_theta.'))
xlabel('theta_0'); ylabel('theta_1')
title('Cost function J(theta)');
hold on;
plot(theta_stoch_vec_v(:,1),theta_stoch_vec_v(:,2),'rs.');
plot(theta_batch_vec_v(:,1),theta_batch_vec_v(:,2),'kx.');

The converged values are :

.

 

Measured and predicted pageviews Batch and Stochastic gradient descent

From the below plot, we can see that Batch Gradient Descent (black line) goes straight down to the minima, whereas Stochastic Gradient Descent (red line) keeps hovering around (thick red line) before going down to the minima.

References

An Application of Supervised Learning – Autonomous Deriving (Video Lecture, Class2)

CS 229 Machine Learning Course Materials

 

 

 

 

Batch Gradient Descent

I happened to stumble on Prof. Andrew Ng’s Machine Learning classes which are available online as part of Stanford Center for Professional Development. The first lecture in the series discuss the topic of fitting parameters for a given data set using linear regression.  For understanding this concept, I chose to take data from the top 50 articles of this blog based on the pageviews in the month of September 2011.

Continue reading “Batch Gradient Descent”