Numeric optimization#
Due to some computational issues it could be inpractical to calculate \(\widehat{\boldsymbol w}\) by the formula (19). If so, a numerical optimization method can be used. The most common one is gradient descent algorithm.
For ordinary linear regression the loss function is proportional to
\[
\mathcal L(\boldsymbol w) = \frac 12\Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2,
\]
therefore, \(\nabla \mathcal L(\boldsymbol w) = \boldsymbol X^\top(\boldsymbol{Xw} - \boldsymbol y)\).
Gradient descent#
Now let’s implement the gradient descent algorithm for this objective:
import numpy as np
def linear_regression_gd(X, y, learning_rate=0.01, tol=1e-3, max_iter=10000):
w = np.random.normal(size=X.shape[1])
gradient = X.T.dot(X.dot(w) - y)
for i in range(max_iter):
if np.linalg.norm(gradient) <= tol:
return w
w -= learning_rate * gradient
gradient = X.T.dot(X.dot(w) - y)
print("max_iter exceeded")
return w
Try in on some synthetic data:
X = np.random.normal(size=(50, 20))
y = np.random.rand(50)
w = linear_regression_gd(X, y, tol=1e-4)
print("MSE:", np.mean((X.dot(w) - y)**2))
MSE: 0.17863408307966058
from sklearn.linear_model import LinearRegression
LR = LinearRegression(fit_intercept=False)
LR.fit(X, y)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))
print("Difference from w:", np.linalg.norm(LR.coef_ - w))
r-score: -1.2670845255665677
MSE: 0.17863408305332582
Difference from w: 1.3493667742254247e-05
Now apply to Boston dataset:
import pandas as pd
boston = pd.read_csv("../ISLP_datsets/Boston.csv").drop("Unnamed: 0", axis=1)
target = boston['medv']
train = boston.drop(['medv'], axis=1)
X_train = np.hstack([np.ones(506)[:, None], train])
weights = linear_regression_gd(X_train, target.values, learning_rate=9e-9)
print("MSE:", np.mean((X_train.dot(weights) - target)**2))
max_iter exceeded
MSE: 54.281765917872875
Looks not so good…
Newton’s method#
TODO
Add more experiments for gradient descent along with visualizations
Apply Newton’s method to linear regression
Add regularization
Compare the performance of GD and Newton’s method