Numeric optimization#

Due to some computational issues it could be inpractical to calculate \(\widehat{\boldsymbol w}\) by the formula (19). If so, a numerical optimization method can be used. The most common one is gradient descent algorithm.

For ordinary linear regression the loss function is proportional to

\[ \mathcal L(\boldsymbol w) = \frac 12\Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2, \]

therefore, \(\nabla \mathcal L(\boldsymbol w) = \boldsymbol X^\top(\boldsymbol{Xw} - \boldsymbol y)\).

Gradient descent#

Now let’s implement the gradient descent algorithm for this objective:

import numpy as np

def linear_regression_gd(X, y, learning_rate=0.01, tol=1e-3, max_iter=10000):
    w = np.random.normal(size=X.shape[1])
    gradient = X.T.dot(X.dot(w) - y)
    for i in range(max_iter):
        if np.linalg.norm(gradient) <= tol:
            return w
        w -= learning_rate * gradient
        gradient = X.T.dot(X.dot(w) - y)
    print("max_iter exceeded")
    return w

Try in on some synthetic data:

X = np.random.normal(size=(50, 20))
y = np.random.rand(50)
w = linear_regression_gd(X, y, tol=1e-4)
print("MSE:", np.mean((X.dot(w) - y)**2))
MSE: 0.17863408307966058
from sklearn.linear_model import LinearRegression
LR = LinearRegression(fit_intercept=False)
LR.fit(X, y)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))
print("Difference from w:", np.linalg.norm(LR.coef_ - w))
r-score: -1.2670845255665677
MSE: 0.17863408305332582
Difference from w: 1.3493667742254247e-05

Now apply to Boston dataset:

import pandas as pd
boston = pd.read_csv("../ISLP_datsets/Boston.csv").drop("Unnamed: 0", axis=1)
target = boston['medv']
train = boston.drop(['medv'], axis=1)
X_train = np.hstack([np.ones(506)[:, None], train])
weights = linear_regression_gd(X_train, target.values, learning_rate=9e-9)
print("MSE:", np.mean((X_train.dot(weights) - target)**2))
max_iter exceeded
MSE: 54.281765917872875

Looks not so good…

Newton’s method#

TODO

  • Add more experiments for gradient descent along with visualizations

  • Apply Newton’s method to linear regression

  • Add regularization

  • Compare the performance of GD and Newton’s method