Numeric optimization#

To find the optimal weights of the logistic regression, we can use gradient descent algorithm. To apply this algorithm, one need to calculate the gradient of the loss function.

Binary logistic regression#

Multiply the loss function (25) by \(n\):

\[ \mathcal L(\boldsymbol w) = -\sum\limits_{i=1}^n \big(y_i \log(\sigma(\boldsymbol x_i^\top \boldsymbol w)) + (1- y_i)\log(1 - \sigma(\boldsymbol x_i^\top \boldsymbol w))\big). \]

To find \(\nabla \mathcal L(\boldsymbol w)\) observe that

\[ \nabla \log(\sigma(\boldsymbol x_i^\top \boldsymbol w)) = \frac {\nabla \sigma(\boldsymbol x_i^\top \boldsymbol w)}{\sigma(\boldsymbol x_i^\top \boldsymbol w)} = \frac{\sigma'(\boldsymbol x_i^\top \boldsymbol w) \nabla(\boldsymbol x_i^\top \boldsymbol w)}{{\sigma(\boldsymbol x_i^\top \boldsymbol w)}}. \]

Also,

\[ \nabla \log(1 - \sigma(\boldsymbol x_i^\top \boldsymbol w)) = -\frac {\nabla \sigma(\boldsymbol x_i^\top \boldsymbol w)}{1 - \sigma(\boldsymbol x_i^\top \boldsymbol w)} = \frac{\sigma'(\boldsymbol x_i^\top \boldsymbol w) \nabla(\boldsymbol x_i^\top \boldsymbol w)}{{1 - \sigma(\boldsymbol x_i^\top \boldsymbol w)}}. \]

Q. What is \(\nabla(\boldsymbol x_i^\top \boldsymbol w)\)?

Putting it altogeter, we get

\[ \nabla \mathcal L(\boldsymbol w) = -\sum\limits_{i=1}^n \big(y_i(1 - \sigma(\boldsymbol x_i^\top \boldsymbol w))\boldsymbol x_i - (1-y_i)\sigma(\boldsymbol x_i^\top \boldsymbol w)\boldsymbol x_i\big) = \sum\limits_{i=1}^n (\sigma(\boldsymbol x_i^\top \boldsymbol w) - y_i)\boldsymbol x_i. \]

Question

How to write \(\nabla \mathcal L(\boldsymbol w)\) as a product of a matrix and a vector, avoiding the explicit summation?

Question

What is hessian \(\nabla^2 L(\boldsymbol w)\)?

Breast cancer dataset: numeric optimization#

Fetch the dataset:

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
X.shape, y.shape
((569, 30), (569,))

Apply the gradient descent algorithm to the logistic regression:

import numpy as np
from scipy.special import expit

def logistic_regression_gd(X, y, C=1, learning_rate=0.01, tol=1e-3, max_iter=10000):
    w = np.random.normal(size=X.shape[1])
    gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
    for i in range(max_iter):
        if np.linalg.norm(gradient) <= tol:
            return w
        w -= learning_rate * gradient
        gradient = X.T.dot(X.dot(w) - y) + C * w
    print("max_iter exceeded")
    return w

Fit the logistic regresion on the whole dataset:

w = logistic_regression_gd(X, y, learning_rate=1e-9, max_iter=10**5)
w
max_iter exceeded
array([ 1.56038059, -0.2675925 , -0.49726469,  0.01489382,  1.40672143,
        0.49241637,  0.22854445, -1.31087023,  0.51856803,  0.10672971,
        1.71801498,  1.11273329,  1.67293441, -0.09067153, -0.41158665,
        1.14139333, -0.15143644,  0.42951181,  0.55172117,  0.24684593,
       -0.21485563,  0.03532925,  0.24591635, -0.00934616,  0.17156271,
        1.02783243,  0.91185502, -1.12774813,  0.68919211, -0.05270757])

Calculate the accuracy score:

from sklearn.metrics import accuracy_score
accuracy_score(expit(X.dot(w)) > 0.5, y)
0.4797891036906854

Compare with sklearn:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(fit_intercept=False, max_iter=5000)
log_reg.fit(X, y)
print(log_reg.score(X, y))
print(accuracy_score(log_reg.predict(X), y))
log_reg.coef_
0.9595782073813708
0.9595782073813708
array([[ 2.11889390e+00,  1.22744492e-01, -6.67312763e-02,
        -3.27123359e-03, -1.50002223e-01, -4.12798324e-01,
        -6.42392447e-01, -3.32819680e-01, -2.20815385e-01,
        -2.73001384e-02, -1.78421266e-02,  1.32475836e+00,
         7.49355569e-04, -9.60832715e-02, -1.57868909e-02,
        -6.94498579e-03, -5.52225661e-02, -3.91172002e-02,
        -4.27511650e-02,  4.87715350e-03,  1.29800504e+00,
        -3.50698893e-01, -1.19614382e-01, -2.47450987e-02,
        -2.76962316e-01, -1.20191436e+00, -1.61528773e+00,
        -6.42735657e-01, -6.89168691e-01, -1.21859280e-01]])
np.linalg.norm(w - log_reg.coef_)
5.514889572272746

Multinomial logistic regression#

Recall that the loss function in this case is

\[\begin{split} \begin{multline*} \mathcal L(\boldsymbol W) = -\sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \bigg(\boldsymbol x_i^\top\boldsymbol w_{k} -\log\Big(\sum\limits_{k=1}^K \exp(\boldsymbol x_i^\top\boldsymbol w_{k})\Big)\bigg) = \\ = -\sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \bigg(\sum\limits_{j=1}^d x_{ij} w_{jk} -\log\Big(\sum\limits_{k=1}^K \exp\Big(\sum\limits_{j=1}^d x_{ij} w_{jk}\Big)\Big)\bigg) \end{multline*} \end{split}\]

One can show that

\[ \nabla \mathcal L(\boldsymbol W) = \boldsymbol X^\top (\boldsymbol {\widehat Y} - \boldsymbol Y) = \boldsymbol X^\top ( \sigma(\boldsymbol{XW}) - \boldsymbol Y). \]

TODO

  • Try numerical optimization on several datasets

  • Apply Newton’s method and compare it’s performance with GD