Numeric optimization

Numeric optimization#

To find the optimal weights of the logistic regression, we can use gradient descent algorithm. To apply this algorithm, one need to calculate the gradient of the loss function.

Binary logistic regression#

Multiply the loss function (25) by \(n\):

\[ \mathcal L(\boldsymbol w) = -\sum\limits_{i=1}^n \big(y_i \log(\sigma(\boldsymbol x_i^\top \boldsymbol w)) + (1- y_i)\log(1 - \sigma(\boldsymbol x_i^\top \boldsymbol w))\big). \]

To find \(\nabla \mathcal L(\boldsymbol w)\) observe that

\[ \nabla \log(\sigma(\boldsymbol x_i^\top \boldsymbol w)) = \frac {\nabla \sigma(\boldsymbol x_i^\top \boldsymbol w)}{\sigma(\boldsymbol x_i^\top \boldsymbol w)} = \frac{\sigma'(\boldsymbol x_i^\top \boldsymbol w) \nabla(\boldsymbol x_i^\top \boldsymbol w)}{{\sigma(\boldsymbol x_i^\top \boldsymbol w)}}. \]

Also,

\[ \nabla \log(1 - \sigma(\boldsymbol x_i^\top \boldsymbol w)) = -\frac {\nabla \sigma(\boldsymbol x_i^\top \boldsymbol w)}{1 - \sigma(\boldsymbol x_i^\top \boldsymbol w)} = \frac{\sigma'(\boldsymbol x_i^\top \boldsymbol w) \nabla(\boldsymbol x_i^\top \boldsymbol w)}{{1 - \sigma(\boldsymbol x_i^\top \boldsymbol w)}}. \]

Q. What is \(\nabla(\boldsymbol x_i^\top \boldsymbol w)\)?

Putting it altogeter, we get

\[ \nabla \mathcal L(\boldsymbol w) = -\sum\limits_{i=1}^n \big(y_i(1 - \sigma(\boldsymbol x_i^\top \boldsymbol w))\boldsymbol x_i - (1-y_i)\sigma(\boldsymbol x_i^\top \boldsymbol w)\boldsymbol x_i\big) = \sum\limits_{i=1}^n (\sigma(\boldsymbol x_i^\top \boldsymbol w) - y_i)\boldsymbol x_i. \]

Question

How to write \(\nabla \mathcal L(\boldsymbol w)\) as a product of a matrix and a vector, avoiding the explicit summation?

Hint

The shape of \(\nabla \mathcal L(\boldsymbol w)\) is the same as of \(\boldsymbol w\), i.e., \(d\times 1\). Now observe that

\[\begin{split} \begin{pmatrix} \sigma(\boldsymbol x_1^\top \boldsymbol w) - y_1 \\ \vdots \\ \sigma(\boldsymbol x_n^\top \boldsymbol w) - y_n \end{pmatrix} = \sigma(\boldsymbol X \boldsymbol w )- \boldsymbol y \in \mathbb R^n. \end{split}\]

What should we multiply by this vector to obtain \(\nabla \mathcal L\)?

Question

What is hessian \(\nabla^2 L(\boldsymbol w)\)?

Answer

\[ \nabla^2 L(\boldsymbol w) = \boldsymbol X^\top \boldsymbol S \boldsymbol X, \]

where

\[\begin{split} \boldsymbol S = \mathrm{diag}\{\sigma(\boldsymbol X \boldsymbol w )- \boldsymbol y\} = \begin{pmatrix} \sigma(\boldsymbol x_1^{\boldsymbol{\top}} \boldsymbol w) - y_1 & \ldots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \ldots & \sigma(\boldsymbol x_n^{\boldsymbol{\top}} \boldsymbol w) - y_n \end{pmatrix} \end{split}\]

Breast cancer dataset: numeric optimization#

Fetch the dataset:

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
X.shape, y.shape

((569, 30), (569,))

Apply the gradient descent algorithm to the logistic regression:

import numpy as np
from scipy.special import expit

def logistic_regression_gd(X, y, C=1, learning_rate=0.01, tol=1e-3, max_iter=10000):
    w = np.random.normal(size=X.shape[1])
    gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
    for i in range(max_iter):
        if np.linalg.norm(gradient) <= tol:
            return w
        w -= learning_rate * gradient
        gradient = X.T.dot(X.dot(w) - y) + C * w
    print("max_iter exceeded")
    return w

Fit the logistic regresion on the whole dataset:

w = logistic_regression_gd(X, y, learning_rate=1e-9, max_iter=10**5)
w

max_iter exceeded

array([ 1.56038059, -0.2675925 , -0.49726469,  0.01489382,  1.40672143,
        0.49241637,  0.22854445, -1.31087023,  0.51856803,  0.10672971,
        1.71801498,  1.11273329,  1.67293441, -0.09067153, -0.41158665,
        1.14139333, -0.15143644,  0.42951181,  0.55172117,  0.24684593,
       -0.21485563,  0.03532925,  0.24591635, -0.00934616,  0.17156271,
        1.02783243,  0.91185502, -1.12774813,  0.68919211, -0.05270757])

Calculate the accuracy score:

from sklearn.metrics import accuracy_score
accuracy_score(expit(X.dot(w)) > 0.5, y)

0.4797891036906854

Compare with sklearn:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(fit_intercept=False, max_iter=5000)
log_reg.fit(X, y)
print(log_reg.score(X, y))
print(accuracy_score(log_reg.predict(X), y))
log_reg.coef_

0.9595782073813708
0.9595782073813708

array([[ 2.11889390e+00,  1.22744492e-01, -6.67312763e-02,
        -3.27123359e-03, -1.50002223e-01, -4.12798324e-01,
        -6.42392447e-01, -3.32819680e-01, -2.20815385e-01,
        -2.73001384e-02, -1.78421266e-02,  1.32475836e+00,
         7.49355569e-04, -9.60832715e-02, -1.57868909e-02,
        -6.94498579e-03, -5.52225661e-02, -3.91172002e-02,
        -4.27511650e-02,  4.87715350e-03,  1.29800504e+00,
        -3.50698893e-01, -1.19614382e-01, -2.47450987e-02,
        -2.76962316e-01, -1.20191436e+00, -1.61528773e+00,
        -6.42735657e-01, -6.89168691e-01, -1.21859280e-01]])

np.linalg.norm(w - log_reg.coef_)

5.514889572272746

Multinomial logistic regression#

Recall that the loss function in this case is

\[\begin{split} \begin{multline*} \mathcal L(\boldsymbol W) = -\sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \bigg(\boldsymbol x_i^\top\boldsymbol w_{k} -\log\Big(\sum\limits_{k=1}^K \exp(\boldsymbol x_i^\top\boldsymbol w_{k})\Big)\bigg) = \\ = -\sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \bigg(\sum\limits_{j=1}^d x_{ij} w_{jk} -\log\Big(\sum\limits_{k=1}^K \exp\Big(\sum\limits_{j=1}^d x_{ij} w_{jk}\Big)\Big)\bigg) \end{multline*} \end{split}\]

One can show that

\[ \nabla \mathcal L(\boldsymbol W) = \boldsymbol X^\top (\boldsymbol {\widehat Y} - \boldsymbol Y) = \boldsymbol X^\top ( \sigma(\boldsymbol{XW}) - \boldsymbol Y). \]

TODO

Try numerical optimization on several datasets
Apply Newton’s method and compare it’s performance with GD

Numeric optimization

Contents

Numeric optimization#

Binary logistic regression#

Breast cancer dataset: numeric optimization#

Multinomial logistic regression#