Logistic regression#

Let \(\mathcal D = \{\boldsymbol x_i, y_i\}\), \(y_i \in \{0, 1\}\). Logistic regression model predicts the probability of the positive class:

\[ \widehat y = \sigma(\boldsymbol x^\top \boldsymbol w) = \mathbb P(\boldsymbol x \in \text{class }1), \]

where \(\sigma(t) = \frac 1{1 + e^{-t}}\)sigmoid function.

../_images/f9dbec08fc14fd59906e3d475ca5072356c9c7331671cbc5e3df94a46ffcfeec.svg

Q. What is \(\sigma'(t)\)?

The linear output \(\boldsymbol x^\top \boldsymbol w\) is also called logit.

The loss function is binary cross-entropy

(25)#\[\begin{split} \begin{multline*} \mathcal L(\boldsymbol w) = -\frac 1n\sum\limits_{i=1}^n \big(y_i \log \widehat y_i + (1-y_i)\log(1-\widehat y_i)\big) = \\ =-\frac 1n\sum\limits_{i=1}^n \big(y_i \log(\sigma(\boldsymbol x_i^\top \boldsymbol w)) + (1- y_i)\log(1 - \sigma(\boldsymbol x_i^\top \boldsymbol w))\big). \end{multline*}\end{split}\]

Question

How will the cross entropy loss change if \(\mathcal Y = \{-1, 1\}\)?

Regularization#

The loss function for \(L_2\)-regularized logistic regression with is \(\mathcal Y = \{-1, 1\}\)

\[ \mathcal L(\boldsymbol w) = \frac 1n\sum\limits_{i=1}^n \log \big(1 + e^{y_i \boldsymbol x_i^\top \boldsymbol w}\big) + C \boldsymbol w^\top \boldsymbol w. \]

There are also versions of \(L_1\) penalizer or elastic net.

Example: breast cancer dataset#

This is a dataset with \(30\) features and binary target.

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
data['data'].shape, data['target'].shape
((569, 30), (569,))

Malignant or benign?

data.target_names
array(['malignant', 'benign'], dtype='<U9')
data.feature_names
array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

Divide the dataset into train and test:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)

Now take the logistic regression from sklearn:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
/builder/home/.local/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The default value of max_iter is \(100\), and here it is not enough for convergence. However, accuracy is not bad:

print("Train accuracy:", log_reg.score(X_train, y_train))
print("Test accuracy:", log_reg.score(X_test, y_test))
Train accuracy: 0.9384615384615385
Test accuracy: 0.9649122807017544

Now increase max_iter argument:

log_reg = LogisticRegression(max_iter=3000)
log_reg.fit(X_train, y_train)
LogisticRegression(max_iter=3000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The improvement of accuracy seems not to be significant:

print("Train accuracy:", log_reg.score(X_train, y_train))
print("Test accuracy:", log_reg.score(X_test, y_test))
Train accuracy: 0.9516483516483516
Test accuracy: 0.9649122807017544