1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
|
"""
====================================================
Multiclass sparse logistic regression on 20newgroups
====================================================
Comparison of multinomial logistic L1 vs one-versus-rest L1 logistic regression
to classify documents from the newgroups20 dataset. Multinomial logistic
regression yields more accurate results and is faster to train on the larger
scale dataset.
Here we use the l1 sparsity that trims the weights of not informative
features to zero. This is good if the goal is to extract the strongly
discriminative vocabulary of each class. If the goal is to get the best
predictive accuracy, it is better to use the non sparsity-inducing l2 penalty
instead.
A more traditional (and possibly better) way to predict on a sparse subset of
input features would be to use univariate feature selection followed by a
traditional (l2-penalised) logistic regression model.
"""
# Author: Arthur Mensch
import timeit
import warnings
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
t0 = timeit.default_timer()
# We use SAGA solver
solver = "saga"
# Turn down for faster run time
n_samples = 5000
X, y = fetch_20newsgroups_vectorized(subset="all", return_X_y=True)
X = X[:n_samples]
y = y[:n_samples]
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42, stratify=y, test_size=0.1
)
train_samples, n_features = X_train.shape
n_classes = np.unique(y).shape[0]
print(
"Dataset 20newsgroup, train_samples=%i, n_features=%i, n_classes=%i"
% (train_samples, n_features, n_classes)
)
models = {
"ovr": {"name": "One versus Rest", "iters": [1, 2, 3]},
"multinomial": {"name": "Multinomial", "iters": [1, 2, 5]},
}
for model in models:
# Add initial chance-level values for plotting purpose
accuracies = [1 / n_classes]
times = [0]
densities = [1]
model_params = models[model]
# Small number of epochs for fast runtime
for this_max_iter in model_params["iters"]:
print(
"[model=%s, solver=%s] Number of epochs: %s"
% (model_params["name"], solver, this_max_iter)
)
lr = LogisticRegression(
solver=solver,
multi_class=model,
penalty="l1",
max_iter=this_max_iter,
random_state=42,
)
t1 = timeit.default_timer()
lr.fit(X_train, y_train)
train_time = timeit.default_timer() - t1
y_pred = lr.predict(X_test)
accuracy = np.sum(y_pred == y_test) / y_test.shape[0]
density = np.mean(lr.coef_ != 0, axis=1) * 100
accuracies.append(accuracy)
densities.append(density)
times.append(train_time)
models[model]["times"] = times
models[model]["densities"] = densities
models[model]["accuracies"] = accuracies
print("Test accuracy for model %s: %.4f" % (model, accuracies[-1]))
print(
"%% non-zero coefficients for model %s, per class:\n %s"
% (model, densities[-1])
)
print(
"Run time (%i epochs) for model %s:%.2f"
% (model_params["iters"][-1], model, times[-1])
)
fig = plt.figure()
ax = fig.add_subplot(111)
for model in models:
name = models[model]["name"]
times = models[model]["times"]
accuracies = models[model]["accuracies"]
ax.plot(times, accuracies, marker="o", label="Model: %s" % name)
ax.set_xlabel("Train time (s)")
ax.set_ylabel("Test accuracy")
ax.legend()
fig.suptitle("Multinomial vs One-vs-Rest Logistic L1\nDataset %s" % "20newsgroups")
fig.tight_layout()
fig.subplots_adjust(top=0.85)
run_time = timeit.default_timer() - t0
print("Example run in %.3f s" % run_time)
plt.show()
|