THE LOGISTIC REGRESSION GUIDE

How to Improve Logistic Regression?

Section 3: Tuning the Model in Python

Kopal Jain
Analytics Vidhya
Published in
4 min readJan 11, 2021

--

Reference How to Implement Logistic Regression? Section 2: Building the Model in Python, prior to continuing…

[10] Define Grid Search Parameters

param_grid_lr = {
'max_iter': [20, 50, 100, 200, 500, 1000],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'class_weight': ['balanced']
}
  • max_iter is the number of iterations.
  • solver is the algorithm to use for optimization.
  • class_weight is to troubleshoot unbalanced data sampling.

Why this step: To set the selected parameters used to find the optimal combination. By referencing the sklearn.linear_model.LogisticRegression documentation, you can find a completed list of parameters with descriptions that can be used in grid search functionalities.

[11] Hyperparameter Tune using Training Data

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
logModel_grid = GridSearchCV(estimator=LogisticRegression(random_state=1234), param_grid=param_grid_lr, verbose=1, cv=10, n_jobs=-1)logModel_grid.fit(X_train, y_train)print(logModel_grid.best_estimator_)...Fitting 10 folds for each of 30 candidates, totalling 300 fitsLogisticRegression(C=1.0, class_weight='balanced', dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=20, multi_class='warn', n_jobs=None, penalty='l2', random_state=1234, solver='newton-cg', tol=0.0001, verbose=0, warm_start=False)

Note: Total number of fits is 300 since the cv is defined as 10 and there are 30 candidates (max_iter has 6 defined parameters, solver has 5 defined parameters, and class_weight has 1 defined parameter). Therefore, the calculation for a total number of fits → 10 x [6 x 5 x 1] = 300.

  • estimator is the machine learning model of interest, provided the model has a scoring function; in this case, the model assigned is LogisticRegression().
  • random_state is the seed of the pseudo-random number generator to use when shuffling the data. To avoid variances in model numeric evaluation output, set the seed to a consistent number for model-to-model comparison; in this case, the number is set to 1234.
  • param_grid is a dictionary with parameters names (string) as keys and lists of parameter settings to try as values; this enables searching over any sequence of parameter settings.
  • verbose is the verbosity: the higher, the more messages; in this case, it is set to 1.
  • cv is the cross-validation generator or an iterable, in this case, there is 10-fold cross-validation.
  • n_jobs is the maximum number of concurrently running workers; in this case, it is set to -1 which implies that all CPUs are used.

Why this step: To find an optimal combination of hyperparameters that minimizes a predefined loss function to give better results.

[12] Predict on Testing Data

y_pred = logModel_grid.predict(X_test)print(y_pred)...[1 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0]

Why this step: To obtain model prediction on testing data to evaluate the model’s accuracy and efficiency.

[13] Numeric Analysis

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred), ": is the confusion matrix \n")
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred), ": is the accuracy score")
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred), ": is the precision score")
from sklearn.metrics import recall_score
print(recall_score(y_test, y_pred), ": is the recall score")
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred), ": is the f1 score")
...[[85 21]
[12 90]] : is the confusion matrix

0.8413461538461539 : is the accuracy score
0.8108108108108109 : is the precision score
0.8823529411764706 : is the recall score
0.8450704225352113 : is the f1 score

Note: Using the confusion matrix, the True Positive, False Positive, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 score:

  • True Positive = 85
  • False Positive = 21
  • False Negative = 12
  • True Negative = 90
Equations for Accuracy, Precision, Recall, and F1.

Why this step: To evaluate the performance of the tuned classification model. As you can see, the accuracy, precision, recall, and F1 scores all have improved by tuning the model from the basic Logistic Regression model created in Section 2.

--

--

Kopal Jain
Analytics Vidhya

Genentech Data Engineer | Harvard Data Science Grad | RPI Biomedical Engineer