THE SUPPORT VECTOR MACHINE GUIDE

How to Implement Support Vector Machine?

Section 2: Building the Model in Python

Kopal Jain

--

Reference What is Support Vector Machine? Section 1: Defining the Model, prior to continuing…

[1] Import Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
  • NumPy is a Python library used for working with arrays.
  • Matplotlib is a Python library used for creating static, animated, and interactive visualizations.
  • Pandas is a Python library used for providing fast, flexible, and expressive data structures.

Why this step: Python Libraries are a set of useful functions that eliminate the need for writing codes from scratch, especially when developing machine learning, deep learning, data science, data visualization applications, and more!

[2] Read & Store Data

df = pd.read_csv('Mammographic_Data_Cleaned.csv')
df.info()
...<class 'pandas.core.frame.DataFrame'>
RangeIndex: 831 entries, 0 to 830
Data columns (total 5 columns):
AGE 831 non-null float64
SHAPE 831 non-null int64
MARGIN 831 non-null int64
DENSITY 831 non-null int64
SEVERITY 831 non-null int64
dtypes: float64(1), int64(4)
memory usage: 32.6 KB

Note: The dataset used here is preprocessed and cleaned. To follow the steps refer to GitHub code.

Notice how SHAPE, MARGIN, and DENSITY are int64 data types. Since these features are nominal, they need to be converted into object data types. If the data types for your data set are correct, then this step can be skipped, or else reference the following code:

for data in [df]:# Convert Data Type for SHAPE
data['SHAPE'] = data['SHAPE'].astype(str)
# Convert Data Type for MARGIN
data['MARGIN'] = data['MARGIN'].astype(str)
# Convert Data Type for DENSITY
data['DENSITY'] = data['DENSITY'].astype(str)

Why this step: In order to use the data and perform data manipulations, the data must be read and stored in uniform structure. Pandas is used to read the .csv file and to store the data into a dataframe format called df.

[3] Split Data (Independent Variables [X] & Dependent Variable [y])

dependentVar = 'SEVERITY'X = df.loc[:, df.columns != dependentVar]
y = df[dependentVar].values
print("Number of observations and dimensions in 'X':", X.shape)
print("Number of observations in 'y':", y.shape)
...Number of observations and dimensions in 'X': (831, 4)
Number of observations in 'y': (831,)

Why this step: The goal is to use the independent variables (or features) to predict the dependent variable (or outcome). Hence these variables need to be split into X and y, where X represents all the features input into the model and y represents the outcome result from the model.

[4] Encode Independent Variables [X]

X = pd.get_dummies(X)print("Number of observations and dimensions in 'X':", X.shape)
print("Number of observations in 'y':", y.shape)
...Number of observations and dimensions in 'X': (831, 14)
Number of observations in 'y': (831,)

Note: Notice how X.shape features have increased from 4 to 14. This means that 10 more features were added to the model. To understand the features that were added, reference the following code:

features = X.columns.tolist()print(features)...['AGE', 'SHAPE_1', 'SHAPE_2', 'SHAPE_3', 'SHAPE_4', 'MARGIN_1', 'MARGIN_2', 'MARGIN_3', 'MARGIN_4', 'MARGIN_5', 'DENSITY_1', 'DENSITY_2', 'DENSITY_3', 'DENSITY_4']

Note: Notice how, for example, the feature SHAPE is broken down into four new features: SHAPE_1, SHAPE_2, SHAPE_3, and SHAPE_4.

SHAPE_1 is represented as [1, 0 , 0, 0]

SHAPE_2 is represented as [0, 1, 0, 0]

SHAPE_3 is represented as [0, 0, 1, 0]

SHAPE_4 is represented as [0, 0, 0, 1]

  • One Hot Encoding: Each label is mapped to a binary vector.

Why this step: Machine learning algorithms require that input and output variables are represented as numbers. Since this data set includes categorical features, they must be encoded to numbers before they can be used to fit and evaluate a model.

[5] Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
print(X)...[[ 0.76580356 -0.54443719 -0.52583048 ... -0.2688086 0.31497039
-0.09859277]
[ 0.15166622 -0.54443719 -0.52583048 ... -0.2688086 0.31497039
-0.09859277]
[-1.89545824 1.83675916 -0.52583048 ... -0.2688086 0.31497039
-0.09859277]
...
[ 0.56109111 -0.54443719 -0.52583048 ... -0.2688086 0.31497039
-0.09859277]
[ 0.69756608 -0.54443719 -0.52583048 ... -0.2688086 0.31497039
-0.09859277]
[ 0.42461615 -0.54443719 -0.52583048 ... -0.2688086 0.31497039
-0.09859277]]

Why this step: Some machine learning algorithms calculate the distance between two points using the Euclidean distance. If one of the features has a broad range of values, the distance will be dominated by this feature. Standardization and normalization are techniques that are used on the range of independent variables to allow each feature to proportionately contribute to the final distance.

[6] Split Data (Train & Test)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 32)
print("Number of observations and dimensions in training set:", X_train.shape)
print("Number of observations and dimensions in test set:", X_test.shape)
print("Number of observations in training set:", y_train.shape)
print("Number of observations in test set:", y_test.shape)
...Number of observations and dimensions in training set: (623, 14)
Number of observations and dimensions in test set: (208, 14)
Number of observations in training set: (623,)
Number of observations in test set: (208,)
  • Training Set is used to train, or fit, the model.
  • Test Set is used to obtain an unbiased evaluation of the final model.

Why this step: To assess the predictive performance of the model, it is important to have an unbiased evaluation. This can be accomplished by splitting the dataset before using it. The data is randomly split into a training set and a testing set, where 75% of the data is kept aside for the training data and the remaining 25% of the data is kept aside for the testing data.

[7] Build Model on Training Data

from sklearn.svm import SVCsvmModel = SVC(random_state=1234, probability=True)
svmModel.fit(X_train, y_train)
...SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto_deprecated', kernel='rbf', max_iter=-1, probability=True, random_state=1234, shrinking=True, tol=0.001, verbose=False)

Why this step: To train the model on training data so it can accurately predict the outcome.

[8] Predict on Testing Data

y_pred = svmModel.predict(X_test)print(y_pred)...[1 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0]

Why this step: To obtain model prediction on testing data to evaluate the model’s accuracy and efficiency.

[9] Numeric Analysis

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred), ": is the confusion matrix")
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred), ": is the accuracy score")
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred), ": is the precision score")
from sklearn.metrics import recall_score
print(recall_score(y_test, y_pred), ": is the recall score")
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred), ": is the f1 score")
...[[82 24]
[14 88]] : is the confusion matrix

0.8173076923076923 : is the accuracy score
0.7857142857142857 : is the precision score
0.8627450980392157 : is the recall score
0.8224299065420562 : is the f1 score

Note: Using the confusion matrix, the True Positive, False Positive, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 score:

  • True Positive = 82
  • False Positive = 24
  • False Negative = 14
  • True Negative = 88
Equations for Accuracy, Precision, Recall, and F1.

Why this step: To evaluate the performance of a classification model.

--

--

Kopal Jain

Genentech Data Engineer | Harvard Data Science Grad | RPI Biomedical Engineer