THE SUPPORT VECTOR MACHINE GUIDE

Section 1: Defining the Model

What is the Algorithm?

Support Vector Machine (SVM) is a supervised machine learning algorithm. SVM’s purpose is to predict the classification of a query sample by relying on labeled input data which are separated into two group classes by using a margin. Specifically, the data is transformed into a higher dimension, and a support vector classifier is used as a threshold (or hyperplane) to separate the two classes with minimum error.

How Does the Algorithm Work?

Step 1: Transform training data from a low dimension into a higher dimension.

Step 2: Find a Support Vector Classifier [also called Soft Margin…


THE K-NEAREST NEIGHBORS GUIDE

Section 1: Defining the Model

What is the Algorithm?

K-Nearest Neighbors (KNN) is a supervised machine learning and lazy learning algorithm. KNNs purpose is to predict the classification of a query sample by relying on labeled input data which are separated into several classes. One of the most popular parameters to find the optimal value for is k, which refers to the number of nearest neighbors to include in the majority of the voting process.

How Does the Algorithm Work?

Step 1: Determine parameter k (number of nearest neighbors).

Step 2: Calculate the distance (ex: Euclidean Distance) between the query sample and all training samples.


THE LOGISTIC REGRESSION GUIDE

Section 1: Defining the Model

What is the Algorithm?

Logistic Regression (LR) is a supervised machine learning algorithm. LR’s purpose is to predict the classification of a query sample (eg. yes/no). It predicts the probability (between 0 and 1) of the action using labeled input data with the help of a sigmoid function. To determine the class outcome, a threshold value is selected as a cutoff for an event predicted to happen.

How Does the Algorithm Work?

Step 1: Perform linear regression on the query sample to predict the outcome as a continuous value.


THE NAIVE BAYES GUIDE

Section 4: Evaluating the Model Tradeoffs

Reference How to Improve Naive Bayes? Section 3: Tuning the Model in Python, prior to continuing…

A D V A N T A G E S

Q1: Is Naive Bayes a simple or difficult classifier to understand?

Answer: Simple

Q2: Is Naive Bayes an interpretable classifier or not an interpretable classifier?

Answer: Interpretable

Q3: Is Naive Bayes a fast or slow classifier?

Answer: Fast

Q4: Can Naive Bayes handle missing data or sensitive to missing data?

Answer: Handle Missing Data

Q5: Does Naive Bayes increase in error as the number of features increases?

Answer: No Curse of Dimensionality

Q6: Is Naive Bayes more prone to overfitting or less prone to…


THE NAIVE BAYES GUIDE

Section 3: Tuning the Model in Python

Reference How to Implement Naive Bayes? Section 2: Building the Model in Python, prior to continuing…

[10] Define Grid Search Parameters

param_grid_nb = {
'var_smoothing': np.logspace(0,-9, num=100)
}
  • var_smoothing is a stability calculation to widen (or smooth) the curve and therefore account for more samples that are further away from the distribution mean. In this case, np.logspace returns numbers spaced evenly on a log scale, starts from 0, ends at -9, and generates 100 samples.

Why this step: To set the selected parameters used to find the optimal combination. By referencing the sklearn.naive_bayes.GaussianNB


THE NAIVE BAYES GUIDE

Section 2: Building the Model in Python

Reference What is Naive Bayes? Section 1: Defining the Model, prior to continuing…

[1] Import Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
  • NumPy is a Python library used for working with arrays.
  • Matplotlib is a Python library used for creating static, animated, and interactive visualizations.
  • Pandas is a Python library used for providing fast, flexible, and expressive data structures.

Why this step: Python Libraries are a set of useful functions that eliminate the need for writing codes from scratch, especially when developing machine learning, deep learning, data science, data visualization applications, and more!

[2]…


THE NAIVE BAYES GUIDE

Section 1: Defining the Model

What is the Algorithm?

Naive Bayes (NB) is a supervised machine learning algorithm. NBs purpose is to predict the classification of a query sample by relying on labeled input data which are separated into classes. The name naive stems from the foundation that the algorithm is an independence assumption of the features and bayes stems from the foundation that the algorithm uses a statistical classification technique called Bayes Theorem.

How Does the Algorithm Work?

Step 1: Calculate the Prior Probability for given class labels in training data.

Step 2: Obtain Likelihood Probability with each feature attribute for each class.

Step…


THE SUPPORT VECTOR MACHINE GUIDE

Section 4: Evaluating the Model Tradeoffs

Reference How to Improve Support Vector Machine? Section 3: Tuning the Model in Python, prior to continuing…

A D V A N T A G E S

Q1: Is Support Vector Machine a simple or difficult classifier to understand?

Answer: Simple

Q2: Can Support Vector Machine solve linear problems or non-linear problems?

Answer: Linear Problems & Non-Linear Problems

Q3: Does Support Vector Machine increase in error as the number of features increases?

Answer: No Curse of Dimensionality

Q4: Can Support Vector Machine handle outliers or is sensitive to outliers?

Answer: Handle Outliers

D I S A D V A N T A G E S

Q5: Is Support Vector Machine a fast or slow classifier?

Answer: Slow

Q6: Can Support Vector Machine handle…


THE SUPPORT VECTOR MACHINE GUIDE

Section 3: Tuning the Model in Python

Reference How to Implement Support Vector Machine? Section 2: Building the Model in Python, prior to continuing…

[10] Define Grid Search Parameters

param_grid_svm = {
'C': [0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
'class_weight': ['balanced']
}
  • C is the penalty parameter of the error term; this parameter controls the trade-off between smooth decision boundary and classifying the training points correctly. Therefore, low C means low error, and high C means high error.
  • gamma is the kernel coefficient for ‘rbf’, ‘poly’, and ‘sigmoid’; this parameter controls the curvature weight in a decision boundary. …


THE SUPPORT VECTOR MACHINE GUIDE

Section 2: Building the Model in Python

Reference What is Support Vector Machine? Section 1: Defining the Model, prior to continuing…

[1] Import Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
  • NumPy is a Python library used for working with arrays.
  • Matplotlib is a Python library used for creating static, animated, and interactive visualizations.
  • Pandas is a Python library used for providing fast, flexible, and expressive data structures.

Why this step: Python Libraries are a set of useful functions that eliminate the need for writing codes from scratch, especially when developing machine learning, deep learning, data science, data visualization applications, and more!

Kopal Jain

Genentech Data Engineer | Harvard Data Science Grad | RPI Biomedical Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store