This article was published as a part of the Data Science Blogathon.
One of the classifiers that we come across while learning about machine learning is Support Vector Machine or SVM. This algorithm is one of the most popular classification algorithms used in machine learning.
In this article, we will learn about the mathematics involved behind the SVM classifier, how it classifies the classes, and gives a prediction.
Table of Contents
- What is SVM?
- Few Concepts to know before learning the mathematics of SVM
- Diving into the mathematics of SVM
- 3.1 Background
- 3.1.1 Case 1: Perfect separated binary classified dataset
- 3.1.2 Equation for Perfect separation
- 3.2 Case 2: Imperfect Separation dataset
- 3.2.1 Final Equation to solve for Case 2
- 3.2.2 How to further solve the final equation
- 3.2.3 Use of kernelization to obtain final results
- 3.1 Background
1. Whats is SVM?
Support Vector Machine or SVM is a supervised and linear Machine Learning algorithm most commonly used for solving classification problems and is also referred to as Support Vector Classification.
There is also a subset of SVM called SVR which stands for Support Vector Regression which uses the same principles to solve regression problems. SVM is most commonly used and effective because of the use of the Kernel method which helps in solving the non-linearity.
2. Topics before diving into complete mathematics of SVM
2.1 A point in a space
Assuming, that we are familiar with what is a domain, range, and co-domain while defining a function in real space. Now, for any set of points to be classified, every point is represented by some feature vector x. Then-
x ∊ RD
Since, this is simple for many applications, mapping the point on a complex feature space-
Φ(x) ∊ RM
The transformed feature space for each input feature mapped to a transformed basis vector Φ(x) is:
2.2 Decision Boundary
This is the main separator for dividing the points into their respective classes.
Equation of Hyperplane :
The equation for a straight line with slope m and an intercept c is mx+c = 0
The hyperplane equation dividing the points (for classifying) is-
H: wT(x) + b = 0
Here: b = Intercept and bias term of the hyperplane equation
Note: In D dimensional space, the hyperplane would always be D -1 operator. For example, for 2-D space, a hyperplane is a straight line (1-D).
2.3 Distance Measure
The distance of any line, ax + by + c = 0 from a given point say, (x0 , y0) is given by d. Similarly , distance of a hyperplane equation : wTΦ(x) + b = 0 from a given point vector Φ(x0) is :
where ||w||2 is the Euclidean norm for the length of w given by :
Now that the terms are clear, let’s get into the algorithm used behind.
3. Diving Into Mathematics of SVM
Not all classification predictive models support multi-class classification, algorithms such as the Logistic Regression and Support Vector Machines were designed for binary classification and do not natively support classification tasks with more than two classes.
One approach for using binary classification algorithms for multi-classification problems is to split the multi-class classification dataset into multiple binary classification datasets and fit a binary classification model on each. Two different examples of this approach are the One-vs-Rest and One-vs-One strategies.
The Problem with SVM having multi-classification can be understood in more depth from here.
So, to understand in the least complex manner, we are going to consider the One-vs-One approach for two classes.
3.1.1 Case 1: (Perfect Separation for Binary Classified data) –
Data has only two classifications, positive and negative group, and is perfectly separable, meaning that, a hyperplane can accurately separate the training classes.
Now, there could be many hyperplanes giving 100% accuracy, so to choose the optimal/best hyperplane, place the hyperplane right at the center where the distance is maximum from the closest points and give the least test errors further.
The goal of the algorithm involved behind SVM:
Finding a hyperplane with the maximum margin from the closest points (known as support vectors).
In other words, “The goal is to maximize the minimum distance.”
Now, while making the predictions on the training data which were binary classified as positive and negative groups, if the point is substituted from the positive group in the hyperplane equation, we will get a value greater than 0 (zero), Mathematically,
wT(Φ(x)) + b > 0
And predictions from the negative group in the hyperplane equation would give negative value as wT(Φ(x)) + b < 0. Therefore, the product of a predicted and actual label would be greater than 0 (zero) on correct prediction, otherwise less than zero.
For a perfectly separable dataset, the optimal hyperplane classifies all the points correctly, further substituting the optimal values in the weight equation.
arg max is an abbreviation for arguments of the maxima which are basically the points of the domain of a function at which function values are maximized. (For further working of arg max in machine learning, do read here.)
Further, bringing the independent term of weight outside gives:
The inner term (minn yn |wT Φ(x) + b | ) represents the minimum distance of a point to the decision boundary and the closest point to the decision boundary H. Re-scaling the distance of the closest point as 1. i.e. (minn yn |wT Φ(x) + b |) = 1. Here, the vectors remain in the same direction and the hyperplane equation will not change. It is like changing the scale of a picture; the objects expand or shrink, but the directions remain the same, and the image remains unaltered.
Rescaling of distance is done by substituting,
The equation now becomes (describing that every point is at least 1/||w||2 distance apart from hyperplane) as
This maximization problem is equivalent to the following minimization problem which is multiplied by a constant as they don’t affect the results.
3.1.2 Primal Form of SVM (Perfect Separation) :
The above optimization problem is the Primal formulation since the problem statement has original variables.
3.2 CASE 2 : (Non – Perfect Separation)
Since the above discussed is the ideal case and never happens. We allow making few mistakes while classifying the points and therefore,
Adding a slack variable as a penalty for every miss-classification for each data point represented by β (beta). So, no penalty means the data point is correctly classified, β = 0, and at any misclassification β > 1, as a penalty.
3.2.1 Primal Form of SVM(Non -Perfect Separation):
Here: for β and C
Slack for every variable should be as low as possible and further regularized by hyperparameter C
- If C = 0, means less complex boundary as classifier would be not penalized by slack, as a result, the optimum hyperplane can use it anywhere and accept all large misclassifications. And as a result, the decision boundary would be linear and under fitted.
- If C = infinitely high, then even small slacks would be highly penalized and the classifier can’t afford to misclassify points and therefore overfitting. So parameter C is important.
The above equation is an example of convex quadratic optimization since the objective function is quadratic in W and constraints are linear in W and β.
Solution for Primal form: (Non – perfect separation):
Since we have Φ, which has a complex notation. we would rewrite the equation.
To get rid of Φ is to rewrite Primal formulation in Dual Formulation known as the dual form of a problem and to solve the obtained constraint optimization problem we can use the Lagrange Multiplier method.
In simple words :
- Dual Form: rewrites the same problem using a different set of variables. So the alternate formulation will help in eliminating the dependence on Φ and reducing the effect will be done with Kernelization.
- Lagrange Multiplier Method: It is a strategy to find the local minima or maxima of a function subject to the condition that one or more equations have to be satisfied exactly by the chosen values of variables.
Further, the formal definition of Dual Problem can be defined as :
The Lagrangian dual problem is obtained by first forming a Lagrangian of the already obtained minimization problem with the help of a Lagrange Multiplier so that new constraints could be added to the objective function and now would be solved for the primal variable values that minimize the original objective function.
This new solution makes the primal variables as functions of the Lagrange multipliers, and are called dual variables so that the new problem is to maximize the objective function for the dual variables with the new derived constraints.
This blog has explained the working of Lagrangian Multipliers in a very good way.
Taking an example of the application of Lagrange multiplier for a better understanding of how to convert a primal formulation using Lagrange multipliers to solve for optimization.
Below x is the original primal variable and to minimize the function f under a set of constraints given by g, and rewriting for a new set of variables called Lagrangian multipliers.
Solving the primal variables by differentiating the unconstrained Lagrange.
And lastly substituting back in the Lagrange equation and rewriting the constraints
Getting back at our Primal form:
Step 1: Obtain Primal and Determine Lagrangian form from primal
Step 2: Obtaining solution by expressing primal in form of duals
Step 3: Substitute obtained values in the Lagrangian form
The final dual form from the above simplification is :
The above dual form still have Φ terms, and here this can be easily solved with the help of Kernelization
The kernel by definition avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary For all x and x’ in the input space Φ certain functions k(x,x’) can be expressed as an inner product in another space Ψ. The function
is referred to as a Kernel. In short, for machine learning kernel is defined as written in the form of a “feature map”
For a better understanding of kernels, one can refer to this link of quora
The kernel has two properties:
It is symmetric in nature k(xn , xm) = k(xm , xn)
It is Positive semi-definite
To have an intuition of the working of kernels, this link of quora might be useful.
By the definition of the kernel, we can substitute these values
So substituting properties of the kernel and by definition of the kernel in our dual form,
We get The New Equation as :
and this solution is free from Φ and so much easier to compute. And this was the mathematics involved behind the SVM model.
4. To summarize
Finally, we are at the end of the article, and to sum up, all the gibber gabber written above
Whenever, we are given any test feature vector x to predict, mapped to a complex Φ(x) and asked to predict which is basically wTΦ(x), then rewrite it using the duals so that the prediction is completely independent of complex feature basis Φ and is further predicted using kernels
So hence concludes that we don’t need a complex basis to model non-separable data well, and the kernel does the job.
I hope I was able to cover the basic intuition and mathematics behind SVM, for the classifier model, and keeping the length of the article in mind, I have tried to attach all the important links for the corresponding links within the article.
This was my understanding of involved mathematics in SVM, and I am open to suggestions regarding the same.