Revision

Back to Machine Learning


Introduction

SVM is a classification algorithm. The goal of SVM is (if inputs are perfectly separable) to maximize the distance between any points and the separation plane.

One of the main advantage of SVM is that it can compute this separation plan in a projection space (called the feature space) but without having to actually project the inputs in this feature space.

It means that it can operate in large feature spaces (even infinite - for example with a gaussian kernel) but with a low computation coast. This is possible if the inner product of element of the feature space can be simplified and does not require to compute the projection of the element. From a mathematical point of view it means that the result of \(K(x,y)=\langle \phi(x), \phi(z) \rangle\) does not include \(\phi(x)\) or \(\phi(z)\).

The trick behind this is known as the kernel trick. If the kernel matrix associated with the Kernel is symmetric positive semi-definite then it is a valid Kernel (ie the kernel trick can be used). This is the Mercer theorem.


Notation

\(x_i\) represents the \(i\)-th individuals.


Computation detail

Let \(wx+b=0\) be the separation plan. It is parametrized by \(w\) and \(b\).


Functional margin

The functional margin (the distance to the decision plan) for an \(x_i\) is :

\[\hat{\gamma_i}=y_i(w^t x_i+b)\]

Where:


Note that this margin can be made arbitrarily large by scaling \(w\) and \(b\) by a scalar (however the separation plan won’t be modified). We look for the smallest margin among all \(x_i\) (ie the distance of the \(x_i\) closest to the plan):


\[\hat{\gamma} = \min\widehat{\gamma_i}\]


Geometrical margin

The geometric margin follows the same idea from the functional margin but from a geometrical point of view. Also this margin is invariant to a scaling of \(w\) and \(b\). To understand the geometric margin, first note that \(w\) is orthogonal to the separation plan (check the separation plan for \(b=0\) and \(w=(1, -1)\) to convince you and remember that if \(\langle x, y \rangle=0\) then \(x\) and \(y\) are orthogonal).

For any input \(x_i\), let’s define its distance to the plan as \(\gamma_i\). Note that \(\frac{w} {\Vert w \Vert}\) is a unit length vector orthogonal to the separation plan. Then the point of projection of \(x_i\) to the separation plan is \(x_i - \gamma_i \frac{w} {\Vert w \Vert}\) and as this point belongs to the separation plan:

\[w(x_i-\gamma_i \frac{w} {\Vert w \Vert})+b=0\] \[\gamma_i=\left(\frac{w}{\Vert w \Vert}\right)^tx_i+\frac{b}{\Vert w \Vert}\]

This distance can be negative if \(x_i\) is behind the separation plan. To keep it positive just multiply by \(y_i\):

\[\gamma_i=y_i\left[\left(\frac{w}{\Vert w \Vert}\right)^tx_i+\frac{b}{\Vert w \Vert}\right]\]

Let \(\gamma=\min{\gamma_i}\) be the minimum of these distances.


Note that if \(\Vert w \Vert=1\) the geometrical margin and the functional margin are equal. The geometrical margin is invariant to a scaling of \(w\) and \(b\) but the functional margin is not.


Optimal margin

We want to find the parameters of the separation plan (ie \(w\) and \(b\)) that maximise \(\gamma\) (or equivalently that maximise \(\hat{\gamma}\) but subject to some constraints on the norm as scaling \(w\) and \(b\) can make \(\hat{\gamma}\) arbitrarily large).

Starting with the geometrical margin we look for:

\[\begin{eqnarray} \max_{w,b} \; && \gamma \\ st \; && y_i(w^t x_i + b) \geq \gamma \end{eqnarray}\]

It is equivalent to:

\[\begin{eqnarray} \max_{w,b} \; && \widehat{\gamma} \\ st \; && y_i(w^t x_i + b) \geq \widehat{\gamma} \\ && \Vert w \Vert=1 \end{eqnarray}\]

Which is equivalent to:

\[\begin{eqnarray} \max_{w,b} \; && \frac{\widehat{\gamma}}{\Vert w \Vert} \\ st \; && y_i(w^t x_i + b) \geq \widehat{\gamma} \end{eqnarray}\]


Using the scaling invariance

We will now use the scaling invariance at our advantage. Recall that we can scale \(w\) and \(b\) arbitrarily without changing anything. In particular we can choose a scaling factor \(\lambda\) for \(w\) and \(b\) in order to make \(\hat{\gamma}\) equal to \(1\). In this case we want all \(x_i\) to be at a distance greater than 1 with respect to the separation plan. Another way of saying this is to remark that our problem is over parametrized has \(\hat{\gamma}\) totally depends on \(w\) and \(b\).


Let’s thus set \(\hat{\gamma}=1\) (ie let’s choose values \(w\) for \(b\) such that \(\hat{\gamma}=1\)) :

\[\begin{eqnarray} \max_{w,b} \; && \frac{1}{\Vert w \Vert} \\ st \; && y_i(w^t x_i + b) \geq 1 \end{eqnarray}\]

This maximization problem can be transform in a minimization problem:

\[\begin{eqnarray} \min_{w,b}\; && \frac{1}{2}\Vert w \Vert^2 \\ st \; && y_i(w^t x_i + b) \geq 1 \end{eqnarray}\]

This minimization problem can be solved using, for example, QP solver.
We can also use the Lagrangian (and more specifically take advantage of the Lagrangian duality) to solve it.


Lagrange Duality

As said previously the above optimization problem can be solved using Lagrangian.

Let define its Lagrangian as:

\[L(w,b,\alpha)=\frac{1}{2}\Vert w \Vert^2-\sum\alpha_i\left[y_i\left(w^t x_i + b\right)-1\right]\]

Where:

We can solve this using two different formulation.


Primal

The primal finds the optimal point by:

  1. Taking the \(\max\) with respect to \(\alpha_i\),
  2. Taking the \(\min\) with respect to \(w\) and \(b\).

So the primal takes the \(\min\) with respect to \(w\) and \(b\) of the \(\max\) with respect to \(\alpha_i\).

of the and the dual takes the max wrt to \(\alpha_i\) of the min wrt w and b :


Dual

The dual finds the optimal point by:

  1. Taking the \(\min\) with respect to \(w\) and \(b\),
  2. Taking the \(\max\) with respect to \(\alpha_i\).

So the dual takes \(\max\) with respect to \(\alpha_i\) of the \(\min\) with respect to \(w\) and \(b\).


The link between the Primal and the Dual is:

\[d = \max \min L(w, b, \alpha) \leq \min \max L(w, b, \alpha) = p\]

Under the KKT constrains, the dual and the primal are equal.


Back to SVM

In the case of our SVM problem, the KKT constrains are respected and then using the primal or the dual leads to an equivalent solution.

We will use the dual to find our optimal parameters as it leads to a convenient formulation (that allows to use the kernel trick).


Recall our optimization problem:

\[L(w,b,\alpha)=\frac{1}{2}\Vert w \Vert^2-\sum\alpha_i\left[y_i\left(w^t x_i + b\right)-1\right]\]

First steps is to compute the \(\min\) by setting the derivatives with respect to \(w\) equal to 0.

\[w=\sum\alpha_i y_i x_i\]

Setting the derivative with respect to b equal to 0 lead to:

\[\sum\alpha_i y_i = 0\]

We can reinject it in the loss \(L\):

\[L(w,b,\alpha)=\sum \alpha_i - \frac{1}{2} \sum y_i y_j \alpha_i \alpha_j \langle x_i, x_j \rangle\]

Finally the dual form to find \(\alpha_i\) is :

\[\begin{eqnarray} \max_{\alpha} \; W(\alpha) &&= \sum \alpha_i - \frac{1}{2} \sum y_i y_j \alpha_i \alpha_j \langle x_i, x_j \rangle \\ && st \; \alpha_i \geq 0 \\ && st \; \sum \alpha_i y_i = 0 \end{eqnarray}\]


We can use a SMO algorithm (see CS229 - 3 SVM - page 20) to find the \(\alpha_i\). Once the \(\alpha_i\) are found we can easily compute \(w\) using the previous equation and \(b\) using the primal (see this link Stack StackExchange question).

Inference function

Recalling that:

\[w=\sum\alpha_i y_i x_i\]

The inference function is:

\[\begin{eqnarray} w^t x + b &&= \left(\sum\alpha_i y_i x_i\right)^t x + b \\ &&= \sum \alpha_i y_i \langle x_i, x \rangle + b \end{eqnarray}\]

Hence, if we’ve found the \(\alpha_i\)’s, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between \(x\) and the points in the training set.

If we are in the projected case we have:

\[w^t x + b = \sum \alpha_i y_i \langle \phi(x_i), \phi(x) \rangle + b\]

Hence if we are able to find a Kernel associated with the projection function \(\phi\), we will be able to use complex projection functions \(\phi\) without having to compute the projection of the points.


Kernel

Mercer’s theorem

Let \(K\) be a kernel. If \(K\) is symmetric positive ie:

\[K(x,y)=K(y,x)\]

and:

\[\sum_i \sum_j \alpha_i \alpha_j K(x_i, x_j) \leq 0\]

(or if \(K^{(mat)}\) is the matrix where \(K_{i,j}^{(mat)} = K(x_i, x_j)\) the \(\left(K^{(mat)}\right)^t=K^{(mat)}\) and \(vK^{(mat)}v \leq 0\)).

Then \(K\) can represent the scalar product of a projection function \(\phi\) without having to compute the projection of the element:

\[K(x,y)=f(x) \cdot f(y)\]

Where:


Example of Kernel usable with kernel trick

Let’s define a kernel as follow:

\[\begin{eqnarray} K(x,z) &&= (x^T z)^2 \\ &&= \left(\sum x_i z_i\right)\left(\sum x_j z_j\right)\\ &&= \sum\sum x_i x_j z_i z_j\\ &&= \sum_{i,j}(x_i x_j) (z_i z_j) \end{eqnarray}\]

We can deduct that the projection function associated with this kernel is \(\phi_K(x)=\sum_{i,j}x_i x_j\).

Computation time to compute \(\phi_K(x)\) is quadratic (\(O(N)\)) but for the inner product it is linear (check with n=3).


The optimization function and the inference function implied the scalar product of a projection function \(\phi\) and then then kernel trick may be applied in order to not compute the projection of the points.


SVM with misclassification

In all the previous part, the assumption that no input was misclassified was done. However in practice even in large feature spaces, some inputs remain misclassified.

Hence the loss function is a litte bit different.


Misclassification Loss function

Loss function to take into account misclassified input in SVM is the Hinge Loss:

\[L(w, b; (X, Y))=\max(0,1 - (w^t x + b) \cdot Y)\]

Over the whole dataset it is:

\[l(w, b) = \sum_{j=1}^{n_{pop}} \max(0, 1 - h_\beta(X^{(j)}) \cdot Y^{(j)})\]

Where \(w, b\) represents the parameters of the SVM model.


Total Loss function with misclassification

Hence the total loss is:

\[l(h_\beta) = \frac{1}{2} \Vert w \Vert_2^2 + \sum_{j=1}^{n_{pop}} \max(0, 1 - (w^t X^{(j)} + b) \cdot Y^{(j)})\]


References

See: