## Nt

However, a small Remp does not guarantee a small actual risk. To see this, imagine fitting a function that hits every data point in the sample, so that there is no 'error'. It is unlikely that new data would also fall exactly on the function as fitted. Therefore, we need not only to minimize the empirical risk (maximize consistency), but also simultaneously maximize our generalization ability to avoid overfitting.

### Structural Risk Minimization

The structural risk minimization principle developed by Vapnik and Chervonenkis gives us the following bound, for any a p A and for a sample size of l > h, with probability at least 1 — q

where the confidence term 0 is given by

where h is the VC dimension (for Vapnik-Chervonenkis) of the set of functions {fa : a p A}. The VC-dimension of a set of functions is the largest set of points that can be shattered by elements in that set. To shatter a set of points is to be able to separate any possible classification, that is, for each possible classification there exists an a p A such that fa takes on the value 1 for one class and —1 on the other, no matter how the points have been labeled. Concisely, a set of points {x1, x2,..., x1} for which the set fh(xi), h(x2),..., h(xi) : hpH} - f -1,1}1

is said to be 'shattered' by H. For a set of l points, there are 2l possible binary classifications. For ease in conceptualization, consider the set of all linear functions in R2. The largest number of points that can be shattered using only lines is three. Hence the VC dimension of the set {fa1,a2(x) = a1 + a2x : 8(a1, a2) p R } is equal to three. Generally, the VC dimension of linear functions in R" is equal to n + 1. For nonlinear functions, the VC dimension can be calculated by finding the dimension of the space where the function is linear (called the feature space) and calculating as above. For example, a quadratic function of two variables x1 and x2 can be represented as a linear function in R3 since there are three possible monomials of degree two. Let ^ be the map then

{fx : f\(xu x2) = axj2 + bx22 + cx1x2} can be written as the set f fx : fx{z1, Z2, Z3) = az1 + bz2 + CZ3g which is linear in R3, hence the VC dimension of a quadratic in R is four.

The VC dimension grows quickly for an increasing number of variables and even more quickly for an increasing order of polynomial. Intuitively, if we always choose a hypothesis space with an high enough VC dimension, we will always be able to perfectly classify our training set, even if there is no relationship between our observations and their labels, and thus our decision rule may behave rather badly when classifying future, unknown data. We control our prediction error by controlling the VC dimension, which in turn can be controlled by reducing the richness of the set F = {8fa : a p A} called the 'hypothesis space'. To do this we impose a structure on the hypothesis space. If we consider only the set of classification functions {fa : a p A' C A}, we reduce the richness of our function class, and hence the VC dimension. The following structure can be imposed on A and hence on F. Let

A1 C A2 C ■■■ C An c ••• and Fi :— {f* : a p A,}, then

are hypothesis spaces whose VC dimensions satisfy hi < ¿2 < ■ ■ ■ < h„ < ■■■

SVM separation of a bivariate N (6,2,0) and a bivariate N (1,1,0)

The structural risk minimization principle allows us to find the function fa« in the subset {fa : a e An} for which a balance between consistency and generalizability has been reached. That is, when the right-hand side of eqn[4] is minimized, we have achieved the optimal machine. In practice, we must find a subset such that the VC dimension of our hypothesis space is less than our sample size.

### Separating Hyperplanes

Assume that our sample xP = (x1, x2,..., x1) is linearly separable in some dot product space F. Then our hypothesis space is the set of all hyperplanes in F. Any given hyperplane, with a = (w, b), can be written

for some w e F and b e R1. Then, for a = (w, b), iffa is a separating hyperplane, then the decision rule is: classify x as class one if fa (x) > 0 and as class 2 if fa (x) < 0. Note that multiplying both w and b by the same nonzero constant yields the same hyperplane. However, we can obtain the canonical hyperplane if we require min i = 1, ..., r

that is, the scaling ofw and b be such that the point closest to the hyperplane has a distance from the hyperplane of 1/ || w ||. Now we are ready to make the connection between the structure we have imposed upon our hypothesis space and our selection of the optimal hyperplane. We require the following theorem:

Theorem 1. Let R be the radius of the smallest ball BR(a) = {xeF : |w - a | < R}(aeF) containing the points x1,. .., xr, and let fw,b = sgn((w?x) + b)

be the canonical hyperplane decision functions defined on these points. Then the set {f^b : |w| < A} has a VC dimension h satisfying h < R2A2 + 1

Note that condition [11] allows for two canonical hyperplanes (w, b) and (—w, —b). However, this ambiguity will be resolved later.

Without the condition |w | < A, the VC dimension of our hypothesis space is NF + 1, where NF is the dimensionality of F. When we choose a value of A, we are choosing how rich a subset of our hypothesis space {fa : a e A,} we are considering. Smaller values of A or equivalently larger margins yield hypothesis spaces with smaller VC dimensions. Hence, the above theorem allows

SVM separation of a bivariate N (6,2,0) and a bivariate N (1,1,0)

Class 1 Class 2 Support vector Class 1 margin Decision boundary Class 2 margin

Figure 1 The margins and support vectors for a perfectly separable data set.

Class 1 Class 2 Support vector Class 1 margin Decision boundary Class 2 margin

Figure 1 The margins and support vectors for a perfectly separable data set.

us to control our prediction error by suppressing the increase in the VC dimension that would otherwise occur with high-dimensional data or when transforming the data into a high-dimensional feature space. Therefore we wish to find a hyperplane that minimizes w , and at the same time classifies the training set with as few mistakes as possible. We consider two different classes of hyperplanes.

Figure 1 shows a perfectly separating hyperplane between two classes of random variables in R2. The support vectors are the data points that lie on the margins of each class. Note that the choice of hyperplanes is drastically limited by a wide margin. Indeed, with a margin large enough, it is possible that only one perfectly separating hyperplane exists for a given data set. Hence, the VC dimension has also been drastically reduced.

### Optimal Margin Hyperplanes

Given a set of examples {(x1,j1), (x2,j1),.. .,(x1,y1)}, with xi e F andyi e {±1}, assume there exists a decision function fw b = sgn(w ? xi + b) with the property that fw,b(xi )= y,; i = 1;...,l

Then, canonicality implies y,-(xi w + b) > 1, i = 1,... ,l

Note that only one of the two canonical hyperplanes (w, b) and (—w, —b) will satisfy eqns [14] and [15]. Hence, we can label our data without prior knowledge of whether a class such as an exotic species will result in the data lying above or below the hyperplane. Only the hyperplane consistent with our data will satisfy the above two constraints.

We can restate our minimization problem as a convex problem, which allows us to use Lagrangian theory. The optimal hyperplane can be found by minimizing

subject to [15]. The Lagrangian form is then 1 1

L(w, b, a) = - ||w ||2 - £ ai(y<((x,-w) + b) -1) [17] i=1

with the constraint that ai > 0. We then maximize L with respect to the a, while minimizing with respect to w and b, under the additional constraints which yields and

which yields the optimal parameters a* and b*. The optimal hyperplane and the corresponding decision function can then be expressed in the dual representation in terms of these parameters and the support vectors:

f (x,a*; b*) = sgn yia*(x, x i=i sgni Ylyia*'(x'?x)

\ipSV

where SV is the set of all support vectors obtained in the optimization routine.

Once the problem has been reparametrized into the Lagrangian form, we must also reparametrize Theorem 1. We use the parameter C as an upper bound on the sum of the ai as part of our optimization, that is, ^ j aj < C. The a is directly related to the norm of w since

Hence, under the Lagrangian form, rather than choosing an upper bound for ||w | | during structural risk minimization, we choose an upper bound on the sum of the aj.

This produces a unique solution for w due to the strict convexity of [16]; however, the ai are not necessarily unique. By the Kuhn-Tucker theorem, the only Lagrange multipliers (the ai) that can be nonzero are those for which the constraints are exactly met, that is, ai ?[ y, ((x,-w + b) -1)]— 0, i = 1,..., l [22]

The observations Xj for which aj > 0 are called the support vectors. Since these lie exactly on the margin, they alone determine the separating hyperplane. For observations other than the support vectors, the constraints are satisfied automatically, and since they have weight zero, they do not appear in the solution for w ([21]).

We can further simplify the problem and obtain a decision function in terms of the support vectors by substituting the conditions for the extremum into the Lagrangian. This form of the optimization problem is called the 'dual form'. Maximize i i

subject to the constraints