## Ai 0 i 1 l

Soft Margin Hyperplanes

A perfectly separating hyperplane may not always exist in the input space for a desired margin width. We could solve this by projecting the data into a higher-dimensional space using a polynomial or other nonlinear function. However, as discussed earlier, this can result in overfitting and should not be done on a whim. To allow a few margin errors in the training set, we use slack variables,

and use relaxed constraints, y, -((xi ?w + b)> 1 - £, i = 1,..., l [29]

Our convex minimization problem then becomes: minimize

subject to the above constraints. As before, minimizing '/i |w |2 amounts to minimizing the VC dimension, while minimizing ^j=1 amounts to minimizing the classification error in the training set (while not eliminating it completely unless we achieve = 0). As with optimal margin hyperplanes, we can use the Lagrangian form i 1 i W(a) = "^2 a, - -^2 a•

subject to the constraints

For a suitably chosen C, this approach is a practical implementation of the structural risk minimization principle on the hypothesis space.

### Support Vectors in Nonlinear Feature Spaces

When the training error is large in the input space, we may want to work with nonlinear functions of the data in high-dimensional feature spaces where they are linearly separable, using the map 0 : Xj ! z, Structural risk minimization allows us to control the growth of computations, while still gaining the advantage that the increased richness of our hypothesis space offers in terms of classifying the training set, since we can control the growth of the VC dimension. The key is to work exclusively within a dot product space and to find a nonlinear kernel representation that allows us to compute the dot product in the feature space without explicitly mapping into it. A 'kernel' is a function K, such that for all x, z e X

With such a kernel, we can maximize the target function and evaluate the decision rule in terms of the kernel. Since we are not explicitly mapping the data into the new space, we do not have to perform any more calculations that we would have in the original space, once we have transformed the data. This yields a decision function of the form f(x) = sg^ a 'k(x; x) + [36]

Hence, all of the above-developed theory for the linear machine applies to nonlinear machines by using a valid kernel K instead of the Euclidean dot product. Therefore to calculate the theoretical bound on the prediction error, we simply calculate R, the radius of the smallest ball B^(a) = fz, aeF : ||w-a|| < Rg containing the transformed data points z1,..., zr and control our VC dimension by choosing the optimal subset of the hypothesis space.

One important aspect of using kernels is that we do not need to know the underlying feature map in order to be able to learn in the feature space. If we can identify that K(x, z) is a kernel for some map 0 and in some dot product space, F, we do not need to actually identify the 0 which produced it. We can in fact identify any given function K(?) as a kernel if it has certain properties.

Theorem 2. Let X be a finite input space with K(x, x) a symmetric function on X. Then K(x, x) is a kernel function if and only if the matrix

is positive semidefinite (has non-negative eigenvalues).

As an example consider polynomials of degree 2 on points in R2. Then we have

0 : (xi, X2) eR2 ! (x2, x22, xx) e F C R3 [38] Then the dot product in F takes the form

This result also holds in general for polynomials of degree d with n variables.

Proposition 1. Define 0 as the map which takes x e X! z e F where the entries of the vector z are all possible dth degree ordered products of the entries of x. Then the corresponding kernel that computes the dot product in F is

For classifiers such as the radial bias function, k(x, xi) = exp( - |x - x, | |2 c), and the neural network kernel, k(x, xi) = tanh(k • (x • x,) + ©), we know that since these functions satisfy the definition of a kernel, there is a corresponding dot product space where our theory based on linear functions holds, and therefore that the problem has a solution.

Figure 2 shows the margins, support vectors, and leave-one-out cross-validation errors for a data set of Iris. The data are taken from the well-known Iris benchmark, available from the UCI Repository of Machine Learning Databases (http://www.ics.uci.edu/~mlearn/ MLRepository.html).

Leave-One-Out Cross-Validation

In practice, the theoretical bounds (often called probably approximately correct, or PAC bounds) are too loose to

Distinguishing IRIS versicolor variety from setosa and virginica

Distinguishing IRIS versicolor variety from setosa and virginica

 Class i i i o oo Class 2 O o Support vector oooo o c o o Class margin o o o Decision boundary \ oooo c o Class 2 margin oooo o o X l-o -o error —_ oo o o oo o o o o - / □ □□□s«®»\ - / / □ □ □□□ \ \ / / □ □□□□□□□ \ f □□ □ □ □ \ f □□ \ /□□□□□ \ o \ \ \ - o \ \ \ - o ooo © \ \ / ooo o \ \ \ / o oooooo o\ \ \ / / o oo \\ ./ / -

SVM separation of a bivariate N (5,2.5,0) and a bivariate N (1,1,0) □ Class 1 o Class 2 + Support vector

— Decision boundary

Figure 3 The margins, support vectors, and classification errors for the full data set of two bivariate normals.

Figure 2 Support vector classification using the radial bias function to separate one iris variety from two others based on petal width and length.

be helpful. (This author has seen PAC bounds of the form ''There is 95% confidence that the prediction error is no larger than i00%!'') A tighter bound on the prediction error is possible using leave-one-out cross-validation. Traditionally, leave-one-out cross-validation involves removing one data point, fitting the model, and using the left-out point as a test point to see how well the classification technique performs on data not included during training. This is done for each observation in the data set. The fraction of misclassified test points is an estimate of the prediction error. When using SVM, only the support vectors are actually relevant in the classification. Only when the left-out point is a support vector will the classification of the remaining data change. Hence, when cross-validating an SVM, one only needs to perform a cross-validation on the support vectors. Whatever errors (or not) were made on the training set, will remain exactly the same until one of the support vectors is removed as a test point. So the estimate of the prediction error is obtained by adding the number of training points misclassified during training to the number of support vectors misclassified during cross-validation and then dividing by 2n. For large data sets, the savings in computational effort can be enormous. See Figures 3-5 as an example for two-dimensional data that are not perfectly separable. Notice the change in the decision rule and the size of the margins as support vectors from each class are removed.

SVMs provide a powerful tool to ecologists to classify complicated data sets and predict important ecological parameters from high-dimensional, noisy, and ill-behaved data and can be applied to a wide variety of ecological problems.

Figure 3 The margins, support vectors, and classification errors for the full data set of two bivariate normals.

Crossvalidation of SVM separation of two bivariate normals

o Class 2

V Left-out point Support vector

— Decision boundary

Crossvalidation of SVM separation of two bivariate normals

o Class 2

V Left-out point Support vector

— Decision boundary

Figure 4 The margins, support vectors, and classification errors when a class 1 support vector is left out during cross-validation. Cross-validation of SVM separation of two bivariate normals

Figure 4 The margins, support vectors, and classification errors when a class 1 support vector is left out during cross-validation. Cross-validation of SVM separation of two bivariate normals

□ Class 1 o Class 2 V Left-out point + Support vector

— Decision boundary

□ Class 1 o Class 2 V Left-out point + Support vector

— Decision boundary

Figure 5 The margins, support vectors, and classification errors when a class 2 support vector is left out during cross-validation.

Figure 5 The margins, support vectors, and classification errors when a class 2 support vector is left out during cross-validation.

See a/so: Artificial Neural Networks: Temporal Networks; Multilayer Perceptron.

Bertsekas DP (1995) Nonlinear Programming. Belmont, MA: Athena Scientific.

Bishop CM (1995) Neural Networks for Pattern Recognition. Oxford: Clarendon.

Brown M, Gunn SR, and Lewis HG (1999) Support vector machines for optimal classification and spectral unmixing. Ecological Modelling 120: 167-179.

Carter GA (1994) Ratios of leaf reflectance in narrow wavebands as indicators of plant stress. International Journal of Remote Sensing 15: 697-704.

Christianni N and Shawe-Taylor J (2000) An Introduction to Support Vector machines and Other Kernel-Based Learning Methods. Cambridge: Cambridge University Press.

Cortes C and Vapnik V (1995) Support vector network. Machine Learning 20: 273-297.

Cawley G (2000) MATLAB support vector machine toolbox (v0.54 beta), School of Information Systems, University of East Anglia, Norwich, http://theoval.sys.uea.ac.uk/~gccsvm/toolbox (accessed November 2007).

Curran PJ (1998) Estimating green LAI from multispectral aerial photography. Photogrammetric Engineering and Remote Sensing 49: 1709-1720.

Fablet R and Le Jusse N (2005) Automated fish age estimation from otolith images using statistical learning. Fisheries Research 72: 279-290.

Gitelson AA and Merzlyak MN (1996) Signature analysis of leaf reflectance spectra: Algorithm development for remote sensing of chlorophyll. Journal of Plant Physiology 148: 494-500.

Qinghua G, Kellya M, and Graham CH (2005) Support vector machines for predicting distribution of Sudden Oak Death in California. Ecological Modelling 182: 75-90.

Rock BN, Hoshizaki T, and Miller JR (1988) Comparison of in situ and airborne spectral measurements of the blue shift associated with forest decline. Remote Sensing of Environment, 24: 109-127.

Scholkopf B (1997) Support Vector Learning. Dissertation, Informatik der Technischen Universität Berlin.

Vapnik VN (1995) The Nature of Statistical Learning Theory. New York: Springer.

Vogelman JE, Rock BN, and Moss DM (1993) Red edge spectral measurements from sugar maple leaves. International Journal of Remote Sensing 14: 1563-1575.

Wilson MD, Ustin SL, and Rocke DM (2004) Classification of contamination in salt marsh plants using hyperspectral reflectance. IEEE Transactions in Geoscience and Remote Sensing 42: 1088-1095.

Relevant Website http://mlearn.ics.uci.edu- UCI Machine Learning Repository, UCI Machine Learning Group.