## Data Mining Algorithms

The previous section described several types of patterns that can be found in data. This section outlines some basic  Income <108 000 >108 000

Age 12 000

42.5

26 700

12 000

### 108 000 Income

Figure 3 A regression tree and the partition of the data space induced by the tree. The tree predicts the value of the variable Total from the variables Age and Income.

IF Income < 102 000 58 AND Age < 58 THEN BigSpender=No ELSE

DEFAULT BigSpender=Yes

IF Income < 102 000 58 AND Age < 58 THEN BigSpender=No ELSE

DEFAULT BigSpender=Yes

Yes

No

Income

102 000

### Income

Figure 4 A partition of the data space induced by an ordered list of rules, derived from the data in Table 1. The shaded box corresponds to the first rule in the list IF Income < 102 000 AND Age < 58 THEN BigSpender=No, while the remainder of the data space is covered by the default rule BigSpender=Yes.

Table 2 An ordered (top) and an unordered (bottom) set of classification rules derived from the data in Table 1

### Ordered rules

IF Age < 60 AND Income < 81 000 THEN BigSpender=No ELSE IF Age >42 THEN BigSpender=Yes ELSE IF Income >113500 THEN BigSpender=Yes ELSE DEFAULT BigSpender=No

Unordered rules

IF Income > 108 000 THEN BigSpender=Yes

IF Age > 49 AND Income > 57 000 THEN BigSpender=Yes

IF Age < 56 AND Income < 98 500 THEN BigSpender=No

IF Income <51000 THEN BigSpender=No

IF 33 < Age < 42 THEN BigSpender=No

DEFAULT BigSpender=Yes algorithms that can be used to find such patterns in data. In most cases, this involves heuristic search through the space of possible patterns of the selected form.

### Linear and Multiple Regression

Linear regression is the simplest form of regression. Bivariate linear regression assumes that the class variable can be expressed as a linear function of one attribute, that is, C = a + ft x A. Given a set of data, the coefficients a and ft can be calculated using the method of least squares, which minimizes the error ^i (ci - a - ftai) between the measured values for C (c/), and the values calculated from the measured values for A (a) using the above equation. We have ft = ^2(ai - a)(ci -a)/^(a - a)

i i a = a - fta where a is the average of ab ..., an and a is the average of cb • • . cn.

Multiple regression extends linear regression to allow the use of more than one attribute. The class variable can thus be expressed as a linear function of a multidimensional attribute vector, that is, C = ^"= 1 ftj x Aj. This form assumes that the dependent variable and the independent variables have mean values of zero (which is achieved by transforming the variables - the mean value of a variable is subtracted from each measured value for that variable). The method of least squares can also be applied to find the coefficients ft. If we write the equation C = ^2 "= 1 ft< x Aj in matrix form C = ft A where C = (cb ..., c„) is the vector of measured values for the dependent variable and A is the matrix of measured values for the independent variables, we can calculate the vector of coefficients ft as ft = (AtA) - lATC

where the operations of matrix transposition ?T and matrix inversion are used. The use of nonlinear transformations, such as Ai = Ai, i = 1, ... , n, allows nonlinear models to be found by using multiple regression: such models are linear in the parameters.

Note that both for linear and multiple regression, the coefficients a, ft, and ftj can be calculated directly from a formula and no search through the space of possible equations takes place. Equation discovery approaches, which do not assume a particular functional form, search through a space of possible functional forms and look both for an appropriate structure and coefficients of the equation.

Linear regression is normally used to predict a continuous class, but can also be used to predict a discrete class. Generalized linear models can be used for this, of which logistic regression is a typical representative. The fitting of generalized linear models is currently the most frequently applied statistical technique. 