Simple and multiple linear regression models

Simple linear regression involves modeling the relationship between one continuous response variable and one continuous predictor using a linear equation. Multiple linear regression employs more than one predictor variable. It is a common error to think that the method is called 'linear' regression because a plot of the relation between the response and predictors should yield a straight line. This is not the case: the relation is assumed to be a linear function of the model's parameters, not necessarily the predictors. This can be seen in the way a simple linear regression model is typically written as y = a + fix + e [1]

In this equation, a and fi are the model parameters, and e is a vector representing the variation in y that is not predicted or 'explained' by the rest of the right-hand side. The equation would still be a linear regression model if a second variable were added which was the square of x:

because the right-hand side is still a linear combination of the parameters a, fi1, and fi2. In fact, any linear combination of the functions f(x), g(x),... is a linear regression model, as long as these functions are prespecified and not estimated simultaneously with the regression parameters. As eqn [2] also shows, when more than one variable is used in a regression model, separate corresponding values of fi are usually assumed.

Understanding the assumptions that are made with respect to the error term, e, is critical to estimating and applying regression models. It is generally assumed that the errors:

• are independent,

• follow a normal (or Gaussian) distribution, and

• have a variance which is constant with respect to the predictors.

Predictor variable, x

Figure 1 A simple linear regression model with solid line representing the conditional mean prediction, dashed lines representing uncertainty in the location of the mean due to uncertainty in the intercept a and slope fi, and dotted lines representing uncertainty in the value of a new prediction for y due to both uncertainty in the mean and residual error.

Predictor variable, x

Figure 1 A simple linear regression model with solid line representing the conditional mean prediction, dashed lines representing uncertainty in the location of the mean due to uncertainty in the intercept a and slope fi, and dotted lines representing uncertainty in the value of a new prediction for y due to both uncertainty in the mean and residual error.

These assumptions imply that the parameters a and fi can be estimated from data on x and y by following the method of 'ordinary least squares'. This consists of finding the values of a and fi that minimize the sum of the squared residuals (defined as the difference between measured values ofy and values predicted by the linear equation involving x) across all data points. In practice, this minimization problem is solved as a straightforward algebraic operation on matrices containing the measured data.

For the purposes of prediction, the above assumptions imply that, conditional on a value of x, y is normally distributed with a mean value, or expectation, of

and a variance equal to the variance ofthe error term, e. The mean value itself may also be uncertain, due to uncertainty in estimating a and fi. This uncertainty is described by the 'standard errors' of a and fi, which one should not confuse with the 'standard deviation' (the square root of the variance) of the model residuals. The total uncertainty in predicting the value ofy corresponding to a new measurement of x is the sum (expressed in terms of the variance) of these two contributions (Figure 1).

When the predictors, x, of a linear model consist entirely of categorical variables, and yet the response, y, remains continuous, the model is referred to as an 'Analysis of Variance' (ANOVA) model. If an additional, continuous predictor is also included, and it is assumed that the effect ofthat covariate on the response is equal for all categories, the model is referred to as an 'Analysis of Covariance' (ANCOVA) model. Traditionally, the use of ANOVA

and ANCOVA-type models has focused on hypothesis testing, but the models can also be used for prediction. There are three basic classes of ANOVA models:

1. 'Fixed effects models' in which the data are assumed to come from normallydistributed populations which differ only in their means. In these models, the parameter P is simply regarded as a vector of constants.

2. 'Random effects models' in which the data are assumed to come from a hierarchy of populations, and differences are constrained by the hierarchy. In these models, P is treated as a vector of random variables.

3. 'Mixed effects models' in which both fixed and random effects are present.

Because the random components in ANOVA and ANCOVA models are assumed to be normally distributed, the methods used to fit and apply simple and multiple linear regression models can continue to be used.

A convenient and flexible way to proceed when we have a mix of categorical and continuous predictors is to represent the categorical variables by 'indicator' or 'dummy' variables. If there are only two possible categories of a variable, then a single indicator variable is required that takes the value of zero or one. If there are c categories, then c-1 indicator variables are necessary. As with ANOVA, the basic methods used to fit and apply simple linear regression models apply to indicator variables regression as well. In this case, the coefficients on the indicator variables represent the differential effects of each category compared to a reference category in which all indicator variables have a value of zero. The model can be written so that the differential effects occur with respect to either the intercept or the slope of the linear relation.

There may be cases in which a variable is categorical, but the categories have a natural ordering (as opposed to the unordered categories assumed for ANOVA and the type of indicator variables regression described in the previous paragraph). For example, categories may consist of the levels 'low', 'medium', and 'high', or even of multiple discrete levels of a measured continuous quantity. In such cases, the categorical variable can simply be treated as a continuous variable, and standard simple and multiple linear regression methods can be applied. However, this requires assigning numerical scores to the categories that are spaced at intervals that can be expected to lead to a linear effect on the response variable. For example, if the effect on the response variable in going from 'low' to 'medium', can be expected to be the same as in going from 'medium' to 'high', then equally spaced scores can be assigned to the three levels, such as 1,2, and 3. If the effect of the latter is expected to be twice as great as that of the former, then levels such as 1, 2, and 4 can be used. Such a renumbering is simply an example of the type of transformation of variables that is often used in regression analysis to approximate a linear relation.

Using numerical scores to represent ordered categories is generally a better way to proceed for the purposes of predictive modeling than ANOVA. The method is more efficient in that it requires fewer model parameters. Therefore, it is a more powerful statistical approach in the sense of being able to detect an effect when one is present. It also provides quantitative results that are more useful for prediction because the relation between the levels of a categorical variable is more clearly defined. However, this method also assumes a linear relationship between y and (the possibly transformed) x, which is not assumed by ANOVA. Therefore, as with any linear models, the assumptions of linearity must be carefully considered.

If the predictors, x, of a linear model are continuous, but the response, y, is binary, the model is called a 'logistic regression'. In such a case, the error term e is assumed to be binomially distributed (rather than normal), and the model is usually written as logit(^;) = ln

= a + ßixii h-----h ßkXk,, where i = 1,..., n, and k is the probability of one of the two possible outcomes for y.

Estimation of the parameters of a logistic regression model proceeds by the method of 'maximum likelihood', in which values of the parameters a and P are chosen which maximize the likelihood of the measured data, given the assumptions of the model (i.e., the linear model form and the binomial error distribution). Detailed description of the required algorithms can be found in the sources recommended for further reading.

When used for prediction, logistic regression models produce a probability of the selected outcome resulting from specified values of the predictor variable, x.

As should be clear by now, the basic linear model (eqn [1]) can be generalized to suit a variety of settings, depending on the type and number of variables and the assumptions that are made with respect to the error term. In fact, if we simply state that the response variable y consists of a mean (or expected value) component described by

(where g is called the 'link function'), and a random component which is a function of the mean

then we have what is called the generalized linear model (GLM).

Predictor variables |
Response variable | |

Categorical |
Continuous | |

Categorical |
Contingency tables and log-linear regression |
ANOVA and indicator variables regression |

Continuous |
Binomial or multinomial regression |
Simple or multiple regression |

Both |
Generalized logistic regression |
ANCOVA and indicator variables regression |

Adapted from Dobson A (1999) An Introduction to Generalized Linear Models. Boca Raton, FL: Chapman and Hall/CRC Press.

Adapted from Dobson A (1999) An Introduction to Generalized Linear Models. Boca Raton, FL: Chapman and Hall/CRC Press.

When the link function, g, is identity, f follows from the normal distribution, and y is continuous, we have the simple or multiple regression models. When y is binary, g is the logit function, and f is binomial, we have a logistic regression model. The probit model, popular for estimating dose-response relationships in toxicology, results from the same assumptions but with a normal link function. In fact, a wide variety of functions can be used as the link and distributional forms. Contingency tables, log-linear regression, survival analysis, and heteroscedas-tic (i.e., nonconstant variance) error models, for example, are all easily conceived as GLMs (Table 1).

From a practical standpoint, it is easier to estimate the parameters of a GLM if the variance function f follows from the exponential family - which actually covers a wide range of useful distributions. This fitting usually follows the procedure of maximum likelihood estimation, of which the least-squares method, described above, is a special case.

Once the link function and distributional forms of a GLM have been specified and the model parameters have been estimated, predictive probability distributions of y for any specified values of x follow naturally from eqns [4] and [5].

Although, technically, the predicted conditional mean and variance from eqns [4] and [5] can be used to predict any conditional quantiles of the response variable distribution, it may also be appropriate to estimate functions for desired quantiles directly. This has become an increasingly popular practice for ecologists to model the effects of factors believed to serve as constraints on populations. For example, estimates of the response of the 0.95 quantile of population density to a measured predictor variable are generally considered to be a better estimate of the effect of that variable as a limiting factor than estimates of the response of the mean. This is because other, unmeasured factors may sometimes be the active limiting constraint, producing a 'wedge-shaped' pattern of unequal variation in the response variable through an interaction with the measured factors (Figure 2). Therefore, estimating the upper edge of this wedge is often the focus of quantile regression.

Figure 2 A quantile regression model of average annual acorn biomass versus an index of acorn suitability based on oak forest characteristics. Solid lines represent seven selected quantile estimates, and the dashed line (overlain by the 50th quantile) represents the traditional, least-squares regression estimate of the mean. Reproduced from Cade BS, Terrell JW, and Schroeder RJ (1999) Estimating effects of limiting factors with quantiles. Ecology 80: 311-323.

Figure 2 A quantile regression model of average annual acorn biomass versus an index of acorn suitability based on oak forest characteristics. Solid lines represent seven selected quantile estimates, and the dashed line (overlain by the 50th quantile) represents the traditional, least-squares regression estimate of the mean. Reproduced from Cade BS, Terrell JW, and Schroeder RJ (1999) Estimating effects of limiting factors with quantiles. Ecology 80: 311-323.

The pth linear regression quantile is a linear function such that approximately p proportion of the observations are below, and 1 -p proportion are above the estimated line. If a factor is a limiting constraint on a population, the slopes of the regression quantiles will increase in absolute value with higher quantiles. The upper regression quan-tiles will have a slope that approximates the change in the response variable when the predictor variable is the active constraint. Traditional linear regression, which focuses on the slope of the mean, would not necessarily be able to detect such relations.

Estimating the parameters of a quantile regression model is not simply a matter of conventional least-squares or maximum likelihood optimization. Rather, estimates are obtained by minimizing a weighted loss function of the residuals in which positive residuals are given weight p and negative residuals are given weight 1 -p. Fortunately, routines to perform quantile regression are now available for many popular statistical software packages.

Compared to GLM, quantile regression has the advantage that a specification of how changes in the variance of y are linked to the mean of y is not required, nor is one practically restricted to the exponential family of distributions. In fact, methods have been established for detecting precisely how the shape of the distribution of y changes across values of the predictor variables. This can result in more accurate probabilistic predictions of the response variable than other regression-based methods which make stricter distributional assumptions.

Of course, the linear model can be further generalized to include nonlinear functions as y = f (x, 0) + e where f is a nonlinear function of the predictors, x, and model parameters, 0. Unfortunately, even when the assumption of independent, normally distributed, mean zero errors is made for nonlinear models, there is no matrix algebraic expression to determine the best-fitting parameter estimates, as there is for linear regression. Numerical optimization routines are usually required, and there may actually be multiple local optima depending on model form.

In some cases, nonlinear equations can be linearized through appropriate transformations. For example, taking the logarithm of the equation leads to the linear equation log y — log a + bx in which estimates of log a and b can be made by standard linear regression. However, it should be kept in mind that the error term will have a different distributional assumption in the two equations, leading to a potentially unintended weighting scheme on the data points.

Fortunately, most spreadsheet programs and statistical software packages include routines for nonlinear optimization, allowing ecological modelers to easily test a variety of model forms.

Generalized additive models

As a further generalization, one can write a model in which the mean of the response variable y is the sum of arbitrary, nonlinear functions of x:

E (y\x) — a + f (xi) + f,X) + ••• + fk (xk) [6]

These functions are estimated from the data and can even be nonparametric. Often such a generalized additive model (GAM) will use a scatterplot smoothing function, such as a locally weighted mean, for the function f This usually provides a better fit to the data than other types of models, but can lead to problems of overfitting and difficulty with interpretation. Cross-validation procedures can be used to help avoid overfitting problems.

Often a constructive way to proceed is to use nonpara-metric GAMs to identify the dominant relationships in data and then use parametric models such as GLMs for final model-fitting and prediction. GLMs can more efficiently communicate information about the model and may be more readily interpretable ecologically.

Was this article helpful?

Get All The Support And Guidance You Need To Be A Success At Helping Save The Earth. This Book Is One Of The Most Valuable Resources In The World When It Comes To How To Recycle to Create a Better Future for Our Children.

## Post a comment