# Specifying the Conditional Probabilities

Once a graphical model is drawn that has a causal structure and satisfies the Markov property, it defines the appropriate factorization of the joint distribution of all variables in the system. 'Appropriate' means that: (1) conditional distributions characterizing the relationships between a variable and its parents will be stable to changes in other variables and relationships; and (2) as a logical consequence, the full network can be modularized to allow the characterization of individual subnetworks to proceed independently without regard to the broader context. This implies that each subnetwork can be specified using an approach suitable for the type and scale of information available.

Specification of the conditional probabilities can proceed in several ways depending on the properties of the variables involved and the nature of the knowledge being brought to bear. Most examples of BN modeling in ecology have used either inherently discrete variables or continuous variables that have been discretized into a finite number of categories for representation in the network. However, this may have more to do with convenience and ease of interpretation than fidelity to the true properties of the system. When all variables in a network are discrete, then the network relationships are specified by conditional probability tables for each node that provide the probability of it being in a particular state (or category), given any combination of states of its parents. This has the advantage of being fairly easy to interpret when the parents are set to particular states, but when many different states are possible, the number of probabilities required to fill out the table quickly becomes prohibitive. For example, if T, H, and K in Figure 1 are each assumed to have only three possible states, then specifying the full conditional probability table for K would require 33 = 27 probabilities (albeit, 9 of which could be inferred from the law of total probability). Estimating this many conditional probabilities from either data or expert judgment is a demanding task.

Representing all variables in a network model as being discrete does have the advantage that software is readily available to handle all possible calculations one would want to make with such a model. However, discretizing variables that are inherently continuous introduces a degree of imprecision into the model that would otherwise not exist. This is because of the vagueness that arises from assigning all values within a specified range of a continuous variable to the same discrete state. For example, we might define nutrient concentration (N) in Figure 1 to have three possible states, corresponding to the ranges of 0-10, 10-40, and >40 mgl"1, respectively, and then specify the probability of various levels of algal density (A) conditional on each of these states. The inability of the probability table to distinguish between the different values for A likely to result from a value for N of 11 mgl1 compared to a value of 39 mgl"1 adds significant imprecision to model predictions and inferences.

Another problem associated with discretization is that it encourages vagueness in variable definitions. For example, many BN studies have been published that only define states of variables to be low, medium, and high, without giving precise quantitative definitions. This is unacceptable, as it opens up the possibility for model developers or users to have very different ideas of what the variable and its different states represent. This can lead to errors in assessing the probabilities required of the model or in applying the results for decision making. The clarity test provides confirmation that variables and states have been defined in adequate detail. To implement the test, one needs to imagine that, at some point in the future, perfect information will be available regarding all aspects of the system. Will it be possible to determine unequivocally the state of every node in the network, without any interpretation or judgment? If not, then further specificity is required.

A satisfying alternative to constructing BNs entirely of discrete variables related by conditional probability tables is to use continuous variables when appropriate, connected by functional equations. Probabilities are introduced through the assumption that certain variables or parameters in the equations are uncertain or unobserved. In many ways, this is more consistent with the semidetermi-nistic way that causal models are conceived in biology, physics, and engineering. In its most general form, a probabilistic functional equation for a network variable X consists of an equation of the form, xi = fi (pai; ui)

where PAj are the parents of X, and Ui are the disturbances caused by omitted variables or random (e.g., measurement) errors. This conceptualization can be considered a nonlinear, nonparametric version of the more familiar linear structural equation models (SEMs). It has been shown that for every BN characterized by some distribution P (as in eqn ), there exists a functional model (as in eqn ) that generates a distribution identical to P. In other words, characterizing each relationship as a functional equation, instead of a conditional probability P(x jpa), leads to all the valuable properties of a Markov model, and, as shown by Pearl 2000, this holds regardless of the choice of function f or error distribution P(u). This implies that for all applications of BNs, including synthesis, prediction, and inference, one can regard functional models as a legitimate way of specifying the conditional distributions.

As an example of the value of using a functional expression rather than a conditional probability table to describe the relationships between variables, consider how one might use measured data on nutrient concentration and algal density to characterize the relation between N and A in Figure 1. A linear regression fit to log-transformed data of Dillon and Rigler (1974) would provide the following functional form:

where and are model coefficients and UA is a normally distributed disturbance term with a mean of zero and a standard deviation derived from the residuals of the model fit. As described above, UA might be represented as an explicit node in the model or as an implicit error term, depending on whether the omitted factors it represents also influence variables beyond A. The same holds true for the model coefficients and which also have uncertainty associated with them; they can either be treated as explicit nodes or implicit disturbance terms with distributions defined by the parameter means, standard errors, and correlations estimated by the regression procedure.

An actual fit to data (Figure 4) suggests that the conditional probability distribution of A given a value for N of 11 mg l"1, for example, can be appropriately represented by a lognormal distribution with mean 2.5 mg l" and standard deviation of 2.4 mg l"1. If N were to be 39 mg l"1, A would have a lognormal conditional distribution with median 15.5 mg l"1 and standard deviation of 15.0 mg l"1.

 i ° 100 o\ i""o J-"" i 1 SfP' 1 0.1 !

10 100 Spring phosphorus (jg l-1)

Figure 4 Linear regression fit to data on spring phosphorus concentration and summer chlorophyll level (as a measure of algal density) from 46 lakes. The solid line represents the mean prediction, dashed lines represent the 95% confidence interval in the mean resulting from uncertainty in model coefficients, and dotted lines represent the 95% predictive interval representing the full conditional distribution. Vertical and horizontal dashed lines represent the thresholds for categorical definitions, as described in the text.

 Summer chlorophyll (jg l 1) Low (<2) Medium (2-15) High (>15) O) .c Discrete variables (2/15) (0/15) (2/18) High (>40) (12/13) Functional results 11 0.37 0.63 0 39 0 0.51 0.49

Fractions given in the body of the table represent the conditional categorical frequencies of the data in Figure 4.

Fractions given in the body of the table represent the conditional categorical frequencies of the data in Figure 4.

If, instead of representing N and A as continuous variables, we were to artificially discretize them into three categories, then conditional probabilities could be derived from the data using a two-way contingency table (Table 1). With thresholds of 10 and 40 mg l_1 for spring phosphorus and 2 and 15 mg l_1 for summer chlorophyll, such a table would predict a conditional probability distribution for A of low: 33%; medium: 56%; and high: 11%; regardless of whether N were 11 mgl_1, 39 mgl_1, or anywhere in between. This is a significantly less-precise prediction than the functional results, which, even if A were to be discretized, would capture the difference between 11 and 39 mg l_1 for N (Table 1).

Regardless of whether variables in a BN are represented as continuous or discrete, there are a variety of ways to determine the appropriate conditional probabilities. Data-based statistical techniques, such as the regression or contingency table approaches exemplified above, are one possibility. Another possibility is to use the results of complex, process-based simulation models run externally to the BN that are then converted into reduced-form, response-surface approximations for use in BN specification. Of course, this requires a comprehensive uncertainty analysis to be performed to characterize the conditional probability distributions. This may not always be possible, but response-surface surrogate models can help in this regard also.

When data and process models are not available for specifying conditional probabilities, the carefully elicited judgment of subject matter experts may be required. This approach is consistent with the Bayesian perspective on statistical inference and decision, which states that probabilities are a useful way of expressing subjective degrees of belief. Established techniques exist for eliciting probability distributions from experts and help to assure accurate and honest assessments. These distributions can then serve as Bayesian 'priors', to be formally updated according to Bayes's theorem as data become available or knowledge improves. 