## Learning the Structure

Learning the conditional probabilities of a BN, given the graphical structure and data, is referred to as parameter learning. Another type of learning in BNs is structure learning, which attempts to recover the causal structure that underlies a set of data based on the patterns of probabilistic dependence between variables. Many algorithms for learning BN structure employ the concept of d-separation (short for directed separation) to translate between probabilistic patterns and graphical structures. Graphically, two nodes (or sets of nodes) X and Y are said to be d-separated by another node (or set of nodes) Z if and only if (for details, the reader is referred to 'Further reading' section):

1. the path between X and Y contains a chain X! m ! Y or a fork X^ m ! Y such that the middle node m is a member of Z, or

2. the path between X and Ycontains an inverted fork (or collider) X! m ^ Y such that the middle node m is not in Z and such that no descendant of m is in Z.

The term path indicates a sequence of consecutive edges (of any directionality) connecting two nodes in a graph.

The notion of d-separation has important implications for the probability relations implied by a graph. Namely, if two nodes X and Y are d-separated by a node Z in a graphical model, then the corresponding variable X is independent of the variable Y, conditional on the variable Z. Conversely, if X and Y and not d-separated by Z, then X and Y are dependent conditional on Z. These implications are entirely general and do not depend on any assumptions about the distributional form of the variables or the functional form of the causal relationships.

The notion of a d-separating chain was used to establish the parental Markov property, which forms the basis for modularization of BNs. For example, in Figure 1 , d-separation implies that H is independent of N given A. The notion of a d-separating fork can be exemplified by the idea that T and H in Figure 1 are unconditionally dependent (they are both more likely to occur under conditions of high algal density), but become independent after conditioning on A (all correlation between the two is accounted for by A). Colliders indicate two causes having a common effect (such as the relation between H, T, and Kin Figure 1) and act the opposite way to forks. For example, if H and T are assumed to be independent conditional of A, they will become dependent once K is known. This is referred to as the explaining away effect because, given that a consequence has occurred, information about one of two possible individually sufficient causes tends to reduce our belief that the second cause occurred.

The d-separation criterion can be used to test the causal relations represented in a graphical model by confirming that all d-separation relations in the graph are mirrored by equivalent statistical independences in the observational data. If the test is negative, then the graph does not represent the underlying causal mechanisms generating the data. If the test is positive, then the causal structure cannot be rejected. However, there may be other structures that are also consistent with the patterns in the data. Such structures are termed observationally equivalent, and cannot be distinguished from the existing data alone.

The process of proposing and testing a large number of graphical models in order to learn the causal structure underlying a set of variables is not a very efficient procedure. Instead, structural learning algorithms have been developed that take data (or a statistical summary of data) as inputs and return the set of graphical structures that are consistent with those data as outputs. In other words, the algorithms construct the set of all directed acyclic graphs that show d-separation relations that can be supported by statistical dependences contained in the data. When all variables are observed and are either discrete or related by linear Gaussian models, reliable and efficient algorithms exist for recovering the causal structure. However, when some variables are unobserved, the situation is more complicated, for it is no longer clear that graphs consistent with the distributions of observed data must have a directed acyclic structure. Therefore, the available algorithms lead to graphs with four types of edges between any two nodes: (1) a definitively directed edge indicating genuine causation, (2) a possibly directed edge indicating potential causation (leaving open the possibility of a latent common cause), (3) a bidirected edge indicating spurious association (the existence of a latent common cause), or (4) an undirected edge indicating an undetermined relationship.

Distinguishing which of the four types of edges should be used to denote the relationship between two variables X and Y requires that a third variable Z exhibit a particular pattern of dependency with X and Y. This is consistent with the notion that causal claims are defined by their ability to correctly specify the behavior of X and Y under the influence of a third variable that corresponds to an external control on either X or Y. However, in the absence of experimental manipulation, the variable Z serves as a virtual control and must come from the observed data. Algorithms for generating partially directed graphs can be regarded as a systematic way for identifying the variables Z that qualify as virtual controls.

A partially directed graph that includes the four types of edges described above provides a concise way of revealing all the structures that are observationally equivalent with a particular set of data. When the directions of some edges remain ambiguous, additional tests can be performed to identify which additional observations or experiments are required to reveal the underlying causal structure.