^{*}

CJN is the primary author of the tutorial. JRB and DRW have advised on the biological examples. AJB and DRW have contributed their pedagogical knowledge on the topic. All authors have advised on the selection and presentation of the material.

Chris J. Needham and Andrew J. Bulpitt are with the School of Computing, University of Leeds, Leeds, United Kingdom. James R. Bradford and David R. Westhead are with the Institute of Molecular and Cellular Biology, University of Leeds, Leeds, United Kingdom.

The authors have declared that no competing interests exist.

Bayesian networks (BNs) provide a neat and compact representation for expressing joint probability distributions (JPDs) and for inference. They are becoming increasingly important in the biological sciences for the tasks of inferring cellular networks [

There are many applications in biology where we wish to classify data; for example, gene function prediction. To solve such problems, a set of rules are required that can be used for prediction, but often such knowledge is unavailable, or in practice there turn out to be many exceptions to the rules or so many rules that this approach produces poor results.

Machine learning approaches often produce better results, where a large number of examples (the training set) is used to adapt the parameters of a model that can then be used for performing predictions or classifications on data. There are many different types of models that may be required and many different approaches to training the models, each with its pros and cons. An excellent overview of the topic can be found in [

In a graphical model representation, variables are represented by nodes that are connected together by edges representing relationships between variables.

Gene regulatory networks provide a natural example for BN application. Genes correspond to nodes in the network, and regulatory relationships between genes are shown by directed edges. In the simple example above, gene G1 regulates G2, G3, and G5, gene G2 regulates G4 and G5, and gene G3 regulates G5. The probability distribution for the expression levels of each gene is modelled by the BN parameters. Simplification results from the fact that the probability distribution for a gene depends only on its regulators (parents) in the network. For instance, the expression levels of G4 and G5 are related only because they share a common regulator G2. In mathematical terms, they are conditionally independent given G2. Such relationships lead to factorisation of the full JPD into component conditional distributions, where each variable depends only on its parents in the network.

For BNs, the edges of the graph must form a directed acyclic graph (DAG)—a graph with no cyclic paths (no loops). This allows for efficient inference and learning. JPDs can be expressed in a compact way, reducing model size through exploiting conditional independence relationships—two variables are conditionally independent if they are independent given the state of a third variable. A benefit of BNs is that they may be interpreted as a causal model which generated the data. Thus, arrows (directed edges) in the DAG can represent causal relations/dependencies between variables. However, it must be noted that to learn a causal model from data needs more than association data, and this is discussed toward the end of this primer under the heading Causality.

Bioinformatics applications of BNs have included gene clustering and the inference of cellular networks [

The relationships between variables are encoded by conditional probability distributions (CPDs) of the form

For BNs, which use _{i}_{i}_{B}_{i}_{i}_{i}

In a similar way, regression models for CPDs of continuous variables with continuous parents may be used. In this case, θ_{B}^{2}). i.e., the CPD for

It is the JPD over all the variables that is of great interest. However, the number of model parameters needed to define the JPD grows rapidly with the number of variables. Through exploiting conditional independence between variables, the models may be represented in a compact manner, with orders of magnitude fewer parameters.

Relationships between variables are captured in a BN structure _{1}, … , _{n}_{1} , … , θ_{n}_{i}_{i}_{i}_{i}_{i}

For the known BN structure (gene regulatory network) in

Conceptually, inference is straightforward,

Two variables are conditionally independent if they are independent given the state of a third variable. Mathematically,

Conditional independence relationships are encoded in the structure of the network, as illustrated in the three cases below. Regulation of three genes

In the case of a converging connection, it is also worthwhile noting that when the value of

Thus, the structure of the model captures/encodes the dependencies between the variables and leads to a different

As a simple example, consider the task of predicting interaction sites on protein surfaces from measures of conservation and hydrophobicity of surface patches. This gives three variables: ^{n}^{n}^{−1} + n + 1 free parameters, whereas a naïve Bayes classifier has only 2^{90} times smaller!]. In the next section of this primer, the learning of parameters for this simple example is illustrated. This example is inspired by [

The simplest approach to learn the parameters of a network is to find the parameter set that maximises the likelihood that the observed data came from the model.

In essence, a BN is used to model a probability distribution _{1}, … , _{N}_{i} is the

The learning paradigm which aims to maximise _{ML}) where θ_{ML} is the maximum (log) likelihood model which aims to maximise ln _{ML} = arg max_{θ}

In order to consider a prior distribution, a _{MAP}) where θ_{MAP} is the MAP probability (likelihood of the “model given data”) which aims to maximise ln _{MAP} = arg max_{θ}

Often ML and MAP estimates are good enough for the application in hand, and produce good predictive models. The numerical example at the end of this section illustrates the effects of ML and MAP estimates with different strength priors and training set sizes. Both ML and MAP produce a point estimate for θ. Point estimates are a single snapshot of parameters (though confidence intervals on their values can be calculated).

For a full Bayesian model, the uncertainty in the values of the parameters is modelled as a probability distribution over the parameters. The parameters are considered to be latent variables, and the key idea is to marginalise over these unknown parameters, rather than to make point estimates. This is known as marginal likelihood. The computation of a full posterior distribution, or model averaging, avoids severe overfitting and allows direct model comparison. In [_{i}

The joint probability of the training data, the model, and a new observation

Applying the product rule [

This is computing a full Bayesian posterior. In order to do this, a prior distribution, p(θ), for the model parameters needs to be specified. There are many types of priors that may be used, and there is much debate about the choice of prior [

The parameters for BNs may be learned even when the training dataset is incomplete, i.e., the values of some variables in some cases are unknown. Commonly, the Expectation–Maximisation (EM) algorithm is used, which estimates the missing values by computing the expected values and updating parameters using these expected values as if they were observed values.

EM is used to find local maxima for MAP or ML configurations. EM begins with a particular parameter configuration

_{C}

_{C}

Using EM to find a point estimate for the model parameters can be efficient to calculate and gives good results when learning from incomplete data or for network structures with hidden nodes (those for which there is no observed data).With large sample sizes, the effect of the prior p(θ) becomes small, and ML is often used instead of MAP in order to simplify the calculation. More sophisticated (and computationally expensive) sampling methods such as those mentioned below may also be applied to incomplete data. One advantage of these methods is that they avoid one of the possible drawbacks of EM—becoming trapped in local optima.

There may be cases of hidden nodes in gene regulatory networks, where the network is known, but experiments have not provided expression levels for all genes in the network—model parameters can still be learned. The ability to handle incomplete data is an important one, particularly when considering that expression data may come from different laboratories, each looking at different parts of a gene regulatory network, with overlap of some genes whilst others are missing. In this case, all the collected data can be used.

A number of sampling methods have been used to estimate the (full) posterior distribution of the model parameters _{MAP}.

In this numerical example, we illustrate the approaches described in the text for learning Bayesian network parameters, using the simple example of a naïve Bayes classifier to predict protein interaction sites (I) using information on conservation (C) and hydrophobicity (H). Each variable has two possible values: I = yes/no; H = high/low and C = high/low. The conditional probability tables defining the network are shown in _{1–5}.

To illustrate the different methods, we will focus on parameter _{2}, the probability that conservation is high (C = high), given that this is a protein interaction site (I = yes). The value of _{2} is to be estimated from count data; in this case, we assume that for

_{2}. The other graph curves are the prior _{2}, and the posterior

(A) In this case, the observed data is ten interaction sites, of which five have high conservation, five low. As expected, in this case the likelihood peaks at _{2} = 0.5. The prior is _{2} = 0.5. The posterior is also shown, along with the MAP estimate of _{2}. The influence of the prior information in this case where the observed counts are low is clear.

(B) Learning from 100 training examples (75 high, 25 low). Here the weak _{2} ∼ 0.75). The posterior distribution for _{2} is narrower—some of the uncertainty about its value has been removed given the evidence (training examples).

(C) Using a stronger prior _{2} is 0.7; however, note that the prior is narrower—a lot of evidence would be needed to be convinced that _{2} was less than 0.6, say. Small samples are more susceptible to noise than larger samples. For a training set with five high and five low conservation scores, the ML estimate (_{2} = 0.5) is quite different from the MAP estimate of about 0.7, which takes into account the prior. Hopefully, this illustrates why priors are useful, but also cautions against choosing the wrong prior (or too strong/weak a prior)!

(D) This final example has a _{2}.

The Bayesian approach of calculating marginal likelihood does not involve making a point estimate of the parameter; instead, the posterior distribution is averaged over in order to fully take into account the uncertainty in the data.

Particularly in the domain of biology, the inference of network structures is the most interesting aspect; for example, the elucidation of regulatory and signalling networks from data. This involves identifying real dependencies between measured variables; distinguishing them from simple correlations. The learning of model structures, and particularly causal models, is difficult, and often requires careful experimental design, but can lead to the learning of unknown relationships and excellent predictive models.

So far, only the learning of parameters of a BN of known structure has been considered. Sometimes the structure of the network may be unknown and this may also be learned from data. The equation describing the marginal likelihood over structure hypotheses ^{h}

However, the computation of a full posterior distribution over the parameter space and the model structure space is intractable for all practical applications (those with more than a handful of variables).

Even for a relatively small number of variables, there are an enormous number of possible network structures, and the computation of a full posterior probability distribution is difficult. There are several approaches to this problem, including Markov chain Monte Carlo (MCMC) methods (such as the Metropolis–Hastings algorithm), which are used to obtain a set of “good” sample networks from the posterior distribution ^{h}_{S}|D^{h}^{h},_{S}|D

A faster alternative to MCMC is to use

The two key components of a structure learning algorithm are

There are two common approaches used to decide on a “good” structure. The first is to test whether the conditional independence assertions implied by the structure of the network are satisfied by the data. The second approach is to assess the degree to which the resulting structure explains the data (as described for learning the parameters of the network). This is done using a ^{h}_{ML}. The BIC score has a measure of how well the model fits the data, and a penalty term to penalise model complexity. This is an example of

In the case of gene regulatory networks, these structure learning algorithms may be used to identify the most probable structure to give an influence diagram for a gene regulatory network learned from data. Imoto et al. [

Often the really interesting problems involve the learning of causal relationships [

Sachs et al. [

An essential feature of many biological systems is feedback. BNs are perfectly suited to modelling time series and feedback loops. When BNs are used to model time series and feedback loops, the variables are indexed by time and replicated in the BN—such networks are known as

As an example, if in the earlier gene regulatory network example, gene G5 regulated G1, then a feedback loop (cyclic graph) would be formed. In order to perform efficient inference, BNs require a DAG to define joint probabilities in terms of the product of conditional probabilities. For probabilistic graphical models with loops, as described, either iterative methods such as loopy belief propagation must be used, or the cyclic graph must be transformed into a DAG. Assuming a (first-order) Markov process governs gene regulation, the network may be rolled out in time, to create a DBN. Generally, DBNs contain two time slices, with an instance of each variable in each time slice (

Many applications in computational biology have taken advantage of BNs or, more generally, probabilistic graphical models. These include: protein modelling, systems biology; gene expression analysis, biological data integration; protein–protein interaction and functional annotation; DNA sequence analysis; and genetics and phylogeny linkage analysis. However, perhaps the most interesting application of BNs in the biological domain has been the modelling of networks and pathways. These analyses combine all the features of BNs: the ability to learn from incomplete noisy data, the ability to combine both expert knowledge and data to derive a suitable network structure, and the ability to express causal relationships. Recent application of DBNs has allowed more sophisticated relationships to be modeled; for example, systems which incorporate feedback. Furthermore, the marriage of improved experimental design with new data acquisition techniques promises to be a very powerful approach in which causal relations of complex interactions may be elucidated.

Heckerman has written an excellent mathematical tutorial on learning with BNs [

Bayesian network

Bayesian information criterion

conditional probability distribution, CPT, conditional probability table

directed acyclic graph

dynamic Bayesian network

expectation–maximisation

hidden Markov model

joint probability distribution

maximum a posteriori

Markov chain Monte Carlo

maximum likelihood