^{1}

^{2}

^{*}

^{2}

^{3}

^{1}

^{2}

^{4}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: MM HHC STW. Performed the experiments: MJM HHC. Analyzed the data: MJM HHC. Contributed reagents/materials/analysis tools: MJM STW. Wrote the paper: MJM HHC.

Bayesian Networks (BN) have been a popular predictive modeling formalism in bioinformatics, but their application in modern genomics has been slowed by an inability to cleanly handle domains with mixed discrete and continuous variables. Existing free BN software packages either discretize continuous variables, which can lead to information loss, or do not include inference routines, which makes prediction with the BN impossible. We present CGBayesNets, a BN package focused around prediction of a clinical phenotype from mixed discrete and continuous variables, which fills these gaps. CGBayesNets implements Bayesian likelihood and inference algorithms for the conditional Gaussian Bayesian network (CGBNs) formalism, one appropriate for predicting an outcome of interest from, e.g., multimodal genomic data. We provide four different network learning algorithms, each making a different tradeoff between computational cost and network likelihood. CGBayesNets provides a full suite of functions for model exploration and verification, including cross validation, bootstrapping, and AUC manipulation. We highlight several results obtained previously with CGBayesNets, including predictive models of wood properties from tree genomics, leukemia subtype classification from mixed genomic data, and robust prediction of intensive care unit mortality outcomes from metabolomic profiles. We also provide detailed example analysis on public metabolomic and gene expression datasets. CGBayesNets is implemented in MATLAB and available as MATLAB source code, under an Open Source license and anonymous download at

A Bayesian network (BN) is a data structure that encodes conditional probability distributions between variables of interest by using a graph composed of nodes and directed edges. In a BN, variables in the domain are modeled as random variables and represented by nodes, and edges between them represent a statistical dependence of the child node on the parent node. Each node is annotated with the conditional distribution of the variable given the values of its parents, and this information can be used to answer questions about the most probable values of variables in the BN given assignments to other variables in the BN.

BNs are attractive because they offer an interpretable picture of dependence and independence between domain variables, while modeling complex statistical relationships among them and providing prediction of an outcome of interest. BNs, as a mathematical modeling formalism, has enjoyed success in recent years

However, Bayesian inference algorithms can be extremely complex and difficult to implement. There are several software packages for doing BN analysis, but to our knowledge all existing free implementations have one of two problems: 1) they do not allow mixed discrete and continuous data in a fully Bayesian mathematical formalism; or 2) they do not perform inference with the BN, merely performing the network learning step. The first limitation is one of traditional BNs, which are limited to considering only discrete variables. Researchers who wish to analyze continuous data (such as gene expression or metabolomics data) generally discretize these data, leading to loss of information and concurrent loss of power. In the second situation, a BN implementation that does not include inference is incapable of making predictions on new data, and researchers wishing to use the BN for prediction must either implement their own inference software, or take apart the BN model and put the pieces into some other modeling formalism, such as a logistic regression or a support vector machine. In addition to being inconvenient, this translation process leads to suboptimal performance since the variables identified in a BN are chosen for their relation and interactions to the other variables within the network – interactions which can be difficult to accurately recapitulate in another predictive modeling formalism.

CGBayesNets addresses both of these problems by implementing both network learning and network inference in a more modern type of Baysian network, the conditional Gaussian Bayesian network (CGBN)

CGBayesNets was primarily designed to aide genomic researchers in building predictive models of a phenotype of interest using multimodal genomic data, combined with demographic and clinical data, as possible predictors. A typical scenario is one where the researcher has a case-control dataset of patients with and without a binary condition: cancer

Of course, the primary outcome of a clinical case-control trial need not be used; other discrete variables of interest can be considered the phenotype; and networks predictive of these secondary outcomes can be computed separately. In a future update, CGBayesNets will also allow continuous (normally-distributed) variables to be used as the phenotype.

While there are several software packages for learning and predicting with Baysian networks, none provide the mix of features presented by CGBayesNets; in particular no free implementations provide algorithms for inference in networks of mixed discrete and continuous variables.

Some Bayesian packages focus only on learning the network structure, including the popular Banjo

Other BN packages provide network learning and inference but do not implement inference with continuous variables without discretization. These include the machine learning Java platform Weka 3.6.9

Some network-learning packages are not Bayesian, but instead use other formalisms for defining statistical relationships between variables, such as GlobalMIT

GDAGSim is a Gaussian linear model simulator

Perhaps closest to our CGBayesNets package is the BNfinder 2.0 package

In total, CGBayesNets provides new functionality: learning and predicting with Bayesian networks composed of discrete and continuous variables. In the following, we discuss our implementation of these and additional features.

CGBayesNets is entirely Bayesian, using the Bayesian marginal likelihood to guide network search and for performing inference. Using Bayesian statistics allows leveraging of Bayesian priors to bias network structure learning toward parsimonious models that are more likely to predict well on new datasets, while also providing a consistent mathematical treatment throughout the package. Please see the Supplementary Materials for a full mathematical treatment of the Bayesian semantics.

We refer to the process of predicting an outcome of interest using a Bayesian network as “inference,” the term commonly used for this in the BN literature. Inference in a BN can proceed either forward or backward along the directed edges of the network. The best prediction of a phenotype node is often obtained when that node has no parents, but many children. This is structurally similar to a type of network know as a Naïve Bayes Network, where each other (non-phenotype) node is conditionally independent given the phenotype. Although Naïve Bayes networks are simple, in practice they can provide extremely good prediction

To perform inference in a CGBN, different algorithms are used on the discrete and continuous portions of the network. We have implemented the Cowell algorithm for inference in conditional Gaussian network nodes

We chose the Cowell algorithm for inference because it is numerically stable. There are very few algorithms to choose from for exact Bayesian inference in mixed networks. The Cowell algorithm is based on the earlier Lauritzen and Jensen junction tree algorithm

While the focus of CGBayesNets is on inference in mixed Bayesian networks, CGBayesNets provides four main network structure search algorithms. The problem of searching for the best Bayesian network is one that has received much attention over the last 30 years, and there are many possible algorithms (and software implementations of those) that a researcher may want to employ. We have endeavored to make our software package modular and extensible so that researchers familiar with MATLAB will be able to easily add their own network search algorithms; we also read common network file formats so that researchers can use other packages that have more extensive and specialized network search procedures to find a good Bayesian network, and then use CGBayesNets to perform inference in that network.

In all of the network search algorithms in CGBayesNets, network scoring is done by a metric sometimes known as the ^{2}) possible directed networks), all networks cannot be investigated; rather heuristics are employed to search for good networks that might be satisfactory. One heuristic we employ throughout is a limit on the maximum number of possible parents a node may have, which may be set by the researcher to any appropriate value.

The first network learning algorithm we provide in CGBayesNets is a K2-style ^{2} possible edges, where n is the number of variables and k is the maximum number of parents a node can have, a parameter of the search algorithm. K2 is frequently almost as good as other methods that consider a much larger number of possible edges, and thus take much more computational time. The K2 procedure we have implemented can be considered a hill-climbing algorithm that allows backtracking, but does not consider all possible edges, only those that obey the K2 ordering constraint.

A second structure learning algorithm provided in CGBayesNets is a greedy, exhaustive, search algorithm that starts with an empty network and adds the best edge, iteratively. It does not rely upon a K2-style node priority list to avoid cycles, but rather does its own cycle-checking with depth-first search. This algorithm is a greedy hill-climber, in that at every step it adds the edge that increases data likelihood the most. It is exhaustive in that it considers all possible legal edges, between any two nodes (not that it considers all possible ^{2}+kn^{2} edges, and may find better networks than K2, although it is much slower. In sum, this algorithm is appropriate for smaller datasets, and will have prohibitive computational cost for networks with thousands of nodes.

The third network learning algorithm provided with CGBayesNets is pheno-centric search, which builds a network based around a particular phenotype node, sufficient to perform learning and prediction of that particular variable ^{2} possible edges. Second, it does not require a K2-style node order list for parent constraints, and as such does not needlessly exclude many potential network structures from consideration. On the other hand, it can result in overfitting of the data: making too many connections to the phenotype that may just be due to random noise in the dataset.

The fourth algorithm is a hill-climbing technique known as simulated annealing ^{3} possible edges to allow the search to consider many possible permutations of the n^{2} possible edges in a network. However this is the slowest of our four search algorithms and as such may perform worse than the other three given limited computational time.

Finally, for learning networks of many variables, CGBayesNets includes simple filtering functions that filter the number of variables by Bayes Factor of association with the phenotype, where the Bayes Factor is the ratio of posterior likelihood of the data with the variable dependent upon the phenotype, to the likelihood of the data independent of the phenotype

The CGBayesNets package is intended to support all phases of the predictive modeling process.

CGBayesNets provides the four network structure learning algorithms, described above. In addition, in our software implementation, CGBayesNets provides separate functions for learning the parameters of a network and learning its structure from data, and base functions for computing Bayesian likelihood of variables. These functions make it easy for advanced users to add their own network learning algorithms. Once structure and parameters are learned, the model may be tested on a dataset: either the existing dataset or a new (replication) dataset. CGBayesNets provides functions for making testing on multiple different datasets simple and direct. In all cases the Area Under the Receiver-Operator Characteristic Curve (AUC) is reported as a measure of predictive accuracy of the network

To increase the performance of networks on replication datasets, CGBayesNets provides functions for employing cross-validation (CV) and bootstrapping. The cross-validation functions will either perform CV to determine the best settings of Bayesian prior parameters, or to estimate the performance on an unknown replication dataset. Bootstrapping is provided to obtain estimates of the frequency of individual edges within a given Bayesian network, by comparing frequencies of edges in different bootstrap realizations of the dataset. This results in a single aggregate network with fractional probabilities for each edge; functions are provided to translate these into concrete Bayesian networks and test their performance.

We have endeavored to make CGBayesNets easier to use by providing several data reading and writing functions. There are input functions for reading several different types of PED SNP files, and text files formatted with mixed string and numeric data, such as output by the popular R statistical language. We output networks into Trivial Graph Format (tgf), which can be manipulated for example by the free program yEd (yWorks, Tubingen, Germany.

CGBayesNets is distributed as MATLAB source code. Each function is commented and documented with the input and output specifications so that it may be employed in the user's application as necessary. To make this as easy as possible, we make recommendations as to which functions are suggested for modification, and which represent inner workings of the algorithms, and should not generally be altered. We also provide example code to copy and edit demonstrating how to combine our lower-level Bayesian inference functions to assemble higher-level search and diagnostic routines.

The primary form of biological insight provided by CGBayesNets is predictive network models that differentiate cases from controls. CGBayesNets is the only existing free software package for doing so with Bayesian networks of mixed discrete and continuous domains.

It is clear that discretization of continuous variables is a possibility, allowing researchers to convert continuous variables to discrete ones and then use discrete Bayesian network methods. However, we argue that this necessarily results in a loss of information and a concomitant loss in power. See supplemental material for an example of a mixed discrete-continuous domain where we compare performance of BNs using discretization of continuous variables to using CGBayesNets. Results from this experiment are shown in

True Network | BNfinder 2.0 (K2) | BNfinder 2.0 (reverse-K2) | Weka 3.6.9 | CGBayesNets | |||||

Data | nodes | Nodes | AUC | Nodes | AUC | nodes | AUC | nodes | AUC |

Original | 14 | 5 | 50% | 3 | 98.7% | - | - | 8 | 99.3% |

Discretized | 14 | 3 | 50% | 2 | 61.3% | 20 | 50.0% | 5 | 72.6% |

Average Original | - | - | 68.7% | - | 82.4% | - | - | - | 99.2% |

Average Discretized | - | - | 68.7% | - | 68.7% | - | 70.0% | - | 70.0% |

We have used CGBayesNets in several applications. We have performed eQTL analysis with earlier versions of the software that resulted in predictive models for subtypes of acute lymphoblastic leukemia

To consider one application in greater detail, we recount the network analysis strategy employed in our previous work: the identification of a metabolic signature for predicting mortality in hospital intensive-care units (ICU) from metabolomic profiling

We include several test datasets with the CGBayesNets download. These are intended both to demonstrate the suggested use of our software and to assure its correct installation and function. We provide a metabolomic profiling dataset of a cachexia sample from the MetaboAnalyst2.0

Bayesian Networks remain an important machine learning methodology within bioinformatics, although their recent application in genomics, metabolomics, and proteomics has been limited by the necessity to learn networks over mixed discrete and continuous variables. The development of effective algorithms for learning and reasoning with conditional Gaussian Bayesian networks addresses this issue, although freely available implementations of these algorithms have so far been unknown. Our CGBayesNets package solves this problem and fills these needs. We are committed to continued development of CGBayesNets to fit our own needs of predictive Bayesian network software, as we continue to apply these techniques to biomedical domains; these improvements will be available to all users of CGBayesNets in the future.

The CGBayesNets package is available from the authors and via anonymous download from

(PNG)

(TIFF)

(PNG)

(ZIP)

(DOCX)

We thank Professor Kelan Tantisira for his generous help and support. We thank the late Dr. Marco Ramoni, who laid the theoretical groundwork for the Bayesian treatment of Gaussian variables employed herein.