^{1}

^{2}

^{*}

^{3}

^{4}

^{5}

IMM and IM conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, and wrote the paper.

The authors have declared that no competing interests exist.

Computational methods for predicting evolutionarily conserved rather than thermodynamic RNA structures have recently attracted increased interest. These methods are indispensable not only for elucidating the regulatory roles of known RNA transcripts, but also for predicting RNA genes. It has been notoriously difficult to devise them to make the best use of the available data and to predict high-quality RNA structures that may also contain pseudoknots. We introduce a novel theoretical framework for co-estimating an RNA secondary structure including pseudoknots, a multiple sequence alignment, and an evolutionary tree, given several RNA input sequences. We also present an implementation of the framework in a new computer program, called SimulFold, which employs a Bayesian Markov chain Monte Carlo method to sample from the joint posterior distribution of RNA structures, alignments, and trees. We use the new framework to predict RNA structures, and comprehensively evaluate the quality of our predictions by comparing our results to those of several other programs. We also present preliminary data that show SimulFold's potential as an alignment and phylogeny prediction method. SimulFold overcomes many conceptual limitations that current RNA structure prediction methods face, introduces several new theoretical techniques, and generates high-quality predictions of conserved RNA structures that may include pseudoknots. It is thus likely to have a strong impact, both on the field of RNA structure prediction and on a wide range of data analyses.

Many RNA genes function by assuming a distinct three-dimensional structure in which the molecule folds back onto itself. Contacts are formed by hydrogen bonds between non-consecutive nucleotides that are complementary to each other. These hydrogen bonds are weak compared with covalent bonds. The three possible consensus pairs of complementary nucleotides are {A, U}, {G, C}, and {G, U}. It turns out that many properties of the three-dimensional RNA molecule can already be studied even if we know only the positions in the RNA sequence that form base-pairs. This is the level of abstraction that is predominantly chosen for studying RNA structure. For our purposes, an RNA structure is unambiguously defined by the set of base-pairing positions in the RNA sequence. This set of base-pairing sequence positions defines the RNA secondary structure. We count pseudoknotted structures, i.e., structures that contain non-nested base-pairs (e.g., two pairs

Most RNA structure prediction programs investigate only secondary structures that do not contain pseudoknots. In addition, most of the structure prediction programs aim to predict the pseudoknot-free secondary structure that minimizes the free energy of the entire RNA molecule. The first empirical and theoretical investigations of the free energies of RNA secondary structures were conducted by Tinoco and his colleges in the early 1970s [

The MFE approach has, however, a number of limitations. One conceptual limitation is the underlying assumption that a given RNA sequence will assume its MFE structure in the cell, i.e., its thermodynamic RNA structure. This assumption is not well supported in the general case. Theoretical, comparative studies of RNA molecules [

There exist several programs that aim to simulate the dynamic folding process of an RNA molecule in the cell to predict RNA structures that may contain pseudoknots [

Another conceptual limitation of the MFE approach is that the Zuker–Sankoff algorithm cannot handle pseudoknotted secondary structures, i.e., structures with non-nesting base-pairs. However, pseudoknotted RNA structures are known to fulfill diverse and important functional roles in the cell [^{4}) to ^{6}) for an RNA sequence of length

The best information for predicting the functional RNA structure can be derived from functionally equivalent RNA sequences of evolutionarily related organisms. This is due to the fact that evolutionarily related RNA sequences that serve the same purpose in the cell are likely to employ the same mechanism for exerting this function. In particular, if the function of these RNA sequences depends on their structure, this RNA structure (but not necessarily the RNA sequences themselves) should be highly conserved. If we therefore align the RNA sequences such that structurally equivalent parts are grouped together, we can detect pairs of columns in the sequence alignment where the primary sequence conservation may be low, but the functional conservation in terms of base-pairs is high. These base-pairing columns in the alignment where compensatory mutations occur in a correlated way are called co-varying or co-evolving columns. They provide the main sequence signal that many comparative structure prediction programs detect to predict the base-pairs of evolutionarily conserved RNA secondary structures. RNAalifold [

There already exist comparative programs that attempt pseudoknot prediction. These programs use the maximum weighted matching algorithm (MWM algorithm) [^{3}) time to analyze an RNA sequence of length

The fundamental conceptual problem that all of these comparative programs face is that they require an input alignment of high quality to be able to predict the conserved RNA secondary structure. However, such an alignment can often only be established if we already know the conserved RNA secondary structure. These comparative structure prediction programs thus face a major chicken-and-egg problem. If the sequences are very well conserved and easy to align based on primary sequence similarity, the resulting alignment may contain no or few co-varying columns. If, at the other extreme, the sequences are only distantly related, a trustworthy sequence alignment that would exhibit many co-varying columns is impossible to establish based on primary sequence similarity alone. Comparative RNA structure prediction methods that take a fixed alignment as input can therefore analyze only a very limited range of available data successfully.

This chicken-and-egg problem has been addressed by several comparative structure prediction programs that do not require a fixed input alignment, Dynalign [

To summarize, all of the existing RNA structure prediction programs face at least one of the following challenges: (1) the MFE structure rather than the evolutionarily conserved structure that is likely to correspond to the functional structure is predicted, (2) unstructured regions of the RNA are not explicitly modeled, (3) input alignments are fixed and cannot be altered and improved, (4) pseudoknotted structures are either completely ignored or computationally too expensive to predict, (5) only two evolutionarily related RNA sequences are used as input, or (6) the evolutionary relationship between the RNA sequences is not explicitly modeled.

There are several good reasons to convince ourselves that many of these problems can be best solved simultaneously. For example, a good structure prediction should improve the prediction of a good alignment, and vice versa. Likewise, the prediction of a good alignment should improve the prediction of the correct evolutionary relationship of the RNA sequences, and vice versa.

The idea of co-estimating RNA secondary structures, multiple sequence alignments, and evolutionary trees was first suggested in a theory paper by David Sankoff in 1985 [

We here propose a novel theoretical framework for solving the problem of co-estimating RNA secondary structures including pseudoknots, multiple sequence alignments, and evolutionary trees. We introduce a joint distribution of RNA structures, alignments, and trees in a Bayesian framework. As it is not feasible to analytically calculate any interesting statistics in this model in reasonable computational time, we propose a Markov chain Monte Carlo (MCMC) method with which we can sample from the posterior distribution.

According to elementary probability theory, the following equation holds:
_{1}, _{1}, _{1}|_{2}, _{2}, _{2}|

We now explain how we calculate the different terms on the right hand side of Equation 1. We also introduce models and explain how to employ them to calculate the terms on the right hand side of the equation. The definitions that we propose for the prior probabilities merit a detailed discussion as there is currently no widely accepted consensus on how to define these prior distributions. Concerning the calculation of the likelihood, we make a conscious decision to use the widely known Felsenstein likelihood.

The likelihood,

We write the prior, _{1}, _{2}, and _{3} is, to a certain extent, arbitrary and reflects our understanding of the underlying biological problem. We now explain in detail our reasons for choosing this decomposition. The consensus secondary structure, _{1}, _{2}, and _{3}, which we now introduce in detail.

The likelihood function, _{1}(_{i}

The consensus secondary structure, _{2}(_{2}(

The entropy of a pseudoknot-free secondary structure can be calculated by decomposing it into loops [

For pseudoknotted secondary structures, the calculation of the structure's entropy becomes more complicated. We use a simple model where each stretch of unpaired nucleotides of length

The structure prior, _{2}(

Other structure priors for pseudoknotted structures have, for example, been developed by Isambert and Siggia [

The alignment, _{3}(_{3}(_{3}(

For calculating the gap contribution to the prior, _{3}(

The analytical calculation of the posterior distribution is computationally too expensive. Instead, we employ a Bayesian MCMC method [_{new} := (_{new} in our case, is drawn from a proposal distribution, _{accept} [

The mixing of the Markov chain depends on how closely the proposal distribution resembles the target distribution. Gibbs sampling is a special case of MCMC sampling, where each state _{1}, _{2},..._{n}_{i}_{i}

As it is generally not possible to sample from an arbitrary conditional distribution, the Gibbs sampling strategy can only rarely be used. However, it is possible to mimic the conditional distribution with an auxiliary distribution. This strategy is employed in partial importance sampling; see MacKay [

It is possible to define tree moves that are independent from the actual alignment and RNA structure. However, it is generally impossible to alter the alignment without disturbing the RNA structure. We therefore use the following three types of moves: changing the length of an edge in the tree, changing the tree topology, and using a complex move that alters both the RNA structure and the alignment.

We use the Metropolis-Hastings algorithm [_{new}, from the interval [max{0, _{new})). We accept the new edge length if a random number _{new} is the length of interval [max {0, _{new} − δ}, _{new} + δ] and δ

For changing the topology of the tree, we pick a tree node at random and swap this node and its aunt node to alter its topology (see _{new})). We accept the new tree topology if a random number

The topology of the tree on the left gets modified into the tree on the right by swapping an aunt (A) and its niece (B). Nice–aunt swapping has been shown to be ergodic.

Simultaneously changing the structure and alignment is more sophisticated and merits a detailed discussion. The challenge is to define moves that can significantly change the actual state, that can be calculated efficiently, and that have a high probability of being accepted.

We sample a new alignment by choosing a random window of the alignment that is devoid of base-pairs and by realigning it (step 1). We then sample a new RNA structure by removing a random set of helices from the given structure (step 2) and by adding a set of new helices (step 3).

The alignment is sampled in the following way while keeping the RNA structure and tree fixed. We first find intervals in the alignment that do not contain any base-paired positions and that cannot be further extended. We then choose one of these intervals at random and propose a new alignment for this interval. This alignment is created using a stochastic version of iterative alignment along the fixed tree [_{new}, _{proposal} and _{backproposal} denote the proposal and backproposal probabilities, respectively. The backproposal probability is defined as the probability of choosing the old state from the proposal distribution given that we are in the new state of the Markov chain.

Once the alignment has been sampled, we sample the RNA structure while keeping the alignment and the tree fixed. The challenge is to devise a unique way of proposing a new structure; otherwise we cannot easily calculate the proposal and backproposal probabilities. We propose a new structure in the following way. We first decide on the number of helices to be removed from the given set of helices by drawing a random number from the truncated Poisson distribution with parameter λ = 3. We then remove this number of helices from a weighted distribution, where the weight of each helix is the log-odds ratio of its Felsenstein likelihoods (i.e., the likelihood of the RNA without any base-pairs and the likelihood of the RNA with the helix). The set of removed helices is denoted _{j} H_{new}, _{proposal} and _{backproposal} are the proposal and backproposal probabilities, respectively.

The primary result of an MCMC run is a large set of simulated (

We project the RNA structure,

There already exist a number of programs that determine a consensus structure for a given set of RNA structures, e.g., RNAdistance of the Vienna package [

We use the following procedure for deriving a consensus bi-secondary structure for each individual sequence. For each simulated (^{3}) time and ^{2}) memory to analyze a sequence in which

The pairing probabilities estimated by SimulFold are colour-coded and range from bright green (high probability) to bright red (low probability). The pairing probabilities for this structure range from 0.62 for the pair at sequence positions {4, 38} to one for most of the base-pairs, e.g., the one at sequence positions {6, 36}. For this figure, we have adopted a nonlinear colouring scheme; otherwise, all base-pairs would simply come in slightly different shades of green.

SimulFold is to our knowledge the first program that predicts an RNA structure including pseudoknots while simultaneously estimating an alignment as well as an evolutionary tree for several, evolutionarily related input RNA sequences. It was therefore not possible to present a comparison to a truly equivalent program. Instead, we compare the RNA structures predicted by SimulFold to those predicted by RNAalifold [

RNAalifold takes a fixed alignment as input and predicts a consensus RNA secondary structure without pseudoknots. It extends the MFE algorithm employed by the non-comparative MFE methods Mfold and RNAfold by interpreting the fixed input alignment as a hyper-sequence and by simultaneously minimizing the overall free energy while taking the primary sequence conservation and co-varying columns in the fixed alignment into account. The optimization is implemented in a dynamic programming procedure that combines free energy parameters with conservation scores.

Hxmatch is an extension of RNAalifold. Like RNAalifold, it takes a fixed alignment as the only input and predicts a consensus RNA secondary structure. However, unlike RNAalifold, it is capable of predicting secondary structures with pseudoknots. Hxmatch employs a two-step procedure. In the first step, the fixed input alignment is analyzed with RNAalifold [^{3}) time and ^{2}) memory to analyze an input alignment of length

Pfold takes as input not only a fixed alignment, but also an evolutionary tree relating the sequences, and predicts a consensus secondary structure which does not contain pseudoknots, as Pfold cannot handle pseudoknots. Pfold employs an SCFG, i.e., a probabilistic rather than an MFE model, to derive the consensus secondary structure. Similar to RNAalifold, it takes the primary sequence conservation and co-varying columns in the fixed input alignment into account. Unlike RNAalifold, Pfold also takes the known evolutionary relationship of the input sequences, i.e., the input tree, explicitly into account. Both, Pfold, and RNAalifold require ^{3}) time and ^{2}) memory to analyze an input alignment of length

CARNAC is also a comparative RNA structure prediction method. It takes several unaligned RNA sequences as input and predicts an RNA structure for each individual RNA sequence which does not contain pseudoknots, as CARNAC cannot handle pseudoknots. Similarly to Hxmatch and RNAalifold, it does not take the evolutionary relationship of the input sequences explicitly into account. CARNAC employs a multi-step procedure for generating predictions. In the first step, potential helices are predicted for each RNA sequence separately. In the second step, an optimal consensus secondary structure is extracted from these helices for every possible pair of RNA sequences. In the third and last step, the different secondary structures that were predicted in a pairwise fashion for each individual RNA sequence are combined into one secondary structure using graph theoretical techniques. This is the final RNA structure reported by CARNAC for that RNA sequence. In the most general case, the algorithms underlying CARNAC would require ^{6}) time and ^{4}) memory to analyze input sequences of length ^{2}) time and memory.

We compiled a large and diverse dataset from previously published data [

Performance of CARNAC, Hxmatch, RNAalifold, Pfold, and SimulFold for Predicting RNA Structures

To evaluate the structure prediction performance, we compared the known RNA structure of the reference organism in each set with the corresponding predicted RNA structure. We measured the quality of the structure predictions in terms of the number of correctly predicted base-pairs (true positive base-pairs, or TP, see

CARNAC generally shows a high specificity, i.e., a low number of incorrectly predicted base-pairs, often in combination with a low sensitivity, i.e., a low number of true positive base-pairs. This low sensitivity can even be found for sets whose average pid is fairly high, e.g., set U5 (high) with an average pid of 88%. CARNAC's performance is naturally limited by the fact that it cannot predict pseudoknotted structures.

Besides SimulFold, Hxmatch is the only other investigated program that is capable of predicting pseudoknotted secondary structures. Hxmatch has the tendency to over-predict base-pairs, as indicated by the high number of false positive base-pairs. This happens for low average pids (e.g., set tRNA [low]) and for high pids (e.g., set SSU [high]). It is interesting to note that RNAalifold often does better than Hxmatch at predicting the base-pairs of pseudoknotted reference structures, e.g., the results for RNaseP (medium), RNase P8, SSU (medium), and SSU (high). However, there are also examples, see the corona set, where the reverse holds.

Like RNAalifold and Hxmatch, Pfold is a program that takes a fixed alignment as input. Its performance tends to be low for the low average pid range, e.g., the U5 (low), tRNA (low), and rRNA (low) sets, which all have average pids below 50%. For the high pid range, its performance can be limited because of the fact that Pfold cannot model pseudoknots, e.g., in the corona, entero, and hepatitis delta virus (HDV) sets. Its performance for the RNaseP (medium) set constitutes a notable exception to this trend.

SimulFold is the only program that simultaneously co-estimates alignments, structures, and trees. It clearly outperforms all other programs in terms of overall performance for eight out of 16 sets: U5 (low), group II intron (low and high), tRNA (low), rRNA (low and high), entero, and HDV. It also shows a competitive performance for the sets U5 (high) and tRNA (high). These sets cover a wide range of average pids, from 40% to 91%. The results for the two SSU sets show that SimulFold has problems analyzing these two sets, whose reference alignments span more than 1,500 nucleotides. However, the results for the RNase P8 set show that SimulFold can successfully predict structures with high sensitivity even for comparatively long sequences (the reference alignment of the RNase P8 set has a length of 472 nucleotides). The results for the RNase P8 and the HDV sets show the benefits of parallel tempering. When investigating the predictions for the HDV set, we concluded from the loglikelihood plot (see

With parallel tempering (grey) and without parallel tempering (black). Without parallel tempering, the MCMC chain gets stuck in local minima.

Our initial motivation for devising a novel method that simultaneously co-estimates RNA structures, alignments, and evolutionary trees was to improve the prediction of RNA structures, in particular those with pseudoknots. A very interesting additional benefit of our approach is that SimulFold can also be used as an alignment and phylogeny prediction program.

We here present preliminary results for sequences from the HDV set that show SimulFold's potential as an alignment and phylogeny prediction program. By the same argument that we made above for RNA structure prediction, we should also be able to derive better alignments and trees if we co-estimate all three, interdependent quantities together rather than in isolation.

The HDV dataset contains 15 sequences of HDV ribozymes from several strains (their NCBI accession numbers are shown on the figures showing the alignments and the consensus networks). The ribozyme contains one pseudoknot and a variable helix. We calculated posterior probabilities of alignment columns from the multiple alignments that the MCMC method sampled, i.e., the probability of seeing a particular alignment column in a sampled multiple alignment.

We calculated the maximum posterior decoding (MPD) alignment using a dynamic programming procedure [

The MPD alignment is shown in

The name of each sequence indicates the NCBI accession number of the strain. The ribozyme contains one pseudoknot and a variable helix as shown in the line above the alignment, which denotes the known reference structure in dot-bracket, or Vienna, notation. The posterior probabilities for each alignment column were derived from the multiple sequence alignments that the MCMC method sampled and are indicated at the top of the figure.

If we compare our MPD alignment to the alignment generated by Clustal-X [

Some parts of the AB037948 and L22063 sequences are misaligned (red characters), which causes nonsense base-pairs (highlighted in green) when mapping the known reference structure onto these sequences.

We calculated consensus networks based on the evolutionary trees sampled from the posterior distribution by the MCMC using the method of Holland and Moulton [

The name of each sequence indicates the NCBI accession number of the strain (see also

These preliminary results show that SimulFold not only allows us to derive consensus multiple alignments and evolutionary trees, but even enables us to highlight particularly well or poorly estimated parts of these alignments and trees.

It is easy to think of situations where one does not want to simultaneously co-estimate RNA structures, alignments, and trees, e.g., because a high-confidence RNA structure (or alignment or tree) has already been established. It is straightforward to employ SimulFold in these situations, as the program can be easily told to keep the input RNA structure or alignment or tree (or any combination thereof) fixed.

An MCMC can suffer from a low efficiency for three main reasons: (1) the acceptance ratio is low, (2) the Markov chain gets stuck in local optima, or (3) the computational time to perform each step is large. We introduced partial Metropolis importance sampling to quickly propose moves that replace only part of the data and to keep the rejection probability and autocorrelation low. For the HDV set, the initial loglikelihood plot shows poor mixing (see ^{2}) time, where

We propose here a novel theoretical framework for co-estimating an RNA structure including pseudoknots,

Our novel theoretical framework allows us to sample (

Our work is significant in a number of ways. SimulFold overcomes several limitations of existing RNA structure prediction methods, in particular the conceptual limitations of SCFG-based methods. SimulFold does not rely on a fixed input alignment or tree, it can predict pseudoknotted RNA structures, it can take any number of related RNA sequences as input, it aims to predict the evolutionarily conserved RNA structure rather than the thermodynamic or MFE structure, it explicitly models the evolutionary relationship between the RNA sequences, it is a fully probabilistic method that is capable of quantifying the reliability of its predictions, and, most important for the majority of users, it works in a computationally efficient way and can be used on any standard desktop computer. Furthermore, SimulFold derives the RNA structure that is best supported by the posterior distribution, rather than the RNA structure that maximizes the likelihood, which is what SCFG-based structure prediction methods do.

We use a number of novel theoretical and computational tricks to achieve the above. We devised a new expression for the prior ^{2}) pre-processing time, we do an MCMC step modifying the base-pairs that requires

The performance of SimulFold in predicting RNA secondary structures with and without pseudoknots compares very well to the performance of RNAalifold, Hxmatch, Pfold, and CARNAC across a wide range of average pids and sequence lengths. We also present encouraging preliminary results that show SimulFold's potential as an alignment and phylogeny prediction program. It is not only possible to derive a consensus alignment and tree, but also to highlight those parts of the alignment and tree that can be particularly well or poorly estimated. This information is very valuable for interpreting the results in great detail.

It is easy to think of situations where one does not want to simultaneously co-estimate RNA structures, alignments, and trees, e.g., because a high-confidence RNA structure (or alignment or tree) has already been established. We therefore implemented special flags in SimulFold that allow the user to keep the input RNA structure or alignment or tree (or any combination thereof) fixed. We hope that this feature will make SimulFold a useful program for a wide range of interesting tasks and data analyses.

In the future, we intend to investigate different models and priors for use in SimulFold, e.g., a co-transcriptional folding prior. We also hope to further improve the properties of the sampling, e.g., partial importance sampling of tree or an even better structure sampler, to improve the performance for very long sequences.

SimulFold opens up a large number of possibilities for exciting data analysis. Most importantly, we can now start analyzing data whose low primary sequence conservation has so far prevented their analysis with methods that require a high-quality input alignment. We hope that our work inspires other researchers to also develop methods that predict or investigate the functional structure of RNA sequences so that we learn more about how RNA sequences play their diverse functional roles in the cell.

SimulFold as well as information on the input and output files of this analysis can be found at

IMM would like to thank Elena Rivas, Eric Westhof, and the participants of the computational RNA workshop in Benasque, Spain, for many inspiring discussions. IM would like to thank Péter Ittzés for discussing consensus networks. We would both like to thank Bjarne Knudsen for providing us with the diagonalized rate matrices of Pfold and Paul Gardner for help with the dataset. Last, but not least, we would like to thank the three anonymous referees for their constructive comments.

covariance model

false negative

false positive

hepatitis delta virus

insertion and deletion

Mathews's correlation coefficient

Markov chain Monte Carlo

minimum free energy

maximum posterior decoding

maximum weighted matching algorithm

pairwise sequence identity

stochastic context-free grammar

true positive