^{1}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AFN. Performed the experiments: AFN. Analyzed the data: AFN. Contributed reagents/materials/analysis tools: AFN SFA. Wrote the paper: AFN SFA. Designed and implemented routines for Dirichlet mixture priors and BILD scores: SFA. Designed and implemented other aspects of the GISMO program: AFN.

We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are:

Existing multiple alignment programs typically utilize (i) bottom-up progressive strategies, which require the time-consuming alignment of each pair of input sequences, (ii)

This is a

A common starting point for the computational analysis of proteins is the construction of a multiple sequence alignment (MSA). Insofar as they result from protein functional similarities and differences, the patterns of residue conservation and divergence within such an alignment provide clues to biological function. Of course the biological relevance of any observed patterns depends upon an alignment’s accuracy, and alignments of larger sequence sets have greater statistical power. For biologically appropriate scoring systems applied to more than a very small number of sequences, however, no optimization procedures are known that are both tractable and rigorous; thus all practical MSA programs rely upon heuristic methods.

The most widely used general approach to multiple alignment is the progressive technique [^{2}) time to compute such scores, and this becomes the time-dominating step for large

An alternative O(

Aligning distantly related sequences presents major algorithmic and statistical challenges because such sequences typically share similarity only within a common structural core, with sizable insertions often occurring between core elements. Classical dynamic programming alignment algorithms typically have difficulty spanning these insert regions because the log-odds scores associated with weakly conserved core elements are often too low to offset the gap penalties incurred. Fortunately, even when the conserved blocks are very subtle, an MCMC strategy can take advantage of a large number of input sequences to detect weak yet statistically significant similarities.

Two factors have tended to slow previous MCMC sampling procedures, or to trap them in local optima. The first is the inclusion of correlated sequences within the input. When a set of such sequences is misaligned to the main body of sequences, it favors recurring misalignment when individual sequences from the set are resampled. This problem may be partially addressed by removing from the program’s input all but one sequence among closely related sets; these sequences may be added to the alignment at the program’s end. The second factor is the difficulty in accurately identifying the number and locations of aligned columns corresponding to the structural core, and the corresponding placement of indels. A previous sampler addressed this problem with only partial success by splitting or joining contiguous blocks, extending or trimming blocks, and by allowing short indels within a block [

A critical issue for multiple alignment programs is how they internally assess sequence alignment quality. One widely used measure is the sum of the implied pairwise scores, but this measure lacks a good mathematical justification. Previous MCMC programs introduced measures with a rigorous statistical basis by sampling over the posterior probability distribution defined by a statistical model for aligned columns [

In this article, we describe a new approach to MSA, whose main features are as follows.

GISMO shares certain algorithmic and statistical features with an earlier version of this sampler [

Most multiple alignment methods utilize a “bottom-up” progressive alignment strategy. That is, they start by aligning the most closely-related sequences and progressing to those more distantly related. GISMO takes an inverted, “top-down” MCMC approach that starts by aligning, among all sequences, the core regions they share. It first generates a random alignment consisting of many short (5- to 15-column) co-linear aligned blocks (

A program’s measure of multiple alignment quality, either explicit or implicit, plays a vital role in determining the alignments it will produce. GISMO’s measure corresponds to an underlying generative statistical model, specifically an HMM, and, to the extent that it can efficiently explore alignment space, GISMO will converge on an optimal solution (assuming that sequence weights remain the same; see below). GISMO’s statistical model has several features worth noting. (i) To counter redundancy or bias among the input sequences, GISMO down-weights closely correlated sequences [

As described in Methods, GISMO infers HMM transition probabilities at each model position based on the evolving alignment itself. More specifically, the observed numbers of each type of transition, along with specified prior probability distributions for these transitions, imply posterior probabilities for each transition at each position. These correspond to implicit gap penalties, which favor insertions or deletions (indels) within a given sequence that vary in tandem with where indels have been inferred within other sequences. This tends automatically to favor near block-based alignments—a characteristic that the following column-sampling strategy exploits.

The HMM’s evolving position-specific transition probabilities tend to align conserved regions of the proteins as contiguous blocks, separated by insertions of varying lengths. To improve the alignment, however, it is desirable for the sampler to explore alternative configurations of aligned columns. To determine the proper extent of an implied block, GISMO adds or removes columns based on their

When individual sequences are realigned to the evolving HMM, they may be sampled (as described in Methods) one at a time, with the HMM parameters recomputed after each sequence is removed from the alignment. However, this approach encounters difficulties when an alignment consists of distinct clusters of more closely related sequences, because a sampled sequence is biased by the remaining sequences of its cluster to realign as before. Sampling all the sequences of a cluster in tandem can overcome this “stickiness”. GISMO does this in two distinct ways. First, for a cluster

Alternatively, for a cluster

Finally, GISMO applies three different coordinated sampling strategies: (i) It realigns sequences using a ‘purged’ set as follows: first, it groups all sequences into closely-related clusters; second, it retains in the alignment only the one sequence from each cluster closest to the cluster’s consensus sequence; third, it realigns by sampling each of the remaining sequences; finally, it resamples into the alignment each of the sequences that were originally excluded. This step resembles another, recently-described MSA strategy [

Given the stochastic nature of MCMC sampling, it is advantageous to focus on refining the best alignment among several initial candidate alignments. GISMO does this as follows. (i) It generates a rough block-based alignment for all input sequences, which it uses to construct clusters of closely related sequences, and then selects one sequence from each cluster for further preliminary alignment. (ii) For these sequences, it independently generates a population of block-based alignments, ten by default. (iii) It converts each of these alignments into an HMM alignment and resamples its sequences permitting the introduction of gaps. (iv) It scores each alignment by its similarity to the other alignments (see

The GISMO program was implemented in C++. We tested GISMO on 408 protein sequence sets; these correspond to those domain alignments within version 3.14 of the NCBI Conserved Domain Database (CDD) [

GISMO was compared to four widely used MSA programs, MUSCLE (v3.8.31) [

We assess alignment accuracy using SP-scores, with the CDD alignments as benchmarks. In brief, an SP-score (from "Sum of the Pairs") is the proportion of aligned pairs of residues within a benchmark multiple alignment, that are similarly aligned within a test multiple alignment. Note that the term "SP-score", with a related but distinct meaning, frequently describes elsewhere an objective measure of multiple alignment quality, as opposed to a measure of alignment accuracy with reference to a benchmark, its meaning here. Note also that our benchmark CDD alignments leave many residues in many sequences unaligned, and these are ignored in calculating SP-scores, so a program that aligns these residues is neither penalized nor advantaged. In practice, GISMO leaves many of these residues unaligned as well, in contrast to most other multiple alignment programs. To the extent that one is not merely agnostic about these residues’ proper alignment, but believes they should in fact be left unaligned, GISMO's performance is underestimated here.

To compare GISMO to other programs, we define the GISMO ∆SP-score as the SP-score for GISMO minus the SP-score for the other program. ^{−5} for all five programs (see

As described in the text, for each analysis the CDD test sets were first ordered based on the property specified on the x-axis and then split into four equal-sized partitions. The x-coordinates for all data points are averages, for the property in question, over the test sets assigned to the various partitions; similarly, the GISMO ΔSP-scores for each program are averages taken over these partitions.

GISMO’s enhanced alignment quality for larger sequence sets may be due either to improved detection of the conserved domain within full-length sequences or to improved placement of indels within the domain or both. The analyses in

Also worth noting about the analyses in

Unlike most MSA programs, GISMO is stochastic and therefore will return a different MSA for each run. This raises the question of GISMO’s run-to-run SP-score variability, as well as how this compares to the variability in SP-scores among distinct deterministic programs. To start,

SP-scores are based upon the CDD MSAs as benchmarks and vary from 0 (no correctly aligned sequence pairs) to 1 (all pairs aligned correctly). A. The sorted SP-scores for a single GISMO run (red line with yellow back-glow) compared with the sorted scores for the five other programs. B. Run-to-run variability in SP-scores over six GISMO runs. Test set data points are sorted along the x-axis by the SP-score obtained for each set on the first run (red data points) of six. C. SP-scores for the six programs analyzed, sorted by the GISMO score on each test set. GISMO SP-scores (for a single run) are shown in red. Each red data point and the five black data points (one point for each program) plotted in the same column correspond to the same test set. D. SP-scores for the six programs, sorted by the CLUSTAL-Ω score on each test set. Data points for GISMO and for CLUSTAL-Ω are shown in red and green, respectively.

Log-log plots of each program’s runtimes for each of the 408 CDD test sets are given in ^{1.6} and ^{2.2}. There is considerable variability in GISMO runtimes for a given

Each data point corresponds to one MSA generated by the program indicated. Estimated time complexities based on trendline slopes were for: GISMO, ^{0.96}; Clustal-Ω, ^{1.6}; Kalign, ^{1.6}; MUSCLE, ^{2.1} and Dialign ^{2.2}, where

Because GISMO is designed to align only those regions conserved by all of the sequences included in the input set, it is most appropriate to benchmark it against CDD alignments, which likewise only align the common conserved region. However, we were curious to know how it performs against a benchmark set designed for MSA programs that globally align all of the input sequences. For this we selected the Prefab benchmark set [

An example of how GISMO aligns representative proteins of known structure for acetylase domain proteins is shown in

Representative proteins of known structure are shown—no two of which share more than 27% sequence identity over the domain footprint. The full alignment consists of 2,125 sequences.

By eliminating the need for a guide tree, our earlier MSA Gibbs samplers [

GISMO’s advantage over progressive alignment methods is most noticeable when a shared domain is present within long, multi-domain proteins. Consider, for example, the MAFFT and GISMO MSAs in

In contrast, due to its statistical basis, GISMO will align only sequences sharing significant similarity. This feature also allows GISMO to identify the HMM architecture and parameters most likely to generate the input sequences and to thereby define the extent of the core alignment more precisely. Sampling from the Bayesian posterior probability distribution leads to output alignment variability, which some biologists might find troubling. However, we believe that this allows a more realistic assessment of what can reliably be inferred from the input sequences than does repeatedly returning the same suboptimal alignment. As illustrated in

For some domains, different runs of the current version of GISMO (or runs of different programs) generate significantly differing alignments, some of which appear to be far from optimal. However, the statistical and algorithmic foundations laid here provide avenues for further improvement: Close examination of misaligned regions can suggest new sampling strategies for escaping suboptimal traps. Such strategies may yield more than merely incremental improvement to alignment results. With this in view, we anticipate many further enhancements to GISMO. In particular, there is a large body of literature on MCMC sampling strategies [

Finally, we ask: What is the benefit of a large, high quality alignment of evolutionarily-related sequences? We suggest an answer through an analogy to physical chemistry: Statistical thermodynamics describes the macroscopic properties of matter as average molecular properties arising from probability distributions over quantum mechanical states. Its central concept is the Boltzmann distribution, which specifies the most probable population of molecular states for a system in thermodynamic equilibrium. This distribution defines all of the thermodynamic properties central to our understanding of chemistry—such as entropy, heat capacity, enthalpy and free energy.

Likewise, the biological properties of proteins may be better understood by considering average properties implied by probability distributions over polypeptide states, with the central concept being a distribution specifying the most probable population of sequences for a protein class in evolutionary equilibrium. GISMO can be used, in combination with a companion MCMC sampler for protein classification [

The following notation is used for vectors _{1},…,_{n})^{T} and _{1},…,_{n})^{T}: |_{1}|+…+|_{n}|, _{1}+_{1},…,_{n}+_{n})^{T}, _{1}/_{1},…,_{n}/_{n})^{T}, _{1})…Γ(_{n}). Given _{k} is the _{k,i} corresponds to the _{k}) returns a length 20 vector of the counts for the residue types in _{k}.

A block-based alignment of the input sequences is defined by _{j} = {_{1,j},…,_{K,j}}. We define _{j[−k]} ≡ _{j} − {_{k,j}} to denote the set _{j} without _{k,j}. An alignment is defined by the matrix _{1},…,_{w})^{T} and {_{k,j}: _{C}. For instance, _{{A}} = {_{k,j}:

The residue frequencies observed for column _{c} = (_{1,c},…,_{20,c})^{T} where _{i,c} > 0 for all _{1},…,_{w}) defines a product multinomial model corresponding to the full alignment. The vector _{0} corresponds to a background amino acid residue distribution. Hence, the complete-data likelihood function is given by
_{0} ∼ _{1},…,_{w}) specifies the Dirichlet distribution parameters (commonly interpreted as numbers of pseudocounts) at each column position _{1},…,_{w}) = (_{k,j})_{K×w} where _{k,j} indicates the position of the

The conditional predictive probability distribution of this conserved region occurring at position _{[−k]}. This statistical model serves as the foundation for the HMM [

In order to capture the fact that certain biochemically or structurally similar amino acid residues are more likely to occur together we have incorporated Dirichlet Mixture priors [

Sequences are down weighted for redundancy using the following procedure. For each sequence _{max} corresponds to the maximum non-integer sequence weight. Because these weights depend upon the evolving alignment, they are updated after each sampling cycle.

We model the transition probabilities for the HMM shown in

Transitions into M and I states emit a residue as specified by the Θ of our statistical model.

_{k}

Ignoring the indexing variable _{mi}, _{md}, _{mm}, _{ii}, _{im}, _{dd}, _{dm} are corresponding prior pseudo counts. The corresponding maximum a posteriori probability (MAP) estimates for the transition probabilities at each position

Given the alignment and thus the paths Λ, we have the conditional posterior distribution

Sampling on the distribution for each position

For computational efficiency, the ι and δ may be integrated out [

This gives rise to a new posterior distribution

GISMO’s MCMC sampling algorithm explores the space of possible alignments by executing Markovian transitions between alignments. This involves sampling alternative alignments of either individual sequences or groups of sequences. In either case, such sampling is done as follows: First, the sequence or sequences are removed from the alignment and the posterior parameters of the HMM are recalculated based on the retained aligned sequences and the priors. Next, emission probabilities for the twenty amino acids at each position are sampled from the posterior emission probability distributions defined by the HMM parameters; note that these sampled probabilities define a sampled HMM. Finally, the previously removed sequences are optimally realigned to the sampled HMM. We explored sampling transition probabilities in the same way, but found little benefit of doing so; instead, the MAP estimates for transition probabilities are used. GISMO applies simulated annealing [

The GISMO program and the CDD benchmark MSAs and sequence sets used for this study are available at

(XLSX)

(XLSX)

(PDF)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

The nine sequences between the two lines are misaligned; the insert residues shown in red correspond structurally to the first 10 columns shown. Note that these misaligned sequences share two distinguishing features: (i) they contain 27–30 residue insertions that the other sequences lack and they conserve a glycine (G) residue in the seventh column instead of the consensus arginine (R) residue. GISMO relies on such features to identify and realign clusters of correlated sequences.

(PDF)

This corresponds to the same sequences and domain footprint as the MAFFT alignment in

(PDF)

This corresponds to the same sequences and domain footprint as the GISMO alignment in

(PDF)

This corresponds to the same sequences and domain footprint as the MAFFT alignment in

(PDF)

This corresponds to the same sequences and domain footprint as the GISMO alignment in

(PDF)

This corresponds to the same sequences and domain footprint as the MAFFT alignment in

(PDF)

This corresponds to the same sequences and domain footprint as the GISMO alignment in

(PDF)

(PDF)

We thank L. Aravind for critical assessment of the GIMSO program and helpful discussions.