^{1}

^{*}

^{2}

^{1}

^{3}

^{1}

^{3}

^{*}

Conceived and designed the experiments: AKM CP PMFN EJC. Performed the experiments: AKM. Analyzed the data: AKM. Contributed reagents/materials/analysis tools: AKM. Wrote the paper: AKM CP PMFN EJC.

The authors have declared that no competing interests exist.

Identifying transcription factor (TF) binding sites (TFBSs) is an important step towards understanding transcriptional regulation. A common approach is to use gaplessly aligned, experimentally supported TFBSs for a particular TF, and algorithmically search for more occurrences of the same TFBSs. The largest publicly available databases of TF binding specificities contain models which are represented as position weight matrices (PWM). There are other methods using more sophisticated representations, but these have more limited databases, or aren't publicly available. Therefore, this paper focuses on methods that search using one PWM per TF. An algorithm, MATCHTM, for identifying TFBSs corresponding to a particular PWM is available, but is not based on a rigorous statistical model of TF binding, making it difficult to interpret or adjust the parameters and output of the algorithm. Furthermore, there is no public description of the algorithm sufficient to exactly reproduce it. Another algorithm, MAST, computes a p-value for the presence of a TFBS using true probabilities of finding each base at each offset from that position. We developed a statistical model, BaSeTraM, for the binding of TFs to TFBSs, taking into account random variation in the base present at each position within a TFBS. Treating the counts in the matrices and the sequences of sites as random variables, we combine this TFBS composition model with a background model to obtain a Bayesian classifier. We implemented our classifier in a package (SBaSeTraM). We tested SBaSeTraM against a MATCHTM implementation by searching all probes used in an experimental

Identifying which transcription factors bind to which promoters is an important step towards understanding the transcriptional regulatory code. This identification process can be divided into two parts: determining the binding specificity of specific transcription factors, and then identifying TFBSs in a sequence using the binding specificity information.

There have been a number of papers proposing methods for one or both parts of the problem. Methods for finding transcription factors (as motifs which are statistically over-represented in sequences) can be broadly classified as those based on phylogenetic footprinting, and those which are not. These methods have been widely compared

The link between determining binding specificity and finding sites where the transcription factor is likely to bind is the way in which binding specificity is represented. At present, the largest databases which are generally available, such as TRANSFAC

There are more sophisticated representations for transcription factor binding specificity, such as the Hidden Markov Model (HMM) approach used by MAPPER

Representing transcription factor binding specificities in this form means that no data is stored on the interaction of binding specificity between different base positions in the binding sites. This is a reasonable approximation, as molecular binding models describing the interactions between transcription factors and DNA have shown that binding energies are approximately additive between bases

Existing PWM based search methodologies, such as MATCHTM, have not been justified based on a formal statistical model. MATCHTM instead computes scores using the formula

Let

We define a TFBS as a locus that is under evolutionary pressure so the sequence is one that a particular transcription factor will bind to. The sequence is used as evidence supporting the hypothesis that there is a TFBS at a particular locus. For example, the presence of a sequence exactly identical to the consensus sequence for the transcription factor is strong evidence for a TFBS. A sequence which is more distantly similar to the consensus sequence is weaker evidence for there being a TFBS. This is because there are an increasing number of possible sequences as the deviation from the consensus sequence increases, and so the null hypothesis that similarity to the consensus sequence arose by chance (as opposed to natural selection) becomes more credible.

Under this definition, a transcription factor either binds to a TFBS, or it does not; there is no attempt to model the degree of affinity, only to determine if there is evidence for an underlying process. Note that evolutionary pressure may select for a moderate TF-TFBS affinity, but against a stronger affinity. In this case, evidence for the TFBS is reduced, but may still be enough to detect the site.

We use two models of putative TFBS sequences. The foreground model describes the distribution of sequences under the alternative hypothesis that there is a TFBS at the site. The background model describes the distribution of sequences under the null hypothesis that there is no TFBS at the site.

Our foreground model is best introduced in terms of a matrix of hidden parameters

Our foreground model requires that each base in a TFBS is independently selected in accordance with the hidden parameters. In practice, there are two ways in which new TFBSs are likely to arise. They may arise from convergent evolution, in which case the TFBS sequence is independent of all other TFBSs. Alternatively, an existing TFBS could be copied in a duplication event, creating a paralogous TFBS which is not independent of the original. Over time, however, mutations to less strongly conserved bases in the two TFBSs will reduce this dependence. For this reason, the independence assumption is reasonable except for very recently duplicated TFBSs.

If

We assume, under this same model, that

Hence,

Now,

Note that we assume that

This gives us the ability to compute the probability of a given sequence under the alternative hypothesis:

We used a simple first-order Markov chain model, with one parameter for each base,

We will assume that the foreground and background model are complementary. This is an approximation, because sequences might have higher order interactions not explained by either the foreground or background models. Making a simplifying assumption here is unavoidable because of the high complexity of these higher order interactions. For example, polypeptide coding sequences are considered background, and the distribution of the sequence of bases is determined by the effect of the polypeptide sequence on evolutionary fitness; something which would require more knowledge about biological function than is available, and is too complex to include in the background model.

However, the model nevertheless provides a principled approach for correcting for the length of the sequence, and for differences in the frequency of bases or pairs of bases. Hence,

Recall that

In order to combine the foreground and background models, we start with Bayes' theorem:

We assume the foreground and background models are complementary, so

Due to complementarity,

This leaves the prior probability

We note that this combination of foreground and background models is able to represent a number of features to the extent that the information is present in the raw counts matrix. For example, gaps in the sequence correspond to regions in which the foreground is indistinguishable from the background, in which the value of

Our model shares some similarities with the model used in a previous study

One major difference between the two approaches is that Lähdesmäki et al. aims to identify the posterior probability of alignments of one or more motifs in a given promoter region, while BaSeTraM computes the probability that a single motif is found at a given site, and uses this to annotate a sequence with probable sites. Another difference is that BaSeTraM does not take into account uncertainty in the background probabilities (and instead focuses entirely on the uncertainties in frequencies in the foreground model). This approximation can be justified by the large quantity of data available to build the background model (as opposed to the foreground models), and the correspondingly low estimator variance. Using this simpler background model allows BaSeTraM to efficiently use a context-dependent background model.

In addition, Lähdesmäki et al. used a different derivation, by representing all foreground model frequencies at each position using a four-way multinomial distribution across all bases. In this paper we instead use a binomial distribution, where one Bernoulli outcome is that a base at position

We developed an implementation, SBaSeTraM, of the Bayesian search method, BaSeTraM, described above. We also implemented the method described in

In addition, we have created a wrapper, called WrapMAST, around the stand-alone MAST

The addition of

SBaSeTraM, GMATIM, and WrapMAST are written in Haskell, and we have aimed to make the source code of each program a succinct and readable description of the corresponding algorithm. SBaSeTraM, WrapMAST, and GMATIM provide a similar command line interface (and share common code), so as to simplify the design of analyses which compare the algorithms.

Due to the possibility of numerical underflow from very small probabilities, our SBaSeTraM and GMATIM implementations make use of log probabilities (base

It is necessary for SBaSeTraM to compute the posterior probability,

The

Note that equation 18 is a log-transformed equivalent of equation 8, and similarly, equation 19 is a log-transformed equivalent of equation 7.

We compute the vector

For each site, we compute the log-posterior probability and test it against a cut-off (as discussed below) to decide whether the TFBS occurs at that site. We search for sites, both on the sequences provided, and on the reverse complement of those sequences.

We retrieved the online supplement for

Where a matrix used estimated rather than raw counts, as indicated by the occurrence of a decimal point in the ‘frequency’ matrix, that matrix was excluded (as we have assumed that raw counts will be used).

We filtered the set of probes, based on the experimental data, to only include those to which a transcription factor bound (for which we had a corresponding PWM). This left 1259 probes.

We then used each method to search the entire set of probes for TFBSs corresponding to each matrix, across all positions in the probe. Where the method detected the occurrence of a TFBS for a particular TF at any position in a probe, a positive result for that TF-probe combination was recorded. If no TFBSs were found at any position for a given TF a negative result was recorded. These results were then compared against the ‘gold standard’ experimental data. Only TFs which had corresponding matrices in TSM, and were also in the experimental results, were included.

We classified each included TF-probe pair into 4 categories:

True Positive (TP) - positive prediction, and experimental determination of TF-probe interaction;

False Positive (FP) - positive prediction, but no experimental determination of TF-probe interaction;

True Negative (TN) - negative prediction, and no experimental determination of TF-probe interaction;

False Negative (FN) - negative prediction, but experimental determination of TF-probe interaction.

In this paper we have used

There were 38 different transcription factors in TRANSFAC Saccharomyces Module, of which 32 were made up of raw counts. Of these, 16 were also found in the ChIP-chip dataset. These were tested against the 1259 different probes in the chromatin immunoprecipitation experiment. This gives 20144 different TF-probe pairs where we can classify whether the TF binds to the probe, and then check the classification. These results are shown in

For SBaSeTraM, the posterior cut-off was varied to obtain a series of points. For MAST, the p-value cutoff was varied. For GMATIM, the parameters listed in the MATCHTM paper were used to generate the point on the curve.

We generated a ROC curve (

The point on the ROC curve generated using the parameters from

SBaSeTraM outperforms MAST when used through WrapMAST. It is worth noting that MAST is not typically used with TRANSFAC PWMs, and usually, multiple PWMs are used for each TF, and so the results cannot be used to make inferences about how well MAST performs together with MEME. The results do, however, illustrate the benefit of methods which take into account uncertainty in the foreground model.

We also carried out an analysis to see whether any particular TFs were making a large contribution to the overall prediction accuracy at this point.

The results are shown with the overall False Positive Rate for SBaSeTraM matched at that obtained from GMATIM with the parameters in the MATCHTM paper, namely 53.3%. Arrows run from the point obtained using SBaSeTraM to the point obtained using GMATIM.

We also analysed the spread of true and false positive rates for each method.

The results are shown with the overall False Positive Rate for SBaSeTraM matched at that obtained from GMATIM with the parameters in the MATCHTM paper, namely 53.3%.

In addition, we used the bisection method to find a separate posterior probability cutoff for each of the 16 TFs that gave the SBaSeTraM method a FPR (for that TF) close to the FPR obtained with GMATIM. We allowed the method to terminate when a cutoff was found that brought the

Using the same methodology used on the entire dataset (as discussed above), we tested for a statistically significant difference in proportion of predictions which were correct for each transcription factor, between GMATIM and SBaSeTraM (with the posterior probability cutoffs discussed in the previous paragraph). We obtained only one result where the p-value was less than

We have developed a Bayesian classifier for identifying TFBSs, which performs comparably to an existing algorithm, but which has a more principled statistical explanation, so that the trade-off between sensitivity and selectivity can be trivially adjusted, and the method can be altered to use different background models.

It is clear that the two methods are very similar in overall performance, and there is insufficient data in TSM to tell the two apart. The 95% confidence interval for the difference of the proportion correctly classified above runs from SBaSeTraM being 1.03% better, to GMATIM being 0.93% better. We therefore conclude that until there is more evidence that one method is better, from a performance standpoint, the two methods can be used interchangeably.

However, the fact that the statistical interpretation of BaSeTraM has been explained in rigorous terms, combined with the ease with which the posterior probability cut-off can be adjusted (as opposed to needing to adjust two separate parameters and re-run the analysis) makes the use of BaSeTraM preferable for many applications.

We note that despite the similarity in accuracy, the predictions made are not all the same; only 62.8% of all predictions of transcription factor binding made by SBaSeTraM with this posterior probability cut-off were also made by GMATIM.

The BaSeTraM statistical model includes a background model to be used. While a relatively uninformative background model is useful with the synthetic probes used in ChIP-chip analyses, using a different background model is likely to be important on genomic scale data, where there are localised variations in base frequencies.

When dealing with genomic scale data, it is also important that computation is reasonably efficient. It is also preferable that this computation can occur on modest hardware, so it is usable by groups without access to high-performance computing infrastructure.

In order to achieve these goals, we also developed a C++ implementation of BaSeTraM, called CBaSeTraM, which we optimised for the AMD64 architecture. We used Callgrind

GMATIM, SBaSeTraM, and CBaSeTraM, as well as the programs used to test the methods, are Free/Open Source software. Instructions for building these programs are included as an online supplement.

The authors would like to thank the two anonymous reviewers for providing helpful feedback on this manuscript.