^{*}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AM WWW. Performed the experiments: AM. Analyzed the data: AM WWW. Wrote the paper: AM WWW. Implemented the software package: AM.

Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.

Transcription factors are critical proteins for sequence-specific control of transcriptional regulation. Finding where these proteins bind to DNA is of key importance for global efforts to decipher the complex mechanisms of gene regulation. Greater understanding of the regulation of transcription promises to improve human genetic analysis by specifying critical gene components that have eluded investigators. Classically, computational prediction of transcription factor binding sites (TFBS) is based on models giving weights to each nucleotide at each position. We introduce a novel statistical model for the prediction of TFBS tolerant of a broader range of TFBS configurations than can be conveniently accommodated by existing methods. The new models are designed to address the confounding properties of nucleotide composition, inter-positional sequence dependence and variable lengths (e.g. variable spacing between half-sites) observed in the more comprehensive experimental data now emerging. The new models generate scores consistent with DNA-protein affinities measured experimentally and can be represented graphically, retaining desirable attributes of past methods. It demonstrates the capacity of the new approach to accurately assess DNA-protein interactions. With the rich experimental data generated from chromatin immunoprecipitation experiments, a greater diversity of TFBS properties has emerged that can now be accommodated within a single predictive approach.

Transcription factors (TFs) and their specific binding sites act to modulate the rate of gene transcription. They are central to key biological processes, such as organ development and tissue differentiation, nutrient and environmental stress responses and physiological signals. Delineating specific positions at which TFs bind to DNA is of high importance in deciphering gene regulation at the transcriptional level. Each TF binds a variety of DNA sites with sequence-specific affinity

Classically, computational prediction of TFBSs is based on models called position weight matrices (PWMs) that reflect the preferred binding motifs associated to corresponding TFs by providing an additive score for any sequence. They approximate the true specificity of a TF and their parameters can be estimated through different methods (see

Moreover, basic PWMs are restricted to the detection of motifs with a fixed length. This constraint has previously led to alternative heuristic approaches for the modeling of TFBS for TFs tolerant of variable widths, such as nuclear receptors

Several efforts have explored more flexible models for the prediction of TFBSs. Bayesian hierarchical hidden Markov models (HMMs) have been used to model

Recently, a new experimental technique has been developed to study sequences where proteins interact with DNA. This procedure is a combination of chromatin immunoprecipitation and massively parallel sequencing technologies - the well-known ChIP-seq procedure

We introduce here a novel TFBS model and prediction system based on HMMs, hereafter referred to as TF Flexible Model (TFFM). Building upon previously developed models capturing dinucleotide dependencies and flexible lengths as described previously (see

We present a new HMM-based framework to model and predict TFBSs. HMMs have been extensively used in computational biology to model DNA sequences

Recent advances in the prediction of TFBSs have incorporated inter-positional properties through the analysis of dinucleotide properties across the sites. To construct models capturing the dinucleotide compositional properties of TFBSs, we implemented two HMM-based approaches.

Initially, we constructed standard first-order HMMs as TFFMs (denoted later as

(A) 1st-order HMM schema used in 1st-order TFFMs where the first state represents the background and the following states the consecutive positions within a TFBS. Each state emits a nucleotide with a probability dependent on the nucleotide emitted previously. (B) HMM schema used in detailed TFFMs where each state in the 1st-order HMM is decomposed into four states (one per nucleotide). Transition probabilities reflects the emission probabilities of the 1st-order HMM. It allows the start of a TFBS depending on the nucleotide emitted by the background states.

In 1st-order TFFMs, starting a TFBS is given by a unique probability (representing the transition from the background to the TFBS) whatever the nucleotide found in the surrounding sequence. To allow for starting a TFBS depending on the nucleotide emitted in the background state, we implemented a more detailed and descriptive HMM template as TFFMs (denoted later as

By constructing models taking into consideration local dinucleotide dependencies, we aim to better model, characterize, and understand TFBS properties. When trying to analyze and understand a model, a visual representation provides insight into the underlying properties. Basic PWMs for instance can be graphically represented using sequence logos

(A) Graphical representation of a TFFM constructed for the Hnf4A TF. Each column corresponds to a position within a TFBS. Each row captures the probabilities of each nucleotide to appear depending on the nucleotide found at the previous position. The opacity of a case represents the probability of hitting this case depending on the probability of appearance of the corresponding nucleotide at the previous position (the higher the opacity, the higher the probability). (B) The summary logo compacts all the information to summarize the dense logo in (A). (C) Zooming in on the dense TFFM logo for positions 10 to 13 (corresponding to the box in (A)). We observe that a “C” is more likely to appear at position 12 if nucleotide “T” was found at position 11 whereas a “T” is more likely to appear at position 12 if nucleotide “G” was found at position 11.

We provide the TFFM-framework to construct TFFMs from ChIP-seq data sets and to predict TFBSs within DNA sequences. When constructing a TFFM from ChIP-seq data, we extract (using MEME

All ChIP-seq ENCODE data sets from human and mouse (with at least 1800 peaks and a peak max position indicated, i.e. 206 data sets) were used to compare the two types of TFFMs with PWMs and DWMs. Sequences around the peak max positions (50 nucleotides on both sides) were extracted to construct the models and make predictions. The rationale for this is that ChIP-seq peak max positions represent where the maximum amount of ChIP-seq reads map on the genome of reference and TFBSs are expected to be strongly enriched in close proximity to the peak max position

For each ENCODE ChIP-seq data set, the area under the curves (AUC) for the corresponding ROC curves (for all predictive methods) have been computed. To compare the predictive powers of the different methods, we focus on ChIP-seq data sets for which at least one predictive method achieves an

For the 96 ChIP-seq data sets obtaining an

Statistical significances of the differences in terms of discriminative power between the different methods has been computed for each pair of methods using a Wilcoxon signed rank test

1st-order TFFM | detailed TFFM | DWM | |

- | - | ||

- | |||

The table contains the Benjamini-Hochberg corrected

To understand whether the TFFMs perform better than the PWMs because of the model or because of the training method (as both differ), we introduce a 0-order TFFM which is basically a PFM modeled by an HMM and trained using the Baum-Welch algorithm (see

For the 96 ChIP-seq data sets used in

This analysis shows that the TFFMs perform better than PWMs and DWMs more often, with a statistically significant difference, and lead us to hypothesize that the TFFMs are, overall, better at capturing TFBS features found in the experimental data. To further evaluate this property, we analyzed how TFFM scoring correlates with the biological signal found in ChIP-seq data.

An attractive feature of PWMs is that they can produce scores that are correlated with the energetic binding affinity between a protein and a DNA sequence

(A) ChIP-seq signal values obtained from ENCODE data sets were compared to prediction values obtained with the four different predictive methods. The distribution of Spearman's correlation values from all data sets are given for 1st-order TFFMs, detailed TFFMs, PWMs, and DWMs. An over-representation of Spearman's correlations around 1 (perfect correlation) is found for the four methods. (B) Pearson correlation between scores obtained using the different predictive methods and DNA-binding affinities from

In the previous section, we hypothesized that the signal values from ENCODE ChIP-seq peaks reflected the affinity of the TF protein to bind to DNA sequences. In

For each predictive model, we computed the correlation between the predicted scores and the DNA-binding affinity values measured experimentally. Since some mutated sequences can no longer be bound by the Max TF (or with very weak affinity), it is interesting to focus on the sequences to which the TF can actually bind. Hence, we analyzed the correlation between predicted and experimentally measured DNA-binding affinity values by first focusing on the sequences lying in the top 10-percentile affinity values, then the top 20-percentile, and up to including all the sequences using 10-percentile steps. The results of higher interest, corresponding to stronger DNA-binding affinity values, are the top percentiles but all percentiles were computed for completeness. Using such a methodology, we expect the predicted scores obtained from the models to better correlate with high DNA-binding affinity values than with low values.

To understand what characteristic(s) the TFFMs are capturing that is not represented by either the PWMs or the DWMs, we examined the DNA sequences obtaining the highest DNA-binding affinity values. We looked at the motifs for which the DNA-binding affinities are the highest by considering the top-scored 25, 50, 75, and 100 sequences (see

In

Isoform | Method | Pearson correlation coefficient |

0.65 | ||

0.66 | ||

0.60 | ||

0.61 | ||

The table contains the Pearson correlation coefficients computed between the experimentally measured and predicted changes in DNA-binding affinity between the optimal sequence and mutated ones. Correlations have been computed for both Max TF isoforms and all the predictive methods. Associated

In the previous sections, the TFFMs were used to model TFBSs with fixed length by taking into consideration the dinucleotide composition of the sequences. Another feature of the TFBSs that can be accommodated by TFFMs is flexible length.

A subset of TFs bind to the DNA with different structural conformations, leading to TFBSs of different lengths

The first TF analyzed is JunD which has been previously shown to bind to motifs of flexible length using protein binding microarrays in mouse

TFFMs allowing a flexible length motif have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs. Flexible TFFMs outperform the other models since the corresponding ROC curves are above ROC curves corresponding to other models.

STAT TFs bind with a flexible length motif

TFFMs allowing a flexible length motif have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs on STAT4 (A) and STAT6 (B) ChIP-seq data. Flexible TFFMs do not significantly perform better than fixed-length TFFMs. DWMs, PWMs, and GLAM2 produce a lower discriminative power than the TFFMs.

In the previous examples of flexible length binding motifs, we focused on motifs where a spacer was found between the two halves of a core motif. One can also consider motifs containing a flexible edge at the outer edges of a core motif. When analyzing MEME output for ENCODE ChIP-seq data, we observed that a MafK data set showed a weak motif on its edge separated by a 1 nt spacer from the core motif (see

TFFMs allowing a motif with a flexible edge have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs. Flexible TFFMs perform slightly better than fixed-length TFFMs and both outperform the other models.

In the previous sections, TFFMs have been used to predict specific TFBS positions. We extended the TFFM-framework to compute an integrated TF occupancy score across a DNA sequence using the TFFM scores. Using the TFFMs, the probability of occupancy (Pocc) of a TF within a defined DNA sequence is obtained by multiplying the TFBS probabilities at each position (see Material and Methods section for details). This is a simpler approach than the physico-chemical models used in tools like GOMER

In this report, we have introduced a flexible HMM-based framework for TFBS prediction. The new models are demonstrated to perform as well as classic methods for most data, while exhibiting improved performance for a subset of TFs. The new approach retains the desirable attribute of producing scores correlated with the binding energy of TF-DNA interactions. A new graphical representation is introduced to illustrate the properties of the models, complementing the classic and widely used sequence logos. In applications, the TFFM models have been shown to handle variable spacing between half sites, and to allow for the incorporation of flanking sequence properties into TFBS analysis. With a convenient software package and a breadth of opportunities for improvement, TFFMs are a suitable foundation for the next generation of TFBS prediction.

The new TFFM-framework provides an opportunity for researchers to analyze more deeply the features of TF-DNA binding interaction by looking at local dinucleotide dependencies captured by the TFFMs and represented by the new logos. One can see the TFFMs as the probabilistic analog of the energetic BEEML models developed in

For TFFMs, the greatest utility is in handling the growing subset of TFs with complex binding properties. Such complex binding characteristics of TFs may be decomposed into four categories

The TFFM-framework creates new opportunities for innovation in TFBS bioinformatics analysis. Drawing from the initial studies here, it is apparent that refined approaches can be pursued for the identification of TFs capable of binding to motifs of variable width and the analysis of the role of TFBS flanking sequence on TF binding. While the number of cases of TFs tolerant of variable width binding sites has grown with access to high-throughput TFBS data, the TFFM-framework could be extended to enable a comprehensive survey of ChIP-seq data collections to identify additional cases. As observed in the analysis of MafK TFBS flanking sequences, TFFMs are sufficiently flexible to incorporate additional information represented in TFBS proximal sequences. There have been some indications that such sequences may specify interactions with co-factors

Beyond the analysis of non-canonical TF binding motifs, there is a significant scientific opportunity to develop a new computational approach for the prediction of functionally significant DNA variations within

A key to the long-term development and adoption of TFFMs is the access of researchers to both the binding models and the software for their generation. It is our plan to generate a collection of TFFMs trained on ChIP-seq data sets from ENCODE, as well as other sources compiled into the PAZAR repository

The new TFFMs described in this report are designed to address the confounding properties of position inter-dependencies in site composition and variable lengths observed in experimental data. These two challenges have emerged as an increasing issue with the availability of large-scale ChIP-seq data, which reveals greater complexity of TFBSs than could be observed in the past. The TFFM graphical motif representation conveys properties of position inter-dependence, allowing researchers to visually analyze the features captured by the model. TFFMs have been assessed on human and mouse ChIP-seq data sets coming from ENCODE, revealing a higher discriminative power than established methods. TFFMs produce scores consistent with observed protein-DNA affinities measured experimentally and have the capacity to predict the impact of TF binding site mutations on TF-DNA binding affinities.

The analysis of TFBS is a central challenge in bioinformatics. TFFMs provide a powerful and flexible framework within which a broad range of problems can be addressed. While many motif discrimination methods are available, it is our perception that TFFMs will emerge as a preferred approach for TFBS analysis.

Comparisons between the different predictive methods were done using ChIP-seq data sets from the ENCODE project

For assessing the performance of TFFMs allowing for flexible length motifs, we used the following ChIP-seq data sets: human ENCODE JunD TF from K562 cells by the University of Chicago, mouse ENCODE MafK TF from Ch12 cells by Stanford University, and STAT4 and STAT6 TFs from

1st-order HMMs used in 1st-order TFFMs are composed of a state modeling the background sequences surrounding TFBSs and one state per position

HMMs used in detailed TFFMs decompose each state of the 1st-order HMM with four corresponding states in the detailed HMM, each one emitting a nucleotide (A, C, G, or T) with a probability equal to 1 (see

HMMs used in 0-order TFFMs are constructed with the same set of states as the ones used for the HMMs of the 1st-order TFFMs. The emission probabilities are different since no dependency between positions is captured. Hence, each state is associated to only four emission probabilities for the four nucleotides (see

The TFFMs provide, at each position

The different model predictive powers were compared using a 10-fold cross-validation methodology on human and mouse ChIP-seq ENCODE data sets.

Given a ChIP-seq data set

Ten background data sets (

Finally, we generate 10 additional background data sets (

When predicting TFBSs using a TFFM, input sequences are scored at every positions for both strands. Corresponding scores are the posterior probabilities of being at the final state of the underlying HMM computed by the forward and backward algorithms

Data sets described in the previous section were used to initialize, train, and test the predictive methods through a cross-validation methodology. The procedure was as follows:

Apply MEME

Use the motif to initialize the 0-order, 1st-order, and detailed TFFMs (emission probabilities, for the 0-order and 1st-order TFFMs, and transition probabilities, for the detailed TFFM, are derived from nucleotide frequencies at each position of the motif).

Train TFFMs on the 10 training sets from

Apply the TFFMs on matching

Compute the corresponding ROC curve.

Apply MAST

Construct corresponding PWMs (i.e. log-odds weight matrices derived from the PFMs

Apply the PWMs and the DWMs on

Compute the corresponding ROC curves for each method.

For each ChIP-seq data set and predictive model, the best hits from

The statistical significance of a difference between two predictive methods has been assessed through a

To assess the statistical significance of the trend of correlation between model scores and ChIP-seq signal values, we computed the

The summarized features captured by the TFFMs are represented through a sequence logo similar to the ones used for basic PWMs. To construct the sequence logos, the probability of getting each one of the four nucleotides is computed at each position starting from an equiprobability of A, C, G, and T in the background. Let

The classic sequence logos do not give any information about the dinucleotide dependencies captured by the TFFMs. We introduce a new graphical representation of the TFBSs modelled by the TFFMs that is able to capture this feature (see

Given the probabilities of finding each nucleotide at each TFBS position, we compute the information content (IC) of a TFFM by summing the IC of all the positions computed as

DNA-binding affinities between human Max transcription factor (isoforms A and B) and DNA sequences have been obtained experimentally by using the MITOMI method and reported in

By applying the 10-fold cross-validation methodology to the ChIP-seq data sets (note that the 600 best/top peaks are not considered in the 10-fold cross-validation), we obtained a score for each one of the peaks (corresponding to the score of the best hit per peak). The ENCODE data associate a signal value to each one of the peaks. The signal value is a measure of the enrichment for the overall peak region (usually, average). As the peak scores coming from ENCODE may be unevenly distributed, we computed the median of the distribution of prediction scores for the sequences within each 5-percentile of the peak scores. Hence, each peak score 5-percentile is associated to a predictive score corresponding to the median of their distribution within the percentile. A Spearman's rank correlation coefficient has been used to compare prediction scores and ChIP-seq peak scores for these latter data using the

We compared experimental DNA-binding affinities from

The detailed methodology used is as follows:

Construct all 12 nt-long sequences conserving nucleotides GTG at positions 9, 10, and 11 (a nucleotide A is used at the very beginning for the 1st-order TFFM with no impact on the scores).

Construct TFFMs, PWM (i.e. log-odds weight matrices), and DWM using an initialization from the top 600 peaks and trained on the whole ENCODE human Max K562 ChIP-seq data set.

Compute the prediction scores on all the sequences using the different methods (only one strand is used for the score computations here).

Score associations between experimental values and results obtained using the above methodology are given in

Starting from the optimal binding site containing CACGTG, we also compared the changes in prediction scores between a mutated site and the optimal sequence with the changes using the experimental DNA-binding affinities for the same sequences. A Pearson correlation coefficient between the two scores is then computed using the function

The first over-represented motif found by MEME in the JunD data set used for the flexible motif analysis is 14 nt-long with a G/C (G or C) at the centre of the core motif (see position 9 in

To construct a flexible length TFFM modelling STAT4 TFBSs, a gap position has been manually added to the original initialized models between position 5 and 6 (see

To construct a flexible length TFFM modelling STAT6 TFBSs, position 5 from the original motif (see

Using the original initialized TFFM modelling MafK TFBSs, we allowed the motif to end at position 14 in order to model a flexible edge (see

GLAM2

The TFFM-framework is available at

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(EPS)

(PDF)

(PDF)

We thank David J. Arenillas for implementing the web-based application and for helpful comments and Rebecca Worsley-Hunt for the retrieval of the ENCODE data sets, providing the scripts and data sets to construct genomic backgrounds, and helpful comments. We thank Hugues Richard for helpful discussions. We thank the four referees for helpful comments and suggestions. We thank Miroslav Hatas for systems support and Dora Pak for management support.