^{1}

^{*}

^{2}

^{1}

^{3}

The authors have declared that no competing interests exist.

The rapidly emerging field of systems biology is helping us to understand the molecular determinants of phenotype on a genomic scale

(A) A schematic of transcriptional regulation is shown. Motifs 1, 2, and 3 are bound by their respective TFs and thus are active, while motif 4 is not. Furthermore, TFs 1 and 2 are shown to be interacting. (B) Box plots of the logarithm of expression ratio (E_{g}/E_{gC}) of genes containing the MCB element ACGCGT (marked as >0, group 1) and genes that do not contain the element (marked as 0, group 2) are shown for the alpha arrest experiment _{g}/E_{gC} is the expression of the gene relative to its average across all time points. During 21 min (G1/S phase), there is a statistically significant difference (_{2}(E_{g}/E_{gC}) of these two groups is 0.27 and −0.02, respectively. During the 35 min (G2/M phase), there is no such association (_{2}(E_{g}/E_{gC}) = 0.04 vs 0.01). This type of approach is elucidated in detail in _{2}(expression ratio) (

A vast amount of work over the past decade has shown that omics data can be used to learn

A set of approaches based on regression has been developed to overcome the above limitations

A regression method is essentially a curve-fitting approach. When there is one observed variable (

Let us consider the case of a single

A regression approach is a generalized version of the method described above. Here, the data is not binary any more. Instead, we plot the actual motif counts against the mRNA levels for all genes genome-wide (

The best fit shown in _{g}_{g}

Under any specific condition, multiple _{12} = 0). This involves two steps: (a) feature selection, i.e., identifying which specific elements are active, and (b) model building, i.e., specifying the regression model involving these elements. These two steps may be executed simultaneously

The steps are shown for constructing a model with linear functions; however, with some small modifications, they are applicable to nonlinear functions as well. _{motif}_{motif}_{0}

An additional complexity is that functional interactions among TFs are often essential to transcriptional control _{ig}_{1}, _{2}, _{3}, and _{12} are learnt from the data, again using a least squares fit. _{12}>0 implies a synergistic interaction, while _{12}<0, a competitive interaction.

The sequence logo for the PWM of E2F-1, a key transcription factor for regulating the mammalian cell-cycle, is shown (

To use PWMs in regression methods, we would first score each promoter sequence against each PWM. The probabilities of each base at each position are used to compute the scores. These scores are related to the binding affinity of a TF for the DNA sequence

Although one can use linear methods with PWM scores

(A) mRNA expression (_{g}

Software/Publication | Reference | Linear or Nonlinear? | Degenerate or Nondegenerate Motifs? | Identifies Target Genes? | Web Site for Download |

REDUCE | Linear | Nondegenerate | N | ||

MODEM | Linear | Weakly degenerate | Y | ||

Pham et al. |
Nonlinear (sigmoidal) | — | Y | NA | |

MARSMotif | Nonlinear (MARS) | Nondegenerate or weakly degenerate | N | ||

MARSMotif-M | Nonlinear (Linear spline/ MARS) | Degenerate | Y | ||

MotifRegressor | Linear | Degenerate | N | ||

Keles et al. | Linear | Nondegenerate | N | Available upon request | |

Motif Expression Decomposition (MED) | Nonlinear | Degenerate | Y | NA | |

Inferelator |
Nonlinear (LARS/LASSO) | — | Y | ||

RSIR | Nonlinear (SIR) | Degenerate | N | Available upon request | |

MatrixREDUCE | Linear | Degenerate | N | ||

TRANSMODIS | Linear | Degenerate | Y | ||

Segal et al. | Nonlinear (sigmoidal) | Degenerate | Y | NA | |

Prego | Nonparametric | Degenerate | Y | ||

MA-Networker |
Linear | — | Y | ||

fREDUCE | Linear | Degenerate | N | ||

SCAD | Nonlinear | Degenerate | N | NA |

The tools marked with an asterisk were not originally used with

NA indicates not available (we did not find this reported in the original paper or via Web search).

In a regression method, the input is a candidate motif. Thus, once we have identified the active motif, we have an additional task of determining which genes are targets of the cognate TF. Thus, in contrast to coexpression-based approaches where we assume that groups of co-expressed genes are co-regulated, co-regulation of genes is inferred in this approach a posteriori in regression methods. In the case of DNA words, it may seem that all promoters containing an instance of the word will always be bound by its partner TF. However, such a word may represent only the core of the motif. Thus, to discriminate the true targets, additional sequence information flanking the core motif may be essential

The challenge with the PWM scores is that they are generally continuous and nonzero (on a scale from zero to one, zero indicating that the motif is absent). Thus, most promoters often contain a low-scoring instance of each PWM. This is especially true for motifs of high degeneracy, as in humans

A popular metric to assess the quality of a regression model is how much of the variation in the expression data it can explain. This is parameterized as ^{2}, sometimes referred to as the percent reduction in variance _{original}_{residual}_{residual}^{2}

A large number of studies have shown that the motifs identified by regression methods are indeed functional motifs. The organisms where these methods have been applied include yeast

In this tutorial, we have focused on transcriptional regulation. However, regression methods may also be applied to other stages of gene regulation that are mediated by

We have summarized the currently available software based on regression along with their key features in

In this tutorial, we have described the basic aspects of regression methods. These are complementary to alternative approaches for motif discovery, such as comparative genomics

We thank Sam Ng for a careful reading of the manuscript.

During the preparation of this manuscript, a new regression approach based on the Fast Orthogonal Search (FOS) method