^{1}

^{2}

^{1}

^{2}

The authors have declared that no competing interests exist.

Unwanted variation can be highly problematic and so its detection is often crucial. Relative log expression (RLE) plots are a powerful tool for visualizing such variation in high dimensional data. We provide a detailed examination of these plots, with the aid of examples and simulation, explaining what they are and what they can reveal. RLE plots are particularly useful for assessing whether a procedure aimed at removing unwanted variation, i.e. a normalization procedure, has been successful. These plots, while originally devised for gene expression data from microarrays, can also be used to reveal unwanted variation in many other kinds of high dimensional data, where such variation can be problematic.

Relative log expression (RLE) plots are a simple, yet powerful, tool for visualizing unwanted variation in high dimensional data. They were originally devised for analyzing data from gene expression studies involving microarrays, e.g. [

Because of their ability to detect unwanted variation, RLE plots are particularly useful for assessing whether a

Our aim here is to provide a detailed examination of these plots, with the aid of examples and simulation. We begin by explaining what an RLE plot is, then we describe what it can reveal. We then discuss a number of important points to keep in mind when interpreting the plot. To make our discussion concrete we frame it in terms of one kind of data: microarray data. Nearly everything we say applies

We suppose that our microarray expression data (after log transformation) is organized into a matrix with _{ij} be the log expression for gene _{*j} denote the _{ij}]. An RLE plot is constructed as follows:

For each gene _{*j}), then calculate the deviations from this median, i.e. calculate _{ij} − Med(_{*j}), across the

For each sample, generate a boxplot of all the deviations for that sample.

We use the median, a robust measure of centre, to protect against outliers.

To furnish an example we consider data from a study by Vawter et al. [

For our purposes we restrict our attention to a small subset of 27 samples: these were all the samples analyzed with the same kind of microarray (the Affymetrix HG-U95A microarray, measuring the expression of 12,626 genes), but processed at two different laboratories (24 at University of Michigan, 3 at UC Davis). We henceforth refer to this as the

Gender data with colour coding for the University of Michigan and UC Davis laboratories: (a) boxplots; (b) RLE plot.

The most obvious feature an RLE plot reveals is sample heterogeneity. For example, the plot for the gender data shows large differences between samples. A deeper interpretation of an RLE plot can be made if we assume the following:

Expression levels of a majority of genes are unaffected by the biological factors of interest.

This is often a plausible assumption. In the gender study, for example, it is plausible to assume that only a minority of genes will be expressed differently between men and women for the same brain region. The same applies to different brain regions in the same individual, male or female.

In ideal circumstances, i.e. where no unwanted variation is present, under assumption (A) the log expression measurements for a majority of genes would simply consist of a mean plus a random variation about that mean: _{ij} = _{j} + _{ij}, where _{ij} has zero mean and constant variance (depending on the gene _{*j}) ≈ _{j}, by subtracting the median when constructing an RLE plot, we would obtain
_{ij}s: the boxplots would be roughly centred on zero and would roughly be the same size. Thus, under (A), sample heterogeneity is a sign of unwanted variation.

The RLE plot for the gender data is far from the above ideal, and so reveals substantial unwanted variation. We see unwanted variation both between and within batches as indicated by the varying

We have seen that “bad” RLE plots reveal unwanted variation in two ways: varying boxplot position and varying boxplot width. What kinds of effects, in a statistical sense, produce these features in a plot? To help answer this question we use simulation. We simulate log expressions _{ij}, for gene _{j} and _{i} are _{ij} is a _{ij} is a random error. We will use a simple _{ij} is the kind discussed previously [_{ij} as follows:

For each gene

For each sample

For a fixed gene

This model implies that mean expression varies across genes, and each gene varies differently across samples. We can obtain a batch effects by assigning different values of _{θ} to different batches of samples. Note that we require

With this model, we simulate four different data sets each with

_{θ} = 0 and λ = 0 for all

_{θ} = 0 for _{θ} = 2 for

_{θ} = 0 and λ = 1 for all

_{θ} = 0 for _{θ} = 2 for

In all instances, we set

(a) additive effects only; (b) additive effects only, in two batches; (c) additive and non-additive effects; (d) additive and non-additive effects, in two batches.

These results show, perhaps not surprisingly, that shifting boxplots are produced by additive sample effects. However, perhaps more surprisingly, we see that additive sample effects are not sufficient to produce variation in the boxplot widths; this feature only appeared when non-additive sample effects were present. We do not claim that these are the

Can additive and non-additive sample effects help explain “bad” RLE plots for _{ij} is the log expression for gene ^{′} (

(a) RLE plot of the gender data with the additive sample effect removed; (b) RLE plots of the gender data with the additive and successive non-additive sample effects removed, i.e.

To investigate the non-additive effects we examine the following residual, which is the standard estimate for the non-additive component of a linear model:
^{′}, whose entries are these residuals. First observe that the SVD of ^{′} can be written as a sum of (rank 1) matrices:
^{′}, _{k} is the _{k} and _{k} are _{ij}] = _{k} _{i}_{j}, i.e. a product with one factor indexed by _{k}. In other words, we have decomposed the non-additive component of the data into a sum of non-additive effects of simple multiplicative type. So, by subtracting _{k} matrices from ^{′} we can remove non-additive sample effects from the data, in addition to the additive sample effect removed previously. Given this, define

We see that as

Since additive and non-additive sample effects are not the only kinds of statistical effects that can conceivably produce “bad” RLE plots, we do not wish to claim that these effects provide the only explanation for the features seen in the RLE plot for the gender data, only that they provide a

We commented in the introduction that RLE plots are particularly useful for assessing whether a normalization procedure, i.e. a procedure that attempts to remove unwanted variation, has been successful; a “bad” plot indicates a failure to normalize. It is important to note, however, that achieving an ideal RLE plot after applying a normalization procedure does

Lastly, we mention two important points about assumption (A), i.e. that expression of a majority of genes are unaffected by the biological factors of interest. Firstly, this assumption is not always needed to infer the presence of unwanted variation from an RLE plot. Large differences between

Secondly, assumption (A) is not always

We have seen that RLE plots, i.e. boxplots of deviations from gene medians, provide a simple, yet powerful, tool for detecting and visualizing unwanted variation in high dimensional microarray data, the presence of which is often problematic. The only assumption we need to interpret sample heterogeneity in an RLE plot as a sign of unwanted variation is that expression levels of a majority of genes are unaffected by the biological factors of interest. We noted, however, that while this assumption is often plausible, it is sometimes not safe to make, and sometimes not even needed. We have seen that RLE plots can reveal unwanted variation in two ways, i.e. varying boxplot position and varying boxplot width, and that additive and non-additive sample effects can produce these features, although additive effects alone cannot produce variation in boxplot width. We showed how simulated data with these effects produce these features, and that these effects provide an explanation of these features for real data. We have emphasised that due to their ability to detect unwanted variation, RLE plots are particularly useful for assessing whether a normalization procedure has been successful, with a “bad” plot usually suggesting that the procedure has failed. However, we cautioned that while bad looking RLE plots are excellent evidence that a normalization procedure has failed, good looking RLE plots are only weak evidence that a procedure has succeeded, in the sense of not also removing signal of interest. Although our discussion has been framed in terms of microarray expression data, the original context in which RLE plots were devised, we hope we have conveyed how RLE plots might be useful for revealing unwanted variation in many other kinds of high dimensional data, where such variation can be problematic.