^{1}

^{1}

^{1}

^{2}

^{2}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AS JRS CC JH RKW MLS. Performed the experiments: AS JRS JH RKW MLS. Analyzed the data: AS JRS. Wrote the paper: AS JRS.

Missing data can arise in bioinformatics applications for a variety of reasons, and imputation methods are frequently applied to such data. We are motivated by a colorectal cancer study where miRNA expression was measured in paired tumor-normal samples of hundreds of patients, but data for many normal samples were missing due to lack of tissue availability. We compare the precision and power performance of several imputation methods, and draw attention to the statistical dependence induced by K-Nearest Neighbors (KNN) imputation. This imputation-induced dependence has not previously been addressed in the literature. We demonstrate how to account for this dependence, and show through simulation how the choice to ignore or account for this dependence affects both power and type I error rate control.

MicroRNAs (miRNAs) are small non-coding RNA molecules that regulate gene expression by targeting messenger RNAs. They were first discovered in 1993 during a study into development in the nematode Caenorhabditis elegans (C. elegans) regarding the protein gene lin-14 [

The scientific community is currently highly interested in the functional roles of miRNAs. The miRNA biogenesis that functions properly results in the normal rates of cellular growth, proliferation, differentiation, and cell death. But the reduction or deletion of miRNAs that is caused by defects at any stage of miRNA biogenesis leads to inappropriate expression of the miRNA-target oncoproteins that causes increasing proliferation, invasiveness or angiogenesis, or decreasing levels of apoptosis [

The miRBase database, a searchable database of published miRNA sequences and annotation, had listed 2,588 unique mature human miRNAs for July 2014 (from

The association between miRNAs and colorectal cancer (CRC) was reported for the first time in 2003, when the miR-143 and miR-145 genes were downregulated in CRC tumor tissues compared with normal tissues [

The miRNA data as most other expression data can be considered in the form of large matrices of expression levels of features (rows) in different subjects (columns). The data sets might have either some features missing in some samples, or all features missing in some samples. The former case often occurs due to insufficient resolution, image corruption, dust or scratches on the slide, and other various experimental and technical reasons, while the latter case may happen due to lack of collected tissue or limited funds. As an example of the latter case, we present the case study from research to determine the association of miRNAs with CRC in paired normal-tumor samples. As part of a preliminary analysis using the first available subjects, we wanted to compare miRNA expression profiles of normal and tumor samples from each of more than 400 subjects with 2006 miRNA on each sample. We also collected extensive information about demographic and lifestyle variables of these CRC patients. There are not many CRC studies that have collected such extensive data for such variables. However, in the final analysis using all available subjects, 10% to 50% of the subjects will have missing normal samples due to lack of tissue availability.

The immediate objective in this CRC case study is to understand the alternatives for imputation, along with their comparative strengths and weaknesses. Specifically, we wish to know for a given imputation method whether its application to missing miRNA data among normal samples will yield accurate predictions of their actual expression levels, and how such predictions are further affected by the percentage of subjects with missing values. We further wish to understand how these results affect statistical power to detect differentially expressed miRNA while controlling for Type I error.

With the proliferation of gene expression studies over the past decade, more attention has been paid to imputation methods for miRNA data. Conventional approaches often involve simply excluding miRNAs with missing values, replacing missing values with zeroes, or imputing using row or column averages. Such options ignore the correlation structure of the data and have limited power [

In this paper, we introduce and evaluate an imputation method that accounts for the dependence induced by weighted K-Nearest Neighbor (KNN) and considers the covariates, over the multiple imputation techniques using MCMC and EM with bootstrapping algorithms, as well as the case deletion technique using characteristics of this large CRC data set.

This paper is arranged in the following manner: first, we provide an overview of imputation assumptions and methods, as well as the RMSE method to assess the performance of various imputation techniques. Then we demonstrate the application of imputation techniques using simulation data sets. Finally, we conclude with a discussion of the important issues presented in the paper, such as the performance of the KNN imputation method while considering the dependence over the multiple imputation techniques.

Before performing an imputation of missing data, it is necessary to know whether the missing data occurs randomly, as the result of unobserved factors, or is intended. We need to take into consideration two assumptions: missing at random (MAR) and missing completely at random (MCAR) [

We consider the following methods to estimate the miRNA expression levels for missing normal samples of patients:

Multiple imputation (MI) was originally designed to handle missingness in public-use large data sets [

Multiple imputed data sets can be generated by the MCMC method, which is applied to an arbitrary missing data pattern that assumes multivariate normality. MCMC has been used to explore posterior probability distributions to express unknown parameters in Bayesian inferences. Using this method, the entire joint posterior distribution of the unknown quantities is simulated and the parameter estimates based on the simulation are generated [

This process can be described in two steps. The first step is the imputation I-step which randomly draws values for missing values from the assumed distribution of missing values given observed values using the estimated mean vector and variance-covariance matrix, i.e. it draws values for _{mis}|_{obs}, ^{t}), where _{mis} and _{obs} are variables with missing values and observed values, respectively, and ^{t} is a parameter estimate at the ^{th} iteration.

The posterior P-step randomly simulates the population mean vector and variance-covariance matrix from the complete sample estimates, i.e. it draws ^{(t+1)} from _{mis}, _{obs}). Enough iterations are carried out to have reliable results for a multiply imputed dataset and to converge to its stationary distribution from which we can simulate an approximately random draw of the missing values [

The EM algorithm is a very general iterative algorithm for maximum likelihood estimation of missing data [

The EM algorithm consists of two steps, the Expectation (E) and the Maximization (M) steps. The algorithm calculates the conditional expectation of missing values given non-missing values and current parameter estimates in the expectation step. In the maximization step the calculated expected values are used to maximize the likelihood of the complete data. These steps are iterated until the maximum likelihood of data converges. The EM algorithm may not have an explicit form. In this case, the maximization could be theoretically obtained using iterations in the maximization step.

The maximization step can be computationally expensive, which can make the EM algorithm unattractive. Fortunately, the EM with bootstrapping algorithm resolves this problem. It uses the conventional EM algorithm on multiple bootstrapped samples of the original missing data to draw values of the complete-data parameters. Then it draws imputed values from each set of bootstrapped parameters, replacing the missing values with these draws. The EM with bootstrapping algorithm can impute missing values in much less time than the EM algorithm itself [

The conventional KNN method replaces missing values using

[_{i1} ≤ … ≤ _{ik} be the sorted distances of the _{i1}, …, _{ik} among the

Our proposed imputation method accounts for the dependence induced by weighted KNN and can use the additional covariates such as demographic, general health, genetic, and lifestyle variables, as well as other biologically related information. The proposed imputation method takes advantage of the conventional KNN [

This modified KNN technique imputes all miRNA expression levels of missing normal samples by finding the

Another advantage of this method is that it can integrate simultaneously multivariate covariates by aggregating and normalizing their distance matrices (Euclidean, Manhattan, Minkowski, and etc.) to find the nearest neighbor subjects. Specifically, two between-subject distance matrices are constructed based on the fully observed continuous and discrete covariates separately, using Euclidean and Manhattan distances, respectively. These two distance matrices are normalized by scaling between 0 and 1 [

There have been many studies carried out to determine the optimal choice (parameter) of

However, the choice of a small

Because the weighted KNN-imputed expression values are linear combinations of expression values of the fully observed subjects’ expression values, the imputed values are not necessarily independent of the fully observed values. The modified KNN-based imputation method has an advantage of considering this dependence induced by weighted KNN by providing variance-covariance matrices of each miRNA, which can be used when searching for differentially expressed miRNAs. We refer to this method as “KNN dependent”, while referring to the KNN imputation method that ignores the dependence as “KNN independent” in this paper. Its algorithm works almost the same as the algorithms of the conventional KNN-based methods, except it treats the rows as subjects or samples, and the columns as miRNAs.

To see how the proposed imputation method estimates the miRNA expression levels in missing normal samples and accounts for the dependence induced by the weighted KNN, suppose that in the CRC study of _{1}, …, _{k}), and impute the missing miRNA expression values by multiplying the miRNA expressions from normal samples of the

Here, _{lj} is the observed expression value of miRNA _{lj} is the weight of the subject in the imputation. The weights _{i1}, …, _{ik} are obtained as outlined in

Here, _{1}, _{2}, …, _{k}, and are the coefficients _{i1}, _{i2}, …, _{ik} in

The variance-covariance matrix of the normal tissue expression for miRNA

Here,

The paired t-test [

Here, ^{th} miRNA expressions for normal and tumor samples, _{j} is a single parameter representing the difference of mean expression levels of miRNA

The mean tumor-normal difference for miRNA

The

Then, the estimated variance of

Finally, the test statistic will be found using

This paired t-test can be used with the other imputation methods by replacing

The performance of the imputation methods on miRNA data is evaluated through root mean squared error (RMSE). The RMSE-based evaluation technique is the most commonly used method to compare similarity between true expression values and imputed expression values. Various variants of RMSE measures are used in the literature: the non-normalized RMSE measure [

In the motivating CRC case study, all miRNA expression levels of up to 50% missing normal samples, i.e. up to 50% missing rows (samples) of miRNA data must be imputed. Thus, the non-normalized RMSE that measures the difference between the imputed part of matrix and the original part of matrix, divided by the number of missing cells, can be used. It is calculated as

Here, _{ij} is the original value for missing sample

We evaluated the performance of the proposed imputation method, which accounts for the dependence induced by weighted KNN and considers the demographic and lifestyle covariates (KNN dependent), over the weighted KNN ignoring the dependence (KNN independent), MI techniques using MCMC and EM with bootstrapping algorithms, as well as the case deletion technique which only considers fully-observed subjects [

While we have complete normal and tumor sample data for more than 400 subjects in the CRC study, we compare imputation methods using simulated data to have clearly defined power and Type I errors. The imputation analyses were performed on normally distributed paired data matrices of

To ensure that the simulated data sets reflected the characteristics of the CRC study, and that the demographic and lifestyle variables carried some useful information for imputation, the multivariate covariate data sets with demographic and lifestyle variables of subjects were simulated based on

Here, _{0} is the mean age of the patients in CRC case study, and _{j} is uniformly distributed with a minimum and a maximum of up to 5% of the minimum and the maximum of the CRC case study patients’ age, respectively. In this paper, we used 2% of the minimum and the maximum of the continuous variables with _{j} is the expression of truly differentially expressed miRNA

The binary variables such as gender of subjects was simulated using a logistic regression model in Eqs

Here,

Here, _{0} is the mode of the patients’ gender in the CRC case study, and _{j} is uniformly distributed as

In our simulated study, we had denoted as a male if the value of

Demographic and lifestyle variables were thus simulated based on characteristics of five continuous (age, number of cigarettes/day, calories, BMI, and lutein and zeaxanthin concentration) and five binary (gender, recent aspirin/NSAID use, recent smoker, menopause, and post menopause taking HRT within 2 years statuses) variables from the CRC study.

We carried out the performance analyses as follows: First, we called arbitrarily the subjects with missing normal samples. Then, we imputed expression levels of the missing normal samples using the imputation methods mentioned in the Methods section. We evaluated the performance of these imputation methods against the initial generated data matrices by calculating the RMSE for such simulated data set. Moreover, we carried out the differential expression (DE) analyses on the imputed data sets to check whether the KNN dependent method has an equal statistical power in finding differentially expressed miRNA as other imputation techniques.

The performance of the modified KNN method was assessed over MI techniques using MCMC and EM with bootstrapping algorithms for data matrices with different number of subjects and different percents of normal missing subjects. In

The KNN imputation method also shows a robustness to increasing the percent of missing normal samples and the number of subjects in miRNA data sets. It keeps relatively the same performance for all levels of missing percents and number of subjects.

Moreover, the KNN imputation method required much less computational expense than the MI techniques using MCMC and EM with bootstrapping algorithms. For example, to impute the expressions of 50% missing normal samples in 400 subjects on a machine with CPU speed of 1.86 GHz and 2 GB RAM, the KNN method took approximately 35 minutes, whereas MCMC and EM with bootstrapping algorithms took approximately 10 and 5 hours, respectively.

We applied the paired t-test to the data sets, which were imputed by various imputation methods, to see how well we could identify differentially expressed miRNAs. First, we obtained a test statistic and a p-value for each miRNA feature in each imputed data set by controlling the false discovery rate (FDR) at 0.05 within each simulation. Then, we calculated the true positive rate (TPR), the false positive rate (FPR), and the false discovery rate (FDR) based on the miRNAs which were controlled as truly differentially expressed in the simulations. The TPR and FPR were defined and calculated as in [

From

The imputation accuracy of the proposed KNN imputation method, using the aggregated metric distance matrices of the demographic and lifestyle data, in the simulation data sets was higher than that of the MI methods using MCMC and EM with bootstrapping algorithms. Moreover, the proposed KNN method was robust and imputed the miRNA features of missing normal samples with less computational expense than the other imputation methods.

The DE tests of the KNN imputed data sets show that the KNN method while accounting for the dependence of the imputed values (KNN dependent) provided greater power than if no imputation were done (the case deletion approach) and maintained control of the FDR. The KNN method while ignoring the dependence (KNN independent), as well as MCMC and EM with bootstrapping algorithms had higher power than the power of KNN dependent, but failed to control the FDR. These effects are more clear for larger missing percents and number of subjects.

Depending on the study goals, researchers could select the KNN method while ignoring the dependence (achieving more power and higher proportion of false discoveries) or considering the dependence (moderate loss of power but lower proportion of false discoveries). In the motivating CRC study, the chosen approach is the KNN method while accounting for the dependence, with moderate loss of power but maintaining control of the FDR.

The case deletion method showed the lowest power to identify differentially expressed miRNAs, though it had similar FDR control as the KNN dependent method.

In this paper, we applied the paired t-test to identify differentially expressed miRNAs from normally distributed simulated miRNA data while accounting for the dependence structure of the imputed data. However, miRNA data can be noisy and not normally distributed. Currently available nonparametric tests may also not be directly applicable because they assume independence. In this respect, it is challenging to construct a statistical model which tests for significant miRNAs from paired samples while accounting for the dependence. Our future work is to develop a nonparametric t-test method which enables paired t-tests on a large number of miRNA data, using permutations with manageable computational expense, while accounting for the dependence induced by KNN imputation.

(TIF)

(ZIP)

We would like to thank Dr. Adele Cutler and Dr. Daniel Coster for their helpful comments and suggestions regarding numerical issues in weighted least squares. We thank Erica Wolff and Michael Hoffman for miRNA assessment, Sandie Edwards, Courtney Maxfield, and Lila Mullany for tissue collection, Dr. Wade Samowitz for pathology review, and Brett Milash for miRNA bioinformatics assessment. We also thank the Division of Research Computing at USU for providing technical resources to perform numerous study simulations. This research was supported by an NIH grant, award number 1R01CA163683-01A1.