^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: OA SV ALO. Performed the experiments: OA. Analyzed the data: OA. Wrote the paper: OA SV ALO.

It is widely agreed that complex diseases are typically caused by the joint effects of multiple instead of a single genetic variation. These genetic variations may show stronger effects when considered together than when considered individually, a phenomenon known as epistasis or multilocus interaction. In this work, we explore the applicability of information interaction to discover pairwise epistatic effects related with complex diseases. We start by showing that traditional approaches such as classification methods or greedy feature selection methods (such as the Fleuret method) do not perform well on this problem. We then compare our information interaction method with BEAM and SNPHarvester in artificial datasets simulating epistatic interactions and show that our method is more powerful to detect pairwise epistatic interactions than its competitors. We show results of the application of information interaction method to the WTCCC breast cancer dataset. Our results are validated using permutation tests. We were able to find 89 statistically significant pairwise interactions with a p-value lower than

The availability of ever more extensive genetic information has spurred intense research on the search for the genetic factors that influence common complex traits. Genome Wide Association Studies (GWAS) aim at discovering associations between genetic factors and complex traits such as diseases. In GWAS, hundreds of thousands of Single Nucleotide Polymorphisms (SNPs) are analyzed to determine whether they are associated with the disease or conditions of interest. Due to limitations on the data, these analyses are usually performed using single SNP statistical tests and correcting for multiple testing.

This approach has severe limitations since epistatic interactions of SNPs are very important in determining susceptibility to complex diseases. Existing methods for SNP interaction discovery perform poorly when marginal effects of disease loci are weak or absent. As an example of a case where this may happen, it has been suggested that many genes with small effects rather than few genes with strong effects contribute to the development of asthma. The problem is that the individual effects of the interacting SNPs may be too small to be detected with the most commonly used statistical methods. Therefore, there is a need for more powerful methods that are able to identify interactions between SNPs with low marginal effects.

A number of different methods have been used to find epistatic interactions, including statistical methods (e.g. ATOM

In this paper, we describe results obtained using a measure known as

The results obtained in artificial datasets show that the approach vastly outperforms existing methods such as SnpHarvester

To test methods to find interactions with low marginals, we need to have a dataset that simulates these type of interactions. We selected simulated datasets that were tested with the SNPHarvester algorithm

60 different epistatic models were the base for the generation of datasets in which disease loci do not have marginal effects. These epistatic models, firstly used in

The datasets in which there are multiple disease loci without marginal effects try to simulate the expectation that there might exist multiple SNP-SNP interactions in the association studies. Eight hybrid models were used for the generation of the datasets. Each hybrid model is a mixture of five pure epistatic models with the same heritability and MAF. For example if a hybrid model HM1 consists of models

The artificial datasets are very important to test the methods and estimate their power. If we can develop a method that successfully detects the associations in the artificial datasets, then we can apply that method to real world datasets in order to detect associations with similar characteristics. The Wellcome Trust Case Control Consortium (WTCCC) breast cancer dataset is a large real world dataset with 1045 cases, 1476 controls and 15436 SNPs. We used only the SNPs that were also used in the WTCCC original publication

Our first approach was to select for study one artificial dataset that contains multiple disease loci, without marginal effects, and apply classification methods to see if it was possible to explain the phenotype based on the 1000 SNPs (10 of which are the disease loci).

We used WEKA to test several classification methods such as Alternating Decision Trees, Voted Perceptrons and Support Vector Machines. We also tested SMLR method (Sparse Multinomial Logistic Regression).

Alternating Decision (AD) Trees

Voted perceptrons

Support Vector Machines (SVMs)

Sparse Multinomial Logistic Regression (SMLR)

Fleuret published a fast binary feature selection technique (that we call the Fleuret method) based on Conditional Mutual Information Maximization (CMIM)

The goal of this method is to select a small subset of features that carries as much information as possible. To measure the amount of information carried by the features, the Fleuret method uses Conditional Mutual Information. This information theory measure is based on another very important measure which is the Entropy

This value can be seen as the difference between the average remaining uncertainty of

The standard implementation of the algorithm is done by keeping a score vector

The application of a systematic search over all possible pairs of SNPs came as a natural idea after the bad performance of the Fleuret method. If we are not able to have any type of information to guide the choice of our first disease locus, that means we need to consider all the pairs of SNPs, to decide which ones may be interacting in a way that help us to explain the phenotype.

The choice of information interaction as the metric to evaluate if a pair of SNPs is associated with the phenotype arose as a natural option. We define that a pair of SNPs

If the information that SNP

In our experiments in Breast Cancer dataset, the user-defined threshold

We also applied mutual information to all SNPs in order to find single SNP associations to the phenotype and be able to compare the results with information interaction method.

Similarly to the information interaction method, a SNP

In our experiments in Breast Cancer dataset, the user-defined threshold

Permutation testing is a non-parametric procedure for determining statistical significance based on rearrangements of the labels of a dataset. It is a robust method but it can be computationally intensive. A test statistic, which is computed from the dataset, is compared with the distribution of permutation values. These permutation values are computed similarly to the test statistic, but under a random rearrangement of the labels of the dataset. Permutation tests can help reduce the multiple testing burden

In bioinformatics, permutation tests have become a widely used technique. The reason for this popularity has to do with its non-parametric nature, since in many bioinformatics applications there is no solid evidence or sufficient data to assume a particular model for the obtained measurements of the biological events under investigation

The application of the classification methods to our selected artificial dataset did not produce good results. In fact, as we can see from

AD Trees | 55.5% |

Voted Perceptron | 54.5% |

SVMs | 51.5% |

SMLR | 56.5% |

We also performed experiments using feature selection methods in WEKA. The 10-fold cross validation accuracies improved, but none of the variables selected by the feature selection methods corresponded to the disease loci. We tried several feature selection methods available on WEKA and none was able to identify some of the 10 disease loci.

A question that arised was how the performance of classifiers behaves when we add more noisy variables. To study that we performed a study in which we started by training a classifier with the 10 disease loci. We then add one noisy variable at each step and retrain the algorithm using the same training parameters. Even taking into account that the training parameters are not adjusted when we add more variables, it is interesting to see how the 10-fold cross validation accuracy rapidly degrades with the addition of noisy variables (see

We conclude that classification methods are not capable of identify the relevant factors when many irrelevant attributes are present.

In our experiments, we used an implementation of the Fleuret Method available on the author's web page

These bad results may happen because the Fleuret method initializes the score table

The first steps are crucial and that is why we propose another approach that is based on testing all the possible pairs and evaluate them with Information Theory measures.

We applied information interaction metric to all possible pairs of SNPs in one artificial dataset and found that there was a big difference between values of true disease loci pairs and other noisy pairs of SNPs. It was possible to define a threshold that could perfectly distinguish between true disease loci pairs and not associated SNPs.

The results obtained in the artificial dataset, encouraged us to apply to all the datasets that were generated with equal or different simulation parameters. We could then compare the results obtained with the results of SNPHarvester and BEAM methods.

One of the main issues in applying information interaction (II) search is the choice of threshold

We performed a comparison of our method with the results reported in

As we can see, our Information Interaction method (IIM) outperforms both SNP Harvester and BEAM in terms of power. In Hybrid Model 8 our method has more problems in discovering the interactions because the signal is more dispersed in the noise. However, even in those harder conditions, our method performed better than SNP Harvester and BEAM. We also show in

SNP Harvester false positives were obtained by inspecting SNP Harvester results files that are available in

Information Interaction search showed to be a very powerful method to detect interactions between SNPs with low marginals and is therefore a very valid option to apply to real GWAS data.

We tried to apply regression on the SPSS tool, but it was hardly possible: due to the high number of variables the software blocked (in a computer with 8 Gigabytes of RAM). We decided then to use statistical tool R in order to perform our experiments with regression and the same problem occurred given the high dimensionality of the parameters space. The methodology that we were able to run successfully was to use Bayesian Information Criterion (BIC) as the model selection criterion. When we learn the model of marginal effects using Lasso Penalized Logistic Regression and BIC, only 38 variables are selected. We then build the data matrix with all the interactions between these 38 selected variables and trained the model with Lasso and BIC. The final model includes 37 (of the 38) marginal variables and 11 interaction variables. This methodology has the disadvantage of eliminating SNPs with small effects that can be involved in interactions with low marginals.

After the pre-processing of the WTCCC Breast Cancer dataset, we applied the information interaction method. To determine the threshold

Information Interaction | Number of SNP Pairs |

The null hypothesis that is considered in this work is that the phenotype is independent from the SNPs. If this is true, then we can make permutations of the phenotype and the distribution of information interaction values would be similar. However we observe that in our 1000 permutation tests it was not possible to find any II value above

We also applied the mutual information method to the WTCCC Breast Cancer dataset. As already mentioned, we used 1000 permutation tests to determine the threshold

Mutual Information | Number of SNPs |

The execution of the Information Interaction Method on the Wellcome Trust Breast Cancer dataset takes around 45 minutes in a 2.3 GHz AMD Opteron Processor. The execution of 1000 permutations in a single processor would take about a month, or 3 days if we use 10 processors. We also made some experiments with an Alzheimer dataset from the Alzheimers Disease Neuroimaging Initiative (ADNI) with 600000 SNPs and the execution after data quality procedures took around four days. We did not complete the execution of the 1000 permutation tests because we could conclude in advance that the results would not be statistically significant. The complete set of 1000 permutation tests would take several years to complete in a single processor, although it could be parallelized. Since most of the time it is not practical to have access to a large number of processors, there is still the need for methods that need less permutation tests in order to measure statistical significance. We are currently working on methods that obtain the estimates of the p-values with a much smaller number of permutation tests.

In our work, we adapted the source code from the Fleuret method in order to calculate information interaction over all possible pairs of SNPs. With this approach we benefited from the efficient calculations of conditional mutual information that was already developed. Even though computational efficiency is an important issue, because we want results in feasible time, it was not our major point of interest. We gave priority on designing a method that could give us some guarantees of finding pairwise interactions, in cases where both variables have low marginals, if they exist, even if we have to wait for the results. We saw from the results of the application of our method to the artificial datasets that our method was powerful when compared to other state of the art methods. Therefore, if that type of interactions exist in the real breast cancer dataset, our method would detect them. Another advantage of our information interaction method over stochastic methods is the fact that it is deterministic. One of the problems of stochastic methods is that they can give different results in different runs. Our method makes an exhaustive analysis of pairs of SNPs and gives always the same results.

In the artificial datasets that were used with the information interaction method, all the interactions had low marginals. Each SNP of the pairwise interactions did not have any statistically significant association with the phenotype when considered alone. We showed in our results that our method based on information interaction could find these pairs of SNPs more often than other state of the art methods such as SNPHarvester or BEAM.

The application of the mutual information method to the WTCCC breast cancer dataset found that there were 90 SNPs that individually shared information with the phenotype with a p-value

IM and MIM (48) | IIM only (1) | MIM only (42) | |||

SNP | Gene | SNP | Gene | SNP | Gene |

rs1048347 | BTBD16 | rs660895 | MHC | NCBI35_X_15303842 | |

rs1129923 | DUSP23 | rs10456324 | LRRC16A | ||

rs12003 | GMIP | rs1059655 | HLA-E | ||

rs12665700 | MUC22 | rs1132200 | TMEM39A | ||

rs12833456 | KRT72 | rs1151687 | OR2G2 | ||

rs1367580 | FANCM | rs11558709 | ENSG00000188699 | ||

rs1635 | NKAPL | rs12984558 | DHX34 | ||

rs17256042 | TMEM176B | rs1385698 | EDA2R | ||

rs17319801 | ENTHD1 | rs1727 | IFIT2 | ||

rs17641488 | C5orf4 | rs17280682 | NLRP14 | ||

rs1800255 | COL3A1 | rs176024 | MAGEC3 | ||

rs180223 | TG | rs1788799 | NPC1 | ||

rs197414 | DDX20 | rs1800280 | DMD | ||

rs2071299 | SLC17A2 | rs1800309 | GAA | ||

rs2073924 | GBGT1 | rs1804027 | SP110 | ||

rs2273198 | NRD1 | rs1968956 | ELTD1 | ||

rs2281820 | MLN | rs2071307 | ELN | ||

rs2290344 | PIGB | rs2072994 | PTCHD2 | ||

rs2293877 | C14orf55 | rs2093066 | BPIFB3 | ||

rs3088040 | USP36 | rs2207337 | |||

rs363504 | GRIK1 | rs2227289 | CD320 | ||

rs3764795 | GML | rs2229995 | APC | ||

rs3787429 | HRH3 | rs2283432 | FANCI | ||

rs3810715 | FATE1 | rs2706762 | PCYOX1 | ||

rs4826381 | PAGE3 | rs3012075 | CTBP2 | ||

rs482912 | LAMP3 | rs3115572 | |||

rs4830 | LGALS14 | rs3810510 | JPH2 | ||

rs4905757 | C14orf177 | rs393414 | |||

rs592229 | SKIV2L | rs4826957 | |||

rs5927629 | TAB3 | rs4827331 | SYTL5 | ||

rs5930931 | GPR112 | rs4897783 | FLJ46300 | ||

rs5931046 | GPR101 | rs5924658 | PASD1 | ||

rs5951328 | TAF7L | rs5930932 | GPR112 | ||

rs5956583 | XIAP | rs5951332 | ARMCX4 | ||

rs5969783 | TXLNG | rs595413 | FAM217A | ||

rs5983 | F13A1 | rs5955762 | MAP3K15 | ||

rs598318 | TCP1P2 | rs599176 | |||

rs604630 | CTSW | rs631357 | KIF17 | ||

rs6553229 | rs652438 | MMP12 | |||

rs6564956 | SDR42E1 | rs6525447 | SLC7A3 | ||

rs664850 | RGS3 or LOC100288542 | rs662204 | MHC | ||

rs7054230 | PHKA1 | rs706107 | SPINK4 | ||

rs723077 | TTC12 | ||||

rs7349683 | EPHA5 | ||||

rs7645635 | LPP | ||||

rs7879053 | LOC139363 | ||||

rs90951 | CLEC10A | ||||

rs911973 | MYO16 |

Even though the scientific literature refers the need for methods to discover low marginal interactions, our results suggest that most epistatic interactions with relevance to breast cancer have moderate or high marginals. In addition none of the interactions discovered with the information interaction method were reported in the STRING network. We believe that this method should be applied to more breast cancer datasets and also to other datasets from other diseases in order to find more about the kind of epistatic interactions that exist in real diseases.

The authors wish to acknowledge José Caldas for his support in the experiments with regression methods.

This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from

The source code used and developed in this work is available at