A Notion of Feature Importance by Decorrelation and Detection of Trends by Random Forest Regression

In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence -- i.e., is the feature relevant? -- and, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as \emph{random forest regression} have found their way into applications (Boulesteix et al., 2012). These models allow to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al., 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative transversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.


Introduction
In many studies, scientific researchers are faced with high-dimensional but limited data to determine the influence of specific features on a dependent variable.Typically, the data consist of both numerical and categorical features, and strong artificial multivariate correlations appear.In particular, when data are generated from observations of live animals or collected in medical procedures, it is very likely that the data are unbalanced and, even worse, not all combinations of features contain samples.Therefore, it is unlikely that all necessary assumptions of classical statistical tests will be met.Machine learning methods have gained popularity among researchers because they can produce robust effect estimates with minimal assumptions.A plain but prominent example is the random forest regression.Due to advances in data science concepts as well as the increasing computational power available to any research group, such data-driven approaches are finding their way into life science studies [3].Random forest regression, as all machine learning models, makes few assumptions about the distributions of the underlying data and is particularly robust to noise and outliers.Finally, it allows to directly derive measures of feature importance, which are a natural indicator of the strength of influence of individual features [2,9,15].In cases where classical statistical tools such as ANOVA can be applied, it is well known that most features found to be significant by ANOVA also have high feature importance and vice versa [7,16].
Once relevant features have been found, it is important to determine how the values of the features affect the dependent variable.Probably the oldest approach is to measure the correlation or rank correlation between a feature and the dependent variable.More recent methods, some of which can also measure interactions between features, are based on a modeling approach.A model (e.g. a multivariate linear regression model) is trained and its parameters can be used to determine trends, especially when machine learning models are used, the SHAP scores [17] are a recent and prominent method to determine these trends.These approaches use the model rather than the raw data.This can help to identify trends that are not directly visible in the data, but are hidden behind noise.On the other hand, a decent model is required so that these trends are reliable.
The goal of this paper is twofold.First, since dependencies between features are known to influence feature importance scores, we introduce a notion of feature importance based on the well-studied Gram-Schmidt decorrelation method.This notion is empirically compared with a similar approach based on residual learning and the classical impurity-based feature importance and permutation importance.Second, we propose two estimators to identify trends in the data using random forest regression.We exploit the structure of random forests, i.e. at each split node we can compare the average prediction in the left and right subtrees.Since the left subtree is built on data below a threshold and the right subtree contains data above that threshold, this induces a natural estimator of some kind of correlation between the feature and the predicted variable.

Background and Notation
2.1.Feature Importance.With respect to random forests, two types of feature importance scores are well known in the literature.The first one is an impurity-based feature importance.The so-called impurity is quantified by the splitting criterion of the collection of contained decision trees.Therefore, it is likely to overestimate the importance of large numerical features (if the dataset is not standardized).Furthermore, it is possible that features that may not be predictive on unseen data are found to be important in the case of overfitting.For these reasons, a second type of feature importance, the so-called permutation importance, has found its way into the literature and is to be preferred [4].It is defined as the decrease in model performance when a single feature is randomly shuffled.Of course, this permutation-based approach has its shortcomings -in particular, if there are clusters of (highly) correlated [4] features.One approach to overcome this problem, which is often used in the process of feature extraction, is to keep only one variable per cluster [6,10,13].If the ultimate goal is to design a decent prediction model with as few features as possible, this is the state of the art.But in some cases, researchers are actually more interested in estimating the importance of each feature to determine which features influence the dependent variable and how strongly.In this setting, it may be convenient to treat the correlations differently.There are at least two decorrelation techniques that are usually used either for clustering data or for designing well-performing prediction models: the Gram-Schmidt decorrelation technique [19] or residual-based decorrelation [8].The main idea in both cases is to subtract the information from a given feature F i given by F 1 , . . ., F i−1 , F i+1 , . . .F d and use this residual to train the model.

2.2.
Trends.We compare three different ways to define trends in the data set.The simplest way one might think of examining a trend between the values of a feature Y and the predicted variable is to use the correlation coefficient r(X, Y ) := Cov(X, Y )/(σ(X)σ(Y )), which reflects linear trends.A more general correlation coefficient that handles any monotone trends are various types of rank correlation coefficients such as the Spearman correlation coefficient ρ(X, Y ) = r(R(X), R(Y )) where R(•) denotes the rank function.This method of finding trends is well established and only considers the observable raw data.
Another approach does not look at the raw data, but fits a model and looks for trends in that model.Many practitioners tend to identify trends in multivariate tasks by fitting a linear model to the data and interpreting the sign and corresponding p−value of the coefficient of a feature as a trend.We will denote this coefficient by r LM (F ).However, we will see that this can be very misleading, even for very simple data sets.
In recent years, an old concept from mathematical game theory, called Shapley values, has been used to interpret machine learning models [14].In particular, they are well understood mathematically for tree-based models and random forests.The Shapley value of a feature with respect to a data point measures how much the feature value contributes to the prediction compared to the average prediction, and is defined as the average marginal contribution of the feature value among all possible combinations of features.For a formal definition, see Shapley's original paper [17], and for a detailed discussion of how to use the concept in machine learning, see [12,18].Clearly, these Shapley scores can be used to determine trends.

Studied Datasets.
To test the performance of our estimators in practice, we use two very well-known real datasets, called Kaggle fish market dataset (FISH) [1] and California housing data (HOUSING) [5].In addition, we create three different synthetic datasets to explore certain aspects of the estimators.
FISH contains the records of seven different common fish species in fish market sales.The features are species, weight, vertical length, diagonal length, transverse length, height, and width for each fish.Of these characteristics, we used weight, height, and width to predict vertical length.The California housing data refers to the houses found in a given California county and summary statistics based on 1990 census data.The features are longitude, latitude, median age of the house, total number of rooms, total number of bedrooms, population, number of households, median income, and ocean proximity for each county with median house value as the prediction target.We transformed the ocean proximity feature into an ordinal scale.
The first synthetic data set (SYN1) is derived from a base data set B consisting of 1000 samples and 10 features, 3 of which are informative.The base dataset is standardised by removing the mean and scaling to unit variance.It is then combined with a noise dataset N standardised the same way and with the same structure but no informative features.A family of data sets is obtained by SYN1 consists of the combination of the base data set with 250 different random noise data sets.SYN1 is used to compare the robustness of trend estimators.
The second synthetic data set (SYN2) consists of 100 samples with independently generated features Furthermore, given X 0 , we define . The true label is given by Y = 4X 1.5 0 + 2X 1 + 0.5X 2 2 .Thus, the real labels depend on X 0 , X 1 , X 2 and A 0 , A 1 , A 2 can be considered as noisy instances of X 0 with different types of dependencies.SYN2 is used to compare different notions of feature importance.
The third synthetic data set (SYN3) consists of 100 samples with only one informative feature X 0 , defined as previously.Moreover, A 0 , A 1 , A 2 are defined as above.The true labels are now given by Y = 4X 1.5  0 .SYN3 is used to compare the notions of feature importance on a cluster of correlated features, in direct comparison to SYN2, in which two additional, uncorrelated, informative features are present.

3.1.
Finding Trends in a Dataset.We compare the commonly used regression coefficients r, ρ, the linear model-based trend estimator, a Shapley-based trend estimator, and propose two novel estimators based on random forest regression to determine the trends of features.For this purpose, we simply define the Shapley-based trend of a feature as the correlation between its values X and its Shapley values s(X), so that we obtain the estimators r(X, s(X)) and ρ(X, s(X)), respectively.
The two proposed trend estimators are the absolute and the relative traversal rate.The random forest regression model uses an ensemble of uncorrelated decision trees.At each node, the current data set is partitioned into two partition classes based on the values of the node's feature.We assume without loss of generality that the data in the left partition class belong to small feature values and the data in the right partition class belong to large feature values.To determine the trend of a feature F , it seems reasonable to compare the mean of the features in the left and right partition classes per node.If the average value of the predicted variable in the left tree is smaller than in the right tree, this corresponds to a positive correlation with the feature F .More formally, let {F j } j=1...n denote the set of nodes in the random forest in which the data is partitioned with respect to feature F .The corresponding partition classes are called L(F n ) and R(F n ).If the feature F is clear from the context, we abbreviate these classes to L n and R n .Furthermore, for a subset A of the values of the predicted variable, we define AV G(A) = |A| −1 a∈A a as the average value of the set A. This allows us to define our trend estimators.Definition 1.Given a random forest R, let {F j } j=1...n denote the set of nodes in the random forest in which the data is split with respect to feature F .The absolute transversal rate of feature F is defined as Moreover, the relative transversal rate of feature F is defined as The ATR formalizes the idea that we have a trend when the feature with a higher value causes the model to return a higher value.The RTR also takes into account the relative difference between the average values in the partition classes.3.2.Measures of Importance.We compare four different notions of feature importance.Two definitions, the impurity-based feature importance and the permutation-based feature importance, are well studied objects [4].We use scikit-learn's default implementation of these measures.
In addition, we introduce two novel types of feature importance based on residual learning.The idea is that the importance of feature F i is determined by its residuals given features F 1 , . . ., F i−1 , F i+1 , . . .F d .With a slight misuse of notation, we interpret F i ∈ R n as the vector of all values corresponding to feature F i and denote by Y ∈ R n the values of the dependent variable.We denote by A j an arbitrary algorithm that takes F 1 , . . ., F j−1 as input and outputs a vector in R n .Given a fixed permutation π of [d], we denote by i 1 , . . ., i d the new order under π.To determine the importance of F i , we determine its importance under all permutations π with the property that i d = i and weight it by the performance of a model consisting only of the feature F i .The interpretation is as follows: given all other features, what can be learned from feature F i d ?The algorithm to compute the importance can now be expressed as follows.
• For all permutations π which map i → d, do the following -Train a random forest with features {W i j }.
-Determine the impurity-based feature importance of W π i d .• Determine the average feature importance of F i as the mean over all W π i d , call this (F I) i .
• Train a random forest regressor R i with feature F i and dependent variable Y and measure r(R(F i ), Y ).
After applying this algorithm, we are left with f1 , . . ., fd .Finally, we define the feature importance based on the residual algorithm A as the standardized version of the above estimator, namely Formally, the algorithm is given as Algorithm 1.We note the following.
• f j (A) is a random quantity because it depends on the training of the random forest regressors R 1 , . . ., R d and the random forest regressors using the features {W i j }. • In applications, it may not be possible to iterate over all permutations π.Instead, the average impurity-based feature importance is estimated by sampling some permutations.• The algorithm is highly dependent on the residual algorithm A.

Algorithm 1 Residual-based feature importance
Require: In this contribution, we empirically analyze the feature importance based on two different residual algorithms: classical residual learning by random forest regression and decorrelation by the Gram-Schmidt method.
Residual Learning-based Feature Importance.Following [8], it is a natural idea to define the family of residual algorithms A 2 , . . ., A d as a family of random forest regressors.More precisely, given W π i 1 , . . ., W π i j−1 , we train a random forest regressor R on those features with the dependent variable F i j .Hence, ).Thus, we subtract from F i j everything that can be learned by random forest regressors from the first j − 1 features under π.This approach is classically known as residual learning and finds prominent applications in machine learning [11].
Gram-Schmidt decorrelation-based Feature Importance.Another natural approach is to use the very famous Gram-Schmidt orthogonalization technique.While it has been used in mathematics for a good century to generate orthogonal bases of vector spaces, it was first applied in the early 2000s to find independent components in complex data sets [19].The most important observation is that the covariance is an inner product, so the very general Gram-Schmidt orthogonalization technique can be applied with the covariance to create decorrelated features.Here, we define A major advantage may be that this orthogonalization method, unlike the above approach, is fully mathematically understandable.However, it may be brittle to nonlinear dependencies.

Finding Trends.
In the following, we report our empirical results on the performance of the different trend estimators on the HOUSING, FISH, and SYN1 datasets.
4.1.1.SYN1.This dataset was used to test the robustness of the different trend estimators with respect to the mixing of the dataset with noise.To do this, the trend estimator module was applied to D w for each w ∈ [0.01, . . ., 1.] for each of the 250 random noise data sets.In our experiment, the aggregated output shows that both ATR and RTR, as well as the Shapley correlation, are more robust than the linear model for the informative features (Fig. 2).The Shapley values are the most robust, followed by RTR and ATR.Interestingly, non-informative features were also assigned large ρ S , r S , RTR and ATR values.Mean and 95% confidence interval for the different trend estimators on SYN1 for 250 independent trials.On the x−axis, the proportion of noise is reported.Features 1-3 are informative, whereas features 4-10 are non-informative.
4.1.2.FISH.We performed three experiments on the FISH dataset using the features Weight, Height and Width to predict Length.All three selected features are positively correlated with the target (Fig. 3).First, we applied the trend estimation module to the FISH dataset.To control for random effects, we performed 100 bootstrapping iterations, sampling from a subset of 70 %.The linear regression model assigned a negative coefficient to the Height feature, while the other trend estimators reported a positive trend (Figure 4).
To evaluate the robustness of the trend estimators to noise, we used a random mixing strategy similar to that used to create SYN1.The FISH data were standardized and mixed with random noise ranging from 0% to 99% noise before being used as input to the trend estimator module.We found that the linear model and the RTR became unstable as the feature-to-target correlation r and ρ decreased, while the ATR and the Shapley measures r S and ρ S remained relatively unaffected up to much higher mixing rates (Fig. 5).6.The characteristic population is very weakly negatively correlated with the housing price.However, all trend estimators report a significant negative trend for population.The feature total rooms is positively correlated with the target.However, the linear model assigns a negative coefficient to the total number of rooms.All other trend estimators report a positive trend.4.2.Measures of Importance.We compare the impurity-based feature importance, the permutation-based feature importance, and the feature importance induced by the two described residual algorithms (residual learning and Gram-Schmidt decorrelation).A run consists of fitting a random forest.To determine the residual-based importance scores, for each feature, 20 permutations that assign this feature to the last position are sampled independently.
First, we compare the different scores on the synthetic datasets SYN2 and SYN3 (see Fig. 7).Perhaps the most important observation is that the impurity-based feature importance assigns the same score to all features -in both datasets.This is in strong contrast to all other feature importance scores.It is noteworthy that both residual-based approaches produce very comparable scores on the given datasets.Both residual-based approaches and the permutation-based score assign roughly the same score to X 0 and the slightly noisy variant A 0 .However, feature A 1 , which is subject to much more noise, receives a significantly higher score under residual-based scoring.Especially on SYN3, the residual-based approaches assign a not too small score to all informative features.The permutation-based score for the informative features X 1 and X 2 is comparatively small.However, all scores assign a higher importance to the noisy instance A 0 of X 0 than to the informative features X 1 and X 2 .shows results with respect to SYN2.Here, the labels are generated as Y = 4 • X 1.5 0 , and {A i } are given as by X 0 + W i for differently strong noise W i .On the r.h.s., results with respect to SYN3 are reported.Here, the labels are generated as Y = 4 • X 1.5 0 + 2 • X 1 + 0.5 • X 2 2 , thus two more (weakly) informative features are given.

RL GS
Next, we compare the different scores on the real data sets HOUSING and FISH (see Fig. 8).For HOUSING, it is most striking that the residual learning, impurity-based, and permutation-based scores assign the largest value to the median income, followed by the proximity to the ocean and the latitude/longitude, while the Gram-Schmidt-based score assigns only a large value to the median income and all other features receive comparable scores.In addition, the population is found to be more important by the impurity-based and permutation-based approaches as opposed to the residual learning-based approach.
For FISH, all measures assign the highest score to weight and all measures assign a nonvanishing score to all three variables.However, the score of Width is within one standard deviation in the residual learning-based, impurity-based, and permutation-based approaches.Only the Gram-Schmidt-based score assigns a significantly larger value to Weight and considers Height to be the second most important feature.

Conclusion
We present two novel estimators for monotone trends in a dataset based on random forest regression.They perform much more reliably than the often proposed linear model coefficient and are robust to noise.However, the SHAP values perform equally well and are much better understood from a theoretical point of view.Nevertheless, we believe that the transversal rate-based approach has its merits.It depends only on the random forest model (trained on some dataset) and the computation is completely independent of the specific data, once that the model exists.SHAP values, on the other hand, are computed as a combination of the model and some data (which may also have its own merits).With respect to feature importance, we introduced the residual-based approach.We compared the results on synthetic data and two real instances.It is noteworthy that both residual-based approaches produce comparable results on the synthetic data sets, but this may be due to the fact that the noise is added linearly.Overall, the residual-based approaches perform much better on highly correlated features than the impurity-based approach.Their results are comparable to the permutation-based approach in many facets.However, significant differences were also found.In particular, informative features that contribute weakly to the noise were assigned higher values than by the permutation-based score.Therefore, we believe that the residual-based feature importance scores should be preferred for use on datasets with highly dependent features.

Figure 2 .
Figure2.Mean and 95% confidence interval for the different trend estimators on SYN1 for 250 independent trials.On the x−axis, the proportion of noise is reported.Features 1-3 are informative, whereas features 4-10 are non-informative.

Figure 3 .
Figure 3. Pairplot of the used fish market dataset features (Weight, Height and Width) and the predicted variable (Length).

Figure 4 .
Figure 4. Comparison of the trend estimators for FISH.We report the mean and the standard deviation of the different trend estimators over 100 bootstrap iterations, each containing 70% of the data.

Figure 5 .
Figure 5. Mean and 95% confidence interval w.r.t. 100 independent iterations over noise on FISH.The x−axis reports the proportion of noise mixed to the real data.

Figure 6 .
Figure 6.Comparison of the trend estimators on HOUSING.The linear model assigns a negative coefficient to the total number of rooms feature, even though the feature itself is positively correlated to the target.

Figure 7 .
Figure 7.Comparison of the four different notions of feature importance on synthetic data.The l.h.s.shows results with respect to SYN2.Here, the labels are generated as Y = 4 • X 1.5 0 , and {A i } are given as by X 0 + W i for differently strong noise W i .On the r.h.s., results with respect to SYN3 are reported.Here, the labels are generated as Y = 4 • X 1.5 0 + 2 • X 1 + 0.5 • X 2 2 , thus two more (weakly) informative features are given.

Figure 8 .
Figure 8.Comparison of the four different notions of feature importance on real-world instances.The l.h.s.reports the feature importance scores on the FISH dataset (mean and standard deviation over 400 independent runs), the r.h.s. on HOUSING (mean and standard deviation over 100 independent runs).
R i d ← generate random forest model with feature F i d and dependent variable Y FeatImp