HC - Brief solution from Shuo-Jen, Chang

I’ll briefly talk about what I have tried and what worked/not worked for me this time :-)
I started from kxx’s Kernel [1] and make a lot of changes on it. And most of I done were model tricks.

For categorical

n-way interactions between categorical features (n up to 4)
Weight of Evidence features
xgb-embedding and feed into lgbm (maybe feed into NN would be better…..)
count encoding
label encoding
one-hot encoding
likelihood encoding(oof) with following formula :

\[ LL = \frac {30 \cdot \bar{y} + \sum_{i \in G} y_i} {30 + |G|} \] Where \(G\) is the set of indices for the training examples with constant value of categorical features. \(|G|\) is the size of group \(G\), \(\bar{y}\) is the global mean of all target value. the number “30” here is constant value. You can change it to any number you want if reasonable. [2]

For Numeric

Mean of some important feature(EXT_1,2,3 , AMT_ANNUITY, AMT_CREDIT, AMT_CREDIT / AMT_ANNUITY…) by different categories. This one were computed using train+test. (I also tried this method with n-way and it surprising drop my CV, I also tried std/max/min/median….etc but all failed too.)
Make a LGB model on all tables to predict some important features (only EXT_1,2,3 in final version, pred on other features like amt_annuity or amt_credit didn’t give positive feedback and it was pretty weird), All prediction were computed in oof way.
CAUTION : There are a lot of NA value in ext1 and ext3, I treat them as “testset”, train on non-NA rows and make predictions on them.
k-means on (ext1,2,3), (pred_ext1,2,3), (use pred_ext to fill na in original ext1,2,3), cluster = 3,5,9,15,30,40, and treat them as categorical. All of them got really high GAIN in lgbm.
Diff features. bascially it’s the (actual_value - pred_value) or (actual_value - grouped_value) on those important features.
ridge tricks, make a oof predtion to the target with heavy regularization.
quantile or histogram-based binning.
8 features from branden(interest rate and related features), it gave me about 0.0015 boost on CV. see Branden’s description

Quick summary

TRICKS	WORKED?	NOTE
n-way	X	-
WoE	X	-
xgb-embedding	X	-
count encoding	X	-
label encoding	O	label enc have highest CV
one-hot encoding	X	-
likelihood encoding	X	-
grouped mean	O	grouped by n-way / and other stat like sd/min/max are failed too
lgb_prediction	O	trained on all tables
diff features	O	-
kmeans on EXTs	X	it decreased my CV a lot(0.001), but I didn’t check the diversity, that’s by bad.
ridge tricks	X	trained on all tables
binning	X	-

Table

For the numeric feature in each table, I joined them with aggregated way (only max / min / mean, and some of them also with sum) and doing some messy FE (some ratio / dealing DPD situation / really basic cleaning). For categorical ones, I only count or transform them into ratio. like:

# Here's the quick example :
sum_pc_balance_all <- poscash_balance %>% 
  group_by(SK_ID_CURR) %>% 
  summarise(count_act_pc = sum(NAME_CONTRACT_STATUS == "Signed"),  #just count as frequency
            count_act_pc = sum(NAME_CONTRACT_STATUS == "Signed") / length(NAME_CONTRACT_STATUS) #ratio
            )

And I also generated some feature based on specific time period (past 6m/1yr/2yr/3yr/5yr), and these helps a little.
About lgbm, I didn’t spend too much time on tuning but this time in my case, super heavy regularization helps improving performance.

Reference