I’ll briefly talk about what I have tried and what worked/not worked for me this time :-)
I started from kxx’s Kernel [1] and make a lot of changes on it. And most of I done were model tricks.


For categorical

\[ LL = \frac {30 \cdot \bar{y} + \sum_{i \in G} y_i} {30 + |G|} \] Where \(G\) is the set of indices for the training examples with constant value of categorical features. \(|G|\) is the size of group \(G\), \(\bar{y}\) is the global mean of all target value. the number “30” here is constant value. You can change it to any number you want if reasonable. [2]


For Numeric


Quick summary

TRICKS WORKED? NOTE
n-way X -
WoE X -
xgb-embedding X -
count encoding X -
label encoding O label enc have highest CV
one-hot encoding X -
likelihood encoding X -
grouped mean O grouped by n-way / and other stat like sd/min/max are failed too
lgb_prediction O trained on all tables
diff features O -
kmeans on EXTs X it decreased my CV a lot(0.001), but I didn’t check the diversity, that’s by bad.
ridge tricks X trained on all tables
binning X -


Table

For the numeric feature in each table, I joined them with aggregated way (only max / min / mean, and some of them also with sum) and doing some messy FE (some ratio / dealing DPD situation / really basic cleaning). For categorical ones, I only count or transform them into ratio. like:

# Here's the quick example :
sum_pc_balance_all <- poscash_balance %>% 
  group_by(SK_ID_CURR) %>% 
  summarise(count_act_pc = sum(NAME_CONTRACT_STATUS == "Signed"),  #just count as frequency
            count_act_pc = sum(NAME_CONTRACT_STATUS == "Signed") / length(NAME_CONTRACT_STATUS) #ratio
            ) 

And I also generated some feature based on specific time period (past 6m/1yr/2yr/3yr/5yr), and these helps a little.
About lgbm, I didn’t spend too much time on tuning but this time in my case, super heavy regularization helps improving performance.


Reference