I’ll briefly talk about what I have tried and what worked/not worked for me this time :-)
I started from kxx’s Kernel [1] and make a lot of changes on it. And most of I done were model tricks.
For categorical
\[ LL = \frac {30 \cdot \bar{y} + \sum_{i \in G} y_i} {30 + |G|} \] Where \(G\) is the set of indices for the training examples with constant value of categorical features. \(|G|\) is the size of group \(G\), \(\bar{y}\) is the global mean of all target value. the number “30” here is constant value. You can change it to any number you want if reasonable. [2]
For Numeric
Mean of some important feature(EXT_1,2,3 , AMT_ANNUITY, AMT_CREDIT, AMT_CREDIT / AMT_ANNUITY…) by different categories. This one were computed using train+test. (I also tried this method with n-way and it surprising drop my CV, I also tried std/max/min/median….etc but all failed too.)
Make a LGB model on all tables to predict some important features (only EXT_1,2,3 in final version, pred on other features like amt_annuity or amt_credit didn’t give positive feedback and it was pretty weird), All prediction were computed in oof way.
CAUTION : There are a lot of NA value in ext1 and ext3, I treat them as “testset”, train on non-NA rows and make predictions on them.
k-means on (ext1,2,3), (pred_ext1,2,3), (use pred_ext to fill na in original ext1,2,3), cluster = 3,5,9,15,30,40, and treat them as categorical. All of them got really high GAIN in lgbm.
8 features from branden(interest rate and related features), it gave me about 0.0015 boost on CV. see Branden’s description
Quick summary
TRICKS | WORKED? | NOTE |
---|---|---|
n-way | X | - |
WoE | X | - |
xgb-embedding | X | - |
count encoding | X | - |
label encoding | O | label enc have highest CV |
one-hot encoding | X | - |
likelihood encoding | X | - |
grouped mean | O | grouped by n-way / and other stat like sd/min/max are failed too |
lgb_prediction | O | trained on all tables |
diff features | O | - |
kmeans on EXTs | X | it decreased my CV a lot(0.001), but I didn’t check the diversity, that’s by bad. |
ridge tricks | X | trained on all tables |
binning | X | - |
Table
For the numeric feature in each table, I joined them with aggregated way (only max / min / mean, and some of them also with sum) and doing some messy FE (some ratio / dealing DPD situation / really basic cleaning). For categorical ones, I only count or transform them into ratio. like:
# Here's the quick example :
sum_pc_balance_all <- poscash_balance %>%
group_by(SK_ID_CURR) %>%
summarise(count_act_pc = sum(NAME_CONTRACT_STATUS == "Signed"), #just count as frequency
count_act_pc = sum(NAME_CONTRACT_STATUS == "Signed") / length(NAME_CONTRACT_STATUS) #ratio
)
And I also generated some feature based on specific time period (past 6m/1yr/2yr/3yr/5yr), and these helps a little.
About lgbm, I didn’t spend too much time on tuning but this time in my case, super heavy regularization helps improving performance.
Reference