Predicting Board Game Collections

Gyges’s Collection

Author

Phil Henrickson

Published

May 21, 2025

About

This report details the results of training and evaluating a classification model for predicting games for a user’s boardgame collection.

Note

To view games predicted by the model, go to Section 5.

Collection

The data in this project comes from BoardGameGeek.com. The data used is at the game level, where an individual observation contains features about a game, such as its publisher, categories, and playing time, among many others.

I train a classification model at the user level to learn the relationship between game features and games that a user owns - what predicts a user’s collection?

username status games
Gyges ever_owned 1522
Gyges own 594
Gyges rated 1424

I evaluate the model’s performance on a training set of historical games via resampling, then validate the model’s performance on a set aside set of newer relases. I then refit the model on the training and validation in order and predict upcoming releases in order to find new games that the user is most likely to add to their collection.

username years type
Own
no yes
Gyges -3500-2021 train 25888 474
Gyges 2022-2023 valid 10214 94
Gyges 2024-2028 test 8567 26

Types of Games

What types of game does the user own? The following plot displays the most frequent publishers, mechanics, designers, artists, etc that appear in a user’s collection.

Show the code
collection |>
    filter(own == 1) |>
    collection_by_category(
        games = games_raw
    ) |>
    plot_collection_by_category() +
    ylab("feature")

The following plot shows the years in which games in the user’s collection were published. This can usually indicate when someone first entered the hobby.

Games in Collection

What games does the user currently have in their collection? The following table can be used to examine games the user owns, along with some helpful information for selecting the right game for a game night!

Use the filters above the table to sort/filter based on information about the game, such as year published, recommended player counts, or playing time.

Show the code
collection |>
    filter(own == 1) |>
    prep_collection_datatable(
        games = games_raw
    ) |>
    filter(!is.na(image)) |>
    collection_datatable()

Modeling

I’ll now the examine predictive models trained on the user’s collection.

For an individual user, I train a predictive model on their collection in order to predict whether a user owns a game. The outcome, in this case, is binary: does the user have a game listed in their collection or not? This is the setting for training a classification model, where the model aims to learn the probability that a user will add a game to their collection based on its observable features.

How does a model learn what a user is likely to own? The training process is a matter of examining historical games and finding patterns that exist between game features (designers, mechanics, playing time, etc) and games in the user’s collection.

I make use of many potential features for games, the vast majority of which are dummies indicating the presence or absence of the presence or absence of things such as a publisher/artist/designer. The “standard” BGG features for every game contain information that is typically listed on the box its playing time, player counts, or its recommended minimum age.

Note

I train models to predict whether a user owns a game based only on information that could be observed about the game at its release: playing time, player count, mechanics, categories, genres, and selected designers, artists, and publishers. I do not make use of BGG community information, such as its average rating, weight, or number of user ratings. This is to ensure the model can predict newly released games without relying on information from the BGG community.

What Predicts A Collection?

A predictive model gives us more than just predictions. We can also ask, what did the model learn from the data? What predicts the outcome? In the case of predicting a boardgame collection, what did the model find to be predictive of games a user has in their collection?

To answer this, I examine the coefficients from a model logistic regression with ridge regularization (which I will refer to as a penalized logistic regression).

Positive values indicate that a feature increases a user’s probability of owning/rating a game, while negative values indicate a feature decreases the probability. To be precise, the coefficients indicate the effect of a particular feature on the log-odds of a user owning a game.

The following visualization shows the path of each feature as it enters the model, with highly influential features tending to enter the model early with large positive or negative effects. The dotted line indicates the level of regularization that was selected during tuning.

Show the code
#|
model_glmnet |>
    pluck("wflow", 1) |>
    trace_plot.glmnet(max.overlaps = 30) +
    facet_wrap(~ params$username)

Partial Effects

What are the effects of individual features?

Use the buttons below to examine the effects different types of predictors had in predicting the user’s collection.

Assessment

How well did the model do in predicting the user’s collection?

This section contains a variety of visualizations and metrics for assessing the performance of the model(s). If you’re not particularly interested in predictive modeling, skip down further to the predictions from the model.

The following displays the model’s performance in resampling on a training set, a validation set, and a holdout set of upcoming games.

Show the code
metrics |>
    mutate_if(is.numeric, round, 3) |>
    pivot_wider(
        names_from = c(".metric"),
        values_from = c(".estimate")
    ) |>
    gt::gt() |>
    gt::sub_missing() |>
    gt_options()
username wflow_id type .estimator mn_log_loss roc_auc pr_auc
Gyges glmnet resamples binary 0.071 0.873 0.157
Gyges glmnet test binary 0.028 0.943 0.050
Gyges glmnet valid binary 0.047 0.861 0.116

An easy way to visually examine the performance of classification model is to view a separation plot.

I plot the predicted probabilities from the model for every game (during resampling) from lowest to highest. I then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (left side of the chart).

Show the code
preds |>
    filter(type %in% c("resamples", "valid")) |>
    plot_separation(outcome = params$outcome)

I can more formally assess how well each model did in resampling by looking at the area under the ROC curve (roc_auc). A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

Show the code
preds |>
    nest(data = -c(username, wflow_id, type)) |>
    mutate(
        roc_curve = map(
            data,
            safely(~ .x |> safe_roc_curve(truth = params$outcome))
        )
    ) |>
    mutate(result = map(roc_curve, ~ .x |> pluck("result"))) |>
    select(username, wflow_id, type, result) |>
    unnest(result) |>
    plot_roc_curve()

Top Games in Training

What were the model’s top games in the training set?

Show the code
preds |>
    filter(type == "resamples") |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games in Validation

What were the model’s top games in the validation set?

Show the code
preds |>
    filter(type %in% c("valid")) |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games by Year

Displaying the model’s top games for individual years in recent years.

Show the code
preds |>
    filter(type %in% c("resamples", "valid")) |>
    top_n_preds(
        games = games,
        outcome = params$outcome,
        top_n = 15,
        n_years = 15
    ) |>
    gt_top_n(collection = collection |> prep_collection())
Rank 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
1 Endeavor Labyrinth: The War on Terror, 2001 – ? Ora et Labora Terra Mystica Terror in Meeple City Blue Moon Legends Warhammer Quest: The Adventure Card Game Agricola (Revised Edition) Spirit Island Nemesis Tainted Grail: The Fall of Avalon Rush M.D. The Great Wall Horizons of Spirit Island Earthborne Rangers
2 Hansa Teutonica Earth Reborn Ascending Empires Archipelago Impulse Red7 The King Is Dead Scythe Pandemic Legacy: Season 2 Concordia Venus Mega Empires: The West The Vote: Suffrage and Suppression in America Imperium: Legends Agricola 15 Deliverance
3 Vasco da Gama 7 Wonders King of Tokyo Libertalia Disc Duelers La Granja Trans-Siberian Railroad One Deck Dungeon This War of Mine: The Board Game The Edge: Dawnfall Ancient Civilizations of the Inner Sea Pandemic Legacy: Season 0 Excavation Earth Dire Alliance: Horror Nekojima
4 Dungeon Twister 2: Prison Innovation Dungeon Petz Tzolk'in: The Mayan Calendar Concordia Roll for the Galaxy Star Wars: X-Wing Miniatures Game – The Force Awakens Core Set Terraforming Mars Folklore: The Affliction Dungeon Alliance Cloudspire Altar Quest Kemet: Blood and Sand Lasting Tales Monster Pit
5 Revolution! Dust Tactics Colonial: Europe's Empires Overseas Among the Stars BattleCON: Devastation of Indines Irish Gauge Mysterium Exceed Fighting System 878 Vikings: Invasions of England Cosmic Encounter: 42nd Anniversary Edition Pax Pamir: Second Edition Etherfields Ankh: Gods of Egypt アンドーンテッド:ノルマンディー・プラス (Undaunted: Normandy Plus) Undaunted: Battle of Britain
6 Fear and Faith Norenberc Dungeon Fighter We Didn't Playtest This: Legacies Steam Park Thunderstone Advance: Worlds Collide Pandemic Legacy: Season 1 The Manhattan Project: Energy Empire Gloomhaven Orc-lympics Castle Itter: The Strangest Battle of WWII The King Is Dead: Second Edition Mega Empires: The East Undaunted: Stalingrad Ascent of Dragons
7 Alea Iacta Est Forbidden Island Mage Knight Board Game Pocket Battles: Macedonians vs. Persians 7-Card Slugfest Pixel Tactics 3 7 Wonders Duel The Others Anachrony Heroes of Terrinoth Era: Medieval Age Undaunted: North Africa Stargrave: Science Fiction Wargames in the Ravaged Galaxy Frosthaven Too Many Bones: Unbreakable
8 Pocket Battles: Celts vs. Romans The Hobbit Eminent Domain The Resistance: Avalon Glass Road New Dawn Pixel Tactics Deluxe Millennium Blades Myth: Dark Frontier Betrayal Legacy Nights of Fire: Battle for Budapest Reign of Witches Assassin's Creed: Brotherhood of Venice Libertalia: Winds of Galecrest Voidfall
9 Last Train to Wensleydale Glen More Eclipse: New Dawn for the Galaxy Keyflower Star Trek: Attack Wing Viticulture: Complete Collector's Edition Bottom of the 9th Rogue Stars: Skirmish Wargaming in a Science Fiction Underworld Dark Souls: The Board Game Lords of Hellas Undaunted: Normandy Hallertau Hour of Need: Judge and Jury Gateway Island Masters of the Universe: The Board Game – Clash for Eternia
10 Claustrophobia Catacombs Lancaster We Didn't Playtest This at All with Chaos Pack Expansion City of Iron Antike II Forbidden Stars Advanced Song of Blades and Heroes One Deck Dungeon: Forest of Shadows Critical Mass: Patriot vs Iron Curtain Brook City Hansa Teutonica: Big Box Nicaea Marvel Zombies: Heroes' Resistance Empire's End
11 Hellenes: Campaigns of the Peloponnesian War Pocket Battles: Elves vs. Orcs Star Trek: Fleet Captains Bolt Action Forbidden Desert A Fistful of Kung Fu: Hong Kong Movie Wargame Rules BattleCON: Fate of Indines A Feast for Odin Pandemic: Rising Tide New Frontiers Tiny Epic Mechs Eclipse: Second Dawn for the Galaxy Blood of the Northmen One Deck Galaxy Marvel Zombies: A Zombicide Game
12 Shipyard Masques City Tycoon Pixel Tactics Batman: Gotham City Strategy Game The Witcher Adventure Game Viticulture Essential Edition Hit Z Road Wasteland Express Delivery Service War Chest Godtear By Stealth and Sea Good Puppers Pisces: A High-Stakes Fishing Competition Tamashii: Chronicle of Ascend
13 We Didn't Playtest This Either Warmachine Prime Mk II Dungeons & Dragons: Wrath of Ashardalon Board Game 1989: Dawn of Freedom Euphoria: Build a Better Dystopia Power Grid Deluxe: Europe/North America Stockpile The Fog of War Gaia Project Champions of Hara Bios: Origins (Second Edition) Planet Apocalypse Canvas Mosaic: A Story of Civilization Expeditions
14 Win, Lose, or Banana Dominant Species Last Will Lords of Waterdeep Pixel Tactics 2 Warmachine: High Command – Faith & Fortune Pixel Tactics 5 SeaFall Startups Trapwords Living Planet: Deluxe Edition Merv: The Heart of the Silk Road Steamwatchers ISS Vanguard Storybook Battles
15 Dungeon Lords Flying Lead Malifaux Rules Manual New Amsterdam This Is Not a Test: Post-Apocalyptic Skirmish Rules Flip City Adorable Pandaring Days of Ire: Budapest 1956 Widower's Wood: An Iron Kingdoms Adventure Board Game The Walking Dead: No Sanctuary Hellboy: The Board Game – Deluxe Edition Anachrony: Infinity Box Blitzkrieg!: World War Two in 20 Minutes Nemesis: Lockdown Fire for Light

Predictions

New and Upcoming Games

What were the model’s top predictions for new and upcoming board game releases?

Show the code
new_preds |>
    filter(type == "upcoming") |>
    # imposing a minimum threshold to filter out games with no info
    filter(usersrated >= 1) |>
    # removing this goddamn boxing game that has every mechanic listed
    filter(game_id != 420629) |>
    prep_predictions_datatable(
        games = games_new,
        outcome = params$outcome
    ) |>
    predictions_datatable(outcome = params$outcome)

Older Games

What were the model’s top predictions for older games?