Predicting Board Game Collections

Gyges’s Collection

Author

Phil Henrickson

Published

May 21, 2025

About

This report details the results of training and evaluating a classification model for predicting games for a user’s boardgame collection.

Note

To view games predicted by the model, go to Section 5.

Collection

The data in this project comes from BoardGameGeek.com. The data used is at the game level, where an individual observation contains features about a game, such as its publisher, categories, and playing time, among many others.

I train a classification model at the user level to learn the relationship between game features and games that a user owns - what predicts a user’s collection?

username	status	games
Gyges	ever_owned	1522
Gyges	own	594
Gyges	rated	1424

I evaluate the model’s performance on a training set of historical games via resampling, then validate the model’s performance on a set aside set of newer relases. I then refit the model on the training and validation in order and predict upcoming releases in order to find new games that the user is most likely to add to their collection.

username	years	type	Own
username	years	type	no	yes
Gyges	-3500-2021	train	25888	474
Gyges	2022-2023	valid	10214	94
Gyges	2024-2028	test	8567	26

Types of Games

What types of game does the user own? The following plot displays the most frequent publishers, mechanics, designers, artists, etc that appear in a user’s collection.

Show the code

collection |>
    filter(own == 1) |>
    collection_by_category(
        games = games_raw
    ) |>
    plot_collection_by_category() +
    ylab("feature")

The following plot shows the years in which games in the user’s collection were published. This can usually indicate when someone first entered the hobby.

Games in Collection

What games does the user currently have in their collection? The following table can be used to examine games the user owns, along with some helpful information for selecting the right game for a game night!

Use the filters above the table to sort/filter based on information about the game, such as year published, recommended player counts, or playing time.

Show the code

collection |>
    filter(own == 1) |>
    prep_collection_datatable(
        games = games_raw
    ) |>
    filter(!is.na(image)) |>
    collection_datatable()

Modeling

I’ll now the examine predictive models trained on the user’s collection.

For an individual user, I train a predictive model on their collection in order to predict whether a user owns a game. The outcome, in this case, is binary: does the user have a game listed in their collection or not? This is the setting for training a classification model, where the model aims to learn the probability that a user will add a game to their collection based on its observable features.

How does a model learn what a user is likely to own? The training process is a matter of examining historical games and finding patterns that exist between game features (designers, mechanics, playing time, etc) and games in the user’s collection.

I make use of many potential features for games, the vast majority of which are dummies indicating the presence or absence of the presence or absence of things such as a publisher/artist/designer. The “standard” BGG features for every game contain information that is typically listed on the box its playing time, player counts, or its recommended minimum age.

Note

I train models to predict whether a user owns a game based only on information that could be observed about the game at its release: playing time, player count, mechanics, categories, genres, and selected designers, artists, and publishers. I do not make use of BGG community information, such as its average rating, weight, or number of user ratings. This is to ensure the model can predict newly released games without relying on information from the BGG community.

What Predicts A Collection?

A predictive model gives us more than just predictions. We can also ask, what did the model learn from the data? What predicts the outcome? In the case of predicting a boardgame collection, what did the model find to be predictive of games a user has in their collection?

To answer this, I examine the coefficients from a model logistic regression with ridge regularization (which I will refer to as a penalized logistic regression).

Positive values indicate that a feature increases a user’s probability of owning/rating a game, while negative values indicate a feature decreases the probability. To be precise, the coefficients indicate the effect of a particular feature on the log-odds of a user owning a game.

The following visualization shows the path of each feature as it enters the model, with highly influential features tending to enter the model early with large positive or negative effects. The dotted line indicates the level of regularization that was selected during tuning.

Show the code

#|
model_glmnet |>
    pluck("wflow", 1) |>
    trace_plot.glmnet(max.overlaps = 30) +
    facet_wrap(~ params$username)

Partial Effects

What are the effects of individual features?

Use the buttons below to examine the effects different types of predictors had in predicting the user’s collection.

Assessment

How well did the model do in predicting the user’s collection?

This section contains a variety of visualizations and metrics for assessing the performance of the model(s). If you’re not particularly interested in predictive modeling, skip down further to the predictions from the model.

The following displays the model’s performance in resampling on a training set, a validation set, and a holdout set of upcoming games.

Show the code

metrics |>
    mutate_if(is.numeric, round, 3) |>
    pivot_wider(
        names_from = c(".metric"),
        values_from = c(".estimate")
    ) |>
    gt::gt() |>
    gt::sub_missing() |>
    gt_options()

username	wflow_id	type	.estimator	mn_log_loss	roc_auc	pr_auc
Gyges	glmnet	resamples	binary	0.071	0.873	0.157
Gyges	glmnet	test	binary	0.028	0.943	0.050
Gyges	glmnet	valid	binary	0.047	0.861	0.116

An easy way to visually examine the performance of classification model is to view a separation plot.

I plot the predicted probabilities from the model for every game (during resampling) from lowest to highest. I then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (left side of the chart).

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    plot_separation(outcome = params$outcome)

I can more formally assess how well each model did in resampling by looking at the area under the ROC curve (roc_auc). A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

Show the code

preds |>
    nest(data = -c(username, wflow_id, type)) |>
    mutate(
        roc_curve = map(
            data,
            safely(~ .x |> safe_roc_curve(truth = params$outcome))
        )
    ) |>
    mutate(result = map(roc_curve, ~ .x |> pluck("result"))) |>
    select(username, wflow_id, type, result) |>
    unnest(result) |>
    plot_roc_curve()

Top Games in Training

What were the model’s top games in the training set?

Show the code

preds |>
    filter(type == "resamples") |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games in Validation

What were the model’s top games in the validation set?

Show the code

preds |>
    filter(type %in% c("valid")) |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games by Year

Displaying the model’s top games for individual years in recent years.

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    top_n_preds(
        games = games,
        outcome = params$outcome,
        top_n = 15,
        n_years = 15
    ) |>
    gt_top_n(collection = collection |> prep_collection())

Rank	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
1	Endeavor	Labyrinth: The War on Terror, 2001 – ?	Ora et Labora	Terra Mystica	Terror in Meeple City	Blue Moon Legends	Warhammer Quest: The Adventure Card Game	Agricola (Revised Edition)	Spirit Island	Nemesis	Tainted Grail: The Fall of Avalon	Rush M.D.	The Great Wall	Horizons of Spirit Island	Earthborne Rangers
2	Hansa Teutonica	Earth Reborn	Ascending Empires	Archipelago	Impulse	Red7	The King Is Dead	Scythe	Pandemic Legacy: Season 2	Concordia Venus	Mega Empires: The West	The Vote: Suffrage and Suppression in America	Imperium: Legends	Agricola 15	Deliverance
3	Vasco da Gama	7 Wonders	King of Tokyo	Libertalia	Disc Duelers	La Granja	Trans-Siberian Railroad	One Deck Dungeon	This War of Mine: The Board Game	The Edge: Dawnfall	Ancient Civilizations of the Inner Sea	Pandemic Legacy: Season 0	Excavation Earth	Dire Alliance: Horror	Nekojima
4	Dungeon Twister 2: Prison	Innovation	Dungeon Petz	Tzolk'in: The Mayan Calendar	Concordia	Roll for the Galaxy	Star Wars: X-Wing Miniatures Game – The Force Awakens Core Set	Terraforming Mars	Folklore: The Affliction	Dungeon Alliance	Cloudspire	Altar Quest	Kemet: Blood and Sand	Lasting Tales	Monster Pit
5	Revolution!	Dust Tactics	Colonial: Europe's Empires Overseas	Among the Stars	BattleCON: Devastation of Indines	Irish Gauge	Mysterium	Exceed Fighting System	878 Vikings: Invasions of England	Cosmic Encounter: 42nd Anniversary Edition	Pax Pamir: Second Edition	Etherfields	Ankh: Gods of Egypt	アンドーンテッド：ノルマンディー・プラス (Undaunted: Normandy Plus)	Undaunted: Battle of Britain
6	Fear and Faith	Norenberc	Dungeon Fighter	We Didn't Playtest This: Legacies	Steam Park	Thunderstone Advance: Worlds Collide	Pandemic Legacy: Season 1	The Manhattan Project: Energy Empire	Gloomhaven	Orc-lympics	Castle Itter: The Strangest Battle of WWII	The King Is Dead: Second Edition	Mega Empires: The East	Undaunted: Stalingrad	Ascent of Dragons
7	Alea Iacta Est	Forbidden Island	Mage Knight Board Game	Pocket Battles: Macedonians vs. Persians	7-Card Slugfest	Pixel Tactics 3	7 Wonders Duel	The Others	Anachrony	Heroes of Terrinoth	Era: Medieval Age	Undaunted: North Africa	Stargrave: Science Fiction Wargames in the Ravaged Galaxy	Frosthaven	Too Many Bones: Unbreakable
8	Pocket Battles: Celts vs. Romans	The Hobbit	Eminent Domain	The Resistance: Avalon	Glass Road	New Dawn	Pixel Tactics Deluxe	Millennium Blades	Myth: Dark Frontier	Betrayal Legacy	Nights of Fire: Battle for Budapest	Reign of Witches	Assassin's Creed: Brotherhood of Venice	Libertalia: Winds of Galecrest	Voidfall
9	Last Train to Wensleydale	Glen More	Eclipse: New Dawn for the Galaxy	Keyflower	Star Trek: Attack Wing	Viticulture: Complete Collector's Edition	Bottom of the 9th	Rogue Stars: Skirmish Wargaming in a Science Fiction Underworld	Dark Souls: The Board Game	Lords of Hellas	Undaunted: Normandy	Hallertau	Hour of Need: Judge and Jury	Gateway Island	Masters of the Universe: The Board Game – Clash for Eternia
10	Claustrophobia	Catacombs	Lancaster	We Didn't Playtest This at All with Chaos Pack Expansion	City of Iron	Antike II	Forbidden Stars	Advanced Song of Blades and Heroes	One Deck Dungeon: Forest of Shadows	Critical Mass: Patriot vs Iron Curtain	Brook City	Hansa Teutonica: Big Box	Nicaea	Marvel Zombies: Heroes' Resistance	Empire's End
11	Hellenes: Campaigns of the Peloponnesian War	Pocket Battles: Elves vs. Orcs	Star Trek: Fleet Captains	Bolt Action	Forbidden Desert	A Fistful of Kung Fu: Hong Kong Movie Wargame Rules	BattleCON: Fate of Indines	A Feast for Odin	Pandemic: Rising Tide	New Frontiers	Tiny Epic Mechs	Eclipse: Second Dawn for the Galaxy	Blood of the Northmen	One Deck Galaxy	Marvel Zombies: A Zombicide Game
12	Shipyard	Masques	City Tycoon	Pixel Tactics	Batman: Gotham City Strategy Game	The Witcher Adventure Game	Viticulture Essential Edition	Hit Z Road	Wasteland Express Delivery Service	War Chest	Godtear	By Stealth and Sea	Good Puppers	Pisces: A High-Stakes Fishing Competition	Tamashii: Chronicle of Ascend
13	We Didn't Playtest This Either	Warmachine Prime Mk II	Dungeons & Dragons: Wrath of Ashardalon Board Game	1989: Dawn of Freedom	Euphoria: Build a Better Dystopia	Power Grid Deluxe: Europe/North America	Stockpile	The Fog of War	Gaia Project	Champions of Hara	Bios: Origins (Second Edition)	Planet Apocalypse	Canvas	Mosaic: A Story of Civilization	Expeditions
14	Win, Lose, or Banana	Dominant Species	Last Will	Lords of Waterdeep	Pixel Tactics 2	Warmachine: High Command – Faith & Fortune	Pixel Tactics 5	SeaFall	Startups	Trapwords	Living Planet: Deluxe Edition	Merv: The Heart of the Silk Road	Steamwatchers	ISS Vanguard	Storybook Battles
15	Dungeon Lords	Flying Lead	Malifaux Rules Manual	New Amsterdam	This Is Not a Test: Post-Apocalyptic Skirmish Rules	Flip City	Adorable Pandaring	Days of Ire: Budapest 1956	Widower's Wood: An Iron Kingdoms Adventure Board Game	The Walking Dead: No Sanctuary	Hellboy: The Board Game – Deluxe Edition	Anachrony: Infinity Box	Blitzkrieg!: World War Two in 20 Minutes	Nemesis: Lockdown	Fire for Light

Predictions

New and Upcoming Games

What were the model’s top predictions for new and upcoming board game releases?

Show the code

new_preds |>
    filter(type == "upcoming") |>
    # imposing a minimum threshold to filter out games with no info
    filter(usersrated >= 1) |>
    # removing this goddamn boxing game that has every mechanic listed
    filter(game_id != 420629) |>
    prep_predictions_datatable(
        games = games_new,
        outcome = params$outcome
    ) |>
    predictions_datatable(outcome = params$outcome)

Older Games

What were the model’s top predictions for older games?