Predicting Board Game Collections

GOBBluth89’s Collection

Author

Phil Henrickson

Published

May 21, 2025

About

This report details the results of training and evaluating a classification model for predicting games for a user’s boardgame collection.

Note

To view games predicted by the model, go to Section 5.

Collection

The data in this project comes from BoardGameGeek.com. The data used is at the game level, where an individual observation contains features about a game, such as its publisher, categories, and playing time, among many others.

I train a classification model at the user level to learn the relationship between game features and games that a user owns - what predicts a user’s collection?

username	status	games
GOBBluth89	ever_owned	135
GOBBluth89	own	122

I evaluate the model’s performance on a training set of historical games via resampling, then validate the model’s performance on a set aside set of newer relases. I then refit the model on the training and validation in order and predict upcoming releases in order to find new games that the user is most likely to add to their collection.

username	years	type	Own
username	years	type	no	yes
GOBBluth89	-3500-2021	train	26263	99
GOBBluth89	2022-2023	valid	10288	20
GOBBluth89	2024-2028	test	8590	3

Types of Games

What types of game does the user own? The following plot displays the most frequent publishers, mechanics, designers, artists, etc that appear in a user’s collection.

Show the code

collection |>
    filter(own == 1) |>
    collection_by_category(
        games = games_raw
    ) |>
    plot_collection_by_category() +
    ylab("feature")

The following plot shows the years in which games in the user’s collection were published. This can usually indicate when someone first entered the hobby.

Games in Collection

What games does the user currently have in their collection? The following table can be used to examine games the user owns, along with some helpful information for selecting the right game for a game night!

Use the filters above the table to sort/filter based on information about the game, such as year published, recommended player counts, or playing time.

Show the code

collection |>
    filter(own == 1) |>
    prep_collection_datatable(
        games = games_raw
    ) |>
    filter(!is.na(image)) |>
    collection_datatable()

Modeling

I’ll now the examine predictive models trained on the user’s collection.

For an individual user, I train a predictive model on their collection in order to predict whether a user owns a game. The outcome, in this case, is binary: does the user have a game listed in their collection or not? This is the setting for training a classification model, where the model aims to learn the probability that a user will add a game to their collection based on its observable features.

How does a model learn what a user is likely to own? The training process is a matter of examining historical games and finding patterns that exist between game features (designers, mechanics, playing time, etc) and games in the user’s collection.

I make use of many potential features for games, the vast majority of which are dummies indicating the presence or absence of the presence or absence of things such as a publisher/artist/designer. The “standard” BGG features for every game contain information that is typically listed on the box its playing time, player counts, or its recommended minimum age.

Note

I train models to predict whether a user owns a game based only on information that could be observed about the game at its release: playing time, player count, mechanics, categories, genres, and selected designers, artists, and publishers. I do not make use of BGG community information, such as its average rating, weight, or number of user ratings. This is to ensure the model can predict newly released games without relying on information from the BGG community.

What Predicts A Collection?

A predictive model gives us more than just predictions. We can also ask, what did the model learn from the data? What predicts the outcome? In the case of predicting a boardgame collection, what did the model find to be predictive of games a user has in their collection?

To answer this, I examine the coefficients from a model logistic regression with ridge regularization (which I will refer to as a penalized logistic regression).

Positive values indicate that a feature increases a user’s probability of owning/rating a game, while negative values indicate a feature decreases the probability. To be precise, the coefficients indicate the effect of a particular feature on the log-odds of a user owning a game.

The following visualization shows the path of each feature as it enters the model, with highly influential features tending to enter the model early with large positive or negative effects. The dotted line indicates the level of regularization that was selected during tuning.

Show the code

#|
model_glmnet |>
    pluck("wflow", 1) |>
    trace_plot.glmnet(max.overlaps = 30) +
    facet_wrap(~ params$username)

Partial Effects

What are the effects of individual features?

Use the buttons below to examine the effects different types of predictors had in predicting the user’s collection.

Assessment

How well did the model do in predicting the user’s collection?

This section contains a variety of visualizations and metrics for assessing the performance of the model(s). If you’re not particularly interested in predictive modeling, skip down further to the predictions from the model.

The following displays the model’s performance in resampling on a training set, a validation set, and a holdout set of upcoming games.

Show the code

metrics |>
    mutate_if(is.numeric, round, 3) |>
    pivot_wider(
        names_from = c(".metric"),
        values_from = c(".estimate")
    ) |>
    gt::gt() |>
    gt::sub_missing() |>
    gt_options()

username	wflow_id	type	.estimator	mn_log_loss	roc_auc	pr_auc
GOBBluth89	glmnet	resamples	binary	0.017	0.944	0.206
GOBBluth89	glmnet	test	binary	0.005	0.923	0.016
GOBBluth89	glmnet	valid	binary	0.012	0.893	0.055

An easy way to visually examine the performance of classification model is to view a separation plot.

I plot the predicted probabilities from the model for every game (during resampling) from lowest to highest. I then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (left side of the chart).

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    plot_separation(outcome = params$outcome)

I can more formally assess how well each model did in resampling by looking at the area under the ROC curve (roc_auc). A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

Show the code

preds |>
    nest(data = -c(username, wflow_id, type)) |>
    mutate(
        roc_curve = map(
            data,
            safely(~ .x |> safe_roc_curve(truth = params$outcome))
        )
    ) |>
    mutate(result = map(roc_curve, ~ .x |> pluck("result"))) |>
    select(username, wflow_id, type, result) |>
    unnest(result) |>
    plot_roc_curve()

Top Games in Training

What were the model’s top games in the training set?

Show the code

preds |>
    filter(type == "resamples") |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games in Validation

What were the model’s top games in the validation set?

Show the code

preds |>
    filter(type %in% c("valid")) |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games by Year

Displaying the model’s top games for individual years in recent years.

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    top_n_preds(
        games = games,
        outcome = params$outcome,
        top_n = 15,
        n_years = 15
    ) |>
    gt_top_n(collection = collection |> prep_collection())

Rank	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
1	Middle-Earth Quest	Runewars	A Game of Thrones: The Board Game (Second Edition)	Descent: Journeys in the Dark (Second Edition)	Glass Road	Star Wars: Imperial Assault	Pandemic Legacy: Season 1	Star Wars: Rebellion	Stop Thief!	Star Wars: X-Wing (Second Edition)	Maracaibo	Unmatched: Little Red Riding Hood vs. Beowulf	Unmatched: Battle of Legends, Volume Two	The Lord of the Rings: The Card Game – Revised Core Set	Undaunted: Battle of Britain
2	Chaos in the Old World	Battles of Westeros	Mansions of Madness	Star Wars: X-Wing Miniatures Game	NFL Game Day	Pandemic: The Cure	Blood Rage	Agricola (Revised Edition)	Century: Spice Road	The Lord of the Rings: The Card Game – Two-Player Limited Edition Starter	Unmatched Game System	Unmatched: Cobble & Fog	Brian Boru: High King of Ireland	Unmatched: Redemption Row	Unmatched: Teen Spirit
3	Kuhhandel Master	Space Hulk: Death Angel – The Card Game	The Lord of the Rings: The Card Game	Galaxy Trucker: Anniversary Edition	Eldritch Horror	Nyakuza	Forbidden Stars	Junk Art	Gloomhaven	Azul: Stained Glass of Sintra	Unmatched: Battle of Legends, Volume One	Unmatched: Jurassic Park – InGen vs Raptors	Boonlake	Unmatched: Hell's Kitchen	Unmatched: For King and Country
4	Age of Conan: The Strategy Board Game	Troyes	Gears of War: The Board Game	Rex: Final Days of an Empire	Caverna: The Cave Farmers	Pandemic: Contagion	Star Wars: X-Wing Miniatures Game – The Force Awakens Core Set	Sherlock Holmes Consulting Detective: Jack the Ripper & West End Adventures	Twilight Imperium: Fourth Edition	Root	Unmatched: Robin Hood vs. Bigfoot	Unmatched: Buffy the Vampire Slayer	Savannah Park	Unmatched: Jurassic Park – Dr. Sattler vs. T. Rex	Unmatched: Brains and Brawn
5	Warhammer: Invasion	DungeonQuest (Third Edition)	Rune Age	Wiz-War (Eighth Edition)	BattleLore: Second Edition	Alchemists	7 Wonders Duel	Scythe	My Little Scythe	Camel Up (Second Edition)	Star Wars: Outer Rim	Undaunted: North Africa	Unfathomable	Unmatched: Houdini vs. The Genie	The Witcher: Old World
6	Cyclades	7 Wonders	Letters from Whitechapel	Keyflower	Impulse	Five Tribes: The Djinns of Naqala	Star Wars: Armada	Terraforming Mars	Folklore: The Affliction	Fireball Island: The Curse of Vul-Kar	Tapestry	Century: Golem Edition – An Endless World	Railroad Ink Challenge: Lush Green Edition	SPYBAM	Rough Draft
7	Chronicle	Dominant Species	Dark Moon	Android: Netrunner	Relic	Camel Up	The King Is Dead	A Feast for Odin	Century: Golem Edition	Rising Sun	Undaunted: Normandy	Gloomhaven: Jaws of the Lion	Galaxy Trucker (Second Edition)	アンドーンテッド：ノルマンディー・プラス (Undaunted: Normandy Plus)	Empire's End
8	Ubongo 3D	Wars of the Roses: Lancaster vs. York	Dungeon Fighter	Star Wars: The Card Game	Rococo	Akrotiri	Mombasa	New Angeles	Spirit Island	The Estates	Subtext	Merv: The Heart of the Silk Road	Railroad Ink Challenge: Shining Yellow Edition	Undaunted: Stalingrad	Arkendom Conquista Starter Set
9	Endeavor	High Frontier	Quarriors!	Agricola: All Creatures Big and Small	Lewis & Clark: The Expedition	Fields of Arle	Oh My Goods!	Arkham Horror: The Card Game	Star Wars: Destiny – Two-Player Game	War Chest	Century: A New World	Eclipse: Second Dawn for the Galaxy	Great Plains	Frosthaven	Witchcraft!
10	The Adventurers: The Temple of Chac	Forbidden Island	King of Tokyo	Antike Duellum	Euphoria: Build a Better Dystopia	Warhammer 40,000: Conquest	Codenames	Great Western Trail	Azul	Cosmic Encounter: 42nd Anniversary Edition	Era: Medieval Age	The Search for Planet X	Bloodborne: The Board Game	Foundations of Rome	Century: Big Box
11	Small World	Sid Meier's Civilization: The Board Game	Dungeon Petz	Clash of Cultures	Nations	Spyfall	Viticulture Essential Edition	Love Letter: Premium Edition	Sherlock Holmes Consulting Detective: Vanishing from Hyde Park	Newton	Century: Golem Edition – Eastern Mountains	New York Zoo	Ark Nova	Foundations of Rome (Emperor Edition)	Chase Chess
12	Wings of War: WW2 Deluxe set	Glen More	Belfort	Il Vecchio	Spyrium	Port Royal	A Game of Thrones: The Card Game (Second Edition)	Sakura Arms	Downforce	Everdell	The King's Dilemma	Lost Ruins of Arnak	Arkham Horror: The Card Game (Revised Edition)	Return to Dark Tower	Unmatched Adventures: Tales to Amaze
13	EVE: Conquests	Earth Reborn	Mage Knight Board Game	Merchant of Venus (Second Edition)	Francis Drake	La Granja	Watson & Holmes	Captain Sonar	Bunny Kingdom	Western Legends	The Taverns of Tiefenthal	Dune: Imperium	Corrosion	Agricola 15	Mad Mars
14	Wings of War: Fire from the Sky	Labyrinth: The War on Terror, 2001 – ?	Eclipse: New Dawn for the Galaxy	Terra Mystica	Sails of Glory	Star Wars: Empire vs. Rebellion	Steampunk Rally	Star Wars: Destiny	Indulgence	Century: Eastern Wonders	The Isle of Cats	Star Wars: Armada – Galactic Republic Fleet Starter	Cascadia	Bardsung	General Orders: World War II
15	Imperial 2030	Mousquetaires du Roy	A Few Acres of Snow	Pax Porfiriana	Cube Quest	Antike II	Kraftwagen	DOOM: The Board Game	Sherlock Holmes Consulting Detective: Carlton House & Queen's Park	Coimbra	Marvel Champions: The Card Game	Star Wars: Armada – Separatist Alliance Fleet Starter	Oath	ISS Vanguard	Pirate Tales

Predictions

New and Upcoming Games

What were the model’s top predictions for new and upcoming board game releases?

Show the code

new_preds |>
    filter(type == "upcoming") |>
    # imposing a minimum threshold to filter out games with no info
    filter(usersrated >= 1) |>
    # removing this goddamn boxing game that has every mechanic listed
    filter(game_id != 420629) |>
    prep_predictions_datatable(
        games = games_new,
        outcome = params$outcome
    ) |>
    predictions_datatable(outcome = params$outcome)

Older Games

What were the model’s top predictions for older games?