Predicting Board Game Collections

ZeeGarcia’s Collection

Author

Phil Henrickson

Published

May 21, 2025

About

This report details the results of training and evaluating a classification model for predicting games for a user’s boardgame collection.

Note

To view games predicted by the model, go to Section 5.

Collection

The data in this project comes from BoardGameGeek.com. The data used is at the game level, where an individual observation contains features about a game, such as its publisher, categories, and playing time, among many others.

I train a classification model at the user level to learn the relationship between game features and games that a user owns - what predicts a user’s collection?

username	status	games
ZeeGarcia	ever_owned	1975
ZeeGarcia	own	435
ZeeGarcia	rated	2512

I evaluate the model’s performance on a training set of historical games via resampling, then validate the model’s performance on a set aside set of newer relases. I then refit the model on the training and validation in order and predict upcoming releases in order to find new games that the user is most likely to add to their collection.

username	years	type	Own
username	years	type	no	yes
ZeeGarcia	-3500-2021	train	26085	277
ZeeGarcia	2022-2023	valid	10203	105
ZeeGarcia	2024-2028	test	8540	53

Types of Games

What types of game does the user own? The following plot displays the most frequent publishers, mechanics, designers, artists, etc that appear in a user’s collection.

Show the code

collection |>
    filter(own == 1) |>
    collection_by_category(
        games = games_raw
    ) |>
    plot_collection_by_category() +
    ylab("feature")

The following plot shows the years in which games in the user’s collection were published. This can usually indicate when someone first entered the hobby.

Games in Collection

What games does the user currently have in their collection? The following table can be used to examine games the user owns, along with some helpful information for selecting the right game for a game night!

Use the filters above the table to sort/filter based on information about the game, such as year published, recommended player counts, or playing time.

Show the code

collection |>
    filter(own == 1) |>
    prep_collection_datatable(
        games = games_raw
    ) |>
    filter(!is.na(image)) |>
    collection_datatable()

Modeling

I’ll now the examine predictive models trained on the user’s collection.

For an individual user, I train a predictive model on their collection in order to predict whether a user owns a game. The outcome, in this case, is binary: does the user have a game listed in their collection or not? This is the setting for training a classification model, where the model aims to learn the probability that a user will add a game to their collection based on its observable features.

How does a model learn what a user is likely to own? The training process is a matter of examining historical games and finding patterns that exist between game features (designers, mechanics, playing time, etc) and games in the user’s collection.

I make use of many potential features for games, the vast majority of which are dummies indicating the presence or absence of the presence or absence of things such as a publisher/artist/designer. The “standard” BGG features for every game contain information that is typically listed on the box its playing time, player counts, or its recommended minimum age.

Note

I train models to predict whether a user owns a game based only on information that could be observed about the game at its release: playing time, player count, mechanics, categories, genres, and selected designers, artists, and publishers. I do not make use of BGG community information, such as its average rating, weight, or number of user ratings. This is to ensure the model can predict newly released games without relying on information from the BGG community.

What Predicts A Collection?

A predictive model gives us more than just predictions. We can also ask, what did the model learn from the data? What predicts the outcome? In the case of predicting a boardgame collection, what did the model find to be predictive of games a user has in their collection?

To answer this, I examine the coefficients from a model logistic regression with ridge regularization (which I will refer to as a penalized logistic regression).

Positive values indicate that a feature increases a user’s probability of owning/rating a game, while negative values indicate a feature decreases the probability. To be precise, the coefficients indicate the effect of a particular feature on the log-odds of a user owning a game.

The following visualization shows the path of each feature as it enters the model, with highly influential features tending to enter the model early with large positive or negative effects. The dotted line indicates the level of regularization that was selected during tuning.

Show the code

#|
model_glmnet |>
    pluck("wflow", 1) |>
    trace_plot.glmnet(max.overlaps = 30) +
    facet_wrap(~ params$username)

Partial Effects

What are the effects of individual features?

Use the buttons below to examine the effects different types of predictors had in predicting the user’s collection.

Assessment

How well did the model do in predicting the user’s collection?

This section contains a variety of visualizations and metrics for assessing the performance of the model(s). If you’re not particularly interested in predictive modeling, skip down further to the predictions from the model.

The following displays the model’s performance in resampling on a training set, a validation set, and a holdout set of upcoming games.

Show the code

metrics |>
    mutate_if(is.numeric, round, 3) |>
    pivot_wider(
        names_from = c(".metric"),
        values_from = c(".estimate")
    ) |>
    gt::gt() |>
    gt::sub_missing() |>
    gt_options()

username	wflow_id	type	.estimator	mn_log_loss	roc_auc	pr_auc
ZeeGarcia	glmnet	resamples	binary	0.045	0.892	0.145
ZeeGarcia	glmnet	test	binary	0.036	0.870	0.073
ZeeGarcia	glmnet	valid	binary	0.044	0.900	0.162

An easy way to visually examine the performance of classification model is to view a separation plot.

I plot the predicted probabilities from the model for every game (during resampling) from lowest to highest. I then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (left side of the chart).

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    plot_separation(outcome = params$outcome)

I can more formally assess how well each model did in resampling by looking at the area under the ROC curve (roc_auc). A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

Show the code

preds |>
    nest(data = -c(username, wflow_id, type)) |>
    mutate(
        roc_curve = map(
            data,
            safely(~ .x |> safe_roc_curve(truth = params$outcome))
        )
    ) |>
    mutate(result = map(roc_curve, ~ .x |> pluck("result"))) |>
    select(username, wflow_id, type, result) |>
    unnest(result) |>
    plot_roc_curve()

Top Games in Training

What were the model’s top games in the training set?

Show the code

preds |>
    filter(type == "resamples") |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games in Validation

What were the model’s top games in the validation set?

Show the code

preds |>
    filter(type %in% c("valid")) |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games by Year

Displaying the model’s top games for individual years in recent years.

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    top_n_preds(
        games = games,
        outcome = params$outcome,
        top_n = 15,
        n_years = 15
    ) |>
    gt_top_n(collection = collection |> prep_collection())

Rank	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
1	Dice Town	7 Wonders	Takenoko	Mage Wars Arena	Lewis & Clark: The Expedition	Abyss	7 Wonders Duel	Conan	Gloomhaven	Château Aventure	Cthulhu: Death May Die	Furnace	Moving Pictures	Sea Salt & Paper	Welcome To...: Collector's Edition
2	Mr. Jack in New York	Hanabi	Mage Knight Board Game	Robinson Crusoe: Adventures on the Cursed Island	Eldritch Horror	Five Tribes: The Djinns of Naqala	Pandemic Legacy: Season 1	Reign of Cthulhu	Circle the Wagons	Rising Sun	Deep Blue	Puerto Rico	Sleeping Gods	Frosthaven	Empire's End
3	Finca	Earth Reborn	Ninjato	Escape: The Curse of the Temple	Longhorn	Pandemic: The Cure	Arboretum	Scythe	DIG	Everdell	Wingspan	Pandemic Legacy: Season 0	Pandemic: Hot Zone – Europe	Knight Fall	Arkeis
4	Cyclades	Forbidden Island	Mundus Novus	Libertalia	Legacy: The Testament of Duke de Crecy	Imperial Settlers	Raptor	Iberia	Pandemic Legacy: Season 2	Root	Trails of Tucana	Via Magica	Flourish	Tribes of the Wind	Numbsters
5	Endeavor	Troyes	Summoner Wars: Master Set	Ginkgopolis	City of Iron	Port Royal	Mission: Red Planet (Second/Third Edition)	Arkham Horror: The Card Game	Meeple Circus	Legendary Encounters: The X-Files Deck Building Game	Naga Raja	Gloomhaven: Jaws of the Lion	Cascadia	Endless Winter: Paleoamericans	Earth
6	Jaipur	Glen More	Puerto Rico	Fleet	Forbidden Desert	Madame Ching	Trambahn	When I Dream	Near and Far	Seals	The Magnificent	Hues and Cues	Botanik	Amsterdam	Ticket to Ride Legacy: Legends of the West
7	Long Shot	Merchants & Marauders	Elder Sign	Love Letter	SOS Titanic	Artifacts, Inc.	GEM	Dice Stars	Mythic Battles: Pantheon	Treasure Island	Antinomy	Nidavellir	Tides	Hamburg	Naturopolis
8	Kuhhandel Master	Innovation	The New Era	Seasons	Ghooost!	AquaSphere	The Little Prince: Rising to the Stars	Kanagawa	Smile	Micropolis	Herbaceous Sprouts	Top Ten	Ankh: Gods of Egypt	Marvel Zombies: Heroes' Resistance	51st State: Ultimate Edition
9	La Habana	Hive Pocket	The City	Descent: Journeys in the Dark (Second Edition)	Bruges	Nyakuza	Blood Rage	Legendary Encounters: A Firefly Deck Building Game	Pandemic: Rising Tide	Yellow & Yangtze	Aftermath	Forgotten Waters	ROVE: Results-Oriented Versatile Explorer	Revive	Expeditions
10	Claustrophobia	Merkator	Tournay	Star Wars: The Card Game	Eight-Minute Empire: Legends	Dragon Run	...and then, we held hands.	Islebound	WOO	Architects of the West Kingdom	Draftosaurus	Planet Apocalypse	Arkham Horror: The Card Game (Revised Edition)	Wildtails: A Pirate Legacy	Sleeping Gods: Distant Skies
11	Macao	Tikal II: The Lost Temple	Timeline: Science & Discoveries	Il Vecchio	The Little Prince: Make Me a Planet	Blue Moon Legends	Vs System 2PCG: The Marvel Battles	Terraforming Mars	Azul	Fall of Rome	Siege of the Citadel	Deep Vents	Batman: The Animated Series Adventures – Shadow of the Bat	Wayfarers of the South Tigris	Fire for Light
12	Warhammer: Invasion	Mousquetaires du Roy	Tales & Games: The Hare & the Tortoise	Eight-Minute Empire	Pathfinder Adventure Card Game: Rise of the Runelords – Base Set	Heroes Wanted	Viticulture Essential Edition	Vs System 2PCG: The Defenders	Herbaceous	Arkham Horror (Third Edition)	Tapestry	Viscounts of the West Kingdom	Marvel United: X-Men	Nemesis: Lockdown	51st State: Ultimate Edition (Gamefound Edition)
13	Einauge sei wachsam!	51st State	Friday	Antartik	Corto	Desperados of Dice Town	Tokaido: Deluxe Edition	Quadropolis	RUM	The River	Pandemic: Rapid Response	Lost Ruins of Arnak	Sobek: 2 Players	Now or Never	Sleeping Gods: Primeval Peril
14	Martinique	Luna	Mansions of Madness	Shadows over Camelot: The Card Game	Terror in Meeple City	Red7	Plums	Pocket Madness	The Godfather: Corleone's Empire	Zombicide: Green Horde	Tang Garden	Sleeping Gods: Primeval Peril	Bullet♥︎	Everdell: The Complete Collection	The Castles of Burgundy: Special Edition
15	Food Chain	GOSU	Mondo	Gentlemen Thieves	Cinque Terre	Isle of Trains	Elysium	Mansions of Madness: Second Edition	Legendary Forests	Underwater Cities	Tainted Grail: The Fall of Avalon	Food Chain Island	King of Tokyo: Monster Box	1001 Islands	Forbidden Jungle

Predictions

New and Upcoming Games

What were the model’s top predictions for new and upcoming board game releases?

Show the code

new_preds |>
    filter(type == "upcoming") |>
    # imposing a minimum threshold to filter out games with no info
    filter(usersrated >= 1) |>
    # removing this goddamn boxing game that has every mechanic listed
    filter(game_id != 420629) |>
    prep_predictions_datatable(
        games = games_new,
        outcome = params$outcome
    ) |>
    predictions_datatable(outcome = params$outcome)

Older Games

What were the model’s top predictions for older games?