Predicting Board Game Collections

J_3MBG’s Collection

Author

Phil Henrickson

Published

May 21, 2025

About

This report details the results of training and evaluating a classification model for predicting games for a user’s boardgame collection.

Note

To view games predicted by the model, go to Section 5.

Collection

The data in this project comes from BoardGameGeek.com. The data used is at the game level, where an individual observation contains features about a game, such as its publisher, categories, and playing time, among many others.

I train a classification model at the user level to learn the relationship between game features and games that a user owns - what predicts a user’s collection?

username	status	games
J_3MBG	ever_owned	638
J_3MBG	own	412
J_3MBG	rated	704

I evaluate the model’s performance on a training set of historical games via resampling, then validate the model’s performance on a set aside set of newer relases. I then refit the model on the training and validation in order and predict upcoming releases in order to find new games that the user is most likely to add to their collection.

username	years	type	Own
username	years	type	no	yes
J_3MBG	-3500-2021	train	26052	310
J_3MBG	2022-2023	valid	10236	72
J_3MBG	2024-2028	test	8563	30

Types of Games

What types of game does the user own? The following plot displays the most frequent publishers, mechanics, designers, artists, etc that appear in a user’s collection.

Show the code

collection |>
    filter(own == 1) |>
    collection_by_category(
        games = games_raw
    ) |>
    plot_collection_by_category() +
    ylab("feature")

The following plot shows the years in which games in the user’s collection were published. This can usually indicate when someone first entered the hobby.

Games in Collection

What games does the user currently have in their collection? The following table can be used to examine games the user owns, along with some helpful information for selecting the right game for a game night!

Use the filters above the table to sort/filter based on information about the game, such as year published, recommended player counts, or playing time.

Show the code

collection |>
    filter(own == 1) |>
    prep_collection_datatable(
        games = games_raw
    ) |>
    filter(!is.na(image)) |>
    collection_datatable()

Modeling

I’ll now the examine predictive models trained on the user’s collection.

For an individual user, I train a predictive model on their collection in order to predict whether a user owns a game. The outcome, in this case, is binary: does the user have a game listed in their collection or not? This is the setting for training a classification model, where the model aims to learn the probability that a user will add a game to their collection based on its observable features.

How does a model learn what a user is likely to own? The training process is a matter of examining historical games and finding patterns that exist between game features (designers, mechanics, playing time, etc) and games in the user’s collection.

I make use of many potential features for games, the vast majority of which are dummies indicating the presence or absence of the presence or absence of things such as a publisher/artist/designer. The “standard” BGG features for every game contain information that is typically listed on the box its playing time, player counts, or its recommended minimum age.

Note

I train models to predict whether a user owns a game based only on information that could be observed about the game at its release: playing time, player count, mechanics, categories, genres, and selected designers, artists, and publishers. I do not make use of BGG community information, such as its average rating, weight, or number of user ratings. This is to ensure the model can predict newly released games without relying on information from the BGG community.

What Predicts A Collection?

A predictive model gives us more than just predictions. We can also ask, what did the model learn from the data? What predicts the outcome? In the case of predicting a boardgame collection, what did the model find to be predictive of games a user has in their collection?

To answer this, I examine the coefficients from a model logistic regression with ridge regularization (which I will refer to as a penalized logistic regression).

Positive values indicate that a feature increases a user’s probability of owning/rating a game, while negative values indicate a feature decreases the probability. To be precise, the coefficients indicate the effect of a particular feature on the log-odds of a user owning a game.

The following visualization shows the path of each feature as it enters the model, with highly influential features tending to enter the model early with large positive or negative effects. The dotted line indicates the level of regularization that was selected during tuning.

Show the code

#|
model_glmnet |>
    pluck("wflow", 1) |>
    trace_plot.glmnet(max.overlaps = 30) +
    facet_wrap(~ params$username)

Partial Effects

What are the effects of individual features?

Use the buttons below to examine the effects different types of predictors had in predicting the user’s collection.

Assessment

How well did the model do in predicting the user’s collection?

This section contains a variety of visualizations and metrics for assessing the performance of the model(s). If you’re not particularly interested in predictive modeling, skip down further to the predictions from the model.

The following displays the model’s performance in resampling on a training set, a validation set, and a holdout set of upcoming games.

Show the code

metrics |>
    mutate_if(is.numeric, round, 3) |>
    pivot_wider(
        names_from = c(".metric"),
        values_from = c(".estimate")
    ) |>
    gt::gt() |>
    gt::sub_missing() |>
    gt_options()

username	wflow_id	type	.estimator	mn_log_loss	roc_auc	pr_auc
J_3MBG	glmnet	resamples	binary	0.047	0.907	0.201
J_3MBG	glmnet	test	binary	0.024	0.852	0.043
J_3MBG	glmnet	valid	binary	0.034	0.893	0.098

An easy way to visually examine the performance of classification model is to view a separation plot.

I plot the predicted probabilities from the model for every game (during resampling) from lowest to highest. I then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (left side of the chart).

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    plot_separation(outcome = params$outcome)

I can more formally assess how well each model did in resampling by looking at the area under the ROC curve (roc_auc). A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

Show the code

preds |>
    nest(data = -c(username, wflow_id, type)) |>
    mutate(
        roc_curve = map(
            data,
            safely(~ .x |> safe_roc_curve(truth = params$outcome))
        )
    ) |>
    mutate(result = map(roc_curve, ~ .x |> pluck("result"))) |>
    select(username, wflow_id, type, result) |>
    unnest(result) |>
    plot_roc_curve()

Top Games in Training

What were the model’s top games in the training set?

Show the code

preds |>
    filter(type == "resamples") |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games in Validation

What were the model’s top games in the validation set?

Show the code

preds |>
    filter(type %in% c("valid")) |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games by Year

Displaying the model’s top games for individual years in recent years.

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    top_n_preds(
        games = games,
        outcome = params$outcome,
        top_n = 15,
        n_years = 15
    ) |>
    gt_top_n(collection = collection |> prep_collection())

Rank	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
1	Dominion: Intrigue	Troyes	A Game of Thrones: The Board Game (Second Edition)	Suburbia	Eldritch Horror	Imperial Settlers	The Gallerist	Terraforming Mars	Azul	Architects of the West Kingdom	Star Wars: Outer Rim	Viscounts of the West Kingdom	The Great Wall	Frostpunk: The Board Game	51st State: Ultimate Edition
2	Chaos in the Old World	7 Wonders	Village	Archipelago	Lewis & Clark: The Expedition	Viticulture: Complete Collector's Edition	Pandemic Legacy: Season 1	Star Wars: Rebellion	Pandemic Legacy: Season 2	Cosmic Encounter: 42nd Anniversary Edition	Pax Pamir: Second Edition	Dune: Imperium	Sleeping Gods	Nemesis: Lockdown	Night Flowers
3	Hellenes: Campaigns of the Peloponnesian War	Labyrinth: The War on Terror, 2001 – ?	Elder Sign	Keyflower	Gearworld: The Borderlands	Artifacts, Inc.	Steampunk Rally	Scythe	Spirit Island	Azul: Stained Glass of Sintra	Paladins of the West Kingdom	Tidal Blades: Heroes of the Reef	Galactic Era	ISS Vanguard	Lords of Ragnarok
4	Hansa Teutonica	51st State	Rune Age	Terra Mystica	Forbidden Desert	La Granja	Tiny Epic Galaxies	Islebound	Anachrony	Nemesis	Noctiluca	Versailles 1919	Ark Nova	Wayfarers of the South Tigris	51st State: Ultimate Edition (Gamefound Edition)
5	Small World	Dominant Species	Tournay	The Manhattan Project	Legacy: The Testament of Duke de Crecy	Five Tribes: The Djinns of Naqala	Thunderbirds	Mansions of Madness: Second Edition	Twilight Imperium: Fourth Edition	Underwater Cities	Wingspan	Beyond the Sun	The Rocketeer: Fate of the Future	Carnegie	Ticket to Ride Legacy: Legends of the West
6	War of the Ring	Forbidden Island	The Lord of the Rings: The Card Game	Empires of the Void	City of Iron	Pandemic: The Cure	Raiders of the North Sea	Arkham Horror: The Card Game	Near and Far	Newton	Era: Medieval Age	Florenza: X Anniversary Edition	Arkham Horror: The Card Game (Revised Edition)	Planet Unknown	Voidfall
7	Middle-Earth Quest	DungeonQuest (Third Edition)	The New Era	Wiz-War (Eighth Edition)	Eight-Minute Empire: Legends	AquaSphere	Viticulture Essential Edition	The Manhattan Project: Energy Empire	Fallout	Everdell	Tapestry	Raiders of Scythia	Radlands	Endless Winter: Paleoamericans	Darwin's Journey
8	Age of Conan: The Strategy Board Game	Commands & Colors: Napoleonics	Dungeon Fighter	Robinson Crusoe: Adventures on the Cursed Island	Navajo Wars	Greed	Het Koninkrijk Dominion	Quadropolis	My Little Scythe	Pax Emancipation	Res Arcana	Pandemic Legacy: Season 0	Lorenzo il Magnifico: Big Box	Starship Captains	Expeditions
9	Endeavor	Alien Frontiers	Eclipse: New Dawn for the Galaxy	Descent: Journeys in the Dark (Second Edition)	Rococo	Alchemists	The Voyages of Marco Polo	Hit Z Road	Gaia Project	Brass: Birmingham	Fantastic Factories	Century: Golem Edition – An Endless World	Bloodborne: The Board Game	Woodcraft: Roll and Write	Hybris: Disordered Cosmos
10	Shipyard	Sid Meier's Civilization: The Board Game	King of Tokyo	City of Gears	Nations	Sons of Anarchy: Men of Mayhem	Super Motherload	Great Western Trail	This War of Mine: The Board Game	Coimbra	Mega Empires: The West	The Search for Planet X	Terraforming Mars: Ares Expedition	Everdell: The Complete Collection	Forbidden Jungle
11	Stronghold	Prêt-à-Porter	Space Empires 4X	Android: Infiltration	Blueprints	Castles of Mad King Ludwig	Mission: Red Planet (Second/Third Edition)	Agricola (Revised Edition)	Fate of the Elder Gods	The Edge: Dawnfall	The Magnificent	Etherfields	Canvas	The Age of Atlantis	Shogun no Katana
12	American Rails	Battles of Westeros	Mansions of Madness	Rex: Final Days of an Empire	Concordia	Istanbul	Stockpile	Junk Art	Sagrada	Space Park	Century: A New World	On Mars	Cascadia	Merchants of the Dark Road	People Power: Insurgency in the Philippines, 1981-1986
13	Axis & Allies: 1942	Runewars	War of the Ring: Second Edition	Merchant of Venus (Second Edition)	Russian Railroads	Orléans	Elysium	Lorenzo il Magnifico	Dinosaur Island	Rising Sun	PARKS	Fallout Shelter: The Board Game	Genotype: A Mendelian Genetics Game	Shogun No Katana Deluxe Edition	Rolling Heights
14	Terra Prime	Defenders of the Realm	The Castles of Burgundy	Neuroshima: Convoy	Spyrium	Pandemic: Contagion	Above and Below	Star Trek: The Dice Game	First Martians: Adventures on the Red Planet	Arkham Horror (Third Edition)	Tang Garden	Hallertau	Roll Camera!: The Filmmaking Board Game	Woodcraft	Diora
15	Cyclades	Castaways	Mage Knight Board Game	Il Vecchio	City of Remnants	Praetor	The Bloody Inn	51st State: Master Set	Bob Ross: Art of Chill Game	Forbidden Sky	Horrified	Dwellings of Eldervale	Steampunk Rally Fusion: Atomic Edition	Tindaya	Age of Rome

Predictions

New and Upcoming Games

What were the model’s top predictions for new and upcoming board game releases?

Show the code

new_preds |>
    filter(type == "upcoming") |>
    # imposing a minimum threshold to filter out games with no info
    filter(usersrated >= 1) |>
    # removing this goddamn boxing game that has every mechanic listed
    filter(game_id != 420629) |>
    prep_predictions_datatable(
        games = games_new,
        outcome = params$outcome
    ) |>
    predictions_datatable(outcome = params$outcome)

Older Games

What were the model’s top predictions for older games?