Predicting Board Game Collections

VWValker’s Collection

Author

Phil Henrickson

Published

May 21, 2025

About

This report details the results of training and evaluating a classification model for predicting games for a user’s boardgame collection.

Note

To view games predicted by the model, go to Section 5.

Collection

The data in this project comes from BoardGameGeek.com. The data used is at the game level, where an individual observation contains features about a game, such as its publisher, categories, and playing time, among many others.

I train a classification model at the user level to learn the relationship between game features and games that a user owns - what predicts a user’s collection?

username	status	games
VWValker	ever_owned	183
VWValker	own	183
VWValker	rated	105

I evaluate the model’s performance on a training set of historical games via resampling, then validate the model’s performance on a set aside set of newer relases. I then refit the model on the training and validation in order and predict upcoming releases in order to find new games that the user is most likely to add to their collection.

username	years	type	Own
username	years	type	no	yes
VWValker	-3500-2021	train	26233	129
VWValker	2022-2023	valid	10256	52
VWValker	2024-2028	test	8591	2

Types of Games

What types of game does the user own? The following plot displays the most frequent publishers, mechanics, designers, artists, etc that appear in a user’s collection.

Show the code

collection |>
    filter(own == 1) |>
    collection_by_category(
        games = games_raw
    ) |>
    plot_collection_by_category() +
    ylab("feature")

The following plot shows the years in which games in the user’s collection were published. This can usually indicate when someone first entered the hobby.

Games in Collection

What games does the user currently have in their collection? The following table can be used to examine games the user owns, along with some helpful information for selecting the right game for a game night!

Use the filters above the table to sort/filter based on information about the game, such as year published, recommended player counts, or playing time.

Show the code

collection |>
    filter(own == 1) |>
    prep_collection_datatable(
        games = games_raw
    ) |>
    filter(!is.na(image)) |>
    collection_datatable()

Modeling

I’ll now the examine predictive models trained on the user’s collection.

For an individual user, I train a predictive model on their collection in order to predict whether a user owns a game. The outcome, in this case, is binary: does the user have a game listed in their collection or not? This is the setting for training a classification model, where the model aims to learn the probability that a user will add a game to their collection based on its observable features.

How does a model learn what a user is likely to own? The training process is a matter of examining historical games and finding patterns that exist between game features (designers, mechanics, playing time, etc) and games in the user’s collection.

I make use of many potential features for games, the vast majority of which are dummies indicating the presence or absence of the presence or absence of things such as a publisher/artist/designer. The “standard” BGG features for every game contain information that is typically listed on the box its playing time, player counts, or its recommended minimum age.

Note

I train models to predict whether a user owns a game based only on information that could be observed about the game at its release: playing time, player count, mechanics, categories, genres, and selected designers, artists, and publishers. I do not make use of BGG community information, such as its average rating, weight, or number of user ratings. This is to ensure the model can predict newly released games without relying on information from the BGG community.

What Predicts A Collection?

A predictive model gives us more than just predictions. We can also ask, what did the model learn from the data? What predicts the outcome? In the case of predicting a boardgame collection, what did the model find to be predictive of games a user has in their collection?

To answer this, I examine the coefficients from a model logistic regression with ridge regularization (which I will refer to as a penalized logistic regression).

Positive values indicate that a feature increases a user’s probability of owning/rating a game, while negative values indicate a feature decreases the probability. To be precise, the coefficients indicate the effect of a particular feature on the log-odds of a user owning a game.

The following visualization shows the path of each feature as it enters the model, with highly influential features tending to enter the model early with large positive or negative effects. The dotted line indicates the level of regularization that was selected during tuning.

Show the code

#|
model_glmnet |>
    pluck("wflow", 1) |>
    trace_plot.glmnet(max.overlaps = 30) +
    facet_wrap(~ params$username)

Partial Effects

What are the effects of individual features?

Use the buttons below to examine the effects different types of predictors had in predicting the user’s collection.

Assessment

How well did the model do in predicting the user’s collection?

This section contains a variety of visualizations and metrics for assessing the performance of the model(s). If you’re not particularly interested in predictive modeling, skip down further to the predictions from the model.

The following displays the model’s performance in resampling on a training set, a validation set, and a holdout set of upcoming games.

Show the code

metrics |>
    mutate_if(is.numeric, round, 3) |>
    pivot_wider(
        names_from = c(".metric"),
        values_from = c(".estimate")
    ) |>
    gt::gt() |>
    gt::sub_missing() |>
    gt_options()

username	wflow_id	type	.estimator	mn_log_loss	roc_auc	pr_auc
VWValker	glmnet	resamples	binary	0.025	0.898	0.069
VWValker	glmnet	test	binary	0.009	0.885	0.001
VWValker	glmnet	valid	binary	0.025	0.874	0.125

An easy way to visually examine the performance of classification model is to view a separation plot.

I plot the predicted probabilities from the model for every game (during resampling) from lowest to highest. I then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (left side of the chart).

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    plot_separation(outcome = params$outcome)

I can more formally assess how well each model did in resampling by looking at the area under the ROC curve (roc_auc). A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

Show the code

preds |>
    nest(data = -c(username, wflow_id, type)) |>
    mutate(
        roc_curve = map(
            data,
            safely(~ .x |> safe_roc_curve(truth = params$outcome))
        )
    ) |>
    mutate(result = map(roc_curve, ~ .x |> pluck("result"))) |>
    select(username, wflow_id, type, result) |>
    unnest(result) |>
    plot_roc_curve()

Top Games in Training

What were the model’s top games in the training set?

Show the code

preds |>
    filter(type == "resamples") |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games in Validation

What were the model’s top games in the validation set?

Show the code

preds |>
    filter(type %in% c("valid")) |>
    prep_predictions_datatable(
        games = games,
        outcome = params$outcome
    ) |>
    predictions_datatable(
        outcome = params$outcome,
        remove_description = T,
        remove_image = T,
        pagelength = 15
    )

Top Games by Year

Displaying the model’s top games for individual years in recent years.

Show the code

preds |>
    filter(type %in% c("resamples", "valid")) |>
    top_n_preds(
        games = games,
        outcome = params$outcome,
        top_n = 15,
        n_years = 15
    ) |>
    gt_top_n(collection = collection |> prep_collection())

Rank	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
1	Shipyard	Warhammer: The Island of Blood	Mage Knight Board Game	Terra Mystica	Lewis & Clark: The Expedition	Orléans	Forbidden Stars	Star Wars: Rebellion	Spirit Island	Everdell	Barrage	Hansa Teutonica: Big Box	Bloodborne: The Board Game	Endless Winter: Paleoamericans	Masters of the Universe: The Board Game – Clash for Eternia
2	Axis & Allies: 1942	Troyes	Mansions of Madness	Tzolk'in: The Mayan Calendar	Terror in Meeple City	Nyakuza	Codenames	Codenames: Deep Undercover	Charterstone	The World of SMOG: Rise of Moloch	Detective: City of Angels	Lost Ruins of Arnak	Ark Nova	Gateway Island	The White Castle
3	Kuhhandel Master	Prêt-à-Porter	A Game of Thrones: The Board Game (Second Edition)	Ginkgopolis	Hanamikoji	Star Wars: Imperial Assault	Grand Austria Hotel	Aeon's End	Unfair	Underwater Cities	Tainted Grail: The Fall of Avalon	Gùgōng: Deluxe Big Box	Ankh: Gods of Egypt	Frostpunk: The Board Game	La Granja: Deluxe Master Set
4	Hansa Teutonica	Zombie in My Pocket	The New Era	Seasons	Madeira	Sons of Anarchy: Men of Mayhem	Mysterium	Lorenzo il Magnifico	Gloomhaven	The Edge: Dawnfall	Dune	High Rise	Llamaland	Frosthaven	51st State: Ultimate Edition
5	Steam	Warhammer: The Game of Fantasy Battles (8th Edition)	Gears of War: The Board Game	Axis & Allies: 1941	Five Points: Gangs of New York	Imperial Settlers	Viticulture Essential Edition	Citadels	This War of Mine: The Board Game	Newton	Clank! Legacy: Acquisitions Incorporated	5x5 Zoo	Boonlake	Woodcraft	Marvel Zombies: A Zombicide Game
6	Axis & Allies: Pacific 1940	51st State	Sekigahara: The Unification of Japan	Robinson Crusoe: Adventures on the Cursed Island	Glass Road	AquaSphere	Food Chain Magnate	Fields of Green	Altiplano	Nemesis	Star Wars: Outer Rim	Praga Caput Regni	Dinosaur Island: Rawr 'n Write	Tiletum	Marvel Zombies: X-Men Resistance
7	At the Gates of Loyang	Norenberc	Ascending Empires	Descent: Journeys in the Dark (Second Edition)	The Builders: Middle Ages	La Granja	Risk: Europe	Codenames: Pictures	Mythic Battles: Pantheon	Quacks	Cthulhu: Death May Die	Dune: Imperium	Tinners' Trail: Expanded Edition	Aeon Trespass: Odyssey	Dune: Imperium – Uprising
8	Stronghold	Axis & Allies: Europe 1940	Rune Age	Star Wars: The Card Game	Sushi Go!	The Staufer Dynasty	Blood Rage	Tramways	Gaia Project	Dungeon Alliance	Fields of Arle: Big Box	Guild Master	The Great Wall	Marvel Zombies: Heroes' Resistance	51st State: Ultimate Edition (Gamefound Edition)
9	Vasco da Gama	Space Hulk: Death Angel – The Card Game	Takenoko	Rex: Final Days of an Empire	Concept	Tiny Epic Kingdoms	Zombicide: Black Plague	Last Will	Pandemic Legacy: Season 2	A Song of Ice & Fire: Tabletop Miniatures Game – Night's Watch Starter Set	Siege of the Citadel	Eclipse: Second Dawn for the Galaxy	Origins: First Builders	ISS Vanguard	Nekojima
10	Space Pirates	Flicochet	Eclipse: New Dawn for the Galaxy	Machi Koro	Tajemnicze Domostwo	Three Kingdoms Redux	9 Lives	Terraforming Mars	Calimala	Spy Club	The Taverns of Tiefenthal	Tawantinsuyu: The Inca Empire	Steamwatchers	Atiwa	Anunnaki: Dawn of the Gods
11	Middle-Earth Quest	Runewars	Ora et Labora	Galaxy Trucker: Anniversary Edition	Pathfinder Adventure Card Game: Rise of the Runelords – Base Set	Maskmen	Haspelknecht: The Story of Early Coal Mining	Age of Thieves	Pulsar 2849	Yellow & Yangtze	Caylus 1303	Etherfields	Adventure Tactics: Domianne's Tower	Tindaya	Empire's End
12	Egizia	Firenze	Pictomania	Cockroach Poker Royal	Circus Train (Second Edition)	New Dawn	Bottom of the 9th	Agricola (Revised Edition)	Flip Ships	Rising Sun	Tang Garden	Altar Quest	Stronghold: Undead (Second Edition) – Kickstarter Edition	Treehouse Diner	Unmatched Adventures: Tales to Amaze
13	Imperial 2030	Dominant Species	Tournay	Siberia: The Card Game	Pelican Bay	Fields of Arle	Minerva	Scythe	Outlive	Duelosaur Island	Tiny Towns	Gloomhaven: Jaws of the Lion	Brazil: Imperial	Horizons of Spirit Island	1971
14	Tarantel Tango	Catacombs	Village	Sheepland	Koryŏ	Power Grid Deluxe: Europe/North America	Super Motherload	Clank!: A Deck-Building Adventure	Dungeon of Mandom VIII	Mage Knight: Ultimate Edition	Zombicide: Invader	Honey Buzz	Stroganov	Warhammer: The Horus Heresy – Age of Darkness	Night Flowers
15	Skyline 3000	Sneaks & Snitches	Last Will	Escape: The Curse of the Temple	Romolo o Remo?	Praetor	Oh My Goods!	Star Trek: Frontiers	878 Vikings: Invasions of England	Everdell: Collector's Edition	Court of the Dead: Mourners Call	Merv: The Heart of the Silk Road	Tabannusi: Builders of Ur	Skymines	Marvel United: Spider-Geddon

Predictions

New and Upcoming Games

What were the model’s top predictions for new and upcoming board game releases?

Show the code

new_preds |>
    filter(type == "upcoming") |>
    # imposing a minimum threshold to filter out games with no info
    filter(usersrated >= 1) |>
    # removing this goddamn boxing game that has every mechanic listed
    filter(game_id != 420629) |>
    prep_predictions_datatable(
        games = games_new,
        outcome = params$outcome
    ) |>
    predictions_datatable(outcome = params$outcome)

Older Games

What were the model’s top predictions for older games?