A Comparison of Topic Modeling Approaches Using Networked Discussion Forum Posts From the City-data.com Corpus

The City-Data.com Corpus provides over 15,000 discussion forum posts scraped from city-data.com--a website that hosts information about cities across the United States. Like the 20 Newsgroups dataset, the City-Data.com Corpus is weakly labeled by forum topics and thread titles and can be used to trial natural language processing techniques or be used to stage lessons in digital textual analysis in digital humanities pedagogy.

(2) DATASET DESCRIPTION The City-Data.comCorpus consists of five discussion forum threads from the site, city-data.com (Advameg, Inc., n.d.a) scraped using the BeautifulSoup Python library (Richardson 2007).City-Data.comaggregates geographic, demographic, and historical information about major cities across the United States and Canada from public and governmental sources.The site provides data visualizations that enable users to compare cities.As such, City-Data.comtargets audiences who are traveling or relocating, real estate professionals, and advertisers.City-Data.com also hosts discussion forums by region and topicality (e.g., Classified Ads or Food and Drink).City-Data.comforums are moderated and prohibit trolling, hate speech, spam, doxxing, and cross-posting (Advameg, Inc. n.d.b).Moreover, posters are advised to stay on topic and avoid personal attacks.
At the time of this writing, the City-Data.comforums house 2,940,053 threads, 62,355,814 posts authored by 2,476,620 members (Advameg, Inc., n.d.c).In comparison with other social media sites enlisted for data modeling, machine learning, and network analysis, City-Data Forums has a small footprint.Reddit, by contrast, boasts of over 13 billion posts and comments (Reddit Inc, 2023).

REPOSITORY NAME
Zenodo 10.5281/zenodo.10086354(3) METHOD Topic modeling is a method through which the latent structure of documents is inferred from lists of terms (topics) that collocate with high statistical significance.For example, latent Dirichlet allocation (LDA) (Blei et al. 2003;Steyvers & Griffiths 2007;Hoffman et al. 2010) creates a topic model by generating probability distributions over words.These probability distributions designate the word features that would most likely generate the documents in under the model.While some topic modeling procedures like latent semantic indexing (Deerwester et al. 1990) do not produce human-interpretable topic models, most current topic modeling procedures like LDA produce term lists that order the most influential features of a topic.There are general limitations to LDA, however.As Vayansky and Kumar (2020) note in their review of topic modeling methods, LDA performance suffers when applied to short texts.Moreover, bag-of-words document representations used in many traditional LDA approaches are less informative than more advanced embedding-based representations, which can capture the dense relationships between words in context.For this reason, embedding-based approaches have been explored (Bhatia et al. 2016;Grootendorst 2022;Angelov 2020;Aharoni & Goldberg 2020;Bianchi et al. 2021a;Bianchi et al. 2021b;Zhang et al. 2022;Limwattana & Prom-on 2021).Unlike bag-of-words approaches, which will capture the presence or absence of term in documents, sentence embeddings encode word contexts derived from large language models pretrained on vast corpora.The richness of sentence embedding representations recommend their incorporation in topic modeling approaches.The density of information should lead to more nuance when grouping documents into topics, which should lead to more relevant topical term lists.My method for performing sentence embedding-based topic models (henceforth, SE-Topics) follows Grootendorst's (2022) pipeline that involves transforming texts into sentence embeddings, clustering sentence embeddings, and extracting representative word lists or topics.To group the thread embeddings, I use scikit-learn's (Pedregosa et al. 2011) kmeans clustering implementation.
Posts assigned to the same cluster are merged into a single document.However, prior work with these approaches indicates that frequency counts lead to better topical quality than TFIDF for SE-Topic and LDA topic modeling.This topic modeling approach emphasizes what Grootendorst (2022) describes as pipeline "modularity."Although the step to derive significant textual features is separate from the embedding and clustering steps, this modularity enables people to replace or extend a phase in the topic modeling process.For example, dimensionality reduction can be applied to sentence embeddings to accelerate clustering; clustering algorithms can be exchanged (e.g., spectral or density-based clustering can be used); frequency distributions can be replaced with TFIDF vectorization when deriving term topics. 1 (3.1) EXPERIMENTAL TRIALS I conduct 16 topic modeling trials designed to leverage the networked structure of the City-Data.comCorpus.I evaluate topic quality of models trained on post-level segments, threadlevel segments, and topics guided by prior information (see Li et al. 2018;El-Assady et al. 2019;Popa and Rebedea 2021;Gourru et al. 2018).Guided topic modeling uses prior information about the data to center model priorities.Post and thread-level topic modeling test how the unitization of City-Data.com'snetworked content influences topics modeling and is primarily a data preparation step.Guided topic modeling intervenes in the topic modeling process.To guide the kmeans clustering process (the basis of the sentence embedding-based topic model), The embedding-based topic modeling approach illustrated here departs from Grootendorsk's (2022) BERTopic significantly.First, kmeans clustering used to generate SE-Topics produce flat clusters; BERTopic's default implementation uses hierarchical density-based clustering (HDBSCAN, McInnes et al., 2017).HDBSCAN forms clusters around points of density in semantic space.Data not proximal enough to these points of density are labeled as outliers.The lack of outliers among SE-Topics produces a topic model more comparable to LDA.Moreover, at the time of this writing, BERTopic's "guided topic modeling" (equivalent to the use of topical priors) was inoperable due to dependency issues.In all, topical diversity measures used in this study determine intra-topical term list diversity, inter-topical term list diversity, and generalized topical term list diversity when compared to a reference corpus.To calculate each measure, I use the top-25 terms per topic.Because the Metro forum also discusses criminal activity, I set the topic number at 4 to correspond to the following thematic categories for topic modeling: coronavirus, crime, plan, and retail. 4  Tables 3 and 4 illustrate the SE-Topics and LDA modeling coherence and diversity scores across different data segmentations.
In general, LDA topic modeling produced better topical coherence and diversity scores compared to sentence embedding approaches with an average UMASS coherence of -2.31 and PUW and JD diversity scores near 1.0.Mean WE-CD for LDA topic models is also 0.12 points greater, signifying that LDA topic models are more semantically distant from the reference corpus than SE-Topics with high coherence.Box and whisker plots of topical coherence and diversity scores (see Figures 1 and 2   interquartile range between the best performing segments (high degree posts) and the lowest performing segments with a difference of 2.93.On the other hand, there is only a 0.53 difference in coherence scores between the best performing LDA topics (high degree posts and threads) and the worst performing topic model (topic titles).
Put another way, the best scoring SE-Topics-guided initial posts and guided high degreeperform as well as the two worst scoring LDA topic models.
The discrepancies in coherence and diversity scores, however, are complicated by qualitative assessments.
Comparing SE-Topics and LDA topics guided by high degree posts, we can discern close similarities among term collocations within individual topics.Both SE-Topics and LDA Topics represent the Retail thread and the development rhetoric of the Plan thread; however, SE-Topics have agglomerated crime and coronavirus discourse (see Topic 0 in Table 5).The LDA topic model seeded by high degree posts (see Table 6), however, has captured discussions    about government restriction of public services that ensured in the first months of the 2020 COVID-19 pandemic in Philadelphia with terms such as money, state, local, care, and help.
Thread segmentation modestly improves LDA topical coherence (mean -2.2 compared to mean -2.4).Coherence scores for SE-Topics, on the other hand, worsen with thread segmentation (mean -4.41 compared to mean -3.49).
SE-Topics models benefited the most from the injection of prior information.The results of the SE-Topics informed by the embeddings bundled with the most quoted replies are particularly interesting when compared with the generally poor performance of threadbased segments.Both permutations depend on the networked structure of the City-Data.com Corpus to structure topics.Threads utilize the chained structure of posts and quoted replies to crystalize texts that emerge through interaction; high degree node priors utilize the linkages of posts with other posts.Although more information is encoded when threads are joined into single texts, the centrality of posts within conversations lends better guidance to the formation of more cohesive topics.This makes intuitive sense in that posts with numerous linkages will anchor common content.Threads, on the other hand, are more intrinsically diverse segments, striated by conversational turns and/or dissensus among interlocutors.
Guided topic modeling returned better topical coherence and diversity scores than unguided models (with the sentence embedding model guided by initial posts the exception).Posts with the most incoming and outgoing linkages (degree) produced the best scoring topic models.This finding suggests that network structure can influence the development of topical content in extended asynchronous conversations.Messages enmeshed in replies are more likely conserved as interlocutors attend to given information as they add commentary.
(4) REUSE POTENTIAL  Because BERTopic assigns isolated datapoints to an outlier cluster, I generate 5 topics for each trial to yield at least 4 cohesive topics.Topics were required to store a minimum of 50 textual units (posts or threads).
Table 9 illustrates the coherence and diversity scores of BERTopic models of posts and thread units.Both BERTopic models yield UMASS coherence and diversity scores comparable to SE-Topics method.Qualitatively, the BERTopic Post Model produces the most legible topics.The Topic -1 (the outlier topics) is characterized by general references to philadelphia and people (see Table 10).Topic 0 captures references to city infrastructure found in the plan forum.Topic 1 includes references to race and crime (white, crime, murders) indicative of the crime and metro forums.
The BERTopic Thread Model is similar to the Post Model, although it more clearly features references to terms in the retail forum such as retail, store, and stores.Both BERTopic Post and Thread Topic Models feature low quality topics (Topics 3 and 2, respectively).Neither conveys interpretable information about the content of the forums (see Tables 10 and 11), indexing two short posts in the corpus.
However, I would not argue that the SE-Topic Modeling method discussed above is a clear improvement over BERTopic.These results do not represent the full range of BERTopic parameters but are useful to highlight how little tuning the SE-Topic method requires to produce legible topics that yield similar coherence and diversity scores.

2
Due to the length of some posts, I provide the post_id in lieu of the full text.3 I employ Terragni's (2023) suite of diversity scripts to measure PUW, JD, and WE-CD.See also Röder (2015a; 2015b), Stevens et al. (2012), Terragni et al. (2021).

Figure 1
Figure 1 Boxplots of SE-Topics and LDA Coherence Scores.

Figure 2
Figure 2 Boxplots of SE-Topics and LDA Diversity Scores.

Table 2
Tabular data model for City-Data.comforum posts.Omizo (Newman 2009) Sobkowicz (2010)s Data DOI: 10.5334/johd.182Imanuallydesignateinitialclustercenters.This initial positioning will guide subsequent recenterings as the model converges to reduce the distance between intra-cluster datapoints(Arthur & Vassilvitskii 2007).To guide the LDA topic modeling, I adaptLi et al.'s (2018)method of injecting seed words extracted from the data into the topic modeling process.iInoneexperimentconducted on the 20 Newsgroups dataset,Li et al. (2018)used forum label text (e.g., talk.politics.guns)asseedswiththeassumptionthatforumlabelsprovidedistinguishingcategorical information about potential topics.Topic titles -FollowingLi et al. (2018), I use thread topic titles as seeds (see Table1).2.Initial forum posts -Initiating posts declare the horizons of participating in forum discussions.Empirical work bySobkowicz and Sobkowicz (2010), See Jagarlamundi et al. 2012 for example, demonstrates that social medial discussions evidence strong first mover advantage similar to scientific papers(Newman 2009).Papers that appear early in the rise of a discipline will outpace the citation rate of newer papers.Here, the hypothesis is that posts that appear first gain more engagement and this increased engagement will condition the content of a sizable portion of the thread.(seeTable 7 in Appendix A). 2 3. Posts with the highest degree -Leveraging the network properties of the City-Data.comcorpus,I create a graph of each forum and extract posts with the most incoming and outgoing links or node degree (Gerlach et al. 2018; Duan et al. 2021; Yang et al. 2016).Posts that receive numerous quoted replies and/or are replying to other posts serve as proxies for engagement.Like the intuition behind the use of initial forum posts, posts that are bound up in more extensive conversations, condition more content because respondents must stay on topic to sustain discourse (see Table 8 in Appendix A).Jaccard Distance between topical term lists (JD) -Proposed by Tran et al. (2013), this diversity measure evaluates the Jaccard distance between topical term lists.Greater distances between topical terms indicate more topical diversity (Terragini 2023).Word embedding centroid distance (WE-CD) (Bianchi et al. 2020b) -WE-CD calculates the distance between collocations in the topical term list to a reference corpus of word embedding.This metric determines how diverse topical term lists are in comparison to generalized usage in embedding models of large volumes of texts like Wikipedia or the Common Crawl Corpus of internet sites.For this paper, I use the FastText Common Crawl word embedding model with 300 dimensions and 2 million subword vectors(Bojanowski  et al. 2016; Joulin et al. 2016a; Joulin et al. 2016b).3 (Dieng et al. 2020)ka 2011)fer results comparable to LDA or the Sentence Embedding-Based routine tested in this study.Despite this offset, I include BERTopic's modeling of City-Data.comCorpusThreadsandPostswithdiscussion in Appendix B. I test three types of topic seeds: 1.I calculate topical coherence and diversity scores to measure model quality.Topical coherence includes several measures that indicate how well topical term lists reflect the underlying data.In this study, I use Minmo et al.'s (2011; see alsoHinneburg et al. 2014) UMASS method.UMASS coherence measures the probability that co-occurring words in the topical term list occur in documents divided by the total number of documents.Unlike other coherence measures such as UCI-coherence, which compare topical terms to a large reference corpus like Wikipedia dumps, UMASS coherence is derived from the original dataset.I balance coherence scores against topical diversity scores(Mimno et al. 2011).I employ Gensim's coherence measures(Řehůřek, R., & Sojka 2011)to calculate UMASS coherence.Topical diversity refers to a family of metrics that indicate the variability of topical terms, thus, the range of data explainable by topical term lists.Quality topics are both coherently related to the underlying data, but also distinctive enough to offer thorough faceting(Dieng et al. 2020).For this study, I employ four diversity measures (Terragni 2023):1.Proportion of unique words (PUW) -PUW determines the ratio of unique topical terms for all topics(Dieng et al. 2020).Scores closer to 1 indicate diverse topics; scores closer to 0 indicate repetitive topical terms.2.
) also indicate that LDA topic modeling results are more consistent across segment types.SE-Topical coherence and diversity scores indicate a wider 4Text processing, topic modeling, and evaluations scripts available at https://github.com/rotemple/city-datacom-corpus-scripts/blob/main/sentence-embedding-tm.ipynb.

Table 3
SE-Topics Coherence and Diversity Scores.

Table 5
SE-Topics Guided Post (high degree nodes).
(Grootendorst 2022)ve explored the potential of sentence embedding-based topic modeling against LDA benchmarks.LDA approaches yielded better topical coherence and diversity scores in comparison to sentence embeddings.Inspection of topical term lists suggests that qualitative distances between methods are less pronounced, though (See Appendix B for a truncated comparison to BERTopic(Grootendorst 2022).That said, final topical coherence and diversity scores are less important than the different topical permutations that the City-Data.comCorpusallowed us to test.The networked structure of the City-Data.comCorpusenables the unitization of data into posts or threads as well as principled means to designate topical priors based on posts with the highest engagement.Consequently, the City-Data.comCorpus is conducive to evaluating the strength text analysis algorithms as well as aspects of research design such as text segmentation and the incorporation of topical guidance.Illustrating the effects of different modelling parameters can be a boon to datadriven pedagogies because students can witness how different data selection choices impact their topic modeling results.Along these pedagogical lines, the City-Data.comCorpus provides opportunities for students to practice other data processing techniques such as scraping, cleaning, and parsing HTML data as well as other methods that rely upon labeled data such as machine classification.

INITIAL FORUM POSTS AND HIGH DEGREE POST IDS APPENDIX B. BERTOPIC RESULTS ON CITY-DATA.COM CORPUS POSTS AND THREADS For
completeness, I trialed BERTopic's topic modeler on City-Data.comCorpus posts and threads.

Table 7
Initial post ids per City-Data.comCorpus forum.

Table 8
High degree City-Data.com Corpus forum posts used for guided topic modeling.

Table 9
BERTopic Coherence and Diversity Scores for City-Data.comCorpus.