A Mixed Method Twitter Methodology and Anonymous Corpus

This dataset represents the first 5 years of posts made by the anonymous twitter bot @lis_grievances combined with a series of custom and pre-established metrics. The bot is a platform for workers in libraries and affiliated fields to make unattributable pronouncements. A simple vetting process ensured no defamatory or explicit posts were made. Anonymity is assured as no evidence of who made the submission is retained, even to the operator of the bot. This dataset represents a collection of thoughts, fears, and cutting remarks made by information workers about their field and the places where they work.


CONTEXT
The @lis_grievances bot was first activated February 26, 2016, and continues to operate to this day.The messages that it tweets are all anonymously harvested and thus allow workers (presumably) in libraries and related fields to 'air their grievances' (@lis_grievances bot, n.d.).There has been an enthusiastic discussion in social media and within Library and Information Science (LIS) literature on the utility of the bot and both its harm and benefit to the profession (Skyrme & Levesque, 2019).
The five-year archive was created as the basis for a chapter (Ribaric, 2022b) within a monograph that investigated the hypothesis that Libraries are dysfunctional workplaces (Acadia, 2022).This research conducted an analysis that partitioned the tweets into various categories in order to understand themes found in the corpus.It also introduced a novel metric called the Grief Index (GI), which gave a quantitative ratio of how many submissions made to the bot were not posted.Every submission made to the bot is checked by a moderator before it is posted, this is to ensure that no one is specifically mentioned in a tweet and that discriminatory language is not used.This difference in submissions to the bot versus actual posted messages is the basis of GI.This value provides a proxy for understanding the amount of material submitted that is not suitable for posting, which also avoids the need to share actual inappropriate posts.

STEPS
The actual dataset is an aggregation of tweets made by the @lis_grievances account (n = 4096) and retrieved from the Twitter API using the Tweepy platform (Roesslein, 2020).Some metadata of the tweets is retained and augmented with a custom metric called the Grief Index as well as the three components of the VADER sentiment score for the full text of the tweet (Hutto & Gilbert, 2014).The complete software used to create the bot is hosted on GitHub (Ribaric, 2022a).A key component of this software is that it contains a mechanism that retrieves the direct messages sent to the bot through a process that ensures anonymity of the sender from the operator of the bot.The Twitter archive of the account was requested on February 27, 2021 and as such, any favourite or retweet counts is current to that day.The etymology of 'bot' is preferred to describe this account since the posting and retrieval of messages is mediated through an API interface in conjunction with custom software that ensures anonymity of posts.While all posts are submitted by humans, no submission is posted without a comprehensive mediated quality control process.
The basis of the analysis was the creation of a metric dubbed Engagement Score (ES) which was the sum of the retweets and favourites a tweet received in the 5-year period.By combining this quantitative scoring with a close reading of the tweets a mixed method was conducted.Tweets with a high ES were examined in an attempt to uncover themes present in the full corpus.

COLUMN DESCRIPTION
Description of all columns retained from the Twitter archive export and additional data added as part of the analysis can be found in Table 1.This dataset is a combination of data exported directly from Twitter and enriched with additional analysis specific description of the provenance of each column, as described in

STATISTICS AND CONTENTS
As mentioned, the investigation focused on ES of the corpus, but it contrasted this score against other dynamics of the tweets.Box plots of the different facets used to partition the tweets are seen in Figure 1.Here we see that the inclusion of swears, for example, lead to a higher mean score compared to other facets.
Figure 2 shows the general distribution of ES across all tweets in the corpus.This provides us with a quick view of the distribution of scores along with some evidence that outliers were also present.To further shed light on the corpus, VADER sentiment scores were also calculated. Figure 3 shows an example of sentiment score composition for the swear word facet.
Lastly, to provide a general sense of what is in the corpus, a word cloud is presented in Figure 4.It appears that Librarians enjoyed talking about themselves and the places in which they work. (

4) REUSE POTENTIAL
The primary goal of the research was to propose a mixed-method approach to analysing the corpus in order to derive insights into its contents without the need of having a researcher examine each tweet and hand-code for themes; however, many other uses of the corpus can be devised.This archive of tweets has potential to inform investigations in many different areas.For example, it can be used to assess the perceived accuracy of the VADER sentiment analysis scoring system.It can also be used to study the online disinhibition effect (ODE).ODE is the supposition that when given anonymity people will express themselves in stronger ways than if their speech is attributed.
Lastly, the dataset can be used for sociological or LIS inquiry, such as to investigate a profession's self-image.Within the LIS field, romanticisation of professional self-identity is

Figure 1
Figure 1 Engagement score box plots for tweets with different characteristics.

Figure 2
Figure 2 Engagement score distribution of all tweets.

Figure 3
Figure 3 VADER sentiment score breakdown of all tweets in the archive.

Figure 4
Figure 4 Word cloud of all the tweets in the archive.

Table 1
All columns found in the dataset.