Digital Narratives of COVID-19: A Twitter Dataset for Text Analysis in Spanish.

Digital Narratives of COVID-19 (DHCovid) offers a curated Twitter corpus of digital conversations about the Coronavirus pandemic. The dataset is collected through a script via Twitter’s Application Programming Interface (API) starting on April 24 th , 2020, and stored on GitHub as an open access repository of tweet identifiers that can be consulted, downloaded, and reused by scholars interested in Natural Language Processing (NLP), topic modelling, and other quantitative methods. A stable version of the dataset has also been released through Zenodo. Twitter datasets are structured in three main collections: tweets in Spanish worldwide; geolocated tweets in six Spanish-speaking areas spanning North and Central America (Mexico, Columbia, Ecuador), South America (Argentina, Peru), and Europe (Spain); and geolocated tweets in English and Spanish from the greater Miami area in South Florida.

(2) METHOD STEPS To assemble the Twitter corpus, a PHP programming language script mines the Twitter data streaming through Twitter's Application Programming Interface (API) and recovers a series of specific tweet identifiers (IDs). Our data mining sampling strategy consists of four main variables: language, keywords, region, and date.
The corpus is available through three repositories: (1) GitHub. Tweet IDs are stored in a MySQL relational database where they are "hydrated," that is, all metadata associated with the tweets is recovered, including its body text. Then, an additional script organizes the tweet IDs in the database by day, language, and region, and creates a plaintext file for each combination with a list of corresponding tweet IDs. The script generates these files daily and organizes them into folders, where each directory represents one day. These are uploaded directly to our public GitHub repository (Table 1). 4 The data collection began on April 24 th , 2020, and new tweets are automatically uploaded daily, until May 2021. (2) Project website endpoint. A free access endpoint 5 for query and download of "hydrated" tweets can be accessed from DHCovid website. An additional script queries the database and recovers body text of tweets (see Quality Control section). The access to a tidied and structured Twitter corpus for on-demand querying is one of the most meaningful contributions of our project for data reuse and text mining activities.
(3) Zenodo. A first stable version of the dataset, published on May 13, 2020, was released through Zenodo as a compressed ZIP file containing folders of daily tweets made between April 24 th , 2020 and May 12 th , 2020. A second and final version will be uploaded by the end of the project in May 2021 with the complete collection of tweet IDs.
Our strategy is also shaped by Twitter API policies. First, in its free version, the API did not offer the possibility for querying tweets older than seven days. Second, Twitter allows users to publish georeferenced tweets (with exact location) but retrieving geotagged tweets is complicated due to the absence of a facility for querying by geographic region. A pragmatic approach led us to define "country" as a circle surrounding the area of interest, e.g., "Mexico" is defined as latitude 21.295658, longitude -100.291341, and a radius of 450 miles. Indeed, political and national borders will not always follow our selection criteria, so our area-specific corpus can sometimes contain tweets from a neighbouring country, e.g., a query for Argentina is conflated with parts of Uruguay, and Colombia with parts of Ecuador.

QUALITY CONTROL
The "hydration" of the collected tweet IDs undergoes an additional data tidying process before any body text data is returned to the user. We apply a set of rules to the tweet body text: enforce that all words are in lowercase, remove accents, punctuations, mention of users (@users) to protect privacy, and replace all links with a general "URL" term. While enforcing all text to be in UTF-8 encoding, a particular challenge unique to the Spanish corpus is accents and graphemes, such as the "ñ", that can be difficult to process and preserve. Most of those cases were resolved through a script by detecting special entity codes and replacing them with the correct character (e.g. &ntilde as ñ). We have also transliterated emojis into its corresponding UTF-8 charset and eliminated them from our experiments as of now. This processing facilitates the application of NLP techniques.

LICENSE
Creative Commons license Attribution 4.0 International (CC BY 4.0).

REPOSITORY NAME
GitHub for the continuously daily updated version. Zenodo for the stable DOI.

PUBLICATION DATE
First published to the GitHub repository on 2020-04-24. Afterwards it was released to Zenodo on 2020-05-13.

STATISTICS AND CONTENTS
In Table 2 and Figure 1 we offer the basic statistics of the dataset.

(4) REUSE POTENTIAL
The COVID-19 pandemic has motivated a plethora of ambitious digital research focusing on Twitter, including the role and importance of automated Twitter accounts, also known as bots (Ferrara, 2020), the increase of politically radical discourse in social media (Jiang, Chen, Yan, Lerman & Ferrara, 2020), and the general public perception (Abdo, Alghonaim & Essam, 2020