Multilingual Analysis and Visualization of Bibliographic Metadata and Texts With the AVOBMAT Research Tool

The objective of this paper is to introduce the workflow of the AVOBMAT (Analysis and Visualization of Bibliographic Metadata and Texts) multilingual research tool, which enables researchers to critically analyse bibliographic data and texts at scale with the help of data-driven methods supported by Natural Language Processing (NLP) techniques. This exploratory tool offers a range of dynamic text and data mining tasks and provides interactive parameter tuning and control from the pre-processing to the analytical stages. It can pre-process, analyse and (semantically) enrich a vast number of texts and metadata in several languages due to its scalable infrastructure. The implemented analytical and visualization tools provide close and distant reading of texts and bibliographic data. It combines bibliographic data and NLP research methods in one integrated, interactive, user-friendly web application, allowing users to ask complex research questions.


UPLOADING THE CORPUS
Users can upload metadata and texts in several formats: Zotero collections in CSV and RDF formats and EPrints (library) repositories as XML files (metadata or metadata with links to the full texts).AVOBMAT can also import full texts, for example, by uploading a zip file of documents along with a CSV of the metadata.Documents from external databases can be imported by providing URLs to the full texts in the CSV.It can process texts in several formats since the Apache Tika library converts them to plain text.

CLEANING THE CORPUS
AVOBMAT provides several options for cleaning the text corpus.For example, users can • remove non-alphabetical tokens (e.g. of OCR-ed texts); • upload a list of words and replace words (e.g.synonyms) and characters; • make use of regular expressions.
A context filter is implemented to keep the context of a keyword or keywords and remove all other parts of the document.

CONFIGURING THE PARAMETERS
Users can create different configurations for each analysis where the outcome depends on the language of the texts.There are two ways to assign a language to a document: researchers can manually select a language for the full dataset (52 languages) or choose the automatic language detection option.As for the latter, the system will choose a language independently for each document.Based on the language, it offers stopword and punctuation, filtering drawing on the spaCy library, and lemmatization (SpaCy Models and Languages).Extra stopword and punctuation lists can also be added.SpaCy language models are used for lemmatization, with LemmaGen models being used for languages not supported by spaCy (Juršic et al., 2010).
The following pre-processing options are implemented: • choose spaCy language model (small, large or transformer); • make text lowercase; • remove numbers; • set minimal character length.
The metadata enrichment includes the identification of the gender of the authors (male, female, unknown gender or without author) and automatic language detection.Users can also upload a list of male and female first names, supplementing and replacing the ones found in the dictionaries of the programme.
As for topic modelling, the user also has the option to separate the documents into sections of equal size.Users can specify the so-called window length for certain lexical diversity analyses (MSTTR, MATTR).

VALIDATING THE SETTINGS
AVOBMAT cleans and pre-processes a small sample of the uploaded database where the user can check if the set parameters are appropriate.The settings can be saved in a template if the configuration is acceptable.If the parameters need to be fine-tuned, the user can start the cleaning and configuration process again.AVOBMAT identifies missing values and gaps in the metadata.Péter et al.

FILTERING THE CORPUS
The user can search and filter the metadata and texts in faceted, advanced and commandline modes and perform all the subsequent analyses on the filtered dataset (Figure 1).The NLP analyses of the documents semantically enrich the metadata.For example, the recognized named entities such as person appear in all types of searches and the user can search for (disambiguated) named entities in 16 languages.The tool supports fuzzy and proximity searches.

INTERACTIVE METADATA ANALYSIS
Having filtered the uploaded databases and selected the metadata field(s) to be explored (Figure 2), the user can, among other actions, • analyse and visualize the bibliographic data chronologically in line and area charts in normalized and aggregated formats (Figure 4); • create an interactive network analysis of the metadata fields (Figure 3); • make pie, horizontal and vertical bar charts.

INTERACTIVE CONTENT ANALYSIS
The following options are available for interactive content analysis.

N-GRAM VIEWER
This diachronic analysis of texts shows the yearly count of the specified n-grams generated at the pre-processing stage in aggregated and normalized views (Figure 5).

FREQUENCY ANALYSIS
Frequency analyses and word clouds can be efficient tools to highlight the prominent terms in a corpus.The significant text analytical tool shows what differentiates a subset of the documents from others using four different metrics (e.g.Chi square) (Manning et al., 2009;Rudi and Vitányi, 2007; see Significant text aggregation).The TagSphere analysis enables users to investigate the context of a word by creating tag clouds showing the co-occurring words of a specified search term within a specified word distance (Figures 6 and 7) (Jänicke and Scheuermann, 2017).Words can be interactively removed from the clouds.Bar chart versions of the analyses present the applied scores and frequencies.

KEYWORD-IN-CONTEXT
The keyword-in-context function supports the close reading of texts (Figure 9).

TOPIC MODELLING
The Latent Dirichlet Allocation function calculates and graphically represents topic models (Blei et al., 2003).It shows the most relevant words and most relevant documents related to each topic, visualizes the distribution of these topics chronologically, highlights the correlation of different topics and exports the results in various formats (Figures 10 and 11).It has the following parameters: the minimum number of occurrences of words, the number of topics and iterations, per-document topic distribution (alpha), and per-topic word distribution (beta) parameters.Users can interactively remove stopwords.

PART-OF-SPEECH TAGGING
AVOBMAT identifies the part-of-speech tags currently in 16 languages by using the spaCy language models.It produces different interactive visualizations and statistical tables of the results (Figures 12 and 13).

NAMED ENTITY RECOGNITION, DISAMBIGUATION AND LINKING
It identifies named entities such as persons and places currently in 16 languages.The number and type of named entities differ by language, as seen in Table 1.AVOBMAT creates different statistical tables and visualization of these entities.The latter are displayed in full-text view.As for the English language, it disambiguates the entities and links them to Wikidata, VIAF and ISNI (Figure 14).

Figure 2
Figure 2 Interactive metadata visualization setting.

Figure 3
Figure 3 Network analysis of authors, publishers and booksellers involved in the publications of 18 th -century books concerning Freemasonry with a particular focus on James Anderson (author).

Figure 4
Figure 4 Chronological distribution of the detected languages of the 53411 articles and books in the University of Szeged publication repository.

Figure 7
Figure 7The same TagSphere analysis as in Figure6.Bar chart view with statistical data.

Figure 8
Figure 8 Lexical diversity metrics in J. K. Rowling's Harry Potter novels.

Figure 9
Figure 9 Keyword-in-context.The word "magic" in J. K. Rowling's Harry Potter and the Philosopher's Stone.

Figure 12
Figure12Part-of-speech analysis in Dan Brown's novels.

Figure 10
Figure 10 Topic modelling of Dan Brown's novels.

Figure 13
Figure 13 Part-of-speech analysis of J. K. Rowling's Harry Potter novels.Statistical results.

Table 1
Named entity recognition in different languages.
Figure 14 Named entity recognition and linking in Dan Brown's Da Vinci Code.