The ChildPoeDE Corpus: 1082 German Children’s Poems for Computational and Experimental Studies on Poetry Reception

–

(2) METHOD STEPS Based on recommendations by five experts on German children's literature and poetry -all professors or research associates from the fields of German Studies, Didactics or Pedagogy -we selected the following seven anthologies as sources: Großer Ozean (Gelberg, 2000), Die schönsten Kindergedichte (Kruse, 2003), Sieben Ziegen fliegen durch die Nacht (Gutzschhahn, 2018), Sieben kecke Schnirkelschnecken (Sailer, 2010), Im Mondlicht wächst das Gras (Andresen, 1991), Ich liebe dich wie Apfelmus (Fried & Hein, 2006) and So viele Tage wie das Jahr hat (Krüss, 1998).The experts were asked to suggest anthologies based on the following criteria: the poems should still be widely read today, aimed mostly at primary school children, written between 1800 and 2018, cover a wide range of poetry, including classics but also less known poetry, written in German and focus on text (not on pictures) to convey meaning.We chose these seven anthologies because they were named multiple times by different experts.Since we intended to provide data suitable as stimulus material in contemporary studies, we only included anthologies with editions published in the last 25 years.We used OCR software from Tesseract and Adobe.
Further, we collected poem-level and token-level metadata (csv).Some poem-level information was added manually (author, title, anthology, anthology count, publisher, publication year, ISBN).From the Integrated Authority File (GND), we retrieved additional data about the authors (GND id, author gender, year of birth, year of death) to ensure their accurate identification.We decided to provide the author's year of birth and year of death as an indication of when the poem could have been published, as it was difficult and, in some cases, impossible to find original publication dates for single poems.Most features, however, were extracted with our own Python script (poemtool.py)(e.g.word/stanza/line counts, data on case, punctuation, layout, rhyme and sonority).To determine rhyme patterns, we used rhymetagger (Plecháč, 2018).Calculations for the sonority score are based on Jacobs (2017) and Stenneken et al. (2005).We also calculated the lexical density and type-token ratio (TTR) for each poem to provide information on lexical richness.Along with the standard TTR, we computed Moving-Average-TTRs (MATTRs) to account for different text lengths (Covington & McFall, 2010).As MATTRs are usually computed for longer texts, we used different window sizes.All TTR and MATTR values were calculated using the R-package quanteda (Benoit et al., 2018).Data on onomatopoeia was annotated manually.The token-level metadata file additionally provides data on word length, word position and parts-of-speech in different levels of granularity.Part-of-speech information was generated with TreeTagger (Schmid, 1995).We also published a frequency table with absolute and relative frequencies for all tokens present in the corpus.Figure 1 represents the childPoeDE corpus in descriptive statistics.It includes frequency tables for the features special layout, rhyme and onomatopoeia, histograms with boxplots for poem length (measured in the number of stanzas and lines), poem sonority, TTR, lexical density and rhyming degree, a word cloud of the most frequent content words, a pie chart on gender distribution and a table with the ten most frequent authors and the number of poems they contributed to the corpus.

SAMPLING STRATEGY
We included as many poems from the anthologies as possible.However, poems relying on pictures, graphical layout or typography to convey meaning were excluded, as well as poems that used archaic or difficult language (e.g.all poems from "Des Knaben Wunderhorn") and poems consisting of a single repeated word.A list of the omitted poems can be found on Zenodo.If a poem appeared in more than one anthology, this was noted in the column "anthology count" in the poem-level metadata file.The childPoeDE corpus in its current state is a first (yet still imperfect) attempt to collect data of German poetry for children.Ideally, a corpus should be balanced with regards to author gender.We will work towards this in the future.For now, the gender imbalance of the corpus represents the gender imbalance present in the anthologies.

QUALITY CONTROL
All texts were checked for OCR errors.Additionally, whitespace and special characters, such as quotation marks, were normalised.We also harmonised the poems' structure to simplify automatic text processing: Detailed information on normalisation processes and explanations of text features can be found in the README files on Zenodo.The part-of-speech data was checked and manually corrected if necessary.In the end we conducted a quality check by reviewing randomly selected data.

(4) REUSE POTENTIAL
Although there is much research available on German poetry, both on corpora (e.g.Haider & Eger, 2019) and computational assessments (e.g.Reinig & Rehbein, 2019), these works never focus on German poetry for children alone.Thus, our data offers new research scenarios for anyone interested in poetry for children, such as empirical scholars, researchers in didactics or digital humanists.In experimental studies the texts can be used as stimulus material to investigate children's emotional involvement when reading poetry.Elaborate metadata allows for a precise poem selection along specific criteria, including rhyme, sonority or onomatopoeia.However, the corpus cannot provide all information which might be useful for empirical studies, including publication dates for individual poems or an evaluation of age appropriateness.
In the context of digital humanities, especially computational literary studies, our data allows for investigations of different poetic features and their correlations.There are plenty of possible approaches from the field of Natural Language Processing which can be performed on the data and might yield new insights on the study of German poetry for children.These include linguistic corpus analysis, sentiment analysis (for an example for children's books see Jacobs et al., 2020), text similarity assessment, topic modelling, named entity recognition or explorative approaches through visualisations.
Overall, the childPoeDE corpus lays the foundations for a wide range of research scenarios while being extensible at the same time.The data could be enriched with additional metadata (e.g.sentiment values, reading age or text complexity measures), linked to other data sets through the authors' GND ids or used for comparisons with corpora from other genres (i.e.childLex (Schroeder et al., 2015)).