A Survey of Body Part Construction Metaphors in the Neo-Assyrian Letter Corpus

The dataset consists of approximately 2,400 examples of metaphors in Akkadian of what we term Body Part Constructions (BPCs) within the letter sub-corpus of the State Archives of Assyria online (SAAo). The dataset was generated by a multi-step process involving the training and application of a language model to the SAAo letter sub-corpus, converting the resulting annotations to linked open data format amenable to searching for BPCs, and manually adding metalinguistic data to the search results; these files, in CONLLU and TTL formats, are also made available in this publication. The BPC dataset is stored as a CSV file, and can serve as an easy starting place for other scholars interested in finding socio-linguistic usage patterns of this construction.


CONTEXT
The royal archives of the late Neo-Assyrian kings (8th-7th century BCE) constitute an important source for understanding many facets of the Neo-Assyrian empire.Ranging from treaty tablets and legal documents to prophecies, ritual instructions, and even court literature, the approximately five thousand texts in this corpus primarily come from the palatial complex at Nineveh and document the reigns of Sargon II (r.721-705), Sennacherib (r.704-681), Esarhaddon (r.680-669), and Assurbanipal (r.668-627).Over the past four decades, much from these archives has been published in the State Archives of Assyria (SAA) volumes at the University of Helsinki, and in more recent years has appeared digitally under the Munich Openaccess Cuneiform Corpus Initiative (LMU Munich) as the State Archives of Assyria online (SAAo).
Within this corpus, the set of letters constitutes a sizeable sub-corpus that is valuable not only in terms of reconstructing social history, but also as a representative of vernacular late Akkadian.It is this linguistic fact that motivates our dataset, discussed more extensively in Ong and Gordin (under review).Here we provide a summary of the dataset's contents.It is a CSV file with approximately 2,400 examples of what we term Body Part Constructions (BPCs) in Akkadian, where a BPC is defined as a verb with a compound prepositional phrase based on a body part term.For instance, X ina qāt Y šûlû literally means 'to lift X from the hand of Y', but colloquially means 'to estrange X from Y'.These constructions are interesting as they are a productive vehicle for metaphors in Akkadian as well as a socio-linguistic feature of various subgroups within the Neo-Assyrian letter corpus.
For a given BPC, the CSV file lists the lexical item representing each syntactic component of the BPC (verb, body part term, direct object, etc.), the lemmas associated with these lexical items, and the CDLI P-number of the text that the BPC appears in. 1 Most BPCs also come with a translation in context, as well as a number of fields describing the letter they appear in, like sender, receiver, sender's location, date of composition, dialect and script of composition, genre, and provenience, alongside additional linguistic and rhetorical properties of the BPC.Most of these fields are described in detail under Ong and Gordin (under review).

METHOD
The dataset was generated via a three-step process:

LANGUAGE MODEL TRAINING
We first sought to train a spaCy language model 2 on a subset of manually annotated, normalized texts drawn from a variety of Oracc projects (consisting both of letters and other genres). 3The training set included: all texts in SAAo 1, 2, 5, 9, 15, and 21, a small set of texts from each of SAAo 8, 10, 13, 16, 17, 18, and 19, SB Anzu, a few extispicy texts from the Corpus of Ancient Mesopotamian Scholarship (CAMS)/Barutu project, selected royal inscriptions of Esarhaddon found in the Royal Inscriptions of Assyria Online project (RIAO), a few Middle Assyrian letters from the Text Corpus of Middle Assyrian project (TCMA), and about two hundred and fifty synthetic training sentences generated manually. 4The model files and training, development, and test data are available from the above-listed Zenodo repository as well as the primary 1 The Cuneiform Digital Library Initiative (CDLI) P-number is a conventional ID in the field of cuneiform studies.

4
The Oracc pages for these texts are: author's GitHub account. 5We then used INCEpTION to make the manual annotations for these texts, which were then encoded in CONLLU format (Figure 1). 6We also incorporated into our training set the annotated Neo-Assyrian royal inscriptions from Luukko, Sahala, Hardwick, and Lindén (2020), slightly modified to match our own annotation format.
We then used the resulting language model to generate automatic parses of the remainder of the SAAo letter sub-corpus and combined them with the original training set to yield a complete set of annotations for the SAAo letters in CONLLU format.More information about the accuracy of the automatic parses can be found in Ong and Gordin (under review).

CONVERSION TO LINKED OPEN DATA
We converted the CONLLU format annotations from the previous step to RDF turtle format (TTL) using a Java package named conll-rdf (Figure 2). 7Each SAAo volume was associated with a single TTL file forming part of our dataset.Both the CONLLU and TTL files of SSAo texts, broken down by individual SAA volume, are made available with the current BPC dataset within the above-listed Zenodo repository. 8The TTL files were then uploaded to TriplyDB, a data hosting service that enables users to easily query their own linked open data projects through a variety of APIs. 9 We then used a SPARQL query to search through the various SAAo letter volumes individually for the exact syntactic and lexical features defining BPCs. 10 The resulting attestations were then amalgamated into a CSV file.

ENRICHING THE METADATA
The last step involved enriching the search results from Step 2 in two ways.First, we extracted certain metadata from the Oracc JSON files associated with the SAAo letter sub-corpus and merged it into the list of attestations.The result was an enlarged CSV file which provided, for a BPC within a given letter, the dialect of Akkadian, likely date and ruler under which that letter 5 The model distribution is provided under ak_AkkParser_Norm_1_2_5_8_9_10_13_15_16_17_18_19_ 21_anzu_barutu_rinap4_tcmaassur-0.0.0.tar.gz in the Zenodo repository.The training, development, and test data is available in CONLLU format under the ak_norm_conllu.zipfile.Given that spaCy requires one to first convert all training data to a special binary format before it can be used for model training, we have also included these pre-compiled binaries under the ak_norm_spacy.zipfile of the Zenodo archive.The CONLLU, model, and binary files may also be found at the primary author's GitHub repository https://github.com/megamattc/Akkadian-language-models/tree/main/ak_norm_model/assets/UD_Akkadian, https://github.com/megamattc/Akkadian-language-models/tree/main/ak_norm_model/packages/ak_AkkParser_Norm_1_2_5_8_9_1 0_13_15_16_17_18_19_21_anzu_barutu_rinap4_tcmaassur-0.0.0/ak_AkkParser_Norm_1_2_5_8_9_10_13_15_16 _17_18_19_21_anzu_barutu_rinap4_tcmaassur and https://github.com/megamattc/Akkadian-language-models/tree/main/ak_norm_model/corpus/UD_Akkadian, respectively (last accessed: 20 November 2023).Readers wishing to train the model from scratch should also consult the documentation available at https://github.com/megamattc/Akkadian-language-models/tree/main (last accessed: 20 November 2023).

Figure 1
Figure 1 Section of SAAo letter in CONLLU format.

Figure 2
Figure 2 Section of SAAo letter in TTL format.