CREMMA Medii Aevi: Literary manuscript text recognition in Latin

This paper presents a novel segmentation and handwritten text recognition dataset for Medieval Latin from the 11 th to the 16 th century. It connects with Medieval French datasets, as well as earlier Latin datasets, by enforcing common guidelines, bringing 263,000 new characters and now totaling over a million characters for medieval manuscripts in both languages. We provide our own addition to Ariane Pinche’s Old French guidelines to deal with specific Latin cases. We also offer an overview of how we addressed this dataset compilation through the use of pre-existing resources. With a higher abbreviation ratio and a better representation of abbreviating marks, we offer new models that outperform the Old French base model on Latin datasets, improving accuracy by 5% on unknown Latin manuscripts

Our dataset builds on the experience of Ariane Pinche, specifically her work on the CREMMA Medieval dataset, which treats different variations of Old French from the 13 th to the 15 th century, with a heavy focus on the first section of the period.As the first recipient of the CREMMALab post-doctoral funding, Pinche co-organized a research seminar around the formalization of transcription guidelines for graphemic transcription of Old French (Pinche, 2022c).Based on her recommendations, a few datasets emerged around the École nationale des chartes and the CREMMA project.Notably, the Gallic(orpor)a corpora (Gabay, Pinche, Leroy, & Christensen, 2022) and the course project DecameronFR (Biay, Boby, Konstantinova, & Cappe, 2022) provided two additions for Old French and Middle French data, centered around the end of the middle ages.On the opposite, the Caroline Minuscule project (Hawk, Karaisl, & White, 2018) was realigned in ALTO XML and adapted to the guidelines as it provided some foundations for recognizing the

GENERAL ASPECTS OF THE CORPUS
Corpus construction theory Borrowing the terminology from the linguistic domain (Bauer & Aarts, 2000) where data construction methods have long been examined, evaluated, and reconsidered, we shall examine the following methodological aspects.Contrary to the notion of "sampling" which is, by definition, a random selection procedure, "corpus construction" implies a systematic selection of materials that obey a specific rationale, where its efficiency depends on the research question."Representative sampling" is where these two approaches converge.Sampling secures efficiency in research by providing a rationale for studying only parts of a population without losing information.Its key feature is "representativeness" of the system in question.Sampling criteria and focal variables correlate.In HTR for medieval manuscripts, "representativeness" was approached in terms of the medieval handwritten Latin language's characteristics as a system comprised of abbreviations, ligatures, and punctuation signs alongside graphemes.Different genres, scripts, and their degrees of formality served as instances of this system.Document sampling strategy From the three registers making up the construction of a qualitative corpus according to Bauer and Aarts (2000), namely channel, domain, and function, only the first parameter is constant in our case: the sample represents exclusively the written Latin language while giving room to texts of multiple functions addressed to different audiences belonging to various genres (while not aiming at exhaustiveness at this stage).The corpus construction can be regarded as a cyclical process: it has not been entirely determined a priori but rather evolved, bearing in mind the logic of complementarity regarding the already existing datasets.Estimated abbreviation rate and use of specific characters, known genres and scripts were implemented to compensate for what was thought to be missing from the network of the corpus and the corpus itself in order to make it as "representative" as possible.HTR engines are language agnostic, but the same cannot be said for the resulting models, which means that it depends on the representativeness of the sample to determine whether a model will work on "similar" or "out-of-domain" documents.
Three distinctive selection processes have been applied in our case: 1.The first set of documents was selected purely on their linguistic feature, their readability, and their availability as both digitized manuscripts and editions which could be found either online or in local libraries.It led to the inclusion of classical texts such as Seneca's Medea.Script was not taken into account.

2.
In a logic of complementarity, the second part of the corpus was dictated inversely by content.More specifically, given the relative absence of ligatures and abbreviations in classical texts, we chose documents that would display a higher degree of abbreviations.This both induced or led to a genre selection process, specifically for medical and scholastic data.At the same time, script diversity was added to the consideration and came naturally as a sort of by-product.

3.
Finally, as we wanted to test Kraken models, we sought a transcription project that would provide us with data that would help us evaluate our own.This led to the alignment of the Eichenberger and Suwelack (2021) dataset, produced in the context of a transcribathon in Berlin and containing genres new to our corpus (Book of Hours, Psalms, etc.).

Quantitative aspects of the corpus
Corpus size depends largely on the subjective criteria and resources of each project and little can be said as a general rule: one needs to consider the limitations that stem from the effort put into producing the corpus, the budget available, the number of representations one wants to characterize, and some minimal and maximal requirements (in our case the quota for the production of an efficient HTR model).Building a turn-key HTR model applicable to as large a range of unseen manuscripts as possible is undoubtedly the end goal.With the production of ground truth being expensive but with increasingly more open-access models available to the public, the challenge is finding the right combination of GTs (either to create a model from scratch or to fine-tune an existing one) that yield the best results.This is where considerations of size and variety enter the discussion and affect directly the quantitative corpus construction strategy.
More specifically, while conducting an experiment on Caroline Minuscule OCR models, Hawk et al. (2018) conclude that "relative preponderance"4 in small training pools was a considerably more important factor than that of size, which inversely impacts the accuracy of the models resulting from larger training pools.A careful conclusion would be that a specific combination of manuscripts can yield exceptional results, even though the reasons behind such results or the criteria for the respective manuscripts to be combined are not entirely clear yet.This means that quantity-wise we sought to find a balance between the diversity and size of the GT, always making sure that the ground truth yields an efficient model for individual manuscripts on the training set.Training and fine-tuning experiments conducted by Pinche showed that a specialized model per script isn't always necessary, but the variety of the training set increases its robustness.Therefore, the size of each GT belonging to the training set was limited to 5 pages per script variation (depending on the density of the layout),5 examining whether this balance can contribute to the production of generic models. 6lérice et al.

Journal of Open
Humanities Data DOI: 10.5334/johd.97 Segmentation vocabulary: SegmOnto With the emergence of efficient layout analyzers and easy-to-use interfaces, the need for efficient segmentation models increases (as does the need for large amounts of data) based on the aggregation of heterogeneous documents.Alongside text recognition, eScriptorium allows for layout annotation using ontologies and controlled vocabularies.For this, researchers need to agree on a limited common vocabulary and share common practices to facilitate the interoperability of their ground truth.
In order to identify the different areas of the document and the type of lines present on the page as well as to characterize them from a codicological point of view, we decided to implement the controlled vocabulary SegmOnto (Gabay, Camps, Pinche, & Jahan, 2021).SegmOnto was born out of the need for a small/restricted common ontology based on existing standards for the description and analysis of document layout, ranging from content categorization to text recognition, mainly addressing the case of manuscripts and early printed books.
SegmOnto has already been implemented in several projects led by Pinche and connected to the CREMMALab project such as Gabay et al. (2022), resulting in segmentation models mainly for late medieval manuscripts and early prints. 7As per the CREMMA Medii Aevi dataset, the documents present two kinds of layout: multi-columns and singular columns, for which lines are most often long, except for the Psalms and Book of Hours.SegmOnto offers multiple levels of description, of which only the first is completely standardized, as the second is intended for custom refinement and the third for local and document-based differentiation.For the purposes of the project, only the first level of SegmOnto has been utilized, such as MainZone for columns and MargintextZone for marginalia.
Pinche's Transcription Guidelines Pinche (2022c) stressed that HTR was an answer to the need for scientific projects to acquire textual data either to undertake editions or to constitute large corpora.Her guidelines address the need to establish principles common to projects dealing with the transcription of manuscripts in order to: • build shareable, reusable, and durable ground truth data sets; • produce robust generic models, reusable in "out-of-domain" manuscripts; • minimize the collective cost, including that of training people; • build GT that seeks to optimize the learning space of HTR models.
Pinche has privileged a graphemic transcription, which reproduces graphemes, i.e. a canonical form for each character, instead of a graphetic one, which tries to reproduce each variation of a letter (such as ſ and s). 8 Pushing the imitation too far through a graphetic approach induces a risk of making the transcription harder to complete (as it requires technical skills to recognize differentiated shapes of characters), harder to make uniform (specifically as more annotators are to participate in a dataset) and potentially unusable for HTR (as it might introduce more characters and ultimately noise for HTR engine to learn).Therefore, in cases where functional signs have more than one graphetic manifestation but essentially the same function, they could be represented by the same sign: for example, for every manifestation of the paragraph sign, we opt for the pilcrow sign " ¶" (U+00B6) on every occasion, instead of several variations such as "" (U+F1E1). 9In the context of the guidelines, we set up a list of allowed characters and a list of common and rare cases (such as Table 2 and 3).
On the topic of abbreviations, resolving them produces specific difficulties for HTR engines, as it leads them to learn more about the language than they originally intended. 10Abbreviations 7 More information and case studies can be found here: https://segmonto.github.io/.
8 See Pinche, Duval, and Camps (2022).On this particular topic, Gueville and Wrisley (2022) and Pinche took different paths.However, while graphemic transcription cannot be turned and reused for graphetic model training, graphetic transcriptions can be turned easily into graphemic GT, at the cost of establishing a "translation" table for each character.

9
While most of the special characters exist as such in MUFI a conscious effort has been made to avoid as much as possible the private domain of MUFI.
10 Most HTR engines learn directly from transcriptions and do not include a separate mechanism for abbreviation resolution or spotting.Transcriptions produced by these models are thus not showing where an abbreviation was resolved, making it difficult to distinguish HTR errors from abbreviation resolution errors.The stakes do not specifically concern the scores, which seem to be close to each other (Camps, Vidal-Gorène, & Vernet, 2021), but the long-term use of ground truth data and silver data in a sustainable way.Clérice et al.

Journal of Open
Humanities Data DOI: 10.5334/johd.97 are not resolved in our dataset, as this constitutes rather an interpretative act linked to the specificity of each document.It is not the same as a textual prediction and it could prove to be detrimental to the extension of an HTR model in the long term.Pinche's graphemic approach without abbreviation resolution simplifies the interpretation step of the text, and in turn, the reduction of characters diversity ultimately smooths both the human transcriber and the HTR engine's learning curves.
In order to ensure the rigorous application of these guidelines and the homogeneity of the data produced, we introduced quality control to the production and publication workflow.Each manuscript transcription was passed through ChocoMufin (Clérice & Pinche, 2021), using project-provided character translation and control tables.
This software, alongside these tables, allows for each dataset to be both controlled at the character level and adapted to guideline specifications and modifications.It also allows for project-specific transcription guidelines to be translated to a more common one such as CREMMALab's (Pinche & Camps, 2022). 11This process has been largely used in the first months of the CREMMA Medieval project, as the guidelines were still being drafted.It allowed Pinche to produce or align datasets first and harmonize later, as long as the harmonization was from an upper level of details (closer to graphetic) to a lower level (closer to graphemic).

TRANSCRIPTION GUIDELINES FOR THE CREMMA MEDII AEVI
The section that follows aims to guide the reader through the transcription norms followed for the Medii Aevi dataset, illustrating the process and the more common and complex cases, especially where new characters have been introduced compared to the CREMMA Medieval dataset.
The project adheres to the general principles laid out by Pinche (Pinche, 2022c, Tables pp. 4-15) concerning the base cases (punctuation, word separation, functional signs, superscript letters, abbreviations, ligatures, and roman numerals).Using the project-provided character conversion table, ChocoMufin controls the transcription and corrects any anticipated error by transforming the character automatically so it conforms to the pre-defined guidelines (data should be used in their post-ChocoMufin converted state as it sometimes corrected mistranscription).However, where the guidelines were not directly addressing the situation (new characters, new types of abbreviations), we positioned ourselves and interpreted the guidelines in light of the situation.Each decision was discussed with the original guidelines' author. 12n general, the main differences that we isolated between the CREMMA Medieval and Medii Aevi datasets, stemming from the language as well as the genre's own characteristics, are: 1. the dataset bears no accentuated vowels like in the Old French texts (a rare event though for the corpus);13 2. no normalization or distinction of u and v was provided, nor of i and j; 3. two variations of con are found, namely the antisigma and the 9-shaped form; 4. a higher diversity of abbreviating character usage and signification; 5. Arabic numerals alongside roman, mostly in scholastic and medical treatises.
Reference marks, functional signs, and punctuation In general, complex medieval punctuation has been simplified as much as possible, with single sign punctuation being reduced to "." and commas will be rendered as ",".Double sign punctuation (mainly punctus elevatus and punctus interrogativus) are consistently reduced to ":".The hyphenation for words that continue to the next line has been marked with a unique "-" (U+002D) sign, following 3.1.Contractions: A word is abbreviated by contraction when one or more of the middle letters are missing.Such an omission is indicated by one of the general signs of abbreviation, present in both corpora, always following Pinche (2022c).Thus, macrons and generally horizontal lines diacritics over the letter such as tildes are represented by combining horizontal tildes, and any vertical zigzag and similarly shaped forms are simplified into combining vertical tildes.In our corpus, in cases where a macron is extended to more than one letter due to the cursivity of the script, this trait has been reproduced in the transcription, as well as in the case of stacked diacritics, usually in later medieval manuscripts (cf.Table 4), as long as it was a semantic feature and not a decorative one.
Abbreviation marks significant in themselves: "Standard" Abbreviations signs have been preserved as such, like pr(a)e -p ̃ (p + combining tilde, p + U+0303), pro -ꝓ (U+A753), hocħ (U+0127), ẜ (s with diagonal stroke, U+1E9C) for secundum or ser-, ꝯ for 9 shaped con/cum (U+A76F), Tironian sign ꝰ for the desinence -us (U+A770), ᷑ for (t)ur (U+1DD1), and Ꝙ / ꝙ for quod.Absent from the CREMMA Medieval but present in Medii Aevi, the truncated ending -is is transcribed using the character ꝭ (U+A76D).The "inverted c" variation of the preposition con/ cum is a good example for the difference of approach between the graphetic and graphematic approach: while using the antistigma (ↄ) is more faithful, it simply is an allograph of the original ꝯ.For -rum, the symbol ꝵ is used rather than the rotunda -rum ꝝ (U+A75D). 14Abbreviation marks significant in context: The abbreviation for the enclitic -que, or simply -bus or vertical -m in later manuscripts, has been reduced to the semicolon-shaped ; sign (U+F1AC), avoiding the private domain ligature specific q (U+E8BF) character but also avoiding confusion with the regular semi-colon.
Conventional signs: a category that includes all signs that stand for a frequently used word or phrase, and they are almost always isolated (cf.Pluta (2020)).First, a rather frequent one, the abbreviation sign for esse is represented by the mathematical operation ≈ (U+2248).The Division sign ÷ is used ubiquitously for the abbreviation sign of est/id est.Tironian et (U+204A, 14 The same two-shaped mark on the baseline, combined with a downward stroke, may stand as well for "-ris" as in "Aristoteles", though it is more often used at the end for "rum".all variations of it, cf.below) is transcribed by ⁊.Etiam can also be found abbreviated by a combination of the Tironian et and the macron symbol (see Table 4).
Ligatures, ie.combinations of more than two letters in one form with the reduction of proclitic and enclitic letters or abbreviating symbols placed above or joined with letters are reduced to their original alphabetical components.Ligatures between letters in cursive scripts such as the ſt (U+FB05) ligature or the two ff (U+FB00) ligature are resolved as -st-and -ff-.For the very frequent quia, the transcription qr has been privileged, avoiding the MUFI sign  that belongs to the private domain.More examples are provided in Table 4. 15   Superscripts letters and interlinear additions A standard way of contracting a word is by adding a superscript letter which gives information about the abbreviated sequence.
Frequent ones are open a, u, o, or the ending of a word altogether.These were all rendered with the aid of superscript characters (Pinche, 2022c, p. 11).Ergo and igitur are two of the most frequent examples of abbreviations with superscript letters.Letters without any baseline letter are simply represented with the same combining superscript character and space as the supporting baseline character (e.g." ͣ ͭ ": space + combining a + space + combining t cf. Figure 1).
Superscript letters, alongside abbreviating functions, were sometimes used to render interlinear additions.Missing content or annotations are added in the interlinear space, especially in manuscripts of scholastic and medical content.This was something that was at first a challenge for the transcription process due to segmentation constraints.It can be, at times, impossible to completely differentiate the segmentation masks of two vertically adjacent letters (like the interlinear additions).Therefore, provided that the corresponding combining letter exists and both words can be formulated, no new lines were carved for the interlinear additions.Where this was deemed too complex, interlinear additions were omitted (see Figure 2).
15 Other transcription guidelines privilege "q2" as a reference to the "r rotunda-shaped" abbreviation sign that lays next to q the choice of qr from our part being the reduction to the r rotunda-shaped abbreviation sign to the simpler r.The original insular abbreviation has a simple vertical tilde next to the letter "q".

CHARACTER(S) UNICODE RESOLUTION EXAMPLES
Table 3 Freestanding, lettercombining abbreviations and their corresponding transcription signs.đ cannot be found in our dataset and is mentioned here as it might be a common case in other datasets.Rare characters and Numerals Referring to corpus construction practices for balanced corpora, Maniaci (1993) stresses that "sporadically attested variables will therefore be preferred to those that appear in all -or almost all -the individuals that are part of the corpus."Rare characters, a subset of freestanding abbreviation signs, specifically occurring in the Medii Aevi dataset are therefore given special attention (cf.Table 5).In two of the manuscripts, both of medical content, some occurrences of graphemes for the denotation of the metric values ounce and semuncia were encountered.For their transcription, ℥ (U+2125) and  (U+10192) were used."Barred O" is represented by ∅ (U+2205) and is widely used to transcribe the word instans instead of ꝋ (U+A74B) that, according to MUFI documentation stands for the abbreviation of obi(i)t (Coulson & Babcock, 2020, p. 10).
Last but not least, in addition to roman numerals, often preceded and followed by dots such as ".ii.",Arabic numerals are also comprised in the dataset, mainly due to the medical treatises (see Figures 3 and 4).

Production pipeline
The data was built using eScriptorium and Kraken for both segmentation of zones and lines (specifically the BLLA model).Manuscripts were annotated successively.First, the manuscript is automatically segmented, then its segmentation is manually corrected, and finally the text is transcribed.Once each sample is entirely annotated, its use of characters is controlled via the ChocoMufin software, while its conformity to the segmentation classification vocabulary is controlled by HTRVX.Finally, data are released on Github. 16All the combining and abbreviation signs suggested for use by the present adaptation of Pinche's guidelines can be also found on a custom-made eScriptorium keyboard configuration, in order to facilitate reuse and compatibility with the guidelines. 17

RESULTS AND DISCUSSION
Properties of the resulting dataset The resulting version of the dataset (see Table 6) is built on 18 + 3 manuscripts.All alignments are original alignments, but some draw their original transcription from online projects (cf.Acknowledgements).
The current version of the dataset shows a wide variety of genres, and thus a wide vocabulary.
From medical and grammatical content to literary and scholastic, a certain level of arbitrariness is introduced in the sequence of characters as they are not as repetitive and predictable from the machine as in a homogeneous genre or topic-driven dataset.The collection was built not to be representative of one specific use of the Latin language and is not thematically unified, while the CREMMA Medieval dataset focuses more on literary texts, specifically hagiographic and chanson de geste texts.Medical and scholastic genres, furthermore, induce the use of a range of rare characters and often underrepresented letters (such as "z", "y" and "k").Other features, such as layout and type of digitization (microfilm or original), provide different representations of texts, with more or less noise in the mask of each line given the space between them, with more or less contrast between information.Colored text yields less "information" in digitized manuscripts as they tend to be a duller form of grey that black ink, while clearly departing from the manuscript "background" in color.
A timespan of 5 centuries separates the earliest and the oldest manuscripts, with a clear focus on the period starting in the 1200s and finishing in 1500.This leads to a good representation of a variety of Gothic scripts, 18 including personal hands alongside formal categories such as the one described by Rossi (2022), with different levels of execution (cursivity and formality).
18 Characterisation of scripts was made by the transcriber where the information was not available on the notice of the manuscript.The criteria followed for the Gothic scripts are those of Derolez (2003).Character frequencies in the CREMMA Medieval and the Medii Aevi datasets We set up this corpus to both complement the CREMMA Medieval dataset and grow the available set of data for Latin through the Middle Ages, noting that at least two datasets for Medieval Latin existed already (Caroline Minuscule and Eutyches) in abbreviated form for pre-10 th century documents.

PAGES
Unlike CREMMA Medieval, our approach has been feature-driven to compensate for rare characters in the dataset network.In this regard, we succeeded, as we have a higher frequency of special characters in our dataset than in Pinche's dataset, despite being smaller overall (see Table 7 and Figure 5).Only three characters are more represented in CREMMA Medieval: the Tironian Et, the superscript combining R (common on words such as "grand"), and "&".The character ꝯ is equally present in both datasets: resolved as con-or com-, it is often used in words such as ꝯmence (commence).Some very frequent diacritics, such as the horizontal lines and vertical lines transcribed as tildes, are more frequent in our dataset, by a factor of 2.51 for horizontal ones and of 3.93 for vertical ones.This will allow better recognition of these two frequent marks, as it now totals around 19,000 occurrences in both datasets for the horizontal tilde and 4,500 for the vertical one, making them the first and the third most represented abbreviating characters.
Some manuscripts have nearly no abbreviation (cf.

IMPLICATIONS/APPLICATIONS
With this addition to the overall amount of datasets available, we now have 1.149 million characters for medieval manuscripts with book scripts, ranging from the 9 th to the 15 th century.These data offer more than characters: we can imagine using them in the context of linguistic studies (evolution of dialects, abbreviation usage, etc.) thanks to the shared transcription norm, or in codicology studies (evolution of layouts, relation between layouts) using the common segmentation vocabulary, both using the original data or automatically annotated one.
HTR data and models have a fairly high level of reuse potential.First and foremost, while it is still relatively a rare reuse, these data, visualised correctly, can easily serve as teaching materials: e-teaching of paleography has been gaining some traction, 20 but simply moving away from printed to digital and interactive hand-outs using open data and transcription is a first step that undoubtedly some have already taken. 21Then reuse can move to the analysis of transcription themselves: Stutzmann (2018) and Stutzmann, Mariotti, and Ceresato (2020) have shown that analysis of graphematic data can yield information about scribal practices.Finally, such data can be used for model training.Project like Possamaï, Gaiffre, Souvaye, Duval, and Ducos (2022) and Foehr-Janssens, Ventura, Carnaille, and Meylan (2021) have used automatic transcription models to speed-up the transcription process of large collection of manuscripts, using base models which were then fine-tuned on sample of data to yield better results, such as described by Pinche (2022b, 4.4).Finally, models can be used for data-mining and operating research at scale on non-manually transcribed manuscripts: Camps, Clérice, and Pinche (2021) proved the hypothesis of a 19 th -century scholar by analysing a full manuscript with automatic transcription, Franzini et al. (2018) proposed also a stylometrical analysis of data obtained through automatic transcription.
As a direct output, we trained a model which would allow for transcribing or starting the transcription of Latin medieval manuscripts.In order to evaluate the gain from our data, we trained three models: 22     (2016). 23In the case of dubia, additional corrections have been made for the faithful reproduction of the abbreviations; • for Berlin, Hdschr.25, Faithful Transcriptions Data Set (Eichenberger & Suwelack, 2021); • for the Donatus manuscripts -Laurentianus Pluteus 53.08 and 53.09 -the edition of HyperDonat by Bruno Bureau & Christian Nicolas has been consulted (Bureau, Nicolas, & Ingarao, 2008) and (Pinche, Bureau, & Nicolas, 2016), preserving, nevertheless, the manuscript lectiones/errors; • In the same vein, for Latin 16195, the critical edition of Questiones de coitu (Cartelle, 2017), for Montpelier H 318 and CLM 1302, the critical edition of Liber minor de coitu (Cartelle, 1987) and for Philadelphia, College of Physicians, 10a 135, the critical edition of the Tractatus de sterilitate (Cartelle, 1993) by Enrique Montero Cartelle were consulted respectively as reference.

Table 2
Punctuation, functional signs and hyphenation.Clérice et al.

Table 4
Ligatures and special contraction cases.

Table 6
Table9).Laur.Plut.39.34 notably so, as it only contains 3 abbreviated words which is a single character abbreviation(⁊, et).A little less than half of our manuscripts are less abbreviated than the most abbreviated text in the CREMMA Medieval dataset, while the other half can exceed it by up to ten points.However, both languages show similar maximum frequencies in terms of non-single letter abbreviations (abbreviations made up of a single Unicode codepoint such as ⁊, &, ꝑ).19

Table 8
Finally, despite showing a similar number of pages, we see a large variation in terms of word density with a limited variation in terms of unique words (cf.Table8).This shows how pages as a metric are not enough to characterize a corpus for HTR and Layout segmentation purposes: the number of columns, lines, and potentiality of words or characters supplements the first.To showcase this argument, the Berlin, Hdschr.25 manuscript has the highest number of pages (17) but the third lowest amount of words (961).

Table 9
From Medii Aevi, as stated earlier, all aligned data from the Faithful Transcription Data Set are kept for testing, as an out-of-domain set.Each model uses at least 10% of the pages of each dataset for the development set.CREMMA Medieval and Medii Aevi are split furthermore with another 10% subset for evaluation, proposing "in Domain" evaluation.

Table 11
Details on errors from the test presented in Table10.Space % shows the portion of error points due to bad spacing, e.g.All Model has a 94.30% accuracy on CREMMA Medieval test set, which means a 5.7% Character Error Rate (CER): not recognized SPACES represent 1.7 points of CER, more than a quarter of the CER.Other numbers are absolute values of missed characters (deletion or substitutions) to make comparisons between models possible; insertions are not accounted for.