Assessing Automatic Sentence Segmentation in Medieval Slavic Texts

1. Introduction

Aiming to equip young scholars in the field of Slavic medieval research with digital skills, our study has its context in developing an automated workflow to assign temporal and geographical information to Slavic medieval texts. These include handwritten and printed texts from three historical periods (10th–11th, 15th–16th, and 18th centuries), originating from South Slavic or East Slavic regions, covering a spectrum from Old Church Slavic (OCS) to its younger varieties. Some of these manually transcribed texts had served as ground truth (GT) for Handwritten Text Recognition (HTR) models (Rabus et al. 2023). On the manuscript level, they had been qualitatively attributed with temporal and geographical information (i.a. Vaillant 1974 [1930] and Krustev / Boyadjiev 2012).

In Lendvai et al. (2023) we explored domain adaptation and fine-tuning of BERT (Devlin et al. 2019) for region and time classification on the sentence level, where we uniformly used Stanza (Qi et al. 2020) with its OCS model to segment manuscript texts. However, using this specific model might have been suboptimal since our texts have varied origins in time and space. Additionally, a sample of the sentences misclassified by BERT appeared to be syntactically and semantically less well-formed than those correctly classified. Consequently, our current study focuses on comparing sentence segmentation by Stanza (version 1.6.1) and UDPipe (Straka 2018, version 2.12). We focus on evaluating from a philological perspective, specifically by examining the resulting segments for well-formedness ¹ , regarding a well-formed sentence as one whose boundaries do not split (i) a clause simplex (i.e. subject-predicate structure) or (ii) complements ² or (iii) titles. Correct sentence segmentation is a prerequisite for many tasks on such historical texts, e.g. syntactic analysis or alignment of Greek-Slavic translation. Importantly, unlike in modern languages, in OCS texts punctuation does not provide sentence segmentation guidance, e.g. bullet point-like characters are presumed breath marks and not sentence endings.

2. Method

Both Stanza and UDPipe offer a model for OCS ( Stanza: language code ‘cu’; UDPipe: old_church_slavonic-proiel-ud-2.12-230717, hereafter ‘proiel’) as well as for Old East Slavic (OES; Stanza: language code ‘orv’, with package ‘torot’; UDPipe: old_east_slavic-torot-ud-2.12-230717, hereafter ‘torot’) ³ .

Focusing on texts from the earliest time period (10th–11th centuries, representing South Slavic provenance), we used three manuscripts as input for the sentence segmentation task, and evaluate on the first 100 sentences in the respective GT.

1. Four texts ⁴ from the Codex Suprasliensis, from the test set of the Universal Dependencies (UD) treebank. (Note that the UD treebank had served as benchmark data for training both Stanza and UDPipe.)

2. The translation of the Catechetical Lectures of Saint Cyril of Jerusalem, a large part of which is preserved in an East Slavic manuscript (GIM, Sin. 478). ⁵

3. The translation of the treatise On leprosy by Methodius of Olympus, as appearing in copy in a compilation from the 16th c. (GIM, Sin. 995) ⁶ , whereas its original is dated to the 10th c.

Note that texts (2) and (3) have no benchmark GTs, thus we created our own GT resources for them. Both (2) and (3) maintain an archaic syntax, implying that word order that is crucial for sentence segmentation resembles the South Slavic pattern typical for the oldest surviving OCS texts rather than the (more recent) OES. We assumed that their mainly graphical-orthographic changes towards the OES variety would not affect sentence segmentation. To verify this, and since OES models in principle would fit with the geographical location where (2) appeared, resp. the time when (3) appeared, we also tested the OES models on these texts. Finally, we also run the OES models on the OCS dataset (1), expecting that their performance would degrade. ⁷

3. Results

3.1. Results by Old Church Slavic models

For all texts, Table 2 shows the quantification of incorrect segmentations as judged by the criterion defined in Section 1. We additionally report F1-scores calculated by the UD benchmark evaluation script. ⁸ The scores are low, in line with those reported on the benchmark. ⁹ In Table 1 we exemplify segmentation mismatches across the GT and the segmenter tools. On dataset (1), both tools’ OCS model showed similar efficacy: on Life of Aninas the Wonderworker, Stanza generated fewer units (72) than UDPipe (91). Both made incorrect segmentations (45 and 56 respectively, cf. Table 2), including shared errors due to erroneous word segmentation in the GT data leading to sentence boundaries (cf. the underlined parts in Table 1, reflecting the state of the manuscript) ¹⁰ .

Table 1: Example of sentence boundary errors produced by OCS models on dataset (1)

On dataset (2), both tools perform considerably worse, producing similar-length yet erroneous units. Here, Stanza created 91 units with 115 erroneous segment boundaries, while UDPipe shows slightly better performance producing 74 units with 86 errors. The performance disparity is most notable for dataset (3). Using the OCS models, Stanza and UDPipe generated 74 and 75 units, respectively.

3.2. Results by Old East Slavic models

Applying the OES models from both tools on text (2) does not yield improved segmentation: Stanza divides the text into 84 units with 113 segmentation errors (some due to incorrect line-end word segmentation); UDPipe yields 100 errors in 92 created units.

For dataset (3), the OES models split the text into 134 resp. 67 units. While UDPipe produces an even higher amount of boundary errors, Stanza yields the least errors in this task.

3.3. Results by Old East Slavic models on Old Church Slavic data

Running the OES models on the OCS dataset (1), the results indicated a decline in performance as anticipated. However, the observed number of incorrect splits was not substantial; the OES models even seemed to demonstrate greater robustness against word segmentation errors.

Table 2: Quantifying segmentation in the first 100 GT sentences on dataset (1), (2) and (3)

4. Conclusion

Our aim is to evaluate sentence segmentation, as a preprocessing step in applied end tasks such as provenance attribution of historical language data. Our current investigation indicates that both Stanza and UDPipe show suboptimal performance on manuscripts unseen during training, likely also because these were compiled in a language region or century different from currently available OCS and OES benchmarks. This highlights the challenges in compiling representative training resources for historical language processing.

Appendix A

Bibliography

Devlin, Jacob / Chang, Ming-Wei / Lee, Kenton / Toutanova, Kristina (2019): “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, Minneapolis, Minnesota, 4171–4186.
Jouravel, Anna / Janina Sieber / Katharina Bracht (2024): Methodius Von Olympus: De Lepra. Griechischer Und Slavischer Text, Mit Einleitung Und Deutscher Übersetzung. Die griechischen christlichen Schriftsteller der ersten drei Jahrhunderte NF, 31. Berlin.
Krustev, Georgi / Boyadjiev, Andrei (2012): “On the Dating of Codex Suprasliensis”, in: Miltenova, Anissava (ed.): Rediscovery: Bulgarian Codex Suprasliensis of 10 ^th century. Sofia: East-West Publishers, 17–23.
Lendvai, Piroska / Reichel, Uwe / Jouravel, Anna / Rabus, Achim / Renje, Elena (2023): “Domain-Adapting BERT for Attributing Manuscript, Century and Region in Pre-Modern Slavic Texts”, in: Proceedings of the 4th International Workshop on Computational Approaches to Historical Language Change 2023 (LChange'23) co-located with EMNLP2023, 06.12.2023, Singapore, 15–21.
Qi, Peng / Zhang, Yuhao / Zhang, Yuhui / Bolton, Jason / Manning, Christopher D. (2020): “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages”, in: Celikyilmaz, Asli / Wen, Tsung-Hsien (eds.): Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg, PA, USA: Association for Computational Linguistics, 101–108.
Rabus, Achim / Arnold, Eckhart / Jouravel, Anna / Lendvai, Piroska / Meindl, Martin / Polomac, Vladimir / Renje, Elena (2023): Developing a Pipeline for Automatic Linguistic Analysis of Historical Manuscripts and Early Printings: The Pre-Modern Slavic Case. In: Proceedings of Digital Humanities Conference, 14.07.2023, Graz, Austria, 112–113.
SJS = Kurz, Joseph (1958): Slovník jazyka staroslovenského. Lexikon linguae palaeoslovenicae. Praha: Nakladatelství Českovlovenské Akademie věd.
Straka, Milan (2018): “UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task”, in: Zeman, Daniel / Hajič, Jan (eds.): Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Stroudsburg, PA, USA: Association for Computational Linguistics, 197–207.
Vaillant, André (1974 [1930]): “Le De autexusio de Méthode d'Olympe: Version slave et texte grec édités et traduits en français”, in: Patrologia Orientalis 22, 5, 111, pp. 631–888.
Weiher, Eckhard, ed. (2017). Die altbulgarische Übersetzung der Katechesen Kyrills von Jerusalem. Monumenta linguae Slavicae dialecti veteris 64. Freiburg i.Br.

Notes

For typically achieved segmentation performance on contemporary and ancient languages and evaluation metrics cf. https://stanfordnlp.github.io/stanza/performance.html ; https://universaldependencies.org/conll18/evaluation.html .

Acknowledging the challenge of clearly defining criteria for clauses, especially in cases like absolute constructions or infinitive clauses, we have temporarily considered the splitting of such ambiguous cases as incorrect. However, this requires further discussion. Against this background, it is very important to understand that there are various equally correct ways of segmenting a syntactic unit. Thus, the incorrect segmentations of the models given here are not always strictu sensu wrong, but occasionally only deviate from the benchmark GT or our proposed GT.

Both tool’s RNC models (Stanza: language code 'orv', package 'rnc'; UDPipe: old_east_slavic-rnc-ud-2.12-230717) are oriented towards modern punctuation and thus proved ineffective, splitting the text into extremely long 'sentences', while the language variant in the Birchbark corpus is irrelevant for our study.

cf. https://github.com/UniversalDependencies/UD_Old_Church_Slavonic-PROIEL#data-splits . The texts are: Vita ( Passio ) XL martyrum Sebastenorum (fols. 34v22–37v30; 39r1–42r27), Iohannes Chrysostomus, In ramos palmarum homilia (fols.159v10–166v11), Patriarchae Photii, In ramos et Lazarum homilia (fols.166v13–171v28), Vita Aninae Thaumaturgi (fols. 272r13–285v30).

ed. Weiher 2017.

Fols. 310r–315r (transcription by A. Jouravel, on the text see Jouravel et al. 2024 ).

Note that since for texts (2) and (3) there is no GT available, we defined our own GT segment boundaries in line with our evaluation criteria (cf. Section 1).

https://github.com/ufal/udpipe/blob/udpipe-2/udpipe2_eval.py .

https://github.com/ufal/udpipe/blob/udpipe-2/udpipe2_eval.py https://ufal.mff.cuni.cz/udpipe/2/models

10.

This refers to a disrupted passage in the manuscript, for which various solutions have been proposed (see for example in the SJS ).