Discovering Works in Historic Languages in HathiTrust with Language Models: A Case Study in Armeno-Turkish

This study presents an experimental workflow to train and implement language models to discover works in Armeno-Turkish—vernacular Turkish written in Armenian script— in HathiTrust, as a test case for a more generalizable solution to increase the discoverability of documents in historic languages.

When working with a digital repository, a fundamental research concern is being able to construct a corpus in a specific language or languages. HathiTrust is currently the largest open collaborative repository that houses 18+ million volumes, offering remote and open access to scholars around the world who work with these sources in a variety of disciplines. However, for scholars working with historic languages and multilingual documents, it is a significant challenge to find such works in HathiTrust, due to numerous reasons regarding the cataloging practices for these records—which tend to have wrong language labels, missing script information, OCR errors and more. Furthermore, the overarching language label, even when it’s correct, doesn’t provide information about the multilingual composition of a given document. These challenges cause marginalized historic, nonstandard, and multi-scripted languages to remain neglected at all stages of research—from exploration to experimentation.

In this study, we focus on collating a dataset of works in Armeno-Turkish— vernacular Turkish written in Armenian script, as a test case for a more generalizable solution for the discoverability of documents in historic languages. The Armeno-Turkish corpus is an established area of study, comprising at least 3,000 books and 100 periodicals published in a multitude of territories between the 18th and early 20th centuries in Europe and in the Middle East. While scholarly interest in Armeno-Turkish and similarly historic languages and nonstandard script combinations is on the rise, the challenges listed above make it practically impossible to collate an Armeno-Turkish dataset only by using HathiTrust metadata.

This project addresses the above challenges and contributes to increasing the discoverability of documents in historic languages in HathiTrust by 1. demonstrating an experiment to discover works in our target language, Armeno-Turkish, without relying on metadata and taking into account both language and script information of a given document and 2. offering a more granular analysis of language identification of sections within each document, revealing patterns of multilinguality in these documents, instead of focusing only on classifying an entire document.

To reach these goals, we employ the following methodology. We start by creating a dataset of works labeled according to both language and script. For Armeno-Turkish, we use expert-labeled documents from the HathiTrust (HT). For negative examples, we first use the HT's MARC index to create a reverse mapping of languages (as assigned by librarians at the contributing institution) to documents, skipping anything dated before 1500 CE. We remove languages with less than 100 documents or whose code is not valid ISO-639. To ensure diverse temporal representation of each language, we split the range from the earliest to latest document in that language into 5 buckets covering equal time periods, and randomly select one document from each bucket. We then select an additional 5 documents at random from the overall remaining set, for a total of 10 representative documents per language. We furthermore split each document into sub-documents of contiguous script according to the Unicode script specification.

At test time, we segment the document into smaller sections. By recording the segmentation offsets, the original documents can be reconstructed with the inferred language information. This approach offers a much more granular language identification of a given document, allowing us to distinguish the variety of different languages present in that document.

We compare a trigram character language model, a simple and performant approach to language identification, with a FastText language identification model trained on our labeled dataset. Finally, we apply our best performing model, the trained FastText model with a 0.91 fscore on the test set, to all documents in the HTC tagged as Turkish (tur), Ottoman Turkish (ota), or Armenian (arm).

We demonstrate the example of Armeno-Turkish as a basis to offer a robust, reproducible, and generalizable workflow to collate a corpus in the target language. As a result of the model's findings, we present 29 new records in Armeno-Turkish. These records include different translations of the New Testament, religious commentaries, dictionaries, textbooks for learning foreign languages, and Ottoman legal documents. Among the notable findings is a two-column bilingual edition of the Mejelle—the civil code of the Ottoman Empire in the late 19th and early 20th century, in Armenian and in Armeno-Turkish. These findings contribute to a more complete and nuanced set of records for researchers. The study also reveals existing discoverability challenges in HathiTrust with respect to works that are outside the high resource language group, which may not be readily apparent. To this end, we present critical insights into digital materiality, drawing attention to book history and publishing practices by showing of how physical qualities of the document might not transfer to its catalogue or metadata description. These examples include previously unknown bilingual editions, and the practice of binding together multiple books in one volume, which were mislabeled in the HathiTrust catalog as monolingual and single-volume items.

Appendix A

Bibliography
  1. HathiTrust Foundation. 2023. HathiTrust Digital Library .
  2. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 .
Hale Sirin (halesirin@gmail.com), Johns Hopkins University and Ali Bolcakan (bolcakan@umich.edu), University of Michigan and Sabrina Li (sli159@jhu.edu), Johns Hopkins University and Thomas Lippincott (tom.lippincott@jhu.edu), Johns Hopkins University