Pipeline for the Structuring and Publication of Language Resources

1. Introduction

We propose a pipeline model of creating and transforming linguistic data into a machine-readable and user-friendly format (such as spreadsheets or websites) from dictionary data, including handwritten, printed, and text data. Our goal is to enable both researchers and members of local communities to find, access, interoperate, and reuse the data for various purposes, thereby supporting language research and preservation efforts. In this paper, we primarily introduce examples of endangered languages in Japan. We begin by digitizing printed or handwritten texts, structuring these texts, and then introducing scripts and examples to convert them into various formats.

2. OCR/HTR

Many scholars and institutions have researched endangered languages in Japan and house numerous field notes and lexical data. For example, the Nakasone Seizen Collection, in the  Ryukyu/Okinawa-related Materials Digital Special Collections, operated by the University of the Ryukyus Library 2002–, publishes the handwritten field notes of Seizen Nakasone on Ryukyuan language materials using IIIF. However, these materials are predominantly handwritten, like the Nakasone Seizen Collection mentioned above, necessitating a significant time investment for transcription. Some documents have been printed using letterpress in the past, but due to worn ink or outdated fonts, they are often unreadable by standard Optical Character Recognition (OCR) software. In such cases, a program designed for modern letterpress materials (NDL-OCR: NDL Lab 2022) can extract text from images to convert it into digital text. Additionally, handwritten materials can be deciphered using Handwritten Text Recognition (HTR) programs like Transkribus (Kahle et al., 2017) or eScriptorium with Kraken OCR engine (Kiessling et al., 2019)  through machine learning training. However, especially when dealing with materials written in Kanji logogram, Hiragana syllabary, and Katakana syllabary mixed writing, we need more training than alphabetical writings common in non-East Asian languages. This is because, due to the wide variety of Kanji, many characters appear only a few times in a single document, and securing a sufficient learning volume for each character can require an immense number of pages. Currently, an HTR program tailored for handwritten materials of Japan’s endangered languages is under development, with a successful implementation demonstrated in the Hateruma Language Lexical Dictionary for the Hateruma dialect of the Yaeyama languages.

3. Structured data as the source for transformation

We recommend using the spreadsheet format since most linguists and some members of the local communities are familiar with spreadsheets such as Microsoft Excel and Google Spreadsheets. We emphasize the importance of keeping the meaning of rows and columns defined clearly and uniquely. Even though dictionary makers are unfamiliar with computer programming, somebody else might process your texts if the texts are well-structured and are available with an open license such as CC BY. Once well-structured data becomes available, we can transform it into other formats for different purposes. For now, structured texts such as the Audio Database of Hatoma Lexicon (Kajiku & Nakagawa, 2021) and the audio database of Hatoma Example Sentences (Kajiku et al., 2022) are published online with a CC BY license. We hope that they become example cases of our model.

4. Writing scripts

If language materials are adequately structured, scripts can create a pipeline of materials connecting different formats and editing them automatically. We can also write scripts as described below to organize unorganized materials, such as a text file that has just undergone HTR.

As noted above, one can generate a text file from image files of a printed dictionary with OCR or HTR. We can use platforms such as Transkribus and Kraken to recognize text and areas on physical media. These platforms recognize text but do not perfectly segment it based on its structure. For example, when they read text from a dictionary, they do not separate headwords from definitions. Therefore, we must use other methods to separate entries into fields. The most potent way to structurize materials is to write scripts containing regular expression patterns to rewrite the original text in the format of your choice. This method is powerful when the original material is relatively well-formed and written in a uniform notation. For example, the electronic dictionary of Bantu languages presented in de Schryver (2023) was created by converting text transcribed by HTR into .json files using Python scripts with regular expression patterns.

If original material is not written in a uniform format, it can be structured using appropriately trained LLMs. For example, the original source material of Nakagawa et al. (eds., to appear), a dictionary of the Gǀui language, was not written in a unified format and was difficult to structure. However, part of it was automatically structured using trained LLMs. The LLM was instructed to split the unstructured entries of a given dictionary into fields such as headings and definitions and then mark them up appropriately. As a result, the dictionary was structured with some accuracy, although not with the perfectly desired result. The rest of the process had to be done manually by humans, but the use of LLMs can significantly reduce the time involved. However, the accuracy of the process often deteriorates when the processing target contains low-resource languages.

Once the text is structured this way, it can be reformatted as desired. For example, the data of Okinawago Jiten (Okinawan Dictionary: NINJAL 1963) was structurized in a spreadsheet and converted into a TEI Lex-0 and a Markdown-based Hugo website using XSLT (Miyagawa et al. 2023). The original spreadsheet of Okinawago Jiten contained one entry per line, with each column containing fields for headwords, definitions, pronunciations, etc.

5. TEI

To ensure the interoperability and reusability of textual data, adherence to the Text Encoding Initiative (TEI) guidelines for markup is essential. Nevertheless, TEI markup practices are diverse for spoken language resources such as corpora and dictionaries. In the case of dictionary data, we employ TEI Lex-0, a TEI-conformant schema specifically developed for lexicographical information, which prescribes rigorous rules and a sophisticated set of tags for dictionary encoding. Despite its robust framework, TEI Lex-0 has not been applied to East Asian languages or dictionaries incorporating audio data. We propose a more versatile approach to encode multiform script notations, including kanji, katakana, and hiragana, and to integrate audio file references within the TEI Lex-0 framework (Miyagawa et al., 2023; Nakagawa & Miyagawa, 2023), which is convertible into OntoLex-Lemon, a LOD/RDF ontological framework for lexical resources (Almeida et al., 2022). We are creating a standard for corpus data to incorporate the interlinear glossed text with audio and videos under the TEI Lex-0 and TEI P5 Guidelines. We are developing a transformation program that converts structured data into our standards, but it will take longer to complete.

6. LaTeX

LaTeX can also be used to automatically output structured data in a form suitable for publication. For example, suppose a lexicon is converted into a CSV, with each entry in each row and the headwords and meanings in each column. Then, in advance, define macros in the LaTeX source that appropriately decorate and arrange each field, such as \headword{} and \definition{}. Then, if each line is converted into the format \headword{homme} \definition{man} using regular expressions, etc., it can be typeset as a TeX file. In this way, if a script is written to convert CSV or XML into TeX format, a PDF file of the lexicon can be obtained semi-automatically as long as there is structured data. Kajiku (2000) was thus typeset and published by converting a spreadsheet of pre-structured dictionaries into a TeX file using regular expressions.

Another advantage of using LaTeX to generate PDF files is separating the text output from the structure. Therefore, data structures can be retained in LaTeX files as they are, which works well with structured data, such as spreadsheets and TEI mentioned above. For example, by defining a LaTeX macro \headword{}, the LaTeX source explicitly indicates that marked-up text was in the "headword" column in the original spreadsheet.

In regional communities in Japan, many speakers of local languages are elderly, and there are limited numbers of people who can access network-connected computers. They must have access to lexical resources in paper format. For now, we have published some dictionaries in a printed format (e.g., Kajiku, 2020, and Nakagawa et al. eds., to appear).

7. Community building

Dictionaries and other linguistic resources in regions are created through the collaboration of local communities, researchers studying the languages of those regions, and researchers or programmers who convert the data into a format that is easily accessible to anyone. This is possible through the equal contribution of both researchers and the community members. If members of the local community and researchers of regional languages produce structured open data, it will be widely used. Researchers or programmers who convert the data also need to learn the local language and cultures to make data more accessible to everyone. Therefore, it is essential for people involved in language resources to collaborate with each other. In this study, we provide our example of collaboration with local communities.

8. Conclusion

In this study, we have drawn a comprehensive picture, starting with the electronic text conversion of dictionaries, their structuring, and their transformation, and discussed ways to publish existing language resources in a more sophisticated form.

Appendix A

Bibliography
  1. Almeida, B., Costa, R., Salgado, A., Ramos, M., Romary, L., Khan, F., Carvalho, S., Khemakhem, M., Silva, R., & Tasovac, T. (2022). Modelling usage information in a legacy dictionary: from TEI Lex-0 to Ontolex-Lemon. Workshop on Computational Methods in the Humanities 2022 (COMHUM 2022), Laboratoire lausannois d’informatique et statistique textuelle, Jun 2022, Lausanne, Switzerland. hal-04170939v2
  2. Kahle, P., Colutto, S., Hackl, G., & Mühlberger, G. (2017, November). Transkribus: a service platform for transcription, recognition and retrieval of historical documents. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (Vol. 4, pp. 19-24). IEEE.
  3. Kajiku, S. (2020). Hatoma hôgen jiten (N. Nakagawa, Ed.). NINJAL Language Variation Division [Hatoma-Japanese Dictionary].
  4. Kajiku, S., & Nakagawa, N. (2021). Hatoma hôgen onsei goi dêtabêsu. NINJAL Language Variation Division. https://doi.org/10.15084/00003209 [The Audio Database of Hatoma Lexicon]
  5. Kajiku, S., Nakagawa, N., & Kato, K. (2022). Hatoma hôgen onsei reibun dêtabêsu. NINJAL Language Variation Division. https://doi.org/10.15084/00003659 [The audio database of Hatoma Example Sentences]
  6. Kiessling, B., Tissot, R., Stokes, P., & Ezra, D. S. B. (2019, September). eScriptorium: an open source platform for historical document analysis. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (Vol. 2, pp. 19–19). IEEE.
  7. Miyagawa, S., Kato, K., Zlazli, M., Carlino, S., & Machida, S. (2023). Building Okinawan Lexicon Resource for Language Reclamation/Revitalization and Natural Language Processing Tasks such as Universal Dependencies Treebanking. In N. Ilinykh, F. Morger, D. Dannélls, S. Dobnik, B. Megyesi, & J. Nivre (Eds.), Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), pp. 86–91. Association for Computational Linguistics. https://aclanthology.org/2023.resourceful-1.12
  8. Nakagawa, H., Sugawara, K. & Tanaka, J. (eds., to appear). A Gǀui Dictionary.
  9. Nakagawa, N., & Miyagawa, S. (2023). A multi-media dictionary of endangered languages with TEI Lex-0: A case study of Hatoma, Yaeyama Ryukyuan. Encoding Cultures: Joint MEC & TEI Conference 2023. https://teimec2023.uni-paderborn.de/contributions/172.html
  10. NDL Lab. (2022). Development of Japanese OCR software. FY2021 https://lab.ndl.go.jp/data_set/ocr_en/r3_software/
  11. National Institute for Japanese Language and Institute. (1963). Okinawago Jiten. Zaimusho Insatsukyoku, Tokyo. [Okinawan Dictionary]
  12. de Schriver, G-M. (2023). Investigating the feasibility of a hub-and-spoke model to hold ILCAA's Bantu lexica into a single multipurpose online dictionary database. Workshop 'The Past and Present of Bantu Languages: Integrating Micro-Typology, Historical-Comparative Linguistics and Lexicography.' Tokyo University of Foreign Studies, Mar 6th, 2023.
Kanji Kato (kanji_kato@nii.ac.jp), The Joint Support-Center for Data Science Research, Center for Open Data in the Humanities, Japan and So Miyagawa (miyagawa.so.36u@kyoto-u.jp), National Institute for Japanese Language and Linguistics and Natsuko Nakagawa (nakagawanatuko@gmail.com), National Institute for Japanese Language and Linguistics