Optical Character Recognition systems for low-resource languages: A case for Sesotho

Africa is home to more than 2,000 languages, constituting about one-third of the world's languages (Báez, Vogel, & Patolo 2019). However, many languages in Africa are considered to be low-resourced because they lack the resources to undertake high-level Natural Language Processing (NLP) activities and the creation of resources in these languages is low. NLP tools need data, textual, speech or multimodal to be developed, however, many African languages do not have enough corpora to undertake such tasks. As the creation of corpora is a big step toward the development of digital resources for languages, this project focuses on Optical Character Recognition (OCR) tools for the Sesotho language which is low-resourced, though it has one of the longest histories of a written language (Nhlapo 2021).

Sesotho language is spoken in various countries in Africa (with different orthographies), particularly in South Africa where it is one of the 12 official languages, and Lesotho where it is one of the two official languages (Makutoane 2022). OCR tools exist for many of the big languages of the world and are defined as technologies that can read text from scanned paper and convert images into a format that the computer can understand (Isheawy & Hasan 2015). For example, one can use OCR system to scan an article from a magazine or book and input it straight into an electronic computer file, which you can then edit with a text editor. However, despite the extensive writing traditions of various South African languages, a small portion of the data is only accessible in printed form (Hocking and Puttkamer 2016) which affects the development of NLP tools.

While OCR tools, techniques for identifying and converting text encoded in images, are readily available for high-resourced languages like English, the same cannot be said for languages like Sesotho. This sentiment is echoed by Hocking and Puttkamer (2016), who emphasize the scarcity of OCR engines for smaller languages like Tshivenda and others spoken in South Africa. This scarcity is attributed to a lack of emphasis on the development of these languages, with insufficient investments in terms of funding and people with skills to work on them. The unavailability of corpora significantly contributes to this dearth of resources, as NLP projects heavily rely on corpora. OCR is instrumental in preserving languages while simultaneously creating corpora that can be made available as open-source material for research.

Large corpora are required for NLP language model training, and they are becoming more and more crucial to enable the best possible digitization and analysis of antiquated textual sources. While digitizing resources for South African languages is a way of preserving, protecting and developing them (Hocking and Puttkamer 2016), it is imperative that printed resources are digitised into a format that would not only allow them to be electronically archived but would also allow their use as reference works in research and further development. Meaning that they should follow the FAIR principles guideline which entail that data must be findable, accessible, reproducible, and interoperable ( https://www.go-fair.org/fair-principles/ ) However, the training of NLP tools for African languages necessitates extremely particular efforts since these languages are also somewhat harder to analyze. (Gabay, Suarez, Bartz, Chagué, Bawden, Gambette, & Sagot, 2022). As a result, language-specific training is crucial in order to maximize the accuracy of the OCR systems during digitization. Using an OCR tool designed for another language could negatively impact the accuracy of the resulting text (Hocking and Puttkamer 2016).

Hocking and Puttkamer (2016) created an OCR system for South African languages which can be found here https://repo.sadilar.org/handle/20.500.12185/322?show=full. This system is mainly trained on South African government data and the purpose of this project is to see how the system performs when it is given input of different genres like books and to also see how it handles different orthographies of Sesotho. The project will analyze the accuracy and character recognition rate of the OCR system.

This project is important because it will assist in the preservation of low-resourced languages by increasing and creating corpora for these languages. The project is also not meant for South African languages only but aims to be a blueprint for all languages that are not yet represented in OCR models or whose presence is limited. This is relevant for discussions on the creation of NLP resources for low-resource languages and promises to address inequities that hinder access to technology. It is also important to note that this is a work in progress and the results will be shared in a publication to follow.

Appendix A

Bibliography
  1. Gabay, Simon/  Suarez, Ortiz, Pedro/  Bartz, Alexandre/  Chagué, Alix/  Bawden, Rachel/  Gambette, Philippe/ Sagot, Benoît (2022): From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French. arXiv preprint arXiv:2202.09452.
  2. Hocking, Justin / Puttkammer, Martin (2016): “Optical character recognition for South African languages”, in:  Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech)  1-5, IEEE.
  3. Isheawy, Najib Ali Mohamed/ and Hasan, Habibul (2015): Optical character recognition (OCR) system. IOSR Journal of Computer Engineering (IOSR-JCE), e-ISSN, 2278-0661.
  4. Makutoane, Tshokolo (2022): The people divided by a common language: The orthography of Sesotho in Lesotho, South Africa, and the implications for Bible translation. HTS Teologiese Studies/Theological Studies, 78(1), 7605.
  5. Nhlapo, Moselane Andrew (2021): Historical perspectives on the development of Sesotho linguistics with reference to syntactic categories. University of Free State. https://scholar.ufs.ac.za/server/api/core/bitstreams/919f59d5-5fe1-4c55-82c3-2a2834dcc63a/content accessed on 08 December 2023
  6. Palkovic, A. J. :(2008) “Improving optical character recognition”, in:  Proceedings of the 2nd Villanova University Undergraduate Computer Science Research Symposium (CSRS 2008). December (Vol. 5, No. 8).
  7. Pérez Báez, Gabriela, Rachel Vogel, & Uia Patolo. 2019. Global Survey of Revitalization Efforts: A mixed methods approach to understanding language revitalization practices. Language Documentation & Conservation 13: 446-513.
  8. FAIR Principles https://www.go-fair.org/fair-principles/ accessed on 09 December 2023
  9. South African Centre for Digital Language Resources https://repo.sadilar.org/handle/20.500.12185/322?show=full accessed on 27 November 2023
Mmasibidi Setaka (mmasibidi.setaka@nwu.ac.za), South African Centre for Digital Language Resources (SADiLaR), South Africa