This ongoing research applies the Bidirectional Encoder Representations from Transformers (BERT) language model to analyze ancient Buddhist texts in Classical Chinese for sentence segmentation and punctuation restoration. Leveraging the CBETA dataset and contextual embeddings, this research tackles the linguistic challenges of ancient Buddhist literature.
The advent of contextual embedding models, such as BERT and its variants, has revolutionized the field of natural language processing, as they offer opportunities for deep semantic understanding of text. (Devlin et al. 2019; Barbecho 2023) This research extends the domain of BERT model capabilities, focusing on sentence segmentation and punctuation restoration in ancient Buddhist texts. These texts are distinct from other forms of text in several ways. Morphological and semantic characteristics, along with the density of terminologies tied to Buddhist doctrine and historical context, all present obstacles for text analysis. The high incidence of polysemy in Classical Chinese also adds to the difficulty of accurate interpretation. (Zhu and Li 2018; Meisterernst 2018) Applying computational linguistics techniques to Buddhist corpora has proved to be both meaningful and challenging. (Veidlinger 2019; Zheng 2024)
The Chinese Buddhist Electronic Text Association (CBETA) corpus is a large collection of the Buddhist Tripitaka, including scriptures and commentaries written in Classical Chinese. It encompasses a vast array of texts, some of which have already been segmented and punctuated, while others remain as raw text. (Tu 2015; Bingenheimer 2020) This corpus serves as the foundation in this research for continual pre-training and fine-tuning, which enable it to adapt to the ancient Buddhist literature.
Several models exist for classical Chinese embeddings that utilize the architecture of BERT or its variants such as RoBERTa (Liu et al. 2019), and DeBERTa (He et al. 2021). Notable cases include RoBERTa-classical-chinese (Yasuoka et al. 2022), SikuBERT, SikuRoBERTa (Wang et al. 2021), BERTguwen (Tang et al. 2023) etc. We did continual pre-training based on the RoBERTa-classical-chinese model using the Buddhist texts mentioned earlier. Our training approach involved the Masked Language Modeling objective, where we randomly masked fifteen percent of the tokens. We employed the AdamW optimizer for this process. (Loshchilov and Hutter 2019)
At the current stage, we have constructed a dataset containing 200,000 sentences selected from CBETA and divided it into training, validation, and test sets at ratios of 8:1:1, respectively. We added labels for the beginning and ending positions of sentences as well as punctuation marks. For sentence segmentation, we fine-tuned the model to use its understanding to predict sentence boundaries derived from context and syntactic structures rather than from explicit delimiters. For punctuation restoration, the model was fine-tuned to insert appropriate punctuation marks. We achieved 93.06% F1 score in the sentence segmentation task, and 79.53% F1 score in the punctuation restoration task.
The methodologies developed here have implications for the study of other ancient languages and texts, by integrating language models with traditional philology scholarship. (Sommerschield et al. 2023) For future work, we will conduct additional experiments on a larger set of texts and detailed comparison with other classical Chinese embedding models. We will apply the trained model to more tasks such as Buddhist scriptures classification. (Huang, Wang, and Hung 2024) We hope that the outcomes of this research will eventually enable scholars to interact with historical documents in a more productive way. Contextual embeddings based on large language models (LLMs), such as gte-Qwen2-7B-instruct, may also be tested. (Bai et al. 2023) Such LLM based embedding models may offer increased performance, albeit at the cost of demanding computational resources.