Leveraging Bidirectional Encoder Representations from Transformers (BERT) for Analyzing Ancient Buddhist Texts

This ongoing research applies the Bidirectional Encoder Representations from Transformers (BERT) language model to analyze ancient Buddhist texts in Classical Chinese for sentence segmentation and punctuation restoration. Leveraging the CBETA dataset and contextual embeddings, this research tackles the linguistic challenges of ancient Buddhist literature.

The advent of contextual embedding models, such as BERT and its variants, has revolutionized the field of natural language processing, as they offer opportunities for deep semantic understanding of text. (Devlin et al. 2019; Barbecho 2023) This research extends the domain of BERT model capabilities, focusing on sentence segmentation and punctuation restoration in ancient Buddhist texts. These texts are distinct from other forms of text in several ways. Morphological and semantic characteristics, along with the density of terminologies tied to Buddhist doctrine and historical context, all present obstacles for text analysis. The high incidence of polysemy in Classical Chinese also adds to the difficulty of accurate interpretation. (Zhu and Li 2018; Meisterernst 2018) Applying computational linguistics techniques to Buddhist corpora has proved to be both meaningful and challenging. (Veidlinger 2019; Zheng 2024)

The Chinese Buddhist Electronic Text Association (CBETA) corpus is a large collection of the Buddhist Tripitaka, including scriptures and commentaries written in Classical Chinese. It encompasses a vast array of texts, some of which have already been segmented and punctuated, while others remain as raw text. (Tu 2015; Bingenheimer 2020) This corpus serves as the foundation in this research for continual pre-training and fine-tuning, which enable it to adapt to the ancient Buddhist literature.

Several models exist for classical Chinese embeddings that utilize the architecture of BERT or its variants such as RoBERTa (Liu et al. 2019), and DeBERTa (He et al. 2021). Notable cases include RoBERTa-classical-chinese (Yasuoka et al. 2022), SikuBERT, SikuRoBERTa (Wang et al. 2021), BERTguwen (Tang et al. 2023) etc. We did continual pre-training based on the RoBERTa-classical-chinese model using the Buddhist texts mentioned earlier. Our training approach involved the Masked Language Modeling objective, where we randomly masked fifteen percent of the tokens. We employed the AdamW optimizer for this process. (Loshchilov and Hutter 2019)

At the current stage, we have constructed a dataset containing 200,000 sentences selected from CBETA and divided it into training, validation, and test sets at ratios of 8:1:1, respectively. We added labels for the beginning and ending positions of sentences as well as punctuation marks. For sentence segmentation, we fine-tuned the model to use its understanding to predict sentence boundaries derived from context and syntactic structures rather than from explicit delimiters. For punctuation restoration, the model was fine-tuned to insert appropriate punctuation marks. We achieved 93.06% F1 score in the sentence segmentation task, and 79.53% F1 score in the punctuation restoration task.

The methodologies developed here have implications for the study of other ancient languages and texts, by integrating language models with traditional philology scholarship. (Sommerschield et al. 2023) For future work, we will conduct additional experiments on a larger set of texts and detailed comparison with other classical Chinese embedding models. We will apply the trained model to more tasks such as Buddhist scriptures classification. (Huang, Wang, and Hung 2024) We hope that the outcomes of this research will eventually enable scholars to interact with historical documents in a more productive way. Contextual embeddings based on large language models (LLMs), such as gte-Qwen2-7B-instruct, may also be tested. (Bai et al. 2023) Such LLM based embedding models may offer increased performance, albeit at the cost of demanding computational resources.

Appendix A

Bibliography
  1. Bai, Jinze, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, et al. (2023). “Qwen Technical Report”. Arxiv Preprint Arxiv:2309.16609
  2. Barbecho, Lidia Bocanegra. (2023). “Review: BERT for Humanists”. Reviews in Digital Humanities 4 (4)
  3. Bingenheimer, Marcus. (2020). “Digitization of Buddhism (Digital Humanities and Buddhist Studies)”. Oxford Bibliographies in Buddhism
  4. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. (2019). “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171–86
  5. He, Pengcheng, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. (2021). “Deberta: Decoding-Enhanced Bert with Disentangled Attention”. In International Conference on Learning Representations
  6. Huang, Shu-Ling, Yu-Chun Wang, and Jen-Jou Hung. (2024). “Application of Deep-Learning Methods in the Classification of Chinese Buddhist Canonical Catalogs”. Journal of Library and Information Studies 22 (1): 133–64
  7. Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. (2019). “Roberta: A Robustly Optimized Bert Pretraining Approach”. Arxiv Preprint Arxiv:1907.11692
  8. Loshchilov, Ilya, and Frank Hutter. (2019). “Decoupled Weight Decay Regularization”. In International Conference on Learning Representations
  9. Meisterernst, Barbara. (2018). “Buddhism and Chinese Linguistics”. Buddhism and Linguistics: Theory and Philosophy, 123–48
  10. Sommerschield, Thea, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, Jonathan Prag, Ion Androutsopoulos, and Nando de Freitas. (2023). “Machine Learning for Ancient Languages: A Survey”. Computational Linguistics 49 (3): 703–47
  11. Tang, Xuemei, Qi Su, Jun Wang, Yuhang Chen, and Hao Yang. (2023). “Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-Trained Language Model”. Journal of Chinese Information Processing 37 (8): 159–68
  12. Tu, Aming. (2015). “The Creation of the CBETA Chinese Electronic Tripitaka Collection in Taiwan”. In Spreading Buddha's Word in East Asia: The Formation and Transformation of the Chinese Buddhist Canon, edited by Jiang Wu and Lucille Chia, 321–36. New York: Columbia University Press
  13. Veidlinger, Daniel. (2019). “Computational Linguistics and the Buddhist Corpus”. Digital Humanities and Buddhism, 43–58
  14. Wang, Dongbo, Chang Liu, Zihe Zhu, Jiangfeng Liu, Haotian Hu, Si Shen, and Bin Li. (2021). “Construction and Application of Pre-Training Model of “Siku Quanshu” Oriented to Digital Humanities”. Library Tribune
  15. Yasuoka, Koichi, Christian Wittern, Tomohiko Morioka, Takumi Ikeda, Naoki Yamazaki, Yoshihiro Nikaido, Shingo Suzuki, Shigeki Moro, and Kazunori Fujita. (2022). “Designing Universal Dependencies for Classical Chinese and Its Application”. Journal of Information Processing 63 (2): 355–63
  16. Zheng, Yutong. (2024). “Buddhist Transformation in the Digital Age: AI (Artificial Intelligence) and Humanistic Buddhism”. Religions 15 (1): 79–80
  17. Zhu, Qingzhi, and Bohan Li. (2018). “The Language of Chinese Buddhism: From the Perspective of Chinese Historical Linguistics”. International Journal of Chinese Linguistics 5 (1): 1–32
Tong Li (tongli@link.cuhk.edu.hk), The Chinese University of Hong Kong