Design and Construction of the corpus of “Kokinshu-Tōkagami”: An Early Modern Annotated Book on Classical Japanese Poetry Anthology

The Corpus of Historical Japanese (CHJ), developed and published by the National Institute for Japanese Language and Linguistics (NINJAL), is a large-scale corpus with morphological data (lemmas and parts of speech etc.) for all its texts, and has become an indispensable resource for the study of Japanese language history. Currently, we are working on the inclusion of “Kokinshu Tōkagami” (1793) in the Edo period series of the CHJ. “Kokinshu Tōkagami” is an annotated edition of the Kokin Waka-shū (lit. Anthology of Ancient and Modern Japanese Poetry) by the famous 18th century philologist and linguist, Motoori Norinaga. The Kokin Waka-shū is an imperial anthology of waka poems that was compiled in the 10th century and has been historically important as a classic of Japanese literature. In this presentation, we will explore the markup and automatic analysis of “Kokinshu Tōkagami” as a corpus containing morphological information. Additionally, we will provide examples of how this corpus can be utilized.

An important feature of the "Kokinshu Tōkagami" is that it is an annotated edition on the "Kokin Waka-shū ", translated verbatim by Motoori Norinaga using the colloquial language of Kyoto during the Edo period. This makes it an important source of colloquial Japanese in the late Edo period in Japan. The book contains three types of text, each written in a different style: the original text of the Kokin Waka-shū (fig.1 blue box), an annotation text by Norinaga (fig.1 green box), and a translation into colloquial Japanese by Norinaga (fig.1 red box). The original text of the Kokin Waka-shū is a waka poem from the Heian period (9th-10th century), Norinaga's annotations are in the Edo period literary style, and the colloquial translation is in the Edo period Kansai egion colloquial style.

Figure 1: Image of Kokinshu Tōkagami (1793), p.31

A full-text searchable corpus (Himawari edition) of "Kokinshu Tōkagami" tagged with document structure (without morphological information) is already available. This data is structured in XML, and the above-mentioned text type is also distinguished by the “type” attribute of the “quotation” tag. In addition, since "Kokinshu Tōkagami" has a very large number of inline elements such as bylines and glosses, we have taken a form that corresponds to them by using more tags than the Japanese colloquial materials of the Edo period that have been implemented in the CHJ so far. These tags are annotated with a TEI-like tag set, although not entirely according to TEI guidelines.

In order to be included in the CHJ, morphological information must be added. To perform morphological analysis for this purpose, MeCab was used as the morphological analyzer, and as the dictionary for morphological analysis, “UniDic for Early Middle Japanese” was used for the ground text using literary sentences, “UniDic for Waka” for the original text of waka poems, and “UniDic for Early Modern Kamigata Colloquial” for the translated part of Norinaga, and XML document structure tags were used to Automatic analysis was performed by switching between these multiple dictionaries using XML document structure tags. This resulted in sufficiently practical analysis results (approximately 90%) when manual correction was assumed.

Utilizing the corpus of "Kokinshu Tōkagami" constructed in this way, we investigated how the auxiliary verbs used in the Kokin Wakashu were translated. As a result, we were able to see differences in the interpretation of auxiliary verbs that could not be obtained from other sources by Norinaga.

The corpus of "Kokinshu Tōkagami" is still under construction, but once the morphological information is assigned and implemented in CHJ, it will be possible to search for words without being affected by word forms or notational quirks in the text, thus making exhaustive searches possible. The addition of morphological information is crucial for Japanese materials from the early modern period, which have a wide variety of notations. Furthermore, it will enable comparison with other materials already included in the CHJ, while taking into consideration the amount of text.

As described above, the construction of this corpus will further increase the value of "Kokinshu Tōkagami" as a resource for research on the history of the Japanese language, and it is expected that research on the history of the Japanese language using this corpus will become more diverse.

Acknowledgments

This presentation was supported by the NINJAL collaborative research project 'The Construction of Diachronic Corpora and New Developments in Research on the History of Japanese.' and JSPS KAKENHI Grant Numbers JP22K00577・JP23KJ1822.

Appendix A

Bibliography
  1. Taro Ichimura (2021) Corpus of Kokinshu Tōkagami, Himawari Edition ver.0.81 https://sites.google.com/view/ichimurat/%E5%B8%82%E6%9D%91%E7%A0%94%E7%A9%B6%E5%AE%A4 (accessed 5th December 2023).
  2. Taku Kudo, Kaoru Yamamoto, Yuji Matsumoto (2004) Applying Conditional Random Fields to Japanese Morphological Analysis, EMNLP 2004.
  3. National Institute for Japanese Language and Linguistics (2023), Corpus of Historical Japanese, Edo Period Series. https://ccd.ninjal.ac.jp/chj/edo-en.html (accessed 5th December 2023).
  4. Toshinobu Ogiso, Mamoru Komachi, Yuji Matsumoto (2013) Morphological Analysis of Historical Japanese Text, Journal of natural language processing, Vol.20, No.5, pp.727-748. In Japanese
Masako Kubo (gs20233501@ninjal.ac.jp), The Graduate University for Advanced Studies, SOKENDAI and Taro Ichimura (tichimura@kpu.ac.jp), Kyoto Prefectural University and Toshinobu Ogiso (togiso@ninjal.ac.jp), National Institute for Japanese Language and Linguistics, JAPAN; The Graduate University for Advanced Studies, SOKENDAI