Digital Humanities Approach to Comparing Tang and Song Poetry: Revealing Thematic Evolution Through Multiple Runs of LDA Topic Modeling

Chinese poetry boasts a long and rich history, and the comparison between Tang（618-907） and Song（960-1279） poetry is an extensive and profound topic. For instance, this comparison can be traced back to the Southern Song period with Yan Yu （嚴羽, ?-1245） ¹ in his 'Canglang Shihua', where he proposed that the Song dynasty valued reason but lacked artistic spirit; in contrast, the Tang dynasty revered artistic spirit, with reason embedded within. ² highlighting the stylistic differences between the two periods. Besides stylistic differences, the themes of poetry also changed with the dynasties.

Tang poetry is known for its vitality and depictions of grand natural landscapes, as seen in the portrayals of mountains and rivers by poets like Wang Wei（王維, 700-761）, representing a pinnacle in Chinese landscape poetry.

Song poetry, on the other hand, displayed a more introverted and delicate style, often depicting details of life, such as Su Shi（蘇軾, 1037-1101）'s profound observations of society and life.

Considering the vast body of Tang and Song poetry, a comprehensive discussion of their themes is a challenge.

In recent years, with the development of digital humanities technologies, such as the application of computational linguistics methods, new avenues have opened up for this kind of research. Scholars like Lee et al. （2012） used computational linguistics to analyze Tang poetry, outlining themes such as geography and seasons in the poems; Liu et al. （2018） utilized digital tools to explore Chinese poetry from multiple perspectives, including linguistics, literature, and history, demonstrating a diversified approach to classical poetry studies; Hong et al. （2020） proposed a collaborative crowdsourcing framework, combining artificial intelligence and expert knowledge to efficiently extract knowledge and associations from Tang poetry, showcasing new approaches in digital humanities for the study of Chinese poetry. All three studies illustrate innovative methods of employing digital tools and computational linguistics in classical poetry research. ³

Building on this, we attempt to use topic modeling methods, such as LDA （Latent Dirichlet Allocation）, ⁴ to analyze and compare the thematic differences between Tang and Song poetry, in order to reveal the stylistic and thematic transitions between these two periods and to gain broader and deeper insights through this method.

This study utilizes LDA topic models to analyze Quan Tang Shi （全唐詩） and Quan Song Shi （全宋詩）. By understanding the co-occurrence patterns of vocabulary, we initially identified ten potential themes in each of these collections, aiming to discern the distinct facets of Tang and Song poetry.

The dataset includes 42,863 poems from the Tang dynasty and 185,113 from the Song dynasty, totaling 227,976 poems. Quan Tang Shi, compiled by Qing dynasty officials, is crucial for studying Tang poetry, containing works from over two thousand poets. Similarly, Quan Song Shi, compiled by the Institute of Ancient Texts at Peking University, is an anthology of Song dynasty poetry, including works from over nine thousand poets.

In the data preprocessing phase, we used Jieba for Chinese word segmentation, splitting the verses of Quan Tang Shi and Quan Song Shi into individual words. ⁵ We then applied the TF-IDF vectorization method to transform these segmented texts into numerical vectors. ⁶ During this process, we adjusted the max_df and min_df parameters of TF-IDF to filter out words that appear too frequently or infrequently, thereby improving data quality.

To determine the most appropriate number of topic clusters, we calculated the coherence score for Quan Tang Shi and Quan Song Shi. The coherence score is an important metric for assessing the quality of topic models, where a higher score usually indicates better topic quality, reflecting greater consistency and relevance among the words within a topic. Based on these results, we chose ten topics as the clustering parameters for this experiment to explore the potential thematic distributions in Quan Tang Shi and Quan Song Shi. ⁷

To enhance the stability and accuracy of the LDA model, we implemented several improvement measures. First, we executed the model multiple times and calculated the average coherence scores from these runs to mitigate variability. Second, we set a fixed random seed and consistent model parameters to ensure reproducibility. Finally, we conducted a large number of iterations during the modeling procedure, improving the model's ability to identify and differentiate themes. We compared the coherence scores between the LDA model with adjusted hyperparameters and the basic LDA model. The results showed that the improved LDA model exhibited better accuracy and stability, confirming that our methods effectively enhanced the model's accuracy and reliability.

In the previous analysis, we explored the thematic differences between Tang and Song poetry and identified the core themes in each period. We then subdivided Tang poetry into four periods: Early Tang （初唐）, the golden age of Tang （盛唐）, Mid Tang （中唐）, and Late Tang （晚唐）, and applied the LDA model to analyze the thematic characteristics of each.

In Early Tang, we found numerous religious terms such as "Immortals" （仙人）, "Three Pure Ones" （三清）, and "Monks" （僧家）. These results confirm the earlier interpretation of "Immortals" and "Incense Burning" （焚香）. The Early Tang period, transitioning from the chaotic Wei, Jin, and Sui dynasties, had a strong reliance on religious beliefs, reflected in the language used in poetry.

As we moved into High Tang and Mid Tang, we noticed an increase in war-related terms such as "Battle" （戰）, "Invasion" （侵）, and "Borderlands" （塞上）. These terms reflect the impact of the An Lushan Rebellion （安史之亂）, marking the transition from the height of the Tang dynasty to its decline. Frequent wars and civil unrest are evident in the language, revealing the anxiety and helplessness of the people during this time. In Late Tang, with the empire's further decline, we observed terms like "Desolate" （茫茫）, "Scattered" （零落）, and "Lonely" （寂寥）, expressing hopelessness and despair. This reflects a general disillusionment with reality and a return to religious beliefs, as indicated by the reappearance of "Immortals."

Our LDA model effectively reveals the thematic evolution of Tang and Song poetry, demonstrating the close relationship between historical contexts and poetic creation. Yoshikawa Kōjirō （吉川幸次郎） in A General Introduction to Song Poetry （宋詩概說） highlighted differences between Tang and Song poetry, such as the connection to social issues, detailed depictions of daily life, and the exploration of philosophical questions. These characteristics are all reflected in our analysis. Notably, in the depiction of natural scenery, we found a strong connection between Song poetry and Late Tang poetry, with a marked absence of religious terms in Song poetry. This supports the validity of our LDA model in reflecting the internal traces of the eras within the poetry.

Appendix A

Bibliography

Yan Yu嚴羽(1983), Canglang shi hua滄浪詩話. Taipei : Taiwan shang wu yin shu guan.
Lee, J. S., & Wong, T. S. (2012, December). Glimpses of ancient China from classical Chinese poems. In Proceedings of COLING 2012: Posters ：621-632.; Liu, C. L., Mazanec, T. J., & Tharsen, J. R. (2018). Exploring Chinese poetry with digital assistance: Examples from linguistic, literary, and historical viewpoints. Journal of Chinese Literature and Culture , 5 (2)：276-321.; Hong, L., Hou, W., Wu, Z., & Han, H. (2020). A cooperative crowdsourcing framework for knowledge extraction in digital humanities–cases on Tang poetry. Aslib Journal of Information Management, 72(2) ：243-261.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research , 3 (Jan) ：993-1022.
Jieba https://github.com/fxsjy/jieba
TF-IDF https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics ：pp. 100-108).

Notes

Our source of information on Chinese historical figures comes from the China Biographical Database Project （CBDB） at Harvard University. China Biographical Database Project （CBDB）（harvard.edu）

Yan Yu嚴羽（?-1245）, Canglang shi hua滄浪詩話. Taipei : Taiwan shang wu yin shu guan, 1983.

Lee, J. S., & Wong, T. S. （2012, December）. Glimpses of ancient China from classical Chinese poems. In Proceedings of COLING 2012: Posters （pp. 621-632）.; Liu, C. L., Mazanec, T. J., & Tharsen, J. R. （2018）. Exploring Chinese poetry with digital assistance: Examples from linguistic, literary, and historical viewpoints. Journal of Chinese Literature and Culture , 5 （2）, 276-321.; Hong, L., Hou, W., Wu, Z., & Han, H. （2020）. A cooperative crowdsourcing framework for knowledge extraction in digital humanities–cases on Tang poetry. Aslib Journal of Information Management, 72（2）, 243-261.

Blei, D. M., Ng, A. Y., & Jordan, M. I. （2003）. Latent dirichlet allocation. Journal of machine Learning research , 3 （Jan）, 993-1022.

Jieba https://github.com/fxsjy/jieba

TF-IDF https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. （2010, June）. Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics （pp. 100-108）.