Echoes Unveiled: Quantifying Literary Historical Scholarship with Word Embeddings

In computational literary studies, the exploration of literary history as a research subject has been primarily based on collections of historical literary works. These text corpora serve as the basis for identifying trends, common themes, stylistic structures, and topics, which are then juxtaposed with established literary historiographical narratives—sometimes corroborating, while at other times challenging them (Jockers 2013; Bode 2018). Some analyses have been expanded to incorporate what Bode refers to as "data-rich literary history" (2018: 37–57), encompassing metadata and traces of a text's "history of transmission" (2018: 38). Despite these advancements (Heuser 2016; Odebrecht et al. 2021; Schöch et al. 2022; Maryl et al. 2023), the integration of literary historiography as textual data remains less prevalent due to the difficulty of incorporating unstructured data into quantitative analyses.

The academic literary historical discourse is, however, an abundant source of data: Authors, genres, traditions, and literary texts are integrated in a resonating system of equation, comparison, and contrast. In an early examination of the possibilities of "computational historiography", Mimno (2012) demonstrates how a topic modeling approach to modern Classical scholarship can reveal trends in research topics. In other disciplines, word embeddings have proven to be useful for dealing with the "latent content" (Tshitoyan et al. 2019) of (academic) discourses: Wevers and Koolen (2020) show the utility of word embeddings in tracing semantic change of concepts in newspapers, Garg et al. (2018) employ word embeddings to quantify gender and ethnic stereotypes, and Tshitoyan et al . (2019) use them to explore and summarize the latent knowledge of material science. While similar applications are not as common in computational literary studies, word embeddings have been used to model literary characters (Bamman et al. 2014), identify patterns of intertextuality (Burns et al. 2021), examine gendered roles of characters (Grayson et al. 2017), and perform sentiment analysis (Jacobs 2019; Schöch 2022).

With these approaches in mind, this contribution outlines the steps necessary for the implementation of a word embedding of the academic literary historical discourse to quantify which authors and texts appear in similar contexts and presents first results of such an embedding covering the long 18 th and 19 th century of European Anglophone literary history. Given the importance of named entities for the aforementioned system of reference encoded in literary historical scholarship—especially persons and works of art—a fine-tuned named entity recognition (NER) was performed to generalize variations of entity names. In combination with an entity disambiguation using Wikidata identifiers, the NER process also ensures the compatibility of the extracted information with other data points structured as Linked Open Data (LOD).

A corpus of 27 literary histories and companions to specific genres and literary periods was compiled. The secondary texts were chosen to represent different notions of canonicity—normative, academic, and the counter-canon—in order to cover a broad range of literary historical scholarship. As an additional prerequisite, only texts available in native PDF versions were used. The majority of texts were published between 2004 and 2017 (see Figure 1); the secondary texts analyzed thus represent a snapshot of recent literary historiographical scholarship. Figure 2 shows the periods of time covered in the respective literary histories as indicated by chronologies in front matters or by timeframes stated in introductions. On average, each literary history consists of 148,720 tokens, amounting to a total of 4.5 million tokens across the corpus.

Distribution of publication years of secondary sources over years.
Figure 1. Distribution of publication years of secondary sources over years.
Distribution of coverage of secondary sources over years.
Figure 2. Distribution of coverage of secondary sources over years.

To be able to compare and contrast how authors and texts are discussed and contextualized, their names and titles had first to be recognized with a fine-tuned spaCy (Explosion AI 2022) NER model. For this, the spaCy en_core_web_lg model was trained on a manually curated training data and tested on a gold standard, both taken directly from the corpus. With this kind of domain adaptation, the performance for both relevant labels, but especially for that of WORK_OF_ART, could be improved (see Table 1). The fine-tuned NER model was then used in combination with the spaCy entity fishing module (Lopez 2022), which extracts information for identified entities from Wikidata. The resulting pipeline produced text versions of the literary histories with entities replaced by unique identifiers consisting of capitalized versions of the entity names and, if applicable, Wikidata IDs (e.g., WILLIAM_SHAKESPEARE_Q692).

Table 1. Precision, recall, and F1 values for NER labels PERSON and WORK_OF_ART before and after the fine-tuning.
spaCy en_core_web_lg fine-tuned model
PERSON WORK_OF_ART PERSON WORK_OF_ART
p 0.64 0.43 0.89 0.74
r 0.78 0.11 0.91 0.72
f1 0.70 0.17 0.90 0.73

After an additional manual curation and correction of homonyms, the corpus was used as the basis for a word2vec embedding with 100 dimensions, a window size of 10, and a minimal count of 5 (Mikolov et al. 2013). The resulting word embedding allows for the calculation of the semantic similarity between each previously detected entity and every other entity encoded in the embedding. This enables the representation of entities based on their discussion in the academic discourse in a similarity network. Kruskal’s algorithm (1956) was used to filter the initially resulting hairball, revealing a so-called Minimum Spanning Tree (MST), a subgraph that ensures the connectedness of all nodes while pruning less significant edges.

As a representation of the "reference system" of literary history, this filtered network facilitates a meta-reading of literary historical discourse, enhancing accessibility through association: Lesser-known texts and authors can be explored through their connections with more canonical players in literary history. For quantitative text analysis, this means that network data based on this interconnectedness—such as similarities, clusters, and centrality measures—can be used as additional metadata, ensuring that even the analysis of lesser-known literary historical entities is context- and data-rich.

Appendix A

Bibliography
  1. Bamman, David / Underwood, Ted / Smith, Noah A. (2014): "A Bayesian Mixed Effects Model of Literary Character", in: Association for Computational Linguistics (ed.): Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Baltimore, MD, 2014: 370–79. DOI:10.3115/v1/P14-1035.
  2. Bode, Katherine (2018): A World of Fiction: Digital Collections and the Future of Literary History. Ann Arbor, MI: University of Michigan Press.
  3. Burns, Patrick J. / Brofos, James. A. / Li, Kyle / Chaudhuri, Pramit / Dexter, Joseph P. (2021): "Profiling of Intertextuality in Latin Literature Using Word Embeddings", in: Association for Computational Linguistics (ed.): Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online, 2021: 4900–07. DOI:10.18653/v1/2021.naacl-main.389.
  4. Explosion AI (2022). spaCy Explosion AI <https://spacy.io>
  5. Garg, Nikhil / Schiebinger, Londa / Jurafsky, Dan / Zou, James (2018): "Word embeddings quantify 100 years of gender and ethnic stereotypes", in: Proceedings of the National Academy of Sciences, 115, 16. DOI:10.1073/pnas.1720347115.
  6. Grayson, Siobhán / Mulvany, Maria / Wade, Karen / Meaney, Gerardine / Greene, Derek (2017): "Exploring the Role of Gender in 19th Century Fiction Through the Lens of Word Embeddings", in: Gracia, Jorge / Bond, Francis / McCrae, John P. / Buitelaar, Paul / Chiarcos, Christian / Hellmann, Sebastian (eds.): Language, Data, and Knowledge. Cham: Springer 358–64. DOI:10.1007/978-3-319-59888-8_30.
  7. Heuser, Ryan (2016): Word Vectors in the Eighteenth Century <https://ryanheuser.org/word-vectors/>.
  8. Jacobs, Arthur. M. (2019): "Sentiment Analysis for Words and Fiction Characters From the Perspective of Computational (Neuro-)Poetics", in: Frontiers in Robotics and AI 6: 53. DOI:10.3389/frobt.2019.00053.
  9. Jockers, Matthew L. (2013): Macroanalysis: Digital Methods and Literary History. Urbana, IL: University of Illinois Press.
  10. Kruskal, Joseph B. (1956): "On the shortest spanning subtree of a graph and the traveling salesman problem", in: Proceedings of the American Mathematical Society, 7, 1: 48–50. DOI:10.1090/S0002-9939-1956-0078686-7.
  11. Lopez, Patrice (2022): spaCy fishing <https://github.com/Lucaterre/spacyfishing>.
  12. Maryl, Maciej / Karlińska, Agnieszka / Walentynowicz, Wiktor / Walkowiak, Tomasz (2023): "Providing Digital Answers to Disciplinary Questions with Graph Literary Exploration Machine", in: Baillot, Anne / Tasovac, Toma / Scholger, Walter / Vogeler, Georg (eds.): Annual International Conference of the Alliance of Digital Humanities Organizations, Conference Abstracts, Graz, Austria, July 2023: 173–74. DOI:10.5281/ZENODO.8107852.
  13. Mikolov, Tomas / Chen, Kai / Corrado, Greg / Dean, Jeffrey (2013): "Efficient Estimation of Word Representations in Vector Space", arXiv:1301.3781 [Cs] http://arxiv.org/abs/1301.3781.
  14. Mimno, David (2012): "Computational historiography: Data mining in a century of classics journals", in: Journal on Computing and Cultural Heritage 5, 1: 1–19. DOI:10.1145/2160165.2160168.
  15. Odebrecht, Carolin / Burnard, Lou / Schöch, Christof (eds.) (2021): European Literary Text Collection (ELTeC). COST Action Distant Reading for European Literary History DOI:10.5281/zenodo.4662444.
  16. Schöch, Christof (2022): "Quantitative Semantik. Word Embedding Models für literaturwissenschaftliche Fragestellungen", in: Jannidis, Fotis (ed.): Digitale Literaturwissenschaft. Stuttgart: J.B. Metzler 535–62. DOI:10.1007/978-3-476-05886-7_22.
  17. Schöch, Christof / Hinzmann, Maria / Röttgermann, Julia / Dietz, Katharina / Klee, Anne (2022): "Smart Modelling for Literary History", in: International Journal of Humanities and Arts Computing 16, 1: 78–93. DOI:10.3366/ijhac.2022.0278.
  18. Tshitoyan, Vahe / Dagdelen, John / Weston, Leigh / Dunn, Alexander / Rong, Zigin / Kononova, Olga / Persson, Kristin A. / Ceder, Gerbrand / Jain, Anubhav (2019): "Unsupervised word embeddings capture latent knowledge from materials science literature" in: Nature 571, 7763: 95–98. DOI:10.1038/s41586-019-1335-8.
  19. Wevers, Melvin / Koolen, Marijn (2020): "Digital begriffsgeschichte: Tracing semantic change using word embeddings" in: Historical Methods: A Journal of Quantitative and Interdisciplinary History 53, 4: 226–43. DOI:10.1080/01615440.2020.1760157
Judith Brottrager (judith.brottrager@tu-darmstadt.de), TU Darmstadt, Germany