Distant Reading a Minor Literature: A New Hope?

1. Introduction

Stylometry was first applied to establish chronology of texts (Lutosławski 1897) and then became a preferred tool of computational authorship attribution beginning at least with Mosteller and Wallace (1964). These and many later studies found that counting most frequent words (MFWs) such as pronouns, modals, articles and prepositions is enough to group texts by author, but that other “signals” such as those of translator, genre, chronology or gender can influence patterns of similarity and difference (Rybicki 2014, 2017, 2022).

It is not at all surprising that this approach to texts – which allows computation of any number of texts – merged with another tendency to “read” as many written products of human culture as is humanely impossible yet computationally possible: Moretti’s distant reading soon moved in the direction of “macroanalysis” (Jockers 2013). Criticized from various positions (Drucker 2017, Mandell 2019), it still remains a way to ask questions to literary “big data” that cannot be answered by traditional close reading because there are simply too many books around.

While most of this research has been done on the dominating English-language literature, its usefulness obviously does not end there. This study strives to show the advantages of this approach within the Polish literary tradition.

2. Material

We have produced a digital literary collection of more than 10,000 full Polish texts for the purpose of stylometry and distant reading. This is the largest collection of Polish literature in ready-to-analyse format. In comparison, the National Corpus of the Polish Language – made, as it is, for very different tasks – is only based on 2,500 books, including non-fiction, and ca. 340 press titles. Still, it is important to note the limits of representativeness as the contents of the collection were conditioned by a number of historical “filters:” others’ decisions in the past to print, translate, digitize, to produce digital copies of acceptable quality; and finally the contemporary decision to include in this collection.

Contrarily to many other similar attempts, the Polish collection presented here contains both public-domain and copyrighted texts, and an almost equal number of original Polish texts and those translated into Polish from 23 other languages.

The collection contains texts by 1682 authors (original: 674, translated: 1019) and by 1735 translators; 105 of Polish authors are also translators. It is annotated with basic textual metadata such as author and translator names and dates, first (Polish) publication dates, author and translator sex, first place of publication, etc. The earliest texts date back to the turn of the 14 th and the 15 th century, and the latest text is of 2022; statistically-reliable numbers of texts appear in late 18 th century. In the future, this collection will be fully open-access (either in full-text mode or, in the case of copyrighted material, as frequencies of required lexical units).

3. Methods

This study combines metadata analysis of the large-scale collection in the original distant reading mode with stylometry based on MFW frequencies, to see if large-scale collections of literary texts such as the one used in this study, in combination with their metadata and simple semantics (such as “content” rather than “function” word occurrences), may provide better insight into the relative strengths of the various “signals.”

The study adopts the well-tested workflow (Eder 2017) which combines most-frequent-word analysis of the texts produced with the stylo package (Eder et al. 2016) for R with network analysis with Gephi to produce a large network, or map, of this Polish collection, where proximity between each pair of texts reflects the similarity in Mfw usage. Various results of metadata and semantics counts were then mapped onto this network.

4. Results

For lack of space, only the main results are discussed here.

The oft-observed phenomenon where comparison of the usage of MFWs is enough to elicit a chronological progression within any collection of texts can be easily recreated with Polish originals in a network analysis (Figure 1).

Network based on cluster analysis of Cosine Delta distances (translations: grey, Polish originals: color).
Figure 1. Network based on cluster analysis of Cosine Delta distances (translations: grey, Polish originals: color).

Figure 2 shows the same network, now classified as originals and translations. There is a clear division into original and translated literature, somewhat complicated by the signal of genre (see below). This presents good evidence of the existence of what can be called stylometric translationese: differences in word usage between original and translated texts. What is more, translationese seems at least partially further classifiable according to source language. This is visible in Figure 3, where different colors have been applied to translations from the main source languages in the collection. It should be noted, again, that the smaller communities of the same colors appearing in the left part of this graph are due to differences in genre; this becomes evident after comparison to Figure 4.

Network based on cluster analysis of Cosine Delta distances (originals: red, translations, blue).
Figure 2. Network based on cluster analysis of Cosine Delta distances (originals: red, translations, blue).
Network based on cluster analysis of Cosine Delta distances (originals: grey, colors: translations).
Figure 3. Network based on cluster analysis of Cosine Delta distances (originals: grey, colors: translations).
Network based on cluster analysis of Cosine Delta distances (prose: white, poetry: red, drama: light blue).
Figure 4. Network based on cluster analysis of Cosine Delta distances (prose: white, poetry: red, drama: light blue).

Mapping keyword (“content word”) data onto the same networks creates other interesting possibilities. In Figure 5, the count of all words denoting colors in the collection’s texts allows classifying each text on a “color-words” quartile scale; when this is overlaid with the MFW-based network, poetic texts (left) not only share similar most-frequent-word usage proportions, but also constitute the most “colorful” community. Interestingly, the same part of the network consists of texts in the highest quartile of love-and-sex-related terms, making poetry a genre that combines color and love! Meanwhile, the community of dramas largely coincides with the least colorful texts.

Network based on cluster analysis of Cosine Delta distances with overlaid color word quartiles.
Figure 5. Network based on cluster analysis of Cosine Delta distances with overlaid color word quartiles.

Other results include:

5. Discussion

Our data and our specific results make this large-scale distant-reading/stylometric approach worthwhile, at least from a Polish-studies perspective. But there is more: this may be away to breach the traditional division between world’s (mainly post-colonial) “major” literature and their more modest relatives: in this perspective, all literatures become much more comparable and thus comparative. This gives us the new hope to lay to rest the two-centuries old imperial (and indeed imperialist) bias that “minor literatures may be—not neglected, but passed by with courteous excuse, by the comparative historian… But to the general student they are at most facultative, and the general historian on a limited scale can hardly spare them a faculty of competing” (Saintsbury 1907: 403).

6. Acknowledgement

This research has been funded by the Jagiellonian University’s Flagship Project “Digital Humanities Lab.”

Appendix A

Bibliography
  1. Burrows, John F. (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3): 267–87.
  2. Drucker, Johanna (2017). “Why Distant Reading Isn’t,” PMLA, 132(3): 628-635.
  3. Eder, Maciej (2017). “Visualization in stylometry: cluster analysis using networks.” Digital Scholarship in the Humanities, 32(1): 50-64.
  4. Eder Maciej, Jan Rybicki and Mike Kestemont (2016). Stylometry with R: a package for computational text analysis. The R Journal, 8(1): 107–21.
  5. Jockers, Matthew (2013). Macroanalysis: Digital Methods and Literary History. Champaign: University of Illinois Press.
  6. Krajewska, Wanda (1972). Recepcja literatury angielskiej w Polsce w okresie modernizmu 1887-1918. Wrocław: Ossolineum.
  7. Lutosławski, Wincenty (1897). The origin and growth of Plato's logic; with an account of Plato's style and of the chronology of his writings. London: Longman’s.
  8. Mandell, Laura (2019). “Gender and Cultural Analytics: Finding or Making Stereotypes?” in Debates in the Digital Humanities (ed. Matthew K. Gold and Lauren F. Klein), 3–26. University of Minnesota Press. https://doi.org/10.5749/j.ctvg251hk.4.
  9. Mosteller, Frederick, Wallace, David L. (1964) Inference and Disputed Authorship: the Federalist Papers. Reading, Mass.: Addison-Wesley.
  10. Rybicki, Jan (2014). “Pierwszy rzut oka na stylometryczną mapę literatury polskiej,” Teksty drugie 2(146): 106-128.
  11. Rybicki, Jan (2017). “A second glance at a stylometric map of Polish literature,” Forum of Poetics, 8: 6–21.
  12. Rybicki, Jan (2022). “A third glance at a stylometric map of native and translated literature in Polish,” in Retracing the history of literary translation in Poland : people, politics, poetics, (ed. Magdalena Heydel and Zofia Ziemann), London: Routledge, 247-261.
  13. Saintsbury, George (1907). The Earlier Renaissance in Periods in European Literature (ed. George Saintsbury), Edinburgh: Blackwood.
  14. Strzałka, Filip (2021). sentimentPL. https://github.com/philvec/sentimentPL
Agata Kwaśnicka-Janowicz (agata.kwasnicka-janowicz@uj.edu.pl), Faculty of Polish Studies, Uniwersytet Jagielloński, Poland and Jan Rybicki (jan.rybicki@uj.edu.pl), Faculty of Philology, Uniwersytet Jagielloński, Poland