Finding connections between fundamentally different conceptions of “models”: Explorations of highly structured data in the context of Large Language Models.

One of the major projects in the DH has been to find meanings in text through digital means—text analysis. Recently the emergence of Large Language Models has presented a radically new way of thinking about meaning captured in texts. The products of the LLM research (Zhao 2023), such as ChatGPT ( https://chat.openai.com), appear to draw on emerging semantics of texts that allow them to at least mimic human responses in person-machine conversations: much like that envisioned in the Turing Test. The debate about whether LLMs echo aspects of human intelligence has begun (see, for example, Prudkov 2023).

In contrast, although almost all my professional DH life at KCL (in 25 substantial funded DH projects) has worked on the representations of meaning by digital means, it has not centered on text analysis but instead on highly structured data (usually in terms of the relational database). Bradley 2021, and Bradley et al 2019 explores the relationship between digital formal structures and semantics we found there. Furthermore, my Pliny project has explored how aspects of the more informal process of humanities research which is, at least in good part, about finding new interpretations could be usefully supported digitally. See Bradley 2008 and Bradley / Pasin 2017, and Pliny’s website: https://www.kcl.ac.uk/research/pliny-project.

Both highly structured data and LLMs create models expressing some the semantics of their material, but LLM’s models are very different kind of thing from models represented by graph-oriented highly structured data projects (and in Pliny). Is there any point of connection? A good amount of the material stored in Pliny is traditional annotations where each annotation is likely to be a bit of prose text. Thus, a significant part of the semantics in a Pliny dataset might well lie not only in the evident object structures Pliny records, but lies hidden, as it were, within the text of the annotations themselves. Similarly, we have found in our large data structured projects that users often wish to query the text found in title and note objects in the database structure. For example, in the Record of Early English Drama’s EMLoT resource, some regular users have revealed that the word search mechanism is the main point of entry for them.

To explore this issue we have built mechanisms to extract the textual bits from EMLoT, and from the Pliny dataset of a significant Pliny user. Text analysis techniques can then be applied to uncover structure. We have begun this work with Voyant.

We have also begun to explore these texts in the context of LLM models and methods, encouraged by examples found online that work with small texts. One of the fundamental ideas behind Large Language Models comes through embeddings: an approach that has been described as “[numeric] vectors stored in an index within a vector database” (Besen 2023). An embedding vector (with a high number of dimensions, perhaps thousands) is associated with each word form found in the collection of training texts. A vector representing a word represents a position in an n-dimensional space, and similarities between words (often cosine similarities) can be asserted as a numeric measure of similarity of their embedding vectors. Tools such as word2vec (Mikolov 2013a and 2013b) produce embeddings where words with high cosine similarity also end up usually having semantically related meanings (Ladd 2020). The n-dimensional space in which they sit thus takes on characteristics of a semantic space too. As Besen says, these vectors become a representation of the word’s meaning, but “precisely [what] these numbers mean is known only to the transformer model that generated them.” (Besen 2023)

Text operating semantically can be found in highly structured data, and particularly in Pliny. Hence, part of the semantic significance of these projects is not found only in the object structure, but also in the text in these structures, and the semantic meaning between the text and the data structure is likely to be complementary. Visualisations of the data structure aspect of these projects can be created as mathematical graphs. In terms of visualising this text, however, we have visualisations created by text analysis tools such as Voyant. Furthermore, embeddings provide a new viewpoint for capturing meaning, and can be analysed with statistical tools such as principal component analysis. This poster will present some ways in which points of contact can be found in these two quite different kinds of visualisations. Do they enable a fuller vision of what our materials represent than either do by themselves?

Appendix A

Bibliography

Besen, Sandi (2023): “LLM Embeddings—Explained Simply”. Blog entry at Medium. < https://pub.aimind.so/llm-embeddings-explained-simply-f7536d3d0e4b>. May 15, 2024.
Bradley, John (2008): “Thinking about Interpretation: Pliny and Scholarship in the Humanities”, in: Literary and Linguistic Computing 23, 3: 263 – 2 79. DOI: 10.1093/llc/fqn021.
Bradley, John (2021): “Creating Historical Identity with Data: a digital prosopography perspective” , in: Flüh, Marie / Horstmann, Jan / Jacke, Janina / Schumacher, Marelke (eds.): Towards Undogmatic Reading: Narratology, Digital Humanities and Beyond. Hamburg: University of Hamburg Press 94 – 108. DOI: 10.15460/HUP.209.
Bradley, John / Rio, Alice / Hammond, Matthew / Broun, Dauvit (2019): “Exploring a model for the semantics of medieval legal charters”, in: International Journal of Humanities and Arts Computing 13 1-2: 136 – 154. DOI:10.3366/ijhac.2017.0184.
Bradley, John / Pasin, Michele (2017): “Fitting Personal Interpretations with the Semantic Web: lessons learned from Pliny”, in: Digital Humanities Quarterly 11, 1. < http://www.digitalhumanities.org/dhq/vol/11/1/000279/000279.html>. May 15, 2024.
Ladd, John R. 2020: “Understanding and Using Common Similarity Measure for Text Analysis”, In Programming Historian (9). DOI: 10.46430/phen0089.
Mikolov, Tomas / Chen, Kai / Corrado, Greg / Dean, Jeffrey (2013a): “Efficient Estimation of word representations in Vector Space”, < https://arxiv.org/pdf/1301.3781.pdf>. May 15, 2024
Mikolov, Tomas / Sutskever, Ilya / Chen, Kai / Corrado, Greg / Dean, Jeffrey (2013b): “Distributed Representations of Words and Phrases and their Compositionality”, in: Proceedings for Advances in Neural Information Processing Systems (NeurIPS) 2013, < https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf>. May 15, 2024
Prudkov, Pavel (2023): “Do large language models approach the level of human intelligence?” Blog at Research Gate. < https://www.researchgate.net/post/Do_ large_language_models_approach_the_level_of_human_intelligence>. May 15, 2024
Zhao, Wayne Xin / Zhou, Kun / Li, Junyi Li / Tang, Tianyi Tang / Wang, Xiaolei / Hou, Yupeng / Min, Min, Yingqian / Zhang, Beichen / Dong, Zican / Du, Yifan /Yang, Chen /Chen, Yushuo / Chen, Zhipeng / Jiang, Jinhao / Ren, Ruiyang Ren / Li, Yifan Li / Tang, Xinyu / Liu, Zikang / Liu, Peiyu / Nie, Jian-Yun / Wen, Ji-Rong (2023): "A Survey of Large Language Models". < https://arxiv.org/abs/2303.18223>. May 15, 2024