The potential of large language models (LLMs) for creative writing tasks and their relative performances has been investigated in previous literature (Gómez-Rodríguez / Williams 2023). Some experiments have also been conducted to simulate authorial voices and evaluate their convincingness. This was often met with negative results, as those described by (Sawicki et al. 2023) and (Rebora 2023), with AI not being able to deceive stylometric methods or classifiers. While (Garrido-Merchán et al. 2023) reported some success in tricking human testers, this outcome relied on high-level, domain-aware prompt engineering and the testers’ unfamiliarity with the original works.
This work-in-progress contribution, however, does not aim to assess how faithfully a large language model replicates a specific author. Instead, it explores how such attempts to replicate authorial voices are conducted, focusing on the models’ ‘literary craft’ and trying to identify the textual components they use to mimic writers. To achieve this, we first build a test corpus by asking a LLM to imitate a diverse sample of authors. Then, we employ some standard text exploration methods from computational literary studies (CLS) to assess the difference between AI- and human-generated texts in terms of content and style and perform a qualitative analysis of the results.
In our experiment, we asked one of the most diffuse LLMs, GPT-4 (9), to write short stories in the style of three English-speaking writers: Edgar Allan Poe, Arnold Bennet, and Katherine Mansfield. We choose to work with English texts because of better performances, and with short stories because their size allows for easier generation while keeping internal coherence. As a comparison corpus, we collected 10 random short stories for each author 1 . We then gave our model a minimalistic starter prompt: “ Write a short story in the style of {author}”, with temperature set at 1. We refrained from any fine-tuning, since we wanted to gauge the model’s “raw potential” without advanced prompt-building (Rebora 2023: 297).
We ran several experiments on our parallel corpora to measure their differences and try to capture key components of authorial voice. First, we tried to understand if there was any significant difference in terms of content via topic modelling, a household technique in CLS (Du 2019) which employs a probabilistic LDA model to uncover thematic structures within corpora.
To try to capture style, instead, we recurred to the Zeta measure, first developed by stylometrists as a ‘distinctiveness’ or ‘keyness’ measure (Burrows 2007) and since then used for contrastive corpora analysis (Schöch 2018). In our experiment, we used its implementation within the PyDistinto package (Du et al. 2021) and compared top keywords in the two groups of texts, investigating which literary devices they commonly use.
A preliminary look at the results of the LDA procedure shows that, beyond the direct influence of training materials, stereotypicality plays a huge role in influencing how AI tries to imitate an author. This can be seen e.g. in the case of Poe, a versatile author whose short stories touched upon several different genres. His prevalent association with Romanticism and Gothic fiction, however, leads the model to produce only texts filled with horror and thriller themes, ignoring other aspects of his production 2 .
As for keyness contrastive analysis, we observed a tendency of AI to use more abstract and refined literary language than humans. The list of words with the highest keyness in AI-Mansfield, as opposed to its real counterpart, was dominated by abstract notions (such as solitude, essence, reality) and meta-narrative language ( storyteller, tale, narrative). Both features were also common to other generated texts while being quite rare in human authors and can thus be considered LLM-specific quirks.
Since the model appears to reproduce topics typical for an author while maintaining a mostly uniform writing style, we can argue that authorial mimicry happens primarily on a thematic rather than stylistic level. In other words, idiosyncratic style nuances seem to be often superseded by generic, cliché-filled composition strategies, whereas the focus remains on the ‘creative’ remixing of contents frequently found in the training set.
While this poster outlines only an initial proof-of-concept, we aim to replicate it in the future with other LLMs, such as Gemini (Gemini Team 2023) and LLaMA (Touvron et al. 2023), and use advanced prompt engineering to measure variations in output features. Increasing and diversifying the authors in the sample might also prove fruitful, especially when pitting genre writers – for which an even stronger thematic imitation can be hypothesised – against more ‘nuanced’ ones. Ultimately, model behaviour in dealing with other genres, such as novels or drama 3 , should be addressed.
More information on sources and editions used, together with the Python script used to run all tests, are available at https://github.com/lucagiovannini7/ai-storyteller .
See e.g. the noun topics the LDA model recognised in AI-Poe: existence, eyes, life, heart, dread, dance, waltz, love, time, heart, death, chateau, despair, mirror, parchment, mansion .
Cf. a previous experiment in (Giovannini and Skorinkin 2023).