Unlike in other philologies and area studies, computational approaches to American drama have been somewhat limited, with few exceptions like the early study by Argamon et al. (2009) or the (currently inactive) NASSA project by Mischke et al. (2020). While there are several commercial databases covering American drama, none of them is open-access, provides state-of-the-art text encoding, or includes advanced tools for computational analysis. To fill this gap and provide a significant resource to scholars of American Drama, we present here an early version of the American Drama Corpus ( AmDraCor), the newest addition to the Drama Corpora ( DraCor) collections (Fischer et al. 2019).
Started in 2017, the DraCor project has become the reference infrastructure for computational investigation of theatre plays, both for its large holdings (more than 3000 plays in more than a dozen languages) and its resources for textual analysis. All items in DraCor are encoded in XML-TEI and their metadata are enriched with Linked Open Data, allowing the API to extract and compute a range of textual metrics, spanning from network values to speech distribution. 1 In this contribution, we briefly describe the corpus building process for AmDraCor and offer a brief showcase of its potential for digital American Studies, conducting some initial exploratory data analysis on the plays’ network values.
AmDraCor presently includes a growing number of out-of-copyright American dramatic texts from the early eighteenth to the early twentieth century. Following a low-resource pragmatic approach, all texts have been retrieved from freely available online sources such as Wikisource, Oxford Internet Archive, Project Gutenberg, and Internet Archive. The files, mostly in HTML and TXT, have been converted to the standard DraCor TEI-XML format through the pipeline previously described in Börner et al. (2023). On the occasion of DH2024, we will release a sample corpus of about 40 plays: 2 while most files will need further work to reach DraCor markup standards, especially in terms of metadata, they are already compatible with the platform and can be analysed through its tools. Furthermore, offering an early preview of the corpus is meant as an invitation to external scholars to contribute, according to the distributed and participative approach outlined in Giovannini et al. (2023).
One key feature of the DraCor platform is the possibility to use its API (directly or through wrappers) 3 to extract structured data from the XMLs and compute textual statistics. A common procedure, for example, involves grouping texts based on metadata properties (such as period, author, or genre) and then performing comparative analyses of their network values according to the principles of literary networks analysis [10] and distant reading [8].
In this showcase experiment, run on a larger collection of texts, 4 we measured network density distribution across the best-represented genres within the corpus (tragedy, comedy, farce, historical drama). As in previous studies (Trilcke et al. 2015, Szemes and Vida 2024), we found that comic genres display a significantly higher density than tragic or historical plays (Figure 1), often because of all the characters coming together in the final scene. 5 Density also shows a moderate inverse correlation (-0.49) with network size; this is consistent with results from previous measurements on other corpora, since comic plays generally have tighter plots with fewer highly interconnected characters (i.e., high density and small network size). Since the size difference between comedies and tragedies is marginal, 6 variance in density cannot thus be explained without accounting for the specific features of comedy.
Figure 1 . Density (left) and size (right) distribution across four genres in AmDraCor .
The only genre with a notably bigger network size (median = 21) is a history play. As in some other corpora of dramatic texts, such plays often have complex polycentric networks with multiple clusters, reflecting the presentation of events spread in time and space (military campaigns, castle sieges, secret missions etc.) and depicting opposing groups of people engaged in large historical conflicts. One such example would be Jeanne d’Arc by Percy MacKaye (1907), where each act takes place at a different stage of the Hundred Years' War (Figure 2).
Figure 2. Character network for MacKaye’s Jeanne d’Arc . Data from the DraCor API, elaborated in Gephi.
While the example of exploratory data analysis we just presented is quite basic, it represents a glimpse into the potential of AmDraCor once the thorough metadata refinement and markup post-correction will be completed. In future developments, we aim at further exploiting the availability of diverse corpora in DraCor to conduct transnational analyses, as showcased in Trilcke et al. (2024). Accordingly, a possible line of inquiry could involve comparing American drama to other contemporary literary milieus, such as the French or the German dramatic texts. Even more poignantly, the ingestion into DraCor of a coeval corpus of British drama 7 into DraCor would also allow measuring whether and when US drama developed some degree of formal autonomy from its European source.
A more detailed presentation of the platform can be read in Börner and Trilcke (2023).
The corpus will be accessible at https://dracor.org/am .
See https://pypi.org/project/pydracor (Python) and https://cran.r-project.org/web/packages/rdracor/index.html (R).
This experiment has been run both on AmDraCor and on a larger collection of American plays we cannot publicly release yet because of unclear copyright status.
Median density values: comedies (0.83), farces (0.87), history plays (0.56), tragedies (0.5).
Median values: comedy (14), tragedy (16).
E.g. by onboarding the Victorian Plays Project , which is currently being relaunched with its texts up to DraCor standards (cf. Burnard 2023).