Over the past two decades, crowdsourcing has emerged as a viable and effective tool for creating transcriptions from digitized images of physical source material such as handwritten manuscripts. Many early crowdsourced text transcription efforts were one-off projects, but swelling demand led to the development of several online platforms, including Zooniverse , the largest virtual platform for crowdsourced research. Since 2010 more than 75 transcription projects have been built on different iterations of the Zooniverse codebase and platform, testing different approaches and tools for transcription. Many Zooniverse platform users have told their own story through peer reviewed publications and gray literature (Brohan 2019; Brusuelas, 2010; Deines et al, 2018), but the genesis and iterative development of the platform has yet to be described. We will address this lacuna by describing three common challenges that informed the theories and practices underpinning the development of transcription tools on Zooniverse. The paper serves as a reflection on responsible resource management and collaboration in DH.
Crowdsourcing is by nature deeply collaborative, and requires significant design and planning in order to be done ethically, e.g. by not wasting volunteers’ time; being transparent about methods; producing open-access results, and supporting diverse teams and volunteers in using the platform. We aim to provide an example for anyone using, creating, or maintaining technical products used for data production in DH, by demonstrating how the history of technical infrastructure and DH resources provide critical context for interpreting project results.
In this short paper, we will present ongoing research into the history of Zooniverse text transcription projects from 2010 to the present, including the methodological shift away from bespoke development after the 2015 launch of the free-to-use Project Builder (PB), which allows team to create their own projects and connect with Zooniverse volunteers. We will reveal how generations of project teams composed of Zooniverse researchers and diverse academic and cultural heritage partners tested different approaches to text transcription, and adapted, reused, and refined methods to amplify successes and mitigate challenges or shortcomings of previous approaches.
This discussion of Zooniverse transcription tools highlights the innate challenges of designing for distributed transcription by multiple volunteers on a platform that was constantly adapting to meet increasing demand for crowdsourcing tools from many disciplines and practitioners (peaking in 2020 in response to physical institutional closures, due to the Covid-19 pandemic), a growing volunteer base, and constant advances in web development (Samuel, 2021). Through a literature review, content analysis, and our work with teams who have used the platform since 2010, we identify three key challenges across Zooniverse transcription projects:
Text data is complex no matter how it is transcribed or tagged. Original documents can be written in diverse languages and scripts, and layout can be structured (e.g. forms or tables), unstructured or semi-structured. Decisions about whether transcribers should produce diplomatic, semi-diplomatic or regularized transcriptions and how to communicate conventions succinctly online can be even more complex when working with distributed volunteers than with scholarly editing teams. These factors make it extremely difficult to design a single transcription approach that can be used for many types of text.
Zooniverse transcriptions can be broken down at the level of a page, paragraph, sentence, line, or character, but until 2018 the unit of classification for the purposes of aggregation was almost always the page/image (Blickhan et al, 2019). Different projects have designed the unit of transcription radically differently for various reasons: to increase user confidence, lower barriers to participation, and in the hopes of easing automated aggregation, but these decisions sometimes compounded the complexity of the resulting data. We will discuss examples of different approaches, their affordances and often unanticipated challenges, as well as successive efforts to tweak the tools and platform to enhance data quality, and improve the user experiences of volunteers and project teams.
Data aggregation is the foremost challenge for text transcription projects on the Zooniverse platform, which gathers multiple “classifications” or assessments per page or image, compares them, and seeks a majority assessment. For the first Zooniverse project, Galaxy Zoo , volunteers were asked simple multiple choice questions, and aggregation was fairly straightforward (Lintott et al). For text, however, the difficulty again comes with the issue of units. If a page of text is broken down into distinct sections or units, a major challenge is grouping, or clustering, the positional data that tells the platform back-end which transcriptions refer to the same unit, so the appropriate units of text can be aggregated together. Even if a page is broken down into physical lines of text, each line will contain multiple words, and slight differences in volunteers’ transcriptions, i.e. spelling and punctuation, can affect the quality of the result. The challenges of aggregating highly variable textual data amplifies typical skills gaps between academic disciplines or those who code in their line of work, and those who don’t, but the challenges of aggregating text cannot be solely attributed to disciplinary differences (Van Hyning, 2019).
This short paper will provide brief examples of the above challenge categories, and illustrate lessons learned through each developmental stage. We hope that feedback from this session will help to guide our future research efforts, by suggesting how we might keep refining sample datasets and documentation aimed to guide Zooniverse transcription project creators through all stages of project design and working with their resulting data.
This historical overview of text transcription on the Zooniverse platform will clearly matter for current and future project creators using the PB, and for anyone working with transcription data produced on Zooniverse in the bespoke or PB instances. This paper could help teams trying to describe the theoretical and methodological underpinnings of platform they used to gather their data, and provide crucial context for those trying to reuse the data for new purposes. We believe our findings are also applicable to DH practice more broadly: we argue that the collaborative methods of adapting and sustaining existing technologies, and refining our documentation and understanding of the platforms we create and the resulting data is ultimately more impactful than continuously prioritizing new builds and tools.