Responsible Datasets in Context: Teaching with Data and Emphasizing the Value of Historical Context

This panel brings together five collaborators who are at work on a project called “Responsible Datasets in Context: Collaboratively Designing for Ethical Humanities Data Education,” which seeks to strengthen students’ capacity to work with data responsibly.  We created a website that hosts datasets paired with rich documentation, data essays, and teaching resources, all of which foreground context and draw on humanistic perspectives and methods: www.responsible-datasets-in-context.com . Our project is supported by the Mozilla Foundation, the Mellon Foundation, and other partners through a 2023-2024 Responsible Computing Challenge award. The project and panelists include: Sylvia Fernández (University of Texas at San Antonio), Anna Preus (University of Washington), Miriam Posner (UCLA), Amardeep Singh (Lehigh University), and Melanie Walsh (University of Washington). 

In this panel, we will share and discuss work that we have done 1) to develop undergraduate courses related to socially responsible computing in the humanities and information science, and 2) to create our online repository of “responsible datasets in context”—that is, a repository of interesting, meaningful cultural datasets that are richly documented with the data’s social and historical context, and that are also accompanied by sample lessons plans and class exercises for how the data can be used in a classroom. Each panelist will share about the dataset that they have curated and its context; why the dataset is valuable for digital humanities or computing education; how they have taught (or how they plan to teach) with the dataset; how they have encouraged students to consider and incorporate its context into analyses and interpretations; and other key takeaways, challenges, and questions. The goal of the panel is to offer concrete examples, strategies, and frameworks for teaching with data, and to articulate the value of the humanities and digital humanities in undergraduate computing-related education. 

Melanie Walsh will begin by discussing problems she has observed in undergraduate computing classes taught outside humanities departments—chiefly, the problem that students are not typically trained to care about or seek out information about a dataset’s context and history. Without this context, she has found that students often misinterpret datasets and produce flat findings as a result. To help address these issues, Walsh will discuss how she incorporates a dataset of visits to U.S. National Parks into her introductory programming classes, with the specific aim of demonstrating how a dataset’s construction, documentation, limitations, and history can drastically impact interpretations of that data. This particular dataset, a curated version of data released by the National Park Service, includes information about how many people visited each National Park in the U.S. from 1979 to the present. Following Catherine D’Ignazio and Lauren Klein’s insistence that “what gets counted counts,” Walsh asks her students to research what “counts” as a visit to a National Park, which turns out to have a complicated and not easy to find answer (a combination of ticket sales, car traffic counters, manual counters, plane flyovers, regression formulas, and more). She also asks them to investigate why certain remote parks in Alaska are recorded as having no visits in certain years, especially parks that are located near existing Indigenous communities, which raises questions about who counts as a visitor as well as how uncertainty should be quantified in data. Through these exercises as well as through paired readings and programming problems, Walsh encourages her students to unpack the complex cultural, environmental, and political histories embedded within this seemingly objective government data, and to recognize the inherent contingency of all data.

Sylvia Fernández will touch on how feminist activists, academics, and people in the public sphere around the world have been working vigorously to participate in data maintenance movements to create awareness about feminicides and gender violence. These record-keeping efforts have been happening since before the digital age to demonstrate the reach of gender violence in communities in both the northern and southern hemispheres. There are several known resources that use both analog and digital records related to feminicides in the Americas; in many cases, personal and institutional data were converted to digital resources (such as databases, repositories, maps, timelines, and other multimodal visualizations). 

With the intention of expanding these efforts, Fernandez will address the design of a course that teaches students how to analyze, develop, and visualize datasets of gender-based violence, feminicide, and human rights feminist movements in ethical and responsible ways. Through a focus on the feminicides and movements in Cd. Juarez and El Paso, Texas in the early 1990s, students are exposed to material that addresses cases of data terrorism, data violence, and missing data; they are also exposed to material that encourages them to embrace intersectional, transborder, transnational, and translingual humanistic practices to create new, counter, or alternative datasets in solidarity with this global movement. The course considers feminicide at both a local and global level (Fregoso and Bejarano, 2010) and synthesizes analyses of counter data, multilingual border literature, periodical publications, and digital activist projects in order to understand how gender violence has affected and continues to affect women, especially women of color. 

Amardeep Singh has been compiling a dataset and corpus consisting of books of poetry published by African American poets between 1850-1945. The starting point, Dorothy Porter's North American Negro Poets: A Bibliographical Checklist 1760-1944 , is close to a comprehensive account of all books of poetry published by Black authors during these years. One goal is to use the dataset as a tool for discovery (including, where possible, to fill in gaps in Porter’s Checklist ). Using this dataset allows us to look at African American poetry without the constraints of editorial filters and academic tradition. The dataset might thus encourage a decentering of the 'Harlem Renaissance' as an endpoint, as well as a decentering of some of the best-known names from the 1920s. Doing so requires closely reading the poetry and employing traditional literary critical classificatory schemes. 

Other possible uses of the dataset might be quantitative. Did the amount of poetry published by Black poets increase over time? What was the gender breakdown in African American poetry during this period, and did it change? What was the geographic breakdown in terms of publishing locations and the locations of the authors themselves? Did the writing become more urban, metropolitan, and northern over time? What was the breakdown in terms of self-published poetry vs. big commercial publishing houses, and did that change over time? From within the poetry itself, what are some patterns that we can identify in terms of style and theme? Did the poetry generally become more politicized over time, or less so? How did Black poets use or deviate from established poetic forms? What role does the use of AAVE play in Black poetry during this period? Here, Singh will present some provisional results to these quantitative queries. 

Anna Preus will discuss approaches to teaching data science students about information systems designed for the organization of print-based and digitized collections of historical texts. Understanding widely used library metadata schemas allows students to unlock huge quantities of annotated historical textual data held in digital libraries like HathiTrust and Internet Archive. However, the terms used to categorize books are highly contested, with metadata standards like Library of Congress Subject Headings taking on a prominent role in debates about how cultural heritage institutions reckon with histories of colonialism, violence, extraction, and erasure that continue to define their present collections. Engaging with these debates can help to illuminate, for data science students, how all information organization systems are historically defined and contingent. However, metadata schema can also be hard to teach and to make interesting, particularly when they are detached from texts themselves. While it is fairly straightforward to, for example, create a dataset of MARC records, it is not so easy to connect it to the books to which those records apply and to discuss the subject tags alongside the texts themselves. Addressing this gap, Preus will discuss building a full-text corpus of public-domain novels included in OCLC’s “Library Top 500”–a list of the 500 most widely held novels in libraries around the world, as reported by OCLC, the largest database of library records--which she uses to highlight historical contexts for data categorization and metadata creation, to problematize processes of textual preservation and canonization, and to link systems of knowledge organization to methods of close and distant textual analysis. 

Finally, Miriam Posner will discuss techniques for aiding the development of “tech-shy” students: those students who have little or no history of working with datasets and find the prospect intimidating. Using the example of the Database of Silent Race Film, which the author assembled along with her students, the presentation will show how tech-shy students can draw on their existing strengths and knowledge to emerge with new confidence, abilities, and interests. These techniques include setting expectations at the beginning of each lesson; using a set of best practices for composing and incorporating instructional material; creating a low-stakes, cooperative environment during the lesson itself; and enlisting more tech-confident students to ensure that their participation is supportive rather than intimidating. This presentation will focus on practical techniques for creating this environment within a range of settings, relying both on the author’s own experience with the Database of Silent Race Film and on recent literature on effective pedagogical practices. The oral presentation will be supplemented by a website that summarizes these principles and offers further reading for the participants’ later reference.

Melanie Walsh (melwalsh@uw.edu), University of Washington and Sylvia Fernández (sylvia.fernandez@utsa.edu), University of Texas at San Antonio and Anna Preus (apreus@uw.edu), University of Washington and Miriam Posner (mposner@humnet.ucla.edu), UCLA and Amardeep Singh (amsp@lehigh.edu), Lehigh University