We present the details of the government transparency project Digital Democracy (Blakeslee 2015) and discuss two major contributions: First, a novel legislative informatics schema capturing entity relationships in the legislative proceedings, and second, the corpus of comprehensive state legislative proceedings (all speeches and remarks) from four states covering one or two sessions 2015-2018. Produced at a cost of about $5 Million by Cal Poly's Institute for Advanced Technology and Public Policy (IATPP 2024) over four years, the bulk of the corpus consists of human-verified, and data-augmented transcripts of state legislative committee hearings and floor sessions in four largest US states: California, Texas, New York and Florida, together with the contextual metadata on all the bills discussed at the hearings and all speakers. The California corpus consists of two full legislative sessions 2015-2016 and 2017-2018, while the other three state corpora are limited to the 2017-2018 session. The corpus is now released under a Creative Commons license via the IATPP Hugging Face repository huggingface.co/datasets/iatpp/digitaldemocracy-2015-2018 doi: 10.57967/hf/2802.
Scholars have been interested in parliamentary records for a long time. They have been used, among other things, to study gender linguistics (Yu 2014), immigration policy (Card 2022), Social Science testimonies (Maher 2020) and language (Albuquerque 2022, Temper 2022, Li 2017). New corpora have been created to allow wider scholarship. Recent efforts include released corpora from the parliaments of Switzerland (Salamanca 2024), Finland (Virkkunen 2023), Catalonia (Külebi 2022), Italy (Frasneli 2022), and German-speaking countries (Abrami 2022). Many legislatures already publish proceedings in some form which makes the creation of corpora a matter of transferring formats or digitizing written records. However, unlike the US Federal Government, US state legislatures do not publish written records of discussions, and Digital Democracy is the only known source for these.
Digital Democracy uses videos of legislative sessions to create highly accurate human-verified transcripts, speaker identifications and bill alignments which then are integrated with other data such as motions, bills, votes, service records, committees, and biographies. A simplified data-ingestion pipeline for the project is shown in Figure 1. For more details see (Ruprechter 2020).
Parts of the corpus have already been used for studies with permission (Latner 2017, Kauffman 2018, Budhwar 2018, Ruprechter 2020, Howe 2022, Grace 2023 and Perkonigg 2023), but this paper marks the first open and public release of Digital Democracy data. We note that even this is not the entire set. For purposes of timeliness, clarity and focus on the main speech corpus, many other data dimensions such as voting records, parliamentary motions, bill versions, organizational affiliations, and financial records have been left for future work. As of 2024, Digital Democracy was integrated with nonprofit news agency CalMatters, with a new public portal and live records for the 2023-2024 California legislative session (CalMatters 2024).
Figure 1. High level view of data acquisition for Digital Democracy.
The released corpus represents all transcribed legislative sessions in four states. As each legislature is independent, the corpus is organized by state and session (2 years). Table 1 summarizes the legislative sessions available in the corpus, and provides a general overview of the content: number of legislative discussions that took place in each session, number of bills, number of unique speakers in all discussions, and total number of speeches recorded.
Table 1. Corpus overview
State | Session | # Discussions | # Bills | # Speakers | # Speeches |
California | 2015 – 2016 | 13,030 | 3,628 | 15,983 | 298,962 |
California | 2017 – 2018 | 16,089 | 4,657 | 24,370 | 392,358 |
Florida | 2017 – 2018 | 7,176 | 2,746 | 4,717 | 220,513 |
New York | 2017 – 2018 | 7,599 | 4,697 | 2,327 | 156,422 |
Texas | 2017 – 2018 | 11,502 | 4,744 | 12,011 | 466,854 |
For each legislative session we provide nine CSV files, documented in Table 2. These files represent entity and relationship sets that exist in the internal database and touch upon legislative activity, with one difference: Whereas our internal database is highly normalized, with every single attribute appearing in exactly one database table, the data files in the released corpus are somewhat denormalized. In designing the corpus, we made the decision to denormalize some of the data files in order to make content in individual rows of each data file as human-readable, and as independent of other data files as possible. See the Hugging Face dataset for field names, updates and example Python code for processing.
Figure 2 presents a simplified Entity-Relationship diagram of the part of the internal database that was used to create the corpus and its concordance with the data files. Entity sets and relationship sets in blue (darker shade in B/W renderings) represent parts of the database that were turned (with additional data brought via many-to-one relationship sets) into data files (names of the respective data files are listed as well).
Informally, here is what is captured in the released dataset: A legislative session documents the activities of state legislators, i.e. Lawmakers in “legislature.csv”. The legislature forms legislative committees (“committees.csv”) on which they serve. Floor sessions where all members of a chamber participate are considered a special committee. Legislators write/sponsor bills (“bills.csv”). The bills are discussed as parts of hearings held by committees (“committeeHearings.csv”), further subdivided into discussions. Our project builds all hearing transcripts from hearing videos (“videos.csv”) and associates them with hearings. Original hearing videos, often as long as 4-5 hours, are split into multiple short, 15-30 minutes long, videos yielding a many-to-one relationship between video files and hearings. The conversion to smaller sized videos simplifies the work on preparing and upleveling hearing transcripts, as well as making it easier to load and play these videos on an app or website.
Table 2. File names and descriptions
Filename | Description |
bills.csv | List of bills discussed during a legislative session |
committees.csv | List of legislative committees that held hearings during the period, with name, type and numbers of members broken down by party |
committee Hearings.csv |
Mapping of hearings to corresponding committees, including committee type and chamber |
committee Rosters.csv |
Legislative Committee rosters for a given legislative session |
hearings.csv | A list of legislative hearings that took place during the legislative session |
legislature.csv | Detailed information on legislators in the session including person id (pid), district, party, biography and twitter account |
people.csv | A dictionary of all persons with first and last name listed by pid |
speeches.csv | All recorded spoken words during the legislative session, broken into individual speeches |
videos.csv | List of videos with links documenting the proceedings in the legislative session and their mapping to legislative hearings |
Each Committee hearing or floor session is divided into separate discussions of different bills brought before the committee or on the floor. The project produces annotated transcripts of bill discussion in which identified individual lawmakers, lobbyists, members of the public, government representatives and expert witnesses discuss the bill. We capture information about individuals who speak in the discussions (“people.csv”, “legislature.csv”) and attribute all spoken words in the transcript to them via “pid” (unique person identification number). The transcript itself is broken into speeches (“speeches.csv”). For the purposes of this corpus we define a speech as all words spoken by one person without interruption from the moment they start speaking and until the moment another person begins speaking or the meeting adjourns.
Alongside with the Digital Democracy Corpus, we release a comprehensive data dictionary that describes all attributes/columns found in all CSV files released, identifies all keys/unique identifiers for rows in each CSV document, and any foreign key columns (i.e. columns in one CSV file pointing to a row in another CSV file). While describing all attributes of all released files is outside of the scope of this document, we provide a brief description of the portions of the corpus that contain rich textual data.
Figure 2. ER diagram.
Here we describe all natural language text fields in the corpus of which “speeches” are the primary and most voluminous contribution.
Speeches. The heart of the Corpus is the hearing transcripts broken into individual speeches associated with specific bill discussions and individual speakers. These are presented in the speeches.csv file. In addition to the text of the speech we track its duration in time. All speeches are indexed in a way to facilitate easy retrieval of the full transcript of a given hearing sorted in the chronological order.
Bill texts. The released corpus contains full text of the final version of each bill (stored in the bills.csv). This corpus contains not just the text but recent annotations such as insertions and deletions on the bill, as they appear on legislative websites such as LegInfo. The field “finalVersionText” in bills.csv file contains an XML version of the final text together with the annotations, some of which may be “cross outs.” Thus we urge caution for researchers in their pre-processing operations as simply removing the XML tags will likely result in text that is meant to be omitted.
Bill digests. A bill digest is a short summary of the content of a bill written in more informal language. We provide the digest of the final version of the bill in bills.csv in a field called ‘digest’. Unlike the bill text, the digest is in plain text form.
Legislator biographies. The “legislature.csv” file contains official legislator biographies (‘biography’ field) in double-quote encapsulated plain text. These are collected by the Digital Democracy project from the official web pages of the legislators.
The authors thank IATPP, Arnold Ventures, Rita Allan Foundation, Dr. Charles Munger Jr., and the National Science Foundation funded program “Central Coast Data Science Partnership” (NSF award #1924008) for their generous support for this release. We thank the numerous students, professors and colleagues from Cal Poly, CalMatters, Civic Actions, Graz University of Technology, Munich University of Applied Sciences, University of Miami, and others who have contributed to this project.