The Digital Democracy Corpus: Comprehensive Proceedings of Four State Legislatures 2015-2018

1. Introduction

We present the details of the government transparency project Digital Democracy (Blakeslee 2015) and discuss two major contributions: First, a novel legislative informatics schema capturing entity relationships in the legislative proceedings, and second, the corpus of comprehensive state legislative proceedings (all speeches and remarks) from four states covering one or two sessions 2015-2018. Produced at a cost of about $5 Million by Cal Poly's Institute for Advanced Technology and Public Policy (IATPP 2024) over four years, the bulk of the corpus consists of human-verified, and data-augmented transcripts of state legislative committee hearings and floor sessions in four largest US states: California, Texas, New York and Florida, together with the contextual metadata on all the bills discussed at the hearings and all speakers. The California corpus consists of two full legislative sessions 2015-2016 and 2017-2018, while the other three state corpora are limited to the 2017-2018 session. The corpus is now released under a Creative Commons license via the IATPP Hugging Face repository huggingface.co/datasets/iatpp/digitaldemocracy-2015-2018 doi: 10.57967/hf/2802.

2. Background and Methodology

Scholars have been interested in parliamentary records for a long time. They have been used, among other things, to study gender linguistics (Yu 2014), immigration policy (Card 2022), Social Science testimonies (Maher 2020) and language (Albuquerque 2022, Temper 2022, Li 2017). New corpora have been created to allow wider scholarship. Recent efforts include released corpora from the parliaments of Switzerland (Salamanca 2024), Finland (Virkkunen 2023), Catalonia (Külebi 2022), Italy (Frasneli 2022), and German-speaking countries (Abrami 2022). Many legislatures already publish proceedings in some form which makes the creation of corpora a matter of transferring formats or digitizing written records. However, unlike the US Federal Government, US state legislatures do not publish written records of discussions, and Digital Democracy is the only known source for these.

Digital Democracy uses videos of legislative sessions to create highly accurate human-verified transcripts, speaker identifications and bill alignments which then are integrated with other data such as motions, bills, votes, service records, committees, and biographies. A simplified data-ingestion pipeline for the project is shown in Figure 1. For more details see (Ruprechter 2020). 

Parts of the corpus have already been used for studies with permission (Latner 2017, Kauffman 2018, Budhwar 2018, Ruprechter 2020, Howe 2022, Grace 2023 and Perkonigg 2023), but this paper marks the first open and public release of Digital Democracy data. We note that even this is not the entire set. For purposes of timeliness, clarity and focus on the main speech corpus, many other data dimensions such as voting records, parliamentary motions, bill versions, organizational affiliations, and financial records have been left for future work. As of 2024, Digital Democracy was integrated with nonprofit news agency CalMatters, with a new public portal and live records for the 2023-2024 California legislative session (CalMatters 2024).

Figure 1. High level view of data acquisition for Digital Democracy.

3. Corpus Description and Schema

The released corpus represents all transcribed legislative sessions in four states. As each legislature is independent, the corpus is organized by state and session (2 years). Table 1 summarizes the legislative sessions available in the corpus, and provides a general overview of the content: number of legislative discussions that took place in each session, number of bills, number of unique speakers in all discussions, and total number of speeches recorded.

Table 1. Corpus overview

State Session # Discussions # Bills # Speakers # Speeches
California 2015 – 2016 13,030 3,628 15,983 298,962
California 2017 – 2018 16,089 4,657 24,370 392,358
Florida 2017 – 2018 7,176 2,746 4,717 220,513
New York 2017 – 2018 7,599 4,697 2,327 156,422
Texas 2017 – 2018 11,502 4,744 12,011 466,854

For each legislative session we provide nine CSV files, documented in Table 2. These files represent entity and relationship sets that exist in the internal database and touch upon legislative activity, with one difference: Whereas our internal database is highly normalized, with every single attribute appearing in exactly one database table, the data files in the released corpus are somewhat denormalized. In designing the corpus, we made the decision to denormalize some of the data files in order to make content in individual rows of each data file as human-readable, and as independent of other data files as possible. See the Hugging Face dataset for field names, updates and example Python code for processing.

4. Entity Relationships

Figure 2 presents a simplified Entity-Relationship diagram of the part of the internal database that was used to create the corpus and its concordance with the data files.  Entity sets and relationship sets in blue (darker shade in B/W renderings) represent parts of the database that were turned (with additional data brought via many-to-one relationship sets) into data files (names of the respective data files are listed as well).

Informally, here is what is captured in the released dataset: A legislative session documents the activities of state legislators, i.e. Lawmakers in “legislature.csv”. The legislature forms legislative committees (“committees.csv”) on which they serve.  Floor sessions where all members of a chamber participate are considered a special committee. Legislators write/sponsor bills  (“bills.csv”). The bills are discussed as parts of hearings  held by committees (“committeeHearings.csv”), further subdivided into discussions. Our project builds all hearing transcripts from hearing videos (“videos.csv”) and associates them with hearings. Original hearing videos, often as long as 4-5 hours, are split into multiple short, 15-30 minutes long, videos yielding a many-to-one relationship between video files and hearings. The conversion to smaller sized videos simplifies the work on preparing and upleveling hearing transcripts, as well as making it easier to load and play these videos on an app or website.

Table 2. File names and descriptions

Filename  Description
bills.csv List of bills discussed during a legislative session
committees.csv List of legislative committees that held hearings during the period, with name, type and numbers of members broken down by party

committee

Hearings.csv

Mapping of hearings to corresponding committees, including committee type and chamber
committee Rosters.csv 

Legislative Committee rosters for a given legislative session

hearings.csv A list of legislative hearings that took place during the legislative session
legislature.csv Detailed information on legislators in the session including person id (pid), district, party, biography and twitter account 
people.csv A dictionary of all persons with first and last name listed by pid
speeches.csv All recorded spoken words during the legislative session, broken into individual speeches
videos.csv List of videos with links documenting the proceedings in the legislative session and their mapping to legislative hearings

Each Committee hearing or floor session is divided into separate discussions of different bills  brought before the committee or on the floor. The project produces annotated transcripts of bill discussion in which identified individual lawmakers, lobbyists, members of the public, government representatives and expert witnesses discuss the bill. We capture information about individuals who speak in the discussions (“people.csv”, “legislature.csv”) and attribute all spoken words in the transcript to them via “pid” (unique person identification number). The transcript itself is broken into speeches (“speeches.csv”). For the purposes of this corpus we define a speech as all words spoken by one person without interruption from the moment they start speaking and until the moment another person begins speaking or the meeting adjourns.

Alongside with the Digital Democracy Corpus, we release a comprehensive data dictionary that describes all attributes/columns found in all CSV files released, identifies all keys/unique identifiers for rows in each CSV document, and any foreign key columns (i.e. columns in one CSV file pointing to a row in another CSV file). While describing all attributes of all released files is outside of the scope of this document, we provide a brief description of the portions of the corpus that contain rich textual data.

Figure 2. ER diagram.

5. Textual Fields

Here we describe all natural language text fields in the corpus of which “speeches” are the primary and most voluminous contribution.

Speeches. The heart of the Corpus is the hearing transcripts broken into individual speeches associated with specific bill discussions and individual speakers. These are presented in the speeches.csv file.  In addition to the text of the speech we track its duration in time. All speeches are indexed in a way to facilitate easy retrieval of the full transcript of a given hearing sorted in the chronological order.

Bill texts. The released corpus contains full text of the final version of each bill (stored in the bills.csv). This corpus contains not just the text but recent annotations such as insertions and deletions on the bill, as they appear on legislative websites such as LegInfo. The field “finalVersionText” in bills.csv file contains an XML version of the final text together with the annotations, some of which may be “cross outs.” Thus we urge caution for researchers in their pre-processing operations as simply removing the XML tags will likely result in text that is meant to be omitted. 

Bill digests. A bill digest is a short summary of the content of a bill written in more informal language. We provide the digest of the final version of the bill in bills.csv in a field called ‘digest’. Unlike the bill text, the digest is in plain text form.

Legislator biographies. The “legislature.csv” file contains official legislator biographies (‘biography’ field) in double-quote encapsulated plain text. These are collected by the Digital Democracy project from the official web pages of the legislators.

6. Acknowledgements

The authors thank IATPP, Arnold Ventures, Rita Allan Foundation, Dr. Charles Munger Jr., and the National Science Foundation funded program “Central Coast Data Science Partnership” (NSF award #1924008) for their generous support for this release. We thank the numerous students, professors and colleagues from Cal Poly, CalMatters, Civic Actions, Graz University of Technology, Munich University of Applied Sciences, University of Miami, and others who have contributed to this project.

Appendix A

Bibliography
  1. Abrami, G., Bagci, M., Hammerla, L., & Mehler, A. (2022). German parliamentary corpus (gerparcor). arXiv preprint arXiv:2204.10422.
  2. Albuquerque, H. O., Costa, R., Silvestre, G., Souza, E., da Silva, N. F., Vitório, D., ... & Oliveira, A. L. (2022, March). UlyssesNER-Br: a corpus of Brazilian legislative documents for named entity recognition. In International Conference on Computational Processing of the Portuguese Language (pp. 3-14). Cham: Springer International Publishing.
  3. Blakeslee, S., Dekhtyar, A., Khosmood, F., Kurfess, F., Kuboi, T., Poschman, H., ... & Durst, S. (2015). Digital democracy project: making government more transparent one video at a time. Digit. Hum.
  4. Budhwar, A., Kuboi, T., Dekhtyar, A., & Khosmood, F. (2018, May). Predicting the vote using legislative speech. In Proceedings of the 19th annual international conference on digital government research: governance in the data age (pp. 1-10).
  5. CalMatters (2024) Digital Democracy. https://digitaldemocracy.calmatters.org.
  6. Card, D., Chang, S., Becker, C., Mendelsohn, J., Voigt, R., Boustan, L., ... & Jurafsky, D. (2022). Computational analysis of 140 years of US political speeches reveals more positive but increasingly polarized framing of immigration. Proceedings of the National Academy of Sciences, 119(31), e2120510119.
  7. Frasnelli, V., & Aprosio, A. P. (2024, May). There’s Something New about the Italian Parliament: The IPSA Corpus. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 16037-16046).
  8. Grace, J., & Khosmood, F. (2023, June 30). Feature Engineering for US State Legislative Hearings: Stance, Affiliation, Engagement and Absentees. Digital Humanities 2023. Collaboration as Opportunity (DH2023), Graz, Austria. https://doi.org/10.5281/zenodo.8108060
  9. Howe, P., Robertson, C., Grace, L., & Khosmood, F. (2022). Exploring reporter-desired features for an AI-generated legislative news tip sheet. ISOJ, 12(1), 17-44.
  10. Kauffman, D., Williams, M., Washington, C., Socher, G., & Khosmood, F. (2018, May). Multimodal speaker identification in legislative discourse. In Proceedings of the 19th Annual International Conference on Digital Government Research: Governance in the Data Age (pp. 1-10).
  11. Külebi, B., Armentano-Oller, C., Rodríguez-Penagos, C., & Villegas, M. (2022, June). ParlamentParla: A speech corpus of catalan parliamentary sessions. In Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference (pp. 125-130)
  12. Latner, M., Dekhtyar, A. M., Khosmood, F., Angelini, N., & Voorhees, A. (2017). Measuring legislative behavior: An exploration of Digitaldemocracy.org. California Journal of Politics and Policy, 9(3).
  13. Li, S. (2017). A corpus-based study of vague language in legislative texts: Strategic use of vague terms. English for Specific Purposes, 45, 98-109.
  14. Maher, T. V., Seguin, C., Zhang, Y., & Davis, A. P. (2020). Social scientists’ testimony before Congress in the United States between 1946-2016, trends from a new dataset. Plos one, 15(3), e0230104.
  15. Perkonigg, M., Khosmood, F., & Gütl, C. (2023, August). Automatic Bill Recommendation for Statehouse Journalists. In International Conference on Electronic Government (pp. 128-143). Cham: Springer Nature Switzerland.
  16. Ruprechter, T., Khosmood, F., & Guetl, C. (2020). Deconstructing human-assisted video transcription and annotation for legislative proceedings. Digital Government: Research and Practice, 1(3), 1-24.
  17. Salamanca, L., Brandenberger, L., Gasser, L., Schlosser, S., Balode, M., Jung, V., Perez-Cruz, F. & Schweitzer, F. (2024). Processing Large-Scale Archival Records: The Case of the Swiss Parliamentary Records. Swiss Political Science Review, 30, 140–153. https://doi.org/10.1111/spsr.12590
  18. Tamper, M., Leal, R., Sinikallio, L., Leskinen, P., Tuominen, J., & Hyvönen, E. (2022, August). Extracting knowledge from parliamentary debates for studying political culture and language. In International Workshop on Knowledge Graph Generation From Text and the International Workshop on Modular Knowledge (pp. 70-79). CEUR-WS. org.
  19. Virkkunen, A., Rouhe, A., Phan, N., & Kurimo, M. (2023). Finnish parliament ASR corpus: Analysis, benchmarks and statistics. Language Resources and Evaluation, 57(4), 1645-1670.
Foaad Khosmood (foaad@calpoly.edu), California Polytechnic State University, United States of America and Alex Dekhtyar (dekhtyar@calpoly.edu), California Polytechnic State University, United States of America and Sarah Ellwein (sellwein@calpoly.edu), California Polytechnic State University, United States of America and Bella White (bwhite17@calpoly.edu), California Polytechnic State University, United States of America