Analyzing Online TOUs (Terms-of-Use) with R Shiny

Digital platforms have changed the way humans interact with each other and the world around them (Samples 2023). Facebook and Instagram have taken over social lives, X (formerly Twitter) and Google have changed how people get information, and Venmo and PayPal have transformed personal finance, just to name a few. To use any such platform, a user must agree to several contracts known as Terms of Use (TOUs), which are notorious for convoluted, difficult to comprehend language (Samples 2023, Samples et al. 2024). If a user is unable to read a contract, it begs the question of whether these contracts are fair. To understand trends in contract complexity and reading difficulty for consumers, this work analyzes the linguistic patterns in TOUs longitudinally.

To accomplish this objective, the dataset was compiled by scraping 323 TOUs of 21 platforms dating from 1999-2024 using the Internet Archive. The full corpus consists of approximately 3 million words. The TOUs are gathered from platforms ranging from social media, finance, dating, gaming, and business. After initial data gathering and compilation, the text files were submitted to several rounds of regular expression scripts, to remove non utf-8 characters and prepare the data for further analysis and NLP (Natural Language Processing). Next, the data was submitted to part-of-speech tagging and analysis with the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC), generating over 100 indices of linguistic complexity for each TOU. At present, the project focuses on eight variables associated with noun phrase and verbal complexity, in addition to results for traditional readability formulas (Flesh and FRE, obtained with R packages quanteda and the tidyverse, Benoit et al. 2018, Wickam et al. 2019).To view how readability has changed longitudinally, this study needed a tool to dynamically view how each platform changed over time. To aid the researcher, this tool required flexibility to compare different platforms, focus on varying sub-periods, and choose from varying graph styles.

Using R-Shiny, this project created an app that allows the user to select any number of platforms and any number of metrics to view (Chang et al. 2024). As platforms and metrics are chosen, they are added to the chart space, as well as a table describing the chosen metrics and platforms. Each metric has a description and an example listed, and each platform has a description and the number of users listed. The default view shows each metric on its own step chart, with each line on the chart representing a separate platform, which are differentiated by color. The range of years depicted in the chart will autoscale to fit the data selected. When the user hovers over any data point, it provides information regarding that point. Each graph can be saved as a picture, zoomed in and out, and panned due to the Plotly package in R. The app user has customizable viewing options. First, the user can choose to not overlay the platforms on the graph, which in turn separates each chart by platform. Second, the user can choose to change the year span to home into a certain time frame. Third, the user can choose their graph view from a step chart, linear model, and loess model. The structure of this app allows for flexibility for the researcher to view the data in multiple ways allowing for a thorough longitudinal trend analysis.

Since implementing this tool, the study has begun revealing interesting findings. Our initial results using this tool show that, in general, the size of TOUs are increasing across all platforms in terms of word count. Although linguistically, there is no clear pattern of change over time when considering the dataset collectively, trends reveal that metrics for platforms from a common corporation (e.g. Meta platforms) tend to converge. This intuitively suggests that the same corporate management leads to a similar TOU. In addition to this, certain platforms reveal increasing verbal complexity longitudinally. Financial tech companies (e.g. Venmo and PayPal) tend to be more linguistically complex and longer than most other platforms. This may in large part be due to financial jargon that is added to the TOU which adds more clauses and a greater lexical diversity.

While this tool will continue to provide meaningful insight regarding TOUs, this project is also exciting for the generalizability of the app to be used for other longitudinal studies. The structure of the code can be easily altered for any longitudinal dataset that has some list of metrics with a grouping variable. Examples of use cases include economic metrics of varying countries over time, sports statistics of different teams/players over time, education metrics of different colleges over time, and more.

Appendix A

Bibliography

Amos, Ryan, Acar, Gunes, Lucherini, Eli, Kshirsagar, Mihir, Arvind, Narayanan, & Mayar, Jonathan: (2021). ‘Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset’, IW3C2 (International World Wide Web Conference Committee), Ljubljana, Slovenia.
Arnold, Taylor: (2017). ‘A tidy data model for natural language processing using cleanNLP’, arXiv preprint arXiv:1703.09570.
Benoit, Kenneth, Watanabe, Kohei, Wang, Haiyan, Nulty, Paul, Obeng, Adam, Müller, Stefan, & Matsuo, Akitaka: (2018). ‘quanteda: An R package for the quantitative analysis of textual data’, Journal of Open Source Software, 3/30: 774–774. The Open Journal.
Benoliel, Uri, & Becher, Samuel I: (2019). The Duty to Read the Unreadable, Boston College Law Review, 60: 2255. HeinOnline.
Bernstein, Anya: (2021). ‘Legal corpus linguistics and the half-empirical attitude’, Cornell Law Review, 106: 1397. HeinOnline.
Biber, Douglas, Gray, Bethany, Staples, Shelley, & Egbert, Jesse: (2020). ‘Investigating Grammatical Complexity in L2 English Writing Research: Linguistic Description versus Predictive Measurement’. Journal of English for Academic Purposes. Elsevier.
Chang, Winston, Cheng, Joe, Allaire, JJ, Sievert, Carson, Schloerke, Barret, Xie, Yihui, Allen, Jeff, McPherson, Jonathan, Dipert, Alan, and others: (2024). shiny: Web Application Framework for R. v. 1.8.0.
Jockers, Matthew: (2013). Macroanalysis: Digital Methods and Literary History. University of Illinois Press.
Kyle, Kris: (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication (Doctoral dissertation). Georgia State University, Atlanta, Georgia.
Kyle, Kris, & Crossley, Scott: (2018). ‘Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices’, The Modern Language Journal, 102/2: 333–49. Wiley Online Library.
Larsson, Tove, & Kaatari, Henrik: (2020). ‘Syntactic complexity across registers: Investigating (in) formality in second-language writing’, Journal of English for Academic Purposes, 45: 100850. Elsevier.
Martínez, Eric, Mollica, Francis., & Gibson, Edward: (2022). ‘Poor writing, not specialized concepts, drives processing difficulty in legal language’, Cognition, 224: 105070. Elsevier.
R Core Team: (2023). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
Samples, Tim: (2023). Consumer Contracting in the Smartphone Era: New Challenges, An Old Conundrum, in Emerging Issues at the Intersection of Law and Technology, Elvy, S.A. & Kim, S. (Eds.)
Samples, Tim, Ireland, Katherine, & Kraczon, Caroline: (2024). TL;DR: The Law and Linguistics of Social Platform Terms-of-Use. Berkeley Technology Law Journal.
Vogel, Friedemann, Hamann, Hanjo, & Gauer, Isabelle: (2018). ‘Computer-assisted legal linguistics: Corpus analysis as a new tool for legal studies’, Law & Social Inquiry, 43/4: 1340–63. Cambridge University Press.
Wickham, Hadley: (2016). ggplot2: elegant graphics for data analysis. Springer.
Wickham, Hadley, Averick, Mara, Bryan, Jennifer, Chang, Winston, McGowan, Lucy D‘Agostino, François, Romain, Grolemund, Garrett, and others: (2019). ‘Welcome to the Tidyverse’, Journal of open source software, 4/43: 1686.
Wijffels, Jan: (2023). ‘udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the “UDPipe” “NLP” Toolkit’.