Do you care about my gender? The long and winding road to data justice in Wikidata’s representation of gender

The ongoing processes of digitalization and datafication of society are having a significant impact on our daily lives (Kitchin 2021; Pink et al. 2017). Data about us is being generated, shared, and reused at a staggering pace, often with little regard for the risks and ethical questions that such processes entail (Boyd / Crawford 2012). Handling personal data, and especially sensitive data such as gender identity, sexual orientation, or ethnicity, is a ‘matter of care’ (Puig de la Bellacasa, 2017) that requires careful ethical consideration.

Indeed, a lack of care in the management, use, or sharing of data entails significant risks for people who are discriminated against or marginalized (see e.g., Noble 2018; Buolamwini / Gebru 2018). This is particularly true for queer people, who are often a minority in popular online communities and therefore more vulnerable to hateful practices (Nash / Browne 2020; Pain 2022). In this context, researching digital projects that involve queer communities is crucially relevant. It is especially important to look at the ethical decisions that are often made behind the scenes and without careful evaluation, which may lead to unfairness, erasure, and injustice. This often happens in for-profit platforms that handle large amounts of data, such as the most popular social networks (Righetti 2021), but it may likewise occur in open and collaborative platforms such as Wikipedia and Wikidata, which are less studied.

In the Wikidata Gender Diversity project (WiGeDi), funded through the Wikimedia Research Fund program, we study how gender is represented in the Wikidata platform (Vrandečić / Krötzsch 2014). A sister project to the more popular Wikipedia, Wikidata is an open, structured knowledge base managed and edited by a wide, international community of volunteers that contribute to the continuous improvement of the data based on their interests and areas of expertise (Piscopo et al. 2017). Due to its openness, data from Wikidata is reused by countless other projects, potentially in uncritical ways that run the risk of amplifying any biases and errors contained in it.

In our project, we look at how the Wikidata community has approached the complex issue of modeling and populating gender data, focusing especially on queer identities, such as those of trans and non-binary people. Working from the idea that gender is a complex social construct (Butler, 2004), the project investigates how Wikidata has approached the representation of gender over a span of 10 years to reach its current state. To do so, we adopt the framework of critical data studies (Kitchin / Lauriault 2014; Iliadis / Russo 2016) and a transfeminist, intersectional, and queer perspective (Butler 1990; Wittig 1992; Lorde 1984; Serano 2007; D’Ignazio / Klein 2020). We reject cis-heteronormative, binary, and essentialist views of gender, recognizing the wide variety of gender identities and expressions that exist and have existed throughout human history (Herdt 1994 / DeVun 2021).

At first glance, it may be easy to think that a self-governed community of volunteers would automatically handle personal data in a more careful way than a large corporation with profit incentives that rely on the exploitation of the data. However, if the community is not sufficiently large and representative of all corners of society, and if it is not taking serious precautions to evaluate possible risks, it can also run into pitfalls. This is what happened historically with gender data in Wikidata, as the community started from a very narrow binary view of gender which was pushed by some of its most influential users. While there was no explicit intention to exclude or discriminate, there was a desire to add as much data as quickly as possible, which eventually trumped all other considerations and resulted in the automated mass-addition of gender data based on unreliable sources and questionable methods (e.g., inferring gender from the name of a person). Over a span of 10 years, gender data was added for more than nine million people, resulting in significant errors. When the community evolved and critical voices became more vocal, Wikidata users embarked on a slow and difficult process of review of the model and the data that is still ongoing today.

In this presentation, we discuss how the Wikidata community has historically handled gender data, looking at the issue from three perspectives: 1. how gender has been modeled, e.g. by including or excluding specific gender identities, classifying gender identities, or defining properties in implicitly gendered ways; 2. how gender data has been populated, often by automated means and without much regard for the reliability of the data sources; 3. how the community has discussed and deliberated on these issues on Wikidata’s discussion pages.

First, we have looked at how Wikidata models gender and how it has done so historically, attempting to understand the extent to which this representation is fair and inclusive, and how the model has evolved over time to support the representation of a wider spectrum of identities. On a technical level, we have qualitatively investigated the classes and properties that make up the Wikidata ontology, the underlying taxonomies, and labels and descriptions in multiple languages. 1

Second, we have analyzed the data stored in the knowledge base in a quantitative way to gather insights and identify possible gaps involving diverse and marginalized gender identities. We are additionally looking at contextual data such as geographic provenance, dates of birth, occupations, and other relevant data points such as statement sourcing, completeness of the descriptions, multilingual labeling, number of linked Wikipedia biographies, etc. We have developed a Wikidata Gender Dashboard that shows real-time statistics and visualizes our most important findings. 2

Third, we have looked at how the community has handled the move towards the inclusion of a wider spectrum of gender identities. Gender representation is often intrinsically connected to language, and this is especially relevant in a multilingual project such as Wikidata (Melis et al. forthcoming). Therefore, we have analyzed a corpus of user discussions about gender by employing critical discourse analysis (Fairclough 2010) and topic modeling analysis (Blei et al. 2003). 3 The discussions have allowed us to identify how users implemented changes over the years, including to the policies that govern the community. A crucial result of this work is the Wikidata Gender Timeline, a tool to explore the history of gender in Wikidata based on an analysis of hundreds of entities, discussion pages and edit histories. 4

Our presentation covers all three perspectives, showing our current findings and the tools we have developed to explore them. We take particular care to show the relevance of our findings beyond the confines of Wikidata itself, and especially in the many Digital Humanities projects that handle gender data, sometimes through crowdsourcing and sometimes by relying on external sources that may themselves be problematic. The presentation engages with the conference theme of “Reinvention & Responsibility” by highlighting practices of data repair (Ramakrishnan et al. 2021) that can be adopted to address past exclusionary practices and achieve data justice (Taylor 2017).

Appendix A

Bibliography
  1. Blei, David M. / Ng, A ndrew Y. / Jordan, M ichael I. (2003) : Latent Dirichlet Allocation ”, in: Journal of Machine Learning Research 3 : 993–1022.
  2. Buolamwini, J oy / Gebru, T imnit (2018) : Gender shades: i ntersectional accuracy disparities in commercial gender classification ”, in: Friedler Sorelle A. / Wilson Christo (eds.): Proceedings of the 1st Conference on Fairness, Accountability and Transparency. P MLR 77–91 .
  3. Butler, J udith (1990) : Gender T rouble: F eminism and the S ubversion of I dentity . London: Routledge.
  4. Butler, J udith (2004) : Undoing G ender . London: Routledge.
  5. D’Ignazio, C atherine / Klein, L auren F. (2020) : Data F eminism . Cambridge: MIT Press.
  6. DeVun, L eah (2021) : The S hape of S ex: Nonbinary G ender from Genesis to the Renaissance . New York: Columbia University Press.
  7. Fairclough, N orman (2010) : Critical D iscourse A nalysis: The C ritical S tudy of L anguage . London: Routledge.
  8. Herdt, G ilbert , e d. (1994) : Third S ex, T hird G ender : Beyond Sexual Dimorphism in Culture and History . New York: Zone Books.
  9. Iliadis, A ndrew / Russo, F ederica (2016) : Critical data studies: a n introduction ”, in: Big Data & Society 3 , 2 .
  10. Kaffee, Lucie-Aimée / Piscopo, A lessandro / Vougiouklis, P avlos / Simperl, E lena / Carr, L eslie / Pintscher, L ydia (2017) : A glimpse into Babel: a n analysis of multilinguality in Wikidata ”, in: OpenSym '17: Proceedings of the 13th International Symposium on Open Collaboration. New York: Association for Computing Machinery.
  11. Kaffee, Lucie-Aimée / Simperl, E lena (2018) : Analysis of e ditors’ l anguages in Wikidata ”, in: OpenSym '1 8 : Proceedings of the 1 4 th International Symposium on Open Collaboration . New York: Association for Computing Machinery.
  12. Kitchin, R ob (2021) : Data L ives: How D ata A re M ade and S hape O ur W orld . Bristol: Bristol University Press.
  13. Kitchin, R ob , / Lauriault, T racey P. (2014). Towards critical data studies: c harting and unpacking data assemblages and their work ”, in: Thatcher, Jim / Eckert, Josef / Shears, Andrew (eds.): Thinking Big Data in Geography: New Regimes, New Research. Lincoln: University of Nebraska Press.
  14. Lorde, A udre (1984) : Sister O utsider : Essays and Speeches . Berkeley: Crossing Press.
  15. Melis, Beatrice (202 3 ): Wikidata Gender Diversity: studying gender representation in Wikidata through the lens of data, model and community . Master’s Thesis. University of Pisa.
  16. Melis , Beatrice / Paolini , Chiara / Fioravanti , Marta / Metilli , Daniele ( forthcoming ): “What does it mean to be queer in Wikidata? Practices of gender representation within a transnational online community , in: Bayramoğlu, Yener / Szulc, Łukasz / Gajjala, Radhika (eds.): Communication, Culture and Critique : Special Issue on Transnational Queer Cultures and Digital Media .
  17. Metilli , Daniele / Paolini , Chiara (2023): “Non-binary Gender Representation in Wikidata”, in: Watson B. M. / Provo Alexandra / Burlingame Kathleen (eds.): Ethics in Linked Data . Sacramento: Litwin Books 221–264 .
  18. Nash, C atherine J ean / Browne, K ath (2020) : Heteroactivism: Resisting L esbian, G ay, B isexual and T rans R ights and E qualities . London: Zed Books.
  19. Noble, S afiya U moja (2018) : Algorithms of O ppression . New York: New York University Press.
  20. Pain, Paromita, ed. (2022) : LGBTQ D igital C ultures: A G lobal P erspective . London: Routledge.
  21. Pink, S arah / Sumartojo, S hanti / Lupton, D eborah / Heyes La Bond, C hristine (2017) : Mundane data: t he routines, contingencies and accomplishments of digital living ”, in: Big Data & Society 4 , 1 : 1 12.
  22. Piscopo, A lessandro / Phethean, C hris / Simperl, E lena (2017) : What makes a good collaborative knowledge graph: Group composition and quality in Wikidat a, in: Ciampaglia Giovanni Luca / Mashhadi Afra / Yasseri Taha (eds.): Social Informatics: 9th International Conference, SocInfo 2017, Oxford, UK, September 13-15, 2017, Proceedings, Part I . Cham: Springer 305–322 .
  23. Puig de la Bellacasa, M aría (2017) : Matters of C are: Speculative E thics in M ore T han H uman W orlds . Minneapolis: University of Minnesota Press.
  24. Ramakrishnan, K avita / O’Reilly, K athleen / Budds, J essica (2021) : The temporal fragility of infrastructure: Theorizing decay, maintenance, and repair ”, in: Environment and Planning E: Nature and Space 4 , 3 : 674 695.
  25. Righetti, N icola (2021) : The anti-gender debate on social media. A computational communication science analysis of networks, activism, and misinformation ”, in: Comunicazione Politica 23 , 2 : 223 250.
  26. Serano, J ulia (2007) : Whipping G irl: A T ranssexual W oman on S exism and the S capegoating of F emininity . New York: Seal Press.
  27. Taylor, L innet (2017) : What is data justice? The case for connecting digital rights and freedoms globally ”, in:   Big Data & Society  4 , 2.
  28. Vrandečić, D enny / Krötzsch, M arkus (2014) : Wikidata: A free collaborative knowledgebase ”, in: Communications of the ACM 57 , 10 : 78–85.
  29. Wittig, M onique (1992) : The S traight M ind and O ther E ssays . Boston: Beacon Press.
Notes
1.

A visualization of the current gender model can be seen at: https://wigedi.com/model. Additional results are presented in Metilli / Paolini (2023) and Melis (2023).

2.

The Wikidata Gender Dashboard is available at: https://wigedi.com/data.

3.

A preliminary topic modeling study is reported in Metilli / Paolini (2023). Further results are forthcoming.

4.

The Wikidata Gender Timeline can be explored at: https://wigedi.com/timeline.

Daniele Metilli (d.metilli@ucl.ac.uk), University College London, UK and Beatrice Melis (beatrice.melis@phd.unipi.it), University of Pisa, Italy; Gran Sasso Science Institute, Italy and Marta Fioravanti (marta@oio.studio), oio.studio, UK and Chiara Paolini (chiara.paolini@kuleuven.be), KU Leuven, Belgium