Nothing to Link to Linked Open Data and the Colonial Archive

This paper seeks to address the limitations imposed by existing reference vocabularies on modeling non-“Western” concepts as Linked Open Data (LOD) and how to mitigate them, using the GLOBALISE project as a case study. 1 These limitations risk re-enforcing archival absences and colonial viewpoints if left unaddressed by digital history projects dealing with colonial archives.

LOD has been suggested as a particularly suitable approach to modelling data derived from the colonial archive (Ortolja-Baird and Nyhan 2022). Its flexibility allows for the representation of multiple epistemologies in a single dataset, potentially embedding multiperspectivity into the data model (Oldman and Tanase 2018). In order to implement this, thesauri or reference vocabularies representing these epistemologies are needed.

Thesauri and reference vocabularies play a key role in modelling linked data. They allow different instances of entities of the same type to be connected, which allows for query expansion. This is particularly valuable for entities for which there are no authority files to identify them directly, which is especially likely for marginalised entities. Furthermore, it allows for the description of unnamed entities (Luthra et al. 2023). A golden keris (a type of dagger) mentioned in the archive is likely impossible to connect to a particular surviving object. However, if identified as an instance of the type keris, it can be connected to other instances of the same type, either other mentions within the same archive or even to surviving objects in collections.

This cross-dataset connection is only possible where datasets use the same (or connected) thesauri. For this reason, reuse of vocabularies is recommended as unique vocabularies would leave linked data fragmented (Hyland and Villazón Terrazas 2011). However, existing standard vocabularies are woefully inadequate for the description of non-“Western” concepts. LOD, like data in general, suffers from a white, male, “Western” bias (Radstok, Chekol, and Schaefer 2021; D’ignazio and Klein 2020). Standard vocabularies are no exception. Thus, when we try to create LOD representations of (pre-)colonial subjects we are often left with nothing to link to thereby leaving marginalized subjects doubly absent, from the data and the metadata.

We encountered this problem in our work at the GLOBALISE project, which aims to improve the accessibility of a key series of documents from the Dutch East India Company (VOC) archives (1602-1795) through Handwritten Text Recognition (HTR) as well as entity and event recognition (Petram and van Rossum 2022). In our endeavour to represent entities (e.g. people, places, objects) found in the archive as LOD, we have found it impossible to describe many of them with existing standard vocabularies. Those entities we are able to link are those already prioritised in the colonial archive. Leaving marginalised entities to be described with much more generic categories, which does not aid in restoring the aforementioned imbalance as underrepresented entities remain vaguely described.

Another issue is that when specific concepts exist they are often inadequately categorised. Reinforcing hegemonic categorisations, or leaving concepts completely unclassified and isolated. Exacerbating these problems is the fact that institutional access to the formation of existing vocabularies is inaccessible. WikiData provides a possible alternative; its open and collaborative nature renders it more accessible to those outside “Western” institutions, it has a much wider language coverage and it does generally cover more non-“Western” concepts than the previous examples (Zhu et al. 2023). However, it too suffers from an overrepresentation of “Western” concepts, as well as general inaccuracies (Zhu et al. 2023; Radstok, Chekol, and Schaefer 2021).

Due to these limitations, we cannot rely (solely) on existing thesauri for the description of entities found in the VOC archives, particularly those already marginalized by the archive creators. Using only these vocabularies would result in the resilencing of the subjects we have taken care to revisibilise from the archive. Creating custom vocabularies can allow for a more restorative depiction of the subject, it does run the risk of leaving the data isolated. One suggested solution is to use both custom and reference vocabularies to enrich data (Ortolja-Baird and Nyhan 2022). While this embeds multiperspectivity into the data-model it still prioritizes the hegemonic perspective represented by the authorized vocabulary. Linking occurs through the reference vocabulary, meaning that discovery remains mediated through its Eurocentric categorizations.

The GLOBALISE project, as well as other projects, 2 is taking the custom thesaurus approach (Nijman and Pepping 2023). A first subset containing over 3000 concepts has been released (Pepping et al. 2023). We recognize that addressing these issues of Eurocentrism and isolated data requires communal approaches. Isolation cannot be solved in isolation. For this reason, we pursue collaboration with other digitization projects of colonial archives such as the aforementioned Unlocking Colonial Archives, 3 as well as IN_CONTEXT, 4 and CAPASIA. 5 We reach out to other cultural heritage professionals through activities such as our discussion at the Linked Pasts symposium (Nijman and Kuruppath 2023). This short paper will present the results of such collaboration and how these criticisms have been implemented in the GLOBALISE thesaurus.

Appendix A

Bibliography
  1. Candela, Gustavo, Javier Pereda, Dolores Sáez, Pilar Escobar, Alexander Sánchez, Andrés Villa Torres, Albert A. Palacios, Kelly McDonough, and Patricia Murrieta-Flores. 2023. ‘An Ontological Approach for Unlocking the Colonial Archive’. Journal on Computing and Cultural Heritage 16 (4): 1–18. https://doi.org/10.1145/3594727.
  2. D’ignazio, Catherine, and Lauren F. Klein. 2020. Data Feminism. MIT press.
  3. Hyland, Bernadette, and Boris Villazón Terrazas. 2011. ‘Linked Data Cookbook’. W3C. 2011. https://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook.
  4. Luthra, Mrinalini, Konstantin Todorov, Charles Jeurgens, and Giovanni Colavizza. 2023. ‘Unsilencing Colonial Archives via Automated Entity Recognition’. Journal of Documentation. https://doi.org/10.1108/JD-02-2022-0038.
  5. Nijman, Brecht, and Manjusha Kuruppath. 2023. How Can Authoritative Vocabularies Be More Inclusive of Non-Western Concepts? Presented at Linked Pasts 9, Online, December 4.
  6. Nijman, Brecht, and Kay Pepping. 2023. Building a VOCabulary: The Uses and Challenges of Thesauri for Working with Early Modern Recognized Entities. Presented at DHBenelux 2023, Brussels. https://doi.org/10.5281/zenodo.7973694.
  7. Oldman, Dominic, and Diana Tanase. 2018. ‘Reshaping the Knowledge Graph by Connecting Researchers, Data and Practices in ResearchSpace’. In The Semantic Web – ISWC 2018, edited by Denny Vrandečić, Kalina Bontcheva, Mari Carmen Suárez-Figueroa, Valentina Presutti, Irene Celino, Marta Sabou, Lucie-Aimée Kaffee, and Elena Simperl, 325–40. Lecture Notes in Computer Science. Cham: Springer. https://doi.org/10.1007/978-3-030-00668-6_20.
  8. Ortolja-Baird, Alexandra, and Julianne Nyhan. 2022. ‘Encoding the Haunting of an Object Catalogue: On the Potential of Digital Technologies to Perpetuate or Subvert the Silence and Bias of the Early-Modern Archive’. Digital Scholarship in the Humanities 37 (3): 844–67. https://doi.org/10.1093/llc/fqab065.
  9. Pepping, K., H. Vellinga, M. Kuruppath, L. Van Wissen, and M. Van Rossum. 2023. ‘GLOBALISE Thesaurus - Commodities’. IISH Data Collection. https://hdl.handle.net/10622/YAWDOV.
  10. Petram, Lodewijk, and Matthias van Rossum. 2022. ‘Transforming Historical Research Practices – a Digital Infrastructure for the VOC Archives (GLOBALISE)’. International Journal of Maritime History 34 (3): 494–502. https://doi.org/10.1177/08438714221112873.
  11. Posthumus, Etienne, and Hongxing Zhang. 2023. ‘Chinese Iconography Thesaurus (CIT) Database and Online Service’. Presented at NKOS Consolidated Workshop 2023. Daegu, South Korea, November 9.
  12. Radstok, Wessel, Mel Chekol, and Mirko Schaefer. 2021. ‘Are Knowledge Graph Embedding Models Biased, or Is It the Data That They Are Trained On?’ In Wikidata Workshop 2021 Co-Located with the 20th International Semantic Web Conference (ISWC 2021).
  13. Zhu, Lihong, Amanda Xu, Sai Deng, Greta Heng, and Xiaoli Li. 2023. ‘Entity Management Using Wikidata for Cultural Heritage Information’. Cataloging & Classification Quarterly 61 (1): 20–46. https://doi.org/10.1080/01639374.2023.2188338.
Notes
1.

https://globalise.huygens.knaw.nl/

2.

See https://enslaved.org/; https://unlockingarchives.com/ (Candela et al. 2023); https://chineseiconography.org/ (Posthumus and Zhang 2023).

3.

https://unlockingarchives.com/

4.

https://in-context.sbb.berlin/?lang=en

5.

https://www.capasia.eu/

Brecht Flora Marie Nijman (brecht.nijman@huygens.knaw.nl), KNAW Huygens Institute, Amsterdam, The Netherlands