From “Can’t…” to “Cancún”: Fine-tuning spaCy’s Spanish-Language Transformer Model for Better and More User-Friendly Named Entity Recognition

1. Introduction

As part of HathiTrust Research Center’s 2-year, Mellon-funded “Scholar Curated Worksets for Analysis, Reuse, and Dissemination” (SCWAReD) project, we collected, curated, and transformed worksets (i.e., custom collections) of diverse digitized materials in the HathiTrust Digital Library in order to enable further scholarly use. One way the worksets were transformed was through the generation of derived data, including sets of named entities extracted from the full text of each volume in a given workset. The SCWAReD worksets include collections of materials in English and Spanish, and although numerous Named Entity Recognition (NER) options are available for English, the landscape is more limited for Spanish (Cañete et al., 2023). Because our other SCWAReD NER datasets were generated with a spaCy transformers-based NER model (Honibal & Montani, 2017) – a library that is both popular and produced accurate results – we hoped to use spaCy’s Spanish NER model to achieve similar results.

However, two key challenges surfaced when applying an off-the-shelf Spanish NER model. First, spaCy’s vanilla Spanish NER implementation (es_core_news) yielded poor results for our NER tasks, likely due to being trained on newspaper data that does not resemble our literary sources; and second, the lack of built-in NER functionality for the spaCy transformer model (es_dep_news_trf). Transformer models have numerous benefits, including better results for Spanish-language materials (Pan et al., 2023), parity with our pipeline for English-language datasets, and the possibility of fine-tuning custom datasets for a variety of downstream tasks.

Our project aligns with the DH2024 theme of “Reinvention & Responsibility,“ and in response to the evolving demands of digital humanities research on Spanish HathiTrust collections, we reinvent the language transformer learning pipeline to enhance the capabilities of NER for Spanish-language materials and making it accessible to the community. We have developed a reusable, open-source pipeline for fine-tuning a custom spaCy transformer NER model. This customization draws insights from diverse Spanish NER tasks, ensuring a nuanced and contextually rich understanding of named entities in the Spanish-language tokens. The developed model, which is streamlined with spaCy, can then be easily implemented into the analytic system for inference.

2. Proposed Datasets and Workflows

For preliminary research, we used three existing NER datasets primarily based on Wikipedia data to develop our in-house Spanish (abbreviated here as “ES”) NER extraction model (Nothman et al., 2013; Pan et al., 2017; Tedeschi et al., 2021). These datasets, spanning various contexts, share common entities—PER (Person), LOC (Location), ORG (Organization), and MISC (Miscellaneous), the latter encompassing a diverse range of named entities not classified under the first three categories. The combination of these three datasets was then strategically partitioned into three segments: training, validation, and testing.

We used the training subset to fine-tune the spaCy transformers model optimization for the downstream NER task and used the validation subset as a control to mitigate overfitting and ensure the robustness of the model. The training process resulted in an accessible spaCy transformers model that is open and reusable for inference tasks on diverse systems, including within the HTRC Data Capsule. We established a dedicated repository rigorously tested for extracting NER for SCWAReD datasets (Parulian, 2023). In addition to the model, we also published the derived Named Entities datasets that resulted from implementing this model on SCWAReD datasets (HathiTrust, 2023).

3. Evaluation and Summary

For evaluation, we compared the extraction results of the fine-tuned Spanish NER model with the spaCy built-in es_core_news_lg model on the testing subset. The preliminary results presented in the table shows that the Spanish custom model provides the best performance on all metrics (precision, recall, and f1) across all types with much better improvement over the built-in model.

Entity Type ES NER Custom Model es_core_news_lg
P R F1 P R F1
PER 93.67 94.48 94.08 88.15 88.34 88.24
LOC 92.17 93.29 92.72 84.26 86.07 85.15
ORG 86.79 81.06 83.83 77.56 54.25 63.84
MISC 87.88 80.65 84.11 38.82 78.39 51.93

Additionally, to ensure the reliability of our findings for the SCWAReD datasets, we conducted a qualitative assessment through human evaluation. In this initial iteration of our repurposing approach, we opted for a targeted quality check, scrutinizing small samples. The emphasis on a smaller subset was an initial indicator of the model's effectiveness on a new dataset. For the next iteration, we plan to do a more comprehensive and extensive evaluation by extending these samples and incorporating these newly evaluated sets into the fine-tuning pipeline. This thorough evaluation will help us understand the model's capabilities for handling diverse and extensive datasets relevant to our goals, uncover key factors for successful NER, and identify potential work that could bring improvement in Spanish-language NER. Finally, we hope the improved Spanish NER pipeline can be a more effective tool for working with multilingual book data.

Appendix A

Bibliography
  1. Nothman, J. / Ringland, N. / Radford, W. / Murphy, T. / Curran, J. R. (2013). Learning multilingual named entity recognition , from Wikipedia. Artificial Intelligence , 194 , 151-175.
  2. Honnibal, M. / Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing . https://spaCy.io/ .
  3. Pan, X. / Zhang, B. / May, J. / Nothman, J. / Knight, K. / Ji, H. (2017, July). Cross-lingual name tagging and linking for 282 languages . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1946-1958).
  4. Tedeschi, S. / Maiorca, V. / Campolungo, N. / Cecconi, F. / Navigli, R. (2021, November). WikiNEuRal: Combined neural and knowledge-based silver data creation for multilingual NER . In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 2521-2533).
  5. Cañete, J. / Chaperon, G. / Fuentes, R. / Ho, J. H. / Kang, H. / & Pérez, J. (2023). Spanish pre-trained bert model and evaluation data . arXiv preprint arXiv:2308.02976 .
  6. Pan, R. / García-Díaz J. A. / Valencia-García, R. (2023, June). Evaluation of Transformer-Based Models for Punctuation and Capitalization Restoration in Spanish and Portuguese . In International Conference on Applications of Natural Language to Information Systems (pp. 243-256). Cham: Springer Nature Switzerland . https://doi.org/10.1007/978-3-031-35320-8_17 .
  7. Parulian N. N. (2023). SCWAReD NER Pipeline . https://github.com/htrc/scwared-ner-pipelines .
  8. HathiTrust .(2023). “Spanish American Fiction in the HathiTrust Digital Library.” Accessed December 8, 2023. https://htrc.github.io/scwared-spanish-american-fiction/
Nikolaus Nova Parulian (nnp2@illinois.edu), University of Illinois Urbana-Champaign, United States of America and Ryan Dubnicek (rdubnic2@illinois.edu), University of Illinois Urbana-Champaign, United States of America and Sarah Griebel (sarahg8@illinois.edu), University of Illinois Urbana-Champaign, United States of America and Glen Layne-Worthey (gworthey@illinois.edu), University of Illinois Urbana-Champaign, United States of America and J. Stephen Downie (jdownie@illinois.edu), University of Illinois Urbana-Champaign, United States of America