Observations in Students’ Theses – A Critical Analysis of Use-Cases, Models and Problems in Natural Language Processing

Natural Language Processing (NLP) has seen increased interest, not only due to the rise of Large Language Models (LLMs) such as ChatGPT. Scholarly work as well as advances in the private sector are majorly based on models trained on huge datasets and requiring immense processing power, especially in recent years (Sharir et al. 2020). Research groups and students, researching and implementing systems in smaller projects or for their final theses, were benefitted as well as restricted due these developments in NLP. This paper will present observations gathered across several bachelor’s and master’s theses from 2020 until 2023.

The theses were supervised by the Chair of Media Informatics at the University of Bamberg. Since no dedicated chair of NLP was present in Bamberg at the time, a wide-ranging set of topics including text classification/summarization/simplification, information extraction/retrieval, research data management, generation of misinformation and more were submitted. The authors of those theses also represent a diverse group of students, majoring in study subjects such as computer science, information systems as well as computing in the humanities.

ChatGPT and LLMs have changed the NLP-landscape since the announcement of ChatGPT in 2022. The capabilities of LLMs are in many cases and applications demonstrably impressive. Problems such as hallucinations in the generated texts or reproducibility persist and even with advancements by using larger models, or the development of competing products such as Bing Chat or Bard, could not be solved adequately (Augenstein et al 2023, Sallam 2023). Additionally, the execution of state-of-the-art models is often only available via subscriptions and therefore prohibitive for students or projects with limited budgets.

Thus, a look at more traditional and sometimes simpler models was not only imaginable but sometimes necessary to comply with resource restrictions. We use the terms traditional or simpler models as, for example, support vector machines (SVMs) when compared to large neural networks, or extractive summarization when compared to abstractive/generative summarization, or even rule-based approaches. Due to the long history of such well-established techniques, documentation as well as instructions for setup, customization for the respective use-case and evaluation are widely available and often of high quality. Our students, who were usually not experts in NLP-approaches, needed clear directions as well as understandable models. Traditional models requiring only few dependencies and no large development environments were well suited for beginners.

A wide-ranging collection of NLP-applications were developed in the supervised theses. Hallucinations were apparent in students’ work across text simplification and text generation in the application scenario of misinformation (Fruth 2022, Matschat 2022). Both theses used generative models based on the transformer architecture (Vaswani et al. 2017). Grammatical errors were also present in the texts generated in the same theses, due to the generative nature of the models. An even bigger problem was present because of the prevalence of the English language in NLP-research (Khurana et al. 2023). There are multi-lingual as well as language-specific models targeting other languages, but the performance when compared to their English counterparts is often inferior (Aumiller et al. 2023). This was noticed in nearly all theses that were working with German texts (Fruth 2022, Matschat 2022, Pulver 2023, Raab 2023). The size of training datasets plays a role here, as well as adaptations that are necessary for languages other than English, when customizing an approach and evaluation metrics originally intended for English to the specifics of other languages.

Two theses will be presented briefly to showcase specific observations, from which we have drawn our conclusions. The first dealt with information extraction from a large database with an industry partner, incorporating information on different companies (Pulver 2023). The goal was to identify the sector of each company and its legal form, e.g., "GmbH" – German for limited liability company, from the official name of the companies. After first experimenting with neural nets to extract both meta-information, a rule-based approach was used to extract the legal form. The variations for declaring the legal form are finite and can thus be implemented using rules with higher accuracy than approaches using neural nets, showing the advantages of traditional models in this case.

The second thesis focused on text classification on domain-specific languages and on domains with little training data, in this case legal and insurance data, again with an industry partner (Raab 2023). Models based on the transformer architecture were used here once more, however SVMs were evaluated as a point of comparison and matched the performance of modern models, thus showing the relevant position of traditional models even in this use-case of text classification.

Appendix A

Bibliography

Augenstein, Isabelle et al. (2023): Factuality Challenges in the Era of Large Language Models. arXiv. http://arxiv.org/abs/2310.05189 [16.05.2024].
Aumiller, Dennis et al. (2023): On the State of German (Abstractive) Text Summarization. arXiv. http://arxiv.org/abs/2301.07095 [16.05.2024].
Fruth, Leon (2022): An Approach Towards Unsupervised Text Simplification on Paragraph-Level for German Texts. Master thesis, University of Bamberg.
Khurana, Diksha et al. (2023): "Natural language processing: state of the art, current trends and challenges", in: Multimedia Tools and Applications 82 (3): 3713–3744. 10.1007/s11042-022-13428-4.
Matschat, David (2022): Automatic Generation of Misinformation - Natural Language Generators in the Application Scenario of Wikipedia. Master thesis, University of Bamberg.
Pulver, Niklas (2023): Information Extraction from Company Names Using Current Text Mining Techniques. Master thesis, University of Bamberg.
Raab, Stefan Wolfgang (2023): Few Shot Text Classification for Domain Specific Languages. Master thesis, University of Bamberg.
Sallam, Malik (2023): The Utility of ChatGPT as an Example of Large Language Models in Healthcare Education, Research and Practice: Systematic Review on the Future Perspectives and Potential Limitations. http://medrxiv.org/lookup/doi/10.1101/2023.02.19.23286155 [16.05.2024].
Sharir, Or et al. (2020): The Cost of Training NLP Models: A Concise Overview. arXiv. http://arxiv.org/abs/2004.08900 [16.05.2024].
Vaswani, Ashish et al. (2017): " Attention is All you Need", in: Advances in Neural Information Processing Systems 30.