Overview
Level of experience for attendees with relevant technologies: intermediate
This workshop will explore theoretical and practical aspects of evaluating automated subject indexing methods. As artificial intelligence and machine learning become increasingly prevalent in information retrieval and subject indexing, it is essential to understand how to effectively evaluate these methods to inform choices in system design, ensure quality and explore possibilities for further improvement. Key questions include: How can I drill down into subject suggestions generated by various methods and determine strength and weaknesses of these methods? What benefits offer LLM-based methods over traditional methods? This workshop aims to provide participants with a comprehensive understanding of the key metrics, their modes of aggregation, dimensions of evaluation that should be considered, as well as hands-on experience with an R evaluation toolkit, CASIMiR, newly developed at the German National Library (DNB).
We will examine strengths and limitations of various automated subject indexing approaches, including Lexical Matching ([1], [8]), Partitioned Label Trees ([2], [3], [4]), X-Transformer ([5], [6]), and LLM-generated subject terms ([7]). As an example we will study the application of these algorithms to a test-set of German book titles, predicting subject terms from the Integrated Authority File (GND) [9]. With its huge size the GND provides a very challenging target vocabulary, providing rich opportunities in studying advantages and disadvantages of the various subject indexing approaches.
The workshop will conclude with an open discussion, where participants will have the opportunity to share their perspectives and experience on other aspects, beyond the quality of subject suggestions, that should be factored into an in depth evaluation: Resource Requirements, Feasibility, Open Source availability, etc.
Prerequisites and Preparation
- It is not mandatory or expected that participants understand German for this workshop. Example-book-titles and German subject terms will be translated to English
- Participants are expected to have a basic understanding of information retrieval and subject indexing concepts, in particular knowledge of the basic information retrieval metrics: Precision, Recall and F-Score
- Familiarity with R and basic programming concepts are helpful but not required. Participants will be provided with fully functional code examples
- Participants are encouraged to bring their own laptops with R (and preferably also an IDE like RStudio, Positron or VS-Code) installed to work through provided example notebooks (see below)
- Software and data will be provided at: https://github.com/deutsche-nationalbibliothek/casimir-workshop. Please make sure to follow the installation instructions in advance of the workshop
Participants will be provided with:
- Example datasets with subject term suggestions from various subject indexing methods
- Access to the CASIMiR package (https://github.com/deutsche-nationalbibliothek/casimir)
- Example quarto-Notebook(s) that contain the first steps of an analysis with CASIMiR
Planned Outcomes
By the end of this workshop, participants will:
- Understand the theoretical aspects to consider when starting an evaluation project,
- Be familiar with some pros and cons of current approaches to automated subject indexing,
- Have hands-on experience with a drill-down analysis in R using the CASIMiR package,
- Be able to apply the knowledge gained to their own evaluation projects.
Detailed Timetable
-----------------------------------------------------------------------------------
Theory I: 30 Minutes
- dataset statistics: why subject indexing is hard
- existing methods for automated indexing
- foundational metrics: set retrieval vs. ranked retrieval
-----------------------------------------------------------------------------------
Work-Book 1: 10 Minutes
- Comparing specific examples of automated indexates
-----------------------------------------------------------------------------------
Work-Book 2: 5 Minutes
- Computing overall set retrieval metrics
- Compare results by subject groups
-----------------------------------------------------------------------------------
Work-Book 3: (optional, for fast study)
- Precision-Recall-Curves
- Ranked Retrieval Metrics
-----------------------------------------------------------------------------------
Theory II: 10 Minutes
-----------------------------------------------------------------------------------
Work-Book 3: 15 Minutes
- Stratify results by label-frequency
- Propensity scored metrics (optional, for fast study)
-----------------------------------------------------------------------------------
Theory III: 15 Minutes
- Graded relevance and expert ratings
- Combining multiple methods
-----------------------------------------------------------------------------------
Work-Book 4: 10 Minutes
- Computing graded relevance results
-----------------------------------------------------------------------------------
Work-Book 5: (optional, for fast study)
-----------------------------------------------------------------------------------
Discussion: 15 Minutes
- Other aspects of evaluation
-----------------------------------------------------------------------------------
Additional Information about Instructor
The workshop will be instructed by Maximilian Kähler, Research Software Engineer at the German National Library.
Mr Kähler acquired degrees in mathematical sciences from the universities of Göttingen, Durham (UK) and Leipzig. After completing his studies, he specialized as Data Scientist and Research Software Engineer. Prior work has led him to the Federal Institute for Quality Assurance and Transparency in Health Care (IQTIG) in Berlin and the Helmholtz Center for Environmental Science (UFZ) in Leipzig, before joining the German National Library (DNB) in October 2021. Kähler is part of the Department for Automatic Indexing and Online Publications and project lead for a DNB research project that investigates the possibilities to exploit recent advances in natural language processing and novel machine learning approaches for the task of automated subject indexing.
References
[1] O. Suominen, “Maui Like Lexical Matching,” https://github.com/NatLibFi/Annif/wiki/Backend%3A-MLLM.
[2] O. Suominen, “Omikuji Backend,” https://github.com/NatLibFi/Annif/wiki/Backend%3A-Omikuji.
[3] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma, “Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising,” The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018, pp. 993–1002, Apr. 2018, doi: 10.1145/3178876.3185998.
[4] S. Khandagale, H. Xiao, and R. Babbar, “Bonsai: diverse and shallow trees for extreme multi-label classification,” Mach Learn, vol. 109, no. 11, pp. 2099–2119, Nov. 2020, doi: 10.1007/s10994-020-05888-2.
[5] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, and I. S. Dhillon, “Taming Pretrained Transformers for Extreme Multi-label Text Classification,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA: ACM, Aug. 2020, pp. 3163–3171. doi: 10.1145/3394486.3403368.
[6] J. Zhang, W. Chang, H. Yu, and I. S. Dhillon, “Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification,” Oct. 2021, Accessed: Nov. 08, 2021. [Online]. Available: https://arxiv.org/abs/2110.00685v2
[7] L. Kluge and M. Kähler, “DNB-AI-Project at SemEval-2025 Task 5: An LLM-Ensemble Approach for Automated Subject Indexing,” Apr. 2025, Accessed: May 07, 2025. [Online]. Available: https://arxiv.org/abs/2504.21589v1
[8] O. Medelyan, E. Frank, and I. H. Witten, “Human-competitive tagging using automatic keyphrase extraction,” ACL and AFNLP, pp. 6–7, 2009, doi: 10.5555/3454287.3454810.
[9] Geschäftsstelle der GND-Zentrale an der Deutschen Nationalbibliothek, “Gemeinsame Normdatei,” https://gnd.network/. Accessed: Oct. 11, 2024. [Online]. Available: https://gnd.network/