Conference Agenda
| Session | ||
WS06: Evaluating Automated Subject Indexing Methods
| ||
| Presentations | ||
Evaluating Automated Subject Indexing Methods German National Library, Germany Overview Level of experience for attendees with relevant technologies: intermediate This workshop will explore theoretical and practical aspects of evaluating automated subject indexing methods. As artificial intelligence and machine learning become increasingly prevalent in information retrieval and subject indexing, it is essential to understand how to effectively evaluate these methods to inform choices in system design, ensure quality and explore possibilities for further improvement. Key questions include: How can I drill down into subject suggestions generated by various methods and determine strength and weaknesses of these methods? What benefits offer LLM-based methods over traditional methods? This workshop aims to provide participants with a comprehensive understanding of the key metrics, their modes of aggregation, dimensions of evaluation that should be considered, as well as hands-on experience with an R evaluation toolkit, CASIMiR, newly developed at the German National Library (DNB). We will examine strengths and limitations of various automated subject indexing approaches, including Lexical Matching ([1], [8]), Partitioned Label Trees ([2], [3], [4]), X-Transformer ([5], [6]), and LLM-generated subject terms ([7]). As an example we will study the application of these algorithms to a test-set of German book titles, predicting subject terms from the Integrated Authority File (GND) [9]. With its huge size the GND provides a very challenging target vocabulary, providing rich opportunities in studying advantages and disadvantages of the various subject indexing approaches. The workshop will conclude with an open discussion, where participants will have the opportunity to share their perspectives and experience on other aspects, beyond the quality of subject suggestions, that should be factored into an in depth evaluation: Resource Requirements, Feasibility, Open Source availability, etc. Prerequisites and Preparation
Participants will be provided with:
Planned Outcomes By the end of this workshop, participants will:
Detailed Timetable ----------------------------------------------------------------------------------- Theory I: 30 Minutes
----------------------------------------------------------------------------------- Work-Book 1: 10 Minutes
----------------------------------------------------------------------------------- Work-Book 2: 5 Minutes
----------------------------------------------------------------------------------- Work-Book 3: (optional, for fast study)
----------------------------------------------------------------------------------- Theory II: 10 Minutes
----------------------------------------------------------------------------------- Work-Book 3: 15 Minutes
----------------------------------------------------------------------------------- Theory III: 15 Minutes
----------------------------------------------------------------------------------- Work-Book 4: 10 Minutes
----------------------------------------------------------------------------------- Work-Book 5: (optional, for fast study)
----------------------------------------------------------------------------------- Discussion: 15 Minutes
----------------------------------------------------------------------------------- Additional Information about Instructor The workshop will be instructed by Maximilian Kähler, Research Software Engineer at the German National Library. Mr Kähler acquired degrees in mathematical sciences from the universities of Göttingen, Durham (UK) and Leipzig. After completing his studies, he specialized as Data Scientist and Research Software Engineer. Prior work has led him to the Federal Institute for Quality Assurance and Transparency in Health Care (IQTIG) in Berlin and the Helmholtz Center for Environmental Science (UFZ) in Leipzig, before joining the German National Library (DNB) in October 2021. Kähler is part of the Department for Automatic Indexing and Online Publications and project lead for a DNB research project that investigates the possibilities to exploit recent advances in natural language processing and novel machine learning approaches for the task of automated subject indexing. References [1] O. Suominen, “Maui Like Lexical Matching,” https://github.com/NatLibFi/Annif/wiki/Backend%3A-MLLM. [2] O. Suominen, “Omikuji Backend,” https://github.com/NatLibFi/Annif/wiki/Backend%3A-Omikuji. [3] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma, “Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising,” The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018, pp. 993–1002, Apr. 2018, doi: 10.1145/3178876.3185998. [4] S. Khandagale, H. Xiao, and R. Babbar, “Bonsai: diverse and shallow trees for extreme multi-label classification,” Mach Learn, vol. 109, no. 11, pp. 2099–2119, Nov. 2020, doi: 10.1007/s10994-020-05888-2. [5] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, and I. S. Dhillon, “Taming Pretrained Transformers for Extreme Multi-label Text Classification,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA: ACM, Aug. 2020, pp. 3163–3171. doi: 10.1145/3394486.3403368. [6] J. Zhang, W. Chang, H. Yu, and I. S. Dhillon, “Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification,” Oct. 2021, Accessed: Nov. 08, 2021. [Online]. Available: https://arxiv.org/abs/2110.00685v2 [7] L. Kluge and M. Kähler, “DNB-AI-Project at SemEval-2025 Task 5: An LLM-Ensemble Approach for Automated Subject Indexing,” Apr. 2025, Accessed: May 07, 2025. [Online]. Available: https://arxiv.org/abs/2504.21589v1 [8] O. Medelyan, E. Frank, and I. H. Witten, “Human-competitive tagging using automatic keyphrase extraction,” ACL and AFNLP, pp. 6–7, 2009, doi: 10.5555/3454287.3454810. [9] Geschäftsstelle der GND-Zentrale an der Deutschen Nationalbibliothek, “Gemeinsame Normdatei,” https://gnd.network/. Accessed: Oct. 11, 2024. [Online]. Available: https://gnd.network/ | ||