Conference Agenda

Session

SP04: Short papers

Time:

Friday, 05/Dec/2025:

11:30am - 1:00pm

Location: Pigott Theatre (Auditorium)

Knowledge Centre, capacity 255

Presentations

'Distrusting, techno-pessimistic and conservative'. National Libraries and the Politics of AI Training Data

Steven Claeyssens, Sophie Ham

KB, national library of the Netherlands, Netherlands, The

In January 2024, the National Library of the Netherlands (KB) issued a statement restricting the reuse of its digitised collections for training commercial language models by AI companies. This decision was prompted by Dutch data journalists revealing that one of the KB’s digital platforms, containing thousands of digitised publications in the field of Dutch language and literature, many of which are under copyright, had become a key source of training data for OpenAI. The revelation sparked both internal reflection and external debate on the library's role in the rapidly evolving AI data ecosystem.

This moment marked a turning point for the KB in understanding that its position in the AI landscape is not limited to leveraging AI as an institutional tool but also encompasses its role as a provider of training data. This duality has raised significant legal, ethical, and operational questions. While heritage institutions have traditionally championed open access as part of their public mission, they are now faced with increasingly complex decisions about how, and to whom, such access should be provided, especially when commercial AI development is involved.

In the case of the KB, this dilemma is shaped by several contextual factors. The Netherlands lacks a legal deposit system, meaning the KB has long relied on trust-based agreements with publishers, authors, and rights holders to acquire, digitise, and disseminate in-copyright publications. Many of these agreements were established long before the rise of AI and machine learning, often with individual creators who had public access for human readers in mind, not computational processing by opaque corporate systems. These creators trusted that the KB would act as a responsible intermediary, safeguarding both access and integrity.

This trust-based model has not only enabled access but also allowed the KB to play a pioneering role in digitisation efforts. Because of the relationships built on mutual trust, the KB has often been able to realise projects well before there was legal infrastructure to support them. For example, it concluded a collective agreement with collective management organisations (CMOs) ten years before extended collective licensing (ECL) was formally enshrined in Dutch law. As a result, the KB was already able, fifteen years ago, to make large-scale collections of 20th-century newspapers available online, and to provide them as data for research purposes. This demonstrates that, in the absence of formal legal mandates, a cooperative approach with stakeholders can serve as a powerful enabler of public access and innovation.

In addition to the statement, the KB revised its terms of use, blocked the OpenAI crawler via its robots.txt file, and initiated broader discussions with stakeholders, including CMOs and publishers. However, these actions led to criticism from multiple sides. Creators accused the library of insufficient protection of their work. Scholars argued that public institutions should not be the arbiters of access: ‘CHIs and research libraries are not in the position to define who should have access and to decide to impose restrictions on access’ (Lehmann and Sichani). At the same time, fellow heritage professionals questioned whether limiting access to cultural heritage datasets conflicted with the public mission of such institutions: ‘Isn’t contributing trustworthy qualitative information and fighting misinformation and bias (in algorithms) more in line with their objectives?’ (Matas). The KB was described as exhibiting ‘techno-pessimism’ and ‘a desire to control their curated content, a conservative stance that can clash with the mission to make content publicly accessible’ (Lazarova and Luth).

Despite these tensions, the KB remains committed to openness. The institution acknowledges its foundational role not just as an intermediary between the public and information, but also between creators and information systems. As such, the library continues to publish bulk downloads of public domain works, provides API access to digitised collections (under regulated conditions), and initiated a national project to develop a secure virtual research environment. This environment is designed to enable responsible text and data mining (TDM) of in-copyright and sensitive collections that cannot legally be shared openly. At the same time, the KB actively facilitates conversations between Dutch AI developers, publishers and authors, serving as a an intersectoral mediator.

In this presentation, we will critically reflect on the KB’s policy development over the past year, highlighting the practical, legal, and ideological challenges faced along the way. We will share lessons learned in navigating a landscape where digital collections are simultaneously cultural heritage, research infrastructure, and commercial resource.

Our case study contributes to a timely conversation on the responsibilities of heritage institutions and national libraries in particular in the machine learning age. Specifically, we ask: How can institutions balance their public mission with the interests of creators, the imperatives of technological progress, and the demands of public trust? Our goal is to foster a more nuanced understanding of how cultural institutions can adapt to their new role as gatekeepers of data in the age of AI.

References

The Europeana Public Domain Charter. Version 2, 2025. Available at: https://pro.europeana.eu/post/the-europeana-public-domain-charter

Hofman, E. and Veerbeek, J. (2023, June 7) ‘’Dat zijn toch al gewoon ál onze artikelen? De bronnen van ChatGPT’, Groene Amsterdammer, 23.

Kleppe, M. (2024, January 9). Statement on commercial generative AI (KB – National Library of the Netherlands). Available at: https://www.kb.nl/en/ai-statement

Lehmann, J. and Sichani, A.-M. (2025). ‘A Position Paper on AI and Copyrights in Cultural Heritage and Research (EU and UK)’, Journal of Open Humanities Data, 11(1), p. 25. Available at: https://doi.org/10.5334/johd.290

Lazarova, A., & Luth, E. (2024). ‘To Mine or Not to Mine: Knowledge Custodians Managing Access to Information in the Age of AI’, Stockholm IP Law Review, 2, 45–54. Available at: https://doi.org/10.53292/33313cc8.44e622dc

Matas, A. (2024, August 22). AI ‘opt-outs’: Should Cultural Heritage Institutions (dis)allow the Mining of Cultural Heritage Data? Available at: https://pro.europeana.eu/post/ai-opt-outs-should-cultural-heritage-institutions-dis-allow-the-mining-of-cultural-heritage-data

Zeinstra, M. and Loef, M. (2025). Kennisdocument AI en erfgoed (Koninklijke Vereniging Archiefsector Nederland en Netwerk Digitaal Erfgoed Nederland). Available at: https://www.kvan.nl/nieuws/kennisdocument-ai-en-erfgoed-gepubliceerd/

AInTheCards: Surfacing Environmental and Cultural History from Observational Records via VLMs and a Web-Based Review Interface

Peter Broadwell, Claudia Engel, Amanda Whitmire, Simon Wiles

Stanford University Libraries, United States of America

Several deep learning-based AI systems for working with text and images have emerged just within the past year or two that greatly enhance the level of assistance such technologies can provide to librarians and archivists in accomplishing semi-complex document processing tasks. These developments raise the possibility of using such methods to surface the contents of digitized but largely untranscribed collections comprising hundreds to many thousands of records, including those of such scale that human effort alone may not suffice to process them unassisted (Brohan 2025). Smaller but capable “open-weight” models that can run locally on relatively affordable hardware resources may be especially appealing to libraries and archives, given that they can perform the same task repeatedly for days, weeks or months at a time without incurring significant further costs, as opposed to proprietary, commercial “frontier” models that may require untenable amounts of cloud “credits” to process such collections and that also may have a cavalier attitude towards the privacy of the data sent to them.

Although the developments in AI described above may raise hopes of instantiating an unflaggingly productive digital assistant for the description and transcription of digitized archival materials, the reality is that adopting a “prompt and pray” strategy in which such models are applied to a collection with minimal oversight is unlikely to result in the desired level of accuracy. Rather, providing interfaces that position humans “in the loop” to review, correct and approve the results of such models, and subsequently to suggest instructions to ameliorate undesirable tendencies of the models, is a more promising strategy (Portinari Maranca et al. 2025). Some approaches might employ the humans in the loop to provide “gold standard” training data to fine-tune a model for a specific task or collection via transfer learning, or even to train the model iteratively via reinforcement learning with human feedback. Increasingly, however, it is sufficient for the human interventions to involve simply adjusting the prompts that are provided to the models (sometimes referred to as “test-time training”).

This short presentation will relate the results of implementing such a system for applying recent open-weights vision-language models (VLMs) including Gemma 3 and Qwen3-VL to transcribe collections of digitized record cards containing observations of ecological and cultural phenomena across substantial spans of time, in concert with a web-based interactive user interface for human review of the results. The observations to be parsed and rendered into data are written or, more commonly, typed into fields on the cards. Previous optical character recognition technologies have not been able to make such contents accessible because doing so involves not only transcribing the texts but also deciphering the layouts of the labeled fields, accommodating variable positioning and contents of the typed or handwritten data entries, and associating the transcribed readings with their corresponding data fields. These are tasks that the latest VLMs can accomplish much more capably, although as noted above, the process still needs to incorporate UX affordances for rapid and effective human oversight of the outputs.

The initial collection of interest contains more than 800 survey cards developed and recorded by the legendary marine scientist Ed Ricketts at the Pacific Biological Laboratories in Monterey, California beginning in the early 1940s. These observations were central to a massive effort to catalog the observed locations, depths, and habitat types of intertidal species from southern Alaska to northern Mexico; participants in survey voyages with Ricketts included the author John Steinbeck. The approach we prototype with these materials then can be applied to other collections, including a set of handwritten records on printed templates from marine surveys compiled by students and faculty researchers on the west coast of the United States from the 1960s and 1970s. Activating observations pertaining to the same ecological region from across such a span of time and subsequently cross-referencing them with contemporary records potentially will be a major contribution to longitudinal studies of biodiversity, climate and other pressing environmental issues. In addition to such materials, we will describe efforts to surface the contents of a large number of archaeological record cards, a process that is currently awaiting a review to exclude potentially sensitive materials – work which, needless to say, must be conducted by human experts.

In our workflow, the VLM model outputs structured data containing the source fields and transcribed data entries, which are then stored in an interim database. The web-based review interface subsequently extracts and presents the transcriptions alongside the original images of the record cards, enabling efficient review and revision of the extracted data. This process also enables the gradual assembly of a set of verified correct “gold standard” transcriptions of card contents, which then can be used to evaluate the efficacy of different models, settings, prompt instructions, and recently emergent best practices for employing VLMs and LLMs as assistants or “agents.” For example, we can test approaches such as prompt chaining (prompting the model to run OCR on a card image, then feeding this output into a second prompt instructing the model to assign the OCR results to the expected data fields, again conditioned on the card image), or specifying expected ranges of values for particular fields like dates and depth measurements. Further, we can prompt the models to provide numerical assessments of the quality of their own transcriptions; as these self-evaluations tend to correlate well with human evaluations (Angelopoulos et al. 2023), we can then prioritize the transcriptions with the lowest agentic “confidence” scores for human review via the web-based interface.

The experiences and conclusions derived from the activities described above can inform future efforts to process further large-scale observational record collections via VLMs and human review interfaces.

References (selected)

Angelopoulos, A. N., et al. (2023). “Prediction-powered inference.” Science 382, 669-674. DOI:10.1126/science.adi6000.

Brohan, P. (2025). “AI Data Rescue: Daily Precipitation.” https://brohan.org/AI_daily_precip/index.html. Accessed May 2025.

Portinari Maranca, R. A., et al. (2025). “Correcting the Measurement Errors of AI-Assisted Labeling in Image Analysis Using Design-Based Supervised Learning.” Sociological Methods & Research. DOI: 10.1177/00491241251333372.

From Microfilm to Metadata: AI-Powered Indexing of 139 Years of a Student Newspaper

Erin Wolfe

University of Kansas, United States of America

This presentation will demonstrate how AI tools (large language models (LLM) / generative AI, machine learning, and neural networks) can transform access to historical collections at unprecedented scale. The University Daily Kansan (UDK) collection exemplifies the challenge facing cultural heritage institutions -- large, rich collections locked behind analog access barriers, with staff lacking resources for comprehensive cataloging -- and provides an example solution of how to address it in a practical and meaningful way.

The UDK is the student newspaper of University of Kansas (KU). The University Archives at KU holds a nearly complete run of this publication, covering the years 1878-2017 (including several pre-1904 title variations) and comprising approximately 150,000 pages. Now produced exclusively as an online publication, the UDK ceased print operations in 2020 with the COVID-19 pandemic.

The physical archive of this newspaper had been converted to microfilm in several stages, beginning in the 1970s. Only accessible in the University Archives' reading room, this collection of 174 reels has long been the most popular microfilm collection for researchers at the KU libraries. However, years of use have resulted in physical damage to many of the reels, and accessible metadata was virtually non-existent for the microfilm collection. Individual reels were labeled with the included date range, and the researchers could consult with the University Archivist to search an internal Access database with inconsistent descriptive metadata for about 15% of the collection.

In 2023, in order to increase accessibility for its users, the Libraries began to explore a project to create digital copies of the images from the master negative microfilm reels held by the Kansas State Historical Society (KSHS). After completion of the digitization phase, the Digital Initiatives Librarian began work to use AI tools and methodologies to explore a variety of avenues to improve access and metadata for the collection.

This presentation will describe the results of various approaches that were tested during the process, focusing on the use of AI tools to generate descriptive metadata. It will compare paid cloud-based LLM API access (Claude AI), locally run open-source models (Mistral, Qwen, Gemma), and cloud-hosted open-source models via the National Research Platform (NRP) (GLM-V), which provides shared resource access to member institutions. Comparative testing of these multiple AI models provided practical guidance for selecting appropriate tools for this and similar large digital projects. We used these multimodal LLMs, in combination with layout analysis neural networks, to generate extensive metadata including article titles and content summaries, subject tags, advertisement inventories, image identification, bounding boxes for improved IIIF access, and other features. Rather than relying on OCR preprocessing, which introduced its own errors and complications, we fed the high resolution TIFF files directly to the models.

Through carefully crafted prompts that position the AI as a scholarly expert in academic newspapers, the process was automated to output JSON-formatted results. This metadata underwent a sample-based quality control process using a web interface built using Streamlit, allowing library stakeholders ample input into the results of the various processes. This iterative and collaborative approach allows the tools to augment rather than replace human expertise within our library. Following the QC and testing phases, the full collection was processed using NRP's GPU and distributed computing infrastructure. The metadata was integrated into the collection at the page-level, as well as being made available as a publicly accessible dataset, with disclaimers noting the use of AI in metadata creation.

This short presentation will detail the technical methodology, compare preliminary and final results, and discuss lessons learned from processing materials that span 139 years of publication. Key challenges include managing diverse historical layouts, establishing reliable and scalable quality metrics for AI-generated metadata, and scaling computational resources sustainably within institutional constraints.

In addition to the actual metadata, important deliverables include reusable Jupyter notebooks, Python scripts, and documentation, which have already undergone preliminary testing by a KU faculty member working with an historical social/political journal. This comprehensive AI-powered indexing project provides and example of practical and relatively low-barrier application, demonstrating how modern AI tools can unlock previously inaccessible archival materials while maintaining rigorous scholarly standards and institutional best practices.

Ultimately, the project transformed a collection previously accessible only to onsite researchers into a fully searchable public resource, while creating sustainable workflows that complement existing institutional capacity. This project demonstrates a measured approach to AI integration that balances potential benefits with human intervention. The methodology offers a low-cost model for improving access to archival collections worldwide.

Wildflowers as Benchmarks: Creating a LAM-Specific LLM Evaluation Using Historical Botanical Text

Mike Trizna, Richard Naples

Smithsonian Institution, United States of America

The rapid advancement of large language models has created significant opportunities for transforming workflows in library, archive, and museum (LAM) institutions. While general LLM evaluation benchmarks are useful for comparing and selecting LLM models, existing evaluation benchmarks predominantly focus on general language tasks or commercial applications [1], failing to capture the specialized knowledge, cultural sensitivity, and domain-specific reasoning required for LAM contexts. This gap between general LLM evaluation and LAM-specific needs has left practitioners without reliable methods for assessing whether these powerful tools can meet their specialized requirements while maintaining the accuracy and cultural sensitivity that their collections and communities demand.

One promising approach to addressing this evaluation gap is through narrow, domain-specific information extraction tasks that test an LLM's ability to accurately parse and structure specialized content. Recent work by Derek Willis [2] demonstrating targeted evaluation through information extraction from fundraising emails—where models extract specific fields like donation amounts, donor information, and campaign details—illustrates how focused, task-specific benchmarks can provide more meaningful assessments of LLM performance in specialized domains than broad, general-purpose evaluations.

In this project, we create a new information extraction benchmark dataset for the LAM domain, using the open access "Wild Flowers of North America" as our basis. "Wild Flowers of North America" [3] is a five-volume set of botanical illustrations published in 1925 by Mary Vaux Walcott, known as "the Audubon of Botany". The Smithsonian-published work is available as open access and contains beautiful illustrations and descriptive free text much like many museums, archives, and libraries would contain. The volume contains 400 textual descriptions of wildflower species that each contain geographic range, physical descriptions of the plants, and even personal descriptions of the specimens that were illustrated. But because the descriptions were written in narrative format, it is difficult to consistently parse these important pieces of information.

In order to construct this dataset, we first produced a manually validated text version of the automatically OCRed volume. Then we coordinated a team of colleagues to create standard guidelines, and then manually parse out the different components of each text description. Next, we iterated on a standard prompt that we could use to compare this information extraction task across many different LLMs. Finally, we built an automated framework to run the evaluation on several LLMs (both commercial and open source) and score their accuracies.

This open access dataset can be used to teach LAM practitioners how to evaluate LLMs. In addition, the dataset construction process can serve as a roadmap for other LAM institutions to create their own domain-specific evaluation datasets for additional areas like audio transcription, image description, and catalog metadata generation.

[1] Chang, Y., Wang, X., Wang, J., et al. (2023). A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 15(3), 1-45. https://dl.acm.org/doi/full/10.1145/3641289.

[2] Willis, Derek. "LLM Extraction Challenge: Fundraising Emails." The Scoop, January 27, 2025. https://thescoop.org/archives/2025/01/27/llm-extraction-challenge-fundraising-emails/index.html

[3] Walcott, Mary Vaux. North American Wild Flowers. Smithsonian Institution, 1925, https://doi.org/10.5962/bhl.title.67774.

Unlocking rare collections in Trove

Eric Swain

National Library of Australia, Australia

The aim? To leverage AI technology to enable archival research through Trove anywhere, anytime. Like many organisations, the National Library of Australia has long been able to deliver text searching for its digitised newspaper collections, gazettes, magazines and almanacs, which continues to provide value for a wide range of audiences. But what about similar search and discovery for rare archival collections? For the National Library and its many audiences, this was an unfilled ambition until it embarked upon its Handwritten Text Recognition program in 2024, converting handwritten images of typed and printed text into editable and searchable transcriptions.

This presentation maps the Library’s journey to unlock its rare manuscript collections, enabling new insights and meaning to be found, and connecting previously disparate content in new and exciting ways. It showcases how the Library has leveraged AI to enhance a core part of its business: enabling meaningful engagement with its collections. It will unpack how the Library turned a proof of concept of 10,000 images, into an embedded workflow that each year produces over 1 million pages of automated transcriptions of the Library’s digitised manuscript collections into Trove. An innovation that has enabled the Library to embark upon an ambitious project to retrospectively apply handwritten text recognition technology to its previously digitised archival materials. And, most importantly, one that has been embedded as a routine part of the Library’s digitisation program, with handwritten text recognition technology applied at the point new manuscript materials are digitised.

The impact of this innovation for the cultural and research sectors cannot be underestimated – through keyword searching, the Library’s manuscript collections have been unlocked, allowing new meaning and insights to be discovered, all integrated into Trove. Trove users can search content by keywords as well as by item titles and descriptions. And even better, these outputs can be enhanced and refined by users as they engage with the content.

This presentation unpacks the Library’s strategy for delivering this innovation, applying a careful and measured approach to leverage AI technologies – amplifying its mission of access and engagement, while retaining audience trust. It demonstrates in a clear and practical way, how this program of work embodies the organisation’s Artificial Intelligence Framework. Released earlier this year, one of the framework’s fundamental principles is to use AI to extend the Library’s practices to support its statutory role of maintaining, developing and protecting a national collection of library material so audiences can discover, learn and create new knowledge, now and in the future.

The presentation also delves into another critical part of the program’s success in incorporating handwritten text recognition as part of a scalable business process – from technical enhancements to business systems, enhanced internal and external workflows and investment and support across the organisation. This business-driven approach has provided the foundation for the success of the program and has helped define what AI looks like in practice for the National Library of Australia. This pragmatic and considered approach for the use of AI within the National Library has been especially important in the context of Trove and the Library’s collections. The Library holds over 10,000 manuscript collections, varying from single books and diaries, to hundreds of boxes, with over 9 million pages from these collections already digitised.

AI technologies present enormous opportunities for the cultural sector, often with an equal measure of challenges. While both sides of this equation have been in play in the delivery of this innovation, its success has been underpinned by people, processes, and just as importantly, smart and responsible adoption of AI technologies. From its inception through to delivery, this program has been driven by providing better experiences for audiences and enabling them to more deeply engage with the breadth of the Library’s collections anywhere, anytime through Trove.