Evaluation Everywhere, But Not All At Once
Glen Berman1, James Smithies1, Karaitiana Taiuru2, John Moore3, Barbara McGillivray4, Martin Spychal5
1Australian National University, Australia; 2Taiuru and Associates Ltd; 3The National Archives (UK); 4King's College London; 5History of Parliament Trust
As the Fantastic Futures 2025 theme indicates, the spectre of AI is suddenly everywhere in the GLAM sector. AI-branded software products are being released at pace, often with claims they enable radical new productivity gains and new modes of interacting with digital media. Publicly funded GLAM institutions, who work under resource constraints and face considerable pressure to demonstrate value to taxpayers, need to demonstrate strategic and proactive engagement with AI – both as a tool to be enrolled in the day-to-day work of GLAM practitioners, and as a phenomenon that GLAM institutions can help the broader public understand and interact with. In this context, an urgent question for GLAM institutions is: how best to determine which AI technologies or products to invest in, and how to deploy and maintain them?
The AI as Infrastructure (AIINFRA) project, launched at Fantastic Futures 2024, explores how AI can be responsibly and effectively integrated into GLAM infrastructures through a multinational collaboration between researchers and practitioners in Australia, the UK, and Aotearoa (New Zealand). AIINFRA is based at the Australian National University, with collaborating colleagues at King’s College London, The National Archives (UK), the UK History of Parliament project, the National Library of Australia, the Australian Parliamentary Library, and the Aotearoa Department of Internal Affairs. Our strategy for supporting GLAM institutions in evaluating AI technologies is twofold. First, we are developing ATLAS, a prototype product designed to test whether open-source development remains feasible in an age of AI and a tool to enable GLAM practitioners to compare different AI language products and models while controlling for a wide range of variables. ATLAS is open source (https://github.com/AI-as-Infrastructure/aiinfra-atlas/), with the current version supporting experimentation across different AI models, system prompts, and retrieval augmented generation settings. Second, we are developing a holistic AI evaluation framework, to enable GLAM institutions to develop in-house capacity to experiment with and evaluate AI products, both through use of ATLAS and through the development of their own testing environments. To support this work, in 2025 we have conducted semi-structured interviews with GLAM researchers and practitioners across Australia, Europe, and North America to explore their existing AI evaluation practices and potential AI use cases, and we have hosted workshops with GLAM institutions in Aotearoa, Australia, and the UK focused on gathering feedback on ATLAS and co-developing our draft evaluation framework.
Our approach to enabling GLAM institutions, although embracing traditions of technical experimentation common to Digital Humanities fields, reflects an understanding of AI evaluation that is broader than the technical and benchmark-driven approach to evaluation adopted in mainstream AI discourse. For GLAM institutions, evaluating which AI technologies or products to adopt is not just a question of determining which AI models are most accurate or most cost effective, but also a question of evaluating how AI products align with the social, cultural, political, and environmental commitments of the GLAM sector, including existing data use standards, such as the FAIR and CARE principles. This implies a need for AI imaginaries. To determine which AI technologies to invest in, GLAM institutions must not just evaluate AI products as they appear today, but imagine also how AI technologies may continue to develop, and how public expectations of AI may continue to change. Our design philosophy for ATLAS prioritizes this creative attitude, presenting a model for testing and scholarly activity that aligns to GLAM and research sector traditions even if developing it for production purposes would be infeasible. Our claim is that the challenge facing GLAM institutions is fundamentally one of empowerment and capacity building: GLAM institutions need to build internal resources and expertise to enable ongoing critical engagement with, and evaluation of, AI technologies and products on their own terms and in ways that can account for our cultural values. Yet, there is scant empirical data on existing GLAM sector AI evaluation practices and no holistic AI evaluation framework exists for the GLAM sector. The findings from our 2025 workshops and interviews begin to address this gap, while the ATLAS prototype tool provides a proof-of-concept for what critical engagement with AI technologies within GLAM sector constraints may look like.
In this short presentation, we will introduce our ATLAS prototype and share our early findings with the AI4LAM community. These findings highlight the breadth of approaches GLAM institutions are taking to conceptualizing potential AI use cases and the evaluation of AI technologies. Our findings also demonstrate the need to take an infrastructural approach to AI evaluation, in which AI evaluation is understood as an ongoing and open-ended task, with AI models treated as interoperable and steadily evolving. And, our findings highlight, also, the opportunity to leverage existing GLAM sector areas of expertise, particularly in responsible stewardship of data and public engagement, to improve AI evaluation practices more generally. To ensure these findings are actionable, and can inform future research, we conclude our presentation with an early iteration of our evaluation framework, which includes a set of guiding principles that the AIINFRA project has developed and is in the process of validating, which offer GLAM practitioners a structured approach to critical experimentation with AI technologies.
Unlocking Transcultural Modernist Artists’ Networks with Natural Language Processing
Maribel Hidalgo Urbaneja1,2
1University of the Arts London, United Kingdom; 2Carleton University, Canada
Harvesting data about artist’s lives and trajectories and establishing meaningful points of connection between these artists is a key element of Mobile Subjects, Contrapuntal Modernisms (1900-1989), a project that investigates the circulation of artists from the decolonizing world through the colonial and artistic capitals of London and Paris. Most accounts of Modernism have focused on the North Atlantic. Despite significant research in the past 25 years on modern art in Asia, Africa, and Latin America, there has been no systematic attempt to understand how multiple narratives intersect and challenge with– not just supplement– Eurocentric histories. The project draws from critical archival studies to identify colonial systems, structures, and relationships in archival source materials and documentary practices.
The project event-based database links artists through temporally and spatially defined “events,” such as exhibitions and education and training at art schools and academies. It defines the identities of artists by encompassing aspects such as citizenship, gender, social class, ethnicity, political affiliations, languages spoken, and membership in artistic groups, as well as the techniques and media used in their work. It captures nuances and variability in the data, such as the names of people, organisations, and events in different languages and scripts.
After the ingestion of the various structured datasets into the database, the project has reached a stage where additional data from multiple archival sources. Research projects in the fields of digital art history and cultural heritage that attempt to address text processing tasks similar to ours propose the use of Fine-grained Named Entity Recognition (FiNER). The use of FiNER allows for the classification of entities into subtypes, a feature present in some entities within our database schema. However, applying this technique in the cultural heritage domain poses specific challenges, particularly in terms of model training, due to the lack of high-quality annotated datasets that would aid in recognising entities and fine-grained sub-entity types.
The heterogeneous nature of the data in the project presents additional challenges for training the models. The names of the artists we study appear in different languages, may have multiple spellings, or may be written in different scripts. Additionally, the names of organisations and events can be in either English or French. Furthermore, our data is biased, shaped through the lens of Western art history and memory institutions, and the language used to identify artists may contain terminology that is problematic.
To train the FiNER model, the project will leverage existing resources, namely the Getty Vocabularies and Wikidata, as training datasets, as well as the own project database, which contains approximately 12,270 people and about 500 events. Additional training datasets in several languages may be needed to label the names of artists who are not already present in our project database.
As this method is being implemented in the project as one of the use cases developed within the Digital Skills for the Arts and Humanities (DISKAH) network, this lightning talk will provide an overview of the project’s approach to FiNER, focusing on preliminary results and initial challenges.
Refactoring the IIIF Artificial Intelligence Community
Martin Kalfatovic
IIIF-C, United States of America
This talk is a call to participate in the newly reformed IIIF AI community group. The International Image Interoperability Framework Consortium (IIIF-C) formed an AI/ML Community Group in 2023. The utility of IIIF in these contexts is clear, but the best practices in this domain are less so, and awareness of the full extent of IIIF capabilities is uneven. In forming the group, there was clear value in bringing together practitioners in this domain to gather and highlight use cases, align output formats, and promote interoperability more generally. At the same time, there has been a rapid growth of discussion of AI in the cultural heritage sector and the rapid growth of the AI4LAM community. The IIIF-C community has taken part in various ways at the six Fantastic Futures conferences, but there remains a disconnect between key components of the IIIF-C and AI4LAM communities. Starting in 2025, the IIIF-C began a refactoring process of the existing AI/ML Community Group that carried through at in-person discussions at the 2025 IIIF Annual Meeting (June 2025) with the goal of creating a new “IIIF AI” community group that will focus on working closely with the existing AI4LAM communities for common goals. Among these goals are raising awareness of IIIF in the AI4LAM community; integrating AI into IIIF workflows; exchanging ideas around best practices/guidelines related to IIIF and AI; and highlighting successful IIIF/AI implementations.
Reasoning with Small Language Models (SLM) for Trustworthy Generative AI (GenAI)
Jason Clark
Montana State University, United States of America
The advent of powerful Large Language Models (LLMs) presents a significant challenge: how do we trust systems that offer confident answers without transparent reasoning? A central problem with current GenAI interfaces is their rapid, often opaque, output generation. These implementations are often optimized for efficiency, which can inadvertently obscure the output generation process. I would suggest that there’s no malice here, but rather, an human-computer interaction (HCI) oversight that misses how to create systems we can trust and learn from. We trust what we can see and explain. The rise of more controllable inference techniques, such as chain-of-thought prompting and step-by-step reasoning, offers pathways to slow down these systems and reveal their content generation processes.
To address the challenge of transparency and trust, I’ll introduce a natural language interface that is designed for slower, more deliberate analysis of a language model's process using small language models (models that run on a phone or laptop). The SLM-based prototype is conditioned for discrete agentic tasks, and the language model's context is intentionally limited to curated sources like Wikipedia, Wikidata, and a custom LAM dataset, allowing for focused grounding and refinement of queries. This includes listing actions, agent paths, and an explained confidence score, empowering users to evaluate the language model's reasoning.
This lightning talk will demonstrate the prototype, discuss our research leveraging SLMs, and showcase applied reasoning/inference methods. Attendees will gain insights into leveraging the unique affordances of dialogic interfaces to foster transparency and begin building genuine trust in these AI systems. I’ll also make the case that experiments like this are essential components of GenAI Literacy and offer actionable lessons for attendees to incorporate into their own work to understand and teach about GenAI.
NHMxDCMS AI Pilots: Rapidly prototyping AI for GLAM
Benjamin Scott
Natural History Museum, United Kingdom
From November 2024 to March 2025, the Natural History Museum (NHM), London, launched a six-month pilot programme to accelerate the adoption of artificial intelligence (AI) across UK museums, galleries, libraries, and archives (GLAM). Funded by the UK Department for Culture, Media and Sport (DCMS), the initiative enabled NHM’s AI & Innovation team to expand with four new data scientists and invite proposals from regional institutions within the DiSSCo UK (https://dissco-uk.org/) network to co-develop AI tools addressing local challenges.
Five projects were selected from over thirty submissions, spanning domains from natural history and agriculture to art and digital collections. Collaborating institutions included the Zoological Society of London, Lyme Regis Museum, National Museums Liverpool, the Royal Agricultural University, and the Royal Botanic Garden Edinburgh. The resulting prototypes tackled problems such as illegal wildlife trade detection via species identification, automated transcription of historical specimen index cards, AI-generated alt text for visual collections, herbarium chatbot development, and machine learning-based digitisation quality control.
This talk will reflect on the model of deploying centralised AI expertise into under-resourced institutions, discuss the challenges of rapid prototyping in heritage contexts, and advocate for scalable, collaborative AI hubs to democratise innovation across the GLAM sector.
AI as assistive technology in cultural organisations: a Disability Gain perspective
Rafie Cecilia
King's College London, United Kingdom
AI tools are increasingly promoted in museums as solutions to accessibility, yet they are often developed for disabled people rather than with them. While projects like Microsoft’s SeeingAI demonstrate the power of lived experience in shaping effective and innovative design, much of the current AI hype in the cultural sector sidelines this model. Instead, organisations are being offered expensive products with little transparency, minimal consultation, and accessibility features added late or used as marketing tools.
My research explores the hopes, expectations, and concerns of disabled audiences regarding AI in museums. Survey results from over 100 respondents indicate strong interest in AI’s potential, but more than 90% expressed concern that exclusion from design and development undermines trust and usability. This presentation argues that for AI to fulfil its promise in cultural organisations, it must be co-designed with disabled people as leaders, collaborators, and innovators.
|