Beyond the Single Voice: Co-Creating Knowledge with AI
Jeff Steward
Harvard Art Museums, United States of America
AI is suddenly everywhere—and expectations and hype are sky-high. With so many new tools available, GLAMs face real opportunities and real risks when it comes to using AI for collections metadata and access.
At Harvard Art Museums, we’re in the early stages of developing a system that brings together computer vision (CV) and large language models (LLMs) to help enrich and diversify our traditional, human-generated metadata. This work builds on over a decade of experience experimenting with and applying AI to our collections, which include a dataset of about 500,000 images—already inspiring roughly 100 million tags and descriptions generated by AI. (All of which is available in the museums public API.)
Our new approach draws on multiple AI models at once: where human cataloguing sometimes leaves gaps or reflects our own biases (like leaving out important, common terms or styles), we use a “group opinion” from several models to suggest new tags, descriptions, and even counts of objects in images. But the process is never fully automated—only if a majority of models agree do we consider a new observation (whether a fact or a subjective interpretation) for human review. This group-decision method acknowledges the known weaknesses of these systems (such as LLMs’ difficulty with visual counting), keeps the process transparent, and puts people firmly in charge.
One fascinating discovery: AI can generate valid, sometimes unexpected subjective interpretations of art that don’t always align with expert opinions of art historians or curators. We see this tension as an opportunity, where GLAMs can use AI to create a more expansive and inclusive experience for museum visitors, allowing for a wider range of perspectives.
As we take these steps, we’re learning a lot about building trust and AI literacy—how to be open about where AI is useful, where it falls short, and why human oversight matters. Our work aims to keep AI human-centered, offering new tools to support and empower staff instead of replacing their expertise. We hope our experiences help spark a broader conversation about how GLAMs can use AI responsibly to support their values, welcome new viewpoints, and fulfill their public missions in a rapidly shifting landscape.
The presentation will cover these questions and topics.
- What should an art museum want—and not want—from AI?
- 10+ years’ experience working with AI in collections, what changes have we seen and what challenges have we encountered
- Real examples of gaps in metadata (intentional and not) and how they affect the utility of our collections
- Practical illustration of our AI-based consensus system, including a sample image’s journey through it
- An illustration of how the tension between human and machine interpretations can make a more inclusive and expansive visitor experience.
- How transparency, human oversight, and openness support trust.
- What questions should we all be asking before handing more decisions to AI?
- The promise and limits of responsible, human-centered AI in GLAM based on direct experience.
References
- HAM API: https://hvrd.art/api
- HAM AI Website: https://ai.harvardartmuseums.org/
- Why Do Large Language Models (LLMs) Struggle to Count Letters?: https://arxiv.org/abs/2412.18626
Computational description: an end-to-end case study
Matthew McGrattan1, Abigail Potter2
1Digirati, United Kingdom; 2Library of Congress
LAM institutions share common challenges that arise from large, and often ever increasing, collections which are often undescribed, or under described; which are inaccessible to users because they have not been digitised or exist only in languages or scripts that are not widely understood; and which cannot be found because good quality metadata that supports search does not exist. Solving these problems has historically required expert labour which is a challenge when staff time and budgets are spread thinly. AI and ML methods are potentially a solution to some of these problems by supplementing expert staff time with computational methods.
Typical use cases for AI or computational methods in LAM institutions include:
-
Transcription:
-
Translation, e.g. from less widely understood languages to those understood by the broad population of LAM users
-
Classification: describing resources using unstructured keywords or using controlled vocabularies
-
Data extraction e.g. the conversion of unstructured data such as images, plaintext or audio into structured data suitable for discovery.
All of these potentially aid in the description, discoverability and accessibility of LAM collections.
While all of the above opportunities to experiment with AI assisted description are a key part of LAM workflows, classification and data extraction are particularly difficult problems in the LAM domain for several reasons. LAM institutions usually require that the extracted data:
-
Conform to a particular data model or schema and,
-
Use multiple specific controlled vocabularies or authority controlled values for key terms,
-
Follow consistent standards in the formatting and structure of data fields,
-
Be sufficiently comprehensive and accurate to serve as reference data for the resource or resources being described.
LAM data models may be complex and deeply nested; controlled vocabularies may have hundreds of thousands of terms; and formatting and cataloging rules may be nuanced and require training and expertise to apply consistently. Meeting these challenging standards are a key part of discoverability and interoperability of data across institutions. AI models have not always been able to meet these kinds of robust standards for data quality.
In this presentation we will use a single case study—cataloging books and/or ebooks in MARC or BIBFRAME using large language models— to discuss how LAM organisations can use LLMs to support staff in the structured description of previously undescribed or sparsely described collection objects.
We will work through the key stages of running an AI/ML experiment from initial concept to prototyping with staff and users using a specific real-world example to illustrate the benefits and the risks of applying AI methods to LAM problems.
These stages include:
-
Discovery:
-
Problem definition / needs analysis: gathering a deep understanding, through workshops, interviews and other activities, of where the specific challenges exist for staff and users; where pain-points or resource bottlenecks exist; and where richer or better structured data can aid users in the discovery of LAM resources.
-
Success criteria and metrics: defining what counts as good enough, either for the purpose of input into a Human-in-the-Loop workflow which aids expert LAM staff or for direct presentation to end users. Defining what counts as good enough often requires the identification of key quantitative metrics that provide a statistical measure of data quality and standards for manual review that can capture the informed opinions of expert reviewers.
-
Data and workflow audit: identifying what can potentially be done using AI/ML methods is heavily dependent on the availability of data that can be used to evaluate, fine-tune or train models or the potential for enrichment of existing data without excessive burden on staff time and budgets.A thorough, technically informed audit is key. Such an audit also needs to pay careful attention to copyright and other rights concerns; issues around statistical, demographic or historic bias; and other risks such as the presence of personally identifying information.
-
Landscape analysis and model selection: the landscape of AI models has been changing rapidly over the past four years. There may be a range of models which can be brought to bear upon the problem(s) identified at step 1, or there may be none, in which case it may be necessary to train some model from scratch. This stage involves identifying which models are suitable for evaluation and practically useful given technical and budgetary constraints.
-
Implementation:
-
Data enrichment and transformation to create datasets: Given the data audit and landscape analysis, the LAM institution needs to create reusable datasets that can be used for training, testing and evaluating models.
-
Prompting, training and fine-tuning: Using the datasets and the model selection to thoroughly train or fine-tune the relevant models.
-
UI/UX development: development and provision of any interfaces used as part of Human-in-the-Loop (HITL) workflows.
-
Review:
-
Evaluation: Evaluate the outputs against the success criteria using a mixture of statistical data and manual review, including user testing of interfaces.
-
Reporting and decision making: Make final decisions about whether the outputs of the AI/ML workflows really do solve the problems identified at stage 1, without undue risk, and within the budgetary and time constraints of the LAM institution. Bearing in mind that the end conclusion may be negative.
In the case of Exploring Computational Description—the latest in a series of experiments to explore the application of machine methods to book cataloging—the requirement was to describe ebooks using MARC or BIBFRAME. For fields that use controlled vocabularies or URIs, the requirement was to use the appropriate controlled vocabulary such as LCNAF (for names), LCGFT (for genres) and LCSH (for subjects). We will discuss, in the context of this experiment:
-
The specific challenges of this data
-
The methods used to restructure and enrich the available data for model training and evaluation,
-
The standards used to evaluate the outputs, and
-
The user interfaces developed to:
Exploring Computational Description will form a case study throughout the presentation to illustrate potential opportunities and potential risks in the application of generative AI and LLMs to the creation of structured LAM catalog data from uncataloged or under cataloged resources.
|