Assessments that are frequently used for large language models (LLMs) include Massive Multitask Language Understanding, Graduate Level Reasoning, and standardized assessment tests (1, 2). While convenient for direct comparison of models, the performance on the SAT, or even on graduate-level tests, such as the GRE, are not sufficient or relevant indicators of the models' suitability and performance for library, archive and museum-focused operations. LAMs collections have materials from different sources and in different media, modalities, formats. For example, a model that can properly caption a color photograph, may not be the most appropriate tool to caption a medieval lithograph. A model that successfully parses a hand-written 20th century letter may struggle with a medieval manuscript. Additional complexity is added when these artifacts come from non-Western and/or non-English speaking countries, due to the prevalence of English sources in LLM training materials. LAM institutions need their own leaderboard, focused on evaluation of AI tools for LAM-specific tasks.
One common use for LLMs is to create summaries of text. Patrons may use that functionality to understand whether a particular article is relevant to their research topic. Another very common intended use for summarization by LLMs is making library collections more accessible (3, 4). However, assessing summaries is a challenging task.
Even in cases when model testing is occurring at cultural institutions, the approaches tend to come with limitations. Institutions often provide access to a single commercial model, so no comparison occurs. In some cases, professionals who need to be able to directly assess the model capabilities may lack the technical skills needed to set up testing environments for open-source models. Furthermore, testing seems to be done ad hoc, without shared scoring rubrics and assessment metrics. Even carefully planned, extensive experiments are not shared with their datasets, which prevents building on the foundation of these successful tests. All of those factors prevent comparison of models to identify which ones perform best on which LAM-specific tasks.
In response to that, the AI Evaluation working group was formed under AI4LAM. We are nearing the completion of our first use-case--the evaluation of summaries of research articles by commercial and open-source LLMs. While a number of automated scoring algorithms based on word similarities have been developed, we chose a different approach. We defined the abstract of an article as the gold standard for that article’s summary, the equivalent of a control in an experiment, and identified the key points in it. We then determined how many of them are present in the summary output by the model. The workshop we are proposing is grounded in our experience. It will focus on developing and implementing an assessment workflow for summarization, but the process can be translated to designing other evaluation workflows.
Intended Learning Outcomes:
Through completing this workshop, the participants shall gain skills to
1. Structure an evaluation rubric
2. Deploy an open-source LLM on their machine (optional)
3. Define a gold standard against which to assess the model
4. Conduct to their own use case
Timetable for the session
-
Introductions to participants (10 min)
-
Context: The AI4LAM AI Evaluation WG (5 min)
-
Introduction to the evaluation use case (5 min)
-
Confirming technical setups (5 min)
-
Evaluation task part 1: Defining gold standard reference data (25 min)
-
Evaluation task part 2: Generating (or sharing) test output (10 min)
-
Evaluation task part 3: Performing comparison (25 min)
-
Evaluation task part 4: Calculate final scores (5 min)
-
Share out to the group (15 min)
-
General discussion (15 min)
Expectations:
Required level of experience: Beginner
If you have a specific document you want to propose to test in this use-case, please, contact vensberg@ucdavis.edu by November 12th 2025. We encourage these submissions to make this workshop directly applicable to your work.
Participants in the workshop should bring their own laptops. They should have either:
-
Access to a cloud-based LLM (i.e. a setup provided by their institution or by a vendor). Free accounts with chat-based applications are sufficient for this workshop, Or
-
Access to an LLM on their own laptop. If choosing this option, please, pre-install LMStudio. If using LMStudio, follow these instructions (https://lmstudio.ai/docs/app/basics/download-model) to download a model. The process will tell you which models are small enough to run locally. Try downloading Llama-3.2-3B-Instruct. We chose the one published by hugging-quants. The file will appear as llama-3.2-3b-instruct-q8_0.ggufllama-3.2-3b-instruct-q8_0.gguf
When registering for the workshop, participants should indicate their choice of how they access LLMs. Participants will also have the opportunity to submit a .txt or .pdf of a text object they would want to create a summary for. By submitting a digital version of the object, they consent that other participants in the workshop will have access and be able to use it during the workshop. We highly encourage submitting materials with a license that would allow it to be included in the open access collection we are creating to test LLMs for LAM operations.
Due to the limited workshop time and depending on the number of submissions, we may have to choose which ones to use. In that case, materials with creative commons licenses will be given priority.
In the case we do not receive submissions, we will provide the materials necessary to conduct the workshop.
Note that the focus of the workshop is on creating standards and workflows for tool assessment. Models used on local machines will have limited size, and therefore, limited functionality.
An additional advantage of the workshop is that participants will learn how to lower the barrier to entry with experimenting with LLMs, and as a result will be able to engage a wider group of colleagues in projects around artificial intelligence.
This workshop focuses on developing standards for benchmarking and evaluating AI tools for LAM operations. We believe that by creating a shared collection, evaluation workflows and metrics, we will change the practice in LAM from conducting siloed, ad hoc experimentation to a streamlined, federated evaluation process. The cost sharing will also benefit all participants, as no organization can afford licensing multiple AI tools. Overall, adopting the process taught in this workshop will allow institutions to make evidence-based decisions when choosing an AI tools LLM to use in their operations while controlling their costs.
References
-
Introducing the next generation of Claude. (2024, March 4). Anthropic. https://www.anthropic.com/news/claude-3-family
-
OpenAI et al. 2024. GPT-4 Technical Report. arXiv. 10.48550/arXiv.2303.08774
-
Yale Library is developing an AI application that could transform research in digitized collections. (2024, November 25). https://library.yale.edu/news/ https://library.yale.edu/news/yale-library-developing-ai-application-could-transform-research-digitized-collections
-
Caizzi, C., Deschenes, A., & Snydman, S. (2024, December). Transforming Access to Collections with AI-Driven Exploration [Slide show]. Coalition for Networked Information, Washington, D.C., United States of America. https://www.cni.org/wp-content/uploads/2024/12/CNI-2024-Reimagining-Discovery.pdf