The BL Labs Collider – Feedback on the Labs/Data.bl.uk Pilot 2 SuperMVP Vision
Silvija Aurylaite
British Library, United Kingdom
The British Library Labs/Data.bl.uk Pilot 2 SuperMVP Vision is exploring a new model for future computational access to curated selection of datasets where the portal has its own staff access - inviting them as collaborators in dataset exploration, visualisations and challenge creation. This workshop invites the international GLAM community to engage directly with the wireframes and UX strategy of this unique platform - currently in the wireframes stage - and to offer structured feedback that will shape its global relevance and real-world impact.
Unlike traditional research data portals, the SuperMVP's innovation lies in its dual-access model: both external users and internal staff are empowered with visibility into datasets, tools for running analysis, and interfaces for suggesting and shaping Labs activities. This supports both “AI for staff” (enhancing institutional knowledge work) and “AI with users” (community-driven discovery and co-creation).
Objectives
-
Present the Labs/Data.bl.uk SuperMVP (Pilot 2) wireframes, strategy and implementation diagrams
-
Demonstrate how systems that invite external user creativity can be embedded in GLAM workflows, not only layered externally - empowering search for mutual data challenge ideas: connecting internal subject-experts, wonderful British Library datasets and creative skillful users onto a collider for innovation potential
-
Gather actionable feedback on usability, staff-user interaction design, and future opportunities for participatory AI in GLAM
-
Co-imagine extensions: How might other institutions adapt this model?
Why Is This Innovative?
BL Labs are exploring solutions of the GLAM dataset portal where every staff member has the same visibility as Labs staff and can actively propose and join beginner level computational challenges. It turns the institution into a living lab—where computational access is not just a backend tool, but a shared, learnable infrastructure. This workshop offers a live opportunity to shape that future.
Vision, Strategy and Implementation Plan of the New BL Labs for Computational Access by Leader/Manager Silvija Aurylaite
AI Literacy is Information Literacy: Teaching STEM students and research labs about generative AI
Hannah Edlund, Ian Mellor-Crummey
Rice University Fondren Library, United States of America
The recent proliferation of generative AI tools and the corresponding surge in interest have created a need for STEM students and researchers to learn how to use these tools properly and within the bounds of institution-provided or cost-free options. This presentation will cover some of the strategies we have used to introduce generative AI, approached from an AI literacy framework.
As librarians, we are well situated to provide guidance on evaluating and implementing these tools in an academic context, pulling from our training and expertise in information literacy. We have learned that AI literacy is similar to information literacy, and most people need not only instruction on how the tools function but also how to select the right tool for the right job, develop/engineer adequate prompts, and fully consider ethical implications, ranging from environmental impact to academic integrity.
Each presentation and demonstration session is tailored to the set of attendees and the—an on-demand presentation to an academic group, research lab, or class group, or an in-library scheduled one shot course. We aim to help attendees learn how generative AI can be a strong supporting part of a research arsenal when used mindfully. We also teach them how to access and best leverage the university-provided models and what their specific strengths and weaknesses are. Because public awareness of generative AI is frequently focused exclusively on LLMs like ChatGPT and Copilot, we help attendees expand their repertoire to other options. The “right tool, right job” approach is exemplified by showing how different tools can be used well or poorly depending on the prompt and context. Attendees learn how to evaluate tools, and what kinds of test cases to use for evaluation. These demonstrations focus on free-to-use or tools/models provided by the university.
Another important aspect that we highlight is the responsibility to protect personal information within an academic environment. In the United States, all students are protected by a set of laws under the Family Educational Rights and Privacy Act (or FERPA), which are much stricter than other US privacy laws. Many generative AI tools that are not accessed through university portals are not compliant, and researchers, students, and other university community members need to be aware of their responsibilities.
Beyond the concerns of FERPA, we also teach about the implications of generative AI models harvesting user data and potential security risks that may pose for researchers. In particular, there are many researchers who work with sensitive intellectual property who are not aware that some free models still may harvest their inputs. Additional intellectual property concerns we address are those of alleged copyright infringement. Not all users are aware of current lawsuits brought against generative AI companies.
We also encourage attendees to fully consider the environmental impacts of using generative AI when it is not helpful or necessary. We teach them how to determine when to use them and when to use them in conjunction with other tools like an internet search or browsing the library catalog. By restricting their use of generative AI to when it is most applicable, they can minimize their unnecessary environmental impact.
We are also careful to ensure that attendees are aware of academic integrity when using generative AI. A single set of rules or guidelines does not exist, so we show them where to find information from the university and then take questions about circumstances. We have learned that people are never certain of what is or is not permissible, and that leadership needs to provide a more precise or detailed statement of what is allowed in which circumstances. Even when lengthier guidance is provided, we have learned that people will need a contact to ask questions.
Through this layered approach, we strive to empower students, faculty, and staff to be more informed and effective users of generative AI. The range of tools grows daily and potential applications in different fields are numerous beyond count, making it impossible for us to cover every possibility. Structuring our instruction around our expertise in information literacy allows us to support a broad cross-section of our campus community and augment their discipline-specific expertise with practical understanding of available AI technologies and strategies for incorporating them into their workflow. As the AI landscape and university position towards it changes, we are able to nimbly adapt our one-shot, small-scale sessions to reflect the new developments. There remains quite a bit of uncertainty surrounding generative AI in higher education, and our hope is to maintain the library’s position as a source of guidance for information needs, regardless of format or source.
Ethical Considerations Around Machine Learning-Engaged Online Participatory Research
Samantha Blickhan, Hillary Burgess
Zooniverse, The Adler Planetarium, United States of America
This poster will present work in progress that brings interdisciplinary researchers, practitioners, and members of the public together in a series of workshops to discuss the ethics of machine learning-engaged online participatory research (often called ‘crowdsourcing’), wherein machine learning (ML) is a component at some stage of an online participatory research project. The discussions from these workshops will be used to produce the first-ever framework for ML-engaged online participatory research on the Zooniverse crowdsourcing platform.
The field of participatory research is mature, with clearly-articulated best practices for equitable approaches to this work which prioritize volunteer experience at the same level as data production (Ridge et al. 2021). The introduction of widespread ML into this field has the potential to disrupt current practice if we do not approach it carefully and with intention. Examples of ML-engaged participatory research projects can include: incorporating ML models for data pre- or post-processing before or after being used in a project; using volunteer-labeled data to train ML models after a project is complete; and feedback-style “Human-in-the-Loop” processes, which combine these approaches to fine-tune models (Fortson et al. 2024). These methods are increasingly prevalent as ML increases in sophistication and accessibility for researchers and the public alike. In the sciences, recent research has demonstrated that the best systems for scientific research output are those that combine the strengths of humans and machine efforts (Zevin et al. 2024), and ongoing work in the LAM sector seeks to explore potential barriers to ingesting AI-enhanced data (Ridge et al. 2024). As ML becomes increasingly integrated into research practices across disciplines, it is crucial to address the risks, opportunities, challenges, and broader ethical questions. Online participatory research provides an illuminating context through which to explore these questions via ground-up discussions which have the potential to set new standards across academia and industry. The Zooniverse platform (https://www.zooniverse.org) offers a unique case study, with nearly 3 million volunteers worldwide who have contributed to 450+ online participatory research projects (including 75+ Digital Humanities and/or LAM-led projects) each led by a different research team and reliant on public participation. All projects include discussion forums where researchers and volunteers engage. As ML has become more prevalent—now in ⅓ of Zooniverse projects—it has sparked a range of reactions within the volunteer community, reflecting broader societal discourse that runs the gamut from curiosity to fear. These discussions have surfaced concerns and insights on issues like data ownership, agency, transparency, and trust.
In response to these concerns, we proposed a series of virtual workshops to bring together diverse stakeholders including the volunteer community, multi-disciplinary researchers, ethicists, data licensing experts, data scientists, communication experts, and Zooniverse project leads. This project received funding in early 2025 from the Kavli Foundation (Drahl 2024). The virtual workshops will run from June through November 2025, on topics related to the theme of ML-engaged online participatory research, including Transparency and Communication Best Practices (to develop guidelines to support researchers in effectively communicating with the public about ML-engaged participatory research); Ethical Approaches to ML (to explore and identify foundational elements of an ethical approach to ML-engaged online participatory research, addressing risks while leveraging opportunities); Deepening Contextual Understanding (expanding on ethical considerations by examining a matrix of factors including disciplinary differences, task type affordances, and the varied needs of stakeholders like researchers, volunteers, and platform maintainers); and Downstream Data Protection (considering recommendations for licensing frameworks to use with online participatory research data outputs that align with platform values, particularly in relation to generative AI).
The project output will include the first-ever Zooniverse framework for ML-engaged online participatory research, which we will incorporate into the platform’s user documentation and project review process. With over 80 active projects and 2-4 new projects launching each month, it is our hope that this documentation will guide hundreds of researchers to effectively address ML ethical considerations and public engagement in their work.
This poster will provide an overview of the process, and share some early results of the project. We also hope that by sharing this work, we can create opportunities for discussion among the DH/LAM community around the challenges and opportunities of creating frameworks for the widespread adoption of a new technology in public-engagement driven research spaces which prioritize transparency, communication, and equitable participation.
Works Cited
Drahl, Carmen. 2024. Relaying Public Input on Machine Learning to Researchers. The Kavli Foundation (blog). October 15. https://www.kavlifoundation.org/news/relaying-public-input-on-machine-learning-to-researchers.
Fortson, Lucy, Kevin Crowston, Laure Kloetzer, and Marisa Ponti. 2024. Artificial Intelligence and the Future of Citizen Science. Citizen Science Theory & Practice 9(1): 32. doi: 10.5334/cstp.812.
Ridge, Mia, Samantha Blickhan, Meghan Ferriter, Austin Mast, Ben Brumfield, Brendon Wilkins, Daria Cybulska, Denise Burgher, Jim Casey, Kurt Luther, Michael Haley Goldman, Nick White, Pip Willcox, Sara Carlstead Brumfield, Sonya J. Coleman, and Ylva Berglund Prytz. The Collective Wisdom Handbook: Perspectives on Crowdsourcing in Cultural Heritage. Chapter 4: Identifying, aligning, and enacting values in your project. Doi: 10.21428/a5d7554f.1b80974b.
Ridge, Mia, Meghan Ferriter, and Samantha Blickhan. 2024. Closing the loop: integrating enriched metadata into collections platforms. Fantastic Futures 2024 conference. October 18. https://zenodo.org/records/14040682.
Zevin, Michael, Corey B. Jackson, Zoheyr Doctor, Yunan Wu, Carsten Østerlund, L. Clifton Johnson, Christopher P. L. Berry, Kevin Crowston, Scott B. Coughlin, Vicky Kalogera, Sharan Banagiri, Derek Davis, Jane Glanzer, Renzhi Hao, Aggelos K. Katsaggelos, Oli Patane, Jennifer Sanchez, Joshua Smith, Siddharth Soni, Laura Trouille, Marissa Walker, Irina Aerith, Wilfried Domainko, Victor-Georges Baranowski, Gerhard Niklasch, and Barbara T´egl´as. 2024. Gravity Spy: lessons learned and a path forward. European Physical Journal Plus. 139(1): 100. doi: 10.1140/epjp/s13360-023-04795-4.
Fast, cheap, and good? Building an image recognition prototype in the AI era
Neil Hawkins
Cogapp, United Kingdom
Recent developments in AI tooling, fuelled by wider industry investment, are lowering the time, funding, and expertise barriers traditionally associated with the field. This project explored whether a small team (of one), using off-the-shelf tools, could rapidly prototype an image recognition API for a complex GLAM-specific challenge — specifically, recognising 3D artworks such as sculptures in a gallery environment.
Recognising 3D objects presents particular challenges. These include the need to handle multiple viewpoints and self-occlusion, variations in lighting and reflections on complex surfaces, the lack of labelled multi-angle datasets for a specific collection, and the limitations of image recognition techniques which work well for “2D” works.
The prototype development involved three key stages:
- Dataset creation: using small, high-quality datasets captured with mobile devices
- Annotation: using modern segmentation tools (e.g. Segment Anything 2) and model-driven annotation loops to accelerate the process
- Iterative training: fine-tuning of a model using the annotated data.
Key findings:
- A high-quality, small dataset refined through a feedback loop, achieved useful accuracy.
- Modern annotation workflows reduced hands-on labelling time and effort to a large degree.
- Off-the-shelf tools and cloud services proved highly affordable, bringing the overall project within a modest budget.
Rapid and low-cost prototyping lowers the threshold for experimentation with fine-tuning models across many potential applications, not only image recognition, allowing new ideas to be tested before committing to larger-scale investments.
The prototype also revealed limitations. Extreme occlusion and highly reflective or translucent surfaces still require additional data capture. Ethical considerations arise around transparency of training data, biases in foundation models, and intellectual-property and rights management when relying on external cloud platforms.
This project demonstrated that fast and affordable AI prototyping is achievable for a challenging task such as 3D object recognition. This suggests a similar outcome could be achieved in other problem areas or applications, opening up new opportunities for innovation.
Reimagining Museum Work: How Museums Respond to AI’s Impact on Workforce, Skills and Equitable Collaboration
Yingru Qian
Kings College London, United Kingdom
As artificial intelligence (AI) becomes increasingly applied in the cultural sector, professional work practices in the museum sector are undergoing profound changes. While AI has been piloted in areas such as collection management and visitor experience, there is little understanding of how it impacts the internal structures of museum labour, particularly from the perspectives of museum professionals. This research will take an internal perspective, focusing on museums in the UK, US, and China to examine how they understand and respond to AI’s influence on their professional work, skill requirements and equitable collaboration.
The key innovation of this study lies in the first application of the AI Exposure Index to the museum context. This approach quantifies which roles and tasks are most susceptible to AI-driven change. Through interviews and discussion groups with museum professionals, the research seeks to gain a deeper understanding of their perceptions, experience and responses to AI integration in their daily work. By using a mixed-methods approach, this research also builds a cross-cultural framework to understand how AI might change existing inequalities in museum workplaces, which will help promote structural and ethical understanding of the changes brought about by AI in the cultural sector.
Though the research is still in the early stages, three primary objectives have been defined:
• Identify the AI’s impact on different museum professions and skill requirements and compare it with existing technologies such as digitisation and automation.
• Explore how museum professionals understand and respond to AI’s influence on their roles, expertise and collaboration.
• Investigate the involving relationship between AI and equity in museum work environments, analysing how national cultural policies and legislation shape these dynamics.
The project aims to provide forward-looking insights for museum practitioners, policymakers, and cultural organisations navigating the complex and rapidly evolving AI landscape. It also contributes to broader GLAM conversations about ethical AI, human-centred technologies, and internationally informed approaches to AI implementation.
Key References
Arts Council England. (2016). Equality, diversity & the Creative Case: A Data Report 2015-2016. (pp. 1–40). Arts council England.
Cecilia, R (2024). Challenging ableism: including non-normative bodies and practices in collections care. in Krmpotich, C. and Stevenson, A., 2024. Collections Management as Critical Museum Practice.
Gray, C. (2015). The Politics of Museums: New Directions in Cultural Policy Research. Palgrave Macmillan UK.
Magdalena, P.S., (2023). Artificial intelligence in the context of cultural heritage and museums: Complex challenges and new opportunities.
Maslej, N., Fattorini, L., Perrault, R., Parli, V., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., et al. (2024). The AI Index 2024 Annual Report. Stanford, CA: Stanford University, Institute for Human-Centered AI.
Murphy, O. & Villaespesa, E. (2020). AI: A museum planning toolkit. Goldsmiths, University of London.
Villaespesa, E. and Crider, S., 2021 A critical comparison analysis between human and machine-generated tags for the Metropolitan Museum of Art’s collection. Journal of documentation. 77 (4), 946–964.
Webb, M., (2019). The impact of artificial intelligence on the labor market.
Transforming Record Appraisal at NAS with AI: A Proof-of-Concept Exploration
Joyce Wong
National Library Board, Singapore
With the push to adopt Artificial Intelligence (AI) within the Singapore Public Service, the National Archives of Singapore (NAS) has embarked on a Proof-of-Concept (POC) to utilise the Large Language Model (LLM) in LaunchPad to improve its appraisal methodology. As part of the POC, NAS leveraged the LaunchPad AI community platform that had been set up by the Singapore’s Government Technology Agency to access valuable AI resources and harness the power of collaboration to tackle its challenges in appraisal. This presentation will share key lessons from NAS’ AI exploration for appraisal and outline its plans for refining the POC.
NAS is mandated to collect, preserve and provide access to government records of historical or national significance. The joint appraisal process with government agencies, to determine the retention period and disposition action of records, is an elaborated process that is considered one of the most challenging tasks for archivists. Faced with the exponential growth of records and staff turnover, it was no longer substantial to continue the current appraisal methodology and NAS decided to explore the use of AI.
There were two phases in the POC:
- The POC was originally intended to assist agencies to identify and match their records to the pre-approved disposition in the Common Records Retention Schedules (CRRS), thereby reducing the need to submit to NAS for separate adjudication. However, the POC faced issues such as data preparation challenge, hallucinated outputs, and uncertainty on changes in LaunchPad’s business model. The POC concluded with LaunchPad’s transition to AIBots, a self-service Generative AI chatbot.
- As such, in the 2nd version of the POC, NAS developed an AI Bot using AIBots, incorporating customised prompts with the CRRS as its knowledge base. While the AI Bot showed promise, the POC faced challenges including inconsistent results and structural issues of the knowledge base. The POC has since been paused due to a lack of data science expertise to address these technical challenges.
NAS will continue to explore various options including switching to alternative platform like Pair (an AI assistant created by Open Government Products (OGP), utilising LLM technology) that presents potential solutions for improved user experience. Nevertheless, these experiences have provided learning points for NAS such as the importance of i) having structured knowledge base and clear requirements, ii) staff’s capability and readiness to adopt AI, and iii) technical expertise in AI implementation.
AI Products: A Licence Review Tool
Alex Fenlon, Lisa Bird
University of Birmingham, United Kingdom
GLAM staff and researchers can see the potential in using new AI products to create work efficiencies or as ways of discovering hidden information and generating new knowledge. Our digital scholars and archivists are using AI products to enrich and label datasets, to add text recognition and transcriptions to digitised versions of printed materials, and using AI products to engage in linguistic analysis and uncover new meaning and knowledge. This paper will explore how GLAM staff can support their communities to use such products safely by ensuring terms of service are evaluated and understood.
The boom in AI products has led to an explosion in the number of products and services available to support GLAM activity, teaching and research. These new technologies have led to the digital transformation we now see, and the proliferation of AI based products is accelerating this digital shift. Some of these products are large and institutionally vetted, approved, and supported. These go through extensive scrutiny to ensure they are fit for our purpose, secure, compliant with data protection and accessibility legislation, and other various check before they are rolled out across our organisations.
At the same time, there are numerous other tools (often free, or paid for personally) that are used by staff and students. These do not have this same level of organisational inspection or support. Each AI tool will have its own terms of service controlling their use. GLAM staff and researchers need to have the skills and knowledge to understand and assess these contracts.
User will click ‘yes’ to terms of services, without reading them, let alone understanding them. Hidden within these terms are often problems for individuals and organisation. So called “Shadow AI” products can potentially generate financial and reputational risk for researchers and organisations. Obligations and guarantees signed, licences granted, liabilities agreed, and risks owned, all by individuals who do not have the authority to do so.
Embrace it, resit it, or fear it- we need to know what we’re signing up to.
This talk will focus on how Library staff at the University of Birmingham have developed an AI Products Licence Review Tool to help draw attention to the key terms within agreements for AI products- shadow or not. They have developed two versions of this tool, one aimed at GLAM staff members who may what to fully review a licence before making a purchase and the other aimed at researchers or staff members who just want to complete a quick review of the terms to ensure that they have checked for the most problematic items in an AI product licence. The tool is designed to be used by non-experts but identifies where expert support should be sought. Jisc’s “AI Procurement Due Diligence”[1] helped in the formation of the tool, although that is aimed at those directly in the procurement of products rather than those in the shadow space. Another of Jisc’s posts “Licensing Options for Generative AI”[2] provides useful guidance as well.
If we look to specific examples of the challenges faced by GLAM professionals wanting to embrace AI technologies, text recognition products provide a useful demonstration. A leading product uses AI and machine learning to build models that are capable of identifying handwriting, turning script into machine readable, and human readable digital text- an extremely powerful product with the ability to unlock archival materials far quicker than by manual transcription. The problem comes that the product needs permission to retain and train on the data, and a user needs to grant this permission, agreeing not only to the permission but also confirming that they have the right to grant such a permission. Users have to grant a:
“worldwide, non-exclusive, fully paid-up, royalty-free, irrevocable, perpetual, sublicenseable and transferable license to use, reproduce, display, transmit and prepare derivative works of your Media, and to additionally distribute and publicly perform Media in connection with the Service and the Company’s (and its successor’s) business, in any media formats and through any media channels”.
Many archival materials may well be long out of copyright but, others due to quirks of UK copyright law, may not and the archive, while they possess the physical item, may not own the copyright.
Another example comes from linguistic analysis for research papers. This time the terms contain a limited liability for the supplier, but the user is required to:
“defend, indemnify and hold the Supplier harmless from and against any and all claims, costs, damages, losses, liabilities and expenses (including attorneys’ fees and costs).”
Embedded within the “Russell Group Principles on the use of generative AI tools in education”[3] is the concept of AI Literacy which includes sections on “privacy and data considerations,” “ethics codes,” and “plagiarism.” Using the AI Products Licence Review Tool helps to increase literacy of these important issues by flagging contractual clauses that cover these areas. For example, the tool requires entries for AI training data, privacy statements and contractual wording around the retention or training of user inputs.
Vitae’s “Researcher Development Framework”[4] references ‘information literacy and management’ in “Domain A Knowledge and Intellectual abilities” as well as ‘legal requirements,’ ‘intellectual property and copyright,’ and ‘risk management’ in Domain C Research Governance and Organisation.”
Both the Framework and the Principles offer direction on how GLAM professionals and the users we support should behave ethically and with due consideration, yet the focus on the products leads to these core concepts being overlooked. Digital literacy frameworks are being updated to include AI elements as “Critical digital, media and information literacy skills are crucial now more than ever”[5].
At a time where AI products and usage is increasing, and the benefits of the digital transformation are revolutionising GLAM activity, harnessing AI in the right way is critical. The AI Products Licence Review Tool provides a useful starting point to identify and manage some of the risks in shadow AI usage. GLAM staff are under increasing pressure with budget cuts and reduced resourcing across the sector. Being able to provide self-directed guidance to our colleagues, flagging where support should be sought, will help alleviate some of the AI burden.
By raising awareness of these issues, the AI Products Licence Review Tool will also help GLAM professionals engage with suppliers to ensure the terms of service do not expose our organisations to unlimited liabilities. Enhanced AI literacy will enable our staff and users maximise the benefits of this AI and digital transformation ethically and safely.
[1] AI Procurement Due Diligence, Jisc, July 2024
[2] Licensing Options for Generative AI, Jisc, October 2024
[3] Russell Group principles on generative AI in education.pdf
[4] The Vitae Researcher Development Framework - Vitae
[5] ”AI Literacy & the criticality of Public Libraries,” Feeny, D. Information Literacy Website.
Gallica Images - Applying AI and UX to vast iconographic collections
Danaë Di Salvo, Jean-Philippe Moreux
French National Library (BnF), France
The project we would like to present is one of the main AI programs at the French National library.
For over thirty years, the French National library (BnF) has been committed to digitizing documents in all formats, preserving them, and making them available online be it from its own collections or those of partner institutions, whose content is accessible via Gallica, the digital library of the BnF.
This vast digitized collection is primarily made searchable and discoverable through bibliographic metadata and full-text indexing generated via Optical Character Recognition applied to printed materials. This infrastructure has opened possibilities for information retrieval, text and data mining, and scholarly research.
Obviously, the process has produced huge amounts of digital data, oftentimes unreachable. This is very important as the collection contains part of France’s national heritage (writings from celebrated authors, first editions of national treasures…) but also documents that will help people understand who they are (administrative documents linked with family history for example).
As such, iconographic heritage continues to sit somewhat outside the circle of large-scale discovery and reuse. Unlike textual documents, images are not easily transcribed or indexed through automated mass-processing methods comparable to OCR. The semantic richness, interpretive ambiguity, and contextual specificity of visual content pose unique challenges that resist straightforward computational treatment.
For more than a decade, the BnF has made it a priority to address this. Through targeted innovation projects and frameworks, we explored pathways to unlock the potential of our visual collections. In the realm of iconographic content specifically, Gallica Images is one answer to this challenge.
We believe that sharing our journey through the making of Gallica Images during the conference could help and enrich the experience of other GLAM practitioners facing comparable issues.
Gallica Images
As we have established, most of the images in Gallica are not identified as such and can only be accessed by visually browsing through the documents.
Each document to which they belong also bears a wide range of information but some may be missing depending on the person registering the document or state of the nomenclature at the time they were recorded.
Moreover, we face huge numbers of images (as of today we do not know how many) deriving from heterogeneous origins and displaying diverse types (they could be decorative vignette for menus, musical scores, photojournalistic images, graphite drawing and so on…).
After building a prototype, we decided to build a case for the production of a tool that would channel different models designed to help us retrieve the images and their coordinates within or digitized collections. We were acutely aware that many of our partners throughout Europe were involving AI at a growing scale into their work on digitized data and the results of the GLAM community so far appeared enticing enough to make us confident in choosing these kinds of processes.
We identified segmentation as extremely helpful, as well as models able to identify the individual items present in the image. These would allow users to search for specific categories or elements (materials, color, topic…) and to put the image, its coordinates as well as all the metadata recorded by librarians throughout time into a database open to the public and researchers.
The database will ultimately be made accessible to users through a dedicated interface, designed and developed with strong UX research and UI design. Given the wide variation in the data types and structures, the interface is intentionally avoiding visual clutter to maintain clarity and usability, which is adamant as the data is extremely heterogeneous.
A well-thought design for a library and AI project is crucial as they can both be quite complicated to navigate. We have quite a lot of examples of projects that were made with the power of the core hindered by a front-end that decisively undermined its value
We came together with partners and contractualized with a private entity that would provide us with the means to build and adapt extensive AI models. Our north star was to stay in control of the project by making sure the pipeline will be deployed in-house allowing the staff to operate them and making them evolve.
The global endeavor was funded by the BNF and the Caisse nationale des dépôts et des conciliations. We expected the project to be done in 3 years, starting in 2024.
We initiated an extensive public procurement process aimed at securing the most advanced expertise available in the private sector. The objective was to collaborate with a firm capable of leveraging open source technologies, particularly those whose models had already demonstrated high performance in production environments, while procuring advanced professionals in the field of data sciences.
Fortunately, the firm we selected displayed a solid track record of working with public institutions. Their team combined data science skill set with an understanding of our organizational workflows and constraints. Importantly, they also demonstrated the flexibility to iterate based on our feedback. This responsiveness has been essential to the quality of the outcome.
From the outset, we ensured that the project was supported not only by our core team, but also by key figures across the institution—particularly from the Infrastructure and Systems Department of the National Library. This included the Library’s head of mission for AI who had developed the first prototype for the solution, but also skills and people from all around the library. We agreed on a shared governance between the functional and the technical staff. This allowed for a true “project mode” with people coming together.
Challenges
Starting in 2022 the project was not without challenges, which I will be showcasing during the talk.
The first thing we wanted to address was the novelty of it all. As putting into effect, the first AI program of this scale in the library, we had to learn everything from scratch and hope for the best. Fortunately, we were able to draw significant lessons each time we iterated. We were also using senior profiles to guide us through the project. And lastly, the way we navigated this was to accept that everything would not go as planned. We prepared to negotiate with the firm when she was able to provide rigorous and documented reasons justifying they had to be either below the threshold of what was asked or differing a little bit from the work material we provided.
The second was the pace at which the technology was evolving. We were indeed dealing with very recent technologies, and we noticed early on that the technology was going faster than us. We had to upgrade and go from a vision that was librarian-centric, with a system of indexes for every kind of object, to another that was embeddings-driven and less intuitive.
The embeddings allowed us to be extremely accurate while comparing two images, generating a lot of keywords and making it easier to generate a lot of concepts and ideas around one image. This system uses very large numbers, known as vectors, to make it easier for the machine to compare and to draw from what she knows. This meant that at the beginning of the project, we had to rethink everything, and we lost a little bit of time.
Another quite common challenge was the heterogeneity of the data. Provoking a difficulty to find a model accommodating very different cases. We compared the performance of the models without preconceived notions and were able to reckon some models, sometimes unexpectedly, were performing better than the other.
Last but not the least, the lack of HR. Initially, we would have needed a lot of data to be labeled by actual humans. This is of course a widespread concern for this kind of project. You need a lot of training and supervising data to make sure that models are working properly. We were not able to satisfy all the demands associated with training. We then chose embeddings because it also demanded less data and could still provide relevant results.
Concerning design, developing and testing for a non-existent front t was a conundrum. We decided to ground our work in the reality we knew: Gallica. The library has long benefited from the presence of in-house sociologists, and we made the most of it. We already had a solid understanding of our audience—not just from analytics, but from years of qualitative insights and direct interactions with users. Yet a lot of our work was exploratory and remain so.
In summary
Working on this project was rich, making the library team gain a quick understanding of the requirement of an AI project. We approached this endeavor with a thorough understanding of our duty as libraries to reshape the rapport de force and make sure this technical effort may be ethically and soundly channeled through clear strategy, design insights and thorough communication.
Through Gallica Images, we hoped to have demonstrated that AI can be a force for expanding access to knowledge — provided it is developed with respect for context, history, and the human dimension of information.
Descriptive Debt and AI: Rethinking the Future of Archival Processing
Emilie Hardman
ITHAKA, United States of America
AI is increasingly proposed as a remedy for scale-related challenges in libraries, archives, and museums, but its use demands scrutiny, particularly when applied cultural memory work. This presentation examines the development of Seeklight, an AI archival processing tool designed to address a specific challenge: descriptive debt. Informed by six months of field research with special collections and archives across the US and UK, the presentation situates Seeklight within a broader inquiry into how institutions can responsibly integrate AI into archival workflows.
Descriptive debt refers to the growing gap between the minimal description practices widely adopted in response to backlogs and the more robust metadata necessary to support digital access, discovery, and use. It is the legacy of triage: the result of processing strategies that prioritized physical control and access over descriptive richness, often under conditions of austerity. While More Product, Less Process (MPLP) and less formally defined minimal processing strategies allowed many institutions to gain intellectual control over large volumes of material, its application in a digital context has deferred rather than eliminated descriptive labor. Metadata that was “good enough” for analog settings is frequently insufficient in digital systems.
This problem is particularly acute for collections that document marginalized communities and activist movements. Within JSTOR’s Reveal Digital program, an initiative to digitize independent, oppositional, and subcultural publishing for open access, collections such as HIV, AIDS, and the Arts exemplify how sparse or absent metadata can render materials digitally invisible. These records carry deep social, political, and emotional significance, but without adequate description, their visibility and utility to scholars and communities are severely constrained. Seeklight was developed to intervene in this context not as a wholesale solution to archival challenges, but as a tool designed to reduce the friction of descriptive bottlenecks, enabling faster baseline processing while preserving pathways for refinement and context building. It uses generative AI to produce structured metadata aligned with professional standards with a focus on efficiency, transparency, interoperability, and a continuous centering of the human in loop.
From the outset, the development of Seeklight was grounded in field-based research and participatory design. Through direct engagement with 23 institutions ranging from large academic libraries to small community-based archives interviews and working sessions with practitioners responsible for archival and special collections processing, description, systems, and digital stewardship were conducted. These conversations revealed widespread recognition of descriptive debt and a clear demand for tools that extend rather than replace human labor. Participants emphasized the importance of robust metadata, the risks of overpromising automation, and the need to preserve space for community and reparative description. These insights directly informed Seeklight’s design and helped define its boundaries.
This presentation argues that the responsible integration of AI into archival practice requires more than technical feasibility. It demands attention to institutional capacity, labor structures, and the politics of representation embedded in metadata. Descriptive debt is not a technical failure; it is a structural inheritance. And any attempt to address it through AI must be evaluated not only in terms of outputs, but in how it redistributes labor, encodes values, and shapes future infrastructure.
“AI pedagogy and new methods for humanities scholars: A reflective case study”
Chris Haffenden, Justyna Sikora
Kungliga biblioteket, National Library of Sweden, Sweden
How can we address the divide between increasingly large-scale computational approaches to GLAM collections and the more qualitatively-inclined perspective of humanities scholars focused on close reading? (Jaillant and Aske 2024) And how might we raise user awareness of the possibilities and limitations of novel AI-based search systems, which are becoming an increasingly prevalent feature of the research landscape, even as AI literacy among scholars remains distinctly varied? What pedagogical options for such methodological outreach are available for those of us who work hands on with digital research infrastructure and research services?
This brief talk draws on our experience at KBLab at the National Library of Sweden (Börjeson et al 2024), working with various outreach initiatives within the Swedish national infrastructure for digital research, Huminfra (https://www.huminfra.se/). In particular, we discuss a new workshop we have developed to showcase the potential of multi-modal topic modelling for heritage organisations with large collections of unlabelled images and researchers that explore visual culture. This builds on our earlier work both with using CLIP (Contrastive Language–Image Pre-training) to enhance the searchability of the library’s postcard holdings (Haffenden et al 2023), and with making a user-friendly, script-based workshop for BERTopic (Rekathati). We reflect upon our design choices in using Google Colab to show how image collections can be clustered according to topic, especially the questions of how much code to include and how many steps it is feasible to explain.
By focusing on a concrete scenario, we highlight how AI literacy efforts in the humanities can be rooted in domain-specific needs and questions. We argue that effective outreach requires not only demystifying technical tools, but also fostering dialogue around how such tools intersect with established interpretive practices. The case study illustrates the value of flexible, script-based formats that empower researchers to engage critically with emerging methods without requiring full technical fluency. At the same time, we remain sceptical about the extent to which short, one-off workshops can substantially raise technical proficiency, and instead emphasise the importance of longer-term, transdisciplinary forms of methodological interaction.
References:
Börjeson, L., Haffenden, C., Malmsten, M., Klingwall, F., Rende, E., Kurtz, R., Rekathati, F., Hägglöf, H., & Sikora,J. 2024. “Transfiguring the Library as Digital Research Infrastructure: Making KBLab at the National Library of Sweden.” College & Research Libraries, 85(4), 564–582. https://doi.org/10.5860/crl.85.4.564.
Haffenden, C., Rekathati, F. & Rende, E. 2023. “Unearthing Forgotten Images With the Help of AI.” The KBLabBlog. https://kb-labb.github.io/posts/2023-10-20-unearthing-forgotten-images-with-the-help-of-ai/. (Link to the image search demo itself: https://lab.kb.se/bildsok)
Jaillant, L. & Aske, K.. 2024. “Are Users of Digital Archives Ready for the AI Era? Obstacles to the Application ofComputational Research Methods and New Opportunities.” Journal on Computing and Cultural Heritage. 16:4. https://doi.org/10.1145/3631125.
Rekathati, F. “BERTopic Workshop: Analyzing Swedish Parliamentary Motions.” https://colab.research.google.com/drive/10kB3wfoHSfZE48vEKmznIw-ff36uR8gs?usp=sharing (Colab script, requires Google log-in)
|