Conference Agenda
| Session | ||
SP05: Short papers
| ||
| Presentations | ||
Errare Machina Est? The Impact of Errors in Large-Scale AI-Based Information Extraction from Archival Documents 1TEKLIA, France; 2Institut national d'études démographiques, France; 3Paris School of Economics, France Introduction Automatic information extraction using artificial intelligence—particularly handwritten text recognition (HTR)—is transforming research in the quantitative humanities [1]. By automating the transcription of vast archival collections, AI enables the creation of datasets of a scale previously unimaginable. The SocFace project exemplifies this shift: it aims to process all surviving French census records from 1836 to 1936, covering more than 25 million pages and an estimated 500 million individual entries. Using an AI pipeline for page classification, text recognition, and entity tagging [2], we have already extracted over 150 million person-level mentions. Yet this transformation comes with new challenges. AI-generated data are not only noisy—they are noisy in fundamentally different ways than human transcription. Errors are dispersed across a “long tail” of rare variants, and may appear deceptively accurate. This paper takes a critical look at the nature of these errors and their impact on downstream use. We ask: What types of errors do machine learning systems produce? How do they differ from human mistakes? What are the implications for research, archival practice, and genealogy? And how can we evaluate and manage them responsibly? Above all, we argue that errors are not absolute: their impact depends both on the context in which they occur and the purpose for which the data are used. Uncertainty by Design: why AI gets it wrong Understanding why AI makes errors starts with understanding its statistical nature. Unlike deterministic systems, AI models are probabilistic, predicting outcomes based on patterns in training data rather than fixed rules. The accuracy of these predictions depends on the quality of the input, the structure of the document, and how well the training data represent the variation in the sources. Variability in output is not an anomaly — it's inherent to the system. Evaluating probabilistic systems is equally complex. Although widely used, Character Error Rate (CER) often fails to convey the real impact of errors, particularly when a single character significantly alters meaning, as in names or occupations that differ by gender. Although closer to human perception, Word Error Rate (WER) also misrepresents error severity when some words carry more weight than others. In both cases, context is important: minor errors in fields with high variability can be misleading, whereas major errors in more constrained fields are easier to detect. Metrics such as precision and recall provide a more targeted evaluation, but they require adaptation to noisy environments such as fuzzy searches. Too wrong to read, or too polished to be true? Errors produced by AI systems generally fall into two broad, equally problematic categories. The first category includes non-words or out-of-vocabulary terms — character sequences that do not correspond to any word or term in reference lists. While these errors are often easy to detect, they can be difficult to correct. Although helpful, reference lists are inevitably incomplete, and attempts at normalisation risk introducing further errors or erasing meaningful historical variation through excessive standardisation. The second category is more insidious: hallucinations. These are grammatically and visually plausible outputs that do not correspond to anything in the original document. Unlike non-words, hallucinations cannot be identified by comparing them with external references because they blend seamlessly into the transcription. Their plausibility makes them particularly dangerous as they hinder data validation and can erode user trust in both the dataset and AI tools more broadly. Fifty shades of wrong Unlike humans, who tend to make a limited number of systematic transcription errors, statistical models produce a wide range of variations around the correct form. This diversity results in an 'error cloud', whereby the same name can be misrecognised in a dozen different ways, none of which occur frequently enough to be corrected systematically. We present a case study of first names in the SocFace dataset and demonstrate that the number of different first names produced by the AI model exceeds those found in reference lists and genealogical data produced by humans. This suggests that AI systems do not simply misread words, but generate a variety of approximate alternatives, some of which are closer to the correct form than others. And a huge number of them are unique occurrences, making corrections from a fixed list, whether preexisting or from a dictionary constructed from the data, very difficult. Hence, in our first batch of data containing around 140M lines from the French censuses, there are 2M different first names, among whom 1.5M are unique, many being orthographic variations that could not be produced by a human coder. These variations have direct negative impacts on linking individuals between censuses, as have been shown in the US case, with human errors [3]. This variability poses significant challenges for subsequent use. In search interfaces, for example, it undermines exact string matching and complicates fuzzy searches. In statistical analysis, it introduces data sparsity and impairs the construction of reliable aggregates. In historical inference, it can obscure or distort patterns relating to naming, migration, or family structure. The problem is not only that errors exist, but also that they are too diverse to be easily managed. Eyes to See, Minds to Know Due to the nature of AI-generated data, human knowledge of context, logic and domain-specific rules is crucial for identifying errors that the models themselves cannot detect. In the SocFace project, we observed that certain external features were correlated with higher error rates. For example, the final pages of registers often contained more 'hallucinations', likely due to an irregular layout or the model's assumption that every page should contain 30 lines. While these predictable patterns cannot be fully corrected, they can be flagged algorithmically. Internal coherence checks also play a critical role. In structured records such as censuses, inconsistencies between fields — such as a daughter being older than her mother or two household heads — can indicate errors. While these contradictions may not yield automatic corrections, they offer useful clues for review and analysis. To support end users, we advocate the creation of quality indices that combine error signals, confidence scores and contextual logic. Rather than classifying data as simply right or wrong, these indices would convey degrees of reliability to help users judge how to interpret, verify or exclude certain records. Conclusions Automatic transcription should not be viewed as a perfect representation of documents, but rather as a noisy gateway to historical information. Just as experimental sciences routinely deal with measurement error and uncertainty, and the social sciences have long worked with incomplete or biased datasets, scholars using AI-generated data must accept a certain degree of imperfection. The aim is not to eliminate error — an impossible task given the inaccessibility of historical truth — but rather to model and manage it. In this context, AI transcription is not a substitute for human understanding, but rather a tool that enables access on a large scale, facilitating the analysis of vast archives that would otherwise remain inaccessible. However, this access comes with responsibilities: we must understand the sources of noise and recognise the forms that errors take. We must also build methods that account for their presence in the production of historical knowledge. In the SocFace project, we are learning to strike a balance between accuracy and scale, and between automation and expertise. The challenge is not to silence the machine's mistakes, but to learn to read through them with eyes open and minds alert. [1] Nockels, J., Gooding, P., & Terras, M. (2024). The implications of handwritten text recognition for accessing the past at scale. Journal of Documentation, 80(7), 148-167. [2] Boillet, M., Tarride, S., Schneider, Y., Abadie, B., Kesztenbaum, L. & Kermorvant, C.(2024). The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses. In Document Analysis and Recognition. LNCS, vol 14806. [3] Hwang, Sam Il Myoung, and Munir Squires (2024). Linked Samples and Measurement Error in Historical US Census Data. Explorations in Economic History 93 (July):101579. Collections Explorer: Building a User-Centered, AI-Driven Discovery Platform at Harvard Library Harvard Library, United States of America Academic libraries are experiencing growing pressure to evolve discovery systems that serve both expert researchers and casual users in a digital environment increasingly shaped by generative AI. At Harvard Library, we are addressing this challenge through Reimagining Discovery, a multi-year initiative that aims to modernize how users discover and engage with our vast and diverse collections. This talk will present the goals, strategy, implementation process, and lessons learned from the first phase of the project, focusing on the development of Collections Explorer—an experimental platform which offers natural language searching, semantic retrieval, and generative AI features. The Harvard Library discovery environment spans billions of records and multiple platforms, including HOLLIS, HOLLIS for Archival Discovery, and CURIOSity Digital Collections. This information encompasses over 14 million print titles, 4 million digital items, 15,000 archival finding aids, and a wide range of content types—from images to spatial data and born-digital objects. As users struggle to navigate these fragmented systems, they increasingly turn to AI tools like ChatGPTfor discovery. Analytics confirm that traffic from AI chatbots to Harvard’s library websites is rising. A 2024 student survey of over 200 undergraduates found that users now expect search systems to support natural language input. These insights reinforce the need for discovery platforms that support natural language searching. Collections Explorer is Harvard Library’s first prototype to integrate semantic search and large language models (LLMs) into a user-centered discovery platform. It was developed to address existing usability issues, especially around special collections. Feedback from faculty and students underscores these challenges: discovery is overwhelming, inconsistent, and intimidating—particularly for new researchers. Users want clearer pathways into the collections and contextual explanations for why specific results appear. To address these pain points, we have created a platform that supports discovery via natural language and provides AI-generated related prompts and summaries that help users easily explore materials. To accelerate our learning and refine the product strategy, we partnered with Mozilla.ai—a mission-aligned startup—to design a technical framework and strengthen our team’s AI development capabilities. Through a collaborative kickoff workshop and three months of design partnership, we scoped and launched the initial build phase for Collections Explorer. At the core of our product strategy was a commitment to user-centered design. We built on a framework that prioritized user value, measurement, and validation through embedded analytics, user testing, and iterative feedback. We integrated students as Product Insights Interns throughout the development process. These students represented both special collections researchers and general users, and they participated in biweekly meetings, asynchronous design reviews, and usability evaluations. Their feedback—alongside input from technical stakeholders and accessibility testers—informed both the interface and the AI features embedded in the system. The AI components of Collections Explorer include two major technologies: semantic retrieval using an embedding model and a large language model for generative features. For semantic search, we first selected an open source embedding model, which allowed the system to match queries to conceptually related results, even when keywords do not align. The LLM-powered explanation of relevance ("About this Material") helps users understand these matches by referencing semantic concepts from the metadata. Large language models also power additional features: suggested searches ("You Might Also Try") and search translation ("Need Books or Articles"), which transforms natural language prompts into traditional keyword queries. These LLMs were selected based on quality and tone, as evaluated by our student interns and library staff. All LLM outputs were controlled using carefully tuned system prompts to ensure relevance. Usability testing of the prototype, conducted in Fall 2024, provided critical insights that are shaping the next phase of development. Participants expressed a strong preference for individual item-level results over collection-level descriptions, citing efficiency and relevance. Users also valued transparency about AI functionality, requesting clearer labeling and visible explanations without needing to click. Advanced researchers preferred AI-generated relevance statements that focused on objective item descriptions rather than inferred user intent. Users appreciated the suggested searches but wanted them more closely tailored to the Harvard Library context. Feedback also highlighted the importance of clarity around digitization. Users want to know whether an item is digitized and accessible without navigating through multiple pages. We are responding by improving metadata visibility and interface design to prioritize access status and enable filtering by digitization, year, repository, and language. With a full launch planned for the next academic year, our technical team has built a robust ingest pipeline to manage the flow of metadata records from disparate sources. The team also evaluated multiple vector databases and implemented one which is capable of supporting tens of millions of metadata records. The technical team partnered with librarians to evaluate which embedding model to use for production. Librarians evaluated the relevancy of results that 3 different models returned and the team chose the one that performed the best. Additionally, the team plans to repeat the evaluation of the LLMs that are used in the summarization and suggested searches features. We are designing the system with flexibility in mind: AI models can be swapped as the landscape evolves, and hybrid search (combining semantic and keyword approaches) will ensure relevance and precision. In parallel, we are expanding the scope of indexed content beyond archival finding aids to include digital images, full-text documents, and born-digital materials. Ultimately, this system will serve as a foundation for modernizing discovery across the Harvard Library ecosystem—transforming not just how users search, but how they begin to explore, understand, and connect with the richness of Harvard’s collections. Throughout the project, we’ve anchored our work in the values of openness, trust, transparency, and user engagement. We are budgeting for the operational realities of AI, including compute costs, environmental impact, and ongoing model maintenance. We are also investing in long-term user education and cross-functional collaboration to support responsible adoption of these technologies. This talk will provide an in-depth view into the research-driven development process behind Collections Explorer, including technical architecture, user research, model selection, and evaluation. It will also reflect on broader implications for academic libraries adopting AI-powered systems: how to balance innovation with user needs, how to future-proof design in a rapidly shifting technical environment, and how to remain grounded in library values while embracing emerging tools. Tailor-made or ready-to-wear? The challenge of reusing computer vision models and processing workflows 1Ecole nationale des chartes - PSL, France; 2MSH Mondes Format: Panel (45 mn) Presenters:
Summary: PictorIA is a consortium for the development of shared practices and tools in computer vision for cultural heritage institutions in France, hosted by the research infrastructure Huma-Num. The consortium is co-chaired by Julien Schuh at MSH Mondes, Anne-Violaine Szabados at CNRS and Jean-Philippe Moreux at the BnF. Its endeavour is to help developing a joint community between cultural heritage institutions and research teams in digital humanities, working together towards the implementation of computer vision solutions in libraries, museums and archives. One of the major questions that has arisen within PictorIA is the reuse of available tools and workflows across projects with different types of collections and different goals. While many computer vision tools and models are now mature and have proved their efficiency with several use cases in the cultural heritage domain —including for instance YOLO, SAM, CLIP, Florence2, and tools such as Label Studio and Roboflow— adapting them to the specific needs of a project often remains a challenge. It means finding a balance between a "tailor-made" approach, where each project would develop its own tools and protocols, and a "ready-to-wear" approach where models and workflows are already available but require the projects to adapt. PictorIA provides a forum for comparing use cases, needs and methods, sharing vocabulary and experimental protocols, designing adaptable processing workflows and making sure that they are documented and fit for reuse. In this panel, we would like to discuss how we are working towards a balance between ready-to-wear and tailor-made, based on some core examples that have been shared within PictorIA during its first year of existence:
Potential discussion topics: - Reusability: When does a pretrained model stop being useful due to bias or changes in the target domain? How can we evaluate whether a model should be reused, retrained or replaced? - Interoperability and institutional constraints: How do existing metadata structures, cataloguing practices and rights restrictions affect the reuse of AI workflows? What practical steps can ensure these workflows stay consistent with FAIR principles? - Scalability: How do we move beyond experimental notebooks or demos to build processing pipelines that can handle large collections, be maintained in the long term and be reused across institutions? - Cross-domain collaboration: How can we make documentation, terminology, and onboarding processes more accessible to researchers, engineers and cultural heritage professionals working together? References: Arnold, Taylor, and Lauren Tilton. 2023. Distant Viewing: Computational Exploration of Digital Images. The MIT Press. https://doi.org/10.7551/mitpress/14046.001.0001. Azar, Mitra, Geoff Cox, and Leonardo Impett. 2021. “Introduction: Ways of Machine Seeing.” AI & SOCIETY 36 (4): 1093–1104. https://doi.org/10.1007/s00146-020-01124-6. Bermès, Emmanuelle, Leclaire, Céline, Moreux, Jean-Philippe. 2023. “L'image comme particule élémentaire, ou les prémisses d'un changement d'échelle à la BnF.” The Measurement of Images. Computational Approaches in the History and Theory of the Arts, sous la direction de Clarisse Bardiot et Emmanuel Château-Dutier, Presses universitaires du Septentrion, À paraître, Humanités numériques et science ouverte. https://hal.science/hal-03991515v1 Blettery, Emile, and Valérie Gouet-Brunet. 2024. “Heritage Iconographic Content Structuring: From Automatic Linking to Visual Validation.” J. Comput. Cult. Herit. 17 (3): 47:1-47:32. https://doi.org/10.1145/3666007. Bouté, Édouard, Julliard, Virginie, Alié, Félix, Gödicke, David, Pailler, Fred, et al. 2024. “PANOPTIC, un outil d'exploration par similarité de vastes corpus d'images.” Humanistica 2024, Association francophone des humanités numériques, mai 2024, Meknès, Maroc. https://hal.science/hal-04687627v1 Colavizza, Giovanni, Tobias Blanke, Charles Jeurgens, and Julia Noordegraaf. 2021. “Archives and AI: An Overview of Current Debates and Future Perspectives.” Journal on Computing and Cultural Heritage 15 (1): 4:1-4:15. https://doi.org/10.1145/3479010. Cordell, Ryan. 2020. Machine Learning + Libraries: A Report on the State of the Field. Library of Congress. https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf. Du, Lin, Brandon Le, and Edouardo Honig. 2024. “Probing Historical Image Contexts: Enhancing Visual Archive Retrieval through Computer Vision.” J. Comput. Cult. Herit. 16 (4): 84:1-84:17. https://doi.org/10.1145/3631129. Fiorucci, Marco, Marina Khoroshiltseva, Massimiliano Pontil, Arianna Traviglia, Alessio Del Bue, and Stuart James. 2020. “Machine Learning for Cultural Heritage: A Survey.” Pattern Recognition Letters 133 (May):102–8. https://doi.org/10.1016/j.patrec.2020.02.017. Foka, Anna, and Gabriele Griffin. 2024. “AI, Cultural Heritage, and Bias: Some Key Queries That Arise from the Use of GenAI.” Heritage 7 (11): 6125–36. https://doi.org/10.3390/heritage7110287. Jaillant, Lise, ed. 2022. Archives, Access and Artificial Intelligence: Working with Born-Digital and Digitized Archival Collections. Bielefeld University Press. https://doi.org/10.1515/9783839455845. Jaillant, Lise. 2024. “Introduction to the Special Issue: Using Visual AI Applied to Digital Archives.” Digital Humanities Quarterly 18 (2). https://www.digitalhumanities.org/dhq/vol/18/2/000752/000752.html. Lorang, Elizabeth, Leen-Kiat Soh, Yi Liu, and Chulwoo Pack. 2020. “Digital Libraries, Intelligent Data Analytics, and Augmented Description: A Demonstration Project.” Faculty Publications, UNL Libraries, January. https://digitalcommons.unl.edu/libraryscience/396. VIillaespesa, Elena and Murphy, Oonagh. 2021. “This is not an apple! Benefits and challenges of applying computer vision to museum collections”, Museum Management and Curatorship, vol. 36, no 4, p. 362‑383. Wevers, Melvin, and Thomas Smits. 2019. “The Visual Digital Turn: Using Neural Networks to Study Historical Images.” Digital Scholarship in the Humanities 35 (1): 194–207. https://doi.org/10.1093/llc/fqy085. Collaborative AI for Cultural Heritage: Building Trust and Literacy through Community and Institutional Partnerships 1READ-COOP SCE, Austria; 2British Library; 3Wikimedia Foundation Artificial Intelligence is increasingly permeating every aspect of society, from industry to education to everyday digital interactions. The cultural heritage sector faces particular challenges and opportunities in this landscape. On the one hand, AI holds the potential to dramatically expand access to rare, at-risk, and dispersed collections of historical texts. On the other, its use raises questions of trust, inclusivity, transparency, and the role of both institutions and communities in shaping how AI tools are developed and deployed. This paper explores collaborative models for AI in the cultural heritage domain, emphasising how partnerships between custodial institutions, volunteer communities, and cooperative technology providers can advance not only access to cultural materials but also AI literacy and democratic engagement. A central case study is the Wikisource Loves Manuscripts programme, which demonstrates the transformative potential of participatory AI for heritage preservation. Volunteers engaged in this initiative can train custom AI models to transcribe different languages, including from historical Indonesian items such as Javanese manuscripts. In many cases, such models are essential to make recognition of these texts possible in the first place, opening up access to materials that would otherwise remain inaccessible or at risk of loss. The project has already contributed to the preservation of over 20,000 manuscripts. Beyond digitisation, the programme strengthens communities by equipping participants with digital literacy and technical skills, enabling them to act as both stewards and interpreters of their own cultural heritage. The emphasis on transparency and accessibility ensures that AI is not an opaque, external imposition but rather a tool shaped by and for the communities that use it. The British Library provides another perspective on how publicly owned custodians of written heritage are advancing the integration of AI. Through targeted digitisation programmes such as those focusing on Asian rare books and manuscripts and the Endangered Archives Programme (EAP), the Library has made significant strides in expanding access to fragile and geographically dispersed collections. Its collaborations with Wikimedia platforms further illustrate the power of institutional-community partnerships in fostering global digital literacy and public engagement. Crucially, the Library’s work within a trilateral partnership with Wikisource and READ-COOP reflects a shift towards a community-centred, AI-driven approach, in which large institutions share authority and agency with volunteers and cooperative developers. This model demonstrates how heritage custodians can play an enabling role, ensuring quality and sustainability while supporting openness and inclusivity. The cooperative model pioneered by READ-COOP, through its flagship platform Transkribus, offers an alternative to more commercial or centralised approaches to AI in cultural heritage. As a community-owned initiative, Transkribus allows institutions, researchers, and individuals to collaborate in digitisation, transcription, and enrichment of archival materials without the prohibitive costs or restrictions that often accompany proprietary systems. Its flexibility in creating customisable, user-friendly AI models has been particularly impactful in contexts where resources are limited, including low- and middle-income countries. By prioritising explainable AI and collective ownership, the cooperative strengthens both trust and literacy, positioning AI as a shared resource rather than a technology controlled by external actors. The recent collaboration between Transkribus, Wikisource, and the British Library represents a major step forward in aligning community-driven and institutional efforts, offering a practical example of how AI development can remain transparent, participatory, and socially responsible. Equally significant is the role of education and co-creation in ensuring that AI is both inclusive and culturally respectful. Workshops designed around Transkribus highlight how stakeholder involvement must move beyond the framing of volunteers as passive transcribers. Instead, participants are treated as knowledge experts whose insights are essential for building models that accurately and respectfully represent the cultural material involved. This approach foregrounds listening, exchanging, and learning as central to workshop design, making the process itself a site of cultural engagement. From these experiences emerges the notion of “Minimal ATR”—a low-barrier approach to AI-assisted text recognition that is accessible not only in terms of use but also in terms of creation. By enabling communities to develop their own models, this approach decentralises control of AI and makes it more adaptable to diverse cultural and linguistic contexts. Together, these case studies illustrate a broader argument: that building trust and literacy in AI requires collaborative, community-centred strategies. The examples of Wikisource, the British Library, and READ-COOP demonstrate how heritage institutions, volunteer networks, and cooperatives can pool their expertise to democratise access, empower communities, and establish sustainable practices for the future. Rather than treating AI as an external technology to be adopted, these initiatives embed AI within ongoing cultural processes, ensuring transparency, inclusivity, and respect for diverse forms of knowledge. References
| ||