Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
Please note that all times are shown in the time zone of the conference. The current conference time is: 18th Dec 2025, 02:41:03pm GMT
|
Session Overview |
| Session | ||
WS13: Auto-Cataloging Research Materials from the Endangered Archives Programme
| ||
| Presentations | ||
Auto-Cataloging Research Materials from the Endangered Archives Programme 1Princeton University, United States of America; 2University of New Brunswick, Canada; 3University of Pennsylvania, United States of America; 4Grupo de Investigación sobre historia regional: Fundación Muntú Bantú Building on our experience with materials from the British Library’s Endangered Archives Programme and 19th-century court records from the Circuit Court of Istmina, Chocó, Colombia, which are damaged and in different manuscript and typescript formats, we will teach participants how to utilize AI to generate research data from document images. Using vision-language models (VLMs), we will demonstrate how to extract text and other metadata from images and publish this data in accessible formats for research. This transformative process makes it possible to search and analyze collections of digitized documents in ways that facilitate exploratory analysis and computational research methods. Machine-readable text makes documents more accessible to researchers, students, and the public. Additionally, auto-cataloging of the collection provides standardized metadata, such as the occurrences of person names, place names, and case summaries, which facilitates research across the entire collection. VLMs are particularly useful for handwritten text recognition and the description of materials at the case, box, and collection levels. Our workshop seeks to address three key outcomes:
We will address common problems facing researchers with optical text recognition (OCR) and handwritten text recognition (HTR). The most common problem is finding a model that is “good enough.” Most out-of-the-box general models do not provide sufficiently accurate results for researchers’ needs. These models work well for modern languages but do poorly with historical documents or languages with non-Latin scripts. Platforms like Transkribus and eScriptorium provide essential tools for annotating such text and creating models for specific scripts, periods, and languages. However, our team’s experience with eScriptorium resulted in some frustration–too often, generated text was spotty or so inaccurate as to be useless. The segmentation is inconsistent across different types of documents in the collection. Any text missed by the segmenter is not processed in the recognition stage. In response, we shifted to transformer-based vision-language models (VLMs), which provide significantly improved handwritten text recognition (HTR) and extraction of complex metadata for our nineteenth-century typescript and manuscript legal court documents from Colombia. Rather than working with whole images, we split the document into chunks and use smaller, fine-tuned models for specific tasks. For example, we can generate legal case summaries and identify names and places in the context of the case. We identify not only a person’s name, but also that they were the defendant or judge. A place is not just a name, but the place where the crime occurred or the court's jurisdiction. The capabilities of VLMs allow us to create automatic workflows that not only extract text for research but also offer the metadata essential for creating archival finding aids. Timeline (2 hours total) First session (45 minutes)
Break (10 minutes) Second Session (45 minutes)
Resources Workshop participants will need a laptop with an internet connection. All code notebooks will be available in a GitHub repository and can be run in the browser using Codespaces or Google Colab. These systems require a Google or Microsoft user account. Participants are welcome to download and run the code locally if they prefer. We will use the Qwen2-VL-2B-Instruct model available on HuggingFace for the workshop. This relatively small model can be downloaded and run on consumer computers. We will also provide an API key to access Qwen models on Alibaba Cloud. Our process is model and provider-agnostic, so participants can use the models and resources they prefer going forward. Preparation Participants should know basic Python syntax and be familiar with Jupyter Notebooks, JSON, and YAML. We will provide links to instructional material on these topics for preparation. Additional information about instructors Andrew Janco has taught several previous workshop series, including “New Languages for NLP: Building Diversity in the Digital Humanities” at Princeton’s Center for Digital Humanities. He provides technical expertise to Professors Farnsworth-Alvear and Tubb on their projects related to gold and platinum mining in the Chocó region of Colombia. This work includes the fichero catalog system and software. Ann Farnsworth-Alvear is a co-editor of The Colombia Reader and author of Dulcinea in the Factory. She teaches Latin American History at the University of Pennsylvania. Her current work explores the extraction of platinum and gold from the San Juan River and its tributaries in 1870-1970; it depends on collaborations with other researchers–especially Daniel Varela of the University of Michigan and the Semillero de Jóvenes del Centro de Memoria Muntú Bantú, in Quibdó, Colombia. Daniel Tubb is an anthropologist working on gold mining, agrarian change, and rural life in Colombia. He is the author of Shifting Livelihoods: Gold Mining and Subsistence in the Chocó, Colombia, and he is working with Ann Farnsworth-Alvear and Andrew Janco on projects exploring extraction and memory in Colombia’s San Juan and Atrato River regions. Their work includes the development of fichero—a digital tool for cataloguing and transcribing historical documents related to mining and land struggles. Kelly López-Roldán is a doctoral student at the University of Pennsylvania. She has participated in various digitization projects involving materials in Colombia, including EAP1477, EAP1531, and EAP1740. Her experience with public history projects focused on memory and engagement has included publishing collectively-authored books that showcase community-level research with primary sources. | ||