Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
SESSION#14: COLLECTIONS AS DATA & DATA QUALITY
Time:
Friday, 31/May/2024:
8:45am - 10:15am

Session Chair: Mahendra Mahey, Tallinn University, Estonia
Location: H-205 [2nd floor]

https://www.hi.is/sites/default/files/atli/byggingar/khi-stakkahl-2h_2.gif

Show help for 'Increase or decrease the abstract text size'
Presentations
8:45am - 9:15am

Data Availability and Evaluation Reproducibility for Automatic Date Detection in Texts, a Survey

Tommi Jauhiainen

University of Helsinki, Finland

Automatic date detection in texts has been applied to various languages and periods in research related to digital humanities. In this article, we investigate the availability of the datasets used in published experiments including a date detection component. We take a closer look at c. two dozen of the newest articles. Primarily, for each of the two dozen articles, we examine the possibility of acquiring the used dataset based on the information presented in the article itself. Secondarily, we examine the possibility of reproducing the exact evaluation setting described in the articles, e.g., the possibility of dividing the dataset into the same training and testing portions. We find that, as far as the datasets are involved, we would be able to reproduce the evaluation setting of four of the articles using the information in the articles themselves.

Jauhiainen-Data Availability and Evaluation Reproducibility for Automatic Date Detection-167.pdf


9:15am - 9:45am

Source criticism, bias, and representativeness in the digital age: A case study of digitized newspaper archives

Jørgen Burchardt

Museum Vestfyn, Denmark

Historians must critically scrutinize their sources, a task further complicated in the digital age by the need to evaluate the technical infrastructure of digital archives. This article critically examines digital newspaper archives, revealing error rates in optical character recognition (OCR) that compromise result reliability, and word frequency-based datasets that introduce biases due to issues in the shaping of the OCR corpus and later post-processing. Beyond technical issues, copyright restrictions hinder access to crucial newspapers, while incomplete archives pose representativeness challenges. Accessing datasets from different countries is cumbersome. Commercial archives are costly, and uneven publication rates necessitate corrections over time. The use of digital archives presents new exercises: the researcher needs to explain the reliability of the digital source, which often can only be achieved in interdisciplinary working groups. The digital archives must ensure transparency by detailing to researchers the technical manipulations performed on the original source.

Burchardt-Source criticism, bias, and representativeness in the digital age-205.docx


9:45am - 10:00am

Engineering 19th century poetry from the National Library of Norway's Collection

Lars Magne Drønen Tungland, Ingerid Løyning Dale

National Library of Norway, Norway

The Digital Humanities support unit (DH-lab) of the National Library of Norway is providing data and digital analysis tools in the EU-funded NORN project, led by the Ibsen centre and the University of Oslo. NORN analyses and reinterprets 19th-century Norwegian literature through national romanticism. It merges fields like cultural and critical race studies with both qualitative and quantitative methods from literary science, history, and digital humanities. The project aims to explore the emotional aspects of national romanticism and objectively analyse literary works to critique literary history.

A segment of the NORN project is dedicated to examining 19th-century Norwegian poetry. We here present insights from the process of extracting and analysing 19th century Norwegian poetry from our digital text collection. The genre proposes some particular challenges for quantitative analysis and known NLP methods, due to a combination of visual, phonemic, semantic, as well as syntactic structures, that are very different from typical NLP data such as reviews, news articles or novels.

This paper offers insights from the DH-lab's perspective. Our aim has been to operationalize the structures we find in poems published in the 1800s through rigorous text mining, to create a comprehensive dataset with quantitative metrics. The dataset presents the texts in enriched, annotated, and aggregated structured formats which can be analysed by researchers in a useful, interesting, and perhaps surprising way, in addition to the standard literary practice of close reading. The core of our discussion revolves around the NLP-engineering challenges encountered and the methodologies employed to provide researchers with actionable data.

The dataset has been designed for interoperability with the DH-lab’s suite of tools for corpus analytics, fostering an integrated research environment for digital humanities scholars. Enrichments to the texts in the dataset include text and token classification such as topics, sentiment analysis, named entities (Nielsen, 2024), as well as phonemic transcriptions using supervised machine learning models. We have used traditional N-gram language models, transformer models such as BERT, and experimented with the newer Large Language Models (LLM) such as Llama 2. An important step involved utilising the Norwegian language banks' grapheme to phoneme (G2P) models (‘Grapheme-to-Phoneme Models for Norwegian’, n.d.) and training new models to phonemically transcribe 19th century Norwegian written language nuances, enabling us to annotate rhymes in the poetry corpus.

Further, we have used word embeddings, e.g. Word2vec (Mikolov et al., 2013) to assess word similarity within the poetry corpus, juxtaposing it with the general 19th century Norwegian language usage to unravel unique poetic vocabularies. Our team has also experimented with automatic enjambment detection, creation of word clusters using co-occurrence analysis, such as collocations (Johnsen, 2021) and LDA (Blei et al., 2003). We have developed a system for sentiment analysis specific to 19th-century poetry, and also experimented with mapping the frequency of words related to emotions.

This paper does not delve into the results of the dataset analysis but focuses on the journey of dataset creation, the development of analysis tools, and identifying measurable aspects within the poems. We discuss the lessons learned and challenges encountered, such as dealing with OCR errors, interpreting the visual formatting of poems, understanding genre-specific linguistic features, and addressing domain effects of off-the-shelf tools.

In sum, our work not only underscores the challenges in treating poetry as data for NLP analysis but also highlights the solutions and tools we have developed, contributing findable, interoperable and reusable resources to the field of digital literary studies.



10:00am - 10:15am

Publishing 10 000+ archival index cards of modern cultural heritage as open (as open as possible) data: challenges with OCR, data cleaning, restricted data and maintenance

Niklas Alén, Maria Niku

The Finnish Literature Society, Finland

The Finnish Literature Society (SKS) has recently published online a large, unique collection of modern cultural heritage, which is preserved at the SKS archives. The web publication can be accessed at https://aineistot.finlit.fi/exist/apps/harju (in Finnish).

Johan K. Harju (1910–1976) was a major collector of heritage for SKS, who recorded the lives and traditions of marginalised people in the society. He documented urban culture, alcoholics' circumstances and life in prisons at a time when the emphasis of heritage collection was still in the countryside and was only just beginning in urban environments. The collection, stored in archival index cards, contains personal accounts of experiencing homelessness and alcohol substitutes, local stories from Helsinki inner city districts, interviews conducted at night shelters, prison and sanatorium traditions, jokes, and youth and wartime reminiscences. The digital edition contains the digitised cards and their OCR-produced transcriptions and metadata, both fully searchable.

The publication project was a test run of sorts for SKS. The goal was to develop effective and as automatized as possible ways to publisher large-scale archival collections that may contain over 100 000 units, with their text and metadata fully searchable. The largest part of this was finding OCR solutions which would make it possible to extract both the text and the metadata of the cards and store them in publishable, standard-conforming XML/TEI 5. Our paper examines the various challenges encountered and solutions found during the project: OCR, automatized processes for data cleaning, the various aspects to be considered with the restricted parts of the data, requirements for the database and frontend for this kind of a digital edition, and technical solutions for continued database maintenance. Our paper thus falls under the third special theme of the conference, the life cycle of digital humanities and arts projects, and in particular its sub-theme "Creating and using cultural heritage collections as data: workflows, checklists, tools".

In order to develop an effective OCR solution the structure of the material was first carefully analysed. It was quickly determined that the material followed a specific pattern where regions of interest where always positioned in specific areas of the card, for example a geographic collection code, if present, was always in the upper right hand side of the card, titles were often clearly separated from the rest of the text and the main text was also mostly separated from the metadata on the card. Based on this information a C++ programme was developed using open source libraries. The programme performed card analysis, content segmentation and OCR using the Tesseract OCR engine’s API. The result was then serialized to XML/TEI 5 using libxml2’s API.

In J. K. Harju's case, the digitised collection of archival cards contained a large amount of duplicates, which needed to be removed before publication. Due to the scale of the collection it was essential to find an automatic solution for this data cleaning. Initially the possibility of removing duplicates based on their metadata was considered, but due to OCR errors it was deemed infeasible. After this a string similarity based approach was devised. Two factors were important in choosing a string similarity algorithm. The algorithm should have good performance and it should account for small variations in the text. The cosine similarity algorithm was chosen based on these criteria. A C++ programme was then developed to perform this task.

The goal with publishing large collections should be to make them available as open data, in order to benefit research in the broadest way possible. Open data is the only way the collections will be of use for research that utilises digital humanities methods. Open data is not an issue when the documents are 19th century or older. The Harju collection, however, contains about 2000 cards, where the informant was born less than 100 years ago or their birth date is not known. Such documents cannot be published without restrictions.

Decision was therefore made to publish the unrestricted part of the data (10 000+ cards) fully openly, with the entire data available for download and further use under CC BY 4.0, and find a way to make the restricted part available to researchers in a more limited manner. This part of the process involved devicing mechanical solutions for separating the restricted part of the data from the open part, different requirements for the web interfaces for both, as well as implementing continued maintenance (updating the open edition with cards when they can be opened to public).

In order to efficiently create the different datasets and update the public online publication a couple C++ programmes were written. One compared the images and XML files to a list of persons and their corresponding birth dates and created the different sets based on this information. The second programme is meant to update the openly accessible publication yearly. When developing this programme security was our first concern. To this end the programme employs multiple features to ensure safe data transfers. The programme uses the aforementioned list of persons to update the public publication yearly with data from persons born 100 years before the year in question.

A good digital edition for data like the Harju collection requires powerful search features and effective facets and filters for browsing and filtering the data, with metadata fields chosen for indexing and the facets and filters in a way that best serves both the data and the users. SKS has for the past few years used the open source XML database and software development platform eXist-db (https://exist-db.org) for smaller XML/TEI 5-based editions, with the TEI Publisher application (https://teipublisher.com) used as a basis for building the editions. As eXist-db has a robust full text search and range index based on Apache Lucene, it was found to be suitable for a larger-scale edition like J. K. Harju as well. The only concern was how the indexing would handle a large number of small documents. However, this was found not to be an issue once a sufficient amount of RAM was allotted to the indexing.

In the open data Harju edition, users can filter the catalogue with four facets (region, location, year and informant's name) or alternatively with searches in a number of metadata fields. The full text search covers the text content as well as metadata. All the data of the edition is available for download and further use in XML and CSV. Users can download the entire data or parts of the data filtered with relevant facets or metadata/full text search. The restricted Harju edition contains the entire collection, including the c. 2000 card strong restricted data, and is accessible only in SKS's intranet. In terms of features it is similar to the open version but does not have data download options.