Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 1st May 2025, 01:50:27pm GMT

 
 
Session Overview
Date: Monday, 27/May/2024
12:00pm - 12:30pmArrival
Location: Skriða [1st floor]
12:30pm - 2:30pmWS1-1: SSH Open Marketplace
Location: K-205 [2nd floor]
Session Chair: Edward Joseph Gray, DARIAH-EU / IR* Huma-Num, France
 

SSH Open Marketplace

Edward Joseph Gray

DARIAH-EU / IR* Huma-Num

This workshop will present to the DHNB 2024 Community what the SSH Open Marketplace is, how to use it, and guide users in the process of uploading their resources into the Marketplace.


The Social Sciences and Humanities Open Marketplace (SSH Open Marketplace) - marketplace.sshopencloud.eu - is a discovery portal which pools and contextualizes resources for Social Sciences and Humanities research communities: tools, services, training materials, datasets, publications and workflows. The SSH Open Marketplace showcases solutions and research practices for every step of the research data life cycle. In doing so, it facilitates discoverability and findability of research services and products that are essential to enable sharing and re-use of workflows and methodologies. The SSH Open Marketplace, developed during the Horizon 2020 Project SSHOC, is maintained by three ERICS: DARIAH, CLARIN, and CESSDA.
Following a brief presentation of what the SSH Open Marketplace is and how it works, participants will be supported by members of the Editorial Board of this discovery portal to write and document their research scenarios, based on the tools and research methods that they use in their daily practice.

This workshop is aimed at those who develop, maintain, or use digital humanities tools and services and wish to valorize them in a European context. Beyond this group, this workshop is of interest to those researchers who would like to harness the power of SSH Open Marketplace Workflows to create a step-by-step guide of how to complete a research process, and in doing so, produce a new form of research output beyond the usual blog or journal article, and one that is fully inscribed in an open-science context. The first half of the workshop can be useful for those participants that wish to learn more about the SSH Open Marketplace, how it functions, and how to use it - the second half will be dedicated to on-boarding resources. In any case, participants should bring a laptop, and ideally, have a few ideas in mind of what they would like to put in the SSH Open Marketplace.
LIMIT: 50 participants

 
2:30pm - 3:00pmBreak
Location: Háma
Date: Tuesday, 28/May/2024
8:00am - 8:30amArrival & registration
Location: Skriða [1st floor]
8:30am - 10:00amWS3-1: FULL-DAY WORKSHOP (DHSS)
Location: K-205 [2nd floor]
Session Chair: Koraljka Golub, Linnaeus University, Sweden

8.00-8.30 Coffee and registration
8.30-8.45 Welcome and introductions 
8.45-9.10 Eetu Mäkelä: Computational Humanities in the upcoming Helsinki Liberal Arts and Sciences Bachelor's
9.10-9.35 Olle Skölld: Digital Humanities Master Program at Uppsala University   
9.35-10.00 Koraljka Golub: Digital Humanities Master Program at Linnaeus University
 
10.30-12.00 Coffee break
 
10.30-10.55 Anda Baklāne: Baltic Summer School of Digital Humanities: Towards Collaborative Learning 
10.55-11.20 Marianne Ping Huang: On Social Justice and Decoloniality in Digital Humanities. Perspectives from new #dariahTeach course 
11.20-12.00 Koraljka Golub, Marianne Ping Huang and Ahmad Kamal: DARIAH platforms for open educational resources – an overview and future goals 
 
12.00-13.00 Lunch break
 
13.00-14.00 Discussion on micro-credentials and up/re-skilling initiatives; alumni and other topics of interest

More information: https://lnu.se/en/research/research-groups/digital-humanities/workshop-reykjavik-2024/.

 

Higher Education in Digital Humanities and Social Sciences (DHSS)

Koraljka Golub1, Isto Huvila2, Jonas Ingvarsson3, Ahmad Kamal1, Marianne Ping Huang4, Olle Sköld2, Mikko Tolonen5

1Linnaeus University, Sweden; 2Uppsala University, Sweden; 3University of Gothenburg; 4Aarhus University; 5University of Helsinki

Session format: Presentations and discussion in person (not hybrid or online)

Audience: The intended audience of this workshop are course instructors and programme managers for Digital Humanities and Social Sciences (DHSS) and Digital Humanities and Cultural Heritage (DHCH) programs; researchers working on DHSS/DHCH education; professionals interested in DHSS/DHCH program, courses, or modules.

Number of Participants: About 30 participants are expected to attend.

Technical requirements: Projector and computer

Description

In Nordic, Baltic, and other regions, many universities have developed and are maintaining programs in Digital Humanities and Social Sciences (DHSS/DHCH). The University of Gothenburg launched a Master in Digital Humanities in autumn 2017, followed by Uppsala University in 2019 and Linnaeus University in 2020. In Croatia, the BALADRIA summer school in Digital Humanities was first given in 2019, while the University of Helsinki has been offering a module in Digital Humanities for several years and is starting a master programme together with social sciences. Many more universities have come to offer courses in digital methods as a part of new or existing programs; other times DHSS/DHCH-related topics and perspectives are included as a part of established courses. This has led to a variety of approaches, areas of concentrations, and teaching material in DHSS/DHCH education. It has also led to instructors and administrators to encounter numerous challenges, whether pedagogical or infrastructural. As such, drawing on the now robust DHSS/DHCH educational community is imperative for learning from one another’s experiences and leveraging the scale of our endeavours to realize new learning opportunities for ourselves and our students.

This workshop will provide an opportunity for DHSS and DHCH educators to share their experiences; discuss existing programs, modules, courses, research, training, and development activities; reviews evaluation approaches; and reflect on lessons learned. The workshop will also engage in discussions on common areas of interest, in order to set the groundwork for concrete collaborative implementations (such as student exchanges or regular DHSS/DHCH pedagogy seminars). The primary audiences of this workshop are course instructors and programme managers in DHSS/DHCH programs as well as researchers working on education in this field.

The workshop focuses on higher education in DHSS/DHCH, aiming at pedagogical development and infrastructure building. With respect to the first goal – pedagogical development – the workshop participants share their DHSS teaching experiences, including discussions of strategies, tools, platforms, evaluations, outcomes, and problems. These participants will be recruited through an Open Call, with accepted applicants presenting their experiences in Session 3 (see Workshop Structure). The Open Call will be managed by the workshop organizers.

With respect to the workshop’s second goal – infrastructure building – the workshop will explore a series of initiatives intended to enable collaborative education among fellow universities with DHSS/DHCH programmes/courses/modules within DHNB and DARIAH-EU.

This workshop will allow voices from established and recent DHSS/DHCH programmes to learn from one another on issues related to education and discover opportunities for richer and more sustainable pedagogy within the field. In keeping with the spirit of DHNB, the workshop encourages participation by teachers, researchers, and developers bringing different perspectives, with the goal of translating their insights into actionable plans. The long-term goal of this is capacity building for collaboration among DHSS/DHCH programmes for more interdisciplinary and international learning in the field of DHSS/DHCH in the Nordic and Baltic regions, and beyond.

Workshop Structure

  • Session 1: Welcome and introductions (15 min)

  • Session 2: Presentation and discussion of submitted papers from Open Call (180 min)

  • Session 3: Directed discussion emerging from the main session (45 min duration)

Call for proposals

Candidates are invited to submit their proposal for a 10-minute presentation. The presentations will be held in the workshop’s open period segment. The presentation should address a specific topic related to the workshop’s theme.

Proposals should be 300 words in length and are to be submitted to dh.edu.ws@lnu.se by 15 April 2024. Proposals will be reviewed by the workshop organizers. Presenters will be notified of acceptance the following week.

Workshop themes

  • Collaborations/exchanges in digital humanities (DH) instruction

  • Project-based/problem-based DH education

  • Interdisciplinary/cross-disciplinary/cross-sectoral/international cooperation in DH education

  • Existing programs, modules or individual courses in DH (e.g., design, target student groups, content, job market, evaluation, experiences, lessons-learned)

  • Currently developed programs, modules or individual courses in DH (e.g., design choices, target student groups, resource management, related issues)

  • Capacity building for student employability

Golub-Higher Education in Digital Humanities and Social Sciences-114_a.pdf
Golub-Higher Education in Digital Humanities and Social Sciences-114_b.pdf
 
8:30am - 10:00amWS4-1: FULL-DAY WORKSHOP (SOLRWAYBACK)
Location: K-206 [2nd floor]
Session Chair: Anders Klindt Myrvoll, Royal Danish Library, Denmark
Session Chair: Jon Carlstedt Tønnessen, National Library of Norway, Norway
 

SolrWayback as a search & discovery tool for researchers to work with web archive collections

Jon Carlstedt Tønnessen1, Anders Klindt Myrvoll2

1National Library of Norway; 2Royal Danish Library

The archived web is considered inaccessible for most digital humanities researchers partly due to lack of easy-to-use tools and/or a narrow single URL-based methodology. This is where SolrWayback becomes relevant.


SolrWayback is a powerful web application for searching and exploring data of the archived web (ARC/WARC files). It has an advanced search syntax that allows full-text search and search in metadata. It also has a replay service, similar to the Wayback Machine, with a toolbox for source inspection. SolrWayback further has tools for network analysis of domains, powerful visualization-tools like N-gram, tools for data insight and, not least, the possibility to export data and derivatives from search results.
SolrWayback is open source and getting more and more users. By spreading the knowledge and gains using SolrWayback at workshops for scholars and anyone interested in working with web archive collections, we aim to build a thriving, resilient and robust SolrWayback software and community.

During this workshop, participants will install and run the SolrWayback bundle to index WARC files. After indexing, they will be able to explore and analyze archival records in the web application, and learn how to export derivatives from the search results.

This workshop will

  1. Briefly explain how a web archive is typically created and key elements like crawling, indexing, what is a WARC-file, tools to view archived web and more.

  2. Present the SolrWayback solutions at the web archives in Norway and Denmark - search/discovery/QA/tools/visualization and more to show clearly how easily SolrWayback can be adapted to the needs of other institutions and users.

  3. Explain the ecosystem for SolrWayback.

  4. Perform a walkthrough of installing and running the SolrWayback bundle. Participants are expected to follow the installation guide and will be assisted whenever stuck.

  5. Leave participants with a fully working stack to index, discovery and playback WARC files.

  6. End with an open discussion on findings from the workshop and key takeaways.

Prerequisites:

  • Participants should have a Linux, Mac or Windows computer with Java installed. To see java is installed type this in a terminal: java -version

  • Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.

  • Mac users with M1-processors (MacBook Pro 2021+) need to have Java 17 installed (later versions are not compatible with solr9, which is used for indexing).

  • For windows computers, administration-user may be required.

  • Downloading the latest release of SolrWayback Bundle beforehand is recommended: https://github.com/netarchivesuite/solrwayback/releases

  • Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles or you can make your own WARC-files with https://archiveweb.page/ or similar tools.

  • A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities in the best way possible.

  • Optional to have installed SolrWayback in advance (we aim to make instructional videos for different OS´s in advance).

Target audience

  • Researchers

  • Cultural heritage experts, developers, curators, and archivists (not just web archivists), or simply anyone interested in working or being inspired by these technologies and topics.

Anticipated number of participants:
20-30 (but less would be good as well - 1:1 help is difficult with many attendees - also a reason to install the SolrWayback bundle in advance if possible).

Ideal length:
5 hours - full day workshop

Background & resources

Access to and interaction with web archive data are commonly described as a problem for researchers and SolrWayback is the key to unlocking the potential of data, archives and collections.

SolrWayback 5 (https://github.com/netarchivesuite/solrwayback) is a major rewrite that strongly focuses on improving usability. It provides real-time full-text search, discovery, statistics extraction and visualisation, data export and playback of web archive material.
SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source.

Documentation for SolrWayback in the Norwegian online archive:
https://nlnwa.github.io/research-services/docs/solrwayback


IIPC WAC 2022: SESSION 1 #2: SOLRWAYBACK AT THE ROYAL DANISH LIBRARY
https://www.youtube.com/watch?v=-q4a-edVP5E
See more at SolrWayback - presentations, examples & more

 
8:30am - 10:00amTUTORIAL1-1: HALF-DAY TUTORIAL (INTERACTIVE VISUALISATION)
Location: K-207 [2nd floor]
 

Interactive visualisation of GLAM+ data

Lena Krause1,2

1Maison MONA; 2Université de Montréal

The proposal for this tutorial is a participatory demonstration of data visualisation, focusing on content exploration and analysis using interactive inputs. The term GLAM+ data leans toward a more inclusive definition of cultural heritage or GLAM data. Smaller or unconventional non-profit cultural organisations, local councils and educational institutions are also engaged in preserving, documenting, and mediating art and culture, as well as in creating and disseminating datasets.

Such GLAM+ data typically documents a series of objects, such as artworks or archival documents. As they are all described with the same properties, one can visualise them in timelines, categorical charts, etc. Participants with coding experience may bring their own dataset or sample in JSON or CSV (if you haven't collected your data yet, you can create a demo sample to use during the course, and update your dataset source once it is ready). Participants with little or no coding experience are welcome to the tutorial as a way to read through the code and understand the process behind it. Peer-coding or partnering up to share knowledge is recommended, and there will be time to form small groups at the beginning of the tutorial.

The demonstration will begin with an introduction to Notebook environments on the Observable platform. Observable can be used as a web-hosted sandbox for playing with data and visualising it. It enables collaborative exploration, analysis, visualisation, and communication of data. Public individual and collective workspaces are free, favouring an open-source policy and generating a trove of examples to learn from or to fork. Coding itself is in Javascript, using the librairies Plot and D3.js.

The workshop will focus on a single notebook (provided) allowing all to follow step by step, with the possibility of forking it to add comments or to make some twists, such as using your own data. After a brief overview of the goal of the visualisation, we will examine each step required to produce it, including:

  • getting data in JSON or CSV from an API, file link or simple upload

  • exploring the content of the dataset

  • cleaning or preparing it for the visualisation

  • choosing the inputs and filters

  • updating the output data based on the selection

  • visualising it and thinking about scales, colours and interactions such as hover and clicks

We will also look at other examples on Observable, thus seeing a wide array of code and visualisations whilst exploring one in depth and detail.

Learning outcomes

  • discover the Observable platform and its notebooks

  • understand the main steps of interactive data visualisation

  • handle typical GLAM+ data challenges

  • learn to use the Plot javascript library

  • take in some d3.js code examples

Target audience: all welcome. Computer scientists, digital humanists and cultural heritage workers alike, the tutorial is organised for a collective experience and sharing knowledge.

Anticipated number of participants: max ~ 15 to 20 participants

Ideal length: 3 hours

Technical requirements: participants should bring their own laptop and have WIFI access. They can create an Observable account or use their Github identifiers to log into the platform.

Instructor: Lena MK is an art historian and computer scientist, currently working as CTO of Maison MONA and as lab manager at Ouvroir Laboratory of Digital Art History and Museology, University of Montreal (UdeM). Specialising in data visualisation for cultural data, she is also a PhD candidate in art history and research-creation at UdeM and teaches data visualisation (HNU3056-6056) in the digital humanities program there.

 
8:30am - 10:00amWS2-1: FULL-DAY WORKSHOP (DiPaDA 2024)
Location: K-208 [2nd floor]
Session Chair: Mats Fridlund, University of Gothenburg, Sweden, Sweden
Session Chair: Matti La Mela, Uppsala University, Sweden

Full programme at: https://dhnb.eu/conferences/dhnb2024/workshops/dipada/

09:00-09:10 Welcoming (Organizing Committee)

09:10-09:30 Transcriber effects in the Icelandic parliament corpus (Anton Karl Ingason, Lilja Björk Stefánsdóttir)

09:30-10:00 Augmenting the Analysis of Political Discourse: A Word Embedding and Context-sensitive Methodological Approach to the Swedish Parliamentary Corpus (Lejf-Jöran Olson, Daniel Brodén, Mats Fridlund, Magnus P. Ängsal, Patrik Öhberg)

 

Transcriber effects in the Icelandic parliament corpus

Anton Karl Ingason, Lilja Björk Stefánsdóttir

University of Iceland, Iceland

The Icelandic parliament corpus is being used to study individual
lifespan change in sociolinguistic style-shift. We report on how the
word order effect in question is affected by decisions made by those
who transcribe the speeches and show that while some changes are made
by the transcribers, the overall pattern of linguistic usage is not
substantially altered. Ideally, each recording is manually checked by
an annotator, but automatic annotation can be used with the understanding
that quantitative findings are subject to minor errors.

Ingason-Transcriber effects in the Icelandic parliament corpus-245.pdf


Augmenting the Analysis of Political Discourse: A Word Embedding and Context-sensitive Methodological Approach to the Swedish Parliamentary Corpus

Leif-Jöran Olsson, Daniel Brodén, Mats Fridlund, Magnus P. Ängsal, Patrik Öhberg

University of Gothenburg, Sweden, Sweden

Introduction

Recent years have witnessed a rapidly expanding interest in data-driven research on parliamentary collections as well as a drive towards further development and standardisation of parliamentary infrastructures (La Mela et al. 2022: 3–4), including cross-national initiatives such as Parla-CLARIN (https://github.com/clarin-eric/parla-clarin). As regards Sweden, although major work is carried out within the SWERIK infrastructure and the Welfare State Analytics project (WeStAc), research on the Swedish parliamentary datasets have, so far, been primarily exploratory and driven by broad conceptual issues and application of common statistical measurements (see Ohlsson et al. 2022; Brodén et al. 2023; Jarlbrink & Norén 2023), leaving a significant gap in the data modelling and the contextual understanding of the data.

This paper will present the augmented methodological approach to analysing the Swedish parliamentary discourse on terrorism developed in the ‘Terrorism in Swedish Politics’ (SweTerror) project (2021–2025) (Edlund et al. 2022). Drawing upon a mixed methods approach, we will discuss a set of context-sensitive analyses of the Swedish parliamentary record, integrating language technology (LT) and contextualising methods from, among other research areas, political science, history of ideas and linguistics. Notably, the discussion connects to the current debate within digital humanities (DH) about the need for engaging with the contextual complexities of text mining large-scale archival collections. According to digital historian Jo Guldi (2023), dedication to the question of what makes text mining accurate and robust will only get data-driven analysis of large-scale collections so far, as without a contextual sensibility applied to the materials the results tend to raise more questions than they answer (see also Bode 2018).

We here outline SweTerror’s enactment of a contextualised understanding of the text data through the development of a custom-made dataset and use of word embeddings (vectors) for analysing the framing of terrorism in the parliamentary debates, 1968–2018. The paper presents a LT approach firmly grounded in humanities and social sciences (HSS) research questions, and highlights its methodological and analytical potential by presenting results from two case studies of how the Swedish Parliament (riksdag) and the different political parties have engaged with the issue of terrorism.

The dataset and LT approach of SweTerror

The paper will focus on our work with developing the SweTerror corpus, using and adapting the Swedish Parliament Corpus of the edited transcripts of minutes that are currently being cleaned up, partly re-digitised and curated for research purposes (current version 0.14). The dataset is longitudinal and encompasses both the bicameral and unicameral Parliament (1867–1970 and 1971–2018, respectively), consisting of roughly 4 M tokens per parliamentary year (see below). The structure of speeches is reintroduced with a correctness of 90+ percent. Notably, the dataset is annotated with metadata about Members of Parliament (MPs) concerning name, party affiliation, gender and regional representation. Furthermore, we describe the exchange between SweTerror and SWERIK, with SweTerror’s LT analyst Olsson serving on the advisory board and technical advisory board of SWERIK and further enriching and curating the Swedish Parliament Corpus for the benefit of the infrastructure and our research purposes. This work includes contributing various forms of quality control; in this paper we will point at some issues of relevance for our analysis, including the identification of omissions in the dataset such as missing debate protocols.

From the infrastructure perspective, the paper will highlight the integration of workflows into the Språkbanken Text (SB Text) infrastructure, including the Korp tool (Borin et al. 2012) to avoid reprocessing in the SB Text infrastructure. In turn, the process will introduce more flexibility to Korp’s word picture functionalities and feed into new Sparv plugins (Borin et al. 2016) as well as be accessible through APIs. In extension, this means that workflows and data will also be integrated into the CLARIN ERIC infrastructure.

Concerning contextualisation, writings in Digital Humanities on Swedish Parliamentary data have mostly focused on more technical and formalistic aspects of the documentary record, such as issues surrounding OCR quality and metadata as well as how the transcriptions of the minutes are the result of post-speech editing (see Norén & Jarlbrink 2024). However, SweTerror seeks to enact a more contextualising understanding of the data, a simple yet significant element being our choice to, contrary to most parliamentary datasets, group the debates by parliamentary year (autumn–summer), rather than calendar year, to distinctly represent the mandate period of the Parliament. An important rationale for this is that changes of government during election years (mid-calendar year) affect the political dynamics, and we have previously shown that governmental position is a major factor for MPs’ motion writing on the topic of terrorism (Brodén et al. 2023). Following our lead, SWERIK in 2024 adopted this principle for their Swedish Parliament Corpus.

From the technical perspective, the paper will describe our workflows around the annotation pipelines, where the outputs are continuously aggregated and analysed in an iterative process with each layer of annotation having at least one manual evaluation. Specifically, we discuss our work with word vectors and metadata, respectively, for contextual readings in two case studies concerning the occurrence of terrorism discourse in the Swedish parliamentary debate transcripts.

Case study 1: Vectors of violence

A key part of the SweTerror project is applying word vectors, word embeddings (Mikolov et al. 2013) and vectors for longitudinal exploration and examination of conceptual changes and conceptual similarities related to the notion of terrorism (see Stampnitzky 2013; Ditrych 2014; Zoller 2021). The word vectors are used in combination with enriched document annotation and quality assessed metadata to create ‘temporal lenses’ to traverse our analytic universe. Furthermore, we highlight the work concerning document based analyses of classification and Named Entity Recognition (NER). Both of these enrichments are used for traversal or retrieval of similar sections and related activities based on network relations.

This case study explores the Swedish parliamentary speech on terrorism with regard to discourse semantic patterns and competing terms from the realm of political violence. We aim at comparing the usage of discourse-relevant lexical items such as the Swedish terms ‘terror’, ‘terrorism’, and ‘våldsbejakande extremism’ (‘violence-affirming extremism’). To highlight this, we diachronically compare these units by means of their similarity and closeness, estimated through a plethora of word vectors, allowing us to trace the development of the parliamentary discourse on terrorism, with regard to continuities and discontinuities. Comparing different vectors of semantically related lexical units diachronically allows us to identify potential discursive shifts in the framing of terrorism over time. A key focus of the analysis is if, and then how, the vectors of ‘terror’ undergo a change when the modern usage in Swedish of the word ‘terrorism’ emerged in the early 1970s. Another point of interest is possible discursive shifts related to the establishment of the specific term ‘våldsbejakande extremism’ in the 2010s (Andersson Malmros 2022) and whether it had an impact on the vectors of ‘terrorism’.

In our preliminary findings, the calculated similarity between ‘terror’ and ‘terrorism’ remains rather high over time, ranging from the lowest value of 0,73 (2015) to 0,93 (1974), with value 1,0 meaning identical embeddings in the data. Although there are noteworthy differences between separate years, a general decrease in similarity after 2001 is discernible, with 2005 as the first year when similarity drops below 0,8 and 2015 displaying the all-time lowest degree of similarity. This discursive change can likely be interpreted as a further specialisation of the term ‘terrorism’ post 9/11, meaning that it increasingly diverts from the more general term ‘terror’. A further point of inquiry is examining different ways of wording political violence by way of the lexical items mentioned in relation to MPs party-affiliation. This methodical advancement enables the outlining of differences and similarities in framing terrorism with respect to lines of political-ideological differences in parliament diachronically.

Case study 2: Terrorism is not for beginners

Another integral part of the SweTerror project is combining word vectors and metadata (see above) for classifier tasks and traversal. The act of giving speeches holds significance for analysis, as it influences the discourse surrounding policies, but also serves to inform the public about position-taking and the law-making process. However, the allocation of speaking opportunities and debates on certain issues among legislators is not random. There are patterns. One such pattern is gendered speech behaviour. For instance, women often deliver fewer speeches (Bäck, Debus & Müller 2014) and exhibit more emotional (Dietrich, Hayes & O'Brien 2019) and less aggressive speech tendencies (Kathlene 1994). Moreover, women tend to speak less on subjects regarded as ‘masculine’ (Bäck & Debus 2019).

Since the parliamentary speeches in our dataset are automatically assigned a Persistent Identifier (PID) for the speakers that can be used in connection to other metadata and calculated metadata. This allows for an analysis of structural differences in the debates on terrorism at party level, including government versus opposition, speech volume measured as token percentage, differences between men and women. Furthermore, we will integrate ‘seniority’ as an analytical factor through measurements based on the speaker’s age, years in parliament, position (Minister or not, membership in Parliamentary Committees, committee chairs, governmental party, etc). Our temporal periods are dynamic in the sense that they can encompass, among other things, parliamentary year, government period, and eras defined by MPs that have been particularly influential in the debate on terrorism, all of which will be explored for both continuities and discontinuities. Networking inside parliament is explored and workflows have been established for visualising and comparing extra-parliamentary and intra-parliamentary networking.

This case study also provides an overview of women’s speech participation over an extended period of time and to explore which MPs who have discussed terrorism. Our preliminary findings suggest that female speech percentages have risen slowly during the early 20th century, levelling out close to 50 percent in the last decades (since 1988). Nowadays, women speak about as much as men do. Whether that pattern is also reproduced in the terrorism debate we will present and discuss when we have produced more finished results for our conference presentation.

Conclusions

We conclude by drawing together our lines of thought about the augmented methodological approach of SweTerror to analysing Swedish parliamentary discourse and how it can enrich a contextual understanding of parliamentary data, in Sweden and beyond.

Olsson-Augmenting the Analysis of Political Discourse-247.pdf
 
10:00am - 10:30amBreak
Location: Háma
10:30am - 12:00pmWS3-2: FULL-DAY WORKSHOP (DHSS)
Location: K-205 [2nd floor]
Session Chair: Koraljka Golub, Linnaeus University, Sweden
10:30am - 12:00pmWS4-2: FULL-DAY WORKSHOP (SOLRWAYBACK)
Location: K-206 [2nd floor]
Session Chair: Anders Klindt Myrvoll, Royal Danish Library, Denmark
Session Chair: Jon Carlstedt Tønnessen, National Library of Norway, Norway
10:30am - 12:00pmTUTORIAL1-2: HALF-DAY TUTORIAL (INTERACTIVE VISUALISATION)
Location: K-207 [2nd floor]
10:30am - 12:00pmWS2-2: FULL-DAY WORKSHOP (DiPaDA 2024)
Location: K-208 [2nd floor]
Session Chair: Mats Fridlund, University of Gothenburg, Sweden, Sweden
Session Chair: Daniel Broden, University of Gothenburg, Sweden

Full programme at: https://dhnb.eu/conferences/dhnb2024/workshops/dipada/

10:30-11:00 Parliaments as Networks of Power: The Analysis of Power and Gender Relations in Selected European Parliaments (Jure Skubic)

11:00-11:20 “Matrikelmoderation”, what else? Topic Modeling of the Imperial Diet Records of 1576 (Roman Bleier, Florian Zeilinger)

11:20-11:40 Reppin' your constituency? Geographical Representation in Swedish Parliamentary Speech (Albert Wendsjö)

11:40-12:00 The Relevance of AI: Perspectives from the British and Slovenian Parliament (Ajda Pretnar Žagar, David Moats)

 

 

Parliaments as Networks of Power: The Analysis of Power and Gender Relations in Selected European Parliaments

Jure Skubic

Institute of Contemporary History, Ljubljana, Slovenia, Slovenia

Parliamentary debates and parliamentary discourse are an integral part of a nation's political power because parliaments, as representative institutions, are responsible for shaping the legislation that affects people's daily lives. Thus, parliaments are a source of power for Members of Parliament (MPs) and other politicians (Bischof and Ilie 2018). Parliamentary discourse is at the heart of political decision-making and is an expression of political power – an important and widely theorized concept especially in cultural and social studies (Simon 1952; Parsons 1963). Parliamentary debates are therefore an important source of highly relevant data not only for the social sciences and humanities, but also for computer science, making parliamentary discourse interesting for both qualitative (van Dijk 2000; Bayley 2004; Ilie 2015) and quantitative (Abercrombie and Batista-Navarro 2020; Rheault et al. 2016; Cherepnalkoski and Mozetič 2016) and multidisciplinary research (Andrushchenko et al. 2022; Blaxill 2013).

We analyzed gender representation and power relations in the national parliaments of three European countries – Spain, Slovenia and the United Kingdom. The aim of the work was twofold: first, we analyzed the differences in the political representation of women in national parliaments and examined different manifestations of power within parliamentary discourse through the analysis of parliamentary speeches. With our analysis, we wanted to examine the differences in women's political representation and show that high political representation alone does not necessarily warrant the actual power of women MPs in the respective parliaments. Our second, equally important goal was to show that the intertwining of data science and social science research can generate meaningful results that might otherwise be overlooked.

To gain insight into the distribution of power of parliamentarians in the different national parliaments in Europe, we used one of the most comprehensive parliamentary datasets available – the ParlaMint dataset (Erjavec et al. 2023). This dataset contains session transcripts from more than 20 European national parliaments and spans several legislative periods with transcripts of speeches made between 2009 and 2022 (depending on the country). The corpora are uniformly encoded, linguistically annotated using the Universal Dependencies Standard and contain extensive and informative metadata of more than 11 thousand speakers. In addition, the named entities are annotated, which was very important for our analysis.

We analyzed the lower houses of the national parliaments of three European countries: the House of Commons of the United Kingdom, the Congreso de los Diputados in Spain and the Državni zbor in Slovenia. We analyzed the last completed parliamentary term (UK 2017 – 2019, Spain 2016 – 2019, Slovenia 2014 – 2018) in order to have a fully balanced group of parliamentarians. We were interested in three things in particular: first, we examined the overall gender representation in all three parliaments so that we could analyze the proportion of female MPs in national parliaments. Secondly, we analyzed the argumentative power of MPs and focused our analysis on how MPs' speeches and mentions of MPs can shed light on the power of MPs in political debates. We inspected how argumentative power becomes visible through parliamentary discourse and how the speeches and mentions shape the power relations in parliament. Last but not least, we focused on the structural power of MPs within the respective parliaments. We were interested in how the speaking practices of male and female MPs relate to certain topics and the distribution of power. We approached this question by referring to the division of topics within parliamentary debates into 'hard' topics (i.e. topics that are more likely to be discussed by male MPs) and 'soft' topics (i.e. topics that are more likely to be discussed by female MPs) (Baeck et al. 2014). We focused on five key policy issues, namely energy, finance ('hard' topics), education, healthcare ('soft' topics) and immigration, which we treated as an ambiguous topic. In doing so, we wanted to investigate whether this pattern of division could be observed in selected parliaments. To better understand and visualize the power relations between MPs, we created directed networks in which the nodes represented the mutual mentions of MPs and the weights were the number of mentions. For each parliament, we created both general and topic-specific networks, which helped us to better understand the speech dynamics on a particular topic.

Our results show that the political representation of women in all three national parliaments is quite high (between 32% and 41%), but still below the expected and desired level (50%). This suggests that although the implementation of gender quotas has been rather successful, political participation still remains highly gendered and there is still room for improvement. Furthermore, we found that despite the relatively high level of political representation, women still have much less power of argumentation than men. In all three parliaments, the active and passive importance of women was lower than their share of political representation, with Spain showing particularly negative values. This shows that although the Spanish parliament has the highest proportion of female MPs, we were able to prove that the mere presence of female MPs in parliament does not justify their participation in debates. Although they hold the same position, female MPs have fewer opportunities to express their opinions.

We also show that the dichotomy between 'hard' and 'soft' topics is not universal, but depends largely on the political and cultural context of the parliament under study. The fact is that structural power in parliaments varies thematically and that even on issues that are generally considered 'soft', women do not necessarily have more say than men. Our networks have visualized the distribution of speeches and mentions of male and female MPs well, showing that despite the presence of some prominent female politicians, men often still dominate parliamentary discourse. This visualization, combined with statistical and social network analysis suggests that the gender of MPs may have a significant impact on their argumentative and structural power.

Although many of the findings collected in our study deserve further analysis and explanation, we have scratched the surface of an important social issue of gender inequalities in politics and, most importantly, shown that multidisciplinary research can provide meaningful results and uncover constructions that serve as harmful blockages to the realization of gender equality in politics.

Keywords: parliamentary discourse, network analysis, argumentative power, structural power, relevance

Skubic-Parliaments as Networks of Power-237.docx


“Matrikelmoderation”, what else? Topic Modeling of the Imperial Diet Records of 1576

Roman Bleier1, Florian Zeilinger2

1University of Graz, Austria; 2Historical Commission at the Bavarian Academy of Sciences and Humanities, Germany

Proposal for a short paper (10-minute) at DiPaDa 2024, DHNB 2024

During the 16th century the Imperial Diets, or Reichstage, of the Holy Roman Empire of the German Nation were convened by the emperors of the House of Austria at intervals of one or more years in changing locations. They were events in which the head of the empire called the aristocratic and urban estates to him to discuss and decide on political topics and thus the fate of Central Europe. Therefore Imperial Diets were proto-parliamentary legislative assemblies and can be seen as the forerunners of modern parliaments and an important part of the constitutional history of the Empire. The subjects of deliberation are diverse and touch on such different and central areas of social coexistence, such as, in modern terms, questions of internal security, the judicial system, taxes, defense against external danger, monetary policy and the shaping of social coexistence in the face of confessional plurality, to name just the most important.

Imperial Diet research is primarily based on the edition series of the Historical Commission at the Bavarian Academy of Sciences and Humanities, in which records for each individual Diet have been collected and published in print since the 19th century. While this enterprise is still going on, the Commission has begun the retro-digitisation of existing volumes. However, without technical tools and the resulting search options as well as maintaining the often hermetic, source-oriented terminology, it is almost impossible to gain an overview of this data.

Over the last several years a team of researchers at the University of Graz and the Historical Commission created the first genuinely digital edition of the Imperial Diet of 1576. In this context it also experimented for the first time with methods of information extraction. A follow up project that is currently still in the application phase will focus on the topic history of the Imperial Diets in the 16th century. In a small project in spring 2024 the authors have tested a standard tool for topic modeling, Mallet (https://mimno.github.io/Mallet/topics.html), that is frequently used in Digital Humanities projects on the electronic texts of the digital edition of the Diet of 1576.

The heterogeneous edited, transcribed and annotated texts are segmented and have human assigned keywords for negotiation topics linked to the according index. These keywords follow the traditional indices from previous editions, but were slightly modified for the edition of 1576 (more abstract analytical terms; no complex, nested general registers of people, places and topics). This will be very useful when comparing them with computer generated topics to critically reflect on the editorial keywording and indexing of contemporary source texts that document the political discourse. Especially the automated generation and historiographical interpretation of topics and according keywords makes it possible to practise data and editorial criticism.

In order to contextualise the edition-critical results, the digital edition on 1576 was compared with a second edition: The Imperial Diet records from 1556/57, which are available in retro-digitised form (https://reichstagsakten.de/), are different structured texts (general index with links to PDFs of individual pages). This made it necessary to manually extract the subject headings from the general register and link them to specific text passages.

This experiment provides first insights into the potentials of topic modeling for the Imperial Diet records. It also highlights issues that will need to be addressed in the new project focusing on topic modeling and in which the assignment of negotiation topics in several editions can be checked and the keywords can be merged into a cross-edition index.

Bleier-“Matrikelmoderation”, what else Topic Modeling of the Imperial Diet-239.pdf


Reppin' your constituency? Geographical Representation in Swedish Parliamentary Speech

Albert Wendsjö

University of Gothenburg, Sweden

In representative democracy parliamentarians are elected to represent their voters. That they succeed doing this is one, if not the most, important goals in representative democracies (e.g. Pitkin 1967), and a a lot of research have studied to what extent this is achieved. For example, to what extent does parliamentarians represent people of different gender, ethnicity or age (see e.g. Elsässer and Schäfer 2023, Wägnerud 2009, Persson et al. 2023). However, one important aspect of representation is geographical representation, that is that parliamentarians represent the interests of all geographical regions of the polity. It is to this end that many representative democracies practically elect parliamentarians through local constituencies that are geographically spread through the country. However, to what extent geographical representation is substantially achieved is a question that has partially been overlooked. The purpose of this paper is to show how geographical text analysis (GTA) (see e.g. Porter et al. 2015) can be used to study geographical representation using parliamentary debates, and to study to what extent geographical representation is achieved in the Swedish Riksdag. In other words, what this paper does is that it studies which places that politicians mention in their parliamentary speeches.

When talking about representation, one often makes the distinction between substantial and descriptive representation (Pitkin 1967). While descriptive representation refer to whether parliamentarians share characteristics with those they represent, substantial representation refers to whether parliamentarians represent their political preferences. In the case of gender, the descriptive representation refers to whether the share of female parliamentarians is proportional to the share of females in the population, whereas substantial representation asks whether parliamentarians represent their interest. In the case of geographical representation, the equivalent questions become whether or not parliamentarians are elected from the all geographical regions, and whether they represent their interests. While the first question is ensured through the electoral system, the second question is less certain.

Prior research have mainly focused on substantial representation in terms of gender, ethnicity or class and less so in terms of geography. Prior research in Sweden have shown that parliamentarians adapt their language on immigration based on local conditions (Olander 2018), but we know little of how this generalizes to other forms of substantial representation. In a more local level Folke et al. (2021) found that where local politicians can reduce the likelihood of public bads in those areas, but we don’t know if parliamentarians in the national level act the same for those regions they are elected to represent. Beyond Sweden, there are a few studies that have started to study geographical representation in a broader sense, often through mentions of local regions. Using this studies have found several explanations of local representation, for example electoral volatility, politicians background and electoral system (Schürmann and Stier 2023, Zittel et al. 2021, Russo 2021, Nagtzaam and Louwerse 2023). However, most studies have been limited in temporal coverage, and carry methodological limitations.

To add to this literature, this study studies geographical representation in the Swedish Riksdag from 1864-2022 using the SWERIK dataset (Swerik 2023). To measure geographical representation, previous studies have mainly relied on dictionary approaches, which force the researcher to identify all possible references of local areas. To move beyond this, this study uses what can be referred to as geographical text analysis (see e.g. Porter et al. 2015). Specifically the study uses named-entity-recognition (NER) to extract all geographical mentions in the parliamentary data, the study then geotags all places in order to create a dataset of geographical representation over time.

Using this data, the study examine two aspects of geographical representation. First the study explores to what extent geographical representation is achieved, that is to what extent parliamentarians talk about a region proportional to how many people live there. Second, the study explores to what extent parliamentarians talk more about their constituency compared to other constituencies.

Overall, the contribution of this study is twofold. Firstly, this explores how geographical text analysis can be used to study parliamentary speech, to answer substantial questions of representation. Through exploratory analysis, the study also points to several avenues for future research. For example, geographical text analysis can be further used to study how globalization emerges through parliamentary speech. Secondly, this study contributes substantively to the literature of representation, by showing how local representation in Sweden have developed over time.



Perspectives on AI in the British and Slovenian parliament

Ajda Pretnar Žagar1, David Moats2

1Institute of Contemporary History, Slovenia; 2Faculty of Social Sciences, University of Helsinki, Finland

Parliamentary debates illustrate which legislative topics are relevant and how these topics shift over time. When observing how a specific topic is handled in the parliament, it is possible to pinpoint tangential topics and focal points. The identification of policy foci is particularly interesting in a cross-country comparison. In this contribution, we analyse and compare the results from seven years of parliamentary debates (2015-2022) from the British and Slovenian ParlaMint corpus (Erjavec et al. 2023). While previous research in comparative computational linguistics covered debates on migration (Blaette et al. 2020, Navarretta et al. 2022), attitudes to the EU (Hörner 2013), and right-wing populism (Schwalbach), we focus on the debates about the artificial intelligence (AI) to investigate how AI was discussed in the two countries.

The main research question is how the debates differ in the “industry leader” country, such as the UK, versus the “industry follower” country, such as Slovenia. To answer this question, we employ computational and close reading techniques. We used collocation networks to visualize the semantic relationships between words within the debates. By identifying frequently co-occurring word pairs, we discern key themes that centre around AI. We used word enrichment analysis to identify lexico-semantic patterns of each parliamentary discourse. This allowed us to pinpoint significant AI-related topics, revealing the unique emphases of British and Slovenian legislators. We used semantic document maps (Godec et al. 2021) to further elaborate on the topics pertinent to artificial intelligence. This provided a landscape of policy foci for further cross-country comparison. We supplemented the findings by close reading. We also analysed how the debate is characterised by parties and different MPs.

Initial observations reveal markedly different topics, reflecting the distinct political landscapes and socio-cultural contexts of the two nations. Both debates started to take off after 2016. However, the UK debate is much more prominent, while in Slovenia, there are more than a handful of mentions of AI only in the late 2019. The UK debate centres around protecting national interests, mostly in relation to company takeovers such as DeepMind. A prominent topic in the British subcorpus is also how AI can be implemented in healthcare. In Slovenia, on the other hand, the debate revolves around general digital transformation.

The presented research is part of a larger project mapping public values in algorithmic systems. Further research will include the analysis of additional national parliamentary debates from Denmark, Finland, and Sweden.

Pretnar Žagar-Perspectives on AI in the British and Slovenian parliament-248.docx
 
12:00pm - 1:00pmLunch
Location: Háma
1:00pm - 2:30pmWS3-3: FULL-DAY WORKSHOP (DHSS)
Location: K-205 [2nd floor]
Session Chair: Koraljka Golub, Linnaeus University, Sweden
1:00pm - 2:30pmWS4-3: FULL-DAY WORKSHOP (SOLRWAYBACK)
Location: K-206 [2nd floor]
Session Chair: Anders Klindt Myrvoll, Royal Danish Library, Denmark
Session Chair: Jon Carlstedt Tønnessen, National Library of Norway, Norway
1:00pm - 2:30pmTUTORIAL2-1: HALF-DAY TUTORIAL (LyX)
Location: K-207 [2nd floor]
 

LaTeX Light: Introduction to LyX

Elisabeth Maria Magin

Universitetet i Oslo, Norway

Despite the presence of computers in our daily lives, choosing the right tool for the right task can be tricky. This especially applies when we’re already familiar with a specific tool for a specific task – like writing and publishing papers. Students in particular, but also many older colleagues, tend to use Microsoft Word or a similar word processor for the purpose, leading to a range of issues, not the least of which is compatibility. There are better tools available, like LaTeX, which offers a wide range of benefits over traditional Word documents, particularly for those submitting contributions to a variety of academic publications with different styles and requirements. LaTeX excels at compiling bibliographies and outputting them according to different stylesheets, meaning that resubmitting an article to a different journal only requires one change in the settings and a rerun for all of the references to be formatted according to the guidelines of the new journal. It further offers automated numbering of cross-references and handling of references, ensuring that table of contents, page numbers, bibliography and cross-references are always up-to-date without any manual work required. This is especially useful for those working with longer texts, where chapters and sections may have to be moved around during the writing process, and spares one the headache of having to manually update cross-references in the text.
Beyond these, LaTeX also offers other benefits. Using plain text and code to in, files can be edited on every computer regardless of access to proprietary software, rendering .tex-files completely portable and platform-indepedent; and since the visual output is decoupled from the content, it is also much easier to output the same document according to different stylesheets, for example using different fonts or page margins, than it is with Word.
But LaTeX can not only render the same document in different ways; it is also possible to create several different files out of the same basic document. This feature is especially useful if one happens to be writing worksheets for students or exam papers, or is a PhD student required to hand in a draft to one’s supervisors, where some parts aren’t quite finished yet. Keeping the student exam paper and the solution sheet, or the draft meant for one’s supervisors versus the drafts one is still working on in different documents can easily lead to much confusion and wrong the versions of documents being sent to the wrong person.
But despite the many benefits of LaTeX, the switch from a word processor to a plain text editor with markup language can be intimidating and difficult to tackle by oneself. This is where LyX comes in. LyX is a graphical user interface that aims to make the transition from word processor to LaTeX easier by using many familiar buttons from word processors to enter the LaTeX code ”behind the scenes”, allowing newcomers to LaTeX to focus on getting the text written instead of trying to remember the right commands. However, the use of LaTeX in the background enables LyX users to make use of many of the features LaTeX offers – such as consistent formatting, easy cross-referencing, consistent bibliographies and creating different output documents from the same source document, thus keeping everything in the same space.
LaTeX Light: Introduction to LyX is meant for everyone looking for alternatives to Microsoft Word, whether student or established academic; everyone who has ever spent hours adapting the bibliography to a different journal’s style or tried to stop the illustrations from completely messing up the formatting of the text. Over the course of four hours, this introductory course teaches up to 15 attendants how to write a structured text of any length – whether 5-page journal article or 15-chapter book – using LyX and compile it with LaTeX. In units of approximately 30 minutes, the course covers:
* The basic principles of “what you get is what you see” versus “what you get is what you want”
* The LyX interface
* Basic structuring in a LyX document (chapters, sections and subsections)
* Including images and tables in your document
* Bibliographies and cross-referencing
* Outputting different versions of the same document
To ensure that participants can do the practical exercises for each of the units, they should bring an unformatted paper draft (or some other text) with at least 3 sections, 3-5 subsections, 1 image and 1 table, which will be used as a training document.
Participants are also required to install the latest version of either MikTeX or TeXLive (only ONE of them) and LyX before the workshop. All of these are freeware and available for Windows, Linux and Mac.
The course will be lead by Elisabeth Magin, who has given similar introductions into LyX to a number of colleagues and students at the University of Oslo.

Magin-LaTeX Light-214_a.pdf
Magin-LaTeX Light-214_b.pdf
 
1:00pm - 2:30pmWS2-3: FULL-DAY WORKSHOP (DiPaDA 2024)
Location: K-208 [2nd floor]
Session Chair: Daniel Broden, University of Gothenburg, Sweden
Session Chair: Mats Fridlund, University of Gothenburg, Sweden, Sweden

Full programme at: https://dhnb.eu/conferences/dhnb2024/workshops/dipada/

13:00-13:30 Decoding the parliamentary debate on marketization of education in Sweden through computational analyses (Eric Borgström, Martin Karlsson, Christian Lundahl

13:30-13:50 Bilingual Parliament? Functions of Swedish, English and Latin in the Parliament of Finland (Anna Ristilä)

13:50-14:10 Accessing nature before “allemansrätten”? Combining two national parliament datasets to study a tradition before it was named (Matti La Mela)

14:10-14:30 Finding Patterns across Multiple Time Series Datasets: Democracy in the Twentieth-century Political Discourses in the United Kingdom, Sweden, and Finland (Risto Turunen, Hugo Bonin, Pasi Ihalainen, Jani Marjanen)

 

Decoding the parliamentary debate on marketization of education in Sweden through computational analyses

Eric Borgström, Martin Karlsson, Christan Lundahl

Örebro University, Sweden

During the last four decades, the Swedish education system has gone through a transformation from one of the most centrally planned and uniform to one of the most marketized and decentralized in the OECD area (cf. Fredriksson, 2010; Lundahl et al., 2013). Though part of a global trend of marketization of welfare services (Fuller, 2019), the development of the Swedish education system is more radical in comparison both with other welfare areas in Sweden, as well with eduction reforms in other countries (Lundahl, 2016). As such, the trajectory of Swedish education policy from the 1980s until today can be described as a shift between two extreme positions seldom witnessed in neither welfare policy nor education policy research (Fredriksson, 2010).

While a large number of studies have produced important insights into the causes, processes and consequences of this development (cf. Fredriksson, 2010; Ringarp, 2011 & Hultén, 2019), the complexity, fragmentation and extension of this process have made comprehensive and systematic analyses difficult. The task of analyzing this policy process expands beyond what is possible in traditional qualitative policy analysis. This paper aims to illustrate the potential in utilizing novel computational methods and open parliamentary data to systematically investigate the complex processes of educational reform. The paper makes use of the recent development in Natural Language Processing (NLP) to analyse the parliamentary debate on school marketization in Sweden.

Aim and research questions

The study draws on the comprehensive and systematized record of the Swedish parliament's current and historical work, made public as open resource (data.riksdagen.se). Using a combination of computational techniques ­– including sentiment analysis, and topic modeling – as well as qualitative text analysis, this study aims to map the dominant arguments in parliamentary debates surrounding the marketization of the Swedish education system, between 1993 and 2023. The study is guided by the following research questions:

1. What the most frequent types of arguments made in the parliamentary debate on school marketization in Sweden, 1993-2023?

2. How does the prevalence of these different types of arguments change over time?

3. How does the prevalence of these different types of arguments vary across parties?

4. What MPs and political parties are most active in the debate on school marketization across the time period, 1993-2023?

Taken together, answering these research questions will lay the foundation for creating a comprehensive picture of the structure and development of argumentation in the parliamentary debate related to school marketization in the Swedish parliament.

Theory

The marketization of the Swedish education system can be said to represent a paradigm shift in education policy (Lundahl et al., 2013; Alexiadou & Lundahl, 2016). A policy paradigm can be defined as “a framework of ideas and standards that specifies not only the goals of policy and the kinds of instruments that are used to attain them, but also the very nature of the problems they are meant to be addressing” (Hall, 1993: 278). In very general terms, the transformation of the swedish education system in a marketized direction represent a shift in understanding of the central problems of education policy, from a problem of attaining equity and equality, towards a problem of insufficient efficiency and flexibility in the school system (Börjesson, 2016: 77 & 144). This shift in definition of the central problems of education policy was followed by a transformation of the policy instruments used as well as the goals of those instruments.

A paradigmatic view of policy change is thus a suitable theoretical framework for analyzing the marketization of the Swedish education system. Such a perspective puts focus on the link between ideational dynamics, the ebb and flow of normative ideas about education, and the policy instruments designed and implemented. Policy paradigm theory suggests that the understanding of policy development requires analysis of the problems and goals identified in the policy debate.

This paper focuses on understanding the ideational development in the parliamentary debate on school marketization. The empirical basis is open parliament data, that form an ample basis for empirically analyzing policy actors’ normative ideas, not least through transcripts of parliamentary speeches and debates.

Data and materials

The data utilized in this study comes from the open data of the Swedish parliament, available at https://data.riksdagen.se/. This data-base includes: parliamentary speeches, government bills, motions, government commission reports, voting records, and records of appointments. The sample selected for for analysis consists of all parliamentary speeches identified using the search term “friskol*” (i.e. independent schools with alternative word endings) in the open data of the Swedish parliament (using the GUI riksdagssök – riksdagsdata.oru.se), between 1993 and 2023. The sample consists of 2538 speeches made by 1279 members of parliament. Apart from a text transcript of the speech the data consists of the time of the speech as well as party affiliation, name and identification number of each speaker.

The search term chosen to delimit the sample, friskola, [independent school], is a central concept in the debate on school marketization. The central reform creating a quasi-market among education providers did so by allowing privately operated and owned schools to receive equivalent public funding per student as public schools. However, the sample used is not comprehensive as parts of the debate about school marketization can consist of speeches not mentioning independent schools.

Research methods

The first stage of the analysis consisted of applying automated NLP-technologies (using Dcipher Analytics, https://www.dcipheranalytics.com) to extract arguments in the data set. First, a rough set of arguments were identified using GPT-3. Each argument was classified based on its sentiment (positive, negative, neutral) using sentiment analysis. The arguments were then plotted on a two-dimensional vector space, and clustered into distinct topics using topic detection (Figure 1). Finally, clusters were aggregated, yielding a preliminary compilation of 88 different machine-identified arguments, each with aggregated information including 15 debate excerpts illustrative of the particular argument.

In the second stage of the analysis, each of the arguments, were validated, refuted or refined through qualitative analysis of the debate excerpts as data. In each excerpt collection, arguments (i.e. claims and premises, see Rocha et.al. 2022) were manually identified, categorized as pro or contra educational marketization and labelled inductively. The automated sentiment analysis of stage one proved to be inadequate in identifying pro and contra arguments (an argument in favour of a particular view may well be stated in a negative sentiment, e.g. independent schools are not inferior schools). The qualitative analysis resulted in a tentative, high-level typology comprising four (4) main topics and nine (9) argument types for and against independent schools was developed and used to code the data set.

Preliminary results

  • The arguments put forward in the parliamentary debate on school marketization a are characterized by a broad diversity of topics. However, four high level topics can be distinguished that encompass a substantial share of the arguments made. These are:
    • Meta-debate on school marketization: As any other policy debate, the parliamentary debate on school marketization has a substantial element of meta-debate or a debate on the subject of debating school marketization. Meta-debate arguments are concerned with the debate itself (e.g. is school marketization an important issue to debate), or on the actors of the debate (e.g. debating the credibility of MPs or political parties).
    • Organization of the school market: The organization of the school market is one prevalent topic in the parliamentary debate on school marketization. Rather than question or defend the existence of a school market the arguments made within this topic of the debate focus on how such a market should be organized.
    • Existence of the school market: While arguments in this category often resemble those in other categories, the difference lies in their starting point. What is debated in this theme is the existence of the marketized school system in itself, rather than its organization or regulation. Hence, the debate focuses on whether the system should prevail or be abandoned.
    • Quality of independent schools: The last theme of arguments in the debate on school marketization concerns the quality of independent schools. While there are arguments put forward on both sides, it is important to point out the strong skewness towards pro-marketization arguments within this theme.
  • There has been a gradual shift in issue ownership in the parliamentary debate on school marketization over time, from the ideological right to left. At the start of the investigated time-period, the political right dominated the debate. However, gradually over time MPs from the parties on the left have become more active in the debate on school marketization.
  • The distribution of parliamentary speeches among Swedish MPs follows a clear power law, or long-tail distribution. A small minority of MPs have conducted a majority of the parliamentary speeches mentioning independent schools. The large majority of MPs that have mentioned independent schools in parliamentary speeches have done so in few or single speeches, while 11% (n=114) of the MPs stand for a majority of speeches mentioning independent schools.
Borgström-Decoding the parliamentary debate on marketization-246.pdf


Bilingual Parliament? Functions of Swedish, English and Latin in the Parliament of Finland

Anna Ristilä

University of Turku, Finland

According to the Finnish law (Constitution of Finland §51) only the two official languages of Finland – Finnish and Swedish – are allowed to be used in the Finnish parliament (Eduskunta) but fragments of other languages are sometimes present in the discussions as well. However, the presence of these fragments has not been extensively studied, e.g. are certain topics more prone to include fragments of foreign languages. The speeches given in the Finnish parliament have been topic modelled before (Loukasmäki & Makkonen 2019, Ristilä & Elo 2023) but the distribution of languages over topics has not been studied.

This study attempts to fill this gap by examining the distributions of Swedish, English and Latin across topics identified in the Finnish parliament 1970-2020 and discussing the different languages’ functions in the parliamentary context. These three languages have very different popularity trends: Swedish is an official language in Finland but its use has been in decline for a long time, English is not an official language in Finland but generally used as lingua franca, and Latin is a dead language but has a strong footing in especially legal contexts. Comparing the topic distributions of these languages gives us better understanding of their functions in the political context, and of how and why different languages can be used as political tools and devices.

The materials used in this study are the plenary speeches given in the Finnish parliament between 1970 and 2020. The speeches have been made computer readable and enriched with metadata (Hyvönen et al 2024).

This study built on the topic model by Ristilä and Elo (2023) and used language detection (Python’s Lingua library) to define what languages are present in which topics. Language use occurrences and, when necessary, their surrounding context was processed through the topic model to get topic distributions for each language. The topic model was monolingual, and since there were many Swedish passages, all Swedish contents were machine translated into Finnish (eTranslation). English and Latin only appeared as small fragments, so their close context was used to define the topics.

Functions of language use were defined with close reading. Ten function categories were defined for all languages, and additional three for Swedish. These were divided into two groups: functional and rhetoric. Kari Palonen’s four concepts of politics (Palonen 2003, 1993) – policy, polity, politicking and politicization – were used to get a better understanding of how the functions work in a political context: policy entails the regulation aspect of politics, polity can be understood as the sphere of established norms and procedures, politicking is the performative aspect of politics, and politicization makes something political. Especially the politicization function of language choice, or the making of something political just by using a marked (nontypical) language, was of interest.

The preliminary results indicate that Swedish and Latin usages have very similar topic distributions. Both were used significantly more than average around topics named public sector and legislation, and significantly less in topics foreign and security policy and traffic and transport. The distribution of English usage, on the other hand, closely followed the average topic distribution, reflecting its commonplace role.



Accessing nature before “allemansrätten”? Combining two national parliament datasets to study a tradition before it was named

Matti La Mela

Uppsala University, Sweden

Sweden, Finland, and Norway share a longstanding tradition of public access rights known as allemansrätten, which grants access to nature for activities such as camping and foraging on both public and private lands. What is interesting, however, is that allemansrätten, although considered an age-old Nordic custom, was not named until the 1930s and only gained common usage after World War II. This has led some scholars to view allemansrätten as a political construct and to challenge the prevailing narrative about the historical roots of these public access rights (see e.g. Wiktorsson 1996; Valguarnera 2016). The later history of the term allemansrätten has been explored through parliamentary debates in Finland (Kettunen & La Mela 2022), and in Sweden through selected Swedish parliamentary motions, protocols, and official inquiries (Sténs & Sandström 2014). This paper focuses on the roots of allemansrätten before the term was coined, through parliamentary debates in Sweden and Finland. It asks whether such practices of access rights to nature were recognized in Sweden and Finland before they received the name “allemansrätten”; in other words, to what extent can we identify the concept or idea before it was formally defined?

The data used in the paper is linked open parliamentary datasets published in both countries: in Sweden in the SWERIK project for the years 1867-2022 (https://swerik-project.github.io/), and in Finland in the Semantic Parliament project from 1907 until today (Hyvönen et al. 2023). The paper builds on the approaches in digital conceptual history that employs parliamentary debate data (see e.g. Jarlbrink & Noren 2023; Ihalainen & Sahala 2020; Elo 2022). As the concept of allemansrätten (in Finnish, jokamiehenoikeus) becomes a shared concept between Sweden and Finland after the 1940s, provides the study of the national parliamentary debates from these countries a way to follow and contrast the trajectories in how the term became articulated. At this first stage of research, the paper focuses on the years where the two datasets converge, thus on the years from 1907 onwards. Moreover, the focus is on the practices of access of nature by studying debates about foraging for the resources of wild nature. This is done because the (universal) right to forage berries, mushrooms, and non-protected plants, is at the core of today’s allemansrätten.

The paper presents the preliminary results where the the access rights are traced in the debates at two levels. First, the paper examines the debates where wild berries and mushrooms are mentioned, and investigates the word contexts around them. Second, the paper employs topic modeling in the Swedish parliamentary debates to identify access rights-related debates with a particular focus again on foraging and the use of wild resources as key terms. For this, the paper applies the Swedish BERTopic implementation and the BERT language models developed by the Swedish National Library (KBLab). The topic modeling is guided with other seeds too that are key terms from contemporary debates on allemansrätten, such as access right, outdoor recreation, that allows to converge the models towards topics where the concept appears. During both steps, the identified debates are classified manually based on their legislative context and read more closely to study the views of the members of parliament.

The paper contributes to offering new knowledge about the early roots of allemansrätten. It also provides an example of how to identify and study a concept through the semantic content in a bilingual parliament dataset, rather than approaching it by the term that identifies the concept.

References

Elo, K. 2022. Debates on European Integration in the Finnish Parliament (Eduskunta) 1990-2020. In Proceedings of the Digital Parliamentary Data in Action (DiPaDA 2022) Workshop, CEUR-WS, 129-145.

Hyvönen E., Sinikallio S., Leskinen P., Drobac S., Leal R., La Mela M., Tuominen J., Poikkimäki H., & Rantala H. 2023. Plenary Speeches of the Parliament of Finland as Linked Open Data and Data Services. Proceedings of the International Workshop of Knowledge Generation from Text (TEX2KG), co-located with ESWC 2023, Hersonissos, Greece, May 29th, 2023. CEUR Workshop Proceedings 3447, 1–20.

Ihalainen, P. & Sahala, A. 2020. Evolving conceptualisations of internationalism in the UK parliament: Collocation analyses from the League to Brexit. In Fridlund, M., Oiva, M., Paju, P. (Eds). Digital Histories: Emergent Approaches within the New Digital History. Helsinki: Helsinki University Press.

Jarlbrink, J., & Norén, F. 2023. The rise and fall of ‘propaganda’ as a positive concept: a digital reading of Swedish parliamentary records, 1867–2019. Scandinavian Journal of History, 48(3), 379-399.

Kettunen K. & La Mela M. 2022. Semantic tagging and the Nordic tradition of Everyman’s rights. Digital Scholarship in the Humanities, 37(2), 483-496.

Sténs, A. & Sandström, C. 2014. Allemansrätten in Sweden: a resistant custom. Landscapes, 15(2): 106–18.

Valguarnera, F. 2016. Allemansrätten: en internationell förebild. Nordisk miljörättslig tidskrift, 147-159.

Wiktorsson, G. 1996. Den grundlagsskyddade myten: om allemansrättens lansering i Sverige. Stockholm: City Univ. Press.



Finding Patterns across Multiple Time Series Datasets: Democracy in the Twentieth-century Political Discourses in the United Kingdom, Sweden, and Finland

Risto Turunen1, Hugo Bonin1, Pasi Ihalainen1, Jani Marjanen2

1University of Jyväskylä, Finland; 2University of Helsinki, Finland

This paper analyses the contextual variation of nouns and adjectives related to democracy in the United Kingdom, Sweden, and Finland in the twentieth century. We compare parliamentary data (Hansard, Riksdag, and Eduskunta) against press data (UK: Guardian and Times, Sweden: Dagens Nyheter and Svenska Dagbladet, Finland: Helsingin Sanomat and Suomen Kuvalehti). By including both liberal and conservative newspapers as well as parliamentary speeches, our study offers a fresh perspective on the relation between democratic discourses produced by politicians and journalists.

The approach includes visualizing the main similarities and differences in the use of democratic vocabulary between multiple historical time series datasets, as well as applying cross-correlation analysis to automatically find identical patterns between parliament and media or across different nations. The similarity of various word frequency time series charts is evaluated using the Pearson correlation coefficient (PCC), which can vary from -1 to 1. When two time series display simultaneous increases and decreases, the PCC value is nearer to 1 (Derrick & Thomas 2004). The strengths of the PCC are its mathematical simplicity, easy interpretability, and tolerance for noise, while its main limitation is sensitivity to extreme outliers which can be mitigated by using sliding windows to analyze segments of the time series instead of the whole.

Our findings indicate that the cross-correlation is strongest between similar political terms in the same dataset, e.g., the relative frequency of “democracy” and “democratic” over time in a national parliament (in Hansard 0.91, Riksdag 0.76, and Eduskunta 0.65). Another strong set of cross-correlations can be observed when the same political term appears in different datasets from the same country, e.g., the frequency of “democracy” in liberal and conservative press (in the UK 0.87, in Sweden 0.82, and 0.61 in Finland). Transnational correlations of political terms were not as strong as intra-national correlations, but they were clearly evident in the PCC values, e.g., for the frequency of “democracy” they varied from 0.58 to 0.68 between three parliaments under investigation. The shared patterns between parliaments include general increase in the use of “democracy” over time, with notable peaks in the 1930s as a reaction to totalitarianism, around the year 1968 related to the rise of social movements, and in the 1990s, with the expansion of digital communication (Ihalainen et al. 2022). We ensured that our results were not due to intrinsic structural properties of the chosen datasets by calculating the PCC values also for non-political terms, which showed weak or non-existent correlation between political and non-political terms.

Methodologically, our contribution introduces time series methods to the digital humanities, a field which has mostly focused on the manual examination of time series visualizations, with only a few exceptions (Wevers, Gao & Nielbo 2020). From the humanities perspective, we empirically demonstrate the strong linkage between the political discourses in parliament and the press, challenging the notion of parliamentary speech as elite political speech, distinct from a broader society.

Turunen-Finding Patterns across Multiple Time Series Datasets-241.docx
 
2:30pm - 3:00pmBreak
Location: Háma
3:00pm - 4:30pmWS4-4: FULL-DAY WORKSHOP (SOLRWAYBACK)
Location: K-206 [2nd floor]
Session Chair: Anders Klindt Myrvoll, Royal Danish Library, Denmark
Session Chair: Jon Carlstedt Tønnessen, National Library of Norway, Norway
3:00pm - 4:30pmWS2-4: FULL-DAY WORKSHOP (DiPaDA 2024)
Location: K-208 [2nd floor]
Session Chair: Matti La Mela, Uppsala University, Sweden
Session Chair: Mats Fridlund, University of Gothenburg, Sweden, Sweden

Full programme at: https://dhnb.eu/conferences/dhnb2024/workshops/dipada/

15:00-15:30 The Politics of Compound Neologisms: A Novel Methodology for Mining of Conceptual Transformations in Swedish Parliamentary Discourse and Data (Daniel Brodén, Claes Ohlsson, Henrik Björk, Mats Fridlund, Leif-Jöran Olsson, Leif Runefelt, Shafqat M. Virk, Magnus P. Ängsal)

15:30-16:00 Concluding discussion

 

The Politics of Compound Neologisms: A Novel Methodology for Mining of Conceptual Transformations in Swedish Parliamentary Discourse and Data

Daniel Brodén1, Claes Ohlsson2, Henrik Björk1, Mats Fridlund1, Leif-Jöran Olsson1, Leif Runefelt3, Shafqat M. Virk1, Magnus P. Ängsal1

1University of Gothenburg, Sweden; 2Linnaeus University, Sweden; 3Södertörn University, Sweden

Introduction

This paper draws from two distinct research projects in text mining, each exploring the use of concepts and shifts in historical meaning, albeit within different contexts and with diverse inquiries. One project, ‘Terrorism in Swedish Politics’ (2020–2026, see Edlund et al. 2022), examines the framing of political terror within parliamentary discourse 1968–2018, while the other ‘The Market Language’ (2022–2025, see Ohlsson et al. 2022) investigates the discourse surrounding markets, spanning from the Middle Ages to contemporary times. However, despite having disparate focuses, previous results from both projects strongly suggest that compound words and compound neologisms play a significant role in the formation and development of different political discourses.

Hence, the overarching aim of this project-wide paper is to explore and showcase the analytical significance of compounds in navigating new concepts and phenomena within parliamentary discourse. More specifically, we argue that employing corpus linguistic methods in conjunction with conceptual history perspectives to text mining of compounds is an underdeveloped analytical approach in historical text analysis, exemplifying its relevance with two case studies of the transformations of the concepts of market and terror, respectively, based on Swedish parliamentary datasets. Drawing upon a combination of distant and close reading, we will describe patterns of compound use in the data sets at hand, and examine the context of the compounds in a more nuanced way, deepening the historical contextualization of the significance of compounds in concept development.

The analytical significance of compounds

The ‘Terrorism in Swedish Politics’ and ‘The Market Language’ projects are multidisciplinary endeavours that integrate analytical methods for processing large textual datasets with inquiries into actual language usage and the utilisation of particular concepts within the political texts under examination. In this sense, the two projects follow previous research that have explored the development of key concepts in Swedish parliamentary datasets, drawing upon the application of statistical measurements (Jarlbrink et al. 2022; Norén et al. 2022).

However, the present paper will specifically focus on the analytical work in the two projects that concern compound neologisms, framing the issue in the wider context of how multi-word phrases are expressed in different languages. N-grams often serve as a widely accepted method for describing recurrent multi-word combinations in natural language, both in spoken and written discourse, and are frequently employed as units of analysis in numerous studies (Lyse & Andersen 2012). N-grams are a useful tool to map recurring multi-word combinations but do not capture the word combination potential and possibility to easily create neologisms that occur in Swedish. Notably, the creation of new words through the synthetic amalgamation of different terms into compounds, functioning as cohesive lexical units, is a distinctive morphological trait of the Swedish language (Finkbeiner and Schlücker 2019). While similar morphological patterns can be observed in other Germanic languages, including the Nordic languages, Dutch and English, the latter is characterised by fuzzy boundaries between compounds and multi-word expressions (Bauer 2019). This is reflected in irregularities when it comes to the spelling of English compounds; these are often represented orthographically as separate lexical units (c.f. English compound labour market and its Swedish equivalent arbetsmarknad). Orthographic irregularities of this kind considerably narrow the potential of detecting compounds computationally on the basis of large datasets. Thus, the paper will highlight the analytical potentials of compounds in discourse analysis based on how Swedish allows for and visualises the creation of compounds by combining existing words to form new morphologically coherent lexical units. We will argue that this particular form of lexical composition holds relevance as a unit of analysis in computational linguistic studies as well as a discursive phenomenon, offering concentrated semantic information compared to simplex nouns that often necessitate the embedding in multi-word expressions to convey similar meaning.

Through a close examination of how our words of interest (‘terrorism’ and ‘market’) co-occur with other words in compounds, we are able to describe semantic patterns of word usage that includes evaluative approaches and attitudes. Key analytical tools in our context include word frequency analysis, examination of keyword collocations, and exploration of multi-word expressions, such as phrases, involving the focal keywords. The latter often reveal entrenched phraseological relations for a word, providing insights into its lexical, syntactic, and semantic characteristics (Koteyko et al. 2010). We argue that this is crucial for understanding the discursive use of concepts – how words are situated within specific contexts or text genres – and contributes to the creation of a recurring perception of the phenomena they represent.

Material

Both projects utilise publicly available Swedish parliamentary datasets. The Market language project case for this paper draws from the complete data set of texts from the Bicameral Parliament of 1867 to 1970. This data set has been available through the Swedish parliament website in .pdf and .xml formats since some years but have been downloaded by the project’s LT analysts and also been processed, annotated and included in the Språkbanken Text infrastructure as a subcorpus, “Tvåkammarriksdagen”, with the possibility to employ sub-genre categories for specialised searches. The period of the Swedish Bicameral parliament from 1867 to 1970 is of particular interest for the project since it is functioning in a time dominated by industrialization and economic changes where the foundations for a modern market economy and also the modern Swedish democracy are laid out.

The work in the SweTerror project departs from the dataset of the corpus of the minutes provided by Westac and SWERIK that are currently being cleaned up, partly re-digitised and curated for research purposes (latest version 0.14). Notably, the dataset is annotated with metadata about Members of Parliament (MPs) concerning name, party affiliation, gender and regional representation. There is also an ongoing exchange between SweTerror and Westac, with SweTerror’s LT analyst Olsson (on advisory board for SWERIK) further enriching and curating the data for our research purposes and contributing with various forms of quality control. The reason for the extensive - compared to the Market case study - study period beyond 1970 is that terrorism went through drastic changes during the 1970s and early 2000s.

Case Study 1: Market

The results from the Market Language Project so far are primarily based on the productivity of compounds, both in terms of their occurrence and frequency in the current material, as well as the emergence of new compound forms that begin to be used over time in the text material of the Bicameral Parliament. We have previously discussed this productivity aspect of compositional forms in Ohlsson et al. (2022). The earlier results indicate that new compositional forms exerting significant influence in terms of usage represent areas that gain prominence in political debates and also have the capacity to generate further compositional forms.

A trend that we further explore in this paper, is the consistent increase in new compositional forms featuring "market" as an element, spanning the period from 1867 to 1970 in the parliamentary texts at hand, with an accelerating rate particularly evident from around 1920 onwards, and later also an increase in the post WW2 period aligned with the so called Swedish model of combining a free market economy with redistributive politics. For instance, the emergence of the compound arbetsmarknad (labour market) leads to the creation of additional compositions, based on that compound. These patterns of compositional productivity serve as a foundation for discussing the growing utilisation of the concept of market in political discourse in general and the attribution of new properties and roles to the concept itself.

Case Study 2: Terrorism

The second case study will focus on the development of the closely related words ‘terror’ and ‘terrorism’ when they appear as constituents in compounds, as manifested in the Swedish parliamentary discourse, 1867–2018. We have previously shown that although the word terrorism has been used since 1867, terror-related words and compounds first gained traction from 1918 and onwards with the word terrorism gaining its modern meaning in the early 1970s. More specifically, we found a distinct legislative framing of the issue of terrorism in the Swedish Parliament with 9/11 in 2001 serving as a watershed moment for the rise in the production of compound neologisms and a stronger counterterrorism discourse (Fridlund et al. 2022; Brodén et al. 2023). We also observed an increase in the production of compound neologisms with ‘terrorism’ from 2015 and onwards, likely resulting from the emergence of the Islamic State and a range of terrorist attacks in Europe (Ängsal et al. i.p).

Besides chronologically tracing compound neologisms, we will apply word vectors to more deeply examine discursive transformations, examining other terms that carry similar meaning as neologisms. Furthermore, taking into account meta-data about party affiliation of MPs who use neologisms, will be able to focus on the extent to which different political parties have used different compounds, allowing for a more multidimensional perspective on the role of neologisms in the development of the Swedish parliamentary discourse on terrorism.

Conclusions

We conclude by drawing together the two projects specific lines of inquiry, highlighting compound neologisms as a specific morphological feature of primary Germanic languages and, thus, presenting a potential key to opening up methodological perspectives on text mining the formation and development of political discourses. In our engagement with the data we have been able to discuss and showcase the potentials of divulging compound neologisms as a front door to discursive change over time. One such example is how the market concept has ‘colonised’ different political domains by the use of new types of compounds, such as bostadsmarknad (market for housing). Another example is the rise of novel terrorism compound types such as terrorresa/terroristresa/terrorismresa (terrorism travel) after the emergence of the Islamic State, indicating how the parliamentary engagement with political violence has changed discursively. Our contribution will thus feed into ongoing discussions about text mining approaches for tracing discursive transformations in large-scale text collections.

Brodén-The Politics of Compound Neologisms-244.pdf
 
3:00pm - 5:30pmTUTORIAL2-2: HALF-DAY TUTORIAL (LyX)
Location: K-207 [2nd floor]
Date: Wednesday, 29/May/2024
8:00am - 8:30amArrival & registration
8:30am - 10:00amWS6-1: HALF-DAY WORKSHOP (CATALOGUE AS DATA)
Location: K-205 [2nd floor]
Session Chair: Rossitza Ilieva Atanassova, British Library, United Kingdom
 

Catalogues as Data for Computational Analysis

Rossitza Ilieva Atanassova, Harry Lloyd

British Library, United Kingdom

Title

Catalogues as Data in Library practice and research: A Use Case with the British Library Printed Catalogue of Books Published in the 15th Century.

Keywords: data, tools, corpus linguistics, metadata enrichment, transcription

Target Audience

20 maximum, laptops required

  • Librarians and curators interested in computational methods for working with collections metadata

  • Cultural Heritage and Higher Education professionals who support digital research projects

  • Researchers interested in library professional practice around data

  • Those with subject specific interest in incunabula collections or catalogues

Session format

The workshop will demonstrate to both cultural heritage professionals and researchers how bibliographic data can be generated from printed catalogues and disseminated for library and research use using computational approaches. It will cover the methodology, process and tools used to transform printed catalogue descriptions into data to be used for computational analysis and metadata enrichment.

Catalogues occupy a central place in the work of GLAM professionals and are an essential resource for users, including digital humanities researchers. They are not only an important aid to accessing the collections but are scholarly resources in their own right revealing institutional cataloguing and curatorial practice. We will showcase how computational analysis of catalogue data by library professionals can reveal new insights about their holdings and historical collecting practices, and how this can contribute to efforts of cultural heritage and HE institutions to improve the accessibility and inclusivity of their collections and data for diverse audiences. We will demonstrate one particular approach, that is, a method for gaining new insights into historical cataloguing and curatorial voice through linguistic analysis.

For this workshop we use the outputs from a research project funded under the Research Libraries UK and Arts and Humanities Research Council Professional Practice Fellowship scheme that enables library professionals to be active participants in research. The project digitised and extracted descriptions from volumes 1-10 of the Catalogue of books printed in the 15th century now at the British Museum (1908-1974) and prepared the data for corpus linguistics analysis and enrichment of the British Library’s online catalogue. The incunabula catalogue dataset is published on the BL Shared Research Repository.

The workshop fits in with several of the conference themes and puts an emphasis on the life-cycle of creating and using cultural heritage collections as data and the importance of cross-institutional collaboration. We will reflect both on our experience of working with historical printed catalogues as data and on how the project benefited from the close collaboration with printed heritage curators at the Library and Digital Humanities colleagues at the University of Southampton. The corpus linguistics approach we are using was first tested by a previous research project which co-produced training materials with input from GLAM professionals in the UK. We have shared this method and our learning with colleagues at the British Library and the Research Libraries UK community and we want to disseminate our practice with the DH Nordic and Baltic community of GLAM professionals and DH researchers.

Coordinators

The workshop will be delivered by members of the Digital Research Team at the British Library.

  • Rossitza Atanassova, Digital Curator, has worked on major digitisation projects and supports digital scholarship activities at the Library with focus on access and reuse of digital collections. She was the RLUK Professional Practice Fellow 2022 who led the project with the incunabula data.

  • Harry Lloyd, Research Software Engineer, has a background in Chemistry and Data Science and supports a range of digital research projects to enable enrichment of the Library’s collections. Harry maintains the code for the derived data.

Agenda

The workshop includes a combination of presentations and hands-on exercises.

  • Welcome [15 min]:

    • A brief overview of the concept of Catalogues as Data.

  • Demonstration: Transforming Printed Catalogues into Data: [20 min]:

    • Use case

    • Data preparation: training data, transcription (OCR) with Transkribus and code development

    • Data outputs


  • Hands-on Exercise:

    • Guided exploration of catalogue entry extraction from transcribed text using a Jupyter notebook [40 min]

Comfort Break [20 min]

  • Demonstration: Introduction to the corpus analysis tool AntConc: [20 min]

    • Set up

    • Data import

    • Tools and Setting

  • Hands-on Exercise:

    • Guided exploration of the text data with AntConc [40 min]

  • Wrap up discussion [15 min]

Technical requirements

No prior knowledge is required. Participants will need to bring their laptops and have internet access.

Before the workshop participants will be sent detailed instructions on how to download the data they will work with and how to pre-install the corpus linguistics analysis tool. Help will be offered during the session.

As an optional pre-workshop activity participants could take a look at existing training materials on Computational analysis of catalogue data with AntConc.

Learning Outcomes

Will understand:

  • An approach to working with collection catalogues as data, from transforming the digitised images into text data and exploring text data with computational tools.

  • The workflow that takes digitised images, transcribes them, processes the transcribed text into logical units ready for corpus linguistic analysis.

  • The potential for computational analysis of catalogue records and collection metadata to provide professional and scholarly insights for library professionals and researchers alike.

Will be able to:

  • Use code to parse an xml containing transcribed catalogue text into catalogue entries.

  • Load corpus text data into AntConc and carry out basic linguistic analysis.

 
8:30am - 10:00amWS7-1: HALF-DAY WORKSHOP (DIGITAL LEXICOGRAPHY)
Location: K-206 [2nd floor]
Session Chair: Tarrin Wills, University of Copenhagen, Denmark
 

Digital historical lexicography in the Nordic languages

Tarrin Wills1, Simonetta Battista1, Johnny Lindholm1, Ellert Þór Jóhansson2

1University of Copenhagen, Denmark; 2University of Iceland

Coordinators: Simonetta Battista, Johnny Lindholm, Ellert Þ. Jóhannsson, with expected contributions from Tarrin Wills, Steinþór Steingrímsson and Trausti Dagsson

Target audience: The workshop is open to all interested in integrating digital lexicography, particularly historical lexicography of the Nordic languages (Icelandic, Norwegian, Danish, Swedish and Faroese).

Anticipated number of participants: 10

Ideal length: Half-day, preferably close to the start of the conference

Technical requirements: Digital projector (powerpoint)

Expected learning outcomes: Participants will have gained an understanding of the digital landscape for Nordic historical lexicography and its challenges, as well as worked towards new models for interoperability for projects in this area.

Session 1: Review of existing projects and applications/technologies

Session facilitator/ chair: Simonetta Battista and Johnny Lindholm

Presentation of ongoing and recent digital dictionary projects.

The goal of this session is to gain an overview of digital projects and technologies in the field of historical Scandinavian lexicography in order to lay the foundation for developing methods and standards for interoperability.

Content will include scope, applications for editing, publishing and sharing data, technologies employed with an evaluation of benefits, issues and challenges.

1.1. Presentations (1 hour)

Projects presented will include:

  • Dictionary of Old Norse Prose, Copenhagen

  • Árnastofnun dictionaries: Ritmálssafn, Íslensk orðsifjabók, etc.

  • Digitised print dictionaries: ODS, Fritzner, Cleasby and Vigfusson, etc.

Additional participants may be invited to present their own projects.

1.2. Discussion (30 minutes)

Discussion topics for closing of session:

  • What do the projects have in common that can form the basis of future collaboration?

  • What technologies /standards are emerging as the dominant ones in this area of digital lexicography?

  • What methods and technologies provide potential obstacles for interoperability?

Break (30 minutes)

Session 2: Key challenges in digital integration of resources

In this session all participants will be invited to contribute to a discussion of specific issues relating to integration of resources. The outcome should be a roadmap and/or recommendations for different projects to begin work on technologies for integrating their digital materials.

2.1. Interoperability of word/lemma lists (30 minutes)

Discussion facilitator: Ellert Thor Johannsson

In order to enable cross-language searching of cognate words, methods will need to be developed for how cognate words in the different languages can be linked together, that is, older and younger versions of the various Scandinavian languages.

Discussion topics:

  • Is a common Scandinavian word/lemma list feasible?

  • What challenges emerge from the different projects’ approaches to their own lemma lists?

  • What are the linguistic challenges (definitions of homography etc.)?

  • What model could be used for linking the different projects’ lemma lists?

  • How could a resulting combined lemma list be managed and maintained?

2.2. General standards for interoperability (1 hour)

Discussion facilitator: Tarrin Wills

Linking of external resources requires common standards and technologies. We will start by reviewing some of the existing approaches that might be applicable here (TEI, Ontolex, Elexis, Wordnet, etc.) and then proceed with a discussion in these parts:

  1. Requirements: What is the primary goal of interoperability? To facilitate further lexicographic editing? To enable research, both quantitative and qualitative? To reach new audiences for historical lexicography and the languages they cover?

  2. End-user scenarios: Depending on the primary goal, how do we envisage an endþuser accessing lexicographic resources through interoperable applications? What would a web portal look like? How would existing resources be integrated into other, external applications? How could the resources be accessed digitally using programming languages (APIs)?

  3. Roadmap: Which standards should be deployed to begin working towards these scenarios? What features are missing from existing standards but required for integration of our resources?

Wills-Digital historical lexicography in the Nordic languages-150_a.pdf
Wills-Digital historical lexicography in the Nordic languages-150_b.pdf
 
8:30am - 10:00amTUTORIAL3-1: HALF-DAY TUTORIAL (JUPYTER NOTEBOOKS)
Location: K-207 [2nd floor]
Session Chair: Gustavo Candela, University of Alicante, Spain
 

Reusing digital collections from GLAM Labs: a Jupyter Notebook approach

Gustavo Candela1, Mirjam Cuper2, Olga Holownia3, Max Pedersen4

1University of Alicante, Spain; 2National Library of the Netherlands; 3International Internet Preservation Consortium; 4Royal Danish Library, Denmark

For decades, GLAM organizations have been exploring new ways to make available their collections. Recent methods for publishing digital collections through initiatives such as “GLAM Labs” (glamlabs.io) and “Collections as Data'' (collectionsasdata.github.io) have focused on the adoption and reuse of computational access methods. Over the past years, cultural heritage institutions have been gradually offering access to digital collections containing different materials such as metadata, text, and images. In this context, Jupyter Notebooks have emerged as a powerful tool to facilitate additional documentation about the collections for digital humanities researchers. GLAM institutions have started to employ Jupyter Notebooks as a new approach to demonstrate how reusers can access and experiment with datasets derived from their collections. The International GLAM Labs Community compiled a selection of projects provided by relevant institutions such as the Data Foundry at the National Library of Scotland (data.nls.uk), the National Library of Luxembourg (data.bnl.lu) and the Library of Congress (data.labs.loc.gov), and the Austrian National Library (labs.onb.ac.at/en/datasets) that are available on the website in the section dedicated to computational access (glamlabs.io/computational-access-to-digital-collections). Some members of the community have also published a research article to provide a methodology to assess the quality of these Jupyter Notebooks projects [1] and a checklist to publish collections as data [2].

The main goal of the workshop is to demonstrate how Jupyter Notebooks can be used as a tool for working with a dataset derived from a library collection. Following other approaches such as the GLAM Workbench (glam-workbench.net)[3], and provided as the content for the practical exercises, a new Jupyter Notebook collection will be made available before the conference that will be used during the workshop in order to provide examples of how to reuse digital collections. The collection will be made available through GitHub and will be prepared to be open in an executable environment such as Binder, making the code reproducible. Additional tools will be introduced such as Python environments (e.g., conda) and libraries to work with tabular data and natural language processing (e.g., pandas and NLTK: Natural Language Toolkit). This work intends to foster the use of Jupyter Notebooks in the GLAM context as well as provide an introduction for DH researchers and anyone interested in working with digital collections.

The workshop will be structured as follows:

  1. [15 mins] Introduction to the workshop
  2. [15 mins] Introduction to the computational access section on the GLAM labs website (https://glamlabs.io/computational-access-to-digital-collections/). Several aspects will be covered including the projects, how the information about the projects is introduced in Wikidata to create charts as well as how the new section was implemented.

  3. [2 hours] Practical exercises will cover the following steps:

  • opening a selection of Jupyter Notebooks in an executable environment using a web browser as the main tool
  • notebooks' structure (code and markdown cell), and how users can add, run and remove cells
  • learning how to create a project of Jupyter Notebooks from scratch, and understanding the tools, platforms and services that can be used.
  • a brief installation guide for Python and environments such as Anaconda.
  1. [25 mins] Presentation of the article “An approach to assess the quality of Jupyter projects published by GLAM institutions” recently published in the Journal of the Association for Information Science and Technology. This article describes a methodology to assess the quality of Jupyter Notebook projects made available by GLAM institutions. Based on the best practices and guidelines, it provides a list of steps to follow in order to assess a Jupyter Notebook project.

  2. [5 mins] Wrap-up

Workshop coordinators

  • Gustavo Candela, University of Alicante
  • Olga Holownia, International Internet Preservation Consortium
  • Max Odsbjerg Pedersen, Royal Danish Library
  • Mirjam Cuper, National Library of the Netherlands

Format: on-site tutorial

Target audience: DH and CS researchers, librarians, archive staff, university staff and students

Number of participants: 20-25 max.

Technical requirements: Internet connection and laptops. Prior knowledge about the use of Jupyter Notebooks and Python programming language is not required but recommended. If participants want to develop and run the Jupyter Notebooks on their computers, they should install Python or an environment such as Anaconda before the workshop. The web browsers recommended are the latest versions of Firefox or Google Chrome.

Learning outcomes

The following list describes the learning outcomes of this workshop:

  • create awareness of the relevance of Collections as Data in the context of GLAM
  • appreciate the usefulness of Jupyter Notebooks in the GLAM context and its history
  • analyze the structure of a digital collection suitable for computational use
  • understand the steps involved when creating a collection of Jupyter Notebooks
  • understand the structure of a Jupyter Notebook and how to use it
  • create awareness of the relevance of documentation

References

[1] Candela, G., Chambers,S., and Sherratt, T. (2023), “An approach to assess the quality of Jupyter projects published by GLAM institutions”, J. Assoc. Inf. Sci. Technol. 74(13): 1550-1564. https://doi.org/10.1002/asi.24835

[2] Candela, G., Gabriëls, N., Chambers, S., Dobreva, M., Ames, S., Ferriter, M., Fitzgerald, N., Harbo, V., Hofmann, K., Holownia, O., Irollo, A., Mahey, M., Manchester, E., Pham, T.-A., Potter, A. and Van Keer, E. (2023), "A checklist to publish collections as data in GLAM institutions", Global Knowledge, Memory and Communication, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/GKMC-06-2023-0195


[3] Sherratt, T. (2021), “GLAM Workbench (version v1.0.0)”, Zenodo. https://doi.org/10.5281/zenodo.5603060

Candela-Reusing digital collections from GLAM Labs-204_a.pdf
Candela-Reusing digital collections from GLAM Labs-204_b.pdf
 
8:30am - 10:00amWS5-1: HALF-DAY WORKSHOP (LABS)
Location: K-208 [2nd floor]
 

Navigating Digital Landscapes: Experiences from A Decade of Engaging with users of Digital Humanities at the National Library of Norway

Lars G Johnsen, Jana Sverdljuk, Ingerid Løyning Dale, Lars Magne Tungland, Marie Iversdatter Røsok

National Library of Norway, Norway

This workshop will explore the evolution of the National Library of Norway's Digital Humanities Laboratory over the past decade. Based on the laboratory's rich history of collaboration and engagement, it will delve into the lessons learned from its interactions with academic institutions, humanities, and social sciences communities, focusing on the main theme: interacting with users. Workshop participants will reflect on and demonstrate the integration of user-friendly elements in the digital humanities laboratory’s functioning, such as: 1) providing accessibility of data, 2) fostering usability of analysis and 3) assisting with interpreting results. The elements correspond to the main stages of digital humanities projects, that the laboratory walks along with its users: from bachelor, master and PhD students to academic units and institutions.

1) By providing preprocessed and easily accessible datasets, the Digital Humanities Laboratory streamlines the initial stages of research for academic communities. Researchers can focus on analysis and interpretation rather than spending significant time on data preprocessing.

2) Creating applications with user-friendly interfaces democratize access to digital humanities tools. Academics without extensive technical expertise can actively engage in digital research, broadening participation across disciplines.

3) The applications are supplemented with descriptions of the methods used that may secure better understanding of results and initiate reflections on combination of quantitative and qualitative methods. Sufficient explanation of the analysis outcomes encourages engagement of digital humanists in meaningful conversations with scholars from traditional humanities disciplines, fostering a mutually beneficial relationship.

The workshop will consist of three parts:

Ensuring Data Accessibility (45 mins):

We will showcase the current methods employed to make data available for users, highlighting text mining and big data analysis.

The National Library of Norway has a large collection of digitized published works, which have gone through an extensive processing pipeline:

  • Large scale image scanning of books, magazines, newspapers, and other publications

  • Optical Character Recognition

  • Text extraction from the resulting files into indexed databases, allowing for fulltext search

  • Tokenization, splitting text into word tokens, allowing for quantitative analysis on word level

  • Constructing metadata databases, allowing for metadata search and corpus construction

We will connect the showcased techniques to the DH-lab's history, illustrating how they have evolved based on user feedback and engagement. A key point of our interfaces is that we can share and present the data without infringing on copyrights.

  • User facing application programming interfaces (API)

  • Workshops with researchers and students

  • Jupyter Notebooks with python and pandas

  • Project collaborations

  • Apps

Example datasets: bibliographies

Creating User-Friendly Interfaces and Applications (60 mins):

This session explores the innovative approaches the National Library of Norway has adopted in creating user-friendly interfaces and applications for digital humanities projects.

We'll showcase our DH-lab app development methodology, which incorporates the use of Flask APIs and the dhlab Python library, alongside the integration of Streamlit for interactive, web-friendly applications.

With our indexed databases of digital texts, we have developed a suite of apps for analysing reduced representations of the data:

  • Corpus construction: Lists of document IDs and their metadata

  • N-gram statistics: Frequency counts of uni-, bi- and trigrams over time

  • Concordances: Immediate contexts of specific search terms

  • Collocations: Aggregated concordances, ranked by word similarity scores calculated with pointwise mutual information

The session will be a hands-on experience, inviting participants to navigate through these interfaces. This practical exploration aims to provide an immersive understanding of the user-centric design principles we've implemented over the years.

The apps are available from the National Library’s official website: https://www.nb.no/dh-lab/apper/

Combining Quantitative and Qualitative Methods (45 mins):

Showcasing collaborative projects that harmoniously integrate quantitative (text mining, collocation analysis, georeferencing, topic modelling) with qualitative methods (e.g. discourse analysis).

  • Analysing examples of papers using quantitative and qualitative methods

  • Combining tools for quantitative research and critical discourse analysis

  • Reflecting on the evolution of these integrative methodologies, considering insights gained from papers and user involvement in educational and outreach events

  • Examples from bachelor students’ involvement with digital humanities tools at the NLN

Target audience: Academics, researchers, librarians, and professionals interested in incorporating digital humanities methods into their research. Both technical and non-technical participants are welcome.

Outcomes: Attendees will learn data preprocessing techniques for efficient analysis, navigate user-friendly interfaces, and engage in hands-on exercises combining quantitative and qualitative methods. Participants will leave with practical skills, a deeper understanding of the digital humanities landscape, and the confidence to integrate these methods into their academic research endeavours and institutions.

Anticipated Number of Participants: Ideal: 20-30 participants to facilitate meaningful interactions.

Technical Requirements: Participants can use their laptops or mobile phones to work directly from the DH lab’s webpage: https://www.nb.no/dh-lab/

 
10:00am - 10:30amBreak
Location: Háma
10:30am - 12:00pmWS6-2: HALF-DAY WORKSHOP (CATALOGUE AS DATA)
Location: K-205 [2nd floor]
Session Chair: Rossitza Ilieva Atanassova, British Library, United Kingdom
10:30am - 12:00pmWS7-2: HALF-DAY WORKSHOP (DIGITAL LEXICOGRAPHY)
Location: K-206 [2nd floor]
Session Chair: Tarrin Wills, University of Copenhagen, Denmark
10:30am - 12:00pmTUTORIAL3-2: HALF-DAY TUTORIAL (JUPYTER NOTEBOOKS)
Location: K-207 [2nd floor]
Session Chair: Gustavo Candela, University of Alicante, Spain
10:30am - 12:00pmWS5-2: HALF-DAY WORKSHOP (LABS)
Location: K-208 [2nd floor]
12:00pm - 1:00pmLight lunch & registration
1:00pm - 1:20pmWelcome: Opening remarks
Location: Skriða [1st floor]
1:20pm - 2:35pmOpening Keynote: Sally Chambers: From Collections as Data experiments to sustainable Data Services: experiences at the intersection of cultural heritage and digital humanities
Location: Skriða [1st floor]
2:35pm - 2:45pmShort break
Location: Háma
2:45pm - 4:30pmPanel: Publication and reuse of digital collections: A GLAM Labs approach
Location: H-207 [2nd floor]
Session Chair: Mahendra Mahey, Tallinn University, Estonia
 
2:45pm - 4:15pm

Publication and reuse of digital collections: A GLAM Labs approach

Gustavo Candela1, Sally Chambers2, Nele Gabriëls3, Katrine Hofmann Gasser4, Olga Holownia5, Lars Johnsen6

1University of Alicante, Spain; 2DARIAH, Belgium; 3KU Leuven, Belgium; 4Royal Danish Library, Denmark; 5IIPC, United States of America; 6National Library of Norway

Title: Publication and reuse of digital collections: A GLAM Labs approach

For decades GLAM (Galleries, Libraries, Archives and Museums) have been exploring new ways to make available their digital collections. They host a wide diversity of rich content including, for example, maps, images, born-digital materials, text, audio or video materials that are available in many forms in terms of access and copyright. Recent advances in technology based on Artificial Intelligence and Machine Learning have provided a new context in which data level access has become a crucial aspect to engage with the research community. GLAM institutions can play a relevant role in this new context based on their expertise and knowledge as curators and content publishers [1], and their efforts can be maximised by the recently established common European data space for cultural heritage [2].

New initiatives such as Collections as Data [3] and the International GLAM Labs Community [4] have recently emerged in the cultural heritage sector to promote the publication of digital collections suitable for computational use as well as the reuse of content in innovative ways. Following their principles, a growing number of cultural heritage institutions have been making their digital collections available under open licenses, releasing prototypes and creating sandboxes for researchers. Some examples include the Data Foundry at the National Library of Scotland, the Library of Congress Labs, and the British Library Labs [5]. Inspired by previous approaches focused on the use of Jupyter Notebooks such as the GLAM Workbench [6], several institutions have started to use the notebooks to make available documentation and code based on their digital collections [7]. In addition, a checklist describing the steps to publish Collections as Data focused on small and medium-sized GLAM institutions has been recently published as a collaborative effort by the International GLAM Labs Community [8, 9].

These efforts provide an extensive demonstration of different initiatives to publish and reuse digital collections suitable for computational use. However, GLAM institutions need guidance in order to meet the current and emerging needs of the research community covering the following aspects: i) data workflows and checklists to provide data level access; ii) data quality in terms of content (e.g., OCR) and metadata; iii) documentation about the digital collections; and iv) reproducible examples of use.

The purpose of this panel is to introduce the work performed in the context of the International GLAM Labs Community to help GLAM organizations adopt best practices when using new trends such as Collections as Data. This proposal fits several of the conference topics, including “creating and using cultural heritage collections as data: workflows, checklists, tools” and “the reproducibility and repurposing of data, workflows, and lessons learned”.

Format

The speakers representing the GLAM Labs community will provide an introduction to the concepts and practices mentioned above, with a particular focus on the checklist to publish collections as data as well as the planned next steps. This will be followed by two case studies looking at lessons learned from Library Labs, and an overview of the potential of the European data space for cultural heritage In the presentations and the 30-minute Q&A part of the panel, we will cover the following questions: i) what is computational access and how can it be achieved in small and medium-sized institutions?; ii) how can GLAM institutions provide documentation and reproducible examples of use based on their digital collections?; iii) what are the steps and best practices to publish digital collections suitable for computational use?; and iv) how to establish a community to share ideas and knowledge about GLAM. In addition, future work will be explored regarding potential research lines for the International GLAM Labs Community.

Proposed format:

Presentations

Sally Chambers & Olga Holownia: Introduction to GLAM Labs as a community, recent projects and next steps [12 minutes]

Gustavo Candela & Nele Gabriëls: Introduction to Checklist for Publishing Collections as Data [12 minutes]

In order to support GLAM institutions in meeting the needs of the research community, a checklist for preparing collections as datasets for computational use was created. It is set up as a tool for GLAMs to leverage their digital assets for digital scholarship. This presentation will talk about how the checklist was developed based on input from the community through a survey of their needs. The checklist will be presented as well as a brief case study, offering both GLAM professionals and DH researchers insight into the principles and their implementation.

Katrine Hofmann Gasser: Connecting people with data at KB Labs at the Royal Danish Library: lessons learned from collaborative projects and initiatives. [12 minutes]

KB Labs (labs.kb.dk) was set up in 2016 and for the past 8 years has focused on providing opportunities for students and researchers to work with library collections and also using the tools developed by KB IT Department. This presentation will give a brief overview of the labs within the Library, how we work with data and how we connect with our users through collaborative projects. The presentation will also cover the most recent initiatives involving the AI Lab and AI politics.

Lars Johansen: Connecting people with tools: lessons learned at the DH-lab at the National Library of Norway [12 minutes]

DH-lab assists scholars and students in the use of digital tools and methods. Since 2013, they have built a research infrastructure that allows for computational analysis in alignment with the FAIR principles, primarily through Jupyter notebooks and user-friendly web applications. This presentation will focus on the lessons learned from providing tools to researchers who work with digital collections provided by the National Library of Norway through the DH-Lab.

Sally Chambers & Gustavo Candela: Crossroads: the common European data space for cultural heritage [12 minutes]

Building on the experience of Europeana, the launch of a data space by the European Union has opened up new possibilities for the sharing and reuse of cultural heritage data. The data space has the potential to maximize the efforts carried out by cultural heritage institutions, also connecting them with wider academic and research communities. This presentation will focus on the new perspectives it can bring to the GLAM Labs Community.

Moderated Q&A discussion [30 minutes]

After a brief introductory question round regarding the attendants experiences with GLAM collections as data - be it from the perspective of a GLAM institution or Lab or that of a dataset user - the Q&A discussion will focus on the following topics/questions:

  • Access to data: key challenges for researchers and GLAM institutions

  • Solutions for scaling up and reproducibility

  • Next steps for organisations that have adopted a checklist.

References

[1] Research Libraries UK. A manifesto for the digital shift in research libraries, 2020, https://www.rluk.ac.uk/digital-shift-manifesto/.

[2] https://pro.europeana.eu/page/data-space-deployment

[3] Padilla, T., Allen, L., Frost, H., Potvin, S., Russey Roke, E., & Varner, S. (2019). Final Report --- Always Already Computational: Collections as Data (Versión 1). Zenodo. https://doi.org/10.5281/zenodo.3152935

[4] Data Foundry at the National Library of Scotland: https://data.nls.uk/, the Library of Congress Labs: https://labs.loc.gov/, British Library Labs: https://labs.biblios.tech/.

[5] https://labs.biblios.tech/item-category/datasets/

[6] Tim Sherratt. (2021). GLAM Workbench (version v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.5603060

[7] Candela, G., Chambers, S., & Sherratt, T. (2023). An approach to assess the quality of Jupyter projects published by GLAM institutions. Journal of the Association for Information Science and Technology, 74(13), 1550–1564. https://doi.org/10.1002/asi.24835

[8] Candela, G., Gabriëls, N., Chambers, S., Dobreva, M., Ames, S., Ferriter, M., Fitzgerald, N., Harbo, V., Hofmann, K., Holownia, O., Irollo, A., Mahey, M., Manchester, E., Pham, T.-A., Potter, A. and Van Keer, E. (2023), "A checklist to publish collections as data in GLAM institutions", Global Knowledge, Memory and Communication, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/GKMC-06-2023-0195

[9] Mahey, M., Al-Abdulla, A., Ames, S., Bray, P., Candela, G., Chambers, S., Derven, C., Dobreva-McPherson, M., Gasser, K., Karner, S., Kokegei, K., Laursen, D., Potter, A., Straube, A., Wagner, S-C. and Wilms, L. with forewords by: Al-Emadi, T. A., Broady-Preston, J., Landry, P. and Papaioannou, G. (2019) Open a GLAM Lab. Digital Cultural Heritage Innovation Labs, Book Sprint, Doha, Qatar, 23-27 September 2019. https://glamlabs.io/books/open-a-glam-lab/

Candela-Publication and reuse of digital collections-216_a.pdf
Candela-Publication and reuse of digital collections-216_b.pdf


4:15pm - 4:30pm

Surveying cultural heritage data labs

Kaspar Beelen1, Marten Düring2, Danièle Guido2

1School of Advanced Study, University of London, United Kingdom; 2Centre for Contemporary and Digital History, University of Luxembourg

The “Always already computational. Collections as data” paradigm coined by a research project of the same name has since 2016 received a strong resonance among libraries, archives and other GLAM institutions worldwide. Many of them strive to offer access to their data and are experimenting with public APIs, dedicated data labs, data dumps, and even closed computing environments. In parallel, the decidedly computational analysis of cultural heritage data has emerged as a vibrant subfield which so far produced a dedicated journal, a conference series, workshops and monographs. We define the subfield “computational humanities” as a distinct user group of computer-savy humanists who wish to analyze cultural (heritage) data at scale harnessing advanced methods from data science and machine learning.

This short paper addresses the question to which extent GLAM institutions succeed in meeting the needs of the research community. It is motivated by our ongoing work to create a data lab for the impresso project. impresso aims to break down national and institutional data silos, providing unified access to newspaper and radio archives in Western Europe. The forthcoming impresso data lab strives to facilitate access to such a complex, multilingual and multimodal collection, especially focussing on the “programming historians” as a distinct user group.

The presentation will include an overview of the current state of the art in commercially and publicly funded cultural heritage data labs, present user requirements and researcher personas as well as transparency requirements. More specifically, this entails:

  1. A survey of data labs for computational humanities research: the first part of our analysis comprises a survey of existing data labs in the fields of cultural heritage and digital humanities. To determine the shape of the impresso data lab, we need to gather ideas and best-practices. We investigate how data labs provide access to collections, e.g. via APIs, data dumps or other means, what type of information they make available (metadata, text, image) and also to what extent these labs achieve to integrate heterogeneous data (or provide access in parallel). Besides access, we inspect whether labs provide computational infrastructure to support the analysis and exploration of their data, for example by allowing users spin-up dedicated VMs, or less costly, Google Colab notebooks or binder environments. We especially focus on the role of notebooks as a bridge between infrastructure and research applications. (Melgar-Estrada, et al. 2019)

  1. User requirements and researcher personas: after establishing what exists (in terms of data labs), we elicit user requirements from researchers interested in working with historical media archives at scale. Building an infrastructure, doesn’t automatically mean it will be used by the community (Zundert, 2012). Therefore, in the second part of the presentation, we report on interviews conducted with researchers in the computational humanities. More generally, we will discuss how we envisage to create communities around the tools and models we develop as part of the data lab, and ensure longer-term support and use (Arnold et al., 2019).

  1. Requirements for transparency and data-criticism: transparency has been a key value of the impresso project since its inception. The quality of digital research does depend on being able to understand (and control) the process through which data was collected, processed and analyzed. We assess how a data lab can maximize both transparency and utility (allow users to look under the hood and be in charge of the research, without this becoming a burden or hindrance). We discuss various methods to enhance transparency, for example through collecting paradata on the collections by documenting archival knowledge; releasing the models used in processing data; ensuring users can recreate and repurpose data pipelines; facilitating data-criticism through overviews of both present and missing data etc (Beelen et al. 2023).

The computational analysis of cultural heritage data has been embraced by both data providers and the digital humanities community. It is, however, not obvious how exactly institutions can effectively and efficiently support research practices. This short presentation will report on the lessons we learned during the survey, interviews, and our own design process.

 
2:45pm - 4:30pmSESSION#01: DATABASES & DIGITAL ARCHIVES
Location: K-205 [2nd floor]
Session Chair: Edward Joseph Gray, DARIAH-EU / IR* Huma-Num, France
 
2:45pm - 3:15pm

Let's Start at the Start: Remodelling Runic Databases

Elisabeth Maria Magin

Universitetet i Oslo, Norway

The presentation aims to give a comprehensive overview of the experience gained when attempting to model the workflow of a philologist working with medieval runic inscriptions as entities in a relational database – in order to develop a kind of all-purpose basic model for a runological research database, containing the basic information every runologist requires for their work, but offering enough flexibility to conduct research from the point of view of different disciplines. It outlines the process of comparing the models of already existing runic databases, the distillation of the basic “core” features these share and how they are modelled for the specific purposes they were built for, all of which contributed to the proposed new, extendable model. Lastly, it will show examples of analyses that can now be conducted using this new model.
The database created based on this model was the main research tool of the interdisciplinary PhD project “Runes, runic inscriptions and runic writing as primary sources for town development”, begun in 2015 with the explicit aim to collect, compare and analyse previous attempts by runologists to use databases (relational and other) as tools to support their research. An analysis of this kind was, in the author's opinion, sorely needed; the first runic database was brought into existence already in the 1990s, with other researchers and their databases soon following. However, only a few preliminary work reports were published concerning the underlying data models, structures and technologies these databases were using. Within the runologist community, there seemed to be a general consensus that, as long as these databases worked (somewhat), understanding how and why they worked (or did not work, where specific research was concerned) was not a requirement.
The 2015 PhD project, finished in 2021 and published in 2023 as “Data-based Runes: Macrostudies on the Bryggen Runic inscriptions” (DOI: 10.15845/bryggen.v100), set out to question that approach. Instead of using a relational database solely as the tool to curate and gather data on runic inscriptions, it asked questions like: “Should we even be using relational databases for this purpose? Is this the right tool for what we want to do? If it is, how can we ensure the broadest possible range of use of the data (and the model) by other scholars?”
All of these are questions that, in the author’s opinion, are not asked often enough at the start of a digital humanities project, especially PhD projects. Instead, PhD students often rely on the software their supervisors use or on software that they are already familiar with, even when their project would be better served by using a different digital tool. The presentation aims to provide some guidelines for future digital humanities projects, not just in terms of which questions might be beneficial to ask at the start of a new DH project, but also on how to explain to supervisors not particularly familiar with DH projects why these questions should be asked, and why the most obvious or traditional tool may not be the best for the task at hand.
As mentioned above, the project’s ultimate goal was to design a basic model for a runological research database. The model was required to be able to store the basic information every runologist needs, but also to be flexible enough to include research from different disciplines. This flexibility was a key focus, since runic inscriptions are of interest to a broad variety of scholars from different disciplines, amongst them archaeology, philology, history and sociology. The question was therefore whether relational databases were up to the task. The goal was informed by more recent approaches to creating large, reusable collections of structured data, but also by the lack of explanation for data modelling choices where the already existing databases were concerned, which made understanding and replicating query results from these databases difficult. At its most basic, a relational database should attempt to model the world it is representing (as, for example, Ramsay explains in his 2016 contribution “Databases”, DOI: 10.1002/9780470999875); in this case, the reading, interpretation and contextualisation of runic inscriptions within the broader context of the world they originate in. A specific point of interest was thus the translation of a runologist’s workflow when deciphering a runic inscription into appropriate data models and structures, as well as the limitations of the relational technology when modelling this reality. The presentation looks at a range of hurdles and pitfalls when trying to model different stages of interpretation while working with texts written in non-Roman scripts and how to work with “texts” when the field itself doesn't quite agree what makes a “text”.
It also provides a short overview of how already existing databases such as Samnordisk runtextdatabas and the database of the Kieler Runenprojekt solved the same issues in order to understand their data modelling approaches. During the project, it soon became obvious that several lessons could be learnt from these older models, most of which had clearly been tailored to answer very specific research questions, thus limiting the reuse of the existing data for other projects. Further issues arose from a lack of documentation, both where the data model and the data itself were concerned, limiting reproducibility of results and even research into the underlying assumptions. Some ideas and tips on how to get around a lack of documentation are presented as well.
After the stages of analysis and comparison of already existing databases, the PhD project continued to develop its own data model, based on runological methodology and taking into account what previous databases had done. This “core” model database was then filled with data concerning the almost 700 medieval runic inscriptions from Bergen, Norway. This specific corpus of runic inscriptions was particularly suited to serve as a test corpus, because the inscriptions are carved into a wide variety of objects (such as wood, bone, ceramics, leather) and cover a wide range of different topics (from prayers to personal correspondence to vulgarities). Having been discovered in the course of archaeological excavations in the old town quarter Bryggen in Bergen, the objects themselves are of interest to archaeologists and tied to archaeological excavation data, while their textual contents have drawn the attention of runologists, Old Norse scholars, literature scholars and onomasticians. They were therefore well-suited to the task, since examining them in the context of their origins required using approaches from archaeology, runology, traditional text analysis, history and name studies.
Due to the broad range of potential example studies that can be conducted using the material, this particular project, however, only took into account the archaeology, onomastics and traditional text analysis. The presentation finishes by giving a short introduction into how the core database model as well as the various modular “research databases” modelling the archaeological, onomastic and textual aspects were developed to allow connecting and analysing different aspects of the runic inscriptions in relation to each other. From the perspective of three years on and in a follow-up project, it looks at where the final PhD database model “got it wrong” and how different factors such as unfamiliarity with the technology and difficult access to appropriate support contributed to these shortcomings. It concludes with a checklist of tips for how future projects, especially PhD projects, can avoid some of the issues this project ran into along the way. And, not least, how to convince your supervisors that yes, talking about the tool you are using should be part of your thesis.



3:15pm - 3:45pm

Pishu Tebe Digital Archive: Uncovering The Multimodality of Historical Postcards

Anna Golub1,2, Timur Khusyainov1,3, Dmitry Zharov1,4

1Pishu Tebe; 2University of Stuttgart; 3HSE University; 4Central European University

The paper examines the contribution of the Pishu Tebe project to digitization of historical postcards. Pishu Tebe contains 45000 marked-up postcards and is thus one of the largest projects involved in postcard digitization. The principal innovation proposed and successfully implemented by Pishu Tebe is a multidimensional approach to digital preservation of postcards as cultural entities. In contrast to other initiatives, Pishu Tebe builds a digital archive opening the way for analyzing all kinds of postcard-related data: visual, textual, chronological, geographical and personal (sender/recipient's names). The paper starts with a discussion of the place of Pishu Tebe in contemporary postcards studies, its conceptual methodology and IT background. It describes then the key phases of the digitization process and identifies major challenges faced by a voluntary digitization project. In the end, the paper presents quantitative and qualitative results of Pishu Tebe and outlines further plans.

Golub-Pishu Tebe Digital Archive-189.pdf


3:45pm - 4:00pm

Sarpur – A Treasure Trove of information about Icelandic Cultural Heritage

Sveinbjörg Sveinsdóttir

Rekstrarfélag Sarps, Iceland

Sarpur is the Icelandic collective cultural history collection database and associated management system. Rekstrarfélag Sarps is responsible for the operation of Sarpur, which serves the majority of accredited museums in Iceland, with the order of 300 staff users which increases steadily. In total more than 60 different museums and memory institutions in Iceland currently use Sarpur.

The varied museums span from The National Museum of Iceland, The National Gallery of Iceland and Icelandic Museum of Natural History, to urban city-, folk- and art museums and small regional museums run by the municipalities as well as collections managed by non-profit organizations and private foundations. In year 2023 there were over 1,6 million registered artifacts, photographs, art works, historic sites, houses, drawings, documents, archaeological material, books, coins and intangible cultural material like site names and material from ethnological collections registered in Sarpur. Around 1,1 million of the registrations are displayed on the external web sarpur.is.

Sarpur facilitates both management and overview of Icelandic cultural heritage shared across member museums in the whole country. It enables museums to register their collections and to further process those. That means for instance managing object locations, conservation, exhibitions and outgoing loans. It also facilitates online exhibitions, crowdsourcing with the local communities as well as orders for images from the public.

Icelandic museums are often quite small compared to similar institutions abroad. This smallness has facilitated this collaborative effort of registering and managing cultural heritage in this one shared data well Sarpur. It’s opportunities are obvious when it comes to having an overview over cultural heritage in Iceland for i.e. the benefit of research, education and communication.

Currently, Rekstrarfélag Sarps is working on replacing the technical infrastructure for Sarpur including both the database and software. Present software from year 2012 will be replaced with products from the company Zetcom. The first version of Sarpur came about in 1998. All configurations and daily operations of the system will continue to be carried out at the central office of Rekstrarfélag Sarps but the affiliated institutions maintain certain ground-level functions and manage their “part” of the database.

In this presentation Rekstrarfélag Sarps will briefly introduce Sarpur and explain why and how it came about. It will iterate the significance of Sarpur for research activities in the field of digital humanities and art in the Icelandic society using examples of past- and current research projects. Opportunities in the field of education will be addressed. Pros and cons of Sarpur as a collaborative effort that started out as an experiment which resulted in a nationwide practice will be discussed. Similar examples in the field of libraries in Iceland will be mentioned. The presentation will touch on the societal benefit that may result from establishing cultural heritage databases on a national level in a small society with limited resources. If time allows future plans for Sarpur will be discussed.



4:00pm - 4:15pm

The Uralic Trove - The digital data infrastructure of Uralic language speaker area

Outi Vesakoski, Jenni Santaharju, Timo Rantanen, Meeli Roose

University of Turku, Finland

The Uralic Trove (UraLaari)

– The digital datainfrastructure of Uralic language speaker area

Outi Vesakoski, Jenni Santaharju, Timo Rantanen & Meeli Roose

Keywords: Interdisciplinary studies, Human past, Finland, Spatial data, Language data

This paper presents a diverse data collection related to human past in the Uralic language speaker area expanding from Scandinavia to Siberia. The publication promotes F of the FAIR principles: The whole collection is shortly introduced here for further use.

Integrative approaches to building holistic human histories are little by little covering the globe, and the work by BEDLAN team (Biological Evolution and Diversification of Languages, www.bedlan.net) and Human Diversity consortium at the University of Turku have integrated the North-West Eurasian area into this emerging network of global integrative studies of the human past. We are framing a new infrastructure combining our current and forthcoming datasets of human diversity in the Uralic language speaker area. With the Uralic Trove we aim advantage the studies and development of methods to conduct truly integrative studies.

Currently, the Uralic Trove includes four datasets related to Uralic language speaker areas and four related especially to the area of Finland. The dataset are available in repositories and user interfaces. We have also build an interactive web app for easy access to static maps and to provide a possibility for lay audience to create their own maps – Uralic Historical Atlas (URHIA) is presented in Roose et al, this conference.

Uralic language speaker area

UraTyp is a language typological dataset (Norvik et al. 2022) consisting of 360 linguistic traits that are in form of questions with binary answers. The data is being build in GitHub, offered in Zenodo and is in easily approachable mode in web app Uralic Areal Typology Online built by MPI-EVA (Robert Forkel, uralic.clld.org). The features represent actually two typological lists:

1) Grambank data (GB) is a list of 195 grammatical features collected today for 2700 word languages with the aim of studying global typological diversity. The list was produced by the Grambank initiative by Max Planck Institute (Skirgård et al. 2023, grambank.clld.org). BEDLAN contributed the Uralic languages to GB.

2) Uralic specific traits (UT), 165 extra traits that we developed to resolve variation within Uralic languages (Norvik et al. 2022).

UraLex is a basic vocabulary data with cognate assessments indicating etymological connections between words for a meaning in different languages. This is a commonly used data type to construct quantitative language phylogenies (e.g. Gray & Atkinson 2003, Grollemund et al. 2015). Basic vocabulary consists of core items of the lexicon existing in all languages, such as lower numerals, pronouns and body parts. The data is published in Lexibank, which is a data repository organised by the Max Planck Institute for Science of Human History (https://github.com/lexibank/uralex; Syrjänen et al. 2018). UraLex 2.0 covers 26 languages and a reconstruction of Proto-Uralic. We have renewed the UraLex dataset totally (definitions of meanings, selection of words, updates to cognate assessments), and we will soon publish the version 3.0 in Zenodo. The data will be also made available and visible through Uralic Areal Typology Online.

Geographical database of the Uralic languages instead is a spatial data with polygons of Uralic languages speaker areas (Rantanen et al 2022). It consist of multiple versions of each speaker area digitalized first from literature and corrected later by experts. The data and maps are available also in URHIA web app.

Interdisciplinary spatial database of human history in the NW Euraasia is a collection of GIS-files and map vizualisations that are done for different publications. We now offer the GIS-files for further interdisciplinary use.

Finnish area

The Uralic Trove also includes data covering only Finnish language area. A new profile area of University of Turku, the Human Diversity consortium (www.humandiversity.fi) focusses especially to Finland for there are multidisciplinary data sets available. At the moment we have published or are publishing the following:

Preindustrial dialectal landscape of Finland is based on spatial data of linguistic variation in Finland collected by Lauri Kettunen. Santaharju et al. (submitted abstract to this conference) will discuss this data.

Archaeological Artefact Database of Finland is a spatial data offering location and typological classification of 49 000 artefacts located in Finnish museums. The database was collected 2013-2020 and is freshly published and will be described in Pesonen at al. (submitted ms.) Pesonen et al. has submitted a poster abstract about AADA to this conference.

Historical travel environment model over Finland is a spatial data with terrain and landscape attributes and combining information from historical sources that characterize the landscape in terms of travel effort given the environmental and human-related factors current up until the late 19th century. The data is described in Rantanen et al. (2021) but will be OA during the spring 2024.

The Uralic Trove will include also spatial data of environmental variation in Finland (used in Honkola et al. 2018) and cultural data used in Honkola et al. (2018) and folk-culture data from 1600-1800 used in Rantanen, Santaharju et al. (in preparation). Publishing these data is an on-going process, and will be also explained in the paper.

This paper collects together the data sets produced within and around BEDLAN project (Biological Evolution and Diversification of Languages). Part of them were Open Access already earlier, part of them a published along with this article. We did not promote Open Access in the beginning of the project, but learned to appreciate FAIR principles during the journey: Now we see that that digital humanities using OA data could induce international interest to study the human history in the NW Eurasia. We encourage Uralistics and other researchers of the area to make their data OA, for more researcher will yell more voices and cumulative insights, and eventually – hopefully - lead to holistic understanding of human history in Scandinavia and NW Eurasia.

References

Honkola, T., Ruokolainen, K., Syrjänen, K. J. J., Leino U, Tammi, I, Wahlberg, N. & Vesakoski, O. (2018). Evolution within a language: Environmental differences contribute to divergence of dialect groups. BMC Evolutionary Biology. 18:132.

Norvik, M., Jing, Y., Dunn, M. Forkel, R., Honkola, T., Klumpp, G., Kowalik, R., Metslang, H., Pajusalu, K., Piha, Saar, E., Saarinen, S. & Vesakoski, O. Uralic typology in the light of new comprehensive data set. Journal of Uralic Linguistics 1: 4-41.

Rantanen, T., Tolvanen, H., Honkola, T., Vesakoski, O. 2021: A comprehensive spatial model for historical travel effort – a case study in Finland. Fennia 199 (1) 61-88.

Rantanen, Timo, Harri Tolvanen, Meeli Roose, et al. (2022). ‘Best Practices for Spatial Language Data Harmonization, Sharing and Map Creation—A Case Study of Uralic’. PLOS ONE 17 (6): e0269648.

Roose, Meeli, Timo Rantanen, Dmitri Kuznesov, et al. (2023). ‘Collection of Spatial Information and Maps of Human Past and Environment in the Uralic Languages Speaker Area’. https://doi.org/10.5281/zenodo.10081902 [data set]

Skirgård et al. (>100 authors) 2023. Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances. http://dx.doi.org/10.1126/sciadv.adg6175

Vesakoski, O, Salmela E ja Piezonka, H. (2024). Uralic archaeolinguistics. In Oxford Handbook of Archaeology and Languages, ed. by Martine Robbeets ja Mark Hudson. In press.

 
2:45pm - 4:30pmSESSION#02: LITERARY STUDIES
Location: K-206 [2nd floor]
Session Chair: Anda Baklāne, National Library of Latvia, Latvia
 
2:45pm - 3:15pm

Between and Behind the Lines: A Case Study of Strikethrough Removal and Handwritten Text Recognition on Astrid Lindgren’s Shorthand Manuscripts

Raphaela Heil1, Malin Nauwerck2

1Independent Researcher; 2The Swedish Institute for Children's Books, Stockholm, Sweden

1 Introduction

Where does a literary work begin and end, and how does its meaning emerge and transform? From a perspective of textual or genetic criticism, the literary work is not considered a finished product, but an inconstant, changeable process, determined by the author’s creativity as well as their historical, material and social context.

As has been pointed out by Van Hulle (2004), manuscript research and genetic criticism may create the feeling of “rummaging in the author’s ‘scientifically annotated waste-paper basket’”. But the fact remains that many authors, and especially modernist authors, have not thrown their handwritten manuscripts into the waste-paper basket but rather – like Lindgren – preserved them for future research.

Genetic research is based on the material evidence of the creative process, and typically involves the analysis of manuscripts, typescripts, notebooks, and other preparatory documents. It plays a natural role in an ecosystem of related fields of study, such as bibliography, book history, archive studies, filologia d’autore, variantistica, writing studies, digital humanities, and scholarly editing (Van Hulle 2022). Genetic research has been described as drawing attention to the labour, craftsmanship, and dynamic of the creative process by entering the “workshop” of the writer (Hay 2004). The process often includes the compiling and deciphering of relevant documents, establishing a chronological order, and the transcribing and editing of texts.

In our project The Astrid Lindgren Code (2020–2024) (https://www.barnboksinstitutet.se/en/forskning/astrid-lindgren-koden/ ), Swedish author Astrid Lindgren’s original drafts and manuscripts, preserved in 670 stenographed notepads, constitutes the object for investigation. The mixed methods approach of the project has so far resulted in approximately 60 digitised and transliterated notepads, primarily containing the novel The Brothers Lionheart (1973) which Lindgren notoriously was struggling to finish, and where the ending was rewritten several times. As is generally the case in genetic research, mapping out the author’s revisions through deletion, alteration, and rewriting is essential for exploring variations within the text that reflect the path of a literary work, and the author’s creative process. On a material level, these revisions are often manifested in crossed out or struck-through words, lines, and paragraphs.

When it comes to exploring the path of Lindgren’s literary work, the revisions on manuscript level are also more relevant than in many other cases. Generally, book publishing is a collaborative process where several agencies are involved in bringing a manuscript to publication, and the influence on a manuscript subsequently can be traced to early readers, editors, and publishers. Belonging in the category of author-publishers, Astrid Lindgren however assumed the roles of editor and publisher herself. Lindgren wrote and edited in shorthand, then typed up the manuscripts herself before sending them directly to the printer. Her position at publishing house Rabén & Sjögren guaranteed her full control of the editing and publishing of her own books and contributed heavily to the “Lindgren myth” (Nauwerck 2022). The revisions Lindgren made in her shorthand drafts, including striking through words, subsequently provide the only first-hand source to the author’s creative process and editorial work on a script level. In order to access them, it is in this case therefore necessary to be able to read, not only between, but also behind the lines.

In our prior work (Heil and Nauwerck 2024) we have demonstrated that state-of-the-art handwritten text recognition (HTR) models can be used to automatically transliterate portions of Lindgren’s manuscripts. Our experiments have also shown that the presence of the aforementioned editorial marks, strikethrough and additions, negatively affect the recognition performance, resulting in a reduction of 30-40 percentage points (pp) with respect to the character error rate (CER). In this work, we investigate the question whether the application of strikethrough removal techniques, which aim to produce the original, clean words, can improve the recognition performance, allowing us to read behind the lines.

2 Case Study Design

The general design of our case study is centred around the application of strikethrough removal on affected words from Lindgren’s manuscripts, followed by a comparison of text recognition results, obtained before and after applying the cleaning process. With respect to handwritten stenography recognition (HSR), we reuse the baseline models, originally proposed by (Sousa Neto et al. 2022) that were trained and evaluated as part of our prior work (Heil and Nauwerck 2024).

2.1 Strikethrough Removal

Several approaches for strikethrough removal have been proposed in recent years, for example (Heil, Vats, and Hast 2021) and (Poddar et al. 2021). The underlying data ranges from fully uncontrolled data, collected from genuine manuscripts, to entirely synthetic strikethrough. Detailed discussions of various cleaning and data collection approaches can be found in (Nisa 2023) and (Heil 2023). In this work, we focus on a combination of fully synthetic and manually cleaned data, using the strikethrough removal approach proposed in (Heil, Vats, and Hast 2022). The chosen approach has been demonstrated to work well for genuine and synthetic strikethrough, applied to handwriting in Latin script, while requiring comparatively little computational power. It is based on paired image-to-image translation, using an autoencoder-based (Bourlard, and Kamp 1988) deep neural network.

2.2 Data

We base our experiments on the LION dataset (Heil and Nauwerck 2023) which consists of several digitised notepads from Astrid Lindgren’s Swedish shorthand manuscripts, complete with corresponding transliterations, obtained via expert crowdsourcing (Andersdotter and Nauwerck 2022). The dataset includes 2195 clean lines, which are free from any editorial marks, and 307 struck lines, which contain at least one word that has been struck through. Each word in the dataset is annotated with bounding box coordinates, its transliteration and whether it is clean or struck through. For our experiments, we follow the previously established splitting into training, validation, and test sets (Heil and Nauwerck 2024).

In order to train and evaluate the strikethrough removal model, pairs of images, i.e. a struck-through word and its clean counterpart, are required. The training set for our experiments is obtained by extracting all clean words from the original training split and superimposing synthetic strikethrough in various shapes, for example straight or wavy lines, following (Heil, Vats, and Hast 2021). This step allows us to create a large and visually diverse set of training images, while keeping the degree of human labour to a minimum. In order to retain some amount of genuine data in the process, the validation set is obtained by extracting the struck-through words from the original validation portion, and manually erasing strikethrough strokes and related artefacts, using the image manipulation program GIMP and a graphic tablet. This approach requires considerable human involvement and is therefore only feasible at the scale of a small validation set and not applicable to obtain entire training sets with hundreds to thousands of images.

2.3 Experimental Setting

As indicated above, our experiments focus on measuring the text recognition performance before and after applying strikethrough removal. Initial experiments, reusing the original model weights from (Heil, Vats, and Hast 2022), indicated that models, trained on Latin handwriting do not generalise to text written in Swedish stenography. This observation is in line with our expectations, as words in the latter writing system are considerably shorter and often contain strokes that are visually less distinct from strikethrough than Latin characters. We therefore train the chosen strikethrough removal architecture specifically for Swedish shorthand, using the combination of synthetic training and genuine validation data, outlined above. Once the strikethrough removal model has reached convergence, it is applied to all struck-through images from the test portion of the LION dataset. The original, struck-through words are then replaced by their cleaned counterparts in the test line images and the HSR performance is measured, line by line.

3 Results and Discussion

On average, combining the strikethrough removal approach with the previously trained HSR model yields a CER of 53.42%. While this constitutes an improvement, compared to the original recognition performance of 56.48%, obtained on the original, struck-through lines, it is still much higher than the CER for entirely clean lines, which amounts to 25.68%.

A visual inspection of the strikethrough removal results reveals that there are several strikethrough strokes that are hardly or not at all removed by the models. Furthermore, in some cases, the wrong strokes, i.e. belonging to a word, are partially removed, thus potentially altering the transliteration. For these results, several plausible reasons can be identified.

Firstly, as is the nature of shorthand, individual words are considerably shorter, in terms of image dimensions, than many of the words in Latin or other longhand scripts. This leaves comparably less information to determine which strokes belong to the word and which should be removed. In addition to this, several of the frequently used words in Swedish, such as “och” (English: and) and “en” (English: a, one), are written with strokes that are also frequently seen in strikethrough, i.e. short horizontal or angled lines, arcs, and waves. When these symbols are combined with a strikethrough stroke it is virtually impossible to identify what is a shorthand word and what is the strikethrough, unless the context, in the form of surrounding words, is given.

Another aspect that may contribute to the limited improvement lies in the synthetic data, as it does not fully reflect the reality of Lindgren’s style of striking through words. One of these aspects is that the stroke generation has been implemented such that the generated strokes do not extend beyond the edges of the word image and instead, a small margin is left. As Lindgren often struck-through several consecutive words with a single stroke, the strikethrough continues to the edge (and beyond) of the words’ bounding boxes.

Finally, as our approach only treats individual words, any strikethrough strokes that lie outside any of the words’ bounding boxes will not be processed by our model at all, leaving short stroke segments behind in those areas when replacing the processed words. As indicated earlier, short strokes, for example horizontal lines, are valid symbols in Swedish shorthand and may thus be recognised as words by the HSR model, instead of as the strikethrough artefacts that they actually are. To demonstrate the impact of the latter issue, we mask any content within the text lines that lies outside the bounding boxes of any of the words and evaluate the HSR performance on these lines, resulting in a CER of 47.92%, i.e. a further improvement of approximately 5.5 pp.

4 Conclusions and Outlook

In this work, we have examined the impact of combining strikethrough removal with handwritten stenography recognition, using Astrid Lindgren’s shorthand manuscripts as a case study. The investigated approach has resulted in an improvement with respect to the recognition CER, however the obtained performance ist still considerably worse than that obtained on clean lines. Based on the identified challenges in the data, a sensible next step is to move from a word to a line-based strikethrough removal approach. This is expected to introduce the much-needed context in order for the strikethrough removal model to identify which strokes are words and which are strikethrough. Besides this, an interesting avenue for future work is to extend this case study to other writers, who frequently use strikethrough in their works and for whom reading behind the lines could be of interest to, for example, the genetic research community.

Acknowledgements

This work is supported by Riksbankens Jubileumsfond (RJ) (Dnr P19-0103:1). The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at Chalmers Centre for Computational Science and Engineering (C3SE) partially funded by the Swedish Research Council through grant agreement no. 2022-06725.



3:15pm - 3:45pm

Romani Literature in the Digital Era: Opportunities and Challenges

Sofiya Zahova

University of Iceland

Since the late 1990s and particularly after 2000, the development of the Romani literature field – the literary works written by Roma and/or for Romani audiences – has been influenced by the growth of digital technologies and the internet. The use of ICT among Romani communities has previously been discussed in relation to language usage (Leggio and Matras 2017; Leggio 2020), identity, appropriation and representation (Akkaya 2015; Szczepanik 2015), and migrations (Clavé-Mercier 2015; Hajská 2019). Based on recent research of the author in the framework of a broader investigation of Romani literature heritage and digital forms of Romani literature (Zahova 2020, 2021) the proposed paper discusses how digital technologies and tools have been applied in the preservation, production, access to and research of Romani literature. This paper´s first part examines how digital technologies have been incorporated into Romani literature heritage preservation and how they have been utilised in the production of digital forms of Romani literature. To address the conference theme, this overview is focused on how collections, archives and Romani literary materials holders have applied digital tools to facilitate the access, use and distribution of Romani literature and what lessons can be learned from the experience of the already existing collections, particularly when it comes to community engagement and activities. The second part of the paper brings in the topic of Romani digital publishing and the practices applied by authors, publishers and promoters of Romani literature. On the example of existing practices of publishing on websites, blogs and social media I examine how multimedia elements, hyperlinking, options to post, comment and engage have (not) been utilised.

Within libraries, the widespread use of digital technology was first applied in the digitisation of old books and manuscripts that are in the public domain. The digitisation pattern for Romani literature repeats this model, although its corpus in the public domain is neither sizeable nor is there a library that digitises only Romani literary heritage. The only corpus modelled according to this digitisation policy is the Zingarica collection at the National Library of Finland website (https://fennougrica. kansalliskirjasto.fi/handle/10024/85841) with items in the collection amounting to 224 books and 24 manuscripts. Other efforts have been rather unsystematic, driven by projects or by individual activities and thus with limited access and questionable sustainability. An initiative, for instance, was undertaken by the National Library of Serbia as part of The European Library (TEL) project. The idea behind the project was to form a bibliography of Romani literature titles available at all TEL members and digitise publications and materials (Injac 2009: 5‒6) which ended up creating only a bibliography of Romani literature available in some of the libraries. There are also cases of digitisation of materials as part of libraries’ initiatives that are not specifically related to Romani literary heritage and do not create a Romani collection. Such is the case of scanning the Romani language literature materials produced in the second half of the nineteenth century by Roma in the Austro-Hungarian Empire at the Hungarian Electronic Library or (http://mek.niif.hu/) and as part of the Digital Library collection at the Lucian Blaga Central University Library in Cluj-Napoca. Apart from these large-scale, essentially top-down approaches, more grassroots, less systematic initiatives for the digitisation of traditionally printed literature and educational materials and the upload of this material to the internet have also been undertaken by individuals. The materials are made available on file storage websites or as files accompanying social network posts. In these cases, there is limited access to the digital content.

Digital forms of publishing seem to provide good opportunities for minority literature in general and non-commercial organisations involved in Romani publishing, in particular, to overcome challenges and restrictions of the book publishing market and reach readers across the globe. Romani publications are going digital not only through the digitisation of public domain materials but also as digitally born texts in the forms of e-books, internet publishing and social media publishing.

The most sizeable and widely available form of digital Romani literature consists of short literature pieces published on websites and online platforms devoted to Romani culture and/or literature. These websites usually present works by authors from a common geographic, cultural, and historical space, for example, writers from the former Soviet Union, Czechoslovakia, and Yugoslavia. In some ways, the digital publishing scene mirrors the area in which other Romani cultural activities take place, including the printing and distribution of traditional publications by authors from the area. This can be explained by the shared history of the Roma (living in the same political formation) and thus the maintenance of professional contacts between Romani activists, cultural producers, and authors, the common Romani dialects and the majority languages in which the publications appear (Czech, Russian, Serbian/ Croatian/Bosnian/Montenegrin). Paradoxically, despite unlimited opportunities in terms of access to literature offered by internet publishing, the online platforms uniting Romani authors are comprised of Romani digital publications from the same cultural and geographic space.

In some cases, online forms and digital publishing mirror a rather active print literature scene in general (Czech Republic, Sweden) or an active project in publishing (Romano Kher in Romania). In other instances, however, online publishing seems to be more active than print forms and goes beyond mirroring print editions by publishing content that is available only online (Russia and Hungary for instance). Finally, in certain cases (countries formed from the former Yugoslavia), despite the comparatively active Romani literature scene in print, digital forms of Romani literature have not been developed so far by individual authors and publishers. In this respect, digital publishing is mainly determined by the tendencies in the region or country of its production but also depends on personal agency.

Although Romani authors generally do not make a profit from royalties and copyrights, many are still not willing to distribute their works through internet publishing. The authors may have a fear of plagiarism, or even conservatism – due to the sentiment that a proper book is one you can hold in your hands and is read only in print form. Romani publishers have not yet taken advantage of self-publishing through digital formats, which is considered an essential and effective solution for publishing minority literature. At the same time, Romani digital publishing faces many of the issues related to Romani publishing in general: lack of promotion and distribution strategies; limited reading audiences due to unawareness of its existence among wider circles and limited reading ability in Romani; differences in dialect or orthography of publishing; and prevalence of publishing in the languages of the majorities, especially in the prose genres, which are thus inaccessible for Roma from other countries.

There are still many challenges in Romani digital publishing – intermediary and multimedia forms that make digitally published texts more attractive and user-friendly are barely used. So far, there are no festivals, competitions, or promotional campaigns to highlight digital writing and publishing. Nevertheless, the peak in the digital life of Romani literature is still to come, and we can expect more authors to share their texts and reach out to the online community, educate readership, and engage in digital literary activities. As of 2023, I dare claim that digitally available Romani literature is already more successful in terms of accessibility and readership than traditional print literature.

Digital or more interactive forms of publishing may stimulate interest among younger generations who read and write in Romani on the internet in forums, social media, and in communication with their relatives and communities worldwide (Leggio 2020). Much like their peers, most of the young Roma today are “digital natives” (Palfrey and Gasser 2008), using and integrating contemporary technology in most aspects of everyday life, who are also interested in sharing digital content with Roma/Gypsy representations across the world. New ways of presenting Romani literature digitally may expand the audience for Romani literature by reaching these Romani digital natives.



3:45pm - 4:00pm

A hybrid approach in the close and distant reading of Ibsen’s plays in the light of the characters’ stylometric profiles Keywords: stylometry, hybrid-reading, DSE, Henrik Ibsen, stylometric profiles

Sasha Rudan1,2, Eugenia Kelbert3,2,4, Linnea Eirin Timmermann Buerskogen1

1University of Oslo, Norway; 2LitTerra Foundation; 3Institute of World Literature, Slovak Academy of Sciences; 4University of East Anglia

This paper analyses Henrik Ibsen's plays in the connected world of their translations, demonstrating both the patterns in the original texts and their transfer across the translations. We demonstrate the process of collecting, analysing, and presenting the corpora and the methodology of hybrid reading; distant and close reading happening across two institutions; the Centre for Ibsen Studies, Oslo, Norway, and LitTerra Foundation, Belgrade, Serbia. In this process, we use our original digital infrastructure: Bukvik for the cross-lingual corpora analysis and LitTerra for cross-lingual corpora reading and presenting distant reading findings.

We explore the plays Et dukkehjem (A Doll’s House), Fruen fra havet (The Lady from the Sea), Heda Gabler, Gengangere (Ghosts), Vildanden (The Wild Duck), En folkefiende (An Enemy of the People). Our corpus consists of the original texts (written in Dano-Norwegian language/dialect) and their translations in English, Serbian, Croatian, and Russian where we are primarily interested in exploring the author’s contextual style where the context refers to each play’s character. Namely, we identify the textual patterns in the characters’ lines over the course of a play and the characters’ mutual interaction; thus, we eventually build each character’s stylometric profile.

Eventually, we explore the patterns and stylometric profiles across the translations making it possible to understand and present translators’ work, both their grasp of and commitment to the original’s stylometric features, but secondly, their creative freedom and conscious play within the framework of the generally more rigid form of a play as compared to the novel or poetry.

This work exemplifies cooperation across disparate institutions and scholars, namely between the Centre for Ibsen Studies, Oslo, Norway, which holds the competence in Ibsen studies and holds all the original texts by Ibsen in critical editions and various translations, on one hand, and the LitTerra Foundation, Belgrade, Serbia, which holds competence in cross-lingual stylometric analysis (maintaining Bukvik, the cross-lingual corpora analysis tool) and hybrid reading approach (maintaining LitTerra, ‘the cross-lingual corpora reading platform).

Our workflow is the following; we collect Ibsen’s multilingual corpus and feed it to Bukvik workflows that enable the normalisation and cross-lingual alignment of the originals and their translations at the level of sentences and sentences’ parts. We also identify the provenance of each line in the play either through its direct annotation (using TEI format, for original texts) or automatic extraction (for the translations) which helps us to classify and associate the text with the plays’ characters and later understand their contribution to the characters’ contextual styles. After that, we are ready to apply stylometric analysis (limited to the orthographic issues and old Dano-Norwegian language style of Ibsen’soriginal writing) to understand the uniqueness of each character’s style, its development throughout the play and interaction with other characters, together with each translator’s awareness of it and its transfer to a new text “version” in translation.

Using the LitTerra platform, we ensure a parallel view of all the translations of each play, mutually aligned and annotated with stylometric findings, eventually providing a close-reading experience augmented with distant-reading findings. Thus, we can read the texts in a completely new way, dynamically change our focus and range of interest (for example, focusing only on female characters, or aggregating them), and integrate distant reading charts interactively with close reading and pin-pointing the excerpts of the text primarily contributing to the deviations of stylometric profiles across the translations, seamlessly fusing and navigating through the two often distinct experiences.

With this, we present an innovative model of DSE (Digital Scholarly Edition) platform and a continuous workflow for hybrid critical reading that we currently practice within the Centre for Ibsen Studies in light of Ibsen’s upcoming 200th anniversary.



4:00pm - 4:15pm

Jon Fosse and world drama: using map visualisation to interrogate the global dissemination of the fresh Nobel laureate

Jon Carlstedt Tønnessen, Jens-Morten Hanssen

National Library of Norway

During the 1990s Jon Fosse went from being a prose writer with a limited Norwegian readership to becoming a playwright with an ever-expanding international distribution. Within a decade after he wrote his first drama, Fosse held a prominent position on the global stage. This paper explores the geographical distribution of stage productions based on his works, with a point of departure in a performance dataset established by the National Library of Norway in collaboration with the performing arts archive Sceneweb. The dataset was created by extracting information from a relatively large archive of material related to Fosse’s authorship that the National Library received in 2021. A total of 870 Fosse productions worldwide have been entered into the Sceneweb database with information about theatre, venue, work, production title, performance dates, performance language, stage artists and other contributors, tour schedule, etc. (https://sceneweb.no/nb/artist/3170/Jon_Fosse). Venues are registered with geodata such as street address, city, state or province, and country, enabling the use of tools for geographical analysis.

Using python-based tools for geoparsing and map visualisation, we employ interpretive digital tools to shed new light on the events that led to Fosse’s current standing as a world dramatist. Existing accounts indicate that the international distribution of stage productions is particularly strong in three geographical areas, Scandinavia, France, and the German-speaking parts of Europe. One specific production, namely Claude Régy’s staging of Someone Is Going to Come at the Festival d’Automne in Paris in the autumn of 1999, did supposedly mark a turning point and laid the foundation for Fosse’s European breakthrough, initiating a strong wave of performances particularly on the European mainland. However, new studies suggest that there would not have been a global Fosse if it was not for his long-standing strong position in Scandinavian theatre and long-term transnational artistic collaboration initiated and upheld by Scandinavian theatre organizations with a strong ownership in Fosse.

We will present an interactive map, built with the Folium package for Python (https://python-visualization.github.io/folium), visualising the spread of Fosse through a period of three decades, from 1994 as he was presented on stage for the first time and until 2024, with a basis in the combination of precise spatial and temporal data contained in the Sceneweb dataset. Moreover, we will zoom in on France, Germany, Austria, Switzerland, Norway, Sweden, and Denmark, for a closer scrutiny of events of particular significance in these areas. Fosse has been presented on stage in six continents, and we will, last not least, explore how Fosse conquered the stage outside of Europe.

The paper addresses several of the overarching themes of the DHNB 2024 conference. The paper will demonstrate services and applications developed by the National Library’s DH Lab, thereby illustrating the role of such labs in collaborative projects across the GLAM sector. The paper will furthermore demonstrate the reproducibility and repurposing of a dataset initially created by extracting performance data from a cultural heritage collection.



4:15pm - 4:30pm

Norwegian writers during the Second World War. Literary studies and relational analysis combined.

Sofie Arneberg, Lars Johnsen

National Library of Norway, Norway

Introduction:
How can we use text mining to identify traits or describe the literature of Norwegian writers before and during World War II? How can a statistical analysis of the writers’ political and aesthetical position taking in the 1930s and 40s be used as a steppingstone for new studies into a literary epoche?

Objective:
Cultural life in the 30s and 40s can be studied as an object where the prelude to World War II both took place and manifested itself. Most companions to European literary history mark out the period as special, almost deviant, and with a clear before-and-after perspective. Literature of the period has consequently been read and studied differently. Knowledge or assumptions on the writer’s political stance have naturally played a role. How precise or relevant are literary interpretations coloured by these assumptions or facts? Can a distant reading of the collective works of 308 writers either confirm or invalidate presumptions on their artistic production?

Case: Grouping the writers based on their aesthetic positioning was one of the tasks when the dataset was created. It was of interest to the project to test the accuracy of this method.

Methodology:
The Words and Violence project (Literary intellectuals between democracy and dictatorship 1933-1952) analyses the democratic resilience and vulnerability of cultural life in the 1930s and ’40s. The project is funded by the Norwegian Research Council and consists of a research consortium with members from both universities and cultural heritage centers such as The Norwegian Center for Holocaust and Minority Studies, The Falstad Centre and The National Library of Norway. The latter contributes to the project with research, library resources and access to the digitized books from their laboratory for digital humanities (DH-LAB).

Based on prosopographical data on 308 writers active in Norway during the 1930s and 40s, researchers from the project have conducted a systematic analysis of the writers’ various position takings during WWII.

The relations between three sets of structures were analyzed: The writers’ social backgrounds, their position takings in the literary field and their political position taking during the German occupation. Inspired by works of Pierre Bourdieu and Gisèle Sapiro and based on a newly constructed data base on the writers, the study demonstrates the potentials of correspondence analysis when mapping relational structures in culture.

The established data base offers a high number of variables. Several of these concern aspects of the writers’ literary production: What was their preferred written language (Norwegian, new-Norwegian or dialectic variances); which genres did they write in; their affiliation to the different publishing houses; their publishing frequency, and were their books translated into other languages and published abroad? This information was gathered from detailed bibliographies on the writers, produced specifically for the project at the National Library of Norway.

Based on the new bibliographies, a corpus consisting of the literary works of the 308 writers, approximately 2000 books, was built and studied.

Case: Signal words such as "life," "inner", "strange", and "mind" were scrutinized within a subset of works by authors classified as having an interest in psychoanalysis. Subsequently, a document-term matrix was generated.

(figure)

Findings: 1) The initial classification was not deemed incorrect 2) The authors who used the signal words most frequently in their novels almost did not use them in their plays. This should be investigated more thoroughly.

Conclusions:

The combination of statistical investigation methods on one front and literary text mining on the other has the potential to yield valuable insights. This convergence allows for the testing of specific assumptions while simultaneously giving rise to new questions about the material.

 
2:45pm - 4:30pmSESSION#03: LINGUISTICS (NLP)
Location: K-207 [2nd floor]
Session Chair: Maciej Rapacz, AGH University of Kraków, Poland
 
2:45pm - 3:15pm

Developing named-entity recognition for state authority archives

Ida Toivanen1, Mikko Lipsanen2, Venla Poso1, Tanja Välisalo1

1University of Jyväskylä, Finland; 2National Archives of Finland, Finland

Digitisation of archives is a core trend of archival practices around the world. The National Archives of Finland launched a mass digitisation project in 2019 to digitise state authority records. Over the next decade, they aim to digitise more than 200 kilometres of modern government documents (Hirvonen, 2017). This endeavour involves generating both image and text data from the original documents. Even though this process enhances accessibility to archival materials, possibilities for information retrieval from unstructured and noisy text remain at a low level. Consequently, there is a need for interdisciplinary work to find different solutions to enrich the digitised data and serve a more varied user base (Guldi, 2023; Guldi, 2018; Colavizza et al., 2021). Continuing the digitisation process with enrichment of the data helps to avoid the pitfalls of digitisation (Jeurgens, 2013) and opens up new possibilities for researchers.

Aiming at more efficient and innovative uses for archival material, we developed a named entity recognition (NER) model for Finnish state authority archival data. NER is one of the most common tasks in natural language processing (NLP) and has been described to be “among the first and most crucial processing steps” (Ehrmann et al., 2023) in enriching the archival data. It was originally developed for information extraction (Palmer and Day, 1997) but has since been used in various areas of NLP. NER creates a multitude of potential uses for research. At its simplest, it creates the possibility for the researcher to filter data more efficiently (Guldi, 2023). In more advanced use cases, in the context of state authority archives, NER can, for instance, facilitate research on policy trends and historical memory, identification of different advocacy groups communicating with state authorities, or effects of different local and world events on decision-making.

The diversity of the archival data, containing texts from different domains, as well as noise due to imperfect optical character recognition (OCR), creates a challenge for named entity recognition. Pre-trained language models can, however, be exploited in a transfer learning setting to improve the generalizability of a neural NER model to overcome these issues (Ehrmann et al., 2023). The current state-of-the-art NER models made for modern texts, like TurkuNER model (Luoma et al., 2021), did not produce adequate results on an out-of-domain testset that consisted of OCR’d archival data. This gave us an incentive to train our own NER model and see if our attempts would bring suitable results in an archival setting. Therefore, we followed the approach presented in Luoma, Oinonen, Pyykönen et al. (2020) and used FinBERT (Virtanen et al., 2019) as a base model which we fine-tuned to recognise and classify named entities from text input. The aim of our study was to answer the following research questions in the context of state authority archives:

  1. Can a NER model trained with archival text data give comparable or improved results to existing Finnish NER models trained with modern text data?

  2. Is it possible to create a NER model that performs well with both archival and non-archival data?

Our preliminary results clearly indicate that named entity recognition for state authority data needs its own model training due to its specialised formal language. Another difficulty for previous models with digitised state authority data is the noise caused by OCR. We approached these problems through choices made in the pre-training phase. First, we selected training data from various state authority sources to represent the complexity of formal state authority language. Second, we avoided cleaning OCR errors from the data in the annotation phase in order to train a model that performs with OCR noise. In model training we used TurkuONE (Luoma et al., 2021) corpus as well as annotated document data from Finnish state authority records. The time span of the texts is from the 1970s till 2000s, and the latest documents in TurkuONE corpus are born-digital while the text content of the state authority records was retrieved using OCR. The document and text types vary from blog posts to public administration records and legal texts.

NER models are typically trained to recognise entities like date, organisation, person, and location (see, e.g., Nadeau and Sekine, 2007). The named entity categories for our model development were based on surveys aimed at different archive user groups (Poso et al., 2023). The following named entity categories were included: person (PERSON), organisation (ORG), location (LOC), geopolitical location (GPE), product (PRODUCT), event (EVENT), date (DATE), nationality, religious and political group (NORP), Finnish business identity code (FIBC) and journal number (JON). TurkuONE and NewsEye corpora were readily available with annotations but supplemented with missing named entity categories (i.e., FIBC and JON). The state archival dataset was annotated with seven annotators using the annotation scheme IOB2, and to determine the level of agreement between all annotators we calculated inter-annotator agreement (Fleiss' kappa 0.84).

Our initial results show that current state-of-the-art Finnish NER models (namely, TurkuNER model) work well with modern texts (weighted F1-score 0.9166) but show a drop in performance when tested with archival data (weighted F1-score 0.6694). When tested with modern texts and archival data, our model shows consistent performance in both domains (weighted F1-scores 0.9200 and 0.8710, respectively). Our model also showed improvement in F1-score, precision, and recall for all of the named entity groups (excluding the two new entity categories FIBC and JON) in comparison to the TurkuNER model when tested with an archival dataset. For the TurkuONE test dataset our model performed very similarly to the TurkuNER model. We can deduce from this that the increased diversity of the training data improved the model performance – that is, even though we included archival data with OCR noise, the model still learned to detect named entities correctly from noise-free, non-archival data.

As our results show, the users’ needs, the complexities of the archival data and the different domains presented in archival settings present challenges for NER and possibly to other related NLP tasks. Further examination is needed to separate the impact of OCR noise and the impact of state authority language respectively on the model performance.



3:15pm - 3:45pm

SWENER-1800: A Corpus for Named Entity Recognition in 19th Century Swedish

Eva Pettersson1, Lars Borin2, Erik Lenas3

1Uppsala University, Sweden; 2University of Gothenburg, Sweden; 3Swedish National Archives

Named entity recognition (NER) is the process of automatically identifying persons, places, organisations and other name-like entities in text, in order to perform natural language processing tasks such as automatic extraction of metadata from text, anonymisation/pseudonymisation of sensitive personal data, or as a preprocessing step for linking different terms describing the same entity to a single reference. While NER is a mature language technology, it is generally lacking for historical language varieties. We describe our work on compiling SWENER-1800, a large (half a million words) reference corpus of historical Swedish texts, covering the time period from the first half of the 18th century until about 1900, and manually annotating it with named entity types identified as significant for this time period, as well as with sentence boundaries, notoriously difficult to recognise automatically in historical text. This corpus can then be used to train and evaluate NER systems and sentence segmenters for historical Swedish text. An additional concrete contribution from this work is a manual for annotation of named entities in historical Swedish.

Pettersson-SWENER-1800-106.pdf


3:45pm - 4:15pm

BERT-aided normalization of historical Estonian texts using character-level statistical machine translation

Gerth Jaanimäe

University of Tartu, Estonia

Historical texts serve as invaluable resource for linguists, historians, and other researchers who use archives in their work. However, the automatic analysis and searchability of these writings is often hindered by the old and different spelling system used at the time period they were written in. One of the common solutions to this issue is converting these texts from old spelling system to contemporary, a process also called normalizing. However this task can also pose difficulties, mainly caused by the little amount of data and significant variations in it. This presentation gives an overview of using statistical machine translation for normalizing historical Estonian texts and Bert language models as a preprocessing step to improve the results.

The dataset used in this research is comprised of parish court records written in the 19th century. These texts describe court cases regarding debts, petty thefts, public fights and other minor offences. It also served as a place where peasants could perform notarial acts, such as buying and selling properties writing down a person’s will and solve their civil disputes. These writings therefore provide insights into peasant life, thought processes, relationships, language usage, etc. These texts were initially handwritten, however due to the discrepancies in handwritings and prevalence of non-standard spelling, conventional approaches for digitizing these texts, such as performing machine learning or optical character recognition became infeasible. Therefore, these texts had to be transcribed through a crowdsourcing project by Estonian National Archive. There are currently over 130000 texts in the database, of which 1000 texts are used within this study.

Many of these writings are written in old spelling system which was introduced around the end of the 17th century and was heavily influenced by German orthography at the time. The primary divergence from the contemporary spelling system lies in the manner in which the length of a sound was denoted [1]. For instance, the term ’kooli’ (genitive, partitive, and illative form of ’kool’) (school in English)[SO1] was historically transcribed as ’koli’. The terms presently spelled as ’koli’ (junk) and ’kolli’ (genitive and partitive form for ’koll’) (monster) [SO2] were both previously written as ’kolli’.

Although there are some parish court records written in Modern Estonian orthography, most of it is written in older spelling and some of it during a transitional period, where people still wrote some words in the earlier spelling out of habit. What complicates matters even more is that there were two written languages in parallel use until the end of the 19th century representing North vs. South Estonian. Eventually, the North Estonian language and spelling standard became the single standard for the whole country. The spelling standard Estonians know and use today was introduced in 1843 and started gaining popularity in the 1870s. [1].

The dataset used for the normalization experiments consists of 1000 texts, around 330000 words, which were manually normalized for training and evaluating the machine translation system. Due to significant dialectal variation, texts were categorized into eight dialectal areas, with varying amounts of data chosen for each area, according to the differences from modern Estonian[SO3] . Central, insular, western, northeast coastal, and eastern dialects belong to North Estonian, with central dialect being the closest to standard Estonian. Võru, Tartu and partially Mulgi dialects belong to South Estonian. The number of words within the datasets that actually need to be normalized differs from dialect to dialect, ranging from 32% to 51%. The length of these writings within these datasets is also varied with average length being 359 words.

There[SO4] are numerous normalization methods available, among which character-level statistical machine translation (CSMT) holds prominence and is also employed in the current research. One of the first uses of the method was normalizing historical Slovene texts [2]. A similar approach has also previously been adopted for normalizing Estonian parish court records [3]. However, the use of CSMT introduces inherent challenges as it interprets text on a word-by-word basis, potentially leading to the erroneous normalization of words. To complicate this issue, there is the morphological richness of the Estonian language, which gives rise to form homonymy—a phenomenon where a single form can correspond to distinct lemmas in the contemporary writing system. Also, the old writing system was ambiguous. For instance, the term ’kalla’ might denote the contemporary form for ’kalla’ (pour in Estonian) or ’kala’ (fish in Estonian). While a human can discern the correct form based on context, the same task is considerably more complex for a machine. Therefore, it is needed to determine whether a given word should in fact be normalized. To address this, a two-step methodology was employed. Firstly, a Bert model was fine-tuned to identify specific words in sentences composed in the old writing system that require normalization. Subsequently, the Moses toolkit for character-level statistical machine translation (CSMT) was utilized. This approach mirrors a similar method previously employed for normalizing old Slovene texts [4].

The training and testing processes [SO5] were performed using the Simpletransformers library for Python and executed on High-Performance Computing clusters at the University of Tartu [5].

For identifying words that need normalization, two pre-trained models for the Estonian language, Estbert [6] and Est-Roberta [GJ6] [7], were experimented with. The latter was trained on a larger dataset from the media and the former on a smaller but more diverse dataset. The models were both subjected to cross-validation on 8 datasets representing different dialectal areas in 5 iterations. The training was performed in 50 epochs, the other hyperparameters [SO7] were left unchanged. The data was divided into training and testing sets, with an 80% and 20% split, respectively. Despite Est-Roberta being trained on a larger dataset, Estbert exhibited slightly superior results, achieving a macro average accuracy of 95%, compared to 93.5%, across 8 dialectal areas. That can be attributed to its pretraining on a more diverse dataset. The predictions were also compared to the baseline approach, which was Vabamorf, a toolkit designed for morphological analysis of Estonian texts [8]. Words unrecognized by Vabamorf as legitimate Estonian words were designated as requiring normalization. Vabamorf achieved an average accuracy of 90.6%.

[SO8] Following Estbert model training, the same datasets were used to train and tune the Moses machine translation system. 75% of the data were allocated for training, with an additional 5% earmarked for tuning using Minimum Error Rate Training (MERT). The remaining 20% constituted the test sets. Words not requiring normalization were excluded from the training and development sets. This ensured that training was performed on only the words that actually have to be normalized and not the ones that are potentially homonymous with former ones. For comparison, the same texts were used to train the Moses system without using the Estbert preprocessing. Both trained models were evaluated using the same test sets. For Moses without Estbert model all the test set was fed through the machine translation system and for Moses with Estbert model only the words that the model classified as needing normalization were translated.

Although the work is still ongoing, the preliminary results were promising[SO9] , with Moses without Estbert achieving average accuracy across different dialects of 90% on a whole dataset of approximately 330,000 words, while the adoption of Estbert increased accuracy to 94%.

An additional concern addressed by this research pertains to the presence of numerous words in parish court records that, in contemporary usage, manifest as compounds but were previously written often as separate tokens. Conversely, the opposite scenario is also observed, wherein words currently presented as two distinct tokens were originally composed as compounds. For example, ’kohtumees’ (judge in Estonian) was often written as ’kohtu mees’ and ’ära läinud’ (gone away) as ’äraläinud’. In order to alleviate this issue, a proposed strategy involved the training of an additional Estbert model to ascertain whether a given word is regular, part of a compound, or necessitates separation into two distinct tokens. As the annotation process to denote these distinctions began later in the study, not all previously described data could be utilized.

Consequently, for this task, the training dataset comprised manually normalized side of the corpus, encompassing approximately 160,000 words, wherein besides normalization, annotations were also provided if a word should be conjoined as a compound words, 3718 words, or treated as two separate tokens, 2022 words[SO10] . The initial results were not very promising. The precision was observed to be 65%, and the recall was 56%. That can be attributed to the fact that compounds were annotated later during the normalizing and therefore, there is less data available for training. Furthermore, the occurrence of such words is notably sparse within these texts. Efforts to improve these results are currently ongoing.

References

[1] M. Erelt, Estonian language, volume 1 of Linguistica Uralica Supplementary Series, Estonian Academy Publishers, 2007.

[2] Y. Scherrer, T. Erjavec, Modernizing historical Slovene words with character-based SMT, in: 4th Biennial Workshop on Balto-Slavic Natural Language Processing, 2013.

[3] G. Jaanimäe. Challenges Of Using Character Level Statistical Machine Translation For Normalizing Old Estonian Texts. Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022).

[4] Y. Scherrer, N. Ljubeši´c. Sesame Street to Mount Sinai: BERT-constrained character-level Moses models for multilingual lexical normalization. Proceedings of the 2021 EMNLP Workshop W-NUT: The Seventh Workshop on Noisy User-generated Text, p 465–472.

[5] University of Tartu. UT Rocket. share.neic.no. https://doi.org/10.23673/PH6N-0144

[6] H. Tanvir, C. Kittask, S. Eiche, K. Sirts. EstBERT: A Pretrained Language-Specific BERT for Estonian. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa).

[7] M. Ulčar, M. Robnik-Šikonja. Training Dataset and Dictionary Sizes Matter in BERT Models: The Case of Baltic Languages. Lecture Notes in Computer Science book series (LNCS,volume 13217).

[8] S. Laur, S. Orasmaa, D. Särg, P. Tammo. EstNLTK 1.6: Remastered Estonian NLP Pipeline. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), p 7152–7160.



4:15pm - 4:30pm

Augmenting BERT to model remediation processes in Finnish countermedia: feature comparisons for supervised text classification

Ümit Bedretdin, Pihla Toivanen, Eetu Mäkelä

University of Helsinki, Finland

This paper showcases a supervised machine learning classifier to bridge the gap between qualitative and quantitative research in media studies, leveraging recent advancements in data-driven approaches. Current machine learning methods make it possible to gain insights from large datasets that would be impractical to analyze with more traditional methods. Supervised document classification presents a good platform for combining specific domain knowledge and close reading with broader quantitative analysis. The study focuses on a dataset of 37 185 articles from the Finnish countermedia publication MV-lehti, annotated into three categories based on frame analysis. Contextual sequence representations from the finBERT language model, topic distributions from a trained topic model, and a structural, HTML-aware featureset developed in prior work are employed as classification features. The hypothesis that BERT-based embeddings could be improved upon by augmenting them with additional information is supported by recent promising results in natural language benchmarks and tasks (Peinelt, Nguyen, and Liakata 2020; Glazkova 2021). In our study, combining contextual embeddings with topics resulted in only marginal performance increases, and this improvement was observed mostly in minority classes. Despite this, potential future developments to achieve better classification performance are outlined. Based on the experiments, automated frame analysis with neural classifiers is possible, but the accuracy is not yet sufficient for inferences of high certainty.

Bedretdin-Augmenting BERT to model remediation processes in Finnish countermedia-135.pdf
 
4:30pm - 5:30pmDHNB: Annual General Meeting
Location: K-205 [2nd floor]
Session Chair: Eetu Mäkelä, University of Helsinki, Finland
7:30pm - 10:00pmDinner
Location: Reykjavík Art Museum

Social dinner in the Reykjavík Art Museum.

Date: Thursday, 30/May/2024
8:45am - 10:15amKeynote panel: DHNB
Location: Bratti [1st floor]
Session Chair: Camilla Holm Soelseth, OsloMet - Oslo Metropolitan University, Norway
Session Chair: Olga Holownia, IIPC, United States of America

Panellists: 

Anda Baklāne, Camilla Holm Soelseth, Eetu Mäkelä, Eiríkur Smári Sigurðarson, Jurgita Vaičenonienė, Katrine Hofmann GasserMari Sarv, Matti La Mela and Olga Holownia.

10:15am - 10:30amSHORT BREAK
Location: Háma
10:30am - 12:00pmPANEL#02: Measuring the Impact of Open Educational Resources on Digital Methods for Humanists
Location: H-207 [2nd floor]
 

Measuring the Impact of Open Educational Resources on Digital Methods for Humanists

Sofia Papastamkou1, Krebs Stefan1, Garnett Vicky2, McElduff Siobhán3, Hawes Anisa4, Chevrie Charlotte4

1Luxembourg Centre for Contemporary and Digital History, University of Luxembourg; 2DARIAH-EU; 3University of British Columbia; 4Programming Historian

Panel Proposers

Sofia Papastamkou & Stefan Krebs (C²DH, Ranke.2)

Measuring the Impact of Open Educational Resources on Digital Methods for Humanists

The rise of digital humanities (DH) in the early 2000s was paired with an inherent tension regarding pedagogy and education. On the one hand, humanists need to appropriate and integrate digital technologies in their research and teaching, both processes being subject to what has been qualified as “digital interferences” (Lucchesi 2020). On the other hand, the higher education institutions prove(d) slow to adapt their curricula to these needs. In this context, informal networks of scholars stepped in to fulfil training needs on an ad hoc basis (workshops, summer schools, THATCamps…) acting as a sort of “invisible college” for digital humanities (Crymble 2021). This both community-led and open pedagogy (Kirschenbaum 2010, Varin 2013) produced pedagogical outputs, electronically published and openly distributed thanks to the web technologies, now widely known as open educational resources (OER) as per UNESCO’s definition.

In this context, over the past couple of decades, several online platforms that propose OER for digital methods and skills in the humanities and arts have emerged (see e.g. Schriebman et al. 2016; Edmond & Garnett 2017). These projects are committed to open access and generate community engagement in various forms, whether top-down, bottom-up or mixed. Eventually, they also contribute to a linguistically more diverse DH landscape, either by being multilingual or by publishing non-English resources.

The span of life of OER in DH allows us to have the necessary distance, and data, to ask the question of their impact to the community. Impact is a large concept that can be understood in different ways. Following a macro level definition provided by the OECD, impact would address the “so what?” question: the difference a given intervention can make and its potential transformative effects. We propose a panel as stakeholders in three publishing initiatives of open educational resources on digital methods for the Humanities: DARIAH-Campus (https://campus.dariah.eu/), Programming Historian (https://programminghistorian.org/) and Ranke.2 (https://ranke2.uni.lu/). The panel focuses on possible ways of measuring and understanding the impact of the lessons we publish for the purpose of better serving the expectations of the communities involved: publishers, authors and various contributors (translators, reviewers...), educators and learners, as well as various supporters. The panel will consist of three presentations (10 minutes each) and a joint discussion (30 minutes).

Evaluating the (Re-)Use of Open Educational Resources: the case of Ranke.2

Sofia Papastamkou (C²DH, Ranke.2)

In the recent scientific literature on OER, impact seems to be synonymous principally with their (re)use, e.g. by educators integrating them in their teaching practices or by learners to fulfil their training needs (Manju & Batt 2021) or indeed in the sense of “any types of reworking activities that secondary users engage with when they work with resources that have been produced by primary creators, such as modifying, adapting, remixing, translating, repurposing, personalising or re-reversioning” (Pulker 2020).

The focus of this talk is the project Ranke.2 (ranke.uni.lu), a history-oriented teaching and training platform on digital source criticism. Launched in 2018, Ranke.2 is a top-down initiative, open to the community, the existence of which is intertwined with the constitution of a digital history interdisciplinary centre (Centre for Contemporary and Digital History) at the University of Luxembourg.

The paper will use quantitative and qualitative findings from Ranke.2, to explore how educators made use of the lessons in their courses, how they adapted them to their teaching needs and what were the learning outcomes their students achieved. This will help us to better understand the users’ perspective on OER, and what we can learn from it to improve our teaching resources. Beyond uses, the talk will also lean on a broader framework as per (Ebner et al. 2022) to propose an overall assessment of impact and appreciate the investment of various types of resources on the production of such OER.

Going Beyond Quantitative Metrics to Evaluate OER: the case of Programming Historian

Anisa Hawes (Programming Historian), Charlotte Chevrie (Programming Historian)

Programming Historian englobes four Diamond Open Access journals (English, español, français, português) publishing ‘novice-friendly, peer-reviewed lessons that help humanists learn a wide range of digital tools, techniques, and workflows to facilitate research and teaching’.

Programming Historian lessons are created by a global community of authors, editors, peer-reviewers and translators who have collaborated to create more than 200 lessons. Since our online launch in 2012, our website has had over 10.8 million page views. That is 6.1 million unique users from countries across every continent. Analysis of web traffic data has allowed us to track patterns of global and local usage, and better understand the diversity of digital humanities skills sought and needed across different cultural contexts (Crymble & Im 2023).

As part of our commitment to open source values and practices, we publish all our journals under the CC-BY 4.0 Licence. Within its terms, the CC-BY Licence enables each of our four journals to translate, adapt and localise what we have already published across their collective directories. A broader impetus for choosing a licence which offers the freedom of adaptation and circulation is that it opens opportunities for others to do the same.

Various non-affiliated groups and individuals have already taken up the opportunity to translate and localise some of our lessons for their own audience. Mapping of initiatives that emerge from the opportunity offered by our CC-BY Licence as self-organised ‘community teams’ is an alternative way of visualising Programming Historian’s reach.

Meanwhile, a growing network of educators regularly centre our lessons within their classrooms. This involves agile adaptation of text and code upon screen, to voice and action in the classroom. Some educators and institutions have developed their own teaching models which integrate and build upon the lessons we publish. Capturing this use of Programming Historian in teaching is another prism through which we can measure our impact.

Beyond quantitative metrics, this presentation will explore possible qualitative measures for evaluating how learners and educators in our communities experience, use and re-use our resources.

Exploring the User Base of OER: the case of DARIAH-Campus

Vicky Garnett (DARIAH-EU / Trinity College Dublin)

Siobhán McElduff (Trinity College Dublin)

DARIAH is the ‘Digital Research Infrastructure for the Arts and Humanities’, an EU-wide organisation that supports the research and professional practices of scholars and practitioners in the Digital Arts and Humanities. DARIAH-Campus is the hub for training resources developed within the wider DARIAH eco-system. It functions as both a discovery framework for resources hosted on third-party websites, and as a hosting platform to publish new training resources directly: the majority of resources catalogued on DARIAH-Campus are ‘External Resources’, that is those hosted on third-party websites (usually project or institution websites). Some resources are published directly to DARIAH-Campus (‘Hosted Resources’). In addition to these two main resource types, DARIAH-Campus also offers a means to ‘capture’ live training events, in a structure that mimics the programme of the event, compiling presentation slides, videos, speaker biographies and event photos.

There are two key audiences for DARIAH-Campus: those who publish training content on the platform (in whichever form); and those who use the training and learning content on the platform as either supporting resources in the provision of training, or as resources for personal skills development and lifelong learning. Much of the outreach activities to date has been with the view to encourage content providers to publish their resources to DARIAH-Campus, and this has been successful. However, having achieved a strong representation on DARIAH-Campus of excellent Digital Humanities training content from the DH community, we want to start placing more emphasis on building the consumer-base: those who use the training and learning materials.

This paper will therefore showcase the initial results of a long-term project investigating the uses of DARIAH-Campus within the broader community of scholars, educators and practitioners in the Digital Humanities, and what barriers and opportunities exist to accessing and utilising the training and learning resources available on online platforms.

Panellists & Moderator [live session]

Panellists

Vicky Garnett (DARIAH-Campus)

Anisa Hawes (Programming Historian)

Sofia Papastamkou (C²DH, Ranke.2)

Moderator

Stefan Krebs (C²DH, Ranke.2)

Papastamkou-Measuring the Impact of Open Educational Resources-200_a.pdf
Papastamkou-Measuring the Impact of Open Educational Resources-200_b.pdf
 
10:30am - 12:00pmSESSION#04: ART, MULTIMODAL & 3D
Location: H-205 [2nd floor]
Session Chair: Camilla Holm Soelseth, OsloMet - Oslo Metropolitan University, Norway
 
10:30am - 10:45am

Our Art - Art and data, an art exhibition curated by the public through an online collection

Sigurður Trausti Traustason

The Reykjavik Art Museum, Iceland

Our Art was a 2023 project of the Reykjavik Art Museum in collaboration with the Human Rights and Democracy office of the city of Reykjavik and the Citizens foundation. The Citizens Foundation is a global nonprofit offering creative and secure open source digital democracy solutions used in 45 countries. The Democracy office uses Citizens solution for its Better Reykjavik Program a platform where citizens of Reykjavik can suggest and vote on projects in their neighbourhood they would like the city build or implement. www.betrireykjavik.is The idea was to use the platform to allow people to vote which works from the art collection of the Reykjavik Art Museum should be installed in a special exhibition.

The art collection of the Reykjavik Art Museum is the largest in Iceland, totalling nearly 18.000 accessioned works. All of these are documented in a purpose built Filemaker database that is run by the museum and hosted by the IT department of the city of Reykjavik. The database holds a vast amount of data on the accession artworks. In 2023 the museum celebrated its 50th anniversary. In honour of that occasion the museum put together an anniversary program that focused in part on the collection of the Reykjavik Art Museum. The main goal was to inform the citizens of Reykjavík of their collective ownership of the collection. One of the projects was Myndlistin okkar // Our Art where the public was given the opportunity to choose artworks from the collection for an exhibition. They were in a way given the role of the curator. The museum has for the past decade published nearly all of its collection online for visitors to browse through. It is for instance thought of as a way for the public to get to know the collection but at the same time it can be a powerful research tool for art historians and other researchers. in Our Art information on 3.000 artworks from the museums online database safneign.listasafnreykjavikur.is were exported alongside images of them to the Citizens foundations platform. The need to make a smaller selection was mainly because of practicality, not all of the works can be shown because for instance the limitations of the exhibition space, the condition of the works and there was a need to limit as some artists have works in the thousands in the collection which would have completely drowned the selection. The selection website is still open for browsing here: https://myndlistin-okkar.betrireykjavik.is/ Information on the final show and images from it can be found here: https://listasafnreykjavikur.is/syningar/myndlistin-okkar. The project was a big success and garnered much attention from the public and media in Iceland. There were over 26.000 votes cast in the project with nearly 5.000 individual logins.

In this short presentation the author will be giving an overview of the project as a whole, its aims and goals, the challenges involved in its management and how the museum found an interesting way to reproduce and mediate the data from its database.

Sigurður Trausti Traustason is the Head of Collections and Research at the Reykjavik Art Museum. Previously was the manager of Sarpur the collective collection management system of Icelandic museums. His interests lie in documentation and the preservation of cultural heritage.



10:45am - 11:00am

Not Following the Book: A Journey from Museum Conservator to Digital Humanities Researcher through the Creation of a Contemporary Art Management Database

Zoë Renaudie

Université de Montréal, Canada

This paper recounts the transformative journey of a museum conservator thrust into the realm of Digital Humanities through the inception of a groundbreaking database for managing contemporary art collections and facilitating exhibition production. The narrative unfolds at Luma Arles in France, where, as an assistant conservator, I unexpectedly became the project owner for a crucial collection management database initiative. Initiated in conjunction with the opening of Luma Arles in the summer of 2021, the project aimed to address the unique challenges posed by the dynamic nature of contemporary art.

The intricacies of the project involved grappling with existing private solutions that proved inadequate for the institution's needs and overcoming the shortcomings of an agile method adopted by the previous project owner. Upon the departure of the original project owner, the endeavor faced dissolution due to its perceived high cost and lengthy production timeline. Faced with the essential need for a database, I assumed responsibility for the project and proposed alternative solutions, leading to the successful realization of the database.

This paper illuminates the unconventional path taken during the database creation, highlighting both successful and unsuccessful practices. It emphasizes the importance of deviating from conventional approaches, revealing instances where embracing chaos and fostering a collaborative making process proved more efficient than strictly adhering to established methodologies.

Key aspects to be discussed include the evolution of the database creation process, challenges faced, and the ultimate solution devised. Furthermore, the narrative delves into the strategies employed to engage professionals from various sectors, emphasizing the transformation of the database from a mere archival tool to a dynamic and indispensable asset in the management of contemporary art collections.

As a conservator intimately involved in the project, I will delve into the intricacies of collection management, exploring the diverse types of data related to both the physical and conceptual aspects of artworks. Special attention will be given to the utilization of digital solutions as tools for conservation, particularly in the context of multimedia, performance, and installation artworks, showcasing the complexity that the project aimed to comprehend and manage.

This presentation extends beyond the project's technical aspects, sharing personal insights and experiences as an untrained project owner who, amidst challenges, discovered the realm of Digital Humanities. The culmination of this transformative journey has inspired the pursuit of a Ph.D. in Art History at the Université de Montréal (CA), supervised by Emmanuel Château-Dutier in the Research Digital Humanities Laboratory called l’[Ouvroir](https://ouvroir.umontreal.ca/accueil).

In this Ph.D, I approach the question of exhibition conservation using a conceptual framework that considers it as a network of interconnected elements. To grasp this, I plan to consult the sociological notion of a boundary object (Star and Griesemer 1989), whose materiality arises from action rather than physicality. This concept also emphasizes the links between social worlds, allowing for different considerations of the object. Concurrently, the actor-network theory (Akrich, Callon, and Latour 2006) will define a referential state based on a set of multiple properties located in various times and spaces, involving multiple actors.

I also intend to invoke Jean-Pierre Cometti's investigation methodology in philosophy (2016), which opens the profession to a more biographical and holistic approach. This practice is also found in conservation-restoration theories of contemporary art and deemed complex works (Saaze 2009; Scholte and Wharton 2011; Stigter 2017). These will play a central role in the work I intend to develop. The creation of the documentary model will also involve a theoretical framework specific to digital humanities (Schweibenz and Scopigno 2018; Barok et al. 2019), for which I will analyze the CIDOC-CRM ontology of ICOM following the documentation logic created by and for conservation-restoration (Leveau 2012). With this interdisciplinary approach, I propose to envision a tool to capture the complexity of exhibitions in collaboration with the Partnership for New Uses of Collections in Art Museums with [CIECO](https://cieco.co/fr).

My research, in general, aims to explore ways to make digital tools accessible for museum professionals. I wish to expand perspectives on Galleries, Libraries, Archives, and Museums ([GLAM](https://glamdatasci.network/)) and advocate for alternative solutions in addition to traditional relational databases. The paper underscores the use of hands-on experience, adaptability, and a willingness to embrace innovation in successfully navigating the intersections between contemporary art, conservation, and the evolving landscape of Digital Humanities, as I understand it.

Renaudie-Not Following the Book-145.pdf


11:00am - 11:15am

Beyond Creating Collections: A Scoping Review of 3D Heritage Storytelling

Nicole Basaraba

Trinity College Dublin, Ireland

This short paper aims to demonstrate how virtual reality (VR) and augmented reality (AR) can employ 3D archaeological reconstructions to immerse viewers into historical places in combination with digital narrative development in the heritage and tourism sectors. Countless galleries, libraries, archives, and museums have included digitally re-created objects and re-imagined scenes in situ through theatrical sets and live historical re-enactments to help visitors see and contextualise objects from history. What if these techniques were applied to 3D virtual narrative experiences? This scoping review focuses primarily on selected studies that create 3D models for archaeology, VR, and AR that utilise 3D models, and virtual heritage ‘edutainment’ experiences published within the last five years (i.e., 2018-2023). It gives an overview of the current methods and technologies used to create 3D heritage productions for scientific studies, heritage preservation and tourism, and educational applications. Ultimately, this paper discusses how VR and 3D modelling could, in future work, be transformed into narrative experiences based on historians’ and/or archaeologists’ current understanding of the past.

Basaraba-Beyond Creating Collections-181.pdf


11:15am - 11:30am

Transfer learning in digital art history: a flagrant need of standards for patrimonial images segmentation

Léa Maronet1, Alice Truc2

1École Pratique des Hautes Études, France - Université de Montréal, Canada - Centre National de la Recherche Scientifique, France; 2Université Rennes 2, France - Université de Montréal, Canada

Since Johanna Drucker questioned the existence of a “digital art history” (Drucker, 2013), the last decade has seen a growing number of art historical projects make use of methods such as automatic pattern recognition, confirming the transfer of a technology born in the narrow field of computer vision to a growing number of researchers. However, digital art history remains a fragmented field: until now, these projects have not seriously interacted with each other, as they preferred to develop their own technical and methodological solutions. While the prevailing project-based logic offers fresh insights and original solutions to similar problems, it requires a significant financial and time investment, large quantities of data and technical skills that are not accessible to all research units, let alone all researchers (Romein and al., 2020, 310). Our communities would benefit from pooling together our efforts to help reduce these costs and reduce research obstacles.

Such a project-based logic has for consequence to impose limits on the standardization of digital practices, both in terms of data recording and algorithm training. In particular, image segmentation - an operation which involves detecting and grouping pixels according to zones of interest by recording their coordinates, which is at the heart of computer vision and pattern recognition tasks - is not subject to any standards or harmonization. However, the adoption of standards is not totally missing from the field of digital art history: there are, for example, standards such as DublinCore (DCMI Usage Board, 2012), or projects such as Metadata Culture, for the registration of metadata (Näslund & al., 2020). The production of ontologies, such as IconClass (Posthumus & Brandhorst, 2005) or SegmOnto (Gabay et al., 2023) to describe image content, is also developing. However, there is no collective program to standardize the data produced during the process of cultural images segmentation.

It's not so much a question of trying to standardize the description of image content - an attempt that falls within the realm of ontology, and whose terms vary from one project to another, from one work of art to another - than proposing a way of harmonizing the recording of elements's coordinates within these images. Such initiatives are being produced in the field of literary studies, with projects such as HTRUnited (Chagué, Clérice and Romary, 2021), but in digital art history, the question remains little addressed (Bardiot, 2021). Yet having a corpus of segmented images, whose elements have been identified by their coordinates, recorded in a hierarchical and standardized way in a document, would make it possible to create ground truths that would serve as basis for training new algorithms.

Why is this question crucial in our discipline today? The development and training of computer vision algorithms in digital art history is generally based on the technique of transfer learning, which involves using pre-existing algorithms that have already been trained, to adapt them to the specificities of new corpus. While transfer learning can reduce the amount of data required to train algorithms, and at the same time improve the results obtained, its current methods remain ill-suited to art history: the algorithms available are for the most part trained on “natural” images, which means they are not relevant to the specificity and plurality of cultural images. The lack of harmonization of the solutions proposed project after project leads to a bitter conclusion, drawn in the course of our thesis researches: the available tools evolve rapidly and become obsolete in the short term, until they are discarded; many of them are poorly documented or inaccessible to a non-expert public; others are not usable and do not work on trial for a variety of reasons, ranging from operating system incompatibility to the lack of update of the libraries used. Therefore, it appears vital to us today to pool our segmented images so that they can serve as ground truths for future research, minimizing model training while enhancing the results obtained.

However, it is important to keep in mind that segmentation is an intellectual operation, which always proposes an interpretation of the documents it takes as its object, and must be adapted to specific research questions. How can we design digital practices and tools capable of handling segmentation tasks for different data corpora and settings, while remaining relevant to the epistemological requirements of art history and accessible to its researchers? Asking this question implies putting into tension issues of interoperability and reproducibility of patrimonial data, on the one hand, and epistemological reasoning in relation to the tools developed, on the other hand. These questions should not be addressed after tools development but upstream and should accompany the entire workflow (Stutzmann, 2010, 247‑278). To address these tensions, we will identify the specific needs of the discipline of digital art history with regard to questions of image segmentation. We will then offer a quick overview of segmentation tools that can be adapted to be functional for every art historical project. Finally, we will highlight the shortcomings and solutions that need to be developed as a priority.

Pooling cultural ground truths is nowadays crucial if research is to move towards better integration of computer vision and automatic pattern recognition in digital art history projects. In addition to the availability of quality, properly documented and cataloged data, this pooling of resources can encourage all researchers to gradually adopt homogeneous practices allowing a better reproducibility and repurposing of data. The aim is twofold: to produce ready-to-use models and workflows adapted to cultural data, and to share data that enables these models to be produced. For the ground truths of each project to be reusable, segmentation data must be brought about according to norms and standards designed specifically for images of works of art, with a view to conceptualizing them collectively.

Maronet-Transfer learning in digital art history-121.pdf


11:30am - 11:45am

Time-based media artworks at Reykjavík Art Museum

Edda Halldórsdóttir

Reykjavík Art Museum, Iceland

Modern day art museums face both ideological and practical questions when dealing with the preservation of time-based media artworks. Instability of technology is a fact and changes occur when new technologies present themselves. Time-based media works are unstable by nature; their technological components become obsolete, they often require adaptation and every re-iteration poses new interpretations and changes in installation. With this development new professions have emerged such as digital archivist, time-based media conservator, curator of new media and others, which all underline the fact that there is need for professionally trained people in this field.

Before, and still, museums are facing the challenge of not having enough storage space for their collection and additionally, museums are now dealing with the same problem digitally; they need more digital storage space, to be able to store large files with the same care as they store classical paintings.

What I want to introduce is how Reykjavík Art Museum is facing the challenge of collecting and preserving works of art that contain complex technical components; works like video works and other technology-based artworks, digital images, sound works and works containing mechanisms relying on electricity and electronics as a basis for the existence of the works.

In 2017, Reykjavík Art Museum embarked on an exhibition project meant to deal with these challenges. The exhibition project was called Bout, where the majority of the time-based media in the museum’s collection was exhibited. The museum had identified gaps in the documentation of its time-based media art and so the project was conceived as a response to that. The title of the project, Bout – Video Works from the Collection, referred to the works being exhibited in four different bouts, each having its own theme which was based on the approach and subjects of the artists.

The exihibtion alllowed the audience to experience recent and historic works from the collection, but also provided the museum staff a chance to revise its archives and technical classifications – to learn from and improve the process behind exhibiting and caring for these works.

It furthermore opened a dialogue about new attitudes towards the collecting and the preservation of video art. As technology has progressed, new means of creation have opened up, and many of the works reflect experimentations with a medium in a constant flux. The artistic content has furthermore expanded in conjunction with technical possibilities at each given moment.

The Reykjavík Art Museum‘s collection holds roughly 17.000 artworks; paintings, sculptures, drawings, video, installations and outdoor sculptures in the city. Around 60 works are registered as time-based media artworks. The collection spans a period from experimental film in the sixties, through the early days of basic video equipment in the eighties, to the elaborate digital installations of the present. The first video works came into the collection around the year 2000.

Seven years have now passed since the exhibition Bout was carried out and the museum has since then had the opportunity to install the works again in different exhibitions. The documents and data gathered in 2017 have proven to be useful when installing the works anew, although there is always room for improvement. Concerning the topic, the only thing I can do is to address it from my point of view, that is, as a registrar with no particular background in technological matters, trying to catch hold of latest developments in audio/video, hardware/software and other things related. What I am concerned with is how to best register these works, and make it possible for future generations to preserve the works and exhibit, according to the will of the artist, but also logistically, with regards to strictly what is possible. It is then up to other specialists to take this information and install the work successfully.



11:45am - 12:00pm

Sonic Mapping: Creating Digital Interactive Soundscapes Based on Acoustic Surveys

Garrison Charles Gerard

University of Iceland, Iceland

Soundscape ecology is an expanding field that holds a variety of opportunities for understanding our changing ecosystems due to climate change and the impact of anthropogenic noise on natural soundscapes. Passive acoustic monitoring (PAM) provides enormous amounts of data detailing sound levels, species present, and other information about an ecosystem, but presenting that information in an approachable format is a challenge. I argue that music composition and digital interfaces provide an avenue for creating engaging systems from an interdisciplinary perspective that convey meaningful ecological information to an audience. In this paper, I examine this through the lens of my acoustic surveys in Iceland’s national parks and a sample system that sonifies the data from the PAM surveys and combines it with the field recordings from the natural spaces.

First I examine the kinds of data made available by passive acoustic monitoring, using my acoustic surveys of Ice land’s National Parks as a case-study. My surveys in Þingvellir, Snæfellsjökull, and Vatnajökull national parks were carried out from September 2023-April 2024 producing more than 10,000 hours of audio. This audio was analyzed in multiple ways including using temporal audibility analysis, noise level analysis, multiple acoustic indices, and spectrogram analysis. These analyses provide a window into noise pollution in the areas and the impact of anthropogenic noise on these natural spaces.

Many of the results from the analyses illustrate expected trends, but some defy preconceived notions. For instance, the Normalized Difference Soundscape Index (NDSI)—which is a measure of the ratio of biologic sound (biophony) to human-produced sound (anthrophony)—provides a window into where and when biologic activity occurs compared with human activity (Kasten et al. 2012). At stations close to popular trails in Skaftafell national park, the NDSI shows the expected trend of biologic activity spikes around sunrise and sunset (dawn and dusk choruses). However, in areas far removed from human activity, biologic activity actually rises even higher between the dusk and dawn chorus. The recordings also provide insight into interactions between various elements of the soundscape such as between air traffic and biologic activity.

The results of the noise analyses from the acoustic surveys provides interesting information, but the results can be fairly opaque to an observer outside of acoustic ecology—my solution is a program that sonically maps the field recordings and data so that listeners can experience the soundscape and so gain a deeper understanding of the information gleaned from these analyses. I accomplish this by creating a system where listeners can transverse an acoustic soundscape and experience the impact of various parameters based on changing variables. For instance, by changing season, weather, or time of day the listener can hear a different soundscape. They can also manipulate variables such as the number of people present or the amount of air traffic in a given ecosystem and hear the impact of those sounds on the environment. The system accomplishes this by using the underlying field recordings as sonic material, but also by using the information from the noise analyses to manipulate and process the field recordings to represent the impact of noise on these spaces. For instance, by increasing the number of people present in a space, filters will obscure sounds present, mirroring the real-world effects of noise pollution. In this way, the system makes apparent large-scale acoustic interactions by representing them in an interactive format. The system is built in the Max programming environment, chosen for its high audio quality and ease of creating portable versions of the system, but these same ideas could be adapted to other programming environments.

The longterm impact of this project will be the development of more effective tools for communicating acoustic data to viewers. This system can be presented as a public installation, a webpage, or in a concert setting. The goal is to create a system that can sonify and represent the growing variety of acoustic surveys and soundscape information that is building every day. This project would not be possible without beginning from a fundamentally interdisciplinary perspective and leveraging the capabilities of digital systems to convey information in a portable and transparent fashion.

 
10:30am - 12:00pmSESSION#05: BORN-DIGITAL DATA
Location: K-205 [2nd floor]
Session Chair: Sally Chambers, DARIAH, Belgium
 
10:30am - 11:00am

Understanding the Challenges for the Use of Web Archives in Academic Research

Sharon Healy1, Helena Byrne2, Olga Holownia3

1Independent Researcher; 2British Library, United Kingdom; 3International Internet Perservation Consortium (IIPC)

In this paper we examine the challenges for the use of web archives in academic research through a synthesis of the findings from two research studies that were published through the WARCnet research network. The aim of the WARCnet network was to promote high-quality national and transnational research to enable a better understanding of the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives. The network activities ran from 2020-2023.

This paper fits with the Education and Advocacy theme of the conference, with regards to advances in approaches to teaching digital tools and methods. In order to advance approaches to teaching digital tools and methods with a specific focus on the use of web archives for research, it is important to understand the challenges faced by both non-users and users of web archives in the first place. Other studies have also done substantive work in this area focusing on users of web archives, awareness and engagement with web archives, and the scholarly use of web archives (Hockx-Yu, 2014; Riley and Crookston, 2015; Costea, 2018; Gooding et al., 2019). What is evident from such studies is the need to continually examine the challenges for users and non-users of web archives in order to assess and develop the skills, tools, and knowledge requirements for working with web archives for research purposes.

The first study under review is the Scholarly Use of Web Archives Across Ireland: The Past, Present & Future(s). This was a collaborative project which incorporated a review of resources and literature, informal dialogues with heritage colleagues, and the use of an online survey. The study sought to (i) examine the causes for the loss of digital heritage and how this relates to Ireland, (ii) offer an overview of the landscape of web archives based across Ireland, and their availability, and accessibility as resources for Irish based research, and (iii) provide some insight into the awareness of, and engagement with web archives in Irish third-level academic institutions. For this paper we focus on the section of the study that looks at scholarly engagement with web archives and the perceived challenges by users and non-users of web archives in Irish academia. We discuss scholarly awareness and engagement with web archives in Irish academic institutions and offer some insights which may be useful when it comes to providing support and incentives to assist scholars in the use of the archived web for research and teaching. This study could be replicated in other jurisdictions to help understand user needs when using web archives in academic research.

From the findings participants from an Irish academic setting described their lack of engagement with web archives due to:

  • a lack of awareness of the availability of web archives

  • unsure as to how relevant, useful, or beneficial, a web archive would be for their research,

  • not knowing how to use a web archive

  • not having the technical skills to use/process web archived content,

  • not knowing how to find archived websites relevant to their research in a web archive,

  • not knowing how to cite/reference an archived website from a web archive,

  • unsure of the credibility or authority of using archived websites as a source,

  • unsure about copyright implications for using archived web content for research,

  • not relevant for their discipline

Participants further outlined the perceived challenges in using web archives for research due to:

  • a lack of awareness of the existence, content, and value of web archives

  • understanding search and navigation mechanisms,

  • working with large volumes of data,

  • understanding access and discovery mechanisms,

  • understanding the representativeness and completeness of the data in web archives,

  • perceptions of web archived data being a non-established source and/or lacking source credibility,

  • citing archived web content.

Overall, the findings suggest that there is a limited awareness of the existence of web archives

in Irish academic institutions. Therefore, for an unfamiliar audience, more effort is needed to

demonstrate the importance of archiving the web and to promote the value of web archives as resources for research, as well as a need for the dissemination of use cases in Irish based research that will demonstrate the use of web archives as a research resource. The study recommends the need to develop multidisciplinary and interdisciplinary research networks in Irish academia to address potential solutions for developing research models and paradigms for the use of web archives for Irish based research that are fit for purpose in a broad spectrum of research fields.

The second study that we review is the Skills, Tools, and Knowledge Ecologies in Web Archive Research. This was a collaborative project between seven members of the WARCnet network from five institutions. Two of these institutions were academic and three were from the GLAM sector. The study sought to identify and document the skills, tools, and knowledge required to achieve a broad range of goals within the web archiving life cycle and to explore the challenges for participation in web archive research, and the overlaps of such challenges across communities of practice. The methodology for the study entailed desktop research, participation in WARCnet meeting discussions, and an online questionnaire. Respondents who participated in the online questionnaire identified with residing in North America, Europe and Asia. In this paper we focus on the section of the study that looks at the challenges faced by researchers who work in an academic setting when using web archives.

From the findings, participants who identified as researchers in an academic setting offered several insights on the challenges in working with web archives due to:

  • a lack of research methods, theory, and approaches for combining traditional methods with web archive research,

  • having to learn new skills, (e.g., programming, data sheets, etc.),

  • working with large volumes of data in terms of storage, processing and analysis,

  • a lack of access to more comprehensive metadata and documentation for web archive collections,

  • legalities in terms of access to the data, use of the data, and storage of the data from web archives, as well as complications with legal deposit, copyright, and GDPR;

  • a lack of experience in handling protected data from a web archive

  • the inability to download data from some web archives

  • citing archived web content and datasets

Overall the study emphasises that collaboration is key between the creators of web archives and end users. It notes how the web archive field would be enriched through the inputs of both communities for developing a better understanding of the research methods and approaches for using web archives. For example, the study indicates that there would be some value in extending introductory web archiving training to researchers in a bid to offer them more understanding of the limitations of web archiving strategies due to technical challenges, legal constraints, and a lack of resources. The study also highlights that challenges for end users do not become less with increasing experience, and emphasises the need for training across all levels of experience.

To end, both these studies highlight that there is a steep learning curve in working with web archives for research. One challenge is understanding the terminology and technical language that is used within this field. To try and improve accessibility of these reports Healy et. al. produced a glossary that was published through WARCnet. Towards a Glossary for Web Archive Research: Version 1.0 aimed to discuss the development of a glossary of terms and concepts for web archive research, using a novel approach which can be built on depending on user needs.



11:00am - 11:15am

Diving Into the Digital Heritage: (re)Searching the Norwegian Web Archive

Jon Carlstedt Tønnessen

National Library of Norway

For more than two decades, web media have played a pivotal role in cultural and societal transformations. During that time, web archive initiatives have collected and preserved petabytes of web content that can serve as an important basis for studies of these recent and ongoing processes. However, scholars who want to study web archives have described significant obstacles to find, explore and analyse relevant data. A common critique is that web archives are “messy”, “chaotic” and unstructured, lacking an organisation that can make sense to humans. Further, scholars describe the organisation of archival data and the materiality of WARC (WebARChive) records as highly complex, requiring technical expertise for meaningful interaction and analysis. While web archives may be fundamental to understand contemporary culture, phenomena, and events, the current situation does not satisfy the expectations and needs of many researchers.[1]

Learning from these lessons, the National Library of Norway has started to index the content of the Norwegian Web Archive (NWA) collection, allowing for search in full-text and rich metadata. Using SolrWayback, a bundle of technologies developed in context of the NetarchiveSuite initiative[2], the service being built enables researchers to find and explore data and metadata, perform various analyses, and produce corpora of both mono- or multimodal material for computational analysis. The paper will present how our efforts are made within existing legal and ethical commitments, in alignment with the FAIR principles, and address some main opportunities and challenges associated with the current platform.

The paper will unfold in three parts. First, I will make the case that web archives are not unstructured. Rather, the data in web archives are organised according to different principles than in paper-based archives. While paper-based archives are often designed with a hierarchical structure, ordering items into a tree-like structure with nested categories and subcategories that aim to reflect the relationships and contexts of the documents in a way that often makes sense to humans, web archives are organised flatter, based on the way web crawlers work and retrieve data.[3] Nevertheless, this increases the need for research infrastructures that provide relevant metadata with descriptions of its content, context, and other relevant attributes, making it easier to find related items, independent of the structure of the archive. It also addresses the need for interfaces where scholars can search, filter, explore and interact with the archival content in a meaningful way.[4]

Second, I will present the results from indexing domain crawls between 2019-2022, enabling search in full-text and rich metadata for 100 million web resources. This includes a demonstration of the service, displaying the power of its advanced search syntax, examples of the richness of metadata and how it supports FAIR principles. Further, I will show how export functionalities can be used to build corpora and facilitate computational analysis, such as NLP, speech-to-text, image classification and network analysis of domain relations. I will also share the main findings of a pilot study, estimating the HTMLs in the NWA collection in total to contain more than 240 billion words. This alone is 50% more than all the printed newspapers and books digitised by the National Library, making it one of the largest text corpora in the world.

Third, I will review lessons learned from testing the service with more than 20 scholarly users. In addition to bringing forward important user experiences and observing how scholars experience the service, opening the archive for researchers has provided valuable insight into the collection, already improving harvesting. This is not only a reminder that archives gain value through their usage but it highlights the importance of working systematically with user-orientation in designing and developing tools and services for the Digital Humanities.

Wrapping things up, I will briefly present how scholars can get access to the NWA service, and address some challenges that are currently unsolved, such as the identification of low-resource languages and scalability to infrastructures with distributed storage.

[1] Milligan, ‘Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives’; Ruest, Fritz, and Milligan, ‘Creating Order from the Mess: Web Archive Derivative Datasets and Notebooks’; Völske et al., ‘Web Archive Analytics’; Schafer and Winters, ‘The Values of Web Archives’; Vlassenroot et al., ‘Web Archives as a Data Resource for Digital Scholars’; Gomes and Costa, ‘The Importance of Web Archives for Humanities’.

[2] https://github.com/netarchivesuite/solrwayback/

[3] Webster, ‘Existing Web Archives’, 35–37.

[4] Brügger, ‘The Need for Research Infrastructures for the Study of Web Archives’, 221–22.



11:15am - 11:30am

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

Amanda Myntti1, Veronika Laippala1, Erik Henriksson1, Elian Freyermuth2

1University of Turku, Finland; 2National Graduate School of Engineering of Caen

Web-scale corpora, automatically collected from the web and encompassing billions of words, present significant opportunities for diverse fields of research. These corpora play a pivotal role in the advancement of large language models, such as the one underpinning ChatGPT. Moreover, they include masses of texts produced in different situations with different objectives, and they host new forms of digital cultural heritage that are constantly emerging and evolving. Therefore, they open up new avenues for research within the humanities, and social sciences, but also require multidisciplinary collaboration to guarantee their usability. (Laippala et al. 2021; Välimäki and Aali 2022)

A notable challenge associated with web-scale corpora is the absence of metadata detailing their contents. Typically, they lack information regarding the origin and the content of the documents. Documents featuring different text varieties, ranging from legal notices to advertisements, news articles, fiction, and song lyrics, all have an equal status in the corpora. This study aims to address this challenge by exploring various approaches to classify web corpora to specific subsections. In particular, we focus on registers, typically applied in corpus linguistics, defined as situationally defined text varieties (Biber and Conrad 2019), and genres, often utilized in literary studies when examining various forms of literary work (e.g., Goyal and Prakash 2022; Zhang et al. 2022).

In recent years, web register identification has taken leaps forward, with web register classifiers achieving nearly human-level performances (Laippala et al. 2023; Kuzman et al. 2023). However, in practice, when applied to web-scale corpora, the predicted register classes are still very broad, including a wide range of linguistic variation. Therefore, in this study, we examine if and how the available information can be deepened by combining two approaches: registers and genres.

We apply machine learning to train two text classifiers: one targeting registers and one focusing on genres. We utilize these classifiers to predict the classes for one million documents in the widely applied, web-scale Oscar dataset (Suarez et al. 2019; Laippala et al. 2022). Then, we 1) evaluate the distributions of the text classes predicted using the two classifiers; 2) analyze the intersections of the classes, and 3) examine how the combination of the two approaches extends the metadata available for the corpus.

The register classifier is trained using the CORE corpora (Biber and Egbert 2018; Laippala et al. 2022, 2023). The scheme is hierarchical and covers eight main categories with broad, functional labels such as narrative, informational explanation and opinion, and more detailed subcategories such as news report, research article and review. The training data for our genre classification model consists of books from Kindle US. The genre categories are assigned by the authors and selected from the possible categories of Kindle US, which include categories such as Children’s books, Science & Math, and Action & Adventure. The original dataset is available at https://huggingface.co/datasets/marianna13/the-eye and a cleaned version used in training is available at our Huggingface page at [deleted-for-review].

As some of the genre classes in the dataset are overlapping, we perform some pre-processing steps to improve the data quality. We choose a subset of genres that maximize the performance in two ways: firstly, the chosen genres need to be present in our corpus, Oscar. As a web corpus, genres most suitable for our task include Medicine & Health; Cookbooks, Food & Wine; Engineering & Transportation; and Politics & Social Sciences, which are common topics in online sources. Secondly, as categories are partially overlapping and some contain very few examples, we choose other categories based on their support in the dataset and by testing the performance with different candidate subsets.

The register model is implemented by finetuning XLM-RoBERTa-Large (Conneau et al. 2020) using the CORE corpus with the task of register identification modeled as multilabel classification. For training, we use the Huggingface Transformers library. Preliminary results show the register classifier is able to reach an F1-score of 0.77. We also use XLM-RoBERTa-Large for the basis of our genre classifier. Experiments are done on multiple genre subsets as described above. Similarly to the register model, the task is framed as multilabel classification and again, we use the Huggingface Transformers library. We select the best prediction threshold based on the F1-score. Our preliminary results show an F1-score of 0.70.

We use the two classifiers to label one million documents of the Oscar corpus. In our experiments, we see some expected combinations between certain registers and genres, such as the Lyrical register and the Literature & Fiction genre often coinciding, but equally registers such as Interactive Discussion being divided into multiple genres, like Engineering & Transportation and Politics & Social Sciences based on the topic of the discussion. Our preliminary qualitative evaluation shows that the predicted genre and register labels provide valuable auxiliary information which facilitates the use of the corpus in new ways in the study of digital cultural heritage. We will be analyzing the intersection and the different combinations of genre-register pairs using topic modeling and study of keywords as well as evaluating the benefits of cross-labeling a corpus as a tool for creating additional metadata.



11:30am - 11:45am

A test of browser-based collection of streaming services’ interfaces

Andreas Lenander Ægidius

The Royal Danish Library, Denmark

This paper presents a test of browser-based Web crawling on a sample of streaming services’ web sites and web players. We are especially interested in their graphical user interfaces since the Royal Danish Library collects most of the content by other means. In a legal deposit setting and for the purposes of this test we argue that streaming services consist of three main parts: their catalogue, metadata, and the graphical user interfaces. We find that the collection of all three parts are essential in order to preserve and playback what we could call 'the streaming experience'. The goal of the test is to see if we can capture a representative sample of the contemporary streaming experience, from the initial login to (momentary) playback of the contents, for the benefit of digital preservation and media research.

Currently, the Danish Web archive (Netarkivet) implements browser-based crawl systems to optimize its collection of the Danish Web sphere (Myrvoll et al., n.d.). The test will run on Browsertrix Cloud (Webrecorder, n.d.). Our sample includes streaming services for books, music, TV-series, and gaming, e.g. Netflix, DRTV, Spotify, and Twitch.tv.

In the streaming era, the very thing that defines it is what threatens to impede access to important media history and cultural heritage. Streaming services are transnational and they have paywalls while content catalogues and interfaces change constantly (Colbjørnsen et al., 2021). They challenge the collection and preservation of how they present and playback the available content. On a daily basis, Danes stream more TV (47 pct.) than they watch flow-TV (37 pct.) and six out of 10 Danes subscribe to Netflix (Kantar-Gallup, 2022). Streaming is a standard for many and no longer a first-mover activity, at least in the Nordic region of Europe (Lüders et al., 2021).

The Danish Web archive collects websites of streaming services as part of its quarterly cross-sectional crawls of the Danish Web sphere (The Royal Danish Library, n.d.). A recent analysis of its collection of web sites and interfaces concluded that the automated collection process provides insufficient documentation of the Danish streaming services (Aegidius and Andersen, in review).

This paper presents findings from a test of browser-based crawls of streaming services’ interfaces. We will discuss the most prominent sources of errors and how we may optimize the collection of national and international streaming services. The test will include a concurrent dialogue with the developers of the software via their github. This collaborative approach highlights how digital humanities tools can be 'live' and in-the-making. We hope to discuss the quality of what we can collect and what we gain from librarians and developers collaborating: how does the collaboration, i.e active development of tools, impact the archive that is being made and how do we document the process? Can we capture these aspects of our test and the resulting archives in a datasheet specifically for digital cultural heritage datasets? Alkemade et al. (2023) propose a datasheet that supports the documentation of practices and procedures established in GLAM institutions that lead to establishing collections’ descriptions. Collecting and preserving the interfaces of streaming services produce digital cultural heritage datasets that are marked by specific characteristics. They provide a case for digital archives as datasets that are often the product of multiple layers of selection; they may have been created for different purposes than establishing a statistical sample according to a specific research question; they change over time and are heterogeneous (Alkemade et al, 2023).



11:45am - 12:00pm

Memory in the Mediated Age: Unveiling the Dynamics of American Society's Memory through Twitter Discourse on Lynching

Feeza Vasudeva, Narges Azizi Fard, Eetu Makela

university of helsinki, Finland

In the contemporary mediated age, the landscape of collective memory is undergoing a transformative shift, particularly evident in the realm of social media. Marked by the 'connective turn' which emphasizes the sudden surge of digital media, communication networks and online archiving; we witness an unprecedented shift in our understanding and engagement with memory (Hoskins, 2011). Exploring this shift, the study aims to search into the intricate web of remembrance and oblivion, focusing on how American society remembers and forgets specific historical events, centring on the enduring historical trauma of lynching. Utilizing mixed methods and computational tools, the research aims to explore dynamics of memory formation and perseverance as well as memory decay, within Twitter discourse.

The legacy of lynching, a deplorable chapter in American history, continues to echo in collective memory, undergoing a transformative evolution. From concrete and literal spectacles of white supremacist violence, lynching has morphed into one of the most vivid symbols of race oppression, serving as a poignant metaphor for ongoing racial relations in the United States (Rice, 2006). Through the analysis of extensive tweet datasets , this research seeks to uncover patterns, sentiments, and discourse structures, offering insights into the evolving nature of collective memory surrounding lynching. Computational methods including Named Entity Recognition, topic modeling, network analysis, LIWC software, etc., are employed to reveal the emotional dimensions of collective memory, identify recurring themes, and trace the social dynamics shaping the discourse (Jiang & Xu, 2023; Sumikawa & Jatowt & Düring, 2018).

Furthermore, the ‘connective turn’ has re-engineered memory, liberating it from traditional constraints like spatial archives, organizational structures, and institutions. Instead, memory is distributed continuously through connectivity. This prompts an exploration of the temporal dimensions linking historical memory to contemporary events, investigating instances where the memory of lynching intertwines with modern occurrences, influencing and reshaping public discourse on Twitter. The findings contribute valuable insights into understanding the delicate balance between remembering and decay in the mediated age, particularly through the lens of social media platforms. Additionally, the study sheds light on the compulsive nature of contemporary connective practices within digital media content. Individuals and groups actively engage in various connective actions like posting, liking, tweeting, scrolling, forwarding, etc., forming a coercive multitude that eschews traditional debates in favor of digital emotive expressions, often conveyed through emoticons (Hoskins 2011).

As the research unfolds, the connective turn introduces a crucial perspective on the ontological shift in what memory is and does (Hoskins 2017). This paradigm shift, both arresting and unmooring the past, challenges traditional notions of historical consciousness, particularly in the light of ‘technology-mediated memory’ (Elwood & Mitchell, 2015). The insights gained from computational analyses inform discussions on the role of social media platforms in shaping historical memory, thereby also serving as a bridge between disciplines of memory studies and computational studies.

References

Hoskins, A. (2011). Anachronisms of media, anachronisms of memory: From collective memory to a new memory ecology. In On media memory: Collective memory in a new media age (pp. 278-288). London: Palgrave Macmillan UK.

Rice, A. (2006). How We Remember Lynching. Nka: Journal of Contemporary African Art, 20(1), 32-43.

Jiang, K.,& Xu, Q. (2023). Analyzing the dynamics of social media texts using coherency network analysis: a case study of the tweets with the co-hashtags of #BlackLivesMatter and #StopAsianHate. Front Res Metr Anal.

Sumikawa, Y., & Jatowt, A., & Düring, M. (2018). Digital History meets Microblogging: Analyzing Collective Memories in Twitter. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (JCDL '18). Association for Computing Machinery, New York, NY, USA, 213–222.

Hoskins, A. (2017). The restless past: An introduction to digital memory and media. In Digital memory studies (pp. 1-24). Routledge.

Elwood, S., & Mitchell, K. (2015). Technology, memory, and collective knowing. Cultural Geographies, 22(1), 147-154.

 
10:30am - 12:00pmSESSION#06: COLLABORATIONS & RESEARCHER ENGAGEMENT
Location: K-206 [2nd floor]
Session Chair: Rakel Adolphsdottir, National and University Library of Iceland, Iceland
 
10:30am - 11:00am

Ghosts in the Archives? The Search for Feminist and Queer Archival Materials in Sweden

Rachel Pierce1, Siska Humlesjö2

1KvinnSam, Humanities Library, Gothenburg University, Sweden; 2Gothenburg Research Infrastructure in Digital Humanities, Gothenburg University, Sweden

Archives have become a central point of discussion and activism within queer and feminist communities internationally and locally during the past few years. Building on and referencing work by Ann Cvetkovich (2003) and Jack Halberstam (2005) amongst others, gay, lesbian, queer, and feminist cultural heritage institutions have sprung up. In Sweden, “arkivism” (Larsson Pousette & Thomsgård 2021) around collecting materials from historically marginalized groups has resulted in the recent establishment of two queer archives (QRAB and SAQMI) and one archive for Black Swedes (Black Archives Sweden) that specifically lifts queer stories within the Black Swedish community. This is a new wave of archives-building – KvinnSam, the National Library for Gender Research was the result of similar organizing in the late 1950s. Despite the long history of work with feminist and queer materials, how the collections and major themes of this kind of broader feminist and queer archival work are represented and made visible in platforms for Swedish cultural heritage is still an unexplored question.

Researchers and librarians have devoted much time to developing, testing, and analyzing indexing standards and practices that allow for thematized search. Work on feminist and queer metadata systems has exploded in the past few years, spurred by the development of Homosaurus, a linked open data vocabulary, which has been partially translated into Swedish and applied to Swedish literature in the project Queerlit (Golub, Bergenmar & Humlesjö 2023). However, archival collections and description practices are quite different from indexing practices for printed materials. As a result, there has been little development of archival metadata that facilitates thematic searches for “hidden” or “ghostly” materials and themes within physical collections (for work on archival ghosts, see Harris 2021). This paper aims to lay some groundwork for that development.

Just because the metadata is good does not mean that it is visible or usable. Johanna Drucker (2014) has long argued for a humanistic approach to digital infrastructure design that recognizes the arguments embedded in such design. It is worth thinking about what information is prioritized and what information is made invisible, and how platform design directs users to seek out particular kinds of information in particular ways. Further, there is plenty of research on the naivete of researchers, who often think that all information and data has made its way into digital databases, when this is far from true (for a Swedish example, see Weber 2022).

This assumption that databases are complete is an especially dangerous situation for materials related to women, gender, sexuality, and queer themes, given traditional descriptive practices for archives, which privilege certain kinds of lives and achievements based in a normative definition of male public-facing work and hide non-normative or traditionally female-coded work and life. This is as true for Sweden as it is for other places, as Anna Nordenstam (2008) notes. Digital humanist Tara McPherson (2014) has argued that “If a core activity of the digital humanities has been the building of tools, we should design our tools differently, in a mode that explicitly engages power and difference from the get-go, laying bare our theoretical allegiances and exploring the inter-actions of culture and matter” (p. 182).

This paper will build on an initial study of searching the Swedish cultural heritage platform Alvin for queer and feminist histories (Pierce 2024), expanding analysis to the other two large archival databases in Sweden, Arken and NAD (the National Archival Database). Together, these platforms represent over 200 institutions and archival divisions, meaning that they represent the vast majority of Sweden’s archival holdings available via digital channels. These platforms are intended to increase collaboration and research over institutional and disciplinary boundaries, by combining information on archival collections that were often housed, described, and made accessible locally until very recently.

Our study will examine how archival finding aids – the roads into and directions for using physical and digital archival collections – are represented and whether/how they make feminist and queer histories accessible. Initial findings from the examination of Alvin pointed towards problems with the representation of physical archives via finding aids on digital platforms. These problems stem from a set of rules for web design that prioritize visual materials, the promotion of bigness in search functionality, where greater numbers are privileged as relevancy, and the fuzziness of the concept of “archive” within a system where all “documents” (i.e. posts) are considered equal and distinct. All of these tendencies are at odds with the structure and descriptive rules that govern archival finding aids.

This study will examine these three platforms on a number of levels. First, we will assess the search functionality of each platform, analyzing which search routes are facilitated by the platform and which are not. Second, we will examine how search via free text functions in contrast to using subject words. Third, we will examine how subject headings are used, if they are used at all. Fourth, we will examine how each platform constructs a post/document. How open or closed are these posts, in terms of links and the facilitation of what historian Laura Putnam (2016) terms “side-glancing,” a particular kind of archives-based browsing? And fifth, we will examine three archival finding aids that appear most centrally positioned within lesbian and feminist history on each of the three platforms as defined by our searches. How is this post/document/finding aid structured, and why does this result in its relevancy across our array of searches?

An examination of NAD and Arken compliments the Alvin analysis through comparison. Arken is a very small database for the Swedish National Library (KB) and Umeå University, which includes both finding aids and digitized materials. Given that KB also controls the national subject word list, the inclusion of this platform is significant. NAD is the largest database for finding aids in Sweden, with well over 150 institutions represented. Overlaps between NAD and the other two platforms exist, in part because almost all institutions with special collections have, at one point or another, prioritized the outreach that NAD can offer. This is a database that most historians in Sweden consider as a one-stop shop for archives. It is also a database and platform under threat, underfunded to such an extent that it can no longer be updated by the National Archives let alone developed to accommodate digitized material, a situation that may lead to its eventual demise (Isacson, et al. 2023). A truncated free text search gives a result of 15 013 168 records for NAD, 62 511 records for Arken and 433 233 records for Alvin (Search conducted 2024-01-12).

In the case of NAD, the database is clearly developed by and for archivists, meaning that researchers familiar with archives will feel at home. However, NAD is poorly equipped for current digital humanities thinking or, indeed, users without significant experience in archives. The database is built for finding aids, although digitized materials are now available. The metadata produced for the platform is minimal and does not facilitate browsing. And the most used areas for archival description do not lend themselves to digital search. A search for “protokoll” (approved meeting notes) will turn up every organizational archive in NAD. Interlinked organizational and personal names offer the only non-hierarchical search options, obscuring thematic overlaps between collections. NAD’s main strength is its scope combined with the uniformity of its record metadata, making it possible to provide general search instructions. These strengths mean that feminist and queer archival material seems quite findable, but the representativity (are materials only reflective of post-WWII histories?) and thoroughness of search results are hard to determine.

Arken is a much smaller search service developed and managed by the Royal Library (KB) for its own special collections materials, including finding aids and digitized materials. Umeå University recently joined the platform, moving from Alvin. UU also has records in NAD. The database is built upon the open-source web application AtoM (Access to Memory) and adheres to international standards set forth by the International Council of Archives. This is by far the easiest platform to navigate, regardless of prior experience in archives. However, subject headings do not seem to be applied in a systematic way, and there is nearly non-existent metadata related to feminist and queer histories. Given the much smaller nature of Arken, search results reflect an historical erasure of these histories.

The Alvin paper identified areas for potential development that will be further explored in this paper. In particular, relationships and networks between individuals, organizations, themes, and archival holdings might be made more central to website design, to facilitate queer and feminist search across archives and institutions without labeling individuals as lesbians or feminists – a politically and historically dubious approach. While NAD does not apply keywords and has very limited operationalization of linking and Alvin fails to interlink its subject words, this kind of network approach has been applied to some degree within Arken via linked subject headings attached to archival finding aids, to facilitate thematic search.

Further exploration of these platforms will have two concrete benefits. First, institutions like KvinnSam will be able to develop finding aids and metadata that ensure the visibility of their materials for women’s, gender, and queer researchers, as well as a broader research community that has perhaps overlooked the relevancy of this kind of archival material. In other words, greater findability can assist in integrating women’s, gender, and queer history into national and international studies that do not explicitly focus on these topics. Second, the study will lay the groundwork for the further development of platforms for archives in Sweden, in ways that take into account but also look beyond the traditional hierarchical finding aid structure and capitalize on the insights of queer and feminist archives scholars. Development of platforms that facilitate access to archival materials must be based in and facilitate dialogue across the divide between cultural heritage professionals, researchers, and the general public, in order to build sustainable access to historical collections.

Pierce-Ghosts in the Archives The Search for Feminist and Queer Archival Materials-105.pdf


11:00am - 11:30am

Understanding Researchers’ Perspectives on Work Tasks in Digital Humanities and Computational Social Sciences

Anna Sendra, Elina Late, Sanna Kumpulainen

Tampere University, Finland

The usefulness of research infrastructures (RIs) for digital humanities (DH) and computational social sciences (CSS) depends on their capability to support research work tasks. RIs can provide more effective support and services when they have a clear understanding of the work practices and specific tasks researchers are engaged in. This paper explores the ways of working of social sciences and humanities (SSH) scholars to develop resources that truly endorse their data-intensive research processes. Particularly, the study investigates how SSH scholars interact with digital tools and materials to both determine their information needs and how to better support them. A qualitative analysis of 21 semi-structured interviews with potential end-users of a national RI for the DH and CSS that is currently under development revealed three themes: digitizing SSH research, meeting the information needs related to DH/CSS, and supporting DH/CSS research. Based on our findings, we were able to create an understanding of the work task-based requirements of SSH scholars that can be used to inform designs for improved RIs. Suggestions for enhancing the sustainability of resources and services in SSH are also put forth, emphasizing the importance of understanding and accommodating the unique scholarly practices within these fields.

Sendra-Understanding Researchers’ Perspectives on Work Tasks-102.pdf


11:30am - 11:45am

The outcome of course-integrated DH instructions developed in a library-faculty partnership

Lars Kjær

The Royal Danish Library, Denmark

This text presents lessons learned from KUB Datalabs digital humanities (DH) course development and is based on students evaluations and on dialogue between staff at the library and at faculty of humanities.

The courses, that we defines as course-integrated instructions incorporate DH in courses that do not have a DH tone as a starting point[1].

We experience that researchers in their role as teachers ask the library to contribute with DH modules that can be incorporated in their courses. This paper looks into that trend. Many humanistic courses deals with elements that is attached to the digital in some extent, and the teachers see an opportunity to include DH. We also experience that students have an interest in studying text and image material from online communities. Especially online sub cultures is a hot subject among students, and the teachers wish to meet this interest from the students to work more with digital sources.

It is not only on the University of Copenhagen, Faculty of Humanities, at DH is introduces in few modules on courses, where DH is not the core content. Also on other universities do library staff contribute on courses, where it is not a demand that the students submit and are examined in a DH assignment[2].

It is an exciting challenge to teach on these courses, because the intense frame demands that there in the planning situation is considered different types of tools and exemplary methods in order to get DH proper didacticized, so it is relevant for the students in precise that context, where they are in that single or in those few modules. The evaluations and our feedback shows that we do not always get it right, and that the timing in relation to the students' academic level is important.

When the KUB Datalab enter into a collaboration, that deals with the development of a DH course it starts with a dialogue about the frame and the content. The most important part in relation to the frame is the scope of the course, because the scope will be of decisive importance for the content of the course. KUB Datalab would normally offer between one or three modules as course-integrated instruction. In addition to the modules, students have the opportunity to consult the library staff, which is especially useful for those who choose to include digital technologies in their exam papers.

The text also presents the challenges of teaching DH the interesting framework. We have the intensive courses on one module, where we often limit the content to no-code software and online platforms (Voyant, Orange, DraCor etc.), because this give us an opportunity to illustrate the purpose of DH methods in a limited period of time. However, the problem is in relation to complexity that we can only hint at its scope and depth. This means that methodological issues, technological issues, and digital challenges in relation to using DH within the subject remain a black box. No-code software makes frequent use of linkage, distance, clustering and dimensionality algorithms, which form the cornerstone of data mining[3]. It is central elements in software educations, but of course, it is not elements that fits into a course that may have an ambition to introduce DH methods, but which is not toned towards a basic understanding of the digital.

We have the courses that consist of two to three modules, where we often organize a tailored content that illustrates a workflow that the students can follow in order to safely reach the goal with a result that they can include if they choose to use DH methods in relation to their exam assignment. On the one hand, the completed workflows remove some of the creativity from the part of the assignment that is about selecting methodology, but on the other hand, the students get a chance to focus on other things, such as focusing on problem solving and selection of data, which constitutes a much more important element of a humanistic assignment compared to an assignment in software education. The workflows give the students a starting point, and with a little extra guidance, the students will often be able to use the workflows to examine text collections that they choose themselves.

As can be seen above, this paper contributes to shed light on issues in relation to teaching in DH, which is relevant for other humanities faculties and libraries where DH is incorporated into non-DH toned courses.



11:45am - 12:00pm

Exploring research networks through citations and references

Lars Johnsen, Ingerid Løyning Dale, Marie Iversdatter Røsok

National Library of Norway, Norway

Standards for citations and referencing were developed in the late 19th century, and revised and changed during the 20th. It should therefore be a possible task to automatically construct relations between texts based on the citations between them.

We report on work towards building a registry of citations and references for books and articles. This can be used to analyze particular scientific practices and how texts refer to other texts (e.g. Bergmark and Lagoze, 2000, Sivesind and Hörmann 2022). To achieve this within the context of Norwegian scientific and humanities discourse, we focus our efforts on the digitized texts at the National Library of Norway (NLN), which provides documents that are prepared using an OCR-process, and made available for analysis and feature extraction. However, the methods we develop can be used on any collection, as long as it provides a certain window into the texts, in the form of text fragments like concordances. The methods described will fit repositories like Hathi Trust, as well as local collections.

Our aim is twofold, one is to discuss the methods for extracting citations and references from full text, following the work of (Chenet 2017), and the requirements of the repository, and two, how one may go about building a database of citations and references which will represent the network of documents. Each book is associated with a list of other books it refers to together with how they are cited in the particular book.

The method books and references are provided for, which covers both freely available material as well as texts that are under copyright.

As we distinguish between citations and references, we approach these in two ways. We rely on the API to get access to text fragments that will form the basis for the analysis.

For citations we use the concordance extraction available at DH-lab API: 1) select concordances that contain years as four digit numbers, 2) run regular expressions on these to select the citations, if any, within them, and store the concordances with the extracted citation. While this pipeline captures most of the citations, and the citations that are found mostly seem valid, some are still missed. We have used the output of the regular expressions, i.e. keeping the valid citations, to train a machine learning pipeline with spaCy’s SpanCategorizer. The pipeline includes SpanFinder which can help us find more candidate token spans in the text fragments, and the SpanCategorizer scores each candidate with a probability estimate for the target label “CITATION” . We only keep the candidates with a higher score than 50%.

According to the APA style, a simple in-text citation should contain the author’s surname and the publication year inside parentheses: (Wertsch, 2002). The following regular expression captures this format: "([A-Z]([a-z])+,sd{4})". The expression matches parentheses containing one uppercase letter followed by lowercase letters, a comma, a whitespace and four digits. Although similar, this expression does not match (Sture 1437‒1503), because of the lacking comma and the extra dash and year, which indicates a lifespan. We make a subset of the good matches and the bad matches which are subsequently fed to spaCy’s machine learning model.

Within a text, the reference section contains the list of the documents identified by the citations. Here we will report and evaluate two methods for analyzing and linking citations to actual documents.The first method utilizes fragments (concordances) filtered on author names and year of publication. The second method processes the bibliography as an integrated text segment extracted from the entire document. Given that most texts we process are from the 20th century, we carefully consider copyright concerns. Fragments, small or large, should not infringe upon copyright. It's important to note that bibliographies, as compilations of cited works, are generally regarded as non-infringing under copyright law (except if the bibliography itself is the work). We will shed more light on this issue in the presentation.

 
12:00pm - 1:00pmLUNCH
Location: Háma
1:00pm - 2:30pmSESSION#07: DATABASES & PROJECT LIFECYCLE
Location: H-207 [2nd floor]
Session Chair: Katrine Gasser, Royal Danish Library, Denmark
 
1:00pm - 1:15pm

Migrating Heritage: The Icelandic Immigrant Literacy Database

Katelin Marit Parsons

University of Iceland/Árnastofnun

Immigrant heritage is transnational by its very definition, yet immigrants’ lives often occupy marginal spaces in cultural heritage initiatives designed to collect, preserve and disseminate historical materials. Immigrants may be represented in the digital collections of multiple institutions under very different identities, or they may be wholly absent from the libraries, archives and museums of both their birth and receiving countries—vanishing into the gaps between collection scopes. While digitization has been highly successful in bridging physical distances, this is not the only challenge for researchers, professionals and communities working for the preservation of immigrant heritage.

This paper approaches questions of digital collaboration and sharing of written immigrant heritage across national and regional boundaries through the case study of the Icelandic Immigrant Literacy Database (IILD) at the Árni Magnússon Institute for Icelandic Studies in Reykjavík, which opened on 21 October 2023. The focus of IILD is on Icelandic-language materials that were created and/or owned by Icelandic immigrants in North America. The database is an output of the Fragile Heritage Project (Icelandic: Í fótspor Árna Magnússonar í Vesturheimi, 2015–present), which seeks to locate and document material in collaboration with Canadian and American LAM institutions, community organisations, families and individuals.[1] The data consists of digital images from 2009–2019 and notes on Icelandic-language manuscripts, books, letters and other documents, of which 1,000 items have been catalogued on IILD.

From 1873 to the beginning of the First World War, as many as 20,000 Icelandic men and women emigrated to North America, the majority of whom were destined for Western Canada.[2] Literacy rates were high in Iceland, with personal and household book ownership common by the early 1800s.[3] In recent decades, the literacy practices of non-elite Icelanders in the nineteenth century have attracted considerable attention, in particular the production of manuscripts in domestic settings and the existence of informal scribal communities or networks.[4] Virtually all adult immigrants had achieved some level of reading fluency prior to their departure, but writing skills were not universal: individuals educated before instruction in writing was made mandatory in 1880 were frequently self-taught writers, if they could write at all.[5]

The initial focus of the present author’s research was on digital documentation of older Icelandic literary manuscripts brought to Canada and the United States, with the preliminary objective of studying the quantitative effects of mass emigration on manuscript culture. A key shift came with the discovery of scribal networks within Icelandic settlements in North America that were comparable to those found in Iceland.[6] This finding underlined the need for a more holistic approach to immigrant literacy, both pre- and post-migration, to which IILD is a response.

Although the largest and best-known historical Icelandic immigrant settlement was New Iceland in Manitoba, the project fieldwork confirmed the value of initiating extensive cross-institutional and cross-community collaboration beyond a single region. Outside of LAMs dedicated specifically to the study of Icelandic-North American heritage, language emerged as the most significant barrier to the full interpretation and inclusion of written Icelandic immigrant heritage in digital collections. Given that many LAMs do not have easy access to fluent speakers of immigrant languages represented in their collections, digital collaboration and initiatives such as IILD can be mutually beneficial for researchers and professionals.

[1] Funding for the project was received from the Government of Iceland, Eimskip University Fund, the INL of Iceland, Landsbanki, Eimskip, the Canadian Initiative for Nordic Studies (CINS), the Manitoba Heritage Grants Program, the Icelandic American Society of Minnesota, the Áslaug Hafliðadóttir Memorial Fund and the Icelandic Department of the University of Manitoba. The present author is the project manager and database editor; Trausti Dagsson of the Árni Magnússon Institute for Icelandic Studies designed the database infrastructure.

[2] Vigfús Geirdal, cited in Sigurður Gylfi Magnússon, “Sársaukans land: Vesturheimsferðir og íslensk hugsun,” in Davíð Ólafsson and Sigurður Gylfi Magnússon (eds.), Burt og meir en bæjarleið: Dagbækur og persónuleg skrif Vesturheimsfara á síðari hluta 19. aldar, Sýnisbók íslenskrar alþýðumenningar 5 (Reykjavík: Háskólaútgáfan, 2001), 13–69, at 52.

[3] Sólrún Jensdóttir, “Books owned by ordinary people in Iceland 1750–1830,” Saga-Book 19 (1974–1977): 264–292.

[4] Davíð Ólafsson, “Vernacular Literacy Practices in Nineteenth-Century Icelandic Scribal Culture,” in Ann-Catrine Edlund (ed.), Att läsa och att skriva: Två vågor av vardaligt skriftbruk i Norden 1800–2000 (Umeå: Umeå universitet, 2012), Nordliga studier 3: Vardagligt skriftbruk 1, 65–86.

[5] Loftur Guttormsson, Bernska, ungdómur og uppeldi á einveldisöld: Tilraun til félagslegrar og lýðfræðilegrar greiningar, Ritsafn Safnfræðistofnunar 10 (Reykjavík: Sagnfræðistofnun, 1983); Sigurður Gylfi Magnússon and Davíð Ólafsson, Minor Knowledge and Microhistory: Manuscript Culture in the Nineteenth Century (New York: Routledge, 2017).

[6] Katelin Marit Parsons, “Albert Jóhannesson and the scribes of Hecla Island: Manuscript culture and scribal production in an Icelandic-Canadian settlement,” Gripla 30 (2019): 7–46.



1:15pm - 1:30pm

Setting up a Research Data Repository Based on Invenio RDM: An Experience Report

Herbert Lange

University of Gothenburg, Sweden

Setting up a research data repository is a time-consuming task. Learning from others who went a similar way can be very helpful. For that reason we want to share our experience. This article is based on the lessons learned while setting up a repository based on Invenio RDM at the Leibniz Institute for the German Language (IDS) in Mannheim from March 2022 to September 2023. We explain our decisions and steps taken on the way. Even though our requirements differ from other institutions’ we are confident that this description can be helpful to others. Even though we mostly worked with language data, the basic considerations are as valid and relevant to all areas within the Digital Humanities and beyond. This article is neither intended to convince you to set up your own repository nor to do the opposite. It should, however, help you to make your own informed decisions.

Lange-Setting up a Research Data Repository Based on Invenio RDM-184.pdf


1:30pm - 1:45pm

From TeX to Network Graph: Creating a Web Platform for an Etymology Dictionary

Trausti Dagsson, Ellert Þór Jóhannsson, Einar Freyr Sigurðsson, Steinþór Steingrímsson, Finnur Ágúst Ingimundarson, Árni Davíð Magnússon

Árni Magnússon Institute for Icelandic Studies, Iceland

We describe a recent project creating a new interactive web platform of the etymological dictionary of Icelandic, Íslensk orðsifjabók (ÍO), by the lexicographer and linguist Ásgeir Blöndal Magnússon, now available at https://ordsifjabok.arnastofnun.is/. This dictionary contains around 42,000 headwords and is a comprehensive resource for exploring the origin of a substantial portion of the vocabulary of Icelandic. The dictionary material was gathered over a period of decades and published in print in 1989. ÍO is the only etymological dictionary focusing on Modern Icelandic.

Like other etymological dictionaries, ÍO provides information about the history of words as well as information about how meaning and form of words may have changed. It shows how inherited words are derived from roots that can be traced back to an earlier stage of the language or a reconstructed proto-language. Cognates that may be present in various related languages are also listed. In many cases the words do not originate in an earlier language stage but have different historical roots. They can be borrowings from other languages; or neologisms, more recent creations resulting from Icelandic specific derivational or morphological processes.

Extensive information about various words and different types of origins and relationships between them appears in ÍO. The structure of the information in each entry is not entirely consistent but the same key words and phrases indicating types of relationships occur frequently.

To create a digital version of this dictionary, it was necessary to use the material from the print version. The dictionary was originally produced at the start of the digital age and the original TeX files have been preserved until this day. The first digital version of the dictionary was made available online on the web portal málið.is in 2016 using the original working files as its base and enabled a simple lookup of the headwords but did not offer any advanced search features or the possibility of browsing the entries alphabetically as only one entry could be displayed at a time.

We present a new rich online platform for the dictionary where references in the entries are cross-linked using automatic methods to parse the TeX files. The entries are also linked through common referenced words in foreign languages. The platform contains a graph-model component that enables users to explore these relationships through an interactive network visualization. The graph model was created by using the dictionary's parsed entries where referenced words were extracted from the descriptive text in each entry to create links between entries. The data was then imported into the graph database software Neo4j. These links are defined as different types of relationships based on the occurrence of particular words, e.g. sjá (e. see also), sbr. (e. compare) and skylt (e. related to). The graph model also includes references to words in other languages and therefore links entries across the whole dictionary that did not have direct cross-referencing before.

The new online platform of ÍO offers diverse ways of working with the dictionary and visualizing the data. It makes it easier for users to move between words and get an overview of related words.

ÍO offers insights into the historical development of Icelandic by examining the origins of words and the cultural and historical context in which Icelandic evolved. It highlights how the use of words has changed over time and what cognate words exist in related languages.

Overall, the online platform of ÍO serves as a valuable tool for anyone interested in exploring the history and evolution of Icelandic words and the cultural connections between them as well as a novel way of publishing and interacting with etymological dictionaries.



1:45pm - 2:15pm

Exploring the Life Cycle Assessment (LCA) to increase the sustainability of DH projects

Andrea Alessandro Gasparini1, Tom Gheldof2

1Department of Informatics, University of Oslo, Norway; 2Research Unit of Ancient History, KU Leuven, Belgium

Introduction

The growing number of Digital Humanities (DH) services, their heterogeneity (Ulutas Aydogan et al., 2021), and the dislocated locations where the data, applications, and metadata are stored require new approaches for humanities scholars to understand all the crucial aspects of a system. This paper argues for a novel understanding of the strong connection between the sustainability of DH projects and their paths using the Life Cycle Assessment (LCA) method. How can researchers monitor and address all the different phases from the start of a DH project until it becomes sustainable? What kind of framework is necessary to understand all the aspects? This paper presents the potential of LCA as a framework for disassembling such paths.

Life Cycle Assessment

LCA is defined as a framework for addressing how human production, such as technology or services, is sustainable, highlighting possible issues. It is used to define all the different steps of a life cycle, from the design, development, acquisition, production, and use phases, to when it becomes a waste (Finnveden et al., 2009). A general and important definition is found in Brekke et al. (2019), where the paper states LCA as “an umbrella term for several methods needed to translate both the uses of natural resources and the emissions and waste to the environment…”.

In addition, LCA is also used as a governance tool to support the transition to more sustainable results (Brekke et al., 2019), as well as to look into sustainability in ICT projects (Pendergrass et al., 2019). This paper argues for the use of this framework to understand more holistically, through the lens of sustainability how DH projects may be developed and governed better. Especially in the last phase, the transition from a controlled and intensive project period to a drift period is always subject to issues (Ciborra, 2000). A few other examples of the connection between an LCA analysis and a DH project are:

  1. Competence: often dispersed with mobility of the staff or poor description of the development process

  2. Data: not all data kept in a developed DH project is planned to be re-used. Creating and analysing data is also a form of energy consumption. Where the data is stored also has a measurable impact

Applying LCA to a DH project can include the following steps:

  1. Design of the system

  2. Production: Implementation of the system on various servers

  3. Competence creation: if lost, what is the impact on sustainability?

  4. Use: the use can be understood as the total ecological footprint of a DH project (based on the same calculation of CO2 emissions, water and energy use for servers running)

  5. Reuse: since not all data inside a DH project is open and easy to be reused, what is the impact on sustainability?

  6. End-of-life

This last stage is often underestimated. Questions regarding where to move the data and ensure that the data and metadata are not lost often coincide. Finally, the manner in which knowledge survives in the next phase determines the path ahead.

The LCA method also responds well to the selected Sustainable Development Goals (SDGs). The United Nations has outlined 17 SDGs that address various issues, needs, and perspectives related to poverty, gender, education, health, energy, consumption, production, peace, and climate change. Among these, problems associated with climate change and energy consumption have direct implications for DH and are well addressed by LCA.

LCA and sustainability

During the various phases of the LCA for DH projects, the value of using the LCA framework emerges. This transition, from a project period to a drift period, is crucial. In this space, one needs to plan the goal of extending the lifetime of the service allowing the resource to acquire values that support sustainability. For instance, easy access to the right users, and how data are used and made available from a longer perspective, can support the evolution of the service over the life cycle of the project. Those governing the project are responsible for making the correct choices. Often we see examples showing how valuable services, worthy to be valuable resources for researchers for decades, “die” over time for the lack of funding, leadership, and technological understanding. In addition, the core of such projects, the knowledge, if not addressed properly, may be lost. Also, the reuse of the data is often addressed poorly. Using the LCA as a tool to define and analyse all the aspects of the life cycle of a DH project it may be possible to avoid negative outputs. Usually, an LCA is done while a system is operational. Nevertheless, the paper argues the use of the framework several times during the development and also just before the launch of a project as a stable service.

ENCODE

This paper analyses the ENCODE project (ENCODE, n.d.) and affiliated (dislocated) resources as examples of a DH project on which the LCA framework can be applied. In the framework of ENCODE, a three-year (2020-2023) Erasmus+ Strategic partnership for higher education, partners from 6 European universities (Alma Mater Studiorum Università di Bologna, Julius Maximilian Universität Würzburg, KU Leuven, Università degli studi di Parma, Universität Hamburg, and Universitetet i Oslo) aimed enhancing the digital competences of students and researchers who are studying ancient writing cultures. This resulted in the creation of several outputs, such as an open online course on the #dariahTeach platform and a platform for collecting guidelines to use existing digital resources and tools in this (broad) field. (Salvaterra e.a., 2023)

Maintaining (DH) services in a sustainable manner requires a combination of strategies, including institutional support, collaboration agreements, keeping costs manageable, keeping user communities engaged, and keeping (or at least exporting) data in forms that can survive a loss or transition of support. It is almost equally important to involve and provide long-term support for technical specialists, obtain institutional commitments to ensure this support, and intentionally engage partner projects and share data with them to ensure the long-term preservation of these projects’ outputs. Some of the affiliated DH services that partnered with the ENCODE project (all making use of Linked Open Data to adhere to the FAIR principles), Papyri.info, Trismegistos and Epigraphic Database Heidelberg (EDH), especially struggled with the last phases in the LCA, the “Use”, “Reuse” and “End-of-life” phase.

All these projects obtained longer funding in their initial stage, resulting in the creation of several services for the study of Ancient World Data, such as providing URIs, searchable databases, RDF and XML data exports, et cetera. However, after their initial funding, to sustain the uptake of these services and/or their data, different solutions were applied, ranging from a large-scale endowment and crowdfunding rounds (Papyri.info) to the implementation of a subscription model (Trismegistos). Other projects, such as EDH, make their data available via Linked Open Data methods and data dumps. (Cayless, 2019)

Finally, for a short-term funded project such as ENCODE, other possibilities exist, such as applying in the European Blended Intensive Programmes (BIP) programme. This is intended for EU-funded educational initiatives that bring together students, educators, and researchers from across Europe for intensive training in DH. As a result, participation in BIPs can extend the life cycle of DH projects by facilitating knowledge transfer and dissemination, building capacity among students and educators, fostering networking and collaboration, and supporting sustainability planning for long-term viability, thus hopefully preventing a negative outcome of the LCA “End-of-life” phase.

Findings

The findings show how the LCA can be used during different phases of a DH project. For instance during the design process, just before the transition to a drift situation, or can be used when the end-of-life is approaching. The discussions toward the end of the project period have been focused on how to support a long-standing service for scholars. As mentioned, this Erasmus+ project is interdisciplinary and international, where the last stage culminated in launching the various platforms across Europe and writing the very demanding final report. Usually, after this effort, a project group focuses on other tasks. Thus far, this has been the de facto situation for the authors. Projects are often based on the willingness of the participants to use their work and free time to bring a DH project to fruition. The LCA framework may address some of the aforementioned issues, especially unforeseen side-effects, and phases usually where the focus is weak during a DH project.

Discussion

DH projects are interdisciplinary, and often international because experts are part of a small research milieu, and are not easy to enrol. Projects must include them part-time. This context creates a vulnerable situation, and we have countless examples of projects ending in an improper manner. Proposing a framework for guidance and governance, different roles emerge: users, researchers, PIs, funding organisations, universities, and libraries are all important stakeholders. The LCA helps a project focus on the vulnerable stages, for instance, funding, and personal resources, such as experts, since they are underestimated, especially in the transition period from a project to a drift situation, and have quite an important impact and implication when the end-of-life is approaching. If the vulnerable stages are not properly addressed, in a later stage, the lack of resources will result in disrupted services, dissatisfied scholars, an - in the end - even more difficulties in proving funding.

Conclusion

This paper emphasises the critical need for sustainable approaches in Digital Humanities projects, particularly in understanding and addressing the various phases of a project's life cycle. The use of Life Cycle Assessment is a novel and valuable way of looking into hidden aspects of DH projects. By advocating for the application of the LCA framework, the paper highlights how this method offers a holistic perspective on sustainability, encompassing design, production, use, and end-of-life considerations. Through the analysis of the ENCODE project and affiliated resources, the article underscores the challenges faced during the transition from project implementation to long-term sustainability.

Gasparini-Exploring the Life Cycle Assessment-234.pdf
 
1:00pm - 2:30pmSESSION#08: NEWSPAPER CORPORA & HISTORICAL DOCUMENTS
Location: H-205 [2nd floor]
Session Chair: Jon Carlstedt Tønnessen, National Library of Norway, Norway
 
1:00pm - 1:30pm

Analysis of Textual Complexity in Danish News Articles on Climate Change

Florian Meier, Mikkel Fugl Eskjær

Aalborg University, Denmark

Structural linguistic features are often overlooked yet potentially important aspects of journalistic practice. Especially in news reporting on climate change, these features can play a crucial role as the proper use of language is tied to message credibility, processing fluency and knowledge retention, which can positively influence the reader to take more climate action.
This article analyzes language use in Danish news articles on climate change using a sample of around 32,000 articles from four different outlet types (quality news, niche papers, tabloids, and public service broadcasters) published from 1990 to 2021. We create a machine-learning model of text complexity covering this concept's semantic and syntactic dimensions.
Our findings confirm expected differences in complexity between news outlets, highlighting tabloid articles as engaging with higher semantic complexity, while quality papers and niche papers exhibit higher syntactic complexity. We observe a significant decrease in semantic complexity and a slight increase in syntactic complexity over time, a trend towards more generic language, and an increased use of pronouns, verbs, and adverbs. Most of these changes can be attributed to the emergence of articles by public service broadcasters. Articles by public service broadcasters are characterised by high syntactic complexity, which we consider problematic due to their popularity among the general public.

Meier-Analysis of Textual Complexity in Danish News Articles-130.pdf


1:30pm - 1:45pm

Quantifying Western Discourses about Technology. An Analysis of Press Coverage using a Dataset of Multilingual Newspapers (1999-2018)

Elena Fernández Fernández1, Germans Savcisens2

1University of Zurich; 2Technical University of Denmark (DTU)

In this article we explore the ontological properties of technology in contemporary societies using quantitative methods and a dataset of multilingual newspapers in English, French, Spanish, Italian, and German as a proxy. Our observational time covers twenty years (1999-2018). We filter documents using four technological key terms: nuclear, oil, internet, and automation. Our methodology is formed by a five-step pipeline that includes Topic Modelling (Pachinko Allocation), word embeddings, Ward’s hierarchical cluster analysis, network analysis, and sentiment analysis. We divide our analysis in two different categories: information stability (that we define as low levels of semantic diversity in our data outcomes over time at the key term level) and information homogeneity (that we understand as low levels of semantic diversity in our data outcomes across our selection of key terms). We seek to observe to what extent our selection of technologically related terms permeate into press discourses similarly or differently, and whether those discourses fall into semantic categories that could be considered as essential defining elements of the fabric of contemporary societies (such as finance, education, or politics). Results show, firstly, a consistent overlap of content across newspapers’ semantic fields that could be considered as pillars of society. Secondly, we notice a progressive simplification of information historically, reflecting less polarizing views across countries and, therefore, demonstrating an increasing agreement on technologically-related discourses in present times. We interpret these results as an indicator of the rising intrusion of technology in the essence of Western, industrialized countries.

Fernández Fernández-Quantifying Western Discourses about Technology An Analysis-103.pdf


1:45pm - 2:15pm

Using document similarity in historical patents to study value in innovation

Matti La Mela1, Jonas Frankemölle2, Fredrik Tell3

1Department of ALM, Uppsala University; 2Centre for Digital Humanities Uppsala, Department of ALM, Uppsala University; 3Department of Business Studies, Uppsala University

Patents are temporary exclusive rights that are granted to inventors for protecting their commercial interests about new technologies they have invented. For receiving this exclusive right to sell, use, and licence an invention, the inventor needs to disclose to the public how the patented invention works. In other words, the published patents include the description of the invention, a claim about what is novel about the invention, and often also drawings to support the description. Consequently, patents offer rich information about technological change since almost two centuries: when and by whom have new tools, machines, or industrial processes been invented? What remains, however, a major challenge is to understand which of the inventions among the thousands of recorded patents have actually been important and valuable.

This paper investigates the problem of how to estimate the importance of a patent. The question of assessing value is particularly relevant in historical contexts, where relevant metadata used for valuation such as patent citations are not available. In the paper, we establish a method to build ”citations” ex-post between historical patents by looking at the language employed in the patent documents themselves. We apply the method of tf-idf (term frequency-inverse document frequency) to our dataset of historical Swedish patents (1890-1929), and investigate the importance of patents through two dimensions. First, by studying the novelty of the vocabulary used in the patent description, and second, the impact of this new vocabulary over the subsequent patents. The idea is that an important patent uses new terms while it introduces a new technology, that then is further discussed and elaborated by others in their subsequent patents.

In previous work, the question of assessing value in historical patents has been approached from several perspectives. The most common measure for value has been the yearly patent fee payments, thus the times the patentee has decided to renew the patent to keep it valid.(e.g. MacLeod et al. 2003). One of the approaches that captures the societal relevance of a patent in regards to other patents has been to build indexes from contemporary sources (see (Nuvolari, Tartari, and Tranchero 2021)). Finally, post-WW2 patent citations have been used as indirect measures, as they can refer to historical patents too, and thus show their persisting relevance for modern technologies (Moser and Nicholas 2004; Nicholas 2011).

Examining similarity and differences of texts and their elements is a common task in natural language processing. Related to our approach, previous work has studied the innovative use of language and language patterns in various contexts such as measuring novelty and resonance in political speech (Barron et al. 2018) or in news reporting (Nielbo et al. 2023; Gunda 2018). In the context of patents, there is a large strand of research employing contemporary patent documents granted since the 1980s as more traditional text data (Balsmeier et al. 2018) and also through word embeddings and training language models (Hain et al. 2022). As (Hain et al. 2022) note about studying patent similarity, the text-based approaches have been dominant and only recently have vector-based solutions been applied for analyses of similarity. Regarding historical patents in the U.S., the landmark is the paper by (Kelly et al. 2021), where they present a measure for studying similarity between historical patents. They build on tf-idf as a measure, and examine back and forward similarity in a dataset of U.S. patents granted between 1840 and 2010 to build a score for importance of a patent.

The paper builds on the previous paper by Kelly et al. 2021. The novelty of our paper is that we improve their approach by including normalization to control for the variation in amount of yearly patents, and by using edit distance to tackle OCR challenges inherent to historical texts. Moreover, we apply the method in a different language area and historical-institutional context (for Sweden, see e.g. Andersson and Tell 2019). The paper builds ground for further study about similarity of historical patents at a semantic level using embeddings and pre-trained language models.

The source data we use in the paper are scanned and OCR-processed patents granted between 1885 and 1945 that have been digitised and published in the Swedish Historical Patent Database (1746-1945) project (https://svenskahistoriskapatent.se/EN/). Besides the OCR-read patents, which include the description, claim, and patent drawings, the Swedish Historical Patent Database includes information about each patent and individual patentee (e.g. Andersson and La Mela 2020). In the paper, we study the importance of patents between 1890 and 1929: this allows us to have five years of data for calculating backward similarity and ten years for studying forward similarity.

Figure 1 displays the amount of patents granted per year in Sweden between 1885 and 1945 (by application date). We see that the amount of patents per year grows throughout the period; in contrast to Kelly et al. 2021, we have decided to normalize our measure of novelty and impact, while they are calculated as a sum of the individual values. As Figure 2 shows, the average length of the patent documents say rather stable over time. The structure of the patent documents remain rather similar over the period, and contain first a description of the patent and then the patent claim. The patents are typewritten throughout the period, and the OCR quality has been estimated with manual reading as being adequate for the method.

(FIG 1)

(FIG 2)

The patent data is preprocessed by removing numbers, special characters and single letters, and Spacy’s Swedish lemmatizer (Honnibal et al. 2020) is applied to all tokens and Swedish stop words are removed using the NLTK library (Bird, Klein, and Loper 2009). To remove errors obtained from the OCR process, similar words are identified by calculating the Levenshtein distance between them. Then, each of these words in the data will be mapped to the same word. We also filter out for low frequency words as another way to deal with potential OCR errors and to eliminate words with low influence.

To measure the similarity between given patents 𝑖 and 𝑗, Kelly et al. (2021) propose the tf-bidf score, a modified version of tf-idf. The bidf -score measures the importance of a term in a patent based on the patents filed prior to the patent in question. As a result, terms in influential patents receive a more appropriate assessment. For a given patent 𝑝 and term 𝑤, the tf-bidf score is calculated as follows:

(EQ1)

The term frequency (tf ) is defined as:

(EQ2)

where 𝑐 counts the number of occurrences of a term in a patent and bidf is defined as

(EQ3)

To calculate the similarity 𝑝𝑖,𝑗 between patent 𝑖 and 𝑗, two vectors 𝑉𝑖 and 𝑉𝑗 with the size of the union of terms in patents 𝑖 and 𝑗 are created. These vectors store the calculated tf-bidf scores for each term 𝑤 in 𝑖 and 𝑗 respectively. The vectors are normalized and the cosine similarity between the two vectors is calculated, which is the measure for patent similarity.

Novel patents should be distinct from prior patents, as they propose new, disruptive technology. This means that the similarity of a novel patent to prior patents should be low. We calculate the backward similarity (BS) as a measure of novelty for a given patent. We use the application date of the patent for the threshold date. The BS is defined as the average similarity between a given patent 𝑝 and all patents filed 5 years prior to 𝑝. By averaging the sum, we account for the different number of patents in these 5 years between the patents. This enables us to normalize between the novelty scores over time.

(EQ4)

ℬ𝑗,𝜏 denotes the set of patents filed 𝜏 = 5 years before patent 𝑗. Thus, a novel patent is characterized by a low BS score, as its similarity to prior patents is low. Impactful patents are those that influence future patents, which results in high similarities between those. We define the forward similarity (FS), which is the average similarity between a patent 𝑝 and all patents filed 10 years after 𝑝, as a measure for the impact of 𝑝. Again, we use the application date of the patent as the threshold date.

(EQ5)

ℱ𝑗,𝜏 denotes the set of patents filed 𝜏 = 10 years after patent 𝑗. An impactful patent is characterized by high FS score, as its similarity to future patents is be high.

An important patent is both novel and impactful. This can be described as the ratio of patent novelty (BS) to impact (FS).

(EQ6)

The importance value 𝑞 for a patent 𝑗 will be high if its FS score outweighs its BS score.

Similar to Kelly et al. (2021), we define a ”breakthrough” patent as a patent that ranks within the top 10 percent of patents based on its importance score. This definition identifies patents that made significant contribution to technological progress. Figure 3 presents the share of breakthrough patents of total patents per year. The graph shows that the identified breakthrough patents are rather evenly distributed during the period. There is a slight downward trend among the breakthrough patentes over time. This could relate to the growth of absolute number of patents and the slight increase in the average length of the patents, that decreases novelty due to broader vocabulary in the patents. We see a peak during the war years which could indicate how specific kind of technologies were patented during the war years. We study then manually the top breakthrough patents for the peak years, and search for known important inventions among the breakthrough patents. We find some among them, such as the refrigerator by Carl Munters och Baltzar von Platen and the inventions about ball bearings by Sven Wingquist. Finally, we contrast the number of yearly renewals of the breakthrough patents with the yearly payments of all patents.

To conclude, we discuss some limitations of the method and how to further validate the findings. First, is the method capturing relevant technical terms, and not tracing mere language change. To answer this, we improved the steps part of the data processing iteratively. The thresholds for filtering out low and high frequency words, and also setting a minimum for the document range (in how many patents a words needs to appear) was done by examining the ”lost” vocabulary at different levels. We see that the filtering enabled to frame the studied vocabulary into robust central ground with relevant terms concerning the technologies themselves. Some challenges, however, remain that can be relevant in the Swedish case. It can be investigated whether compound words should be divided and processed in separate stems, to be able to examine and group together certain technological vocabularies. Moreover, similar to Kelly et al. 2021, we use the complete patent texts in our study. It could be investigated whether the right unit of analysis would be the beginnings of the patent instead of the complete documents. Second, to what extent can innovation be conceptualized in terms of novel technical vocabulary. It should be discussed more thoroughly if the amount of changes in the vocabulary of patents differs between technology sectors for new innovations. Our proxy of breakthrough patents relates strongly to the appearance of new vocabulary, which might also indicate changes in how an existing technical area is viewed or conceptualised.

La Mela-Using document similarity in historical patents to study value-223.pdf


2:15pm - 2:45pm

Between the Arduous and the Automatic: A Comparative Approach to Transparency in the Classification of Book Reviews in Swedish Newspapers

Daniel Brodén1, Jonas Ingvarsson1, Aram Karimi1, Lina Samuelsson2, Niklas Zechner1

1University of Gothenburg, Sweden; 2Mälardalen University, Sweden

This presentation focuses on the task of identifying a particular genre of texts in newspapers collections. When performed manually it is arduous, and when done computationally it is challenging for reasons related to both computation and digitisation. To make sense of the disconnect between the hyperboles surrounding text mining and the nitty gritty details of such a task, one element of our project The New Order of Criticism: A Mixed-Methods Study of 150 Years of Book Reviews (2020–2024, Ingvarsson et al. 2022) is to train algorithms for classification of book reviews in the newspaper collection of the National Library of Sweden (Kungliga Biblioteket, KB), combining expertise from language technology and comparative literature. Answering to the growing demand for transparency (see Bode 2018), this paper contributes to the discussion of text mining book reviews for tracing the historical discourse of literary criticism (Underwood 2019; Brottrager et al. 2022) by highlighting the methodological and contextual complexities of classification of reviews in large-scale text corpora, taking document context and OCR issues into account.

Specifically, our aim is to comparatively discuss different approaches to identifying book reviews in KB’s newspaper collection, focusing on material from the year 1906 and using the platforms of KB’s digital lab, KBLab. We present different approaches, comparing a traditional manual approach (primarily drawing upon press biographies), with different computational methods in the form of frequency-based statistical classification methods and transformer-based language models.

Our study addresses two research questions:

  • What are the affordances and limitations of these different approaches for identifying literary book reviews from a specific year in KB’s collection?

  • How can a comparative methodological approach enrich the understanding of automated classification of literary criticism, emphasising transparency?

These questions raise wider methodological issues, including the need for articulation of the complex relationships between documentary record, large-scale digitisation and historical analysis and the state of the practice in text mining KB’s newspaper data.

Context-sensitive approach

We begin by framing our study in the light of the discussion about the need for context-sensitive approaches to text mining, where commentators have underscored the criticality of engaging with the original context of archival data to produce robust and nuanced analyses. Katherine Bode (2018: 5) argues against the tendency to rely on data models that inadequately represent the ways in which the original texts generated meaning in the past, and Jo Guldi (2023: 48) warns about naïve assumptions about the relationship between data and the documentary record. Regarding newspaper collections, there is a dire need to pay attention to the forms and genres of newspaper text (Gooding 2017: xiii). When not text mining for high altitude patterns, there is an inadequacy of a conception of newspaper text when conflating a range of text types, including news, editorials, reviews, adverts, and tv listings into the single category of ‘newspaper data’.

This paper touches upon the specific complexities surrounding the identification of literary book reviews when it comes to both the genre as such and the manual work involved, drawing upon a pragmatic definition of the genre (having as its core to inform about and assess a newly published literary work) and project team member Lina Samuelsson’s previous study (2013) of the discourse of literary reviews in Swedish newspapers.

In the early 20th century, large Swedish newspapers did not publish book reviews according to a predictable pattern (if they published them at all) (Forser 2002). Given these uncertainties, Samuelsson originally used a press bibliography (Våge 2015), previously available as a rudimentary database, that catalogues major newspapers and information on different article types, to identify book reviews from the year 1906. The database was, however, not organised per year so Samuelsson used the information to build a simple database by searching for the year 1906 in the posts categorised in the bibliography as book reviews. In total, approximately 3,000 literary reviews from the whole year were found, but since this material was deemed too large for close reading, the focus was narrowed to two months. These reviews (197 in total) were identified on microfilm copies, transcribed and organised in a spreadsheet with metadata, including title of the book reviewed, date of publication, literary genre, name and gender of author and reviewer (when available).

However, the material was found to be quite heterogeneous and sometimes stretching the limits of what we today perceive as a review. Notably, while the press bibliography was helpful as a short track, some reviews were not listed and it only covered 17 newspapers out of approximately 200 and some of them only partially (Tollin 1967). Furthermore, in 1906, articles were often published over multiple columns and different pages, which could make finding the full texts in the dense newspaper layouts time-consuming, even with a bibliography at hand.

Dirty newspaper data and annotation

Turning to the task of automatic classification of book reviews based on KB’s newspaper collection, the suboptimal digitization and OCR constitutes a significant problem (Jarlbrink et al. 2016: 30–31). Basically, the newspaper pages displayed on KBLab’s platforms consist of a “mishmash of text blocks: we no longer know which blocks belong together to form part of the same article, which articles comprise part of the same section, nor which texts are editorial content rather than adverts” (Börjesson et al., 2023: 5). Though KBLab is working on proper article segmentation, data cleaning and curation remain crucial issues (Sikora and Haffenden, 2024).

Thus, the paper discusses the details of OCR errors in KB’s data and their effects on different methods. First, there are noise characters, some of which might stem from misreading graphical elements. Second, at least two newspapers from the year 1906, Aftonbladet and Svenska Dagbladet, are almost entirely missing punctuation, making some types of analysis impossible and hampering others (the effects would be worse for any researcher unaware of the problem). Third, there are a large number of missing spaces, at least some of which seem to be caused by mishandling line breaks. Since most computational methods are based in some way on words, and each missing space makes two words unrecognisable, this is likely to have an effect on the efficacy of classification.

In the paper, we present our annotation to acquire useful data for our classification experiments and an evaluation of the annotation. Although the 197 book reviews originally transcribed by Samuelsson constitute a substantial bulk of text, for the machine learning processes we needed more text data and for this task we enlisted three annotators, who annotated reviews in 8 newspapers from 1906. We used a preliminary algorithm to find the 500 pages most likely to contain reviews in each newspaper and annotators 1 and 2 were then given 4 newspapers each. Annotator 3 was given the same 500 pages of one newspaper as Annotator 1 as well as all pages of another newspaper of which Annotator 2 had annotated 500 pages. This way, we were able to assess both inter-annotator agreement and the efficacy of preliminary algorithms. For each review found, the annotators marked the corresponding text blocks (a way of navigating the effects of the OCR) and noted if the reviewed material was fiction, non-fiction, or something else. For fiction, they also noted the title, author and reviewer, providing us with metadata in line with Samuelsson’s original study.

Classification – methods and experiments

There is a wide variety of methods for automatic text classification (Zechner 2017). Some of these can be labelled as numerical feature list (NFL) methods, which consist in choosing a set of numerical features of the text (word frequencies, character frequencies, sentence lengths, etc.) and then choosing a statistical method for comparing unclassified text samples to texts where the class is known. Using a larger number of features, such as the frequencies of sequences of words, can improve the accuracy, but make the computation slower. In the last decade, a technique which has gained popularity is using word embeddings (Mikolov et al. 2013) to compress this type of data into a smaller set of numbers. This makes analyses based on longer word sequences feasible, but has the downside of being less transparent, as the numbers no longer represent an overt feature of the text. Recently, techniques have emerged which use large amounts of unrelated text to improve this feature compression, creating large language models (LLM), with a prominent LLM method being BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018). These methods have led to great improvements in many areas of natural language processing, but have the downside of requiring a very large one-time computation to create the model. The large training text may also contain hidden bias, which due to the less transparent nature of the method may be difficult to avoid or detect.

In the case of the LLMs, we adopt a methodology involving two primary approaches: fine-tuning the KB-BERT (Malmsten et al., 2020) model and utilising Word2Vec embeddings (https://fasttext.cc/docs/en/crawl-vectors.html) on different types of classifiers. For fine-tuning KB-BERT, we leverage annotated review data to adapt the pre-trained model to the review classification task. Simultaneously, we apply Word2Vec embeddings to represent text documents and feed them into traditional classifiers in the form of Support Vector classifier (SVC), Naive Bayes Classifier (NBC), and Linear Regression(LR). Furthermore, we incorporate TF-IDF vectorization to examine its impact on the classification performance of these models.

The training phase involved fine-tuning the model’s parameters using an annotated dataset exclusively comprising review texts. We have evaluated the model's performance using standard metrics, notably accuracy. In order to establish a benchmark, we also assessed traditional models such as SVC, NBC, and LR using the same dataset. Our experimental outcomes demonstrate the superior performance of fine-tuned KB-BERT, achieving a 93% accuracy in review classification. This notable advantage can be attributed to KB-BERT’s adeptness at capturing intricate semantic nuances and contextual cues inherent in review texts. This robust and effective performance underscores KB-BERT’s potential as a tool for review classification tasks. Conversely, when employing Word2Vec for text representation, the accuracy results for traditional models are as follows: SVC, 74%; NBC, 83%; and LR, 84%.

On top of choosing different methods, we also discuss variations in the experiment setup, such as which data is used for training and testing. One example is testing whether the accuracy falls when training and test data are from different publications. Results can also be presented in the form of rankings or similarity distributions instead of classification proper. The comparative analysis consists in evaluating the accuracy of each method, by manually checking the results, and will contain evaluation of the type, and amount, of preprocessing (i.e. annotation and OCR cleansing) needed to perform the computational analyses. Just as we consider the accuracies and shortcomings of the computational methods, we also view the human annotation as an imperfect method. Thus, we can compare the annotations of different annotators to get an idea of their accuracy. We can also manually analyse the texts chosen by the computational methods as likely reviews, to see if there may be reviews that the human annotators did not find, or other texts which a human might also consider similar to a review.

Conclusions and Summary

We conclude by drawing together our lines of inquiry, highlighting affordances and limitations provided by the manual and different computational approaches concerning the task of identifying a particular genre of texts in a newspaper collection. We discuss results of the different computational methods, also taking the transparency of the classification processes into account. Our discussion feeds both into the general methodological understanding of text mining literary criticism and the ongoing discussion about productive analytical approaches to KB’s newspaper data.

Brodén-Between the Arduous and the Automatic-142.pdf
 
1:00pm - 2:30pmSESSION#09: LITERARY & FOLKLORE STUDIES: POETRY
Location: K-205 [2nd floor]
Session Chair: Sofiya Zahova, University of Iceland, Iceland
 
1:00pm - 1:30pm

Quantifying Feelings: Data on Emotive Words in Saga Poetry

Brynja Þorgeirsdóttir

University of Iceland, Iceland

The medieval sagas of Icelanders, celebrated for their emotionally restrained narrative style, merge prose with verse in a genre-defining prosimetrum. Traditional analysis has largely focused on the prose for emotional insights, while acknowledging that the poetry in the sagas conveys feelings more openly. This paper presents findings concerning the use of emotive language in the poetry from “The Íslendingasögur as Prosimetrum“ (ÍSP) database, which contains a comprehensive collection of quantitative data on the features of all 722 stanzas in the corpus of the sagas. The database facilitates diverse analytical explorations and assessments of correlations, allowing for a comprehensive quantitative scrutiny of established assumptions within the entire corpus of sagas. Consequently, it emerges as a robust and influential instrument for scholarly inquiry. The present study finds that internalized self-expression through first-person emotive words is relatively rare in saga poetry, representing less than a fifth of instances where emotion words are used. More frequently, self-expression is externalized, with speakers reflecting on their emotions as if they were external forces. The most prevalent usage of emotion words is in the description of others’ emotions, rather than as direct expressions of the poet’s own inner state. This suggests that the “inner world” of the poet, as scholars describe, is most often not explicitly revealed but is instead implied through these observations. Furthermore, emotive stanzas in sagas often function as performative narrative elements, complementing or contrasting with direct emotional expressions. Analysing their roles may provide deeper understanding beyond merely reflecting the poet’s psyche.

Þorgeirsdóttir-Quantifying Feelings-120.pdf


1:30pm - 1:45pm

Text Clustering of Icelandic Post-Medieval Þulur: Explorations Using the Runoregi Interface

Yelena Sesselja Helgadóttir1, Maciej Michał Janicki2

1The Árni Magnússon Institute for Icelandic Studies, Iceland; 2University of Helsinki, Finland

This article explores the possibilities of applying Runoregi – a user interface for browsing automatically computed textual similarity, originally designed for studying the Finnic oral tradition, in the research of Icelandic post-medieval þulur. Along with challenges common for oral poetry collections, the latter corpus presents some peculiar issues related to the specificity of the genre, as well as to the difference between Icelandic and the Finnic languages. Preliminary results show that the automatic poem clustering featured in Runoregi corresponds well to an existing scholarly analysis, while features such as dendrograms, side-by-side alignments or similar passage search are useful in exploratory research.

Helgadóttir-Text Clustering of Icelandic Post-Medieval Þulur-154.pdf


1:45pm - 2:00pm

Oral poetry collections as a dialectal spoken language corpus

Antti Kanner1, Mari Sarv2, Kati Kallio3, Helina Harend4, Jakob Lindström3, Kaarel Veskis4, Eetu Mäkelä1

1University of Helsinki, Finland; 2Estonian Literary Museum, Estonia; 3Finnish Literary Society & University of Helsinki, Finland; 4Estonian Literary Museum & University of Tartu, Estonia

In the Finnic languages of Circum-Baltic region, large collections of oral poetry have been collected with periodically varying intensity from the 17th up to the 21th century. These collections include many genres, such as epic poetry, nursery rhymes, wedding songs, spells and enchantments that form part of common Finnic poetic tradition (known as runosong, Kalevala-metric song, Finnic alliterative tetrameter etc.) as well as texts in other vernacular poetic forms, and they represent a wide variety of Finnic languages and dialects, including Karelian, Estonian, Finnish, Ingrian, Ludic, Veps and Votic. While this material has been sporadically used and even seen some dedicated linguistic studies (for example, Peegel 2006, Palola 2009 and others), for large scale corpus linguistic study they have been virtually unused. One of the main reasons for this has been that it has not been clear how the poetic language in these collections relate to the linguistic variation in the Finnic-speaking region. This paper aims at alleviating that circumstance.

In this paper, we report the results of statistical analyses that seek to map the oral poetry collections to the more established corpora of spoken dialectal varieties of Finnic languages. These corpora include the dialectal section of the Finnish Syntax Archives (FSA), the Digitized Morphological Archives (DMA) and Samples of Spoken Finnish (SSF) and the Estonian Dialect Corpus (EDC). Given that the oral poetry collections form a much larger data set than any of the aforementioned dialectal corpora, and are generally thought to represent a more archaic form of language, they have considerable potential for the historical linguistic study of the languages and dialects they contain, as well as a source of general description of linguistic variation in the Finnic area. The size of the oral poetry collections available in a joint database of Finnic runosongs (FILTER database) in its entirety is little above 14M tokens, whereas the sizes of the established linguistic dialectal corpora range from 800K (SSF) to 4M (DMA) tokens and total at 7.1M tokens.

First, we characterise the oral poetry collections as a corpus and examine how much it resembles the other corpora in terms of basic corpus linguistic summary statistical overviews, such as type-token-ratios and the shapes of word frequency distributions. This verifies the general usability of the corpus and viability of application of the basic corpus linguistic tools. Then, we assess how much the different language datasets at our disposal overlap in terms of word types and geographical distribution. If the amount of overlap is small, there is very little to compare between the data sources, high level of overlap on the other hand promises more viable comparisons. We then examine how similarly the frequency distributions of high frequency types behave in all data sources and whether the oral poetry collections stand out, and if they do, do they deviate in a systematic or non-systematic way. Finally, these analyses are summarised and compared against variables concerning the geographical variation of language in each dataset. The main line of questioning here is how much the language of oral poetry reflects the spoken dialectal variety of language spoken in the areas the poems have been collected from.

We show that due to the high level of overlap in especially high frequency band word types, the oral poetry collections reliably capture dialectal variation, but when that variation is compared to the other data sources, they do not geographically align one-to-one. Rather, the oral poetry shows some systematic displacement patterns, which are interesting both linguistically and from the perspective of established folkloristic hypotheses regarding the complex development and transmission processes of the oral poetic tradition. Combined with the knowledge of the inherently considerable variation in the data, the results invite further studies, especially case studies targeting particular poetic types and geographic regions.



2:00pm - 2:15pm

How to analyse variation in folklore: Finnic runosongs and Ukrainian dumas

Olha Petrovych1,4, Mari Sarv1, Maciej Michał Janicki2, Kati Kallio2,3

1Estonian Literary Museum, Estonia; 2University of Helsinki, Finland; 3Finnish Literature Society, Finland; 4Vinnytsia Mykhailo Kotsiubynskyi State Pedagogical University, Ukraine

Variation is a core feature of folklore that reveals itself in the various aspects of folklore. For example, content, style, performance, usage and functions of folkloric communication slightly differ individually as well as regionally at the same time manifesting a recognizable similarity of phenomena in larger or smaller scale. According to Juri Lotman (1977), a specific type of aesthetics prevails in folklore, which he calls the “aesthetics of sameness”: it is considered beautiful if the artistic expression does not deviate too much from the general standard, the creativity is appreciated but within certain limits.

The digitization and datafication of folklore materials enables us to explore the variation with the help of statistical and computational methods. Due to the layered nature of folkloric variation, as well as the unevenness in collection history, this is a complex task. Within a FILTER project funded by Finnish Academy we have invented some methods and procedures to handle variation in large multilingual corpus of poetic texts. In addition to the purely folkloric variation, underlying linguistic variation poses an additional challenge. In order to detect similar but still slightly different verses we have successfully applied a similarity detection method based on cosine similarity of verse bigrams (Janicki et al 2023). By clustering the verses by their similarity measures, we will get verse clusters that are close enough to verse types. Verse types that occur often nearby each other tend to form the adjacency groups of (1) stereotypic formulae, (2) motifs or (3) even the whole songs.

According to the theory of oral-formulaic composition (Parry 1971; Lord 1991) that laid the foundation for understanding oral epic composition, the poetic composition in the conditions of oral transmission relies on the mastery of essential elements such as formulas, themes, and story-patterns. These recurring patterns, encompassing form, meaning, and narrative, transcend mere repetition; they are the fundamental building blocks of the oral tradition.

Exploration of variation among the texts of a song type has been one of the central tasks in geographical–historical research that in the 19th century was a central method in folkloristics. Although the aims of folkloristic research have changed since, it is still a relevant task to figure out the main features and branches of a large number of similar texts. We have been looking for ways to do this computationally and as one possible solution we have applied network visualization to adjacency measures of verse clusters. In network view, the modularity groups are able to represent the most common sub motifs and regional branches of the plot. Thus far, we have applied these procedures in the Finnic runosong tradition. In our current presentation, we will expand this to Ukrainian dumas, to find out (1) if the methods we have used are applicable for the analysis of oral poetry in more universal terms; and (2) if there are considerable differences in variation patterns of Slavic and Finnic poetic traditions.

Dumas form a lyric-epic genre where the songs tell about historical or social events in Ukraine (Lanovyk & Lanovyk 2005: 261). For research material we use a corpus of Ukrainian Folk Dumas compiled on the basis of large academic publication (Skrypnyk 2019). Our analysis focuses on the duma "Escape of three brothers from the city of Azov, from Turkish captivity," preserved in 74 variants.This specific heroic narrative touches on the themes of captivity, escape, and death in the steppes, making it a unique and fascinating aspect of Ukrainian oral folklore. Despite its origins in a single escape incident, the duma gained widespread popularity, forming so many variants and becoming a symbol of national tragedy for the Ukrainian people. Its enduring appeal lies in the narrative's rich content, which goes beyond a family tragedy and encapsulates a broader national situation.

Ultimately, this computational exploration attempts to bridge the gap between traditional scholarship and modern computational methods, offering a novel perspective on the dynamic and interconnected nature of oral traditions. The insights gained from this study have implications for preserving cultural diversity, fostering cross-cultural understanding, and advancing computational approaches in the study of folk songs.



2:15pm - 2:30pm

Sources and development of the Kalevala as an example for the quantitative analysis of literary editions and sources

Eetu Mäkelä1, Kati Kallio1,2, Maciej Janicki1

1University of Helsinki, Finland; 2Finnish Literature Society SKS, Finland

This paper describes in-progress work to quantitatively analyse the development of the Finnish national epic Kalevala, both in terms of the oral poetry sources used, as well as in terms of the development of the final edition through multiple preparatory publications and manuscripts. In terms of methodological development, the paper relates to prior research on how to visualise relations between texts. However, the particular nature of oral poetry and the editing process at play here precludes using any of the established methods directly, leading instead to the development of complementary approaches.

Mäkelä-Sources and development of the Kalevala as an example-226.pdf
 
1:00pm - 2:30pmSESSION#10: COLLABORATIONS
Location: K-206 [2nd floor]
Session Chair: Eiríkur Smári Sigurðarson, University of Iceland, Iceland
 
1:00pm - 1:30pm

Modern Times 1936

Pelle Snickars1, Emil Stjernholm2, Mathias Johansson3, Maria Eriksson4, Fredrik Norén5, Robert Aspenskog6

1Lund University, Sweden; 2Lund University, Sweden; 3Lund University, Sweden; 4University of Basel, Switzerland; 5Malmö University, Sweden; 6University of Gothenburg

What is it that software sees, hears and perceives when technologies for pattern recognition are applied to media historical sources? All historical work requires interpretation, but what kind of algorithmic interpretations of modernity does software yield from historical archives? Modern Times 1936 is a research project funded by Sweden's second largest funding body (Riksbankens jubileumsfond) running between 2022 and 2025. The project involves four researchers, one developer and one master student (http://modernatider1936.se/en/). The project is empirically committed to everyday experiences and sets out to study how machines interpret symbols of modernity in media from the 1930s. By utilizing primarily photographic and audiovisual collections, the project seeks to analyze how modern Sweden was, while also exploring how computational methods can help us understand modernity in new ways.

We would like to propose a one-hour panel around some aspects of the research we have so far done within Modern Times 1936—with a focus on audiovisual media, foremost film and photography. Within our project we have collaborated with the heritage sector, assembling a dataset of some 80,000 photographs from the 1930s, scraped from the heritage portal DigitaltMuseum. We have also initiated a joint venture with the Nordic Museum in Stockholm. Regarding moving images we have used the media historical portal, filmarkivet.se—run by the National Library of Sweden and the Swedish Film Institute. We have used a number of speech-to-text-models on newsreels from the 1930s, but foremost upscaling algorithms for explorative film restoration. The latter work stems from previous film historical projects (Snickars 2015), as well as research done within, European History Reloaded: Curation and Appropriation of Digital Audiovisual Heritage, a EU-funded project that examined algorithmic ways of examining archival film reuse, and introducing a method for mapping video reuse with the help of AI and convolutional neural nets (Eriksson, Skotare & Snickars 2022). Hence, given the special theme of DHNB 2024 of collaboration with the heritage sector and involvement of ALM professionals alongside digital humanities research, we do think that a panel about Modern Times 1936 is a perfect match.

In our panel we will first make a short general presentation of Modern Times 1936—and yes, the Chaplin pun is intended. In short, our project explores how artificial intelligence and machine learning methods can foster new knowledge about the history of Swedish modernity—while at the same time critically scrutinizing algorithmic toolboxes for the study of the past. Within the panel we suggest to emphasize three particular strands that we have been working with: (1.) upscaling algorithms for film restoration, (2.) generative AI and pattern exploration within a major photographic dataset, and (3.) photographic super-resolution fantasies and the production of synthetic media.

(1.) Following a boom of user-friendly artificial intelligence tools in recent years, AI-enhanced (or manipulated) films have been framed as a serious threat to film archives. Film archivists are usually conservative; following their métier they are in the business of safeguarding film heritage. Today, however, the film archive—understood in a wide sense—is also elsewhere, most prominently online where each media asset becomes "at the instant of its release, an archive to be plundered, an original to be memorized, copied, and manipulated" (de Kosnik 2016). To explore these matters, trace and critically evaluate how algorithmic upscaling can modify older films, within our project we initiated a collaboration with Swedish AI artist ColorByCarl. He has been working with silent films from filmarkivet.se, and drawing on this collaboration we were able to study the procedures within the AI enhancement community, highlighting generative AI's potential to encourage reuse, remix and rediscovery of the filmic past.

(2.) Within our project we have been using pattern exploration within a major photographic dataset of some 80,000 photographs from the 1930s (taken from DigitaltMuseum). We have tagged each image with metadata categories using Doccano, and the general idea has been to train models to automatically find different types of images. Gender has been a case in point; two preliminary models can detect images with (or without) men and women with approximately 95 percent accuracy. We have also dealt with different kinds of object recognition models to track symbols of modernity such as factories, vehicles or cinemas. It has proven significantly more difficult, but we believe the work can be improved by using more older imagery as training data. The ambition of this work has been to develop models that can automatically annotate and sort larger visual heritage collections—and consequently we have initiated a collaboration with the Nordic Museum using their new collection 100,000 Bildminnen (image memories) as a case. On the one hand, we are developing models that can help the Nordic Museum to search and sort this collection in new ways, and on the other we are also interested in producing new images based on the same collection. Since Stable Diffusion is open source it can be trained on a specific dataset, such as 100,000 Bildminnen, and hence generate new historical photographs. Our aim is of course not to demonstrate that generative AI can picture the past in a better way. Rather, we believe that such a collaboration will open up new ways of understanding historical image collections.

(3.) Generative AI can indeed prove a useful tool to trace tropes and patterns in historical datasets (Offert & Bell 2020), and recent scholarship has also suggested that generative AI can offer new opportunities, not least in media history (Wilde 2023). Super-resolution technologies describe a set of computational methods for enhancing the resolution and/or sharpness of low-resolution digital visual content. Designed to fix what Hito Steyerl once referred to as poor images (Steyerl 2009), image upscaling is frequently promoted as a tool for improving visual content, as for example poorly digitized historical photographs. Yet what do super-resolution technologies actually do to visual and audiovisual content and make us see? Focusing on resolution technologies in real and imagined ways of enriching visual imagery, within our project we have explored how machine learning models are increasingly shaping ways of seeing, interpreting, and caring for historic photographs. Drawing from a series of experiments aimed at studying if/how super resolution technologies hallucinate and introduce new visual elements to historic photographs, we have explored how super resolution technologies unsettle boundaries between reality and fiction, hence both clarifying and occluding visions of the past.

Forthcoming publications:

Eriksson, Maria (2024) "On the Meaning of Scale in Image Upscaling", MontageAV (forthcoming)

Eriksson, Maria (2024), "Truthful Pixels: Synthetic Images the Measurement of Photorealism", Transbordeur (forthcoming)

Stjernholm, Emil & Snickars, Pelle (2024), "Upscaling Swedish Biograph", Journal of Scandinavian Cinema (forthcoming)

Snickars-Modern Times 1936-118_a.pdf
Snickars-Modern Times 1936-118_b.pdf


1:30pm - 1:45pm

Cross-Institutional collaboration to create learning materials and metadata as LOD for promoting the use of digital cultural heritage in schools

Masao OI1,2, Satoru MAKAMURA2

1National Institutes for the Humanities, Japan; 2The University of Tokyo, Japan

[Purpose and Methodology]

The purpose of this study is to construct a network of "people" and "data" that connects diverse digital cultural heritage with children's enriched learning.

To this aim, we propose "S×UKILAM (School × University, Kominkan (Regional Community Center), Industry, Library, Archives, Museum) collaboration" as a scheme for co-creating educational use of digital cultural heritage through collaboration between schools and various institutions that hold and release digital cultural resources.

[Workshop on "Creating Learning Materials" by utilizing digital cultural heritage]

At first, as a practical method of S×UKILAM collaboration, a workshop was designed to co-create "learning materialization" of diverse cultural heritage and assign metadata, and more than 10 workshops were held from the national scale to various local governments, universities, and libraries. As a result, 396 institutions from over 95% of the prefectures participated in these workshops.

The scheme has been scaled up and an international version of the S×UKILAM collaboration was held as a workshop to develop learning materials together with members of Europeana.

The international version of the workshop utilized the digital cultural heritage of Japan and Europe to create learning materials that would contribute to the learning of children in various countries from a global perspective.

[Questionnaire results]

The results of the questionnaire survey suggested the effectiveness of the workshop, as well as the issues that were revealed through the dialogue between participants from different backgrounds in utilizing cultural heritage. The importance of open licensing and the co-creation of "educational metadata" based on the perspective of the educational field were also indicated.

[Development archive of learning materials]

Over 100 unique learning materials co-created based on the workshops were licensed for secondary use and given back to society and the future by using the internationally interoperable IIIF (International Image Interoperability Framework). This has created a cycle in which new knowledge and information created by using the digital archive are returned to the digital archive again and has realized the flow of cultural resources that had been stocked and the construction of a scheme in which cultural resources are given new value and passed on to the future.

[Constructing Linked Open Data]

An LOD (Linked Open Data) model was also developed to connect and structure the learning materials based on digital cultural heritage created through this inter-organizational and inter-disciplinary collaboration with existing diverse information, digital collections, and curricula in a more machine-readable form. Specifically, we developed a Resource Description Framework (RDF) dataset and SPARQL endpoints using the archive of educational materials co-created through the "S×UKILAM collaboration. As a result, it was possible to construct a findable LOD model by connecting educational information such as curriculum guidebook codes and video contents by NHK and Japan Search in a highly machine-readable form. The model contributes to the informatization of education and the promotion of the use of digital cultural heritage by providing easy access to learning materials and a diverse educational information.

[Development of curation application for end users]

In addition, considering the difficulty of handling SPARQL, we also developed an application for end users. This application has succeeded in automatically curating and providing not only metadata assigned to learning materials but also related information and contents from the Web, with simple operations based on an easily understandable UI.

[Implications and Contributions]

The workflow provided by this research is highly versatile internationally and suggests a way to construct a network of "people" and "data" to advance the educational use of digital cultural heritage through the collaboration of DH research institutions, GLAM, and school education.



1:45pm - 2:00pm

Nichesourcing in Terminology Work: Collaborative Projects Across and Beyond Disciplines

Harri Kettunen, Niklas Laxström, Tiina Onikki-Rantajääskö

University of Helsinki, Finland

The Helsinki Term Bank for the Arts and Sciences (HTB) is an internationally unique Semantic MediaWiki-based platform for interdisciplinary cooperation and a discussion forum for society as a whole. It utilizes nichesourcing in which teams of experts from different disciplines are responsible for content creation with the aim of bringing all disciplines together in the same online service. This presentation focuses on the educational and societal goals of the HTB, lessons learned from collaborative projects between various disciplines, aspects of multilingualism, pedagogical applications and implications, technical challenges, and the potential for international cooperation.

Objectives

The objectives of the HTB serve both the scientific community and society as a whole. Thus, the aim is to bring together the current terminology of different disciplines in a single online service, thereby supporting the development of national and minority languages as languages of science and facilitating multilingualism in the academic world. The online service based on Semantic MediaWiki also enables a new kind of collaboration and co-authorship that can be particularly useful in multidisciplinary and interdisciplinary research. At the same time, it makes all scientific knowledge creation available to all interested parties, thus promoting fairness and equality in education and open science.

Current situation

The HTB has been in existence for ten years and its contents are still under construction. At present, it covers over 50 disciplines or specialty fields and over a thousand experts from different fields have contributed to it. There are over 45,000 concept pages or articles and over 300,000 terms with translation equivalents.

As a wiki, the HTB is a work in progress due to its continuous progress and evolution. Bringing in active experts is a prerequisite for expanding the content. At present, over a thousand experts are already involved in the work, but their activity varies. The incompleteness of the database requires patience from both users and experts.

The HTB is used by people of all ages from school children to retirees, although the largest user group consists of students. The HTB has a total of almost two million page views per year.

Utilizing the terminology in society

The opportunities to use the HTB affect society more broadly than just research and university education. As can be uncovered from user statistics, school students are an important user group. The HTB is important for the education system in the national languages. It involves developing specialized vocabulary in the national languages and, in general, opening up how scientific knowledge is constructed. Although not all terms are needed in educational materials, a broader background knowledge is important for authors of educational materials. The same applies to non-fiction and the popularization of science in general.

Materials at the HTB can be used freely and therefore different applications can use it and improve, for example, indirectly the quality of different language models. Societal activities are increasingly based on different electronic applications, and it is essential that they are compatible. It is essential that the terminology used for interoperability can be interpreted in the same way. HTB can serve as an aid to this.

It is also worth noting that the importance of reliable information, source criticism, and media literacy is emphasized at a time when attempts are being made to influence people's minds in a variety of ways: with fake news, inaccurate or misleading information, or with different beliefs. Knowledge that is based on research and is openly accessible meets a great need, and just as importantly, it opens up how knowledge is formed, and concepts emerge.

It is also worth noticing that there is a continuum of knowledge utilization between experts and laymen. The researcher is an expert in his or her own specialties. The further researchers move away from their discipline, the more they also lack the same kind of clear explanatory knowledge that shows how the knowledge has been generated. Scientific activity and the societal impact of science thus go hand in hand at the HTB.

Challenges

Scientific terminology can only be created by experts in different disciplines. However, experts are rarely terminologists and the limited voluntary input results in somewhat uneven and heterogeneous content. Furthermore, the Mediawiki experts need to understand what scientific experts are doing while the scientific experts need to understand what can be done on the platform.

The best situation is reached when terminologists and discipline experts work together. Wikis can be used as a database, content management systems, and publishing platforms, but the unique aspect is facilitating asynchronous collaboration via simple editing, discussion pages, and change histories.

One of the reasons Semantic MediaWiki was chosen for the HTB was that it allows for loose and dynamic data modeling that is accessible to people who are not programmers. In a way, we apply the iterative progress idea also to the data modeling, not only to the primary content.

Furthermore, nichesourcing has a self-correcting effect that works if a critical mass of researchers participates in the terminology work. However, in today’s academic world, everything ultimately depends on resources. Besides funding, insufficient working time is an obstacle or a delaying factor. Nonetheless, there is a solution to this: carrying out terminology work together with PhD students would integrate terminology work as part of teaching at the university. The cooperation has already begun with various doctoral programs in Finland.

Finally, terminology work corresponds fully to the three main tasks of higher learning at universities: research, teaching, and societal impact. The limited voluntary contributions are necessary for the terminology work, but they also present a challenge.



2:00pm - 2:15pm

Transnational Research Infrastructure: A Journey Through CLARIN Knowledge Centres

Jurgita Vaičenonienė1, Michal Kren2, Vesna Lušicky3, Vincent Vandeghinste4,5

1Vytautas Magnus University, Lithuania; 2Charles University, Prague; 3Universität Wien, Wien; 4Instituut voor de Nederlandse Taal, Leiden; 5KU Leuven, Belgium

This paper discusses best practices in the development and operation of the network of Knowledge centres within the framework of the transnational and cross-disciplinary CLARIN Knowledge Infrastructure. A Knowledge centre is an institution or a collective of institutions which “have agreed to share their knowledge and expertise with others” (Branco et al. 2023), which has been certified by Common Language Resources and Technology Infrastructure CLARIN, and which has an established website providing the users with information on the offered services and contact forms. Currently, there are 28 Knowledge centres from different countries covering a wide range of topics, offering various research and training related services and organised as either a virtual centre of one or several institutional partners in one country or distributed across multiple countries. This paper aims to present this rich diversity of the research infrastructure and its state-of-the-art by reviewing existing Knowledge centres and their areas of expertise, describing the activities initiated by CLARIN to strengthen K-centre networking, showcasing different types of K-centres, and presenting the operation and collaboration practices shared by K-centres in their annual report for 2022-2023.

Vaičenonienė-Transnational Research Infrastructure-113.pdf


2:15pm - 2:30pm

The Labour’s Memory Project

Olle Sköld1, Raphaela Heil2, Silke Neunsinger3, Eva Pettersson4, Örjan Simonson2, Jonas Söderqvist3

1Department of ALM, Uppsala University; 2Popular Movements’ Archive in Uppsala; 3Swedish Labour Movement Archives and Library; 4Department of Linguistics and Philology, Uppsala University

Introduction

The labour movement has facilitated the development of Swedish democracy, and has significantly shaped the Swedish labour market and welfare state system (e.g., Jansson, 2017). Correspondingly, the archives of the labour movement hold great value in many professional and scholarly areas where knowledge of historical events and processes is of importance – including but not limited to history, economics, political science, and trade union work. The main deliverable of the Labour's Memory (LM) project (2020–2023, with prolongation) is to digitise and make accessible annual and financial reports from the Swedish trade union organisations on a local, regional, national, and international level in a 1880-2020 timeframe.

The LM project team consists of archivists, librarians, and researchers from history and economic history, information studies, computational linguistics, and computerised image processing that work towards making the digitised documents available via a portal developed with a basis in the needs and preferences of key user groups. The portal will include features like robust document metadata, capable search features based on applications of computational linguistics including spelling normalization of historical text and named entity recognition (NER), transcriptions produced by the application of handwritten text recognition (HTR) and optical character recognition (OCR) techniques, and high-quality digital reproductions of selected archival holdings from the Swedish Labour Movement’s Archive and Library (ARAB), the Popular Movements’ Archive in Uppsala (FAC), the International Institute of Social History (IISH) and the Archive of Social Democracy (AdSD).

The purpose of this paper is to trace the main domains of knowledge and expertise represented in LM and to reflect on how they intersect in different elements of project work. With its focus on digitization and heritage dissemination driven by the collaborative efforts of multiple archival stakeholders and researchers from different domains, LM epitomises many of the characteristics and challenges of large-scale heritage projects in the digital humanities (Allington et al., 2016; Luhmann and Burghardt, 2021). Reflections and lessons learned from LM project experience have broader relevance in the digital humanities, specifically so for projects and other collaborative efforts where desired outcomes are dependent on co-work and communication across scholarly and GLAM communities.

Mapping the Labour’s Memory knowledge domains

Work in LM draws on expertise from several different research areas and organisations and is characterised by interdisciplinary, international, and inter-institutional exchanges. The project is distributed across four knowledge domains that are mapped and numbered below, alongside their respective contributions.

(1) At the core of LM lies the handling and digitisation of the annual and financial reports. This essential expertise is provided by the archival project partners, who implement the long-term storage and digitisation in accordance with the respective material’s requirements and limitations. Digitisation approaches range from in-house image acquisition, employing digital cameras and dedicated book scanners, to outsourcing via digitisation providers.

(2) The digital image acquisition is followed by OCR for printed and typewritten documents and implemented via off-the-shelf solutions and HTR for manuscripts. The HTR methods are developed in-house at FAC in cooperation with the Department of Information Technology, Uppsala University. OCR and HTR application yields machine-readable texts that enable full-text searches and subsequent processing steps, and make the contents more accessible to interested readers by removing the barrier of deciphering old handwritings or faded prints.

(3) The LM portal’s search capabilities and document interaction opportunities are further enriched through the efforts of computational linguists from the Department of Linguistics and Philology, Uppsala University. Firstly, spelling normalisation of historical texts allows users to search for words in their modern, standardised spelling and receive results that include older variations. And secondly, metadata, such as the names of persons and organisations, are extracted from the text via NER, making it possible for users to perform faceted searches and filterings, for example limiting a query to documents containing references to a specific organisation (Tudor and Pettersson, 2024).

(4) Development work in LM is supplemented by user-driven design methodologies. Researchers from the Department of ALM, Uppsala University, elicit feedback from representatives of vital user groups via a set of user studies (Sköld and Huvila, 2024). Collected insights, for example regarding potential user motivations and goals, inform e.g., the development and design of the portal and its search functionalities and underpinning metadata structures .

Discussion

LM is presently in its final phase, preparing the platform, local infrastructures and data for publication. At this point two intersections between the domains of knowledge and expertise represented in the project emerge as most impactful on LM work processes and results. The first key intersection is between archival and technical expertise, two areas that exist at the very core of LM and many other DH projects in the heritage data domain. Key lessons learnt include the importance of working deliberately not only towards un-siloing these knowledge domains within the project framework, but also to create the conditions for mutual cross-domain learning, inquiry, and exploration of issues to circumvent or solve.

The second intersection of knowledge domains with major impact on the course of LM work is that of user knowledge and the aggregate knowledge domain represented by the LM project team. The user studies conducted in LM confirm that representatives of pivotal user groups — researchers in history, literary studies, and the digital humanities and trade-union professionals — possess significant experience and insights of a practical and theoretical nature that can immensely assist in creating a portal that is successful in helping future users accomplish their goals and ambitions. Experience from LM shows that while it is easier to elicit and make use of user feedback at the early and late stages of development work, it is notably more difficult to do so at the midpoint of the development cycle. The main take-away with regards to user-driven design in DH projects with teams of multiple academic and GLAM stakeholders is to jointly identify areas where user feedback is most beneficial at an early stage, and then determine what tasks need to be accomplished to make the feedback possible to engender and use.

Labour's Memory is funded by Riksbankens Jubileumsfond under grant agreement IN20-0040.

 
2:30pm - 3:00pmBREAK
Location: Háma
3:00pm - 3:45pmSESSION#11: CORPORA
Location: H-207 [2nd floor]
Session Chair: Mari Sarv, Estonian Literary Museum, Estonia
 
3:00pm - 3:15pm

Selecting texts for a corpus of Early Modern Icelandic

Jóhannes B. Sigtryggsson, Ellert Þór Jóhannsson

Stofnun Árna Magnússonar í íslenskum fræðum, Iceland

This presentation accounts for aspects of a recently established collaborative project which aims at digitizing Early Modern Icelandic texts and creating a text corpus from them. The project is supported by experts from three institutions: the National Library of Iceland – University Library (Lbs-Hbs), the National Archives of Iceland (ÞÍ) and the Árni Magnússon Institute for Icelandic Studies (SÁM). The project is funded with a research grant from the Icelandic Infrastructure Fund (Innviðasjóður) and coordinated by the Centre for Digital Humanities and Arts (Miðstöð stafrænna hugvísinda og lista, mshl.is).

The objective is threefold: Firstly, to digitize through an OCR-process large volumes of manuscripts, documents and printed material from the period 1540–1850 and thus making the texts available and accessible. Secondly, to mark up the texts grammatically and use them as a basis for a full-fledged language corpus. Thirdly, through the mark-up process and normalization of texts, facilitate access to a part of Icelandic cultural heritage that has been inaccessible to most people until now.

The beginning of the period is determined by the publication of Oddur Gottskálksson’s Icelandic translation of the New Testament (1540), which marks the beginning of printing in Iceland. The end of the period is determined by the first edition of Piltur og stúlka by Jón Thoroddsen from 1850, considered to be the first modern novel in Icelandic. All Icelandic books published during this three-hundred-year period are preserved and have been digitized and made available online through the website of the National Library, baekur.is.

Selected printed publications will be treated with OCR along with selected documents from the National Archives and manuscripts from the National Library, paying special attention to the writings of officials and scholars of this period.

The project will utilize the infrastructure that has already been created at the Centre for Digital Humanities and Arts, e.g. by using the artificial intelligence program Transkribus for the OCR processing of manuscripts and other written documents. The knowledge that has accumulated at the Árni Magnússon Institute over the last decade regarding digitization of texts as well as further processing of them will also be utilized.

The project is set to be carried out in three steps:

· OCR treatment of manuscripts, books and documents.

· Systematic production of grammatically marked-up texts (tagging and lemmatization), ensuring preservation and availability of digital OCR produced texts.

· Error-detection and correction of computer-readable OCR produced text using artificial intelligence.

For this project, OCR models must be created – Experts from the aforementioned institutions will select manuscripts, documents and printed books for OCR processing that will cover the entire period. Digital images of most of the texts are already available on the websites handrit.is, baekur.is, timarit.is and heimildir.is, that the participating institutes run and manage.

In this presentation we will account for the project and take a closer look at the text material that will be used in the corpus, how it will be selected and the criteria involved. We will discuss how the texts must be treated after the OCR process to be internally compatible. We will account for normalization standards, treatment of linguistically ambiguous forms and different possible representations of the digital texts in the corpus. Finally, we will mention some possible ways of enhancing the data, e.g. linking them to other resources such as dictionaries and other corpora.

The project will also have other benefits besides the text corpus. Correction models created by working with OCR texts will be useful for further development of OCR solutions for Icelandic texts and standards created at the Árni Magnússon Institute. Correction and standardisation models for OCR treated texts will be made publicly available on the research infrastructure site CLARIN (clarin.is/). Furthermore, this project will encourage the development of various technical solutions that will be useful to other institutions that host Icelandic databases in the field of humanities and art for conducting research. The project will strengthen cooperation between institutions and contribute to the development of new databases and updating of those that already exist. All technical solutions developed will adhere to international standards for metadata.

The corpus of Early Modern Icelandic will be especially useful for the study of the history of the language and the development of the lexis. The digital processing of texts and the creation of the text corpus will radically increase accessibility to the texts and their content and facilitate the use of them for research in fields such as, history, ethnography, folklore studies and more.



3:15pm - 3:30pm

Representativity, Biases and Choices in Digital Corpora Curation: Latvian Diary Corpus

Haralds Matulis, Ilze Ļaksa-Timinska, Elvīra Žvarte

Institute of Literature, Folklore and Art of University of Latvia, Latvia

The presentation will assess the representativity of Latvian Diary Corpus by reflecting on potential biases and describing collection creators' choices in the formation of a digital corpus of diaries. The central question is to what degree a collection of digitized diaries can serve as a representative digital corpus to be analyzed with computational methods. When creating a collection of cultural heritage artifacts, other values beside representativity are at stake, and oftentimes everything of historical value is added to the collection. Humanities researchers then study the collection with close reading and other qualitative methods, their domain competence allowing them to fill in the gaps and judge the representativity or accidentality of certain collection items. However, to fully harness the possibilities of computational methods of digital humanities in corpus analysis, the corpus has to be representative or at least it is essential to be aware of biases, otherwise any statistical inferences might bear just accidental character.

Latvian Diary Corpus is a part of “Autobiographical Collection” curated by Institute of Literature, Folklore and Art of the University of Latvia (ILFA). The Autobiography Collection, established in 2018, evolved from materials people have written during various periods to document their own lives and the times which they have experienced. Although the Autobiography Collection has been largely open to depositors of various autobiographical materials, all potential materials are assessed for its relevance to the collection. Then the manuscripts are scanned, originals are archived or returned to the owner, the scanned materials are queued for transcribing which is done by volunteers in the digital archive's encryption tool or by specially recruited encryptors or researchers. (Reinsone, Ļaksa-Timinska, Žvarte, 2024) Mostly these autobiographical materials are diaries, written life stories, memoirs, and letters, as well as various other materials providing complementary information – photographs, interviews with the authors, and their relatives’ stories about them, so far containing a total of 233 autobiographical units of varying size and content, of which 118 units comprise the Latvian Diary Corpus.

Representativity of the diary corpus strongly depends on balanced gender, age, social class and regional coverage. Further methodological decisions arise from different diary writing practices, prompting questions if such different writings is the same phenomenon of writing – “daily routine” vs “event triggered” diary (Lejeune, 34), “journal” vs “autobiographical style diary” all are joined in a corpus under the label of diary. Representativity is further mediated by the profile of persons who are submitting the diaries to corpus. Who are submitters of the diaries – authors, heirs, librarians, citizen historians – and how does that correlate with the content of diaries? What of the surviving materials because of their content are not donated to the corpus? Lynn Z. Bloom makes a distinction between “truly private diaries in contradistinction to those of private diaries intended as public documents” (Bloom, 27). Some researchers go further and note that a “truly private diary” should never be seen by other eyes and authors should better destroy their diaries (Shiffman, 101). And then there is a tradition of self-writing, coming from the field of autobiographies, where the narrator functions as an “autobiographical I”, and is usually cast in a good light, with a potential of publication looming in the mind of the writer (Heehs, 7). A closer look at the diary profiles in the Autobiography Collection reveals that all the diary types mentioned above are present, as well as interesting borderline cases, but not all diary types are represented in proportionally similar parts.

Decisions of corpus curators on borderline self-writing items also influences the final outlook of the corpus. The reluctance to decline potentially precious materials is countered by a need to maintain the conceptual integrity of the corpus. How to categorize different cases of self-writing materials where distinction is not clear-cut, and what to do with materials where several autobiographical self-writing styles are mixed, e.g., memories followed by diary. Oftentimes, due to limited resources the digitization (scanning, transcribing) cannot be done instantly to all donated items. Then prioritization of what to transcribe first, and what remains undigitized for the time being, also influences the composition of the digital corpus which is already used by researchers.

Although this presentation is focused on the Latvian Diary Corpus and each archival collection is unique, the assessment given in this presentation can provide some indications about the scale of biases and noise caused by different facets of the curation process which apply also to similar datasets, and could be useful to researchers in other areas of digital humanities.



3:30pm - 3:45pm

LatSenRom (1879-1940): the creation and iterations of the corpus of Latvian early novels

Anda Baklāne, Valdis Saulespurēns

National Library of Latvia, Latvia

LatSenRom (1879-1940) is a comprehensive corpus of Latvian long prose fiction, covering all novels released in Latvian as books from 1879 to 1940. The corpus was created by the National Library of Latvia, in collaboration with the Institute of Folklore, Literature, and Art of the University of Latvia, and the Institute of Mathematics and Informatics of the University of Latvia. This paper marks the first exposition detailing the dataset's genesis, design, versions, and iterations.

The paper itemizes the workflows entailed in the creation of the corpus, from digitization and optical recognition, preprocessing, and markup up to the version control and iterations designed for specific use cases.

The design of data sets, although based on some general principles, is also dependent on the aims and methodologies of a particular study, hence there is more than one way for designing sound data sets. While the design of LatSenRom follows general principles, the versions or iterations are tailored to specific research needs.

The creation of LatSenRom drew inspiration from the Distant Reading COST Action, which was carried out to produce the European Literary Text Collection (ELTEC) - a compilation of balanced collections of Europen novels in different languages, spanning publications from 1840 until 1920. Since all language collections were devised according to the same principles, the finalized datasets are methodologically rigorous and mutually comparable.

Within the context of ELTeC, the Latvian language collection could not achieve the target of 100 required novels, and the total number of works included was significantly reduced after balancing the corpus, due to the absence of novels in several categories: there were no novels published between 1840 and 1879, and only two novels were written by female authors. This situation was partly shared by collections in other smaller languages, sparking interesting discussions on the reasons why the novel genre emerged later in some regions, and how to redefine the criteria for representativeness when achieving balance among several categories is not feasible. Although the corpus of early Latvian novels, comprising 50 novels, was not representative according to ELTeC's criteria, it nonetheless encompassed every novel, providing complete coverage of publications from the first novel until 1920. The concept of full coverage inspired further development of the corpus as a collection of all works, foregoing the selection process. In its current iterations, the corpus includes all novels published as books until 1940, totaling approximately 460 works.

The question of how many Latvian novels exist proved to be one of the most challenging to answer. First, one might question what constitutes a novel. What distinguishes a novel from a very long story, a novella, a biography, or other forms of documentary literature with a significant fictional element? Second, the issue arises of where one novel ends and another begins: many works of prose fiction consist of parts, are published as trilogies, or as series. A rather formal approach was adopted to address the problem of defining the genre: items were selected based on existing bibliographies of novels. Multi-part works were separated or assembled according to specific guidelines.

Another challenge that arises when amassing corpora spanning long time periods is the change in writing tradition and grammar. At the beginning of the 20th century, Latvian publishing gradually transitioned from the old Gothic (Fraktur) to the Antiqua typeface, and the rules of orthography were changing as well.

It was decided that it was important to address the needs of different types of researchers: those interested in studying transcriptions that are as close as possible to the original texts, and those interested in the comparative analysis of works from different time periods and, hence, required normalized (equalized) versions of texts. The concept of iteration is used to distinguish parallel versions of texts from the versions that occur while the corpus changes in time - by correcting mistakes, adding or removing works.

The normalization of the corpus can be realized (at least) in two steps: normalizing of the script (typeface) and normalizing of grammar. The second step is yet not fully implemented in the time of writing this paper.

In addition to the three iterations - original text, typeface-normalized text, and grammar-normalized text - LatSenRom is available in a structured form with morphological, syntactic, and NER markup. Corpora are available to users as individual datasets, or they can be explored through the interfaces provided by the National Library of Latvia - the corpus analysis platform (nosketch.lnb.lv) and a website Latvian Prose Counter (proza.lnb.lv).

A bibliographical dataset was created, encompassing records of all works featured in LatSenRom as well as records of their reprints, facilitating further analysis of the canonization of these works.

 
3:00pm - 3:45pmSSH Open Marketplace
Location: H-205 [2nd floor]
Session Chair: Edward Joseph Gray, DARIAH-EU / IR* Huma-Num, France
Session Chair: Edward Joseph Gray, DARIAH-EU / IR* Huma-Num, France

Hands-on practical session of Mondays introduction to the SSH Open Marketplace. All are welcome.

3:00pm - 3:45pmSESSION#12: TRANSLATION STUDIES
Location: K-205 [2nd floor]
Session Chair: Vicky Garnett, DARIAH-EU, Ireland
 
3:00pm - 3:15pm

Title Matters: Film Metadata Analysis

Agata Hołobut1, Miłosz Stelmach1, Maciej Rapacz2

1Jagiellonian University, Poland; 2AGH University, Poland

Starting with the early attempts at statistical analysis of style (e.g., Salt, 1974; 1983; cf. Baxter, 2014), quantitative and digital approaches to film study have attracted growing scholarly attention. According to Burghardt et al. (2020), three main disciplinary strands can be distinguished within this research area: (1) the infrastructural strand, advanced by digital archivists working for cultural institutions and documentation centres, which focuses on making digital collections accessible and mineable; (2) the computational strand, advanced by computer scholars and media informaticians working on semi-automatic and automatic analysis of film and video content based on multimedia information retrieval; (3) the media strand advanced by film and media scholars adopting a quantitative approach to film study along the lines set out by numerical humanities (Roth, 2019).

Our contribution locates itself within the third subfield mentioned above. However, unlike scholars involved in the quantitative analysis of the cinematic medium, who investigate e.g. shot scale and length patterns, the cutting structure, luminosity and colour information or visual activity patterns (e.g., Cutting et al, 2011; Baxter, Khitrova and Tsivian, 2017; Heftberger, 2018; Bakels et al., 2020; Flueckiger and Halter, 2020; Hielscher, 2020; Pustu et al., 2020; Wang 2023), we mine and analyse film metadata, i.e., the verbal and numerical cultural information preserved in film databases. This approach has already proved useful in cultural evolutionary studies and cultural analytics, scholars studying amongst others changes in crew structure (Tinits and Sobchuk, 2020); institutional funding patterns (Van Van Beek and Willems, 2022); distribution circuits in film festivals (Zemaityte et al, 2023); remaking and cultural recycling practices (Stelmach, Hołobut and Rybicki, 2022) as well as the dissemination of selected themes in mainstream cinema (Viacom 2017).

In our presentation, we wish to discuss the interdisciplinary insights we gathered from a collaborative project that brought together a film scholar, translation scholar and information technologist. Based on the analysis of film metadata, we set out to explore global trends in film distribution, focusing on title translation practices around the world (Hołobut, Stelmach and Rapacz, forthcoming).

Film titles have so far attracted relatively little attention in film studies; major quantitative advancements coming from entertainment industry economists and marketers, investigating the impact of naming practices on box-office revenues (e.g. Xiao, Cheng, and Kim, 2021; Chung and Eoh, 2019; Bae and Kim, 2019). Most insights have been offered to date by linguists exploring the persuasive power of film titles (e.g. Dengler 1975) and translation scholars exploring linguistic and cultural shifts in title translation (or transcreation) practices (e.g. Santaemilia and Soler-Pardo, 2014; Ross, 2018; Yin 2009; Shokri 2014; Fakharzadeh 2022; Iliescu-Gheorghiu, 2016; Gabrić et al. 2022).The latter studies have been numerous, yet predominantly qualitative, limited to local film markets and narrow timeframes.

Our project was designed to offer a global diachronic perspective on title localisation patterns. To this end, we sourced title translations of two corpora of feature films released between 1950 and 2022, based on the metadata retrieved from the Internet Movie Database (IMDb). Our dataset contained, respectively:

  • all localised title variants (the so-called “AKAs”) of feature films of the Official Selection at Cannes Festival since 1950 (i.e., art films directed at cinephiles around the world and usually distributed by independents, 3,114 films and 59,953 localised title tokens in total);
  • annual top fifty productions distributed in most language versions since 1950, according to IMDb (3,650 films and 154,277 localised title tokens, distributed by major studios).

Each record in our dataset contained the following attributes:

  • original movie title
  • year of release
  • localised movie title – the title under which the movie was distributed in a given region
  • region of distribution
  • genre(s) of the movie
  • country of origin.

We analysed the dataset with a desire to test four preliminary hypotheses concerning:

  • the growing share of films distributed with non-translated titles (e.g. Dutch and Ecuadorian official titles of the American comedy My Big Fat Greek Wedding being exactly the same as the original version);
  • the growing share of films distributed with hybrid names that combine non-translated titles with added subheadings (e.g. the German variant of the title reading My Big Fat Greek Wedding – Hochzeit auf griechisch).
  • the presence of global and regional discrepancies in the ways festival and mainstream film titles are being localised for respective audiences;
  • the variation of localising trends relative to genre and region of distribution.

For our trend analysis, we used a rolling mean with a window of size five. Based on over 200,00 localized titles, the study confirmed only a few of our preliminary hypotheses. It showed a steady global increase in the annual share of films distributed internationally under their original titles – the share started at around 20% in the 1950s and has climbed up to roughly 35% in recent years. It also showed a stable but insignificant share of titles that combine non-translated original with target-language additions, thus undermining our preliminary assumption that such hybrids are gaining popularity around the world. We also discovered remarkable similarities in global trends concerning title translation for mainstream and arthouse productions.

Since most movies included in the “top fifty” corpus are American exports (i.e., 66%, 2,402 films) accounting for two-thirds of localized title tokens (i.e., 104,378), the steady Americanization of titles used in their worldwide distribution illustrates the economic, cultural, and political hegemony of “Global Hollywood” (Miller 2007) or “Hollyworld” (Hozic 2001) and the spread of English as a lingua franca. Currently, most of the production and worldwide distribution of American films is controlled by media and entertainment conglomerates, such as Warner Bros Discovery, Sony Columbia, Paramount Global or The Walt Disney Company, which exert enormous influence on localisation trends and patterns (Ross 2018). Quite surprisingly, however, on a global scale, similar patterns are also visible in the translation of arthouse films, the production and distribution of which involve different stakeholders.

On a regional scale, however, major discrepancies are visible both in dominant localisation solutions and in approaches to festival and mainstream productions. With regard to the former, Central and Eastern Europe invariably prefer domesticating practices, while Asia Pacific or the Middle East and Africa prefer “non-translation” practices. As for the latter, in some parts of the world (Europe, South and Central America), title translation norms are similar regardless of the corpus; in others (Asia Pacific, North America, Middle East and Africa), they are diversified, possibly reflecting different stakeholders involved in the decision-making process (cf. Ross 2013; 2018).

Further research paths that we envisage for our project include an in-depth explanation of the growing non-translation practices in local contexts, conducting more detailed regional breakdowns and automated fine-graining and nuancing of translation strategies other than verbatim transfer and hybridization, which will be of special interest to translation scholars.



3:15pm - 3:30pm

Found in Translation: Sourcing parallel corpora for low-resource language pairs

Hinrik Hafsteinsson, Steinþór Steingrímsson

Árni Magnússon Institute for Icelandic Studies, Iceland

This paper describes the sourcing, processing, and application of parallel text data for Icelandic and Polish, demonstrating how a parallel corpus can be compiled for a language pair that has no available parallel data, by pivoting through a common language. We show the usefulness of the corpus by training and evaluating a machine translation (MT) model on the data. Iceland's linguistic landscape is evolving, with an increasing need for multilingual support due to the growing immigrant population. Polish, in particular, stands out as the language of the largest single minority in Iceland, underscoring the importance of this project.

Openly available parallel corpora for Icelandic are currently limited to English-Icelandic corpora, the most important being ParIce (Barkarson and Steingrímsson 2019), although that same language pair is also included in a number of multilingual data collection projects mostly containing web-scraped texts. An ongoing project, `Mikilvægur orðaforði fyrir fjöltyngi og vélþýðingar' (en. Important Vocabulary for Multilinguality and Machine Translation), leverages openly available datasets to source parallel texts and lexicons for the languages of the main immigrant communities in the country, with the goal of it being used for compiling bilingual dictionaries as well as for MT and general NLP tasks. The languages that have been sourced are Polish, Spanish, Tagalog, Thai, and Ukrainian. As a measure of the wider project's methodology and applicability, we illustrate the methods and source datasets used for sourcing specifically Icelandic–Polish parallel text data, as Polish speakers are the largest single minority in Iceland.

In the case of Polish, we source our parallel texts from CommonCrawl Aligned (CCAligned, El-Kishky et al. 2020), CommonCrawl Matrix (CCMatrix, Schwenk et al. 2019b, Fan et al. 2021), No Language Left Behind (NLLB, Costa-jussà 2022), OpenSubtitles (Lison and Tiedemann 2016), ParaCrawl (versions 6 through 9, Bañón et al. 2020), TildeMODEL (Rosiz et al. 2017) and WikiMatrix (Schwenk et al. 2019). All of these datasets were sourced via OPUS (Tiedemann 2012). These datasets contain pre-aligned parallel texts for multiple languages, including English and Icelandic. We employ English as a pivot language in this context, which means that we first identify sentences in English that are translations of both Icelandic and Polish sentences. This approach is based on the assumption that if an Icelandic sentence and a Polish sentence have the same English translation, they are likely to convey the same meaning. The extraction process involves automatically comparing and matching English sentences across the Icelandic and Polish datasets. This pivoting step is not performed on each dataset in isolation, instead we iterate through target language datasets (here English–Polish), and compare the English sentences to a concatenation of all English–Icelandic sentence pairs found in all the datasets. This comprehensive approach aims to capture any potential Icelandic-Polish sentence pair combinations, acknowledging the method's inherent greediness. Subsequent filtering steps address any resulting redundancy or extraneous data.

To ensure optimal coverage for the available data for each language, we combine our data with "pre-pivoted" datasets, which, in the case of Polish, are MultiCCAligned and MultiParacrawl, both of which are provided by the OPUS.

Initial manual checks of the output sentence pairs were performed to ensure data quality, though a comprehensive manual review of each pair was beyond the scope of this project (automatic evaluation methods are detailed below). Given the potential overlap in the input datasets, we applied a deduplication filter to the output, ensuring each sentence pair is unique in the final dataset. Additionally, we implemented a blanket-filtering step: only sentences shorter than 2000 characters were retained, and each sentence had to have at least 60% of its characters belonging to its respective language's alphabet. This criterion serves as a basic assurance that each sentence pair accurately represents its designated languages, Icelandic and Polish, respectively. Ultimately, this methodology yielded a bilingual dataset comprising 3,138,529 sentence pairs.

As a measure of the quality of this new bilingual dataset, we train an MT model and apply it on a standard task. To gauge the quality of the sentence pairs themselves, we use Language-agnostic BERT Sentence Embedding (LaBSE, Feng et al. 2022) to vectorize the sentences and give a rough estimate on each pair's similarity. This process scores each pair from 0 to 1 based on similarity, facilitating the creation of subsets for training MT models by removing sentence pairs likely to be incorrectly translated or detrimental for MT training in some other way:

3,138,529 pairs (all sentences)

3,094,917 pairs (LaBSE score ≥ 0.3)

2,950,339 pairs (LaBSE score ≥ 0.5)

2,505,558 pairs (LaBSE score ≥ 0.7)

979,326 pairs (LaBSE score ≥ 0.9)

We found that by fine-tuning the mBART-50 model (Tang et al. 2020) on the Icelandic->Polish translation direction using only sentence pairs from our dataset that have higher LaBSE scores than 0.7, we obtained BLEU-score (Papineni et al. 2002) of 13.0 on the Flores (Guzmán et al. 2019) evaluation set, just a bit below the BLEU-score of 13.3, obtained for the only previously published MT model for this language pair (Símonarson et al. 2022). Furthermore, the models trained on the subsets reveal insights into the relationship between data quality and translation accuracy, showing similar effects as demonstrated by Steingrímsson et al (2023).

This work has significant implications for real-world applications. Effective Icelandic–Polish translation models can facilitate communication in educational, governmental, and social contexts, directly benefiting the immigrant community and promoting cultural integration. The methodology and findings also contribute to the broader field of NLP, especially in language pairings involving lesser-studied languages.

Future work includes expanding our dataset to encompass more languages and refining our methodologies based on these initial findings. Additionally, exploring advanced data processing techniques to handle linguistic nuances and idiomatic expressions more effectively remains a priority.

In conclusion, this project not only addresses a critical need in Iceland's changing linguistic landscape but also contributes to the global effort in NLP research, particularly in enhancing machine translation for minority languages. Our approach, rooted in leveraging existing datasets and innovative processing techniques, paves the way for more inclusive and effective language technology solutions.



3:30pm - 3:45pm

The world in Norwegian: Glimpses from a digital exploration of bibliomigration patterns

Inger Hesjevoll Schmidt-Melbye1, Marcus Axelsson2, Siri Fürst Skogmo3

1NTNU, Norway; 2Østfold University College, Norway; 3Innland Norway University of Applied Sciences, Norway

Many important patterns in literary history are still poorly understood, because they weren’t easily grasped at the scale of individual reading (Underwood, 2017, np.)

Our project, The world in Norwegian, is positioned in the intersection of Digital Humanities, Translation Studies, Sociology of literature and Library development. Although methods and perspectives from Digital Humanities have advanced in many fields for several decades, they remain relatively unexplored within Translation Studies (Tanasescu, 2021; Wakabayashi 2019), especially in the context of smaller language cultures as is the case for Norwegian (Horvath, 2021; Spence & Brandao, 2021). We also see a potential in strengthening the collaboration between DH and research support from university libraries (Zhang, Liu & Mathews 2015; Zhang, Xue & Xue, 2021).

This study is a collaboration between translation scholars from three different Norwegian higher education institutions and research librarians from the Norwegian National Library and the Norwegian University of Science and Technology. Through this collaboration we contribute to enhancing university libraries’ knowledge and infrastructure in Digital Humanities in Norway. The project has received funding from the Norwegian National Library from August 2023 to August 2025.

The primary aim of our study is to investigate translation flows using digital methods and tools. We use Norway as a case in point and investigate the Norwegian literary import over the last 200 years (since 1800). A central concept in the project is bibliomigrancy or bibliomigration (Mani, 2014; Lindqvist, 2015) which is “an umbrella term that describes the migration of literary works in the form of books from one part of the world to the other” (Mani 2014, p. 289).

We partly draw on theories by Heilbron (1999; see also Sapiro, 2010), suggesting that the literary world can be described in terms of centres and peripheries and in open and closed literary systems. According to these theories, Norway is a periphery and an open literary system. This means that the Norwegian literature, to a large extent, consists of translated works, and that translated works have been an important literary factor ever since the country gained its independence some 200 years ago. English dominates as the most common source language in Scandinavia and is sometimes described as a hyper-central language (Lindqvist 2015, p. 78-80; De Swaan 2001, p. 4-5). English has traditionally been used - and is still- in indirect translation as an intermediate language from lesser-known languages in Norway (Rindal et al. 1998).

The main material for the study consists of library catalogues, mainly the National library catalogue of Norway, but also metadata from international library catalogues. The DH lab at the Norwegian National Library has developed an app which collects information from the National library catalogue and, through searching for different bibliographic system criteria, generates an overview of translated fiction.

We developed the app to discover bibliomigration patterns for translated literature in Norway, gaining a more multifaceted picture of the source cultures and languages involved, going beyond the traditional confirmation about the Anglo-American dominance.

Throughout the process, we (the researchers and the National Library) have been working closely to figure out on the one hand, which metadata we ideally want to collect from a Translation Studies perspective, and on the other hand, which metadata is actually possible to collect in large quantities through digital humanities tools.

One purpose of the app is to provide a visual representation of bibliomigration patterns by showing on a map which countries the translated literature originates from. This map will be open access and freely available to researchers, but our aim is also that the map serve more than one purpose and be useful to everyone who is interested in bibliomigration patterns and the sociology of literature – for example in public and school libraries, in the publishing sector and in the Norwegian Association of Translators.

The creation of the map poses challenges on at least two levels: First, how to collect the data which provides us with information about the publication place of the first edition of the original work, and second, how to visually represent diachronic data in a geopolitically shifting reality.

In addition to serving as the foundation of the map, the app will also be useful for example in finding and researching translations of works by a particular author, or by a particular translator. In the test stages, we have also found that the app can be used by librarians for information retrieval. The collaborative aspects of this project have sparked several outcomes that were not included in our initial proposal, and we foresee that we will experience more of this as the project progresses. In our paper, we will present the project as work in progress and demonstrate the digital resources in their current state, as well as discuss the challenges we experience.

 
3:00pm - 3:45pmSESSION#13: ARTIFICIAL INTELLIGENCE
Location: K-206 [2nd floor]
Session Chair: Matti La Mela, Uppsala University, Sweden
 
3:00pm - 3:15pm

A qualitative survey of archivist and technologist perspectives on the use of AI in archives

Larissa von Bychelberg1, Johannes Widegren2

1Uppsala University, Sweden; 2Linnaeus University, Sweden

The opportunities offered by artificial intelligence (AI) and machine learning (ML) for the archive sector have been addressed in several scholarly articles in recent years. Some of the articles describe projects (e.g. Carter et al., 2022; Han et al., 2022) featuring a collaboration between archivists, generally defined for this paper as persons working in the archival sector, and technologists, defined as persons with a professional technical background. Some other archival projects have been undertaken by computer scientists or digital humanities scholars without the involvement of archivists (e.g. Luthra et al., 2022). Providing a qualitative angle, some articles have looked at digital archivists’ opinions on the implementations of AI techniques in archives (e.g. Cushing & Osti, 2023). Finally, more general articles have been published, both by archivists and technologists on how AI may be implemented in the archival sector (e.g. Colavizza et al., 2022; Hutchinson, 2020; Sabharwal, 2017).

This paper presents a qualitative analysis of how the perspectives of archivists and technologists on AI in archives are presented in a selection of recent articles. The selection includes both articles describing case studies, such as project descriptions and interview studies, as well as literature reviews. The articles are categorized according to the type of project described or suggested: articles written from an archival perspective, articles written from a technologist perspective, and articles that describe or propose joint projects run by archivists and technologists in cooperation. Differences can be observed in these research articles regarding 1) how archival expertise is valued, 2) the proposed importance of archival theory for successful AI implementation and 3) the degree of influence ascribed to the archivists in collaborations.

The results indicate that viewpoints clearly differ depending on the professional role and background of the contributors to the articles. Articles written from a technologist perspective are more likely to criticize archivist work, and in some cases even blame them for obstructing access and “perpetuating silences” in archives (Luthra et al., 2022). Randby & Marciano (2020) describe the goals of AIC (Advanced Information Collaboratory), a project in which Randby is involved, as aiming towards information professionals learning “to think computationally and rapidly adapt new technologies”; the article goes through a computational workflow without further mentioning archivists on a larger scale.

Articles written by archivists, on the other hand, emphasize the importance of incorporating archival principles and taking advantage of archivists’ knowledge in the AI implementation process (Hutchinson, 2020). Cushing & Osti (2023) highlight the expertise of archivists and stress how their professional background is a requisite to control AI decision-making. An important distinction is also that articles published in archivist journals such as Murphy et al.’s (2015) highlight the perspective of archivists by portraying them as subjects (“archivists”) which is contrasted with a passive portrayal of the technological aspects (“technology” instead of “technologists” or “machine learning experts”). Other articles point out the challenges of using AI in archives; for example, Jaillant & Caputo (2022) mention “ethical challenges” and problems with bias in AI. Cushing & Osti (2023) describe how participants in their study (archival experts) are confident about the possibilities of AI, but also skeptical about practical integration into their work. Lee (2018) argues that certain AI tools are not supportive of the “holistic view” that archivists have of their work.

Those articles that present a mutual collaboration between archivists and technologists highlight the importance of combining the expertise of both fields for successful AI implementation (e.g. Murphy et al., 2015; Carter et al., 2022; Han et al., 2022). Poole & Garwood (2018), an information scientist and computer and informatics researcher respectively, call for the involvement of archivists and librarians in Digital Humanities projects. Tsabedze (2023) offers an important perspective on archival professionals in Eswatini, highlighting the need for digital education; Tsabedze argues that the interview participants in Eswatini are afraid to lose their jobs without proper training. Nevertheless, Marciano et al. (2018) are positive about the disciplines achieving more together than each would have on their own. As a contrasting addition, Jo & Gebru (2020) argue not for implementing AI technologies in archives but for implementing archival expertise in AI development. They propose that archival document collection practices can inform data collection in sociocultural ML, because archivists possess the language and procedures to address issues of consent, transparency, inclusivity etc.

Finally, Marciano et al. (2018) believe that archival expertise of the future will involve knowledge of digital systems. Sabharwal (2017) suggests that in the future, the distinction between archivists and technologists will not be as clear anymore. Jaillant & Caputo (2022) also highly encourage collaboration between archivists and technologists in order to address future challenges. Cushing & Osti (2023) suggest that AI technology in archives can even change how archivists describe “digital archival expertise”. In conclusion, our findings suggest that there is a great diversity of opinions; as Poole & Garwood (2018) suggest, more research is necessary on how the collaboration between professions, as well as implementation of AI in archives, can be improved.



3:15pm - 3:30pm

AI-Powered English: Insights from GPT-Enhanced Classrooms in South Iceland

Luis F. T. Meza1,2, Charlotte Eliza Wolff1

1University of Iceland; 2Fjölbrautaskóla Suðurlands

  1. Introduction

Responding to the DNHB’s call for innovative approaches using digital tools and methods in teaching, this we propose a paper and presentation that shares results and key insights from the upcoming educational workshop on the use of Artificial Intelligence, specifically Language Model Interfaces (LMIs) and Generative Pretrained Models (GPT), e.g. ChatGPT. The workshop, funded by Fjölbrautaskóla Suðurlands (South Iceland College), presents technology as an integral tool for supporting English language teaching and learning in the Icelandic educational context, where learners are adept at understanding English through media, yet struggle with formal academic language. Bridging this gap is imperative for equipping students with the necessary skills to achieve academic and professional success.

  1. Workshop Context

Recent research in Iceland (cf. (Jeeves, 2022; Prinz & Arnbjönsdóttir, 2021) has highlighted the influential role of informal media in English language acquisition, but the potential role of GPT technology as a formal education tool has not been thoroughly investigated. This study expands upon the existing research by exploring how actively engaging students with AI can help scaffold the execution of complex language tasks, such as academic writing, by providing a practical and interactive learning environment reflecting the digital nature of contemporary education.

The workshop develops teachers' capacity to apply GPT tools as both a supplementary aid and a substantial pedagogical tool. Emphasis is placed on facilitating students’ understanding of key communication principles by focusing on content – students’ own opinions, perceptions, and arguments – and discourse, i.e., the forms these ideas naturally take. We hypothesize that incorporating these natural language processing tools will enable students to navigate the intricate landscape of text as a social interaction, while reducing emphases on language formalism.

We aim to deepen students’ understanding of the “norms based on social purpose and the expectations of a community” (Prinz & Arnbjörnsdóttir, 2021) and empower them to convey their views, opinions, and observations successfully. In addition, we seek to empower educators to effectively implement this technology and facilitate meaningful communication with their students.

  1. GPT Language Models and their Applicability in Language Teaching.

When considering Generative Pretrained Transformer (GPT) models in education, it is imperative to differentiate between the overarching concept of Artificial Intelligence and the specific functionalities of GPT. AI represents a broad field with various applications, while GPT models, such as Open AI’s ChatGPT, represent a more focused area within this domain. These models belong to a larger framework of neural network-based software characterized by their attention mechanism. This mechanism ensures that the output generated is not random. Instead, it adheres to a specialized set of rules that increase the likelihood that the output produced is comprehensible to humans.

To illustrate, consider the sentence, “The ocean’s color is pink.” While grammatically correct, it conveys false information and thus lacks real-world reference. GPT models, trained on vast amounts of text, typically avoid such ‘hallucinations’1 by correlating terms like ‘ocean’ and ‘blue’ via statistical analysis: this word-pair appears in closer proximity more frequently than the pair ‘ocean’ and ‘pink.’ Although often successful, this programming is not immune to generating false statements, underlining the importance of understanding its capabilities and limitations.

GPT models can be viewed as interactive texts that expect users to provide expert input or ‘prompts’, to then receive an output which they can respond to with further prompting. This interaction highlights three skills that teachers and students require:

  1. The ability to understand and describe tasks logically, such that a computer program can efficiently produce the required outcomes.

  2. The ability to critically evaluate the relevancy, accuracy, and utility of the responses generated.

  1. The ability to gauge the depth of users’ comprehension by analyzing the quality of both the prompts and the users’ responses to AI output.

The workshop offers teachers guidelines for developing these skills by demonstrating their usefulness for developing competencies outlined in the Icelandic National Curriculum. Given the interactive nature of GPT models and their text-processing capabilities, we aim to explore how these tools support higher-level tasks, i.e. essay writing and information processing. We furthermore aim to stimulate an evidence-based discussion on how teachers and students currently use this tool to support language development with a lookout on ethical guidelines for general use.

  1. English Language Education at Fjölbrautaskóla Suðurlands (FSu).

English at FSu is taught across three levels of competence, using an approach that integrates students from different tracks into one class. This creates an inclusive learning environment where students with diverse educational and professional aspirations, neurodivergent diagnoses, and varying degrees of English proficiency learn together. This diversity underscores the importance of implementing malleable learning tools, like GPT models, into the curriculum.

  1. Workshop Outcomes and Future directions.

Our paper will present the FSU workshop and observed affordances and constraints when harnessing the potential of GPT language models for English language teaching. This initiative aligns with the innovative spirit called for by the DNHB while also addressing notable challenges in the Icelandic context. The digital intervention helps narrow the gap between students’ informal familiarity with English and the formal academic proficiency required for success in higher education and professional pursuits.

The workshop focuses on developing educators’ competencies for interacting with GPT models as the first step towards improving our learners’ experience. The ability to efficiently formulate prompts and critically evaluate AI-generated content is a fundamental aspect of digital literacy in the 21st century. Insights gained from this workshop can provide a valuable blueprint for how such technologies can be ethically and effectively integrated into Icelandic educational settings.

GPT modeling presents an opportunity to support a diverse range of learning needs. Strategies shared in the workshop support both language and democratic participation by making advanced learning tools accessible to all students, regardless of their background or proficiency. Thus, the FSU workshop can serve as a model for other institutions grappling with similar challenges in language education and contribute to the ongoing conversation about the role of AI in Icelandic education, particularly in terms of how educators can leverage technology to enhance learning and prepare students for the evolving demands of the 21st century.



3:30pm - 3:45pm

AI for improving access to archives pertaining to the Sámi: An overview of current approaches and future possibilities

Johannes Widegren

Linnaeus University, Sweden

Facilitating access to archives via metadata creation and enrichment can be a monumental task for large archival collections. The past decade has witnessed an increasing use of artificial intelligence (AI) and machine learning (ML) to assist in these tasks in automatic or semi-automatic workflows (Colavizza et al., 2022). While technologies such as named entity recognition and topic modeling are useful in many different archival contexts, they have found special relevance for colonial archives and archives pertaining to underrepresented communities. Recent projects have for example explored the possibilities of using AI and ML to optimize information discovery in under‑utilized, Holocaust‑related records (Carter et al., 2022), extract mentions of underrepresented people in Dutch colonial records (Luthra et al., 2023), and transform Indigenous and Spanish colonial archives originating from Mexico into Linked Open Data repositories (Candela et al., 2023).

This paper presents, firstly, an overview of current state-of-the-art approaches for AI in archives, and secondly, a project in progress intended to align with the three first goals of the ongoing InterPARES Trust AI project (Duranti et al., 2021): to identify specific AI technologies that can address critical records and archives challenges; determine the benefits and risks of using AI technologies on records and archives; and ensure that archival concepts and principles inform the development of responsible AI. The opportunities offered by these technologies are contrasted with the risks of using automated approaches in general and AI in particular for improving access to archives. The paper also discusses a related approach for increasing discoverability in collections, i.e. semantic search, and compares the pros and cons of these approaches. Furthermore, the potential uses of generative pre-trained transformers (GPTs) for both indexing and retrieval, and the risks associated with these, are addressed.

Bringing the overview to a Swedish context, the paper describes an ongoing project aiming to explore the risks and possibilities of using AI and ML to provide up-to-date, enriched metadata for Swedish archives pertaining to the Sámi, an Indigenous population of the Nordic countries. Managing Indigenous heritage material demands distinct sensitivities that acknowledge the colonial past and the voice of the community in creating and maintaining the cultural record. While AI technologies can be a means for promoting cultural heritage from a Sámi perspective, using them without properly addressing colonial aspects runs the risk of reifying and perpetuating the colonial dynamics of past history writing. When properly applied, however, the hope is that these technologies may be of assistance in the remediation of the digital cultural record to counter the colonial dynamics of the analog cultural record (see Risam, 2019).

The project aims to gauge the potential of these technologies for improving the searchability and usability of records pertaining to the Sámi, of both colonial and Indigenous origin. The research is intended to follow an iterative approach, with continuous evaluation and feedback from experts and end users ensuring the suitability of the metadata generated by the technologies and its usefulness in facilitating search. The expected outcome of the project as a whole is a framework for assessing how to safely and effectively implement selected AI techniques in archival and related institutions while maintaining authenticity.

 
4:00pm - 4:40pmPoster slam
Location: Háma
Session Chair: Olga Holownia, IIPC, United States of America
Session Chair: Katrine Gasser, Royal Danish Library, Denmark

The poster sesssion will start with a poster slam: 1-minute presentation of each poster in the plenary session in Skriða.

4:40pm - 5:30pmPoster session
Location: Háma
 

Insights into the Labour’s Memory Project Infrastructure

Raphaela Heil1, Theo Erbenius1, Isto Huvila2, Eva Pettersson3, Örjan Simonson1, Olle Sköld2

1Popular Movements’ Archive Uppsala, Sweden; 2Department of ALM, Uppsala University, Sweden; 3Department of Linguistics and Philology, Uppsala University, Sweden

The Labour’s Memory (LM) project aims to make the annual and financial reports from the years 1880 to 2020 from local, regional and national Swedish blue-collar trade union organisations, as well as their international umbrella organisations, the International Trade Secretariats (ITS) and the International Confederation of Free Trade Unions (ICFTU), digitally available to researchers and trade union organisations via a dedicated web platform. The report documents fulfil a similar purpose for organisations worldwide and can therefore provide a basis for comparison and serve as entryways to the life and work of trade unions.

The reports from the various organisation are held and digitised at the collaborating archives, the Swedish Labour Movement’s Archive and Library (ARAB, Stockholm, Sweden), the Popular Movements’ Archive in Uppsala (FAC, Uppsala, Sweden), the Archive of Social Democracy (AdSD, Bonn, Germany) and the International Institute of Social History (IISH, Amsterdam, the Netherlands). The presentation of the digitised reports on the platform and the users’ experiences are enhanced via computational linguistics, in the form of spelling normalisation and named entity recognition to facilitate full-text and faceted searches, computerised image processing, making the report contents digitally available via automatic text recognition, and user-driven design methodologies, for example eliciting the needs and goals of platform users. These respective expertises are provided by researchers from the Department of Linguistics and Philology, the Department of Information Technology and the Department of ALM at Uppsala University.

The purpose of this poster is to present the Labour’s Memory project to the wider audience of the DHNB community and share experiences from the process of developing the digital infrastructure with researchers and engineers. Besides providing an update on the progress of the project, the poster will focus on sharing insights into the infrastructural design choices made during the process to inform future work on digital platforms aimed at a mixed audience.

In LM, the Omeka S system (https://omeka.org/s/) was chosen as the framework for the web platform, through which the digitised material will be made available. For each report, the metadata and transcriptions are delivered directly to the platform from each of the four archiving institutions, in order to leverage Omeka’s search and filtering capabilities. Besides this, the respective digital images are made available via the International Image Interoperability Framework (IIIF, https://iiif.io/), which ensures a unified and well-defined interface for accessing whole collections, individual pages or even portions of an image, regardless of each institution’s individual implementation. In the case of LM, the choice of IIIF furthermore allows for the reuse of existing systems (IISH, AdSG) and facilitates the implementation of an infrastructure that can be reused for future projects (ARAB, FAC). The poster will highlight advantages of an Omeka S based approach, how the system was configured and customised for the LM and why. Moreover, the limitations relating to using an existing framework in relation to developing a custom platform will be discussed.

Labour’s Memory is funded by Riksbankens Jubileumsfond under grant agreement IN20-0040.



Letter Collections - from Word to Web

Senka Drobac1, Hanna-Leena Paloposki2, Ilona Pikkanen2

1University of Helsinki, Finland; 2The Finnish Literature Society, Finland

This paper describes the transformation process of Letter catalogues written as Word documents into a Resource Description Framework (RDF) for publishing on the Semantic Web. Part of the digital humanities consortium project, Constellations of Correspondence (CoCo) (Tuominen et al. 2022, Drobac et al. 2023a), this work aims to aggregate, harmonize, link, enrich, and publish 19th-century epistolary metadata from various Finnish Cultural Heritage (CH) organizations. A key challenge in this task is the catalogues' format. Although well-suited for human use, their inconsistent formatting and structure pose difficulties for computational processing.

Many CH organizations, including the National Library of Finland and the Swedish Literature Society in Finland, keep their epistolary collections in traditional Word documents. Typically a file begins with the record creator's name and biography, followed by details of documents in their archive. The Letter Exchange section is usually divided into subcategories such as Received Letters, Sent Letters, Letter Concepts, and exchanges between other individuals. Each subsection contains varying details about correspondences. The creation of these catalogues over decades by different archivists has led to many exceptions in both formatting and structure.

Automatically parsing these files is complicated due to the variation in document structure and line-level inconsistencies. Sometimes, documents include information about multiple persons, breaking the typical structure format. For instance, the Åkerman-Voipio family archive contains records of 19 different persons. Personal archives can also include documents of family members, as seen in the Frans Victor Hannus archive, which includes records of his wife's correspondence. Additionally, the lack of standard naming conventions in subsections and the possibility of letter correspondence information spanning multiple lines with additional comments or locations add to the complexity.

To confront these challenges, our initial step was to manually uniform the documents. Research assistants reviewed the documents for inconsistencies and standardized the format for automatic parsing. This process involved separating catalogues with multiple main archival persons, harmonizing subsection titles, and introducing specific markers for line breaks and comments. Our goal was to standardize sections, especially those involving correspondence between other actors, replacing diverse formats with a consistent one.

Following this groundwork, we created a rule-based parser capable of reading standardized Word documents. It extracts essential information such as actors (sender, recipient), dates of sending, amount of sent letters, and archival information. Whenever available, we also capture biographical information on persons. On occasion, the parser also identifies details on mentioned persons and places, letter types, or other relevant information. Once all required information is extracted, it is fed into the transformation pipeline. This pipeline converts the data into RDF following the CoCo model (Drobac et al. 2023b) and publishes it on the Semantic Web portal.

This combination of manual and rule-based automatic processing ensures accurate data handling, making the process both reliable and trustworthy. It is a crucial step in making these historical documents accessible on the Semantic Web, thereby enhancing their utility for academic research and public interest.

This poster will present the key aspects of our methodology, challenges faced, and the innovative solutions we implemented.



Large language models to supercharge digital humanities

Andres Karjus1,2

1Tallinn University; 2Estonian Business School

The increasing capacities of large language models present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, augmenting and automating qualitative analytic tasks previously typically allocated to human labor. This contribution describes a systematic qual-quant mixed methods framework to harness and combine expert knowledge, machine scalability, and rigorous quantification, with attention to transparency and replicability. 17 machine-assisted case studies are showcased as proof of concept. These cover linguistic and discourse analysis, lexical semantic change detection, interview analysis, historical event cause inference and text mining, detection of political stance, text and idea reuse, genre composition in literature and film; social network inference, automated lexicography, missing metadata augmentation, and multimodal visual cultural analytics. In contrast to the focus on English in the emerging LLM applicability literature, many examples here deal with scenarios involving smaller languages and historical texts prone to digitization distortions. In all but the most difficult tasks requiring expert knowledge, generative LLMs can demonstrably serve as viable research instruments or artificial assistants, reflecting recent LLM applicability research (cf. Gilardi et al 2023, Ziems et al. 2023). Machines as well as humans may contain errors and variation, but the agreement rate can and should be accounted for in subsequent statistical modeling; a bootstrapping approach is discussed. The replications among the case studies (Mulder et al. 2023, Sobchuk et al. 2023, Kanger et al. 2022) illustrate how tasks previously requiring potentially months of team effort and complex computational pipelines, can now be accomplished by an LLM-assisted scholar in a fraction of the time.



Social Media Analysis of Public Reactions to the Israel-Gaza War: Insights from Facebook and Instagram

Wajdi Zaghouani2, Anissa Jrad1

1HBKU, Qatar; 2HBKU, Qatar

The Israel-Gaza War remains a highly intricate and polarizing issue that captures global attention. This study delves into the dynamics of public reactions on Facebook and Instagram, platforms pivotal in shaping the discourse around this geopolitical event. We analyze user-generated content to uncover how individuals react to and engage with posts related to the Israel-Gaza War, and what insights can be gleaned from their interactions.

Research Question: How do users on Facebook and Instagram engage with and respond to content about the Israel-Gaza War? What patterns emerge in their interactions, and how do these reflect broader public sentiment and discourse?

Methodology: Utilizing CrowdTangle, a comprehensive social media analytics tool, we compiled a large dataset of posts from Facebook and Instagram. Our collection strategy involved keyword searches and hashtag tracking (e.g., #Israel, #Gaza, #PeaceInTheMiddleEast) to ensure a diverse representation of perspectives—from news coverage and official communications to grassroots activism and personal narratives. We meticulously documented user interactions, including likes, shares, comments, and emotional reactions, to analyze engagement patterns.

Findings:

Temporal Patterns: Examination of post and reaction distributions revealed spikes in activity following key events like bombings or attacks, reflecting an initial shock and engagement that brings graphic images and major announcements to the forefront of global attention. Over time, certain hashtags or narratives gain traction, maintaining sustained discussions that influence public perception and policy discussions.

Sentiment Analysis: Sentiment analysis of user comments and reactions indicated significant negative sentiments towards the actions of both sides, especially highlighting issues like hospital bombing, censorship, misinformation, and humanitarian impacts. Sentiments can shift rapidly in response to new incidents or information disclosures. Moreover, support and opposition for each party vary widely across different social media platforms and user communities, with some showing a higher prevalence of pro-Palestinian content while others display more balanced or varied political leanings.

User Engagement: Analysis of engagement metrics showed that content depicting graphic imagery or major updates tends to receive heightened interaction, indicating user priorities and the impact of visual content in activism.

Influential Voices: We identified key influencers, including activists, media outlets, and notable personalities, who significantly shape the narrative and mobilize engagement through their extensive reach.

Geographical Analysis: Geotagged data provided insight into regional sentiment variations, underscoring how local contexts influence perceptions of the war.

This research elucidates the critical role of social media in facilitating dialogue, spreading information, and fostering collective expression during complex geopolitical events. By deploying a robust methodology and an extensive dataset, our study enhances understanding of public reactions and the influence of social media on contemporary discourse and activism surrounding the Israel-Gaza War on key platforms like Facebook and Instagram.

Acknowledgments

This work was made possible by grant NPRP14C-0916-210015 / MARSAD Sub-Project from the Qatar National Research Fund / Qatar Research Development and Innovation Council (QRDI). The contents herein reflect the work and are solely the authors’ responsibility.

Zaghouani-Social Media Analysis of Public Reactions to the Israel-Gaza War-233.pdf


MARSAD Observatory: Monitoring and Analyzing Social Networks Topics in the MENA Region

Wajdi Zaghouani

Hamad Bin Khalifa University, Qatar

The MARSAD Observatory aims to revolutionize social media data monitoring and analysis in the Middle East and North Africa (MENA) region. This live social media observatory provides real-time insights into the dynamic digital discourse in MENA, serving individuals and institutions. MARSAD takes a multidisciplinary approach to offer comprehensive data on trending topics, sentiments, emotions, and nuanced insights into gendered and generational dynamics within Qatar's digital public sphere.

The project's objectives encompass a diverse range of ambitious goals. Objective 1 involves creating a large and balanced annotated dataset that covers multiple Arabic dialects and regions, addressing a crucial gap in current data resources. Objective 2 focuses on developing an AI-powered Social Media Monitoring Platform to enable real-time data analysis, providing users with an intuitive interface to navigate the MENA region's digital landscape.

MARSAD's commitment to bridging language barriers is a distinguishing feature. It will be the first social media monitoring tool to offer support for Arabic dialects, capturing diverse linguistic nuances prevalent in the region and making it a unique and indispensable resource.

Beyond real-time monitoring, MARSAD supports archiving topics (Objective 3), ensuring that historical public datasets remain accessible. This feature enables researchers to analyze the evolution of digital discourse over time, crucial for studying online narratives, social trends, and the impact of digital activism.

Fostering inclusivity within the digital public sphere is central to Objective 4. MARSAD investigates gendered and generational dynamics in Qatar's digital landscape, with a focus on marginalized groups. Informed by feminist theories, the project seeks to illuminate both empowering and challenging aspects of digital activism, promoting more inclusive digital cultures in MENA.

Methodologically, MARSAD employs cutting-edge techniques in Artificial Intelligence (AI) and Natural Language Processing (NLP). Data collection relies on the X API, extracting live public X streams from MENA. Premium access to the X API ensures historical data availability. A substantial sample undergoes meticulous annotation for sentiments, emotions, hate speech, irony, sarcasm, and stance.

To achieve Objective 1, Python scripts automate data collection, focusing on specific keywords and posts in standard Arabic and dialects. Skilled annotators follow comprehensive guidelines, ensuring high accuracy through blind and double annotations.

Objective 2 integrates advanced machine learning and deep learning techniques using the annotated dataset. Techniques include LSTM, Bi-LSTM, CNNs, RNNs, and transformers like BERT and GPT-2.

The project culminates in a dynamic real-time social media monitor, offering daily trend exploration, topic and hashtag searches, and interaction visualization. An Application Programming Interface (API) facilitates external application integration. MARSAD includes a user feedback feature, contributing to ongoing annotation and data enhancement efforts.

In conclusion, the MARSAD Observatory advances our understanding of MENA's digital discourse. Through innovative data collection, analysis, and inclusivity, MARSAD empowers researchers, policymakers, and the community to navigate and shape the region's digital landscape effectively. Interdisciplinary collaboration, cutting-edge technology, and a critical perspective drive MARSAD to contribute significantly to the study of digital communication in MENA, promoting more inclusive and informed digital cultures.

This work was made possible by grant NPRP14C-0916-210015 / MARSAD from Qatar National Research Fund.

Zaghouani-MARSAD Observatory-232.pdf


Uppsala Runestaff Database

Michael Dunn

Uppsala University, Sweden

Runestaves are perpetual calendars made and used in Sweden and
surrounding areas from the turn of the millenium until they were
superseded by printed almanacks in the 17th or 18th century (Hallonquist
1994). They were usually carved on wood, and took various forms, most
commonly staff, sword, paddle, or (wooden) book. The runestaves were
marked with sequences of symbols, mostly from the younger futhark runic
alphabet, from which could be read Sundays, the new moons, and various
fixed feasts over any year, and which allowed a simple calculation of
Easter. As a calendrical calculator, the runestaves represent a
folk-scientific instrument of considerable complexity (Halonen 2020).
The Uppsala Runestaff database contains aligned transcriptions of more
than 600 runestaves, representing more than half of the runestaves known
to exist. This digital tool is designed primarily to facilitate analysis
of runestaves as texts, using stemmatological and cultural evolutionary
approaches (Roelli 2020). Analysis of the textual content of the
runestaves corpus provides insight on how they were used, allows
reconstruction of regional and temporal traditions in their manufacture,
and gives evidence of the manner in which the knowledge required to make
and use them was transmitted.



Using ChatGPT for (semi-) automatic subject indexing of different document types

Johannes Widegren1, Koraljka Golub1, Jue Wang2

1Linnaeus University, Sweden; 2University of Chinese Academy of Science

We are currently in a phase where it seems that new applications for large language models (LLMs) in general and generative pre-trained transformers (GPTs) in particular are tested every day. Examples are as diverse as automated data mining for building energy management (C. Zhang et al., 2024), evaluating the accuracy of differential-diagnosis lists for clinical vignettes (Hirosawa et al., 2023), and human-machine-augmented intelligent vehicles (J. Zhang et al., 2023). They can also be used to extract structured information from unstructured text (Söderström, 2023). This poster presents a pilot study on one such application, the potential use of OpenAI’s ChatGPT for automatic subject indexing of archival documents in Swedish, Swedish LGBTQ fiction and Chinese fiction. The accuracy of the assigned subject index terms is compared with the output from ANNIF (Suominen et al., 2022), an established automatic subject indexing software used in libraries.

The results display an impressive degree of accuracy for the subject index terms assigned by ChatGPT, but challenges have been identified in all three document types. For example, the appropriateness of the terms for historical text is highly questionable at times. The terms assigned by ANNIF, in contrast, are drawn from a controlled vocabulary, which ensures that they have been manually selected as suitable subject index terms. The pilot study shows that it is feasible to run the index terms suggested by ChatGPT through ANNIF to get index terms from a controlled vocabulary while harnessing ChatGPT’s state-of-the-art natural language understanding. This presents intriguing opportunities for implementing GPTs in the archival/library cataloging workflows. Semi-automatic approaches and manual checks are still to be preferred, however, in order to maintain the authenticity of the generated metadata.

Widegren-Using ChatGPT for (semi-) automatic subject indexing-155.pdf


A new resource of Icelandic sagas: Digitizing normalized scholarly editions and enhancing textual data

Ellert Þór Jóhannsson1, Þórður Ingi Guðjónsson2, Finnur Ágúst Ingimundarson1

1Árni Magnússon Institute for Icelandic studies, Iceland; 2Old Icelandic Text Society

A new resource of Icelandic sagas

Digitizing normalized scholarly editions and enhancing textual data

Introduction

This poster accounts for a new project aimed at digitizing Icelandic saga text editions and further enhancing the resulting textual data. This includes implementing lemmatization processes, creating a corpus, linking the lemmas with a lexicographic resource, and constructing comprehensive inflectional database. The aim is to have the material accessible online on a user-friendly platform. The initiative seeks to provide scholars, researchers, and language enthusiasts with a dynamic resource for exploring and understanding Old Norse vocabulary and linguistic structures in a detailed and straight forward manner.

Background

Icelandic family sagas represent the peak of Icelandic medieval writing culture. Until now these sagas have only been freely available online in a normalized form compatible with Modern Icelandic language standards, both orthographically as well as morphologically, eliminating various nuances present in their original Old Norse form.

Íslenska fornritafélagið (Old Icelandic Text Society) was founded in 1928 and has ever since worked on publishing Icelandic medieval texts in print with detailed introduction and textual commentary. The editions of the society use their own normalization standard of the language as it was around the year 1200. Most of the print volumes were published before electronic processing of texts. The current project aims at facilitating access to the texts closer to their original linguistic state and create a versatile resource for the study of Old Norse in a standardized form.

Methods and workflow

  1. Digitization of text editions
  • Employ Optical Character Recognition (OCR) and text-processing techniques to convert printed editions into machine-readable formats.
  • Enhance accessibility by providing an online platform to explore the digitized texts.
  1. Lemmatization:
  • Implement natural language processing algorithms to identify and categorize word forms within the texts.
  • Facilitate linguistic analysis by generating lemmatized versions of the texts, revealing the base or dictionary forms of words.
  1. Linking to The Dictionary of Old Norse Prose (ONP):
  • Integrate the lemmatized texts with ONP, creating a link between the sagas and an authoritative lexicographic resource.
  • Enable users to cross-reference words with their definitions, contextualizing the language within a linguistic framework and the broader context of other medieval text genres.
  1. Inflectional Database:
  • Develop a paradigmatic database showcasing different inflectional forms for each headword, allowing users to explore morphological variations.
  • Include unattested forms in the inflectional descriptions, providing a comprehensive view of potential linguistic forms and expanding the understanding of Old Norse grammar.

Results

This project contributes to the field of Old Norse studies by offering a unified digital platform that combines digitized sagas, lemmatized corpus, and a link to a comprehensive dictionary. The inclusion of inflectional descriptions, offers a useful resource to Old Norse learners and researchers, promoting a more nuanced understanding of Old Norse morphology. The platform's user-friendly interface will cater to a diverse audience, fostering research and facilitate the exploration of Old Norse language resources. As a result, important texts will be more accessible and engaging for scholars, learners and enthusiasts alike.



404 Not Found. Dire Straits and Safe Havens for Digital Scholarly Editions in Norway

Annika Rockenberger, Johanne Emilie Christensen, Federico Aurora

University of Oslo Library, Norway

Since its inaugural conference in Oslo in 2016, presentations and discussions about digital scholarly editions (DSEs) at DHNB conferences have dwindled. The "big names" and national icons (Henrik Ibsen, N.F.S. Grundtvig, S. Kierkegaard, Strindberg, Z. Topelius, L. Holberg etc.) have been published, and the eagerness and technological momentum of the earlier days of DSEs has come to a halt. We're now in a situation where even the larger, previously well-funded DSE projects face issues with legacy code and systems (Evensen 2020), maintenance, research software engineer/developer knowledge transfer, necessary back- and front-end updates, and the institutional requirement to make research data FAIR all the while there are no funds and research time allocated (Baunvig et al. 2023). Not to mention the many small DSE projects where individual researchers put much effort, knowledge, and expertise into creating valuable, now virtually useless scholarly resources because they cannot be accessed anymore.

Before the backdrop of this serious situation, the University of Oslo Library has launched an initiative to systematically investigate the "state of affairs" of DSEs in Norway and chart a route to a sustainable national infrastructure for digital editions.

Building and fostering a network of interdisciplinary researchers, software engineers/developers, and cultural heritage specialists is one measure for achieving sustainability. With the network, we aim to retain and spread knowledge about DSEs, their technical infrastructure and local solutions for hosting and maintenance. We believe the key to keeping valuable humanities data in the form of DSEs is having a strong community interested in and can lobby for allocating funds and resources to an infrastructure fit for long-term archiving and accessibility of editions.

Furthermore, we believe that sustainability can only be achieved when DSEs are conceptualized, planned, developed, and published within a realistic setting and with as much clarity about standards for data, software, systems, and maintenance requirements as possible. We will thus develop a set of recommendations for researchers at the University of Oslo and beyond, built on the outcomes of a feasibility study we are doing in the fall of 2024 and an in-depth survey of the state of the art of DSEs in the spring of 2024.

In our poster, we will provide an overview of the project - its background and aims - and will highlight our work with (a) community building as a key to sustainability, (b) the design and expected outcomes of our DSE survey, and (c) the design of our feasibility study for implementing the lessons learned into a national infrastructure.



Collaborative Infrastructure as a Disruptive Force for Interdisciplinary Digital Scholarship, Illustrated by Use Cases of the Transkribus Stakeholder Platform.

Andy Stauder1, Annika Rockenberger2, Minna Kaukonen3, Bragi Þorgrímur Ólafsson4, Unnar Ingvarsson5, Therese Foldvik6, Johanne Emilie Christensen2

1READ-COOP SCE, Austria; 2University of Oslo Library; 3National Library of Finland; 4National and University Library of Iceland; 5National Archives of Iceland; 6University of Oslo

Introduction:

We explore the positive feedback loop created by shared infrastructure that enables collaboration in scholarly research. Collaboration can take various forms, including data sharing, joint AI-model training, direct collaboration, and education. Each of these forms contributes to the loop, leading to improved capabilities and usefulness of the collaborative infrastructure, making collections of documents suddenly more valuable and increasing the attractiveness of scanning projects for memory institutions such as museums, libraries and archives. This leads to more interest in the infrastructure, which in turn makes it more useful and so forth. This positive feedback loop leads to an increased study of original sources and replicability of studies, i.e., strengthens scholarly quality criteria. The framework for this collaborative infrastructure is the READ co-operative. This is a stake-holder- instead of profit-oriented social business which was founded for the purpose of maintaining and further developing the technology and community built up during two EU-funded academic research projects. This socio-technological platform that was built during the projects and is now widely known in the academic community and beyond is called Transkribus.

Data Sharing:

Data sharing is a fundamental aspect of collaboration in scholarly research. By sharing data, researchers can build upon each other's work, leading to new insights and discoveries. This process creates a positive feedback loop where the more data is shared, the more research can be conducted, and the more knowledge is generated. The same is true for training data that is fed into recognition models for handwritten text recognition, natural language processing and information extraction.

Direct Collaboration:

Direct collaboration involves researchers working together on a common project, e.g. the transcription, annotation, statistical analysis or digital publication of original sources. This type of collaboration can lead to the development of new ideas, methods, and theories. It also fosters a sense of community and support among researchers, which can further enhance the research process. The discussed Transkribus software platform fosters this type of collaboration through its cloud-based approach.

Education:

Education is a critical component of collaboration in scholarly research. By educating students and early-career researchers on the importance of collaboration, including collaboration tools and infrastructure, we can create a new generation of researchers committed to sharing data and working together. This will further strengthen the positive feedback loop of collaboration and lead to even greater advances in scholarly research.

Human-Artificial-Intelligence Positive Feedback Loop:

The combination of networked human intelligence and artificial intelligence creates a positive feedback loop that amplifies the benefits of collaboration. When these two patterns of information processing are combined, they can create systems that are more efficient, effective, and adaptable than either pattern alone. Not only does this lead to greater quantities of historical data that can be processed, but also to qualitatively new insights.

Compounding Effect:

The positive feedback loop of collaboration has a compounding effect, leading to exponential growth in the amount of knowledge generated. As more researchers collaborate, more data is shared, better AI models trained, more research conducted, and more knowledge produced. This cycle continues to repeat itself, leading to a rapid expansion of human knowledge, across disciplines.

Practical Use Cases:

The collaborative Transkribus infrastructure has applications in many disciplines, in particular in the digital humanities, and among others, in the Nordic and Baltic countries. A few of them are to be presented in this paper, answering the questions: A) What was the general scope and purpose of the projects? B) If the project coordinator is a memory institution, how has it collaborated with researchers and research institutions, or the project coordinator is a researcher, research team or research institution, how have they collaborated with memory institutions? C) What role have Transkribus and the co-operative played in this?

University of Oslo Library

A)

  1. Transcribing two Early Modern prints for a bilingual digital scholarly edition. The Ethica Complementoria, together with the Tranchierbuch, in the German version from 1674 and its Danish translation from 1678. Public/shared models from the Platform were re-used and a dedicated model was trained for German print to minimise manual corrections. Export in XML/TEI will form the basis of a digital edition.

  2. Transcribing the private correspondence of Christopher Hansteen (Norwegian astronomer), especially the letters from the expedition to Siberia in the late 1820s. This collaboration with the National Library is re-using one of their models and customising it for Hansteen’s hand. It will be re-used to transcribe the professional correspondence held at the Museum for University History and the History of Science, University of Oslo, which collaborates with the University Library for its digitisation efforts.

  3. Planned: Transcriptions of the Library’s East Asian Special Collection, including Tibetan prints from the 18th and 19th centuries.

B)

  1. Training (Transkribus and other recognition software) and project or individual guidance. Guidance sessions with researchers at the University who want to use handwritten text recognition software for their transcriptions. These range from Arabic, Ottoman Turkish, Coptic, French, Spanish, German, English, Danish, Norwegian, and Latin texts in handwriting or print to musical notation (mediaeval) and specimen descriptions from the Museum of Natural History. The Library mainly does guidance and training sessions for researchers as part of their research support.

  2. There is collaboration with other memory institutions, namely the Museum for University History and the History of Science, the Dept. of Pedagogy and the Dept. of Economics at the University of Oslo, and the National Library of Norway.

  3. There is the aim of becoming a member of the Co-operative and offering researchers and partners in smaller memory institutions access to its services. This will also be part of the skills development and knowledge sharing hub “BærUt! Sustainable Digital Scholarly Editions” from 2024-2026.

C)

  1. Community support and quick answers to odd problems via a dedicated Slack channel

  2. Documentation and how-to-guides

  3. Ready-to-use training materials (e.g. presentation slides)

  4. Re-use of public models

  5. A scholarship programme that supports training sessions

National Library of Finland

A)

The primary role has been within the NewsEye Project, funded from the Horizon 2020 programme. In the first place, the researchers from the University of Helsinki – to which the National Library belongs - chose the most important newspapers from a research perspective to be reprocessed with Transkribus and to be used as pilot material in the Research and Innovation Action project. This entailed about 500 000 pages in two languages. In the second place, after the project, the research community speaking Finnish and Swedish has been able to enjoy the improved search results due to a large-scale reprocessing of another 2 million pages.

B)

The National Library of Finland works regularly with researchers, either as project partners in research projects or cooperating in other manners, for instance by offering data services, digital source materials or training in using both of these.

C)

The National Library of Finland has cooperated with read-coop sce to improve text recognition for Finnish historical newspapers from 1771 to the 1920s. The improvement rate that Transkribus provided, is remarkable compared to the earlier text recognition results. The reprocessed corpus consists of about 2.5 million pages in Finnish and Swedish. The improved versions of recognized newspaper texts are accessible in the publication and presentation system of the National Library of Finland on https://digi.nationallibrary.fi. The read-coop sce team, and before that, the Transkribus project team, have provided user support, project management, and a recognition technology that made new levels of correctness possible, providing significant value to the academic community.

The National and University Library of Iceland (NULI) and National Archives of Iceland

A)

Digital humanities project via the Icelandic Centre for Digital Humanities and Arts. This is a forum for the development, hosting, and consultation on the development and access to digital databases in the humanities and arts, as well as for research based on these databases. The project aimed to train the Transkribus application to read old Icelandic handwriting. The project, carried out by historian Emil Gunnlaugsson resulted in two models for late 18th century and 19th century Icelandic handwriting. The models have been made publicly available to Transkribus users on the platform.

B)

Being one research and one memory institution, the University and the Archives have been working through the Digital Humanities Center to foster collaboration between researchers on the one hand and holders of large collections of historical documents on the other. The goal is to promote the development of and access to research infrastructure in the field of digital humanities and to link Icelandic research to international development in the field.

C)

For the discussed project, the READ co-operative has mainly acted as a technology provider, offering an easy-to use, customisable tool for working with historical documents even with languages that have a relatively small number of speakers. An additional advantage is that even where recognition results are not perfect, they still make it easier to read the documents, to users of varying skill levels and the general public.

The project shows that the Transkribus infrastructure is also very compatible with other types of infrastructure and collaboration projects, fostering collaboration on the metalevel, too.

University of Oslo

A)

SAMLA – digitizing Norwegian tradition archives: Three Norwegian archives containing cultural historical material, which includes folktales, legends, traditions etc. are in the process of being digitised. SAMLA has generated ca. 500,000 image files. The aim is to make this material accessible through one web-based portal, which will be launched in the autumn of 2024.

B)

The project is coordinated by a research institution and thus working from a research perspective. One crucial part of the project was the source material, whose holders are both memory and other research institutions, namely:

- Norwegian folklore archives, University of Oslo

- Norwegian ethnological research, The Norwegian folk museum

- Ethno-folkloristic archives, University of Bergen, project owner.

This means two of them are archives situated within research institutions, which makes collaboration easier due to similarities in structure and workflows, while one of the institutions is a museum. This shows the importance of strong connections between research institutions and other organisations in society, in order to have both methodological rigour in research on the one hand and relevant objects of study on the other.

C)

Transcriptions play an important role in making the material accessible. The materials are varying in dialects, spelling, layout, and are a mixture of handwritten/typed, and clean text/drafts. SAMLA is currently experimenting with Transkribus recognition for making models for layout and text, aiming to get as low character error rate as possible to make the transcriptions readable for the public. There are plans to connect Transkribus to an existing (Goobi) database for more automatic file management and larger batches of documents for transcribing before publishing the transcriptions on the web portal. SAMLA also plans to find a good workflow for crowdsourcing, using Transkribus, where the public may correct errors they find in the transcribed material.

Conclusion:

The positive feedback loop of collaboration in scholarly research is a powerful force for advancing human knowledge. By sharing data, working together, and educating future researchers, we can create a more collaborative and productive research environment that will lead to even greater discoveries.



Archaeological Artefact Database of Finland (AADA)

Petro Pesonen1,2,3, Ulla Moilanen2, Meeli Roose2, Jarkko Saipio1,2, Jasse Tiilikkala2,3, Usman Sanwal2, Visa Immonen4, Outi Vesakoski2, Päivi Onkamo2

1Finnish Heritage Agency, Finland; 2University of Turku, Finland; 3University of Helsinki, Finland; 4University of Bergen, Norway

Archaeological Artefact Database of Finland (AADA) is planned to cover all prehistoric artefacts in Finland. So far, the database offers comprehensive information on over 49,000 collection entries of Finnish archaeological materials. It covers the whole prehistory of Finland from the beginning of the pioneer settlement after the Last Ice Age (c. 8900 calBC) until the beginning of the Medieval period (c. 1300 AD). Geographically, it covers the entire territory of present-day Finland, including the Åland Islands, and as well as artefacts collected before the Second World War from the territories ceded to Russia in 1945 (e.g. Karelia, Petsamo). The artefacts are categorized by type, and are accompanied with photos of the artefacts. The database includes the details and measures of such artefacts that can be classified typologically excluding e.g., flakes, informal chipped tools (scrapers, awls, burins, etc.), iron knives, nails and such.

The database provides spatio-temporal context for comparing artefacts across different time periods and regions. To facilitate data usage, we also offer a geospatial framework to implement the visualization and analyses of the database. The AADA database offers a valuable resource for studying Finland's prehistory and is accessible in Zenodo. The data will be continuously updated in the GitHub repository that will be managed by Finnish Heritage Agency and University of Turku. New versions of AADA will be launched to Zenodo in regular intervals. The AADA database is a part of the trend towards more open materials that can be used in collaborative research, representing also a shift towards greater reliability and quality.



Latvian Prose Counter: from digitized books to data visualizations

Anda Baklāne, Valdis Saulespurēns

National Library of Latvia, Latvia

The Latvian Prose Counter (LPC) is a multifaceted digital platform that showcases the potential of digital text analysis and visualization, provides comprehensive insights into Latvian novels from the 19th and 20th centuries, and serves as an experimental hub for full-text and metadata analysis of these novels. This initiative is a collaborative effort between the National Library of Latvia (NLL), the Institute of Literature, Folklore, and Art of the University of Latvia (ILFA), aiming to synergize the resources of both institutions to forge a comprehensive digital resource. The morphological and syntactical markup of texts is realized by using NLP tools created by the Institute of Informatics and Mathematics of the University of Latvia.

The poster delineates the LPC's workflow, which encompasses text digitization, preprocessing, analysis, visualization, and enrichment with references to full-text objects, authoritative data, and ILFA's database contributions. Utilizing open-source Jupyter Notebooks for data processing and visualization underscores the project's commitment to transparency and reusability.

As the number of novels digitized by the NLL increases, the content and functionalities of the LPC are constantly updated. This ongoing development is anticipated to evolve into a holistic representation of the Latvian prose landscape that will facilitate a nuanced understanding through distant reading methodologies.

At the time of this poster presentation, the Latvian Prose Counter offers insights into novels from the Corpus of Latvian Early Novels (1879-1940) and features four distinguished authors from the Soviet era (195 authors in total). Users can delve into various quantitative aspects, such as the frequency of words (categorized by author, work, and parts of speech), prevalent sentence types, and lexical diversity within texts.

Moreover, the platform emphasizes the importance of data visualization in complementing the presentation of quantitative parameters. It aims not merely to furnish details on these parameters but also to present this information through engaging and intuitive visual representations.



Digital tools, citizen engagement and vulnerable cultural heritage

Eiríkur Smári Sigurðarson1, Skúli Björn Gunnarsson2

1University of Iceland, Iceland; 2Gunnar Gunnarsson Institute, Iceland

The CINE project (Connected Culture and Natural Heritage in a Northern Environment), funded by the INTERREG Northern Periphery and Arctic Programme, aimed at transforming people’s experiences of outdoor heritage sites through technology, building on the idea of “museums without walls”. New digital interfaces such as augmented reality, virtual world technology, and easy to use apps brought the past alive, and allowed people to visualise the effects of the changing environment on heritage sites and helping them to imagine possible futures. CINE developed content management toolkits – enabling curators, archivists, historians, individuals and communities to make innovative heritage projects to create unique on-site and off-site customer experiences in specific locations.

We will present the main result of the CINE project and further work on developing a “citizen science app” (Muninn) to monitor and register cultural heritage sites, to collect new information about known sites (descriptions, photographs, 360 photographs and 3D photogrammetry models) and add to geographic databases.



Jubileumsportalen – contextualizing 1923’s jubilee exhibition using digital methods

Siska Humlesjö, Johan Åhlfeldt, Anders Strinnholm

University of Gothenburg, Sweden

Jubileumsportalen is a web portal collecting the views, images and other surviving documents produced during the Gothenburg 1923 Jubilee exhibition. The Jubilee exhibition was inaugurated 8th of May 1923 by the king Gustav V to commemorate the foundation of the city in 1621. But due to the extremely high ambitions, financial troubles, and the aftermath of WWI the exhibition was delayed by two years. Within the exhibition grounds, temporary structures were erected, including a modern lighthouse, an aerial railway, the world's largest restaurant, and a 7000 square meter industrial exhibition hall. While most of these buildings were later demolished, some of Gothenburg's most iconic landmarks were established during this jubilee period.

Jubileumsportalen[1] leverages digital methods to make this part of local history available to a wider audience. The portal utilizes digitized photographs from different sources. These materials are linked to one of the contemporary maps of the exhibition area using geographical data. By providing context to high-resolution photographs available in IIIF format, the project aims to make the exhibition's materials accessible. The digitized collection includes photographs taken by an official exhibition photographer, menus from the exhibition's restaurants, official posters, and more. Additionally, the project utilizes digitized literature, such as guidebooks, official publications, and news articles, to describe and contextualize both the materials and the geographical locations.

The portal is a collaborative initiative between Gothenburg University Library and Gothenburg Research Infrastructure in Digital Humanities (GRIDH). Its purpose is to employ digital humanities technology to showcase the library's collection, with the goal of making this historical material accessible and engaging for a wider audience.

The main interface presents a contemporary map from the exhibition. The user can click the different data points on the map and access the material in depth, with links to the source material published in GUPEA (Gothenburg University Publications Electronic Archive). Places and buildings are described and contextualized to give the user an understanding of the exhibitions scope and the society that formed it. A toogle also allows the user to explore the data collectively in a gallery mode. Making the images available in IIIF give the user the possibility to zoom in on details. All images are free to download and have citation information.

[1] Accessible at https://jubileet1923.dh.gu.se/



Representing the Íslendinga Saga As Knowledge Graphs of Events and Social Relationships: Developing Workflows Based on a Pilot Case

Shintaro YAMADA1, Jun OGAWA2, Ikki OHMUKAI1

1The University of Tokyo, Japan; 2ROIS-DS Center for Open Data in the Humanities, Japan

The sagas of medieval Iceland comprise several genres. The Íslendinga saga in the Sturlunga saga, classified as a contemporary saga, depicts events surrounding the powerful Sturlungar family clan and the social and political circumstances of the time. This research examines the Íslendinga saga with the aim of representing its narrative content as a knowledge graph.

A knowledge graph is a network of data that describes relationships between things in a machine-readable form. One way of representing such data is with RDF (Resource Description Framework), which can be used to structure information as a graph by linking entities through common formats. A knowledge graph is externally extensible; as long as a common descriptive format is used, separately created graphs can be easily integrated. Employing widely used vocabularies like CIDOC-CRM and HIMIKO will enable combining knowledge graphs. It should be noted that there is a research project which has also begun exploring an ontology for Icelandic sagas in Iceland and part of their works can be seen from the GitHub repository.

In constructing a knowledge graph of the Íslendinga saga, this research focuses on two aspects: 1) maintaining chronological continuity when representing the various events described in the saga, and 2) capturing relationships between characters. The graphs will be used to analyze the dynamics of how the characters solve their own problems that arise in the saga, and by doing so it can reveal the dynamics of problem-solving in medieval Icelandic society.

We create two different knowledge graphs in order to represent the Íslendinga saga according to the aspects. One is an event-oriented graph describing events in the saga and another is a character-oriented graph outlining relationships between characters. The event-oriented graph captures entities like persons, places, and objects and their associations in events such as conflicts, killings, and lawsuits. The character-oriented graph describes kinship ties like marriages and sibling relationships, as well as social relationships between characters where possible. Both graphs are constructed according to the texts of the saga, but the person-oriented graph also includes interpreted information based on the reading of the texts, such as friendship or social relationships between characters, which are sometimes not explicitly mentioned in the texts but are implicitly recognizable through understanding the contexts. These two knowledge graphs can be integrated into a single knowledge graph representing the knowledge contained in the Íslendinga saga.

The Íslendinga saga has approximately two hundred chapters and there are over a hundred people appear in the texts. As a pilot study for constructing graphs of the entire work, this research initially attempts to cover approximately one-quarter of the narrative, focusing on establishing workflows for graph construction. Specifically, it aims to compile vocabularies for appropriately capturing narrative content and identify entities and resources that need representation in the graphs.

YAMADA-Representing the Íslendinga Saga As Knowledge Graphs-217.pdf


Display, Ontology and Database for Exhibition Documentation

Emmanuel Château-Dutier, Lena Krause, David Valentine, Zoë Renaudie

Université de Montréal

This poster aims to present the Display project developed within L’Ouvroir, the Digital Art History and Museology Laboratory at the Université de Montréal for the Partnership for New Uses with CIECO. Bringing together a team of researchers in art history, computer science, and museology, the laboratory is working on a digital tool to assist research on exhibition displays. We would like to present our methodology and the role of the DH Lab in the conception of this tool.

The mobilization and utilization of numerous archival sources to document the history of art museum exhibitions and enable their reconstruction is at heart of the CIECO project. The Ouvroir is conceiving this tool necessary to support all research operations, from collecting historical information to formulating hypotheses and recording results.

In a often sparse documentary context, this abstract model allows for the recording of historical information about exhibition installations through a spatial approach defining the possibilities of topological inferences between objects in the exhibition space. The first step was then to design a computer ontology to explicitly and formally describe the characteristics of an exhibition installation (proximity and contiguity of exhibit, faces, is Left Of, etc.). It will be compatible with the [CIDOC-CRM](https://www.cidoc-crm.org/) ontology, a conceptual model of reference promoted by the international museum organization, forming an extension to cover the specific domain of exhibition installations.

The reflection on the database tool to be provided to researchers and communicating it to the developer is the second research axis of this project. User-friendly, the database will be used by art historians without specific technical skills and facilitates:

-     The creation or automated import of lists of artworks

-     The recording or definition of the geometry of an exhibition space

-     The localization of artworks in this space.

This brief presentation will account for the various roles in the team, the decision made, and the documents produced for this purpose. We are currently finalizing the database model to submit for production and would be pleased to present it to the expertise of your audience.

Château-Dutier-Display, Ontology and Database for Exhibition Documentation-146.pdf


Runoregi: A User Interface for Exploring Text Similarity in Oral Poetry

Maciej Michał Janicki1, Kati Kallio1,2, Mari Sarv3, Eetu Mäkelä1

1University of Helsinki, Finland; 2Finnish Literature Society; 3Estonian Literary Museum

This demonstration presents the user interface Runoregi used for exploring text similarity in large collections of Finnic oral poetry. We showcase the different views of the interface and their applications in folkloristic research.

Janicki-Runoregi-164.pdf


Towards Humanistic AI: Mapping an Emergent Field of DH Practices

Mats Fridlund1, Daniel Brodén1, David Alfter1, Ashely Green1, Aram Karimi1, Gustaf Nelhans2, Cecilia Lindhé1

1University of Gothenburg, Sweden; 2University of Borås, Sweden

Although many consider ‘AI’ or ‘artificial intelligence’ a debatable and fuzzy concept, it nevertheless today figures prominently in academic discourse, funding politics and policy-making. One could even talk about a broad institutionalisation process currently underway outside the fields of engineering and data science, as seen by several initiatives to manage the use of AI in higher education, the Swedish Research Council’s recent guidelines for using AI tools in research project applications (https://www.vr.se/english/applying-for-funding/applying-for-a-grant/guidelines-for-the-use-of-ai-tools.html), and the large Swedish research program WASP-HS (2019–2028) (https://wasp-hs.org/) aimed at fostering interdisciplinary knowledge about AI and autonomous systems in the humanities and social sciences, and their impact on human and social development. Notably, throughout this the role of the humanities is emphasised.

Following the surge of ChatGPT and‘generative AI’, the importance of humanities researchers is often argued to be in exploring ethical, social, etc., aspects of AI (Dimock, 2020). However, ,humanities scholars – perhaps most prominently within corpus linguistics and language technology, but also in Digital Humanities (DH) and, to some extent, traditional disciplines, such as archaeology, comparative literature and history – have for a long time been developing and using resources that are today often associated with AI and also tend to use this term in communicating their research to academic and non-academic stakeholders. At the same time, humanists are sometimes active within the fields of AI without realising the depth, degree or character of their involvement. Depending on how it is applied, AI is used as a terminology in technological applications (including the applications themselves), a general field of expertise or an imaginary of something that does not fully exist. For better or worse, one thing is clear: the influence of the term will likely continue to be felt in academic discourse for the near future, as a “fact”, “fantasy”, “desire” or perceived “destiny” (Zhao, 2022).

To not only meet the discursive norms but also open up for a more structured discussion and demystify the notion of AI within the humanities, we have elsewhere proposed Humanistic AI as an apt term for discussing an emergent field of practices in the intersection of the application of AI tools and the interests that fall within the domain of DH and the humanities, taking into account the contested nature of the term AI (Fridlund et al., 2024). As noted by one of the few publications that use the term, “While there is massive investment all over the world related to one side of AI, namely engineering, it is also important to create rules and competence related to humanistic AI and its effects on people and societies” (Zhao, 2022). Thus, this paper will further explore ‘Humanistic AI’ as a term that enables DH to contribute to the expanding AI discourse and by highlighting what could be viewed as AI-related work within DH centres and research.

By providing a conceptual overview and a bibliometric mapping of the emergence of a terminology that refers both to the field of AI and humanities in research publications, the paper will delineate the context of the term and what we mean by it. To ground our discussion in practice, we will address three core areas of practice considered as Humanistic AI, in terms of using, developing or interrogating AI which will be concretised through projects involving the Gothenburg Research Infrastructure in Digital Humanities (GRIDH, formerly Centre for Digital Humanities) at the University of Gothenburg. We conclude by suggesting the term’s pragmatic usefulness for communication with the wider research community.

Humanities + artificial intelligence

We will primarily use ‘Humanistic AI’ to reference activities within humanities research and cultural heritage that apply, develop or study AI tools and applications. For clarity, we briefly sketch out what we mean by ‘humanistic’ and ‘AI’, respectively. Within AI, ‘humanistic’ can be used to designate ‘humane’ or ‘human-like’ functionalities and behaviours as well as to describe aspects related to humanities disciplines or knowledge domains (our use concerns the latter sense). Among historians, there is a consensus that ‘the humanities’ consists of a complex of academic disciplines and practices perceived as distinct and yet under continuous renegotiation (Bon, 2013). For instance, in Sweden many humanistic disciplines move across different university faculties, and before the 1960s the faculty of humanities represented both the humanities and the social sciences (Ekström & Östh Gustafsson, 2022).

The meaning of AI and artificial intelligence is somewhat more problematic due to its increasingly widened and contested meanings. To clarify and critique the various uses (and abuses) of the AI term a number of alternative terms have been introduced, including ‘augmented intelligence’, ‘intelligence augmentation’, ‘automated approaches’, ‘autonomous systems’ and ‘intelligent systems’. The “classic” textbook describes AI as a field “concerned with not just understanding but also building intelligent entities—machines that can compute how to act effectively and safely in a wide variety of novel situations”, encompassing “logic, probability, and continuous mathematics; perception, reasoning, learning, and action; fairness, trust, social good, and safety; and applications that range from microelectronic devices to robotic planetary explorers to online services with billions of users” (Russell & Norvig, 2021). While such a broad range of notions exists about AI, we will pragmatically discuss it as an emergent field of practice that develops and studies so called intelligent machines, as well as the use of such algorithms and machines. In particular, this refers to machine and software applications from the subfields of, among others, Expert Systems, Machine Learning, Natural Language Processing, Speech Recognition, Computer Vision, Robotics, and Genetic Algorithms, which include a range of applications such as clustering, deep learning, image segmentation, text classification and topic modelling.

To discuss the emergence of a research field related to the humanities and humanistic endeavours, we have conducted bibliometric searches in GoogleScholar and Web of Science for publications including the term ‘humanistic AI’ as well as broader search criterias to capture wider uses in the humanities through articles mentioning AI together with ‘humanist’, ‘humanitarian’, ‘humanities’, etc. With such broader search criteria, we found 1,300 articles, reviews and proceedings papers in Web of Science. When visualising these more than 130 different AI-related keywords in VOSviewer (van Eck & Waltman, 2010), we found several distinct clusters, one revolving around digital humanities, technical applications of machine learning, internet of things and image and text analysis; followed by a cluster depicting AI and risks; another centred around posthumanities, cybernetic, and ethics; one on applications within humanitarian law and military applications; and one referring to heterogeneous fields of AI applications within education, culture, media and digital methods. The paper will further analyse these visualisations of the clustering of co-occurring keywords, including such central humanities related ones such as ‘AI ethics’, ‘ethical AI’, ‘human-centred AI’, ‘responsible AI’, ‘explainable artificial intelligence’ – and ‘humanistic AI’.

Notably, during the last two decades the term ‘Humanistic AI’ has been used in different ways. For instance, in 2003 it was used to describe the trajectory within design of intelligent machines that tries to emulate human cognitive capabilities rather than mimicking the human brain’s anatomical functioning (Krishnakumar, 2002). More recently, ‘Human-Centered AI’ (HAI) has been used for similar AI activities and processes. Such efforts are often shaped by a rationale implying that HAI in augmenting rather than replacing human decision-making is not just efficient but also more ‘fair’, ‘compatible’, and ‘humane’. Furthermore, there are a range of similar AI related activities drawing on HSS perspectives (see Saheb et al., 2022). Also, the term is increasingly used within academic research. The Media Lab at KTH Royal Institute of Technology engages in interdisciplinary research combining “advanced engineering with philosophy, art, aesthetics and other disciplines from the humanities” to “develop a strong humanistic stance with respect to AI” (https://www.kth.se/hct/mid/research/media-lab/about-1.929121) and the University of Bologna’s Humanistic AI unit applies AI techniques to humanities that includes “classification, exploration, management, and preservation of cultural heritage, archives, or demo-ethno-anthropological materials” (https://centri.unibo.it/alma-ai/en/scientific-units/humanistic-ai).

Applying, developing and interrogating AI

Drawing together these topics and themes, we suggest that AI is involved in humanistic research mainly through three core practices (exemplified below through the expertise at GRIDH): 1) humanistic scholars applying existing tools incorporating AI applications in their research; 2) engineers and programmers developing custom-made, AI-related resources for humanities research; 3) humanists interrogating AI through reflexive critical analysis of AI tools’ embedded values, positions (‘bias’) and affordances.

Application of AI involves a range of diverse techniques and methods which includes vector representation for text, contextual search, data annotation, clustering, image classification, and recognition. Specific examples of applications implemented at GRIDH include advanced word embeddings (Word2Vec, FastText, etc) to create vector representations of textual content allowing for semantic similarity analysis, topic modelling, and contextual understanding; word embeddings in combination with domain-specific ontologies to enhance the semantic understanding; capture evolving themes and topics in historical text through use of topic modelling techniques, such as Dynamic Topic Modeling (DTM), semantic search to clarify meaning of queries and documents and to improve search recall precision; and image colour clustering based on similarity of embeddings.

Development of AI involves developing resources for solving complex research issues not easily solvable by simply applying existing applications. This can be done in different ways, such as training classifiers, fine-tuning existing or training new transformer models from scratch based on specific text or image corpora. Such development practices at GRIDH include computer vision and deep learning techniques for automatic image annotation, object detection and segmentation for image labelling. However, developing more general AI applications requires a deeper understanding of the underlying principles (and implications), and large amounts of training data, which is a constraint often hard to satisfy in the humanities (e.g. documents to be analysed are in extinct languages, or artefacts under scrutiny no longer exist).

Interrogation of AI entails applying humanistic research-based reflection to interrogate the implications of the AI tools and methods. This partly relates to practices within fields such as Critical Digital Humanities, Critical Code Studies, digitalSTS, etc, that concern interdisciplinary analysis of advanced data-driven approaches, software, etc, and the socio-cultural production of knowledge in digitalised society. In practice, such AI reflexiveness at times comes as explicit interdisciplinary studies including humanities scholars as well as tacitly in project conversations with humanist scholars probing the interpretative limits and affordances of the data generated by AI tools. This often entails making obtuse AI algorithms fathomable or at least trying to work out their inner workings by conducting in-depth analyses of model performance, and training processes that include human-in-the-loop components or active learning techniques.

The presentation will explore these three areas of Humanistic AI practice through descriptions of text-based and multimodal DH-projects at GRIDH: ‘The Nordisk familjebok’ research infrastructure project developed together with Data as Impact Lab at the University of Borås, that implements ‘likeness’ search functionalities using a Word2vec-model; ‘The New Order of Criticism’ (Ingvarsson et al., 2022) mixed methods project that use Swedish LLMs for classification of book reviews in newspaper corpora with a comparative perspective on quantitative and qualitative approaches; the ‘Literary Lab’ developed for the Swedish Literature Bank uses ML algorithms to cluster images of illustrations, initials, graphics ornaments, and sheet music for visualisations; The ‘Ivar Aroseniusarkivet’ project visualises thematic clusterings of a large archival collection of artworks; the project ‘Rock Art in Three Dimensions’ (Horn et al., 2022) uses AI-enhanced Augmented Reality (AR) technologies aligned with the ethics of conservation; the research project ‘Terrorism in Swedish politics’ (Edlund et al., 2022) carried out together with Språkbanken Text (University of Gothenburg) and Språkbanken Speech (KTH Royal Institute of Technology) studies parliamentary discourse on terrorism, drawing on both speech analysis in the form of automatic speech recognition (ASR) and deep neural networks, and text analysis, using, among other things, word vectors to trace conceptual development in nuanced ways.

Summary

By combining a conceptualisation of Humanistic AI with highlighting the AI-related DH practices at GRIDH, our paper will contribute to opening up a discussion and demystifying AI as a field of interest within the humanities.



Towards Standards in Digital Editions of Old Norse Prose: A Case Study

Sebastian Pohland

University of Oslo, Norway

This paper provides an overview over the authors ongoing PhD-research project at the University of Oslo, which aims to contribute to ongoing efforts to transition the field of Old Norse philology into the age of digital humanities by providing a detailed meta-analysis of the tools and technologies currently available for the creation of Old Norse digital editions, as well as the opportunities and pitfalls presented by transition towards digital-first edition projects given current trends.

Pohland-Towards Standards in Digital Editions of Old Norse Prose-203.docx


Digital Datasets Created from Archival Sources: The Problem of Data Quality in the Study of Private Letters

Marin Laak, Kadri Vider, Neeme Kahusk, Mari Sarv

Estonian Literary Museum, Estonia

We will focus our presentation on the analysis of the results we have obtained from studying the letters and correspondence in the Estonian Cultural Archives through the textual databases created from them. The empirical basis of our research is the collection of manuscript private letters of Estonian literary figures in the 20th century. We would like to discuss some methodological issues related to data preparation to highlight an important problem of the impact of the quality of textual data on research results. The goal of our work is to analyse the results of the application of computational methods and their dependence on data quality. For this purpose, we compare different datasets created from archival sources of cultural history and highlight how the quality of metadata and the structuring of content elements affect the content of research results.

The research presented in this paper was conducted in the framework of the research project „Source Documents in the Cultural Process: Estonian Materials in the Collections and Databases of the Estonian Literary Museum” (I and II, 2019-2023, funded by the Ministry of Education and Research of the Republic of Estonia). The project focused on the implementation of digital methods and international standards in the management, publication, and research of archival sources. The use of existing and emerging textual data and databases with help of computational analysis will allow for an increasingly better and more evidence-based overview of the various aspects of the information stored in the collections of Estonian Literary Museum (ELM), as well as of changes in society, culture, mindsets. Our interdisciplinary research is inspired by the surprising results achieved in the study of Estonian and Finnish folklore using computational methods, e.g. in the study of poetic and narrative text corpora of Estonian folklore (Sarv, Järv 2023) and similarity analysis applied to Finnish oral folklore (Janicki, Kallio, Sarv 2023).

The archival sources of our research consist of private letters and correspondence between Marie Under (1883-1980) and Ivar Ivask (1927-1992), two Estonian exile/diaspora writers in the West after World War Second. Marie Under has been one of the most appreciated poets in the Estonian diaspora in Sweden, the candidate of the Nobel prize. Her correspondence with the younger literary scholar and poet dr. Ivar Ivask in Minnesota (and later Oklahoma), USA contains all together ca 550 letters from 1957-1979. Their letters are exceptionally poetic, but also long and informative as it was our experience in traditional literary analyses (close reading).

For exploring thematic and temporal variability of correspondence we had to create a textual database of letters. The methodological challenge relates to archival data preparation. The first step in this complicated process was converting the handwritten letters into a machine-readable format to use them as textual data and applying automatic language data analyses (see also Laak et al 2019).

Raw version of the dataset of Marie Under and Ivar Ivask correspondence consists of approx. 300,000 words, thus the average annual number of words in letters was about 3000 words. Raw data (text only) was unstructured, the authors and dates of each letter were not distinguishable automatically, and thus the data quality was unknown and required metadata extraction.

Our initial hypothesis made by traditional literary qualitative research was, for example, that the correspondence of our refugees/diaspora writers thematically covers a wide range of topics, through which the productive activity of literati in preserving national culture in exile and diaspora communities opens. To explore these hypotheses, we applied frequency and theme analysis to the raw data.

Through theme analysis applied to the original textual data, the dynamics of correspondence over the years was revealed. Secondly, the analysis of word frequencies offered surprises. Topic analysis of the raw data revealed the dynamics of the correspondence and a large number of such topics, which were rather very personal. The frequency analysis of top content words applied to the study of the correspondence texts also yielded to several unexpected results.

The results of the frequency analyses showed that top of eight content words (nouns and verbs in base form of lemma) are: letter (‘kiri’), poem (‘luuletus’), to write (‘kirjutama’), to do/perform (‘tegema’), time (‘aeg’), to read (‘lugema’), poetry (‘luule’), to see/meet (‘nägema’) (Laak, Kirss 2023). We expected that Estonia (‘Eesti’, in form of proper name or possible gerund) belongs also into list of tops, but surprisingly it wasn’t so.

According to the top content words frequency plots, we can claim that if before 1965 the subject matter of the letters was broader, then for some reason there was a change in relationships of correspondents. In 1965 the correspondence stopped for a while and then relatively formal letters were exchanged. From this moment we see from relative frequencies of top words that from the usual distribution, the content words ‘letter’ and ‘read’, which refer to more formal relationships, emerge.

The frequency analysis of topical content words did not support the hypothesis about the broad topic of the letters in national exiles and diaspora activities. It turned out that the number of topics at the centre of the correspondence was relatively narrow. The focus of the letters was the reading and analysing the artistic values of poems, the translation of poetry and preparation of books for publishing.

The results of the project showed that in order to study textual datasets created from manuscript archival sources using computational methods, it is necessary to solve a number of specific problems. The next challenge would be to 1) analyse the frequency of topical content words by distinguishing the authors of the letters, based on the structured and annotated database of the letters; 2) to carry out a network analysis in order to determine the geographical extent of international contacts of Estonian writers in exile.

One of the goals of our study is to analyse the results of the application of computational methods and their dependence on data quality. In our presentation we will compare different datasets created from older archival sources and highlight how the quality of metadata and the structuring of content elements affect the content of research results.

Based on the topic analysis applied to text collections with higher quality data, it is possible to highlight topics personally related to the authors of the letters and the differences between the authors.

Computational exploration of digital data from archival sources of cultural history revealed best practices and lessons learned from collaboration between archive, literature studies and computational linguistics.



Capitalizing on experience to experiment and innovate: feedback and reflection on the future of the Huma-Num research infrastructure

Antoine Silvestre de Sacy, Stéphane Pouyllau

IR* Huma-Num (UAR 3598), CNRS, France

The French national research infrastructure Huma-Num has just celebrated its tenth anniversary. We feel that this is an opportune moment to take a step back and reflect on the past ten years, combining the experience gained with the innovation required of any infrastructure.

The three historical missions of the infrastructure have always remained the same:

  • To accompany the evolution of the human and social sciences (SSH) communities in the context of digitalization and Open Science;
  • To implement an infrastructure for the "FAIRization" (Findable, Accessible, Interoperable, Reusable) of data;
  • To participate in the construction of international infrastructures with the SSH communities in the context of the European Open Science Cloud (EOSC).

Relying on national and international communities, the Huma-Num IR* has been built around a chain of services following the data lifecycle and designed to promote this _fairization_ of research data, developing in-house services or offering off-the-shelf services corresponding to needs typical of digital humanities projects (data storage services, web hosting, virtual machines, computing power, data warehouses, search engine...).

The core principles of the infrastructure are built around three axes:

  • Innovation through use: relying on communities of researchers to co-develop services based on and adapted to their needs.
  • Service operation and data control: hosted at the IN2P3 computing center in Lyon, Huma-Num hosts all data in a sovereign and controlled manner.
  • Mixed and in-house development: backed by industrial partnerships, Huma-Num co-develops part of its services, taking advantage of the opportunities offered by the CNRS (partnerships, joint laboratories with industrial companies, etc.).

However, ten years after its construction, at a time when we can say that the infrastructure has reached a phase of maturity, having overcome the initial challenges associated with its start-up and established a solid base in the national and international landscape, the question of what is a national research infrastructure in the humanities and social sciences today arises all the more acutely, all the more so at a time of democratization of artificial intelligence and massive use of generative models :

  • How can we reconcile the exploitation of existing services and the innovation required for tomorrow's challenges?
  • What is the right balance between listening to communities, anticipating needs and guiding research practices?
  • How can a national research infrastructure can capitalize on past experience to design tomorrow's research infrastructure, especially at a time when innovation seems, at first glance, to be the preserve of private companies ?


Visualizing quire structures on Handrit.is

Beeke Stegmann

Árni Magnússon Institute for Icelandic Studies, Iceland

Information about quire structures of manuscripts can be quite complex, but are often highly relevant for researchers. This is in particular the case when scholars are investigating aspects related to the manuscripts’ materiality and genesis. Therefore, quire structures are a standard element that is included in most catalogue descriptions of handmade books.

Traditionally, quire structures are given in more or less condensed formulas, and their format can vary considerably between subfields. To increase the usability of quire structure information and to make them more easily accessible to different users, we experimented with including visualizations of quire structures into the joint online catalogue Handrit.is. Instead of re-inventing the wheel, however, we wanted to take advantage of available software and decided to build on open-source system for modelling and visualizing the physical collation of manuscripts “VisColl” (Collation Visualization; see Porter et al. 2017). The idea was to integrate the existing code into our online catalogue, but adjusting it for our particular use and needs was not without challenges.

The aim of the present presentation is to reflect on this collaborative initiative at the crossroads of archives, conservation, manuscript studies, cataloguing, DH and data management.[1] Lessons learned will be addressed with focus on how DH can best support the digital use of collections - in this case through additional tools in an online catalogue - and satisfy needs and requirements of both general ongoing cataloguing as well as particular large-scale research projects.[2]

Incorporating the open-source data module was not as straight forward as initially thought. Even though the creators of VisColl make all their code available and share it in an exemplary manner, the nature of our use is different from what was aimed at with the original software, requiring considerable adjustments. In particular, employing visualizations for cataloguing purposes and sharing them with users on a large scale makes a one-by-one project layout and personal logins impractical. In the end, the solution that was most practical and efficient to us was to use the VisColl interface for entering data, exporting the underlying code generated by the software, but employing our own, newly developed code for displaying the graphs. That way, a pop-up window can be embedded in the front-end of the online catalogue Handrit.is for manuscripts that have the relevant encoding. Also, no login is required for the user and additional visualizations can be created one by one as the cataloguing progresses.

[1] Institutions collaborating on this initiative are The Árni Magnússon Institute for Icelandic Studies and the National and University Library of Iceland. Main participants are, in alphabetical order by first names, Beeke Stegmann (SÁM), Halldóra Kristinsdóttir (Lbs), Kristinn Sigurðsson (Lbs), Silvia Hufnagel (SÁM) and Trausti Dagsson (SÁM).

[2] The initiative is part of a three-year research project, “Life of Paper: Cycles of production, Use and Reuse of 17th-Century Paper in Iceland”, funded by The Icelandic Research Council; Grant Number 228695.



Understanding researchers' needs by surveying to support them

Liisa Näpärä

National Library of Finland, Finland

In order to effectively support, collaborate, and understand researchers’ needs, there is a demand for systematic information collection as digital research, methodologies, technology, pedagogy, and practices are evolving. In recent years, the National Library of Finland (NLF) has emphasized its focus on this aspect. This aligns with the purpose of repeating a survey in the first months of 2024 to evaluate researchers' current needs and experiences with digital resources, materials, and services. The survey follows the design of the initial survey conducted in the spring of 2020.

The survey about the researchers’ needs for the digital resources and services of the National Library of Finland has shown to be a remarkable way to attain new development ideas and improve current services. It has been structured with a research data life cycle and data management planning in mind to help identify services and their relevance to researchers. After the survey 2020, various activities and improvements have been conducted. For instance, a specific tool was developed to collect customized datasets, and to serve those who are interested and capable of handling moderate-size (not necessarily big) data with qualitative or mixed methods. Data download numbers have already been promising after two years of its launch. Additionally, a contact point for collaboration has attracted regular researchers. However, there is still room to raise awareness about this option among the broader research community.

Besides the survey-based actions, a number of research collaborations have been active. To name a few 1) Fin-Clariah research infrastructure for social science and humanities to provide out of copyrighted digitized materials to computation environment, and 2) large Finnish language model development together with TurkuNLP group based on the modern language and legal deposit collection.

After a few years, it is time to update the relevance of current and existing research and data services. The survey aims to identify areas requiring enhancement and guide the NLF focusing on the development of digital collections, data, and research services. The core emphasis of the survey lies in understanding researchers' experiences and cultivating ideas for the improvement of digital resources and services. Comparing the upcoming results with the previous ones will indicate or at least reflect how well and widely the NLF’s recent development have found their way to knowledge among the researchers.

In the presentation, the 2024 survey results are analysed, further initiatives for development are delivered and conclusions are compared with the previous research survey highlighting areas of improvement. Repeated surveys are essential in providing evidence of evolving needs, with particular attention contributing to the potential shifts in research profiles. The previous answers indicate that the methods used in digital resources research are still very traditional for humanities and social sciences. Additionally, the survey implied that most attention is needed to digital newspapers and research collaboration in general. The actions have taken place and collaboration has been conducted in projects and other occasions.

The survey results are anticipated to provide significant value and serve as indicators for the further development of collaboration, research, and data services, aligning closely with the evolving needs of researchers.



Uralic Historical Atlas (URHIA): Interactive web app for spatial data

Meeli Roose1, Tua Nylén1, Petro Pesonen2, Harri Tolvanen1, Outi Vesakoski1

1University of Turku, Finland; 2Finnish Heritage Agency (Museovirasto)

The field of digital humanities has advanced significantly, with improved infrastructure and storage capabilities fostering overall research development. This progress, including the evolution of datasets, has expanded opportunities for spatial data storage, management, and analysis. The "spatial turn" in digital humanities incorporates spatial analysis, GIS, and other methodologies into evolving datasets, facilitating cultural studies and providing clear spatial views of complex data.

For 15 years, the University of Turku has fostered interdisciplinary collaboration in studying language evolution and human diversity (www.bedlan.net, https://sites.utu.fi/urko/, www.humandiversity.fi). We compile spatial data and offer open access to databases in Finland and Northern Eurasia. To ensure easy access, we developed the Uralic Historical Atlas (URHIA), an interactive platform (https://sites.utu.fi/urhia/) for researchers and lay audiences. URHIA is built on UTU-GeoNode (http://geonode.utu.fi/) as a versatile resource hub.

Within URHIA, we curate thematic spatial datasets, creating a dynamic space beyond a mere repository. It serves as a live data showroom, presenting various datasets through interactive online maps, enabling active user engagement. Current offerings include the Uralic Language Atlas, showcasing speaker areas, and the Archaeological Artefact Atlas of Finland. We'll discuss URHIA's developmental challenges, particularly those of the two current showrooms, emphasizing the need to address challenges for future advancements.

The initial URHIA platform, developed collaboratively within the URKO project (2020-2022) at the University of Turku, focused on spatial language data (Rantanen et al., 2022). It introduced the Uralic Language Atlas, utilizing interactive maps to showcase diverse information on Uralic languages' speaker areas. This approach, guided by user-centered design (UCD) principles (Roose et al., 2021), set a standard for cross-disciplinary collaboration, involving experts from biology, linguistics, archaeology, geoinformatics, and IT support.

The development of the second showroom, initiated in 2023, the Archaeological Artefact Atlas of Finland, introduced fresh challenges for the underlying platform and necessitated the integration of the Oskari platform into UTU-GeoNode, forming the foundation for URHIA. The data within the Archaeological Artefact Atlas encompasses the typological classification of prehistoric artefacts along with their coordinates (Pesonen et al. 2024, submitted). Sharing archaeological data on a spatial platform poses challenges due to the specific needs of data functionalities related to the subsetting of big data on the map. This database offers a spatio-temporal (c. 8900 calBC - 1300/1500 calAD) context for comparing artefacts across different periods and regions, encompassing approximately 38,000 single artefacts and approximately 10,000 pottery-type identifications.

The design and user-friendly layout of historical spatial data platforms is crucial for enhancing spatial data usability (Slingerland et al., 2020). Data visualisation plays a significant role in enriching studies, emphasising the importance of how a map view is designed in spatial data platforms (Coetzee et al., 2020; Jiang et al., 2019; Kraak and Ormeling, 2020). Through teamwork and a dedication to user-centric design, the URHIA spatial data platform has become a dynamic tool, ready to meet the diverse research needs of the community exploring Uralic historical and cultural data.



Examples from the Translocalis: Cultural Heritage, Narratives, Emotions, Perceptions and Voices of the Finnish Media, People, and Soldiers on the Imperial War.

Aytac Yurukcu

University of Eastern Finland Karelian Institute

"Like everywhere in Finland, the call to provide woollen clothes for soldiers has been well received, and the same goes for here. The government, in turn, has granted 25,000 marks for the acquisition of fur coats for the guard." Satakunta, 10 September 1877, 45, 3. (A letter from the reader of the Satakunta newspaper from Helsinki.).

The nineteenth century was a challenging period not only for the Ottoman Empire, which Russia condescendingly referred to as "The Sick Man of Europe," but also for Russia, which faced all the major powers in the Crimean War. These empires fought nine times between the beginning of the seventeenth century and the Russo-Turkish War of 1877–78. The war had far-reaching consequences in the Balkans and Caucasus and was unique in the way journalists portrayed the war to Europeans, Russians, Turks, and Balkan nations. The media, journalists, military attachés, and foreign correspondents had shaped war news, which drew a sizable audience and frequently deeply engaged people on both an emotional and intellectual level. However, the conflict also exerted a substantial influence on the ethnic majorities and peripheral minorities of the Russian Empire and in the army, including Finns, Estonians, and Poles. Hence, the war had the capacity to significantly impact the Finnish media, society, and the character of the national movement with regards to patriotism and the nation's position within the empire (Liikanen, 1995; Alapuro, 2018).

The war created new circumstances in realpolitik all around Europe and between the empires. One of these outstanding impacts of the war was inevitably on the Baltic Provinces and Grand Duchy of Finland (Thaden, 1964), dominated by the Russian Empire during the golden age of nationalism in the late nineteenth century. Within the imperial context in Finland (Snellman and Kalleinen, 2022), the war also shaped the different media and war perspectives of the people and the army officers on the threshold. This paper provides a scholarly contribution to the study of vernacular writing history by using previously unused letters from readers (Kokko, 2021; Kuismin & Driscoll, 2013) and soldiers as primary sources.

The Translocalis Database (1) contains over 60 published letters from readers and soldiers in 1877–78 that will be examined in this research. Using local letters, which had been sent to newspapers by readers and soldiers from the war zone, as source material gives details on the perspectives of society, people, and war and shares the community's experiences with the war. The study posits a significant hypothesis that Finland's emerging notion of a distinct state and nationhood was influenced by wartime events, as evidenced by particular instances from readers' letters, news coverage of the war in newspapers, and tales of troops.

The theoretical framework will be formulated with the interaction between one's personal identity and a sense of belonging in the community, as argued by Knott (2017), the concept of social layer experiences, and the theory of historical times, as developed by Koselleck (2004). Koselleck emphasizes the war impacts and the link between "the space of experience" and "the horizon of expectation" in national and social politics. Finally, the classical nationalist theory of ‘imagined community’ Anderson (1991), since the public sphere was the essential foundation for nationalism as an "imagined community", and the development of mass movements of civil society in Finland, by using cultural heritage collections, qualitative content analysis, and digital humanities tools (https://korp.csc.fi/korp/, https://voyant-tools.org/, https://digi.kansalliskirjasto.fi/, and https://digi.kansalliskirjasto.fi/collections?id=742) will be implemented to analyze the written texts by key themes: society, soldiers, solidarity, narratives, war news, and enemy images.

Many scholars—Kansanaho, 1965; Hiisivaara, 1969; Backström, 1996; Laitila, 2001; Suistola and Tiilikainen, 2014; Outinen, 2016; Parppei, 2021—discussed the war from various perspectives; however, they have not been directly linked to societal experiences. By using double-sided letters, newspapers, and diaries all together, such practices about the war (Kettunen, 2018) addressed addressed the issue of nationhood, solidarity, welfare, and nation-building Of course, firsthand evidence and national experience have been difficult to gain. However, the Translocalis also contains articles from readers that were published in newspapers throughout times of conflict, providing distinct and extensive information about people's war experiences. In addition, the Digital Collections of the National Library of Finland (2) contain mass media sources that provide insight into the effects of the war on the mindset of the Finnish media, military personnel, and the information disseminated by them and media outlets during the peak development period of Finnish media and the later period of societal renaissance.

This research looks at how the war altered the general, local, social, and political history of information circulation in Finland by asking: How did the war affect the community? How did the 1877–78 battle impact the media and soldiers' narratives and perceptions of liberty and independence? What kinds of war, news, narratives, and emotions have been expressed in the newspapers, as well as in the readers’ and soldiers’ letters, and why? How did they imagine, construct, narrate, and visualize their senses and sensations during wartime?

(1) Translocal Database developed by the Academy of Finland Centre of Excellence in the History of Experiences (HEX), https://digi.kansalliskirjasto.fi/sanomalehti/binding/431835?page=3

(2) Digital Newspaper Collections: National Library of Finland Digital Newspaper Collection. http://digi.kansalliskirjasto.fi/



Automation of Linguistic Annotation in Historical Lithuanian Corpus

Mindaugas Šinkūnas, Ignas Rudaitis

Institute of the Lithuanian Language, Lithuania

Institute of the Lithuanian Language holds the largest corpus of historical Lithuanian language that has been collected for more than two decades in cooperation with scholars from Lithuanian and German Universities. The texts are dated from the first Lithuanian printed book in 1547 to the formation of standard Lithuanian at the end of 19th century. The precise transcription of books and manuscripts was achieved working with original copies, which are held in libraries across Lithuania, Poland, Germany, UK, Sweden etc. The corpus of ca. 6m tokens is mostly used to research the history of Indo-European and Baltic languages and literature. The potential of the research is increased with tagged non-linguistic metadata and a partial linguistic annotation.

Four types of annotation are set to augment the corpus: (1) part of speech, (2) inflectional features, (3) lemma, and (4) modernized spelling. In the general context of linguistic corpora, (1–3) are commonplace, and (4) reflects the specifics of a historical corpus. Historical Lithuanian Corpus, as many other corpora of this kind, consists of texts written in various dialects and orthographic conventions; they vary greatly between the texts of the same period, and are different from the orthography standard of nowadays.

To minimize manual labour in producing these annotations, various computational approaches will be assessed, with most focus on supervised machine learning models. This choice follows naturally from the fact that prior to this work, the present authors had already compiled a training dataset in the order of hundreds of thousands of manually annotated tokens, consolidating multiple scholarly sources.

All of (1–4) can be—and have been—stated as computational problems. In the parlance of natural language processing (NLP), (1) is known as part-of-speech tagging, (2) as morphological analysis, and (3) as lemmatization. In NLP, the interest in (1–3) is already widespread, producing many well-performing solutions. Therefore, a developer of any historical corpus could also be expected to benefit from them.

However, (4) is different. In the literature, “historical spelling normalization” is the preferred term for this task; its relevance is mostly confined to historical corpora, rendering its popularity much lesser in comparison. In addition, even (1–3) have some specifics when applied to historical language varieties. These specifics, arising mostly from sparsity of data, are sufficient to deprive some first-line NLP approaches of their otherwise established advantage.

With that in mind, it is prudent to regularly reassess the state of problems (1–4) in the context of historical language varieties. It is especially pressing nowadays, given the unprecedented flux in the NLP ecosystem, first brought about by deep learning and distributional semantics, and then by large language models (LLMs). The development of Historical Lithuanian Corpus has presented such an opportunity.

This paper will focus on the following questions, to the extent that the experience with historical Lithuanian varieties permits: (a) To date, have LLMs become relevant to tasks of historical NLP, and if so, how are they to be integrated to the pipeline? (b) How do the most common models in neural NLP compare when historical datasets are used, and is the comparison influenced by the specific nature of the datasets? (c) In this context, can these models benefit from different choices of pipelining and representation? (d) Can annotated corpora of modern language varieties be used to augment historical training datasets, and if so, in what ways? (e) Given that, as a computational problem, historical spelling normalization overlaps with spelling correction, grapheme-to-phoneme transduction and some other NLP tasks, can the advances in solving these latter tasks be transferred to solving historical spelling normalization? (f) Given this very same overlap, is an analogous kind of transfer possible in the opposite direction?

Lastly, the paper will remark on some typological aspects of Lithuanian, which should help the reader situate the results in the greater perspective of historical NLP.



Text Recognition, Network Analysis, and Spatial Analysis: Approaching 17th-Century Court Records from a New Perspective.

Ville-Pekka Iivari Kääriäinen

University of Helsinki, Finland

This paper explores 17th-century court records from the Parish of Iisalmi, utilizing digital humanities approach that integrates Handwritten Text Recognition, Social Network Analysis, and Spatial Analysis. By examining the progression of state formation at the local level, this research offers fresh insights into the dynamics of state building “from below,” challenging traditional narratives centered on the decrees of central political elites. The study leverages the Court Records of Iisalmi parish, a rich but underexploited source that, despite its detailed content, presents challenges due to archaic language and handwriting. Employing HTR technology, particularly through the Transkribus application, the researcher has developed a model for deciphering 17th-century handwritten Swedish, making these valuable records more accessible to the academic community.

The creation of a relational database from these records has allowed for an unprecedented level of analysis, collecting meta and descriptive data for each legal case and linking individuals across multiple cases. This methodology not only bridges the gap between quantitative and qualitative research methods but also highlights the importance of everyday interactions and the agency of seemingly marginal figures in the historical process of state formation. The paper further discusses the potential for future research, particularly in the application of spatial network analysis to better understand the geographical dimensions of these interpersonal relationships.

Kääriäinen-Text Recognition, Network Analysis, and Spatial Analysis-191.pdf


Historical Farm and People Registry – Turning static list entries into network nodes

Eiríkur Smári Sigurðarson1, Pétur Húni Björnsson2

1University of Iceland, Iceland; 2Árni Magnússon Institute for Icelandic Studies

The aim of developing the Historical Farm and People Registry is to create a reliable infrastructure for research involving data on people and places in Iceland from 1703 to 1920 based on official census in the period. This has been done by (a) establishing a reliable historical farm registry, (b) mapping census data onto the farm registry, and (c) connecting people between censuses, and thereby transforming the static lists of the censuses into an interconnected network of nodes. In the spring of 2024 an attempt is made to use AI solutions to suggest linkages between individuals that have not so far been linked, "finishing" the product as far as possible.

This presentation will outline the aim and scope of the project, explain the development process and problems encountered on the way, and demonstrate a use case for the "final" product.



Collecting streaming services

Andreas Lenander Ægidius1, Mads Møller Tommerup Andersen2

1The Royal Danish Library, Denmark; 2University of copenhagen, Denmark

In the streaming era, the very thing that defines it is what threatens to impede access to important media history and cultural heritage. Streaming’s barriers to entry and its interim content catalogs challenge the actual collection and preservation of it for research and teaching purposes. If researchers and libraries do not work together to document and preserve these, we will keep losing important sources and data. A legal mandate to collect and preserve cultural heritage drives the national library to address this issue. The same issue sparks engagement among scholars that wish to collect streaming for research purposes, but who end up making individual archives unknown to their associates. Therefore, we wish to address this issue in collaboration by answering the following research questions: What methodological challenges do we find when we collect and study streaming services using our two different collection methods? What characterizes collections of streaming interfaces and how can we improve future collections? From a collection perspective, we argue that streaming services consist of their catalog, metadata, and graphical user interfaces. First, we map the large-scale legal deposit collection of streaming at a national library as well as a media researcher’s small-scale targeted collection. Second, we compare the resulting collections of web sites and graphical user interfaces in order to discuss methodological challenges. The findings of this comparative analysis indicate the existing deficiencies in both collections and suggest potential improvements in the collection and preservation of streaming services.

Concluding discussion

In the following, we will focus on methodological challenges surrounding the collection process. We will divide our discussion into three sections. First, we will discuss the advantages and disadvantages of the two approaches. Second, we will discuss the extent to which the two collections can supply or support each other and in that way mitigate their methodological challenges. Finally, we will discuss steps to achieve greater transparency and better data about and from streaming services.

We have described the national library’s method as a general collection approach and the media researcher’s method as a targeted collection approach. A parallel difference is whether a collection has a macro-level and/or micro-level approach. To exemplify this, we briefly recount how the library has initiated a general automated collection from commercial aggregators of born digital music and books, along with a targeted automated effort to collect streaming-only TV programs from the two major public service TV-stations. However, the library’s current attempted micro-level collection effort provides but a narrow slice of the dimensions of streaming. So far, the method does not capture the graphical user interface only video files, program descriptions, images, and various other metadata.

The researcher’s targeted approach seems to have a wider margin of success by collecting a few streaming services in full or thematically. This should provide data that satisfies that researcher’s immediate needs. However, it runs the risk of being too narrow and hence support very few insights into the different dimensions and roles of streaming services in the contemporary society.

We have already touched upon another important distinction to do with the degree of automation of the process of collecting. At the library, curatorial and technical staff thoroughly test and improve upon their practice of automated collection through many years of quality control. Yet, the library’s collection only happens outside any paywalls, which is a key collection bias and analytical disadvantage. What is not automatically collected, will be missing from the collection. The interfaces are missing. Arguably, the most important aspect of the streaming experience. In other words, our evidence suggests that the automated process has deficiencies in terms of the “lack of depth” during the collection of the interfaces of the streaming services.

In contrast, Author B has documented versions of streaming services interfaces with manual screenshots after having logged in to the services. If a web page does not load successfully, the researcher can just reload the web page to capture all the content in the subsequent screenshot. In the hand-held collection process, the researcher will inevitably tackle such obstacles in the short run on the micro level. However, the obstacles could accumulate during a longer scheduled collection. In other words, the hand-held collection is subject to the researchers’ stamina. If collection fatigue sets in at some point, the consistency of that collection could suffer. This is a potential hindrance for studies, which aim to provide longitudinal insights on the development of streaming services. Author B remedied this by downshifting from monthly to quarterly collection. It seems that the hand-held method has longitudinal potential while the library’s collection method is established as longitudinal. Since there is no exact time-based definition of a longitudinal study, we could argue that the library’s collection method should provide very lengthy longitudinal advantages. Its allocated work effort of testing and improving its method should increase the replicability of the automated method. A threat worth mentioning stems from the fact that the very lengthy longitudinal collection is subject to the library’s strategic priorities that they must renegotiate every 3-5 years.

Still, the two methods could support each other simply because they cover different depths and timelines of streaming. We will describe two possible interdependencies at the practical level and policy level. The targeted approach can document user profiles better and collect more micro-level details e.g. full samples of images in a carousel and aspects of personalization, which are missing in the automatic collection. The choice to, or attempt at, collecting inside paywalls or logins is a very important one. The library’s general collection of other types of content and related materials from streaming services can serve as context for both methods. In the case of national services, the library should be able to give researchers access to the videos featured in the screenshots and historic news coverage of specific services. However, we should consider the methods as interdependent rather than the one being subordinate to the other. In other words, researchers can help the library patch the obvious holes in past collections. Looking forwards, the researchers can help the library adjust and improve collections. Given its wide and ambitious scope, it should be in everyone’s interest to have an optimal broad very lengthy longitudinal collection of streaming. It is a demanding task for the curators to assess what future needs will be. Researchers and eventually the public will continuously have to help the library assess the merit of their general approach that aims to collect everything automatically. This will require a better and continuous dialogue between the library and these stakeholders about what should be collected from the internet (Brügger, 2018; Schafer and Winters, 2021). The same goes for the library and researchers concerning the various dimensions of streaming captured in varied datasets (Kelly and Sørensen, 2021: 87). Not only are collections mutually beneficial on a practical level. We reiterate that they are interdependent at a policy level. The library initiated its collection practices based on investigations made by appointed experts and researchers that led to legal deposit legislation for dynamic web sites (Bache and Finnemann, 2003). Without renewed and continued exchange of advice and assessments of what needs to be collected there will not be a wide, long, and lasting contextual collection to draw from and build upon. As such, it is a fundamental interdependence at the level of policy and at the level of the individual collections of specific types of content, as shown here in the case of streaming services. There are multiple advantages of developing relationships between researchers and institutions. Not only do they facilitate greater access to data but they may also potentially increase access to the expertise and tools required to make sense of said data while also enforcing appropriate digital data preservation policies (Kelly, 2022: 16–17).

We wish to add that the streaming services are also important partners and collaborators. However, their preservation needs and research interests might not match those of the libraries and the researchers. Nevertheless, greater awareness of the legal deposit obligations and the benefits of research collaborations could produce better preservation of digital cultural heritage, more research insights, improved public-service offers, and potential commercial gains. The dynamic illusive Netflix catalog seems like a hyperobject, an n-dimensional non-entity on par with the internet itself (Morton, 2013). Yet, Netflix does have a research division and a website that links to their research (Netflix, n.d.). A quick survey of the site suggests that while they do present at conferences and publish in ACM proceedings they do not seem open to collaboration with independent researchers.

We can pick up, browse through, and pass on books made of paper and their born digital counterparts. We cannot, at this point in time, ‘press play’ in a collected version of the software and content assemblage we call streaming services. Their constituent parts are spread across various collections. Curators and researchers face a near-insurmountable task of reconstructing them as a piece of cultural heritage and as a research object. Potentially, an automated collection process inside login would collect every page and their contents that is linked to in Figure 2. Barring any conventional download of content in a ripping manner, this would be akin to having a bot real-time “watch” all Netflix shows or work its way through a curated playlist of content that is covered by Danish legal deposit law. Such a scenario presents a probable collection method with a very high degree of complexity. Which elements in the interface should be “clicked” and in which sequence? Will the play button be consistently placed in the interface? Will the ‘more info’-button load a pop-up or take us to a different “site” of the interface? Let alone the apparent impossibility of automatically collecting interactive TV series!

In summary, this means that collecting streaming services is very problematic and fraught with challenges. Nonetheless, we are hopeful since we can identify overlapping collections that can support each other. The collections can be built upon and further enriched by research requests if researchers and institutions in collaborations with the services acknowledge interdependencies and produce rich documentation and metadata.

We urge researchers and curators to help politicians realize how big a problem the lack of transparency around streaming actually is. As we have discussed, the actual interfaces of these services are not well preserved for future reference, and every single day, important material is regrettably not collected. Also, if a streaming service used its interface in a way that was societally problematic, the current state of our collections might make that difficult for anyone to prove. One solution is that a political intervention could make it mandatory for streaming services to hand over documentation of their service's appearance (inside paywalls) to the library on a recurring basis. Another major transparency issue is also the lack of reliable "ratings" or numbers that document streaming use, which most services do not share with the public. Altogether, these circumstances point to how streaming services are difficult to research and escape important critical observation.

In the meantime, we recommend that more parties test and share their experiences of using various tools that support archival and research-based collection of online materials. An example of this is Webrecorder’s software Browsertrix Cloud that provides a user interface for non-developers while being compliant with standardized Web archive formats (Myrvoll et al., n.d.). As tools become easier to use, it is increasingly important to provide documentation of choices before and during the collection to increase the methodological transparency and reflexivity of a given collection. This will help future exchanges and access to collections of streaming services.

Funding

A grant from the Ministry of Culture Denmark funded the research project that provided insights for this article: FPK-2021-0004.



Og að mér lifanda lifir enn hans hamingja -- Rare syntactic phenomena in parsed historical corpora

Ingunn Hreinberg Indriðadóttir, Þórhallur Eyþórsson

University of Iceland, Iceland

Introduction
This paper examines the historical distribution of the Prepositional Absolute Construction (PAC), a rare and underdescribed construction in Icelandic. Our study shows how a syntactically annotated digital corpus of historical texts can be used to describe the preservation and development of uncommon syntactic phenomena across different stages of a language. PAC has been defined as a small clause containing a subject NP and a present or past participle whose case is governed by the preposition ‘at’ (Eyþórsson and Indriðadóttir 2018). The construction consists of three types, labeled Type 1, Type 2a and Type 2b. Examples are given in (1):

(1a) Type 1: að öllum sjáandi

at all.dat seeing

‘While everybody sees/saw.’

(1b) Type 2a: að viku liðinni

at week.dat passed.dat

‘When the week had passed.’

(1c) Type 2b: að athuguðu máli

at considered.dat matter.dat

‘When the matter had been considered.’

In Type 1 (1a), the verb is in the present participle, the subject is in an active clause with a finite verb. In Type 2a (1b), the verb is in the past participle, the subject is in an active clause with a finite verb. In Type 2b (1c), the verb is in the past participle, the subject is in a passive clause with a finite verb (underlying object). The past participle shows agreement with the NP in case, number and gender in both Old and Modern Icelandic. The present participle is not inflected in Modern Icelandic, but it shows agreement in Old Icelandic, with a distinction

in the masculine singular (OIc. komandi ‘coming’ (nom.sg.)/ komanda (obl.sg.) vs. Modern Icelandic komandi (nom./obl.sg.)).

The grammatical function of the NP in PAC is either that of a subject of a finite active clause (Type 1 and Type 2a) or a subject of a finite passive clause, i.e. an “underlying” object (Type 2b), (2)

(2a) Allir sjá þetta.

all.nom see this.acc

‘Everybody sees this.’

(2b) Vikan líður.

the-week.nom passes

‘The week passes.’

(2c) Mál var athugað.

case.nom was considered.nom

‘A case was considered.’

Due to their correspondence to subjects in finite clauses, in this paper we call the NPs in PAC “subjects”, as has been argued by Indriðadóttir and Eyþórsson (to appear). However, it is unclear that it be shown independently, by means of the standard tests for subjecthood, that the relevant oblique NP is actually a subject in the PAC.

Previous studies of this structure have suggested that the distribution of the present and past participles is different in Old and Modern Icelandic, i.e. that there are very few occurrences of Type 1 in Modern Icelandic, whereas Type 2 is relatively common, in particular Type 2b. It has also been maintained that case marking within PAC developed very early on, as both dative and accusative NPs are attested in the PAC in Old Icelandic, whereas in Modern Icelandic only dative NPs seem to be found (Eyþórsson and Indriðadóttir 2018).

The results presented in this paper show that the distribution pattern of present and past participles in PAC has been consistent throughout all stages of the language from the 12th century to Modern Icelandic. This is important as it means that while PAC has always been a rare construction in the Icelandic language, it has maintained a similar level of use and productivity. Furthermore, our results show that accusative NPs in PAC survived in Icelandic until the 19th century, many centuries longer than has previously been considered.

PAC in IcePAHC

In this paper, we describe our detailed investigation of the distribution and development of PAC in Icelandic that we conducted in the Icelandic Parsed Historical Corpus (IcePaHC) (Wallenberg et al. 2011). We searched for all possible types of PAC with a single search command and extracted a total of 243 results. The results were then analyzed in R (2021), where the distribution of PAC from the 12th to the 21st century was defined, based on the three types described in (1). We then describe the results in detail, considering issues such as the grammatical function of the NP and the participle and the word order patterns within the construction.

Results

Of the 243 results, 5 were eliminated as they were considered improperly annotated. The remaining 238 results were analyzed in R (2021), where we defined the distribution of PAC from the 12th to the 21st century, based on the three types described in (1). Fig. 1 shows the distribution of the examples by century.

SEE PDF FOR FIGURE 1

Figure 1. Distribution of PAC by type and century

As shown in Fig. 1, the vast majority of examples were of Type 2b (3), while considerably fewer examples were found of Type 2a (4), and only a few examples were found of Type 1 (5). Most examples of Types 2a and 2b date from the 14th and 17th centuries.

(3) En [að sénu þessu mikla tákni]

but at seen.dat this.dat great.dat sign.dat

lofuðu allir guð og sæla Maríu Magdalenu.

praised all god and blessed Mary Magdalen

‘But having seen this great sign everybody praised God and the blessed Mary Magdalen.’ (ID 1350.MARTA.REL-SAG,.914)

(4) [en að morgni komnum] þá stóð Jesús í sjávarfjörunni.

but at morning.dat come.dat then stood Jesus in seashore

‘but when morning came Jesus was standing on the seashore.’

(ID 1540.NTJOHN.REL-BIB,232.1631)

(5) Og [að mér lifanda] lifir enn hans hamingja.

and at me.dat living.dat lives still his happiness

‘And while I am living his happiness still lives.’

(ID 1300.ALEXANDER.NAR-SAG,.554)

A closer look reveals that most of the examples from these periods (and in the results in general) can be found in only three publications. There are numerous examples from the 14th century, including 16 examples from the Saga of Bishop Árni (1325) and 16 examples from the Saga of Martha and Mary Magdalene (1350). From the 17th century, quite a few examples were also found, of which 18 were found in Jón Ólafsson Indíafari’s Travel Book (1661). In other respects, the number of examples is fairly evenly distributed between the centuries. The increase in examples in the 14th and 17th centuries is therefore probably due more to the personal style of the respective authors than to the increased general use of PAC in these periods. These results are consistent with the description of PAC in Modern Icelandic, i.e. that PAC Type 2 is relatively common, in particular Type 2b, which enjoys more productivity than the others. Nevertheless, all three types of PAC can be found through different stages of Icelandic, even though the construction is rare.

Previous studies have maintained that case marking within PAC developed very early on, as examples of dative and accusative NPs have been found in PAC in Old Icelandic, whereas in Modern Icelandic only dative NPs seem to be found (Indriðadóttir and Eyþórsson, to appear, and Eyþórsson and Indriðadóttir 2018). In our search, we found much newer examples of PAC with an accusative NP, such as example (6), which is from the 17th century.

(6) Og [eftir máltíð gerða] var hann kallaður

and after meal.acc done.acc was he called

að fylgja einu líki sem verið hafði kaupmaður

to follow one corpse that been had merchant

til sinnar greftrunar.

to his burial

‘And after the meal he was summoned to a funeral procession to the burial of someone who had been a merchant.’ (ID 1628.OLAFUREGILS.BIO-TRA,.769)

We also found examples that contain the preposition á ‘on’ rather than the usual , (7)-(8). This is interesting on its own as examples of PAC with other prepositions than have not been discovered before. Moreover, while the overt NP töðuslætti ‘haymaking’ in (7) is in the dative case, the missing NP in (8) is in the accusative, as shown by the form of the participle gert ‘done’:

(7) Anno 1661-1661 [á liðnum töðuslætti]

year 1661-1661 on passed.dat haymaking.dat

brann allur bærinn á Gröf á Höfðaströnd

burned all the-farm on Gröf on Höfðaströnd

hvar biskupinn herra Gísli hafði bú og mikið af fjárhlutum

where the-bishop mister Gísli had estate and much of property

sem þar voru inni.

that there was inside

‘In the year 1661, after haymaking, the whole farm on Gröf on Höfðaströnd, where Bishop Gísli had an estate and much property inside, burned.’

(ID 1725.BISKUPASOGUR.NAR-REL,.943)

(8) Þrællinn kvað honum illa farið að eggja sig til stórræða

the-slave said him badly gone to incite him to big-venture

en svíkja sig [á svo gert ofan]

but betray him on so done.acc down

um frelsi og fé er hann bauð honum til.

about freedom and money that he offered him to

‘The slave said that he had fared badly by inciting him to a big venture and thereupon betraying him of the freedom and money that he had offered him.’

(ID 1830.HELLISMENN.NAR-SAG,.220)

These results are important as this means that accusative NPs in PAC survived in Icelandic until the 19th century, many centuries longer than has previously been considered.

Finally, we considered word order within PAC, which can either be verb–subject (9) or subject–verb (10):

(9) [Að liðnum þessum jólum] var herra Árni biskup

at passed.dat this.dat Christmas.dat was mister Árni bishop

að stóli sínum til vors.

on seat his to spring

‘When this Christmas had passed Bishop Árni was in his bishopric until spring.’

(ID 1325.ARNI.NAR-SAG,.442)

(10) og [að veizlunni endaðri] voru menn með gjöfum útleystir.

and at feast.dat ended.dat were men with farewell-presents parted

‘and when the feast was over the men were given farewell gifts.’

(ID 1675.ARMANN.NAR-FIC,121.1007)

The latter appears obligatory if the NP is a pronoun, like in (5). In the paper, we argue that similar rules apply to word order in PAC as in constructions like Object Shift , i.e. an unstressed pronominal object must always precede a sentential adverb, and cannot be left in the default object position, or in situ (see Holmberg 1986 and much later work). Our conclusion is supported by the fact that pronouns are generally affected by different rules of word order than other types of NPs in Icelandic syntax (see e.g. Thráinsson 2007:31–37).

While rules and variation in word order in PAC have not changed over time, the construction tends to be shorter in Modern Icelandic than in older stages of the language, as argued by Indriðadóttir and Eyþórsson (to appear). In our paper, we demonstrate that PAC in older stages of Icelandic would allow much longer and more complex NPs than can be found in modern Icelandic and that long and complex NPs can be found preceding the participle (11) or following the participle (12).

(11) Og svo [að þessu samtali í það sinn öllu enduðu]

and so at this.dat conversation.dat in that time all.dat ended.dat

lét kóngur kalla á sinn eigin víntapparameistara

let king call on his own wine-steward

Kristján Skammelsson.

Kristján Skammelsson

‘and so when this conversation was all over at that time the King had his own wine

steward, Kristján Skammelsson, summoned.’

(ID 1661.INDIAFARI.BIO-TRA,40.375)

(12) Á miðja nátt fyrir framferðartíma sællar Mörtu

on mid night before death-time blessed Martha

[að sofnaðum bræðrum og þeim mönnum sem vöku héldu

at asleep.dat brothers.dat and those.dat men.dat that wake held

með kertum og ljósum] þaut mikill hvirfilvindur(...)

with candles and lights whistled great whirlwind

‘At midnight before the time of death of the blessed Mary, when the brothers and the men who held a wake with candles and lights had fallen asleep, a great whirlwind whistled (...).’

(ID 1350.MARTA.REL-SAG,.703)

The results of this study are robust, providing a detailed historical overview of PAC, which has never been done before, describing how this construction has been preserved in Icelandic and how it has developed over time. Our study demonstrates how syntactically annotated digital corpora can contribute to the study of the preservation and development of uncommon syntactic phenomena across different stages of a language.

Indriðadóttir-Og að mér lifanda lifir enn hans hamingja -- Rare syntactic phenomena-209.pdf


Three 3D scanners and 13 institutions,

Hrönn Konráðsdóttir

National Museum of Iceland, Iceland

How to prioritize projects and spark interest

In the starting stages of the Centre for Digital Humanities and Arts in Iceland three 3D scanners from Artec were purchased. They were chosen with the idea of being able to scan everything from chess pieces to rooms and houses. The scanners are located at the National Museum of Iceland but are easily movable between institutions. They can also be brought on site when needed.

The scanners serve as a collaborative resource for 13 institutions. The National museum’s role is to utilize the scanners to scan its collections but also to supervise their lending to other participating institutions as well as assisting with their use when possible. The initial stages have emphasised the importance of effective project prioritization and institutional engagement. The presentation will delve into the initial phases of the project, addressing the challenges and successes while initiating this venture. Furthermore, it will provide insight into ongoing projects, their current status and outline future aspirations.



Exploring Existentialist Design in Digital Humanities: A Case Study of User Experience at the National Library of Norway

Jana Sverdljuk

National Library of Norway, Norway

This paper examines the application of existentialist philosophical principles to user experience design within the Digital Humanities Laboratory at the National Library of Norway (NLN). Over the past decade, the NLN DH Lab has evolved from organizing workshops on Jupyter Notebook tools to developing user-friendly applications tailored to streamline computer-mediated research on cultural heritage materials. Drawing from existentialist philosophy, the paper explores how these applications prioritize exploration, accommodate subjectivity and interpretation, embrace fluidity and multiplicity of meanings, and facilitate engagement with being-in-the-world. By analyzing user personas and their interactions with the applications, the paper demonstrates the profound impact of existentialist design principles on user experiences and interactions within the digital humanities landscape. The paper concludes with reflections on the dynamic relationship between technology design and user engagement, underscoring the role of design principles in shaping meaningful and impactful experiences for diverse user groups.

Sverdljuk-Exploring Existentialist Design in Digital Humanities-172.docx


Digitized language variation for computational dialectology: The Dialect Atlas of Finnish by Lauri Kettunen (1940)

Jenni Santaharju1, Terhi Honkola1, Perttu Seppä2, Kaj Syrjänen1, Unni Leino3, Outi Vesakoski1,4

1University of Turku, Finland; 2University of Helsinki, Finland; 3Tampere University, Finland; 4Turku Institute of Advanced Studies

Our work introduces open access data conversions based on the Dialect Atlas of Finnish (Kettunen 1940) that are suitable for computational analyses of dialectal traits. The dialect atlas was collected by Lauri Kettunen in the 1920s–1930s and describes the spatial variation of Finnish ca. 100 years ago. The data is organised into 213 maps each describing areal variation of one (mostly) morpho-phonological feature, identifying which variant(s) of each feature was present in each of the 525 Finnish speaking municipalities. The dialect atlas represents so far the most comprehensive data of the dialectal variation in Finnish and is the only data available of historical linguistic landscape. As the dialect atlas was collected before the urbanisation and mass movements during the WW2, it is suitable for studying for example the development of preindustrial linguistic landscape and mechanisms driving dialect formation. The dialect atlas has been studied both with traditional (reviewed in Aarikka 2023) and quantitative methods (reviewed in Syrjänen 2016).

Digital version of the dialect atlas is found at http://kettunen.fnhost.org and undocumented version of the data corrected by us is at http://urn.fi/urn:nbn:fi:csc-kata20151130145346403821. The latter is based on the first digitised version of the dialect atlas by Embledon & Wheeler (1997, 2000) done in collaboration with the Institute of the Language in Finland (KOTUS) and corrected by us (www.bedlan.net). In this paper we provide the raw data of the dialect atlas readily formatted in different coding schemes, along with additional annotations to further facilitate the use of this valuable linguistic data. We also reconstruct the data collection procedure by Kettunen and provide the linguistic classification of the collected linguistic traits. Furthermore, we promote the usage of the data and FAIR principles also by offering different coding schemes, as used in different BEDLAN paper: We have modified the dialect atlas into multiple versions according to the needs of different studies (Syrjänen et al. 2016, Honkola et al. 2018 and Santaharju et al. ms in revision). We offer here not only the master copy and metadata, but also the different versions available in the open repository. Furthermore, we contribute to development of methodology in evolutionary language sciences and computational dialectology by discussing different coding schemes for the language data.Even though these analogies between linguistic and genetic data have been discussed from a theoretical point of view e.g. in Andersen (2006), Croft (2008) and Pakendorf (2014), the different alternatives of how to code linguistic data so that it matches genetic data has obtained little attention (see Leino et al. 2020.

Our work and the dataset provided here contribute to the rise of the evolutionary language sciences by adapting approaches from evolutionary biology and computational sciences to operate with large digitised linguistic datasets. This paper will elevate especially computational dialectometry by providing not only a new data resource to international audience, but also by presenting a well-established new framework to conduct studies within digital humanities.



Using BERT to Study Semantic Variations of Climate Change Keywords in Danish News Articles

Florian Meier

Aalborg University, Denmark

The terms greenhouse effect, global warming, and climate change are often used synonymously in everyday conversations about the warming planet. Utilizing Danish BERT, the study employs masked language model tasks to uncover semantic shifts and variations of these climate change-related keywords in Danish news articles from 1990 to 2021. The findings offer insights into contextual understandings and framing nuances by journalists, contributing to a deeper comprehension of these terms in the Danish media discourse on CC.

Meier-Using BERT to Study Semantic Variations of Climate Change Keywords-125.pdf


From Miðgarð to Marvel: Norse Mythology, Augmented Heritage and the Prose Edda

Alan Thomas Searles

University of Iceland, Iceland

This research proposes that the digital integration of cultural heritage in Iceland expand beyond urban settings and should encompass heritage sites, monuments, churches, graveyards and natural landscapes, initially focusing on the research centre at Snorrastofa in Reykholt and its association with Snorri Sturluson, the Prose Edda and Norse Mythology.

In recent decades, digital technologies have increasingly been employed in cultural and heritage spaces in Iceland. Typically these spaces, museums, galleries and libraries, are located in and around Reykjavík with relatively large visitor numbers. Technologies such as barcodes, QR codes and NFC tags have all been utilised with varying degrees of success in attempts to improve the participatory and experiential engagement of visitors (Werthner et al.). Mobile phones are increasingly being used as interactive devices in cultural and heritage spaces (Lombardi). The capabilities of recent convergent technologies, applications and cloud storage afford opportunities for comprehensive integration of digital interfaces with cultural heritage.

In 2021 the European Commission launched the Digital Decade framework which includes a strategy, targets and objectives for digital innovation in Europe until 2030. As part of the Digital Decade framework, a number of projects and policies have been initiated. These include the project on Digital Cultural Heritage, focusing on the areas of digitisation, online access to cultural material and digital preservation. According the digital cultural heritage website “Cultural heritage is evolving rapidly thanks to digital technologies. The momentum is now to preserve our cultural heritage and bring it to this digital decade.” The Cultural Heritage Cloud initiative has been launched with the aim of developing specific digital collaborative tools for the sector while removing barriers for smaller and remote institutions. The Cultural Heritage Cloud will aim to add a new digital dimension to cultural heritage preservation, conservation, restoration and enhancement.

This research will use the framework and initiatives currently being implemented by the EU to extend digital cultural mapping and virtual heritage technologies to a traditional museum in rural Iceland. Research indicates that public engagement with a culturally significant encounter, such as visiting a museum or heritage centre, can be greatly enhanced through the implementation of interactive tools and mobile devices (Ruiz-Gómez et al.). These tools can be as simple as QR codes located appropriately to promote user engagement with an interactive experience or as complex as augmented, virtual or extended realities designed to create an immersive experiential environment for users.

The primary objective of the research project will be to connect people visiting places of culture significance to the literature and history of the place. Hopefully we will increase visitors understanding of the intimate relationship between the Prose Edda and Norse Mythology and by extension increase their appreciation of the far reaching and enduring impact of Reykholt on twentieth century culture, cinema and literature. To achieve this objective the plan is to implement technologies which will allow individuals to seamlessly interact with specific locations, and through digital interfaces and extended realities, connect those physical locations with medieval Norse literature and mythology.

The proposed research will engage with and collaborate closely with Snorrastofa, which is an independent research centre, located in Reykholt in western Iceland, the main residence of Snorri Sturluson (1179-1241). The main goal of Snorrastofa is to facilitate research on the medieval period in general, and Snorri and his works in particular. Rekholt is the place where modern Norse Mythology was written/compiled into a single book for the first time almost a thousand years ago.

A number of innovative interactive digital projects have been implemented at heritage sites in Iceland in recent years. The Find the Past project at Thingvellir National Park and the Back to Hofsstaðir project in Garðarbær both utilise WebXR technology to create an extended virtual environment which visitors can access via mobile devices. While this technology is immersive, it is still a passive experience for users. A more interactive experience is being offered at the 1238 visitor centre at Sauðurkrókur in north Iceland where guests can take part in the Battle of Örlygsstaðir using Virtual Reality headsets and haptic feedback breastplates. Three dimensional holographic images linked to specific locations via QR codes are being investigated by the town of Hafnarfjördur and a 360 degree scanned map of the statue garden at the Einar Jónsson museum has been created by the engineering firm Efla and made available online to schoolchildren. These and other projects indicate that the technology and expertise are available in Iceland for private interests and tourist companies to implement digitally enhanced experiences for tourists and visitors to heritage sites. This project will attempt to apply some of these tools, technologies and expertise via a digital humanities project focusing virtual heritage. While the concept of digital cultural heritage is not new, the technology and tools to implement the ideas developed in recent decades has reached a point where the theory can now be practically implemented (Salek Farokhi and Hosseini).

According to the European Commission website on Culture and Creativity, “cultural heritage..encompasses a broad spectrum of resources inherited from the past in all forms and aspects.” And that “Research and innovation nurture smart and technologically advanced solutions to help Europe protect and promote its cultural heritage.”

The EU project “Integrated e-Services for Advanced Access to Heritage in Cultural Tourist Destinations (ISAAC)” confirmed that adopting innovative ICT could enhance and improve the promotion of local cultural heritage and improve cooperation across sectors and research disciplines.

During the implementation of this project I will draw on the work of Erik Champion and his extensive research on the concept of Virtual Heritage, as well as following EU guidelines for Digital Cultural Heritage projects. This is an attempt to address a gap in digital humanities research in Iceland, specifically virtual heritage studies and hopefully it will contribute to a structure which future studies regarding digital heritage and cultural studies may utilise and benefit from.

 
Date: Friday, 31/May/2024
8:45am - 10:15amDARIAH DAY: PANEL
Location: H-207 [2nd floor]
Session Chair: Olga Holownia, IIPC, United States of America

Edward Gray & Vicky Garnett: DARIAH 

Guðbjörg Andrea Jónsdóttir: Ministry of Higher Education, Science and Innovation 

Guðmundur Hálfdanarson: The Icelandic Infrastructure Fund 

Starkaður Barkarson: CLARIN-IS 

Eiríkur Smári Sigurðarson: CDHA

 

8:45am - 10:15amSESSION#14: COLLECTIONS AS DATA & DATA QUALITY
Location: H-205 [2nd floor]
Session Chair: Mahendra Mahey, Tallinn University, Estonia
 
8:45am - 9:15am

Data Availability and Evaluation Reproducibility for Automatic Date Detection in Texts, a Survey

Tommi Jauhiainen

University of Helsinki, Finland

Automatic date detection in texts has been applied to various languages and periods in research related to digital humanities. In this article, we investigate the availability of the datasets used in published experiments including a date detection component. We take a closer look at c. two dozen of the newest articles. Primarily, for each of the two dozen articles, we examine the possibility of acquiring the used dataset based on the information presented in the article itself. Secondarily, we examine the possibility of reproducing the exact evaluation setting described in the articles, e.g., the possibility of dividing the dataset into the same training and testing portions. We find that, as far as the datasets are involved, we would be able to reproduce the evaluation setting of four of the articles using the information in the articles themselves.

Jauhiainen-Data Availability and Evaluation Reproducibility for Automatic Date Detection-167.pdf


9:15am - 9:45am

Source criticism, bias, and representativeness in the digital age: A case study of digitized newspaper archives

Jørgen Burchardt

Museum Vestfyn, Denmark

Historians must critically scrutinize their sources, a task further complicated in the digital age by the need to evaluate the technical infrastructure of digital archives. This article critically examines digital newspaper archives, revealing error rates in optical character recognition (OCR) that compromise result reliability, and word frequency-based datasets that introduce biases due to issues in the shaping of the OCR corpus and later post-processing. Beyond technical issues, copyright restrictions hinder access to crucial newspapers, while incomplete archives pose representativeness challenges. Accessing datasets from different countries is cumbersome. Commercial archives are costly, and uneven publication rates necessitate corrections over time. The use of digital archives presents new exercises: the researcher needs to explain the reliability of the digital source, which often can only be achieved in interdisciplinary working groups. The digital archives must ensure transparency by detailing to researchers the technical manipulations performed on the original source.

Burchardt-Source criticism, bias, and representativeness in the digital age-205.docx


9:45am - 10:00am

Engineering 19th century poetry from the National Library of Norway's Collection

Lars Magne Drønen Tungland, Ingerid Løyning Dale

National Library of Norway, Norway

The Digital Humanities support unit (DH-lab) of the National Library of Norway is providing data and digital analysis tools in the EU-funded NORN project, led by the Ibsen centre and the University of Oslo. NORN analyses and reinterprets 19th-century Norwegian literature through national romanticism. It merges fields like cultural and critical race studies with both qualitative and quantitative methods from literary science, history, and digital humanities. The project aims to explore the emotional aspects of national romanticism and objectively analyse literary works to critique literary history.

A segment of the NORN project is dedicated to examining 19th-century Norwegian poetry. We here present insights from the process of extracting and analysing 19th century Norwegian poetry from our digital text collection. The genre proposes some particular challenges for quantitative analysis and known NLP methods, due to a combination of visual, phonemic, semantic, as well as syntactic structures, that are very different from typical NLP data such as reviews, news articles or novels.

This paper offers insights from the DH-lab's perspective. Our aim has been to operationalize the structures we find in poems published in the 1800s through rigorous text mining, to create a comprehensive dataset with quantitative metrics. The dataset presents the texts in enriched, annotated, and aggregated structured formats which can be analysed by researchers in a useful, interesting, and perhaps surprising way, in addition to the standard literary practice of close reading. The core of our discussion revolves around the NLP-engineering challenges encountered and the methodologies employed to provide researchers with actionable data.

The dataset has been designed for interoperability with the DH-lab’s suite of tools for corpus analytics, fostering an integrated research environment for digital humanities scholars. Enrichments to the texts in the dataset include text and token classification such as topics, sentiment analysis, named entities (Nielsen, 2024), as well as phonemic transcriptions using supervised machine learning models. We have used traditional N-gram language models, transformer models such as BERT, and experimented with the newer Large Language Models (LLM) such as Llama 2. An important step involved utilising the Norwegian language banks' grapheme to phoneme (G2P) models (‘Grapheme-to-Phoneme Models for Norwegian’, n.d.) and training new models to phonemically transcribe 19th century Norwegian written language nuances, enabling us to annotate rhymes in the poetry corpus.

Further, we have used word embeddings, e.g. Word2vec (Mikolov et al., 2013) to assess word similarity within the poetry corpus, juxtaposing it with the general 19th century Norwegian language usage to unravel unique poetic vocabularies. Our team has also experimented with automatic enjambment detection, creation of word clusters using co-occurrence analysis, such as collocations (Johnsen, 2021) and LDA (Blei et al., 2003). We have developed a system for sentiment analysis specific to 19th-century poetry, and also experimented with mapping the frequency of words related to emotions.

This paper does not delve into the results of the dataset analysis but focuses on the journey of dataset creation, the development of analysis tools, and identifying measurable aspects within the poems. We discuss the lessons learned and challenges encountered, such as dealing with OCR errors, interpreting the visual formatting of poems, understanding genre-specific linguistic features, and addressing domain effects of off-the-shelf tools.

In sum, our work not only underscores the challenges in treating poetry as data for NLP analysis but also highlights the solutions and tools we have developed, contributing findable, interoperable and reusable resources to the field of digital literary studies.



10:00am - 10:15am

Publishing 10 000+ archival index cards of modern cultural heritage as open (as open as possible) data: challenges with OCR, data cleaning, restricted data and maintenance

Niklas Alén, Maria Niku

The Finnish Literature Society, Finland

The Finnish Literature Society (SKS) has recently published online a large, unique collection of modern cultural heritage, which is preserved at the SKS archives. The web publication can be accessed at https://aineistot.finlit.fi/exist/apps/harju (in Finnish).

Johan K. Harju (1910–1976) was a major collector of heritage for SKS, who recorded the lives and traditions of marginalised people in the society. He documented urban culture, alcoholics' circumstances and life in prisons at a time when the emphasis of heritage collection was still in the countryside and was only just beginning in urban environments. The collection, stored in archival index cards, contains personal accounts of experiencing homelessness and alcohol substitutes, local stories from Helsinki inner city districts, interviews conducted at night shelters, prison and sanatorium traditions, jokes, and youth and wartime reminiscences. The digital edition contains the digitised cards and their OCR-produced transcriptions and metadata, both fully searchable.

The publication project was a test run of sorts for SKS. The goal was to develop effective and as automatized as possible ways to publisher large-scale archival collections that may contain over 100 000 units, with their text and metadata fully searchable. The largest part of this was finding OCR solutions which would make it possible to extract both the text and the metadata of the cards and store them in publishable, standard-conforming XML/TEI 5. Our paper examines the various challenges encountered and solutions found during the project: OCR, automatized processes for data cleaning, the various aspects to be considered with the restricted parts of the data, requirements for the database and frontend for this kind of a digital edition, and technical solutions for continued database maintenance. Our paper thus falls under the third special theme of the conference, the life cycle of digital humanities and arts projects, and in particular its sub-theme "Creating and using cultural heritage collections as data: workflows, checklists, tools".

In order to develop an effective OCR solution the structure of the material was first carefully analysed. It was quickly determined that the material followed a specific pattern where regions of interest where always positioned in specific areas of the card, for example a geographic collection code, if present, was always in the upper right hand side of the card, titles were often clearly separated from the rest of the text and the main text was also mostly separated from the metadata on the card. Based on this information a C++ programme was developed using open source libraries. The programme performed card analysis, content segmentation and OCR using the Tesseract OCR engine’s API. The result was then serialized to XML/TEI 5 using libxml2’s API.

In J. K. Harju's case, the digitised collection of archival cards contained a large amount of duplicates, which needed to be removed before publication. Due to the scale of the collection it was essential to find an automatic solution for this data cleaning. Initially the possibility of removing duplicates based on their metadata was considered, but due to OCR errors it was deemed infeasible. After this a string similarity based approach was devised. Two factors were important in choosing a string similarity algorithm. The algorithm should have good performance and it should account for small variations in the text. The cosine similarity algorithm was chosen based on these criteria. A C++ programme was then developed to perform this task.

The goal with publishing large collections should be to make them available as open data, in order to benefit research in the broadest way possible. Open data is the only way the collections will be of use for research that utilises digital humanities methods. Open data is not an issue when the documents are 19th century or older. The Harju collection, however, contains about 2000 cards, where the informant was born less than 100 years ago or their birth date is not known. Such documents cannot be published without restrictions.

Decision was therefore made to publish the unrestricted part of the data (10 000+ cards) fully openly, with the entire data available for download and further use under CC BY 4.0, and find a way to make the restricted part available to researchers in a more limited manner. This part of the process involved devicing mechanical solutions for separating the restricted part of the data from the open part, different requirements for the web interfaces for both, as well as implementing continued maintenance (updating the open edition with cards when they can be opened to public).

In order to efficiently create the different datasets and update the public online publication a couple C++ programmes were written. One compared the images and XML files to a list of persons and their corresponding birth dates and created the different sets based on this information. The second programme is meant to update the openly accessible publication yearly. When developing this programme security was our first concern. To this end the programme employs multiple features to ensure safe data transfers. The programme uses the aforementioned list of persons to update the public publication yearly with data from persons born 100 years before the year in question.

A good digital edition for data like the Harju collection requires powerful search features and effective facets and filters for browsing and filtering the data, with metadata fields chosen for indexing and the facets and filters in a way that best serves both the data and the users. SKS has for the past few years used the open source XML database and software development platform eXist-db (https://exist-db.org) for smaller XML/TEI 5-based editions, with the TEI Publisher application (https://teipublisher.com) used as a basis for building the editions. As eXist-db has a robust full text search and range index based on Apache Lucene, it was found to be suitable for a larger-scale edition like J. K. Harju as well. The only concern was how the indexing would handle a large number of small documents. However, this was found not to be an issue once a sufficient amount of RAM was allotted to the indexing.

In the open data Harju edition, users can filter the catalogue with four facets (region, location, year and informant's name) or alternatively with searches in a number of metadata fields. The full text search covers the text content as well as metadata. All the data of the edition is available for download and further use in XML and CSV. Users can download the entire data or parts of the data filtered with relevant facets or metadata/full text search. The restricted Harju edition contains the entire collection, including the c. 2000 card strong restricted data, and is accessible only in SKS's intranet. In terms of features it is similar to the open version but does not have data download options.

 
8:45am - 10:15amSESSION#16: BIBLIOGRAPHIC DATA ANALYSIS
Location: K-206 [2nd floor]
Session Chair: Max Odsbjerg Pedersen, Royal Danish Library, Denmark
 
8:45am - 9:15am

The multilingual cultural history in national bibliographies: the Baltic case 1800-1940

Peeter Tinits1,2, Krister Kruusmaa2, Laura Nemvalts2

1University of Tartu, Estonia; 2National Library of Estonia

Introduction

National bibliographies are collections of data that gather information on printed publications, broadly connected to a particular country or a cultural community (e.g. published within a country's modern borders or by its diaspora abroad). Composed mostly for cataloguing, these collections have come recently to be used as datasets for cultural historical studies, as an aspect of bibliographic data science (e.g. Lahti et al. 2019). While the coverage of national bibliographies can vary, the register of known books published and the associated metainformation can prove a valuable source of data for studies in cultural history.

In this talk, we explore the cultural processes surrounding the national awakening and independence for the Estonian language community during the 19th century with the help of Estonian National Bibliographies. We expand the study by exploring the same type of dynamics in the Latvian National Bibliography, to broaden the coverage for the Estonian language community and to look for similarities for the Latvian language community. 19th century European book publishing was transnational at the time and the cultural communities do not always follow the modern national boundaries. A good use of the national bibliographies for cultural historical research may have to rely on data from different countries.

Historical context

In the 19th century, Estonians and Latvians were mostly placed rural communities with low cultural prestige, situated in the three Baltic Gouvernates of the Russian Empire - Estonia, Livonia and Curonia. These areas were politically and culturally dominated by a minority population of Baltic Germans, who had a special status within the Russian Empire as the local administration. In the context of the Russian Empire, the Baltics had an exceptionally high level of literacy, due to German and Scandinavian influences in education (Raun 2017). In the early 1800s already more than 70% of the rural population could likely read, however mostly in the context of learning religious texts. The majority of books written in Estonian or Latvian were written by the local Baltic German elite either for religious or administrative purposes or as part of a national romantic project aimed at the locals.

During the 19th century, this situation changed. Gradually, Estonians and Latvians came to enter the sphere of written communication by publishing books, newspapers and other works. The teaching of reading and writing skills became more practical, aimed at everyday communication. It became common for Estonians and Latvians to read and write in their native language, becoming the language of choice instead of German, Latin or Russian. As part of local national romantic movements, strong language-based communities were established, eventually leading up to the declaration of independence from the Russian Empire for Estonia and Latvia in 1918.

Notably, these events took place in a multilingual cultural space, where individuals had some agency in which language and cultural communities they participated in and oriented towards. For example an Estonian with intellectual aspirations in the 1860s was likely to aim to participate in the German or Russian cultural context as these major cultures showed a potential to include a wider audience and a richer cultural heritage. Thus, the transition towards a preference towards Estonian or Latvian as the language of choice for the native elite provides an interesting case of some agency of the community to choose its path.

The study

Here we study the processes behind this transition with a few substantive questions.

- How were the transitions in language choice influenced by major political decisions made at the time?

  • For example, in the 1880s the Russian Empire started a process of Russification, making Russian the primary language in schools and administration in many places, specifically targeting the cultural dominance of Baltic Germans in the Baltic Gouvernates, among others. Did this influence the language of choice for the books published in Estonia?

- How did the movements of national romanticism influence the demographics of the intellectual community?

  • For example, in Estonia and Latvia, a strong movement of local national romanticism emerged in the 1870s. This was partly due to increased economic and social freedoms, partly due to improved education, and partly due to the spread of influential ideas. Can we see, as part of this emergence, the book publishing community becoming younger as the new young people join, or is it more of a choice for each author in regular generational transitions?

- Did the local intellectual communities grow organically out of the cultural interests of Baltic German elites, or were the communities formed fairly independently from them?

  • For example, the major works written about Estonian culture and language in the 1820s were all written by Baltic German ‘Estophiles’ for whom Estonian language was learned as a second language and more as an intellectual pastime. In the 1860s, many writings were already written by native Estonians. Did this transition happen gradually, as the new writers joined the existing networks of authors and publishers or did the new writers form their own communities, independent of the Baltic German predecessors?

We make use of the data within the national bibliographies to tackle these questions.

Data

We rely on the records of the Estonian National Bibliography from 1800-1940 (n = 39,442) and exploratorily on the records of the Latvian National Bibliography from 1800-1940 (n = 40,123). The data are estimated to have very good coverage of all the known books published then. We transform the library records to allow for cultural data analysis and enrich the datasets in a few ways.

Based on the titles of the books, we add information on the language, where it is not given in the dataset. We harmonize place names and publishers associated with the books using rules and fuzzy matching. We rely on links made to VIAF in these collections to combine different pseudonyms so that each person is represented with an individual id. Where possible, we augment the author names with information on the birth places, to build a separate subset of locally born authors. Due to conventions of the time, the publishers are sometimes listed as people, sometimes as organizations associated with the works; we harmonize these fields for major publishers for network analysis.

Working with two bibliographies brings with it an extra challenge and opportunity: some authors and books are present in both. Thus, sometimes we can use one to get a more complete picture of the works by an author present in another. At the same time in combining the data, some entries may be given in duplicate. To work with both datasets we rely on VIAF links to combine instances of authors across the bibliographies, and on partial matching of book metadata to find instances of duplicate books across the bibliographies.

Analysis

To understand the role historical events played in language choice, we calculate the share of books in each language for each year. Here, we rely on the language data augmented by information in book titles. We see a gradual growth in publications in Estonian across the time period, establishing relative dominance over other languages. Notably, we can see how the Russification policies did substantially increase the share of books in Russian, but only at the expense of German. The share of books in Estonian kept a stable growth throughout this period. This suggests that the mechanisms responsible for growing the community had become somewhat independent from the administrative circumstances. For example the switch in the primary language of teaching in schools from Estonian to Russian did not stop this growth. A major transition occurred with the independence from the Russian Empire that led the Estonian language to dominate the books published, while other languages kept a stable minority position. Here, the importance of political events can be seen as Estonian took on more functions in the community.

To understand the role of generational shifts in the growing status of local publications, we look at the age and the origin of the contributors named for each book. This includes authors, but also illustrators, publishers, and translators. As part of an international community, there are a number of authors that originate from past eras (Aristotle, Martin Luther), we exclude them, by including only authors born in 1750-1920. And in order to focus on the local community we run a separate analysis where we exclude the original authors of translated works and as an additional constraint consider only the contributors born locally in the Baltic area. Here, we see how the age of an average contributor to a book decreases as new contributors join the community, specifically during the period of national romanticism. This suggests that the national romantic movement not just in leaders but among many contributors was led by the younger members of the community rather than simply new ideas becoming popular in a regular transition between generations.

To understand more precisely how the new community came to be formed, we look at the choices of language use for individual authors and their roles in local social networks. We look at the share of books in each language for individuals and track who were the people forming the emerging linguistic communities - were they authors publishing mostly in other languages and only sometimes in the local language or were they authors publishing mostly in local language, but publishing also in other languages. Our analysis shows the community becoming increasingly dependent on authors writing mostly in the local language. To track this in detail, we perform a network analysis, based on the coauthorship patterns in the dataset. We track the central actors over time as measured by betweenness centrality and closeness centrality. We also look at the personal networks of major actors across the time period. With this information we can see how the central figures become increasingly Estonian-dominated, but also how the Estonian community relies on a few actors very active in Estonian and other languages.

We perform the main analysis on Estonian, and report on preliminary analysis of the same questions for Latvian National Bibliography. We are able to complement the study of the Estonian community with the information on Baltic Germans available in the Latvian dataset, and at the same time are able to compare the development of the Estonian community with the Latvian community as shown in the bibliographies. We report on the differences between the datasets and the process of joining the two datasets.

Conclusions

We used the book records given in Estonian and Latvian National Bibliographies to study cultural history of Estonian and Latvian communities. Specifically, we looked at the share of languages in the context of major political events, the demographics of contributors involved and the role of individuals and social networks in these changes. We found that the bibliographic data allows a way to systematically address cultural historical questions (e.g. the choice of a language in book publication), and offer new ways to measure historical trends previously described in general terms (e.g. the demographics of individuals participating in the emerging literary community).

We argue that bibliographic data can become a very valuable source to approach cultural historical questions. These datasets may provide fairly comprehensive overviews of cultural activities of particular communities and thus also allow an easy way to compare these communities. As stated by the proponents of bibliographic data science (e.g. Lahti et al. 2019), this can be accomplished with the combination of methods from data science, digital humanities and cultural historians. The research interest may in turn facilitate the development and enrichment of these datasets, mostly held by the national libraries to find further value in their collections.



9:15am - 9:45am

Fruchtbringende Gesellschaft (1617-1680) Member Publication Patterns in the VD17

Eetu Mäkelä1, Thea Lindquist2, Narges Azizifard1, Julius Arnold2

1University of Helsinki, Finland; 2University of Colorado Boulder, USA

This paper presents the first results of a larger project that draws on large-scale data analysis to investigate the publication patterns and networks of the 890 members of the Fruchtbringende Gesellschaft (1617–1680), or Fruitbearing Society, the first and largest cultural society in early modern Central Europe. First, we elucidate the major steps, including data wrangling, evaluation, clean-up, and algorithmic enrichment, necessary to transform the already high-quality VD17 bibliographic database into research data. Then, we relate the first results of our investigation of the publication patterns of Society members, which brings more nuance to the existing narrative of a society that shifted from focusing on the literary and linguistic aspects of its agenda in the first period (1617–1650) to a more courtly one in the later periods (1651–1662/67) of its existence.

Mäkelä-Fruchtbringende Gesellschaft-148.pdf


9:45am - 10:00am

Unleash the Apparatus? Towards a shared representation of knowledge about connections between primary sources

Jacob Langeloh

University of Copenhagen, Denmark

Philological text editions that are 'born digital' can and should be linked to other primary sources and, ideally, this happens within an ecosystem of clear and short IRIs. Yet, linking texts with each other is and has been a common practice in any form of critical editing: the source apparatus under the text draws parallels between the text being edited and other primary texts.
This paper explores the possibility to mine the source apparatuses of existing editions, to enhance them with further scholarship, and thereby democratize the knowledge about intertextual connections. Based on the analysis of existing proposals and on the experiences of the OTRA project (Ontology for Re-Use and Argumentative Patterns), I propose five requirements that allow to document the knowledge that is already contained in the source apparatus. Setting up this infrastructure based on Linked Open Data would allow distributing its creation and federating its hosting among different actors such as libraries, archives, and museums.

Langeloh-Unleash the Apparatus Towards a shared representation-157.pdf
 
8:45am - 10:15amSESSION#15: LINGUISTIC ANALYSIS
Location: K-207 [2nd floor]
Session Chair: Matti Lamela, Uppsala University, Sweden
 
8:45am - 9:15am

Gly2Mdc v.2.0: Lessons Learned from Building a Tool for Hieroglyphic Texts

Heidi Annika Jauhiainen

University of Helsinki, Finland

In order to advance digital methods in Egyptology, machine-readable hieroglyphic texts are needed. While machine-readable cuneiform texts have been extensively employed in Assyriological studies, the intricate nature of hieroglyphic script poses challenges in creating accessible corpora. Specific hieroglyphic text editors are used to produce pictures of the texts with signs placed correctly above and next to each other in kind of boxes. The pictures are used in publications, but the machine-readable project files are generally discarded. In this paper, I introduce Gly2Mdc v.2.0, a tool designed to transform the .gly files containing encoded hieroglyphic texts into a more human-readable format. The tool extracts and cleans the encoding and offers users options for saving the text in different formats. The aim is to give the users of hieroglyphic text editors a chance to publish the text also in machine-readable format and increase the amount of text available for building digital methods. Challenges faced in developing this tool are discussed, including the impossibility of achieving a faithful rendition of the original text in machine-readable form and the challenges of converting encoding to Unicode.

Jauhiainen-Gly2Mdc v20-183.pdf


9:15am - 9:30am

An unexpected gender-agreement pattern in Icelandic

Einar Freyr Sigurðsson1, Oddur Snorrason2, Ása Bergný Tómasdóttir3

1The Árni Magnússon Institute for Icelandic Studies, Iceland; 2Queen Mary University of London; 3University of Iceland

This paper examines gender-agreement variation for Icelandic sports-team names using the Icelandic Gigaword Corpus. Feminine and masculine sports-team names, such as Keflavík and Fjölnir, respectively, allow two different agreement patterns: (a) the expected (feminine/masculine) gender agreement corresponding to the gender of the team name, see (1) below, or (b) unexpected neuter agreement, see (2) below.

(1) Fjölnir er fallinn ʻFjölnir.MASC is relegated.MASC’

(2) Fjölnir er fallið ʻFjölnir.MASC is relegated.NEUT’

Interestingly, our corpus results reveal that the vast majority of the examples show neuter agreement, i.e., 80% of the total number. It is unclear how to account for this unexpected gender-agreement pattern. We discuss a few possible explanatory factors.

Sigurðsson-An unexpected gender-agreement pattern in Icelandic-215.pdf


9:30am - 9:45am

Word of the year 1919: Conveying the media’s favorite annual linguistic parlor game to a different era

Steinþór Steingrímsson, Einar Freyr Sigurðsson, Starkaður Barkarson, Atli Jasonarson, Ágústa Þorbergsdóttir

The Árni Magnússon Institute for Icelandic Studies, Iceland

In Iceland, the word of the year is chosen annually, both by the Icelandic National Broadcasting Service and the Árni Magnússon Institute for Icelandic Studies (AMI). We explore the possibility of doing the same but for a year more than 100 years ago. We try using the same methods as AMI does for our times. This approach has various limitations, which we discuss, and raises many questions, such as how much texts from journals and periodicals reflect the actual word use of the time.

Steingrímsson-Word of the year 1919-225.pdf


9:45am - 10:00am

Analysing lexical cohesion and topicality in online interaction

Antti Kanner, Anna Vatanen, Eetu Mäkelä

University of Helsinki, Finland

Free forms of conversation are characterised by informal flow of topic and theme, where the original or temporally first topic does not necessarily constrain the following turns of conversation. Topic shifts in spoken language have been studied in interactional linguistics, especially in Conversation Analysis (CA), where the concept of topicality encompasses thematic structuredness and progression as well as the means the speakers use to manage these structures (for an overview, see Couper-Kuhlen & Selting 2018: 312-328).

The concept of lexical cohesion, on the other hand, is most famously introduced by Halliday & Hasan (1976), according to whom lexical cohesion in texts is upheld by lexical selections that are somehow, in whatever way, predictable by selections made earlier. Intuitively, words belonging to the same conceptual spheres are often found in the same segments of discourse. That conceptual sphere then forms one aspect through which the subject matter of the discourse can be characterised. This intuition is often exploited in computational approaches to topicality, such as topic modelling. However, from the Conversational Analytic perspective, specific vocabulary constitutes only one dimension along which a shift in topic can be observed. Others include, for example, the use of specialised expressions to explicitly signal initiation of or a shift to a new topic, and shifts in temporal orientation of the discourse (moving from recounting previous events to planning for future). Furthermore, also topic closures can be marked with specific verbal expressions. Yet the interplay between the different dimensions have not been thoroughly investigated. (Couper-Kuhlen & Selting 2018: 312-328.)

Unlike in several other fields of linguistics, in Conversation Analysis computational methodologies have been relatively rarely used. However, O’Keeffe & Walsh (2012) have argued that corpus linguistics and CA are, despite their ontological differences, not mutually incompatible, and has shown this by utilising corpus linguistic methods in several studies on classroom interaction. Other previous studies approaching computational methodologies from a CA perspective include Haugh & Musgrave (2018), who present a combinatorial procedure for identifying examples of an interactional practice across relatively large tracts of data. From the perspective of computational linguistics and HCI, structures of interaction have attracted more attention, as developing talking machines has been a steady interest (see, e.g., Compagno et al. 2018).

By combining the conversation analytical perspective on topic shifts and computational analysis of lexical cohesion, we ask: what is the role of vocabulary in topic formation and, consequently, how far a purely word-based method can go in recognizing topic shifts in conversation? In our study, we use roughly 10,000 lines of online chatroom discussion data to assess the degree to which computationally mapped lexical cohesion and qualitatively analysed topic shifts converge.

In operationalizing word-based lexical cohesiveness, the most common text analysis methods (including topic modeling) seek to measure the degree of co-occurrence between groups of words and build the computational topics as representations of these co-occurrence patterns. This approach aligns well with text-internal lexical cohesion: the words’ meanings outside the data are not taken into account; instead, what matters only is how they reside in relation to each other in the data. Word-embeddings trained on much larger dataset than our 10K lines and that represent word distributions on type level, on the other hand, suit well for tracking the text-external associations. Word embeddings, being based on distributional similarities, are often very greedy when it comes to establishing proximities: any kind of feature of a word, as long as it has a distributional imprint, will be reflected into distributions and, by extension, to vector space models. In our case this is, however, not an issue, as Halliday & Hasan’s definition of adequate association between words is equally generous.

In our study, we experiment which statistical models capture best the topical structuredness in discourse. We mark the associations identified by both topic (CA-wise) and word embedding models in the data, and experiment with different ways to extract semantically cohesive structures. These include chains connecting associated word pairs, and sliding windows within which the overall associativeness is measured. We assess which models best align with qualitatively analysed topic shifts and where the misalignments particular to each model reside. As an outcome, we are able to discuss how lexical cohesion plays a part in topic development in thematically unbounded online discussions. This discussion contributes to the development of large scale automated methods seeking to understand topical progression in online discussion data.



10:00am - 10:15am

A humanist in search of computer scientists: A (so far unsuccessful) attempt to apply topic modeling techniques to Wittgenstein’s Nachlass

Filippo Mosca

Wittgenstein Archives, University of Bergen; University of Rome Tor Vergata.

This paper explores the application of Latent Dirichlet Allocation (LDA) Model to the writings of the philosopher Ludwig Wittgenstein. More specifically, it highlights what topic modelling is and why it can be useful for philosophical interpretation, shows the two major stages of how LDA works in practice (pre-processing the data and running the model) and addresses some important challenges in assembling the corpus of Wittgenstein’s writings: the issue of multilingual corpora, the issue of repeated text sequences and the issue of the basic textual units. Finally, this article shows the results of a specific analysis of Wittgenstein’s Nachlass through LDA and points out limitations and problems related to these results.

Mosca-A humanist in search of computer scientists-116.pdf
 
10:15am - 10:45amBREAK & DHNB DHLAM MEET-UP
Location: H-207 [2nd floor]

https://dhnb.eu/about-dhnb/working-groups/dhlam/

10:45am - 11:45amDARIAH DAY: Centre for Digital Humanities and Arts showcase
Location: H-207 [2nd floor]
10:45am - 11:45amSESSION#17: MEDIEVAL STUDIES
Location: H-205 [2nd floor]
Session Chair: Mats Fridlund, University of Gothenburg, Sweden, Sweden
 
10:45am - 11:00am

Analyzing Kinship Sentiment in Medieval Documents

Clelia R. LaMonica1, Patrick Burns4, Pramit Chaudhuri5, Jennifer Devereaux2, Joseph Dexter2, Liwen Hou6, Joseph Henrich2, Jonathan Schulz3

1CDHU: Uppsala University, Sweden; 2Culture, Cognition, Coevolution Lab: Harvard University; 3George Mason University; 4New York University; 5University of Texas, Austin; 6Northeastern University

Complex kinship systems have long shaped societies, influencing relationships within different familial structures. These systems can extend beyond immediate and genetically-related family to include distant relatives, in-laws and others, and have historically been used to determine who is or is not eligible for marriage. However, little is understood about the historical psychological impacts of kinship systems, leaving aspects of historical societies’ sentiments surrounding kinship to be explored.

In modern WEIRD (Western Educated Industrialized Rich and Democratic) societies, individuals often exhibit traits of individualism, non-conformity, and trust in societal structures. The influence of kinship structures on WEIRD psychology was previously explored in Schulz et al.'s (2019) study which examined effects of changes in Medieval Europe. It revealed that prolonged exposure to the Western Christian (pre-Catholic) Church led to reduced rates of cousin marriage, which correlated with a more individualistic and impartially prosocial modern psychology. This shift was attributed to the Church's emphasis on promoting nuclear households, weakening extended family ties, and fostering mobility within the community.

The present study takes a new approach to examining the impact of kinship psychology through the medieval church. It leverages empirical data extracted from historical texts using natural language processing techniques, including sentiment analysis of texts written in Latin. The objective is not only to support theoretical claims and shed light on historical kinship sentiment but also to explore computational methodologies applicable to comparative historical research in other regions, such as the influence of religion on kinship and social psychology in other European regions. Such results are to be aggregated and eventually compared within an interdisciplinary team involving classics, computational linguistics, economic history, anthropology, and psychology across regions and time considering the development of WEIRD kinship structures alongside surrounding economic, historical, and socio-psychological influences.

One key aspect of this research is a cross-institutional and interdisciplinary approach, engaging collaboratively with an international network of experts in, e.g. linguistics, history, and psychology, as well as with key cultural institutions. This collaborative framework has been instrumental in enriching our materials, critical to building an understanding of historical kinship systems, and allowing for a more detailed analysis through the integration of diverse perspectives. Such partnerships have not only facilitated access to valuable historical texts but also have provided unique insights into the interpretative methodologies across the various disciplines engaged in this project.

The present preliminary analysis uses the CBMA (Corpus Burgundiae Medii Aevi), which consists of over 22,000 medieval charters, hagiographies, and other religious, legal, and communicative texts from medieval Burgundy spanning roughly the 5th to 15th centuries. These texts, available in XML and pre-processed parsed formats, were prepared for analysis, which included the extraction of kinship terms and matched sentiment scores within their contexts.

Various methods, including a LatinBERT language model (Bamman & Burns 2020) and dependency parsing, were explored to determine the different contexts surrounding kinship terms. Ultimately, dependency parsing using LatinCy (Burns 2023) was the chosen method, which incorporated grammatical relationships between words, enhancing the overall analysis and improving accuracy and interpretation. Plain-text versions of the texts were first processed and parsed using LatinCy. Sentences containing both kinship terms and sentiment terms according to specified lists (over 600 kinship terms and 6000 sentiment terms) were extracted. Specifically, sentiment-related terms were identified using a sentiment dictionary (Sprugnoli et al. 2021; 2023), aiding in the exploration of sentiment-related language within the context of kinship terms. To gain an understanding of the bearing of certain kinship terms within the corpus, a TF-IDF score was also obtained and combined with terms’ average sentiment scores, highlighting terms' varying importance across different texts--notably, more religious texts would contain more references to, e.g. 'father' and 'son'.

The outcomes of this analysis yielded valuable initial insights, including average sentiment scores for kinship terms weighted by TF-IDF scores. Furthermore, sentiment scores were assigned to each kinship term in individual documents, enabling chronological comparisons to track sentiment shifts related to terms over time. Specific kinship terms were furthermore extracted, including gender pairs for comparison (e.g. 'brother', 'sister', 'mother-in-law', 'father-in-law'). Those terms which had the most variability over the timespan of the corpus (measured by their standard deviation) were also examined. Subsets of the corpus based on genre (diplomatic texts, hagiographies, etc) were also specified and explored individually.

The use of the CBMA corpus serves as a valuable example for researchers exploring questions pertaining to kinship terms and sentiment within various texts. The use of NLP methods to explore and analyze such a sizeable corpus of Latin texts is furthermore expanding the boundaries of what has been done within computational Latin studies, as the models and pipelines used are only recently developed or currently under development. It lays a foundation for further in-depth investigations of kinship in linguistics, classics, and across historical social sciences, facilitating further discoveries in this interdisciplinary field.



11:00am - 11:30am

Asynchronous linked editing of texts in physical objects

Tarrin Wills

University of Copenhagen, Denmark

Introduction

Several projects (at the University of Copenhagen these include Editiones Arnamagæanae Electronicae and the Dictionary of Old Norse Prose) aim to digitally record an analysis of early texts and their language which is closely based on material evidence, normally manuscripts. Digital methods for these processes can be assisted by various techniques including imaging, text recognition and linguistic analysis. The overall process normally proceeds in a single direction: objects are imaged, transcribed, structured as texts and linguistically normalised and parsed. At each stage information is often discarded, particularly as the data standards for each process are often incompatible. The present paper describes a working model and application (at https://menotag.ku.dk) that allows these processes to proceed asynchronously and without information loss, that is, linguistically-annotated texts can be linked in detail to manuscript imaging, and manuscript imaging can be used to produce linguistically-annotated texts. This technology produces richly interactive editions and linguistic analyses that are grounded in and linked with the material evidence. It further provides the potential ground-truth set, based on existing editions, for training new handwritten text recognition models.

Background

There are a number of standards for the digital description of text-bearing objects and/or texts deriving from those objects. Many of these standards and applications are focused on the material objects (IIIF, CIDOC-CRM, Transkribus / PAGE XML and related formats, and TEI’s manuscript description tagset), that is, they take the material object as their starting point but may have extensions for encoding the text (e.g. IIIF annotations, CRMtext). A number of alternative standards and applications exist for encoding philological editions of early texts. The Text Encoding Initiative (TEI), an XML-based standard, is by far the most widely implemented of these. Thirdly, there exists a set of applications and de facto standards for linguistic analysis such as Corpus Workbench. These use still different structures and models, even though in the case of historical languages and text, the underlying corpora derive from unique physical text-bearing artefacts.

Some projects have gone some way to overcoming these boundaries. The Menota project has for two decades been maintaining a set of standards based on TEI as well as hosting digital editions based on those standards. Menota’s editions are based on unique physical objects (manuscripts, charters and inscriptions), and the page and line boundaries of the material source are incorporated into almost all editions. At the same time word and punctuation tokens form a central part of the model. This allows for additional linguistic annotation. Menota’s archive is currently hosted as part of Norway’s CLARIN infrastructure. This approach provides a bridge between the physical artefact, whereby manuscript images are linked to pages of transcription, as well as corpus linguistic tools.

A new set of technologies has emerged more recently that allow for automatic analysis of digital images of text, such as Transkribus and eScriptorium/Kraken. These can identify text regions, lines and words and characters on a digitally-imaged page and then recognise the characters with varying degrees of accuracy, once trained to do so. Transkribus, for example, is a tool that allows synchronous editing of texts from objects: handwritten text recognition (HTR) technology generates text from a manuscript page, which can then be corrected and edited as TEI/XML. The workflow is unidirectional, however: TEI/XML documents cannot be linked to the manuscript pages in any detail, and there is little compatibility between the HTR data formats (PAGE XML, for example) and the potential resulting TEI/XML.

This study describes an application and model (MenotaG) which is designed to integrate the processes of describing a physical object, editing its text and analysing its language. This application has been driven by two projects in particular: Editiones Arnamagnæanae Electronicae, a collaboration headed by the present author between the Universities of Iceland and Copenhagen to publish peer-reviewed manuscript-based digital editions, as well as the Dictionary of Old Norse Prose, which requires high-quality editions for its work in the semantic analysis of the Old Norse corpus.

A process and model that is asynchronous enables a process which takes existing TEI/XML documents and deeply links them to the physical objects they derive from as well as vice-versa. This opens the possibility of using existing transcriptions for training new HTR models, as well as creating interactive digital editions integrating image and text.

Model

The data model that has been developed for this project has been designed to link together the three main domains of analysis. The principal goal is to be able to effectively link the textual, linguistic and material information together, while maintaining compatibility and ideally digital links to the data sources. The model is under development and is described at https://menotag.ku.dk/q?p=menota/home/about. The model is graph-based, realised in the application as a relational database. This allows for leveraging mature technologies for editing and publishing data and building complex applications, as well providing the suite of in-built spatial types and functions (OpenGIS standard) found in modern relational database management systems.

The text-bearing object is treated as a series of surfaces on which text is written in lines and lines are a series of word tokens (which may also continue onto another line) or punctuation tokens. The spatial relationship between the surface and lines, words and non-linguistic features are recorded as polygons relating to the digital images representing the surface. This extends the capabilities of IIIF annotations, which can only refer to rectangles on the image. As the data encoded is a superset of IIIF, it can be represented also in this standard.

The text is itself treated in a way compatible with TEI and Menota’s application of it: a work consisting of a hierarchical, ordered structure, which at the most detailed level consists of a series of word and punctuation tokens. All elements in this structure are linked to the underlying TEI by use of XPath paths. These in turn have different representations depending on the relative closeness on the one hand to the physical object’s representation of the text (‘facsimile’ level in Menota parlance) and on the other the underlying linguistic entity (‘normalised’ text), with an intermediate form corresponding to Old Norse diplomatic edition conventions (roughly script normalised, orthography unnormalised, expansions of abbreviations marked). The token here is what allows the linguistic structure to be connected to the physical object.

Application

The application at menotag.ku.dk is designed to use existing tools and resources while providing optimal methods and workflows for textual editing. A user wishing to begin transcribing and editing a manuscript that does not have an existing digital transcription can begin by linking images, either from an IIIF manifest or to the main repository of digitised manuscripts in Old Norse and Icelandic at handrit.is. Manuscript pages can be segmented automatically using Kraken and the user is then provided with an interface where the transcription of each line can be entered under the image of the line itself. For Latin works Kraken’s existing downloadable models can be used to perform a preliminary HTR on the page. If the manuscript is unreadable by these technologies, the user can easily draw outlines of the manuscript lines themselves. The resulting transcriptions can be tokenised automatically and normalisation and lemmatisation applied using machine-learning to assist the user. The resulting information can be exported to both PAGE XML and TEI.

Where existing Menota-style TEI documents are available, these can be imported using a simple interface and the user can begin the process of linking the text and tokens to the image. Once segmented, either manually or with the help of kraken, the transcription lines can be automatically linked to the TEI text. The original TEI document can be then updated with the new information and exported.

This paper will demonstrate an edition that mixes these techniques. The thirteenth-century Icelandic Third Grammatical Treatise has three independent manuscripts. One has been transcribed and is available in Menota’s archives. A second is a relatively clear copy, and a third is a very difficult to read palimpsest. The first is imported, linked to images and enhanced in this system. The second has been segmented automatically and transcribed manually, and the third segmented and transcribed manually. The three versions are linked using automatic tools for collation, providing automatic registration of variants and generation of a stemma.

Future development

The application is currently not released as open source, the main reason being institutional security policies, but work is under way to comply with local rules in order to release the code and ensure that it is externally-deployable. The machine learning tools, however, benefit from a single deployment per language. The application itself is being tested in both teaching and research environments, and is being actively developed in anticipation of the launch of the first digital volume of Editiones Arnamagnæanae Electronicae this year (late 2024). Work is also underway on exposing the image-text links using the IIIF annotations API. Ultimately the whole data structure will be exposed as Linked Open Data and/or with a SPARQL endpoint.

Challenges still remain with respect to speed: while the application leverages different browser APIs to ensure that the image processing is done as efficiently as possible, initial page loads for each image are slow at the server end (around 10s per page image). This is due to the complex model which links the manuscripts to text structure, in part recursively, being loaded in full for each manuscript page. Other standards-related issues remain, particularly in exporting full TEI XML including the text structure, and in re-exporting imported TEI, particularly where word tokens have been deleted or added. MenotaG is also waiting for the outcome of other projects in standardising graph models for text, particularly the Semantic TEI project.



11:30am - 11:45am

Quantifying Medieval Scribal Habits – The Case of Abbreviations in West Norse Manuscripts

Eva Pettersson1, Lasse Mårtensson2, Veturliði Óskarsson2

1Department of Linguistics and Philology, Uppsala University, Sweden; 2Department of Scandinavian Languages, Uppsala University, Sweden

Abbreviations were an integral part of the medieval script with Latin letters, as a way of increasing the writing speed and saving parchment. Originating in Latin texts, the system was transferred into texts in vernacular languages, although to a very varying degree. In the Nordic medieval vernaculars, the West Norse area (Norway and Iceland) made a greater use of abbreviations than the East Norse area (Sweden and Denmark). Furthermore, within the West Norse area, the Icelandic manuscripts are especially characterized by an elaborated use of abbreviations. In the Menota Handbook, ch. 6.1, it is stated that as much as one third of the words are abbreviated in some Icelandic manuscripts.

In the present study, the West Norse abbreviation system is in focus. Our aim is to extract as much information as possible regarding the use of the abbreviations from digital editions of Old West Norse texts, by combining digital methods for information extraction and qualitative analysis of the retrieved data. We have divided our investigation into four separate, but interrelated, questions:

  • Does the frequency with which abbreviations are used increase or decrease during the Middle Ages in Old Icelandic manuscripts, and if so to what degree?
  • Does the number of different abbreviations increase or decrease during the Middle Ages in Old Icelandic manuscripts?
  • Are there any common features of the words that are abbreviated and not abbreviated respectively, regarding frequency or linguistic characteristics?
  • What differences in the frequency with which abbreviations are used can be observed between the Old Icelandic and the Old Norwegian manuscripts?

For our digital method in question 2 above, we need digital texts transcribed in a format where both abbreviation sign and expansion are available. Texts having only the expansion can also be used for questions 1, 3 and 4. Furthermore, we need texts of a certain length (at least 10 000 tokens) in order to make the quantification reliable. We have used all available texts meeting the demands above in the digital text archives in Menota[1] and Emroon[2]. Within this text corpus, we have texts from both medieval Norway and Iceland, allowing for a comparison between Iceland and Norway, and texts from different times, allowing for the investigation of the chronological development. In total, the Icelandic dataset contains approximately 350,000 words from the time period 1280–1425, and the Norwegian dataset consists of roughly 720,000 words from the time period 1200–1350. Of course, when more material becomes available, the picture given by our investigation will be further nuanced.

A common formal classification of the medieval abbreviations is that of dividing them into suspensions (the end of the word is abbreviated), contractions (the middle of the word is abbreviated), superscript letters and special signs (e.g. Hreinn Benediktsson 1965: 85). Regarding the content of the abbreviations, i.e. what linguistic units they represent, they can be divided into two types: 1) abbreviations representing lexical or grammatical forms and 2) abbreviations representing graphemic-phonetic units. Many of the words that are abbreviated in a lexical way are highly frequent words, either names, highly frequent within a certain text (e.g. Gunnlaugr in Gunnlaugs saga), or other words recurring in many texts (e.g. verbs like segja or mǽla). It should be noted that one and the same abbreviation sign can be used for both lexical and graphemic-phonetic representation; superscript ‘m’ can be used in lexical abbreviations like ‘r’ + superscript ‘m’ for riddurum (dat. of riddari) as well as for the sequence ‘um’, occurring in different words. It is also known that recurring formulas are often abbreviated through suspensions, e.g. in manuscripts with poetry or legal texts (e.g. Hreinn Benediktsson 1965: 87). In our investigation, we will address both these perspectives; we will account for the type of abbreviation as well as for the units being abbreviated.

Indeed, for all the research questions above, tentative answers can be given beforehand. It is sometimes stated in the literature that the earliest Icelandic manuscripts are less abbreviated than the later ones, and one could assume that there is a general trend during the Middle Ages towards a more frequent use of abbreviations (nr 1). Also, one could assume that the number of different abbreviations would decrease during the Middle Ages, parallel to the reduction of the number of different ordinary letter forms (nr 2). Furthermore, it is very likely that the frequency of the words affects the proneness of the scribe to abbreviate (nr 3), and finally it is often pointed out that the tendency to abbreviate is stronger on Iceland than in Norway (4). However, our empirical basis and our digital methodology will allow us to quantify the use of abbreviations in West Norse manuscripts in a way that has not been done before.

In our study, we show that some of the previously stated hypotheses about the use of abbreviations in West-Norse manuscripts could be validated on an empirical basis. Abbreviations appear to be used gradually more frequently during the Middle Ages on Iceland. The number of different abbreviation signs do not increase in the texts that we have investigated, however; on the contrary, certain abbreviation signs used in the older texts have disappeared in the later text. The tendency that Icelandic manuscripts are more heavily abbreviated than the Norwegian ones, is also confirmed and quantified in our investigation.

[1] https://clarino.uib.no/menota/catalogue

[2] https://www.emroon.no/#

Pettersson-Quantifying Medieval Scribal Habits – The Case of Abbreviations-110.pdf
 
10:45am - 11:45amSESSION#18: OUTREACH & ADVOCACY
Location: K-205 [2nd floor]
Session Chair: Lars Johnsen, National Library of Norway, Norway
 
10:45am - 11:15am

Going with the Flow: Merging Research and Cultural Outreach in GLAM+ Data

Julie Graff1,3, Lena Krause1,2, Alexia Pinto Ferretti1,2, Camille Delattre1,2, Simon Janssen1,2, Kim Trihn1,2, David Valentine1,2

1Maison MONA; 2Université de Montréal; 3Université du Québec à Montréal

Maison MONA, a non-profit organization founded in 2020, is dedicated to highlighting art and culture in public spaces in Québec (Canada). Our main initiative is the development of a free mobile app (iOS and Android), MONA: it aims to spark curiosity for, and appreciation of, public art, heritage sites, and cultural spaces in Québec, by transforming their discovery into a life-size treasure hunt. To do so, we align a dozen open datasets about public art and heritage sites to generate a rich, diverse and complex panorama as a sort of "meta" cultural collection. From data collecting to cultural outreach, we follow the workflow, finding solutions and passing them on to the cultural community and to the general public, whilst promoting the exchange of knowledge and equal, nonhierarchical participation in the development of research. Our challenges lie in the often incomplete or ambiguous datasets, which are also scattered and unevenly structured. Their contents, the cultural objects themselves, are also difficult to research or, even, to actually find IRL. To establish their provenance and biographies requires accessing information located within oral history sources, physical archives and otherwise difficult to access materials. As a result, these public cultural objects (such as murals, historical buildings, or even free cultural spaces) and their corresponding data tend to remain elusive to the general public, to cultural workers, and to researchers. Bringing together professionals and young researchers from a variety of disciplines (art history, museum studies, digital humanities, computer science…) in our team, we tailor our methodology through a process of trial and error, following an action research approach. We combine a series of concrete activities (open data creation/crowd sourcing, field research, and cultural mediation), a process of knowledge sharing, and an ongoing critical reflection touching on technical, methodological, and ethical issues.
From the beginning, cultural mediation was a key component of our process to bridge a gap between researchers and the public. The MONA app offers a new dimension to outreach with its participatory approach. Users produce photographs, comments, and ratings, thus documenting artworks and heritage sites whilst also revealing how each individual perceive them. Such user data advances a new approach to studying art appreciation. Partnering with Art+site research group (P.I. Suzanne Paquet, Université de Montréal), we are developing a feedback loop between the general public and sociology of art researchers to better understand how public cultural objects are impacting the lived experiences of urban spaces. We will reflect on the opportunities offered, challenges faced and lessons learned from merging research and cultural outreach while working with heritage and public art open data. It will also introduce our understanding of GLAM (Galleries, Libraries, Archives, Museums), which actually includes a variety of cultural workers operating outside, yet often in relation to, these four specific institutions: we are speaking of local councils, schools (including universities), smaller or unconventional non-profit cultural organizations, and so on. We are all engaged in preserving, documenting, mediating public art and heritage sites, as well as in creating and disseminating datasets. This paper will therefore present how this loose network operates to create GLAM+ Data.
In order to do so, we will also focus one a case study, a public art dataset from the Magdalene Islands (Îles-de-la-Madeleine, Qc.). In 2023, we were commissioned by AdMare (Centre d'artistes en art actuel des Îles-de-la-Madeleine; Magdalene Islands' artist-run centre for contemporary art) to co-create a public art tour and its digital mediation. As Quebec has an art and architecture integration policy, we began our field work began by researching these artworks. The only available documentation was a government-issued PDF file (https://www.quebec.ca/culture/integration-oeuvres-art-public/listes-oeuvres-art-public). Beyond the format issue, this file contained rather elusive information, limited to the year of production of the artwork, the name of the artist, the institution for which it was created, and its address (at the time of creation). We had to fill a lot of gaps, starting with the title of the artworks, and updating their locations. While the AdMare team was able to provide us with crucial information, thanks to their own extensive knowledge of their territory, it remained on occasion nebulous to an outsider's viewpoint. We alternated between leg work, crossing the islands searching for public artworks, and talking with local workers from libraries, hospitals, local heritage institutions, and municipal services, all the while documenting what we learned. From this experience, we compiled a rich qualitative documentation, including photographs and testimonies.
With the idea of furthering digital literacy, we wanted to entrust data production to the AdMare team. Adopting tabular data, we kept the structure as simple as possible to favour durability of both the dataset and the knowledge on how to update it. This simplicity might be considered to constrict the quality of the data produced. However, we had to find a balance between respecting standards and good practices, and making the data production possible and accessible for cultural workers. We also introduced the principles of open data, which are very close to the philosophy of several artist-run centres including AdMare, and recommended publishing the dataset as a CSV export on the government open data platform, Données Québec (https://www.donneesquebec.ca/recherche/dataset/art_public_iles-de-la-madeleine). Tabular data also gave us a first opportunity to map these artworks and locations, a welcome visualization after our rather wild-goose chase through the isles. On our side, we added the API provided by Données Québec to our scripts, which import and update all our open data sources. This script requires some fine-tuning, checking that new data aligns with our current data model, but it allows us to automate most updates (Krause and Janssen 2023). Once the data enters the MONA server and our database, we share it to all mobile apps using our own API, creating 31 new cultural troves to be discovered throughout the islands.
The last part of our project with Admare was focused on giving back to the community through cultural outreach. We collaborated with local artist Gabrielle Desrosiers to design a paper map complementing the digital version available on the MONA app. The artist’s rendering focused on the Cap-aux-Meules Harbour. That last sector was also the site for several public activities organized during fall 2023, which allowed direct contact with the public. We met and discussed art and culture in public space, bringing attention to an artwork and sharing local, historical and contemporary knowledge. Participants were encouraged to use the mobile app, in which photography and the prompts to comment and rate the discoveries further a creative and participatory approach. Through these practices, we are dedicated to nurture the opportunity for each participant to develop their own perspective on the artwork, as we hold the belief that each person can and should develop their own taste and appreciation of art (Krause 2020). With the idea of durability in mind, we sought to produce sustainable outreach contents: the mobile app is freely available at all times, and the artist’s map is being distributed by Admare. We also produced teaching content for secondary school teachers.
By sharing this case, we hope to offer insight on the feedback loop between digital good practices and the pragmatic needs of a cultural outreach project, turning a critical and reflexive eye on these practices for a collective learning experience. Our conclusion will, moreover, focus on our current uses for this dataset in the research context, through the analysis of user-generated content, on the one hand, and through the production of Linked Open Data on the other hand (Delattre et al. 2024 (forthcoming)).



11:15am - 11:45am

Experimenting with a more-than-representational approach to urban visual electioneering: what’s the potential of DH methods?

Jurate Kavaliauskaite

Vilnius University, Lithuania

Despite its complexity and continuous practical relevance, outdoor electioneering not only loses scholarly attention due to the proliferation of digital political campaigning and communication (Woolley, Howard 2017; Coleman, Sorensen, 2023), but also remains predominantly treated as a discursive phenomenon (Lilleker, Veneti 2023; Allen, Stevens, 2018) thus tending to be undeservingly reduced to the domain of rhetoric, signification, symbolism, and the art of representing truth and falsity. Can digital humanities methods empower to move ‘beyond the box’ of mainstream political science, and help us to re-engage with urban electoral campaigning in a less conventional but more comprehensive manner? How and what kind of DH tools and techniques are conducive to the ideas of more-than-representational and new materialist theoretical approaches (Thrift, 2007; Anderson & Harrison, 2010; Müller, 2015) that invite and allow to re-imagine outdoor electioneering as a complex extra-discursive phenomenon, entwined with its spatial, physical, socio-technical environments as well as sensorial experiences?

The paper aims to offer answers to the posed questions on the potential of DH methods by means of showcasing and sharing lessons from an experimental educational project, designed for and carried out with 3rd year political science students of Vilnius University in the Lithuanian capital city on the wake of the parliamentary election in 2020. At its core, the project aimed to create relevant and viable links across the distinct study/ research fields via experiential learning and ‘hands-on’ experimentation with an array of mixed techniques and tools. It invited the participants to explore the untapped potential of the humanist digital methods as a ‘gateway to inter and transdisciplinarity’ (Cosgrave, 2019), that enables to question and go beyond traditional studies of urban electoral advertising. What do electoral battles across city spaces entail? How political advertising happens in and via urban environs, and why it’s a different question than ‘what do ads signify’? (How) can such a transient phenomenon be traced, captured and preserved, not only as a political but also a socio-cultural manifestation that is spatially distributed, material and affective?

In order to sensitise the students to the novel view and spur creative ideas leading beyond conventions of their discipline, the study design entailed data collection as an urban fieldwork, based on a tailored set of mixed techniques – in-person city walking (looking for and identifying places and surfaces with electoral ads), taking digital pictures of political posters, billboards, etc. and entire advertising sites (using mobile/smartphones), GIS-enabled mapping of their locations and configurations in urban space (using mobile/smartphones and respective settings), as well as participants’ auto-tracking of own journey trails (using Strava). Such a profile of the learning-based activity met several objectives.

Firstly, the fieldwork required to align the knowledge of the city(scape), skills of navigation, observation, digital recording/ tracking, as well as to learn relevant technicities. But more importantly, the students took an active part in the entire data collection process which allowed them to reflect upon their role as producers of ‘data as capta’ (Drucker, 2011), in relationship with the complexity and local situatedness of the traced phenomenon. This was in a purposeful contrast to the study of political advertising in terms of autonomous, self-contained visual artefacts – ‘image bites’, devoid of their temporal, spatial, social placement or processual nature (Holtz-Bacha, Johansson, 2017).

Secondly, the aforementioned methods of data collection gently but tactically directed the analytical focus towards ‘extra-textual’ aspects of outdoor electioneering, taking the phenomenon beyond its earlier noted traditional focus on signification and linguistic meaning. The data did captured content of ads, nevertheless, it was intentionally generated as multi-dimensional – allowing to experience visual electioneering as discursive but also physical, sensorial, geographically bounded and relational in terms of its urban surroundings, as proposed by more-than-representational theories (Thrift, 2007; Anderson & Harrison, 2010; Müller, 2015). From this point of view, the data allowed to explore not only the strategic instrumentality of political advertising but also material affordances of the city as a ‘giant medium’, composed of an intricate arrangement of sites, surfaces, and flows of local attention economy.

Regarding the data analysis stage of the project, the nature of the overall generated collection – 5000 ad-focused and 2000 contextual digital photos with location data (.exf files) and journey tracklogs (.gpx files) covering Vilnius electoral districts, plus the inventory of the collection (including collectively manually coded standardized metadata) – enabled to further explore the urban electoral campaign at different scales and compare their advantages. In terms of micro-level of analysis, the students were invited to do a qualitative (ethnographic) study of photographed ad sites, urban surfaces, material tactics of power, and present the results in the form of cartographic and visual storytelling (using ESRI Classic Story Maps). In terms of macro-level of analysis, the students looked for and mapped interesting geo-spatial distributions, structures, intensities and rhythms of visual political competition across city space, using open-source software (Excel 3D Maps). Finally, inspired by more-than-representational theories and cultural analytics (Manovich, 2020), the final phase of analysis proposed a new lens to explore outdoor visual electioneering (and the visual aspect in particular), as, first of all, a sensorial and affective phenomenon, given the stimulus-overloaded, hasty, attention-seeking nature of current urban milieus. The idea was explicated putting the collection of captured ads as pictures under a ‘macroscope’ (Graham et al. 2016), adapted for ‘big visual data’ and offering the (non-linguistic) distant reading of urban electioneering art with computational machine vision.

Overall, though the realisation of the complex collective undertaking was not without challenges and limitations that are to be further discussed, it provided the political science students a conceptual and empirical interdisciplinary exercise, facilitated by a gentle introduction to humanist digital methods and tools along the entire research cycle. Quite unexpectedly, next to the empirical experimentation with the more-than-representational approach in the study of politics, the learning-focused activity generated one of the most comprehensive data collections of urban visual electioneering in Lithuania. More importantly, the initiative also demonstrated affinities, intersections and vectors of potential collaboration between humanities-oriented political scientists and digital humanists, trespassing current trends and agendas of computational social science.

 
11:45am - 12:45pmLUNCH (Háma) & DHNB Doctoral program proposal lunch-date (H-205)
Location: Háma
12:45pm - 2:00pmClosing keynote: Thor Magnusson: Intelligent Instruments in the Experimental Humanities
Location: Bratti [1st floor]

Thor Magnusson, research professor at the University of Iceland


 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: DHNB 2024
Conference Software: ConfTool Pro 2.6.153+TC+CC
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany