JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organisers at dhnb2024@hi.is.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 1st May 2025, 05:39:55pm GMT

Session Overview

Session

Import to your local calendar

SESSION#03: LINGUISTICS (NLP)

Time:

Wednesday, 29/May/2024:

2:45pm - 4:30pm

Session Chair: Maciej Rapacz, AGH University of Kraków, Poland

Location: K-207 [2nd floor]

https://www.hi.is/sites/default/files/atli/byggingar/khi-stakkahl-2h_2.gif

Presentations

2:45pm - 3:15pm

Developing named-entity recognition for state authority archives

Ida Toivanen¹, Mikko Lipsanen², Venla Poso¹, Tanja Välisalo¹

¹University of Jyväskylä, Finland; ²National Archives of Finland, Finland

Digitisation of archives is a core trend of archival practices around the world. The National Archives of Finland launched a mass digitisation project in 2019 to digitise state authority records. Over the next decade, they aim to digitise more than 200 kilometres of modern government documents (Hirvonen, 2017). This endeavour involves generating both image and text data from the original documents. Even though this process enhances accessibility to archival materials, possibilities for information retrieval from unstructured and noisy text remain at a low level. Consequently, there is a need for interdisciplinary work to find different solutions to enrich the digitised data and serve a more varied user base (Guldi, 2023; Guldi, 2018; Colavizza et al., 2021). Continuing the digitisation process with enrichment of the data helps to avoid the pitfalls of digitisation (Jeurgens, 2013) and opens up new possibilities for researchers.

Aiming at more efficient and innovative uses for archival material, we developed a named entity recognition (NER) model for Finnish state authority archival data. NER is one of the most common tasks in natural language processing (NLP) and has been described to be “among the first and most crucial processing steps” (Ehrmann et al., 2023) in enriching the archival data. It was originally developed for information extraction (Palmer and Day, 1997) but has since been used in various areas of NLP. NER creates a multitude of potential uses for research. At its simplest, it creates the possibility for the researcher to filter data more efficiently (Guldi, 2023). In more advanced use cases, in the context of state authority archives, NER can, for instance, facilitate research on policy trends and historical memory, identification of different advocacy groups communicating with state authorities, or effects of different local and world events on decision-making.

The diversity of the archival data, containing texts from different domains, as well as noise due to imperfect optical character recognition (OCR), creates a challenge for named entity recognition. Pre-trained language models can, however, be exploited in a transfer learning setting to improve the generalizability of a neural NER model to overcome these issues (Ehrmann et al., 2023). The current state-of-the-art NER models made for modern texts, like TurkuNER model (Luoma et al., 2021), did not produce adequate results on an out-of-domain testset that consisted of OCR’d archival data. This gave us an incentive to train our own NER model and see if our attempts would bring suitable results in an archival setting. Therefore, we followed the approach presented in Luoma, Oinonen, Pyykönen et al. (2020) and used FinBERT (Virtanen et al., 2019) as a base model which we fine-tuned to recognise and classify named entities from text input. The aim of our study was to answer the following research questions in the context of state authority archives:

Can a NER model trained with archival text data give comparable or improved results to existing Finnish NER models trained with modern text data?
Is it possible to create a NER model that performs well with both archival and non-archival data?

Our preliminary results clearly indicate that named entity recognition for state authority data needs its own model training due to its specialised formal language. Another difficulty for previous models with digitised state authority data is the noise caused by OCR. We approached these problems through choices made in the pre-training phase. First, we selected training data from various state authority sources to represent the complexity of formal state authority language. Second, we avoided cleaning OCR errors from the data in the annotation phase in order to train a model that performs with OCR noise. In model training we used TurkuONE (Luoma et al., 2021) corpus as well as annotated document data from Finnish state authority records. The time span of the texts is from the 1970s till 2000s, and the latest documents in TurkuONE corpus are born-digital while the text content of the state authority records was retrieved using OCR. The document and text types vary from blog posts to public administration records and legal texts.

NER models are typically trained to recognise entities like date, organisation, person, and location (see, e.g., Nadeau and Sekine, 2007). The named entity categories for our model development were based on surveys aimed at different archive user groups (Poso et al., 2023). The following named entity categories were included: person (PERSON), organisation (ORG), location (LOC), geopolitical location (GPE), product (PRODUCT), event (EVENT), date (DATE), nationality, religious and political group (NORP), Finnish business identity code (FIBC) and journal number (JON). TurkuONE and NewsEye corpora were readily available with annotations but supplemented with missing named entity categories (i.e., FIBC and JON). The state archival dataset was annotated with seven annotators using the annotation scheme IOB2, and to determine the level of agreement between all annotators we calculated inter-annotator agreement (Fleiss' kappa 0.84).

Our initial results show that current state-of-the-art Finnish NER models (namely, TurkuNER model) work well with modern texts (weighted F1-score 0.9166) but show a drop in performance when tested with archival data (weighted F1-score 0.6694). When tested with modern texts and archival data, our model shows consistent performance in both domains (weighted F1-scores 0.9200 and 0.8710, respectively). Our model also showed improvement in F1-score, precision, and recall for all of the named entity groups (excluding the two new entity categories FIBC and JON) in comparison to the TurkuNER model when tested with an archival dataset. For the TurkuONE test dataset our model performed very similarly to the TurkuNER model. We can deduce from this that the increased diversity of the training data improved the model performance – that is, even though we included archival data with OCR noise, the model still learned to detect named entities correctly from noise-free, non-archival data.

As our results show, the users’ needs, the complexities of the archival data and the different domains presented in archival settings present challenges for NER and possibly to other related NLP tasks. Further examination is needed to separate the impact of OCR noise and the impact of state authority language respectively on the model performance.

3:15pm - 3:45pm

SWENER-1800: A Corpus for Named Entity Recognition in 19th Century Swedish

Eva Pettersson¹, Lars Borin², Erik Lenas³

¹Uppsala University, Sweden; ²University of Gothenburg, Sweden; ³Swedish National Archives

Named entity recognition (NER) is the process of automatically identifying persons, places, organisations and other name-like entities in text, in order to perform natural language processing tasks such as automatic extraction of metadata from text, anonymisation/pseudonymisation of sensitive personal data, or as a preprocessing step for linking different terms describing the same entity to a single reference. While NER is a mature language technology, it is generally lacking for historical language varieties. We describe our work on compiling SWENER-1800, a large (half a million words) reference corpus of historical Swedish texts, covering the time period from the first half of the 18th century until about 1900, and manually annotating it with named entity types identified as significant for this time period, as well as with sentence boundaries, notoriously difficult to recognise automatically in historical text. This corpus can then be used to train and evaluate NER systems and sentence segmenters for historical Swedish text. An additional concrete contribution from this work is a manual for annotation of named entities in historical Swedish.

Pettersson-SWENER-1800-106.pdf

3:45pm - 4:15pm

BERT-aided normalization of historical Estonian texts using character-level statistical machine translation

Gerth Jaanimäe

University of Tartu, Estonia

Historical texts serve as invaluable resource for linguists, historians, and other researchers who use archives in their work. However, the automatic analysis and searchability of these writings is often hindered by the old and different spelling system used at the time period they were written in. One of the common solutions to this issue is converting these texts from old spelling system to contemporary, a process also called normalizing. However this task can also pose difficulties, mainly caused by the little amount of data and significant variations in it. This presentation gives an overview of using statistical machine translation for normalizing historical Estonian texts and Bert language models as a preprocessing step to improve the results.

The dataset used in this research is comprised of parish court records written in the 19th century. These texts describe court cases regarding debts, petty thefts, public fights and other minor offences. It also served as a place where peasants could perform notarial acts, such as buying and selling properties writing down a person’s will and solve their civil disputes. These writings therefore provide insights into peasant life, thought processes, relationships, language usage, etc. These texts were initially handwritten, however due to the discrepancies in handwritings and prevalence of non-standard spelling, conventional approaches for digitizing these texts, such as performing machine learning or optical character recognition became infeasible. Therefore, these texts had to be transcribed through a crowdsourcing project by Estonian National Archive. There are currently over 130000 texts in the database, of which 1000 texts are used within this study.

Many of these writings are written in old spelling system which was introduced around the end of the 17th century and was heavily influenced by German orthography at the time. The primary divergence from the contemporary spelling system lies in the manner in which the length of a sound was denoted [1]. For instance, the term ’kooli’ (genitive, partitive, and illative form of ’kool’) (school in English)[SO1] was historically transcribed as ’koli’. The terms presently spelled as ’koli’ (junk) and ’kolli’ (genitive and partitive form for ’koll’) (monster) [SO2] were both previously written as ’kolli’.

Although there are some parish court records written in Modern Estonian orthography, most of it is written in older spelling and some of it during a transitional period, where people still wrote some words in the earlier spelling out of habit. What complicates matters even more is that there were two written languages in parallel use until the end of the 19th century representing North vs. South Estonian. Eventually, the North Estonian language and spelling standard became the single standard for the whole country. The spelling standard Estonians know and use today was introduced in 1843 and started gaining popularity in the 1870s. [1].

The dataset used for the normalization experiments consists of 1000 texts, around 330000 words, which were manually normalized for training and evaluating the machine translation system. Due to significant dialectal variation, texts were categorized into eight dialectal areas, with varying amounts of data chosen for each area, according to the differences from modern Estonian[SO3] . Central, insular, western, northeast coastal, and eastern dialects belong to North Estonian, with central dialect being the closest to standard Estonian. Võru, Tartu and partially Mulgi dialects belong to South Estonian. The number of words within the datasets that actually need to be normalized differs from dialect to dialect, ranging from 32% to 51%. The length of these writings within these datasets is also varied with average length being 359 words.

There[SO4] are numerous normalization methods available, among which character-level statistical machine translation (CSMT) holds prominence and is also employed in the current research. One of the first uses of the method was normalizing historical Slovene texts [2]. A similar approach has also previously been adopted for normalizing Estonian parish court records [3]. However, the use of CSMT introduces inherent challenges as it interprets text on a word-by-word basis, potentially leading to the erroneous normalization of words. To complicate this issue, there is the morphological richness of the Estonian language, which gives rise to form homonymy—a phenomenon where a single form can correspond to distinct lemmas in the contemporary writing system. Also, the old writing system was ambiguous. For instance, the term ’kalla’ might denote the contemporary form for ’kalla’ (pour in Estonian) or ’kala’ (fish in Estonian). While a human can discern the correct form based on context, the same task is considerably more complex for a machine. Therefore, it is needed to determine whether a given word should in fact be normalized. To address this, a two-step methodology was employed. Firstly, a Bert model was fine-tuned to identify specific words in sentences composed in the old writing system that require normalization. Subsequently, the Moses toolkit for character-level statistical machine translation (CSMT) was utilized. This approach mirrors a similar method previously employed for normalizing old Slovene texts [4].

The training and testing processes [SO5] were performed using the Simpletransformers library for Python and executed on High-Performance Computing clusters at the University of Tartu [5].

For identifying words that need normalization, two pre-trained models for the Estonian language, Estbert [6] and Est-Roberta [GJ6] [7], were experimented with. The latter was trained on a larger dataset from the media and the former on a smaller but more diverse dataset. The models were both subjected to cross-validation on 8 datasets representing different dialectal areas in 5 iterations. The training was performed in 50 epochs, the other hyperparameters [SO7] were left unchanged. The data was divided into training and testing sets, with an 80% and 20% split, respectively. Despite Est-Roberta being trained on a larger dataset, Estbert exhibited slightly superior results, achieving a macro average accuracy of 95%, compared to 93.5%, across 8 dialectal areas. That can be attributed to its pretraining on a more diverse dataset. The predictions were also compared to the baseline approach, which was Vabamorf, a toolkit designed for morphological analysis of Estonian texts [8]. Words unrecognized by Vabamorf as legitimate Estonian words were designated as requiring normalization. Vabamorf achieved an average accuracy of 90.6%.

[SO8] Following Estbert model training, the same datasets were used to train and tune the Moses machine translation system. 75% of the data were allocated for training, with an additional 5% earmarked for tuning using Minimum Error Rate Training (MERT). The remaining 20% constituted the test sets. Words not requiring normalization were excluded from the training and development sets. This ensured that training was performed on only the words that actually have to be normalized and not the ones that are potentially homonymous with former ones. For comparison, the same texts were used to train the Moses system without using the Estbert preprocessing. Both trained models were evaluated using the same test sets. For Moses without Estbert model all the test set was fed through the machine translation system and for Moses with Estbert model only the words that the model classified as needing normalization were translated.

Although the work is still ongoing, the preliminary results were promising[SO9] , with Moses without Estbert achieving average accuracy across different dialects of 90% on a whole dataset of approximately 330,000 words, while the adoption of Estbert increased accuracy to 94%.

An additional concern addressed by this research pertains to the presence of numerous words in parish court records that, in contemporary usage, manifest as compounds but were previously written often as separate tokens. Conversely, the opposite scenario is also observed, wherein words currently presented as two distinct tokens were originally composed as compounds. For example, ’kohtumees’ (judge in Estonian) was often written as ’kohtu mees’ and ’ära läinud’ (gone away) as ’äraläinud’. In order to alleviate this issue, a proposed strategy involved the training of an additional Estbert model to ascertain whether a given word is regular, part of a compound, or necessitates separation into two distinct tokens. As the annotation process to denote these distinctions began later in the study, not all previously described data could be utilized.

Consequently, for this task, the training dataset comprised manually normalized side of the corpus, encompassing approximately 160,000 words, wherein besides normalization, annotations were also provided if a word should be conjoined as a compound words, 3718 words, or treated as two separate tokens, 2022 words[SO10] . The initial results were not very promising. The precision was observed to be 65%, and the recall was 56%. That can be attributed to the fact that compounds were annotated later during the normalizing and therefore, there is less data available for training. Furthermore, the occurrence of such words is notably sparse within these texts. Efforts to improve these results are currently ongoing.

References

[1] M. Erelt, Estonian language, volume 1 of Linguistica Uralica Supplementary Series, Estonian Academy Publishers, 2007.

[2] Y. Scherrer, T. Erjavec, Modernizing historical Slovene words with character-based SMT, in: 4th Biennial Workshop on Balto-Slavic Natural Language Processing, 2013.

[3] G. Jaanimäe. Challenges Of Using Character Level Statistical Machine Translation For Normalizing Old Estonian Texts. Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022).

[4] Y. Scherrer, N. Ljubeši´c. Sesame Street to Mount Sinai: BERT-constrained character-level Moses models for multilingual lexical normalization. Proceedings of the 2021 EMNLP Workshop W-NUT: The Seventh Workshop on Noisy User-generated Text, p 465–472.

[5] University of Tartu. UT Rocket. share.neic.no. https://doi.org/10.23673/PH6N-0144

[6] H. Tanvir, C. Kittask, S. Eiche, K. Sirts. EstBERT: A Pretrained Language-Specific BERT for Estonian. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa).

[7] M. Ulčar, M. Robnik-Šikonja. Training Dataset and Dictionary Sizes Matter in BERT Models: The Case of Baltic Languages. Lecture Notes in Computer Science book series (LNCS,volume 13217).

[8] S. Laur, S. Orasmaa, D. Särg, P. Tammo. EstNLTK 1.6: Remastered Estonian NLP Pipeline. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), p 7152–7160.

4:15pm - 4:30pm

Augmenting BERT to model remediation processes in Finnish countermedia: feature comparisons for supervised text classification

Ümit Bedretdin, Pihla Toivanen, Eetu Mäkelä

University of Helsinki, Finland

This paper showcases a supervised machine learning classifier to bridge the gap between qualitative and quantitative research in media studies, leveraging recent advancements in data-driven approaches. Current machine learning methods make it possible to gain insights from large datasets that would be impractical to analyze with more traditional methods. Supervised document classification presents a good platform for combining specific domain knowledge and close reading with broader quantitative analysis. The study focuses on a dataset of 37 185 articles from the Finnish countermedia publication MV-lehti, annotated into three categories based on frame analysis. Contextual sequence representations from the finBERT language model, topic distributions from a trained topic model, and a structural, HTML-aware featureset developed in prior work are employed as classification features. The hypothesis that BERT-based embeddings could be improved upon by augmenting them with additional information is supported by recent promising results in natural language benchmarks and tasks (Peinelt, Nguyen, and Liakata 2020; Glazkova 2021). In our study, combining contextual embeddings with topics resulted in only marginal performance increases, and this improvement was observed mostly in minority classes. Despite this, potential future developments to achieve better classification performance are outlined. Based on the experiments, automated frame analysis with neural classifiers is possible, but the accuracy is not yet sufficient for inferences of high certainty.

Bedretdin-Augmenting BERT to model remediation processes in Finnish countermedia-135.pdf

Mobile View Print View

Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: DHNB 2024