Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 1st May 2025, 02:19:50pm GMT

 
 
Session Overview
Session
SESSION#08: NEWSPAPER CORPORA & HISTORICAL DOCUMENTS
Time:
Thursday, 30/May/2024:
1:00pm - 2:30pm

Session Chair: Jon Carlstedt Tønnessen, National Library of Norway, Norway
Location: H-205 [2nd floor]

https://www.hi.is/sites/default/files/atli/byggingar/khi-stakkahl-2h_2.gif

Session Abstract

In this presentation funding opportunities within the European Research Council for research projects in the Digital Humanities will be introduced.


Show help for 'Increase or decrease the abstract text size'
Presentations
1:00pm - 1:30pm

Analysis of Textual Complexity in Danish News Articles on Climate Change

Florian Meier, Mikkel Fugl Eskjær

Aalborg University, Denmark

Structural linguistic features are often overlooked yet potentially important aspects of journalistic practice. Especially in news reporting on climate change, these features can play a crucial role as the proper use of language is tied to message credibility, processing fluency and knowledge retention, which can positively influence the reader to take more climate action.
This article analyzes language use in Danish news articles on climate change using a sample of around 32,000 articles from four different outlet types (quality news, niche papers, tabloids, and public service broadcasters) published from 1990 to 2021. We create a machine-learning model of text complexity covering this concept's semantic and syntactic dimensions.
Our findings confirm expected differences in complexity between news outlets, highlighting tabloid articles as engaging with higher semantic complexity, while quality papers and niche papers exhibit higher syntactic complexity. We observe a significant decrease in semantic complexity and a slight increase in syntactic complexity over time, a trend towards more generic language, and an increased use of pronouns, verbs, and adverbs. Most of these changes can be attributed to the emergence of articles by public service broadcasters. Articles by public service broadcasters are characterised by high syntactic complexity, which we consider problematic due to their popularity among the general public.

Meier-Analysis of Textual Complexity in Danish News Articles-130.pdf


1:30pm - 1:45pm

Quantifying Western Discourses about Technology. An Analysis of Press Coverage using a Dataset of Multilingual Newspapers (1999-2018)

Elena Fernández Fernández1, Germans Savcisens2

1University of Zurich; 2Technical University of Denmark (DTU)

In this article we explore the ontological properties of technology in contemporary societies using quantitative methods and a dataset of multilingual newspapers in English, French, Spanish, Italian, and German as a proxy. Our observational time covers twenty years (1999-2018). We filter documents using four technological key terms: nuclear, oil, internet, and automation. Our methodology is formed by a five-step pipeline that includes Topic Modelling (Pachinko Allocation), word embeddings, Ward’s hierarchical cluster analysis, network analysis, and sentiment analysis. We divide our analysis in two different categories: information stability (that we define as low levels of semantic diversity in our data outcomes over time at the key term level) and information homogeneity (that we understand as low levels of semantic diversity in our data outcomes across our selection of key terms). We seek to observe to what extent our selection of technologically related terms permeate into press discourses similarly or differently, and whether those discourses fall into semantic categories that could be considered as essential defining elements of the fabric of contemporary societies (such as finance, education, or politics). Results show, firstly, a consistent overlap of content across newspapers’ semantic fields that could be considered as pillars of society. Secondly, we notice a progressive simplification of information historically, reflecting less polarizing views across countries and, therefore, demonstrating an increasing agreement on technologically-related discourses in present times. We interpret these results as an indicator of the rising intrusion of technology in the essence of Western, industrialized countries.

Fernández Fernández-Quantifying Western Discourses about Technology An Analysis-103.pdf


1:45pm - 2:15pm

Using document similarity in historical patents to study value in innovation

Matti La Mela1, Jonas Frankemölle2, Fredrik Tell3

1Department of ALM, Uppsala University; 2Centre for Digital Humanities Uppsala, Department of ALM, Uppsala University; 3Department of Business Studies, Uppsala University

Patents are temporary exclusive rights that are granted to inventors for protecting their commercial interests about new technologies they have invented. For receiving this exclusive right to sell, use, and licence an invention, the inventor needs to disclose to the public how the patented invention works. In other words, the published patents include the description of the invention, a claim about what is novel about the invention, and often also drawings to support the description. Consequently, patents offer rich information about technological change since almost two centuries: when and by whom have new tools, machines, or industrial processes been invented? What remains, however, a major challenge is to understand which of the inventions among the thousands of recorded patents have actually been important and valuable.

This paper investigates the problem of how to estimate the importance of a patent. The question of assessing value is particularly relevant in historical contexts, where relevant metadata used for valuation such as patent citations are not available. In the paper, we establish a method to build ”citations” ex-post between historical patents by looking at the language employed in the patent documents themselves. We apply the method of tf-idf (term frequency-inverse document frequency) to our dataset of historical Swedish patents (1890-1929), and investigate the importance of patents through two dimensions. First, by studying the novelty of the vocabulary used in the patent description, and second, the impact of this new vocabulary over the subsequent patents. The idea is that an important patent uses new terms while it introduces a new technology, that then is further discussed and elaborated by others in their subsequent patents.

In previous work, the question of assessing value in historical patents has been approached from several perspectives. The most common measure for value has been the yearly patent fee payments, thus the times the patentee has decided to renew the patent to keep it valid.(e.g. MacLeod et al. 2003). One of the approaches that captures the societal relevance of a patent in regards to other patents has been to build indexes from contemporary sources (see (Nuvolari, Tartari, and Tranchero 2021)). Finally, post-WW2 patent citations have been used as indirect measures, as they can refer to historical patents too, and thus show their persisting relevance for modern technologies (Moser and Nicholas 2004; Nicholas 2011).

Examining similarity and differences of texts and their elements is a common task in natural language processing. Related to our approach, previous work has studied the innovative use of language and language patterns in various contexts such as measuring novelty and resonance in political speech (Barron et al. 2018) or in news reporting (Nielbo et al. 2023; Gunda 2018). In the context of patents, there is a large strand of research employing contemporary patent documents granted since the 1980s as more traditional text data (Balsmeier et al. 2018) and also through word embeddings and training language models (Hain et al. 2022). As (Hain et al. 2022) note about studying patent similarity, the text-based approaches have been dominant and only recently have vector-based solutions been applied for analyses of similarity. Regarding historical patents in the U.S., the landmark is the paper by (Kelly et al. 2021), where they present a measure for studying similarity between historical patents. They build on tf-idf as a measure, and examine back and forward similarity in a dataset of U.S. patents granted between 1840 and 2010 to build a score for importance of a patent.

The paper builds on the previous paper by Kelly et al. 2021. The novelty of our paper is that we improve their approach by including normalization to control for the variation in amount of yearly patents, and by using edit distance to tackle OCR challenges inherent to historical texts. Moreover, we apply the method in a different language area and historical-institutional context (for Sweden, see e.g. Andersson and Tell 2019). The paper builds ground for further study about similarity of historical patents at a semantic level using embeddings and pre-trained language models.

The source data we use in the paper are scanned and OCR-processed patents granted between 1885 and 1945 that have been digitised and published in the Swedish Historical Patent Database (1746-1945) project (https://svenskahistoriskapatent.se/EN/). Besides the OCR-read patents, which include the description, claim, and patent drawings, the Swedish Historical Patent Database includes information about each patent and individual patentee (e.g. Andersson and La Mela 2020). In the paper, we study the importance of patents between 1890 and 1929: this allows us to have five years of data for calculating backward similarity and ten years for studying forward similarity.

Figure 1 displays the amount of patents granted per year in Sweden between 1885 and 1945 (by application date). We see that the amount of patents per year grows throughout the period; in contrast to Kelly et al. 2021, we have decided to normalize our measure of novelty and impact, while they are calculated as a sum of the individual values. As Figure 2 shows, the average length of the patent documents say rather stable over time. The structure of the patent documents remain rather similar over the period, and contain first a description of the patent and then the patent claim. The patents are typewritten throughout the period, and the OCR quality has been estimated with manual reading as being adequate for the method.

(FIG 1)

(FIG 2)

The patent data is preprocessed by removing numbers, special characters and single letters, and Spacy’s Swedish lemmatizer (Honnibal et al. 2020) is applied to all tokens and Swedish stop words are removed using the NLTK library (Bird, Klein, and Loper 2009). To remove errors obtained from the OCR process, similar words are identified by calculating the Levenshtein distance between them. Then, each of these words in the data will be mapped to the same word. We also filter out for low frequency words as another way to deal with potential OCR errors and to eliminate words with low influence.

To measure the similarity between given patents 𝑖 and 𝑗, Kelly et al. (2021) propose the tf-bidf score, a modified version of tf-idf. The bidf -score measures the importance of a term in a patent based on the patents filed prior to the patent in question. As a result, terms in influential patents receive a more appropriate assessment. For a given patent 𝑝 and term 𝑤, the tf-bidf score is calculated as follows:

(EQ1)

The term frequency (tf ) is defined as:

(EQ2)

where 𝑐 counts the number of occurrences of a term in a patent and bidf is defined as

(EQ3)

To calculate the similarity 𝑝𝑖,𝑗 between patent 𝑖 and 𝑗, two vectors 𝑉𝑖 and 𝑉𝑗 with the size of the union of terms in patents 𝑖 and 𝑗 are created. These vectors store the calculated tf-bidf scores for each term 𝑤 in 𝑖 and 𝑗 respectively. The vectors are normalized and the cosine similarity between the two vectors is calculated, which is the measure for patent similarity.

Novel patents should be distinct from prior patents, as they propose new, disruptive technology. This means that the similarity of a novel patent to prior patents should be low. We calculate the backward similarity (BS) as a measure of novelty for a given patent. We use the application date of the patent for the threshold date. The BS is defined as the average similarity between a given patent 𝑝 and all patents filed 5 years prior to 𝑝. By averaging the sum, we account for the different number of patents in these 5 years between the patents. This enables us to normalize between the novelty scores over time.

(EQ4)

ℬ𝑗,𝜏 denotes the set of patents filed 𝜏 = 5 years before patent 𝑗. Thus, a novel patent is characterized by a low BS score, as its similarity to prior patents is low. Impactful patents are those that influence future patents, which results in high similarities between those. We define the forward similarity (FS), which is the average similarity between a patent 𝑝 and all patents filed 10 years after 𝑝, as a measure for the impact of 𝑝. Again, we use the application date of the patent as the threshold date.

(EQ5)

ℱ𝑗,𝜏 denotes the set of patents filed 𝜏 = 10 years after patent 𝑗. An impactful patent is characterized by high FS score, as its similarity to future patents is be high.

An important patent is both novel and impactful. This can be described as the ratio of patent novelty (BS) to impact (FS).

(EQ6)

The importance value 𝑞 for a patent 𝑗 will be high if its FS score outweighs its BS score.

Similar to Kelly et al. (2021), we define a ”breakthrough” patent as a patent that ranks within the top 10 percent of patents based on its importance score. This definition identifies patents that made significant contribution to technological progress. Figure 3 presents the share of breakthrough patents of total patents per year. The graph shows that the identified breakthrough patents are rather evenly distributed during the period. There is a slight downward trend among the breakthrough patentes over time. This could relate to the growth of absolute number of patents and the slight increase in the average length of the patents, that decreases novelty due to broader vocabulary in the patents. We see a peak during the war years which could indicate how specific kind of technologies were patented during the war years. We study then manually the top breakthrough patents for the peak years, and search for known important inventions among the breakthrough patents. We find some among them, such as the refrigerator by Carl Munters och Baltzar von Platen and the inventions about ball bearings by Sven Wingquist. Finally, we contrast the number of yearly renewals of the breakthrough patents with the yearly payments of all patents.

To conclude, we discuss some limitations of the method and how to further validate the findings. First, is the method capturing relevant technical terms, and not tracing mere language change. To answer this, we improved the steps part of the data processing iteratively. The thresholds for filtering out low and high frequency words, and also setting a minimum for the document range (in how many patents a words needs to appear) was done by examining the ”lost” vocabulary at different levels. We see that the filtering enabled to frame the studied vocabulary into robust central ground with relevant terms concerning the technologies themselves. Some challenges, however, remain that can be relevant in the Swedish case. It can be investigated whether compound words should be divided and processed in separate stems, to be able to examine and group together certain technological vocabularies. Moreover, similar to Kelly et al. 2021, we use the complete patent texts in our study. It could be investigated whether the right unit of analysis would be the beginnings of the patent instead of the complete documents. Second, to what extent can innovation be conceptualized in terms of novel technical vocabulary. It should be discussed more thoroughly if the amount of changes in the vocabulary of patents differs between technology sectors for new innovations. Our proxy of breakthrough patents relates strongly to the appearance of new vocabulary, which might also indicate changes in how an existing technical area is viewed or conceptualised.

La Mela-Using document similarity in historical patents to study value-223.pdf


2:15pm - 2:45pm

Between the Arduous and the Automatic: A Comparative Approach to Transparency in the Classification of Book Reviews in Swedish Newspapers

Daniel Brodén1, Jonas Ingvarsson1, Aram Karimi1, Lina Samuelsson2, Niklas Zechner1

1University of Gothenburg, Sweden; 2Mälardalen University, Sweden

This presentation focuses on the task of identifying a particular genre of texts in newspapers collections. When performed manually it is arduous, and when done computationally it is challenging for reasons related to both computation and digitisation. To make sense of the disconnect between the hyperboles surrounding text mining and the nitty gritty details of such a task, one element of our project The New Order of Criticism: A Mixed-Methods Study of 150 Years of Book Reviews (2020–2024, Ingvarsson et al. 2022) is to train algorithms for classification of book reviews in the newspaper collection of the National Library of Sweden (Kungliga Biblioteket, KB), combining expertise from language technology and comparative literature. Answering to the growing demand for transparency (see Bode 2018), this paper contributes to the discussion of text mining book reviews for tracing the historical discourse of literary criticism (Underwood 2019; Brottrager et al. 2022) by highlighting the methodological and contextual complexities of classification of reviews in large-scale text corpora, taking document context and OCR issues into account.

Specifically, our aim is to comparatively discuss different approaches to identifying book reviews in KB’s newspaper collection, focusing on material from the year 1906 and using the platforms of KB’s digital lab, KBLab. We present different approaches, comparing a traditional manual approach (primarily drawing upon press biographies), with different computational methods in the form of frequency-based statistical classification methods and transformer-based language models.

Our study addresses two research questions:

  • What are the affordances and limitations of these different approaches for identifying literary book reviews from a specific year in KB’s collection?

  • How can a comparative methodological approach enrich the understanding of automated classification of literary criticism, emphasising transparency?

These questions raise wider methodological issues, including the need for articulation of the complex relationships between documentary record, large-scale digitisation and historical analysis and the state of the practice in text mining KB’s newspaper data.

Context-sensitive approach

We begin by framing our study in the light of the discussion about the need for context-sensitive approaches to text mining, where commentators have underscored the criticality of engaging with the original context of archival data to produce robust and nuanced analyses. Katherine Bode (2018: 5) argues against the tendency to rely on data models that inadequately represent the ways in which the original texts generated meaning in the past, and Jo Guldi (2023: 48) warns about naïve assumptions about the relationship between data and the documentary record. Regarding newspaper collections, there is a dire need to pay attention to the forms and genres of newspaper text (Gooding 2017: xiii). When not text mining for high altitude patterns, there is an inadequacy of a conception of newspaper text when conflating a range of text types, including news, editorials, reviews, adverts, and tv listings into the single category of ‘newspaper data’.

This paper touches upon the specific complexities surrounding the identification of literary book reviews when it comes to both the genre as such and the manual work involved, drawing upon a pragmatic definition of the genre (having as its core to inform about and assess a newly published literary work) and project team member Lina Samuelsson’s previous study (2013) of the discourse of literary reviews in Swedish newspapers.

In the early 20th century, large Swedish newspapers did not publish book reviews according to a predictable pattern (if they published them at all) (Forser 2002). Given these uncertainties, Samuelsson originally used a press bibliography (Våge 2015), previously available as a rudimentary database, that catalogues major newspapers and information on different article types, to identify book reviews from the year 1906. The database was, however, not organised per year so Samuelsson used the information to build a simple database by searching for the year 1906 in the posts categorised in the bibliography as book reviews. In total, approximately 3,000 literary reviews from the whole year were found, but since this material was deemed too large for close reading, the focus was narrowed to two months. These reviews (197 in total) were identified on microfilm copies, transcribed and organised in a spreadsheet with metadata, including title of the book reviewed, date of publication, literary genre, name and gender of author and reviewer (when available).

However, the material was found to be quite heterogeneous and sometimes stretching the limits of what we today perceive as a review. Notably, while the press bibliography was helpful as a short track, some reviews were not listed and it only covered 17 newspapers out of approximately 200 and some of them only partially (Tollin 1967). Furthermore, in 1906, articles were often published over multiple columns and different pages, which could make finding the full texts in the dense newspaper layouts time-consuming, even with a bibliography at hand.

Dirty newspaper data and annotation

Turning to the task of automatic classification of book reviews based on KB’s newspaper collection, the suboptimal digitization and OCR constitutes a significant problem (Jarlbrink et al. 2016: 30–31). Basically, the newspaper pages displayed on KBLab’s platforms consist of a “mishmash of text blocks: we no longer know which blocks belong together to form part of the same article, which articles comprise part of the same section, nor which texts are editorial content rather than adverts” (Börjesson et al., 2023: 5). Though KBLab is working on proper article segmentation, data cleaning and curation remain crucial issues (Sikora and Haffenden, 2024).

Thus, the paper discusses the details of OCR errors in KB’s data and their effects on different methods. First, there are noise characters, some of which might stem from misreading graphical elements. Second, at least two newspapers from the year 1906, Aftonbladet and Svenska Dagbladet, are almost entirely missing punctuation, making some types of analysis impossible and hampering others (the effects would be worse for any researcher unaware of the problem). Third, there are a large number of missing spaces, at least some of which seem to be caused by mishandling line breaks. Since most computational methods are based in some way on words, and each missing space makes two words unrecognisable, this is likely to have an effect on the efficacy of classification.

In the paper, we present our annotation to acquire useful data for our classification experiments and an evaluation of the annotation. Although the 197 book reviews originally transcribed by Samuelsson constitute a substantial bulk of text, for the machine learning processes we needed more text data and for this task we enlisted three annotators, who annotated reviews in 8 newspapers from 1906. We used a preliminary algorithm to find the 500 pages most likely to contain reviews in each newspaper and annotators 1 and 2 were then given 4 newspapers each. Annotator 3 was given the same 500 pages of one newspaper as Annotator 1 as well as all pages of another newspaper of which Annotator 2 had annotated 500 pages. This way, we were able to assess both inter-annotator agreement and the efficacy of preliminary algorithms. For each review found, the annotators marked the corresponding text blocks (a way of navigating the effects of the OCR) and noted if the reviewed material was fiction, non-fiction, or something else. For fiction, they also noted the title, author and reviewer, providing us with metadata in line with Samuelsson’s original study.

Classification – methods and experiments

There is a wide variety of methods for automatic text classification (Zechner 2017). Some of these can be labelled as numerical feature list (NFL) methods, which consist in choosing a set of numerical features of the text (word frequencies, character frequencies, sentence lengths, etc.) and then choosing a statistical method for comparing unclassified text samples to texts where the class is known. Using a larger number of features, such as the frequencies of sequences of words, can improve the accuracy, but make the computation slower. In the last decade, a technique which has gained popularity is using word embeddings (Mikolov et al. 2013) to compress this type of data into a smaller set of numbers. This makes analyses based on longer word sequences feasible, but has the downside of being less transparent, as the numbers no longer represent an overt feature of the text. Recently, techniques have emerged which use large amounts of unrelated text to improve this feature compression, creating large language models (LLM), with a prominent LLM method being BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018). These methods have led to great improvements in many areas of natural language processing, but have the downside of requiring a very large one-time computation to create the model. The large training text may also contain hidden bias, which due to the less transparent nature of the method may be difficult to avoid or detect.

In the case of the LLMs, we adopt a methodology involving two primary approaches: fine-tuning the KB-BERT (Malmsten et al., 2020) model and utilising Word2Vec embeddings (https://fasttext.cc/docs/en/crawl-vectors.html) on different types of classifiers. For fine-tuning KB-BERT, we leverage annotated review data to adapt the pre-trained model to the review classification task. Simultaneously, we apply Word2Vec embeddings to represent text documents and feed them into traditional classifiers in the form of Support Vector classifier (SVC), Naive Bayes Classifier (NBC), and Linear Regression(LR). Furthermore, we incorporate TF-IDF vectorization to examine its impact on the classification performance of these models.

The training phase involved fine-tuning the model’s parameters using an annotated dataset exclusively comprising review texts. We have evaluated the model's performance using standard metrics, notably accuracy. In order to establish a benchmark, we also assessed traditional models such as SVC, NBC, and LR using the same dataset. Our experimental outcomes demonstrate the superior performance of fine-tuned KB-BERT, achieving a 93% accuracy in review classification. This notable advantage can be attributed to KB-BERT’s adeptness at capturing intricate semantic nuances and contextual cues inherent in review texts. This robust and effective performance underscores KB-BERT’s potential as a tool for review classification tasks. Conversely, when employing Word2Vec for text representation, the accuracy results for traditional models are as follows: SVC, 74%; NBC, 83%; and LR, 84%.

On top of choosing different methods, we also discuss variations in the experiment setup, such as which data is used for training and testing. One example is testing whether the accuracy falls when training and test data are from different publications. Results can also be presented in the form of rankings or similarity distributions instead of classification proper. The comparative analysis consists in evaluating the accuracy of each method, by manually checking the results, and will contain evaluation of the type, and amount, of preprocessing (i.e. annotation and OCR cleansing) needed to perform the computational analyses. Just as we consider the accuracies and shortcomings of the computational methods, we also view the human annotation as an imperfect method. Thus, we can compare the annotations of different annotators to get an idea of their accuracy. We can also manually analyse the texts chosen by the computational methods as likely reviews, to see if there may be reviews that the human annotators did not find, or other texts which a human might also consider similar to a review.

Conclusions and Summary

We conclude by drawing together our lines of inquiry, highlighting affordances and limitations provided by the manual and different computational approaches concerning the task of identifying a particular genre of texts in a newspaper collection. We discuss results of the different computational methods, also taking the transparency of the classification processes into account. Our discussion feeds both into the general methodological understanding of text mining literary criticism and the ongoing discussion about productive analytical approaches to KB’s newspaper data.

Brodén-Between the Arduous and the Automatic-142.pdf


 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: DHNB 2024
Conference Software: ConfTool Pro 2.6.153+TC+CC
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany