2:45pm - 3:15pmBetween and Behind the Lines: A Case Study of Strikethrough Removal and Handwritten Text Recognition on Astrid Lindgren’s Shorthand Manuscripts
Raphaela Heil1, Malin Nauwerck2
1Independent Researcher; 2The Swedish Institute for Children's Books, Stockholm, Sweden
1 Introduction
Where does a literary work begin and end, and how does its meaning emerge and transform? From a perspective of textual or genetic criticism, the literary work is not considered a finished product, but an inconstant, changeable process, determined by the author’s creativity as well as their historical, material and social context.
As has been pointed out by Van Hulle (2004), manuscript research and genetic criticism may create the feeling of “rummaging in the author’s ‘scientifically annotated waste-paper basket’”. But the fact remains that many authors, and especially modernist authors, have not thrown their handwritten manuscripts into the waste-paper basket but rather – like Lindgren – preserved them for future research.
Genetic research is based on the material evidence of the creative process, and typically involves the analysis of manuscripts, typescripts, notebooks, and other preparatory documents. It plays a natural role in an ecosystem of related fields of study, such as bibliography, book history, archive studies, filologia d’autore, variantistica, writing studies, digital humanities, and scholarly editing (Van Hulle 2022). Genetic research has been described as drawing attention to the labour, craftsmanship, and dynamic of the creative process by entering the “workshop” of the writer (Hay 2004). The process often includes the compiling and deciphering of relevant documents, establishing a chronological order, and the transcribing and editing of texts.
In our project The Astrid Lindgren Code (2020–2024) (https://www.barnboksinstitutet.se/en/forskning/astrid-lindgren-koden/ ), Swedish author Astrid Lindgren’s original drafts and manuscripts, preserved in 670 stenographed notepads, constitutes the object for investigation. The mixed methods approach of the project has so far resulted in approximately 60 digitised and transliterated notepads, primarily containing the novel The Brothers Lionheart (1973) which Lindgren notoriously was struggling to finish, and where the ending was rewritten several times. As is generally the case in genetic research, mapping out the author’s revisions through deletion, alteration, and rewriting is essential for exploring variations within the text that reflect the path of a literary work, and the author’s creative process. On a material level, these revisions are often manifested in crossed out or struck-through words, lines, and paragraphs.
When it comes to exploring the path of Lindgren’s literary work, the revisions on manuscript level are also more relevant than in many other cases. Generally, book publishing is a collaborative process where several agencies are involved in bringing a manuscript to publication, and the influence on a manuscript subsequently can be traced to early readers, editors, and publishers. Belonging in the category of author-publishers, Astrid Lindgren however assumed the roles of editor and publisher herself. Lindgren wrote and edited in shorthand, then typed up the manuscripts herself before sending them directly to the printer. Her position at publishing house Rabén & Sjögren guaranteed her full control of the editing and publishing of her own books and contributed heavily to the “Lindgren myth” (Nauwerck 2022). The revisions Lindgren made in her shorthand drafts, including striking through words, subsequently provide the only first-hand source to the author’s creative process and editorial work on a script level. In order to access them, it is in this case therefore necessary to be able to read, not only between, but also behind the lines.
In our prior work (Heil and Nauwerck 2024) we have demonstrated that state-of-the-art handwritten text recognition (HTR) models can be used to automatically transliterate portions of Lindgren’s manuscripts. Our experiments have also shown that the presence of the aforementioned editorial marks, strikethrough and additions, negatively affect the recognition performance, resulting in a reduction of 30-40 percentage points (pp) with respect to the character error rate (CER). In this work, we investigate the question whether the application of strikethrough removal techniques, which aim to produce the original, clean words, can improve the recognition performance, allowing us to read behind the lines.
2 Case Study Design
The general design of our case study is centred around the application of strikethrough removal on affected words from Lindgren’s manuscripts, followed by a comparison of text recognition results, obtained before and after applying the cleaning process. With respect to handwritten stenography recognition (HSR), we reuse the baseline models, originally proposed by (Sousa Neto et al. 2022) that were trained and evaluated as part of our prior work (Heil and Nauwerck 2024).
2.1 Strikethrough Removal
Several approaches for strikethrough removal have been proposed in recent years, for example (Heil, Vats, and Hast 2021) and (Poddar et al. 2021). The underlying data ranges from fully uncontrolled data, collected from genuine manuscripts, to entirely synthetic strikethrough. Detailed discussions of various cleaning and data collection approaches can be found in (Nisa 2023) and (Heil 2023). In this work, we focus on a combination of fully synthetic and manually cleaned data, using the strikethrough removal approach proposed in (Heil, Vats, and Hast 2022). The chosen approach has been demonstrated to work well for genuine and synthetic strikethrough, applied to handwriting in Latin script, while requiring comparatively little computational power. It is based on paired image-to-image translation, using an autoencoder-based (Bourlard, and Kamp 1988) deep neural network.
2.2 Data
We base our experiments on the LION dataset (Heil and Nauwerck 2023) which consists of several digitised notepads from Astrid Lindgren’s Swedish shorthand manuscripts, complete with corresponding transliterations, obtained via expert crowdsourcing (Andersdotter and Nauwerck 2022). The dataset includes 2195 clean lines, which are free from any editorial marks, and 307 struck lines, which contain at least one word that has been struck through. Each word in the dataset is annotated with bounding box coordinates, its transliteration and whether it is clean or struck through. For our experiments, we follow the previously established splitting into training, validation, and test sets (Heil and Nauwerck 2024).
In order to train and evaluate the strikethrough removal model, pairs of images, i.e. a struck-through word and its clean counterpart, are required. The training set for our experiments is obtained by extracting all clean words from the original training split and superimposing synthetic strikethrough in various shapes, for example straight or wavy lines, following (Heil, Vats, and Hast 2021). This step allows us to create a large and visually diverse set of training images, while keeping the degree of human labour to a minimum. In order to retain some amount of genuine data in the process, the validation set is obtained by extracting the struck-through words from the original validation portion, and manually erasing strikethrough strokes and related artefacts, using the image manipulation program GIMP and a graphic tablet. This approach requires considerable human involvement and is therefore only feasible at the scale of a small validation set and not applicable to obtain entire training sets with hundreds to thousands of images.
2.3 Experimental Setting
As indicated above, our experiments focus on measuring the text recognition performance before and after applying strikethrough removal. Initial experiments, reusing the original model weights from (Heil, Vats, and Hast 2022), indicated that models, trained on Latin handwriting do not generalise to text written in Swedish stenography. This observation is in line with our expectations, as words in the latter writing system are considerably shorter and often contain strokes that are visually less distinct from strikethrough than Latin characters. We therefore train the chosen strikethrough removal architecture specifically for Swedish shorthand, using the combination of synthetic training and genuine validation data, outlined above. Once the strikethrough removal model has reached convergence, it is applied to all struck-through images from the test portion of the LION dataset. The original, struck-through words are then replaced by their cleaned counterparts in the test line images and the HSR performance is measured, line by line.
3 Results and Discussion
On average, combining the strikethrough removal approach with the previously trained HSR model yields a CER of 53.42%. While this constitutes an improvement, compared to the original recognition performance of 56.48%, obtained on the original, struck-through lines, it is still much higher than the CER for entirely clean lines, which amounts to 25.68%.
A visual inspection of the strikethrough removal results reveals that there are several strikethrough strokes that are hardly or not at all removed by the models. Furthermore, in some cases, the wrong strokes, i.e. belonging to a word, are partially removed, thus potentially altering the transliteration. For these results, several plausible reasons can be identified.
Firstly, as is the nature of shorthand, individual words are considerably shorter, in terms of image dimensions, than many of the words in Latin or other longhand scripts. This leaves comparably less information to determine which strokes belong to the word and which should be removed. In addition to this, several of the frequently used words in Swedish, such as “och” (English: and) and “en” (English: a, one), are written with strokes that are also frequently seen in strikethrough, i.e. short horizontal or angled lines, arcs, and waves. When these symbols are combined with a strikethrough stroke it is virtually impossible to identify what is a shorthand word and what is the strikethrough, unless the context, in the form of surrounding words, is given.
Another aspect that may contribute to the limited improvement lies in the synthetic data, as it does not fully reflect the reality of Lindgren’s style of striking through words. One of these aspects is that the stroke generation has been implemented such that the generated strokes do not extend beyond the edges of the word image and instead, a small margin is left. As Lindgren often struck-through several consecutive words with a single stroke, the strikethrough continues to the edge (and beyond) of the words’ bounding boxes.
Finally, as our approach only treats individual words, any strikethrough strokes that lie outside any of the words’ bounding boxes will not be processed by our model at all, leaving short stroke segments behind in those areas when replacing the processed words. As indicated earlier, short strokes, for example horizontal lines, are valid symbols in Swedish shorthand and may thus be recognised as words by the HSR model, instead of as the strikethrough artefacts that they actually are. To demonstrate the impact of the latter issue, we mask any content within the text lines that lies outside the bounding boxes of any of the words and evaluate the HSR performance on these lines, resulting in a CER of 47.92%, i.e. a further improvement of approximately 5.5 pp.
4 Conclusions and Outlook
In this work, we have examined the impact of combining strikethrough removal with handwritten stenography recognition, using Astrid Lindgren’s shorthand manuscripts as a case study. The investigated approach has resulted in an improvement with respect to the recognition CER, however the obtained performance ist still considerably worse than that obtained on clean lines. Based on the identified challenges in the data, a sensible next step is to move from a word to a line-based strikethrough removal approach. This is expected to introduce the much-needed context in order for the strikethrough removal model to identify which strokes are words and which are strikethrough. Besides this, an interesting avenue for future work is to extend this case study to other writers, who frequently use strikethrough in their works and for whom reading behind the lines could be of interest to, for example, the genetic research community.
Acknowledgements
This work is supported by Riksbankens Jubileumsfond (RJ) (Dnr P19-0103:1). The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at Chalmers Centre for Computational Science and Engineering (C3SE) partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
3:15pm - 3:45pmRomani Literature in the Digital Era: Opportunities and Challenges
Sofiya Zahova
University of Iceland
Since the late 1990s and particularly after 2000, the development of the Romani literature field – the literary works written by Roma and/or for Romani audiences – has been influenced by the growth of digital technologies and the internet. The use of ICT among Romani communities has previously been discussed in relation to language usage (Leggio and Matras 2017; Leggio 2020), identity, appropriation and representation (Akkaya 2015; Szczepanik 2015), and migrations (Clavé-Mercier 2015; Hajská 2019). Based on recent research of the author in the framework of a broader investigation of Romani literature heritage and digital forms of Romani literature (Zahova 2020, 2021) the proposed paper discusses how digital technologies and tools have been applied in the preservation, production, access to and research of Romani literature. This paper´s first part examines how digital technologies have been incorporated into Romani literature heritage preservation and how they have been utilised in the production of digital forms of Romani literature. To address the conference theme, this overview is focused on how collections, archives and Romani literary materials holders have applied digital tools to facilitate the access, use and distribution of Romani literature and what lessons can be learned from the experience of the already existing collections, particularly when it comes to community engagement and activities. The second part of the paper brings in the topic of Romani digital publishing and the practices applied by authors, publishers and promoters of Romani literature. On the example of existing practices of publishing on websites, blogs and social media I examine how multimedia elements, hyperlinking, options to post, comment and engage have (not) been utilised.
Within libraries, the widespread use of digital technology was first applied in the digitisation of old books and manuscripts that are in the public domain. The digitisation pattern for Romani literature repeats this model, although its corpus in the public domain is neither sizeable nor is there a library that digitises only Romani literary heritage. The only corpus modelled according to this digitisation policy is the Zingarica collection at the National Library of Finland website (https://fennougrica. kansalliskirjasto.fi/handle/10024/85841) with items in the collection amounting to 224 books and 24 manuscripts. Other efforts have been rather unsystematic, driven by projects or by individual activities and thus with limited access and questionable sustainability. An initiative, for instance, was undertaken by the National Library of Serbia as part of The European Library (TEL) project. The idea behind the project was to form a bibliography of Romani literature titles available at all TEL members and digitise publications and materials (Injac 2009: 5‒6) which ended up creating only a bibliography of Romani literature available in some of the libraries. There are also cases of digitisation of materials as part of libraries’ initiatives that are not specifically related to Romani literary heritage and do not create a Romani collection. Such is the case of scanning the Romani language literature materials produced in the second half of the nineteenth century by Roma in the Austro-Hungarian Empire at the Hungarian Electronic Library or (http://mek.niif.hu/) and as part of the Digital Library collection at the Lucian Blaga Central University Library in Cluj-Napoca. Apart from these large-scale, essentially top-down approaches, more grassroots, less systematic initiatives for the digitisation of traditionally printed literature and educational materials and the upload of this material to the internet have also been undertaken by individuals. The materials are made available on file storage websites or as files accompanying social network posts. In these cases, there is limited access to the digital content.
Digital forms of publishing seem to provide good opportunities for minority literature in general and non-commercial organisations involved in Romani publishing, in particular, to overcome challenges and restrictions of the book publishing market and reach readers across the globe. Romani publications are going digital not only through the digitisation of public domain materials but also as digitally born texts in the forms of e-books, internet publishing and social media publishing.
The most sizeable and widely available form of digital Romani literature consists of short literature pieces published on websites and online platforms devoted to Romani culture and/or literature. These websites usually present works by authors from a common geographic, cultural, and historical space, for example, writers from the former Soviet Union, Czechoslovakia, and Yugoslavia. In some ways, the digital publishing scene mirrors the area in which other Romani cultural activities take place, including the printing and distribution of traditional publications by authors from the area. This can be explained by the shared history of the Roma (living in the same political formation) and thus the maintenance of professional contacts between Romani activists, cultural producers, and authors, the common Romani dialects and the majority languages in which the publications appear (Czech, Russian, Serbian/ Croatian/Bosnian/Montenegrin). Paradoxically, despite unlimited opportunities in terms of access to literature offered by internet publishing, the online platforms uniting Romani authors are comprised of Romani digital publications from the same cultural and geographic space.
In some cases, online forms and digital publishing mirror a rather active print literature scene in general (Czech Republic, Sweden) or an active project in publishing (Romano Kher in Romania). In other instances, however, online publishing seems to be more active than print forms and goes beyond mirroring print editions by publishing content that is available only online (Russia and Hungary for instance). Finally, in certain cases (countries formed from the former Yugoslavia), despite the comparatively active Romani literature scene in print, digital forms of Romani literature have not been developed so far by individual authors and publishers. In this respect, digital publishing is mainly determined by the tendencies in the region or country of its production but also depends on personal agency.
Although Romani authors generally do not make a profit from royalties and copyrights, many are still not willing to distribute their works through internet publishing. The authors may have a fear of plagiarism, or even conservatism – due to the sentiment that a proper book is one you can hold in your hands and is read only in print form. Romani publishers have not yet taken advantage of self-publishing through digital formats, which is considered an essential and effective solution for publishing minority literature. At the same time, Romani digital publishing faces many of the issues related to Romani publishing in general: lack of promotion and distribution strategies; limited reading audiences due to unawareness of its existence among wider circles and limited reading ability in Romani; differences in dialect or orthography of publishing; and prevalence of publishing in the languages of the majorities, especially in the prose genres, which are thus inaccessible for Roma from other countries.
There are still many challenges in Romani digital publishing – intermediary and multimedia forms that make digitally published texts more attractive and user-friendly are barely used. So far, there are no festivals, competitions, or promotional campaigns to highlight digital writing and publishing. Nevertheless, the peak in the digital life of Romani literature is still to come, and we can expect more authors to share their texts and reach out to the online community, educate readership, and engage in digital literary activities. As of 2023, I dare claim that digitally available Romani literature is already more successful in terms of accessibility and readership than traditional print literature.
Digital or more interactive forms of publishing may stimulate interest among younger generations who read and write in Romani on the internet in forums, social media, and in communication with their relatives and communities worldwide (Leggio 2020). Much like their peers, most of the young Roma today are “digital natives” (Palfrey and Gasser 2008), using and integrating contemporary technology in most aspects of everyday life, who are also interested in sharing digital content with Roma/Gypsy representations across the world. New ways of presenting Romani literature digitally may expand the audience for Romani literature by reaching these Romani digital natives.
3:45pm - 4:00pmA hybrid approach in the close and distant reading of Ibsen’s plays in the light of the characters’ stylometric profiles Keywords: stylometry, hybrid-reading, DSE, Henrik Ibsen, stylometric profiles
Sasha Rudan1,2, Eugenia Kelbert3,2,4, Linnea Eirin Timmermann Buerskogen1
1University of Oslo, Norway; 2LitTerra Foundation; 3Institute of World Literature, Slovak Academy of Sciences; 4University of East Anglia
This paper analyses Henrik Ibsen's plays in the connected world of their translations, demonstrating both the patterns in the original texts and their transfer across the translations. We demonstrate the process of collecting, analysing, and presenting the corpora and the methodology of hybrid reading; distant and close reading happening across two institutions; the Centre for Ibsen Studies, Oslo, Norway, and LitTerra Foundation, Belgrade, Serbia. In this process, we use our original digital infrastructure: Bukvik for the cross-lingual corpora analysis and LitTerra for cross-lingual corpora reading and presenting distant reading findings.
We explore the plays Et dukkehjem (A Doll’s House), Fruen fra havet (The Lady from the Sea), Heda Gabler, Gengangere (Ghosts), Vildanden (The Wild Duck), En folkefiende (An Enemy of the People). Our corpus consists of the original texts (written in Dano-Norwegian language/dialect) and their translations in English, Serbian, Croatian, and Russian where we are primarily interested in exploring the author’s contextual style where the context refers to each play’s character. Namely, we identify the textual patterns in the characters’ lines over the course of a play and the characters’ mutual interaction; thus, we eventually build each character’s stylometric profile.
Eventually, we explore the patterns and stylometric profiles across the translations making it possible to understand and present translators’ work, both their grasp of and commitment to the original’s stylometric features, but secondly, their creative freedom and conscious play within the framework of the generally more rigid form of a play as compared to the novel or poetry.
This work exemplifies cooperation across disparate institutions and scholars, namely between the Centre for Ibsen Studies, Oslo, Norway, which holds the competence in Ibsen studies and holds all the original texts by Ibsen in critical editions and various translations, on one hand, and the LitTerra Foundation, Belgrade, Serbia, which holds competence in cross-lingual stylometric analysis (maintaining Bukvik, the cross-lingual corpora analysis tool) and hybrid reading approach (maintaining LitTerra, ‘the cross-lingual corpora reading platform).
Our workflow is the following; we collect Ibsen’s multilingual corpus and feed it to Bukvik workflows that enable the normalisation and cross-lingual alignment of the originals and their translations at the level of sentences and sentences’ parts. We also identify the provenance of each line in the play either through its direct annotation (using TEI format, for original texts) or automatic extraction (for the translations) which helps us to classify and associate the text with the plays’ characters and later understand their contribution to the characters’ contextual styles. After that, we are ready to apply stylometric analysis (limited to the orthographic issues and old Dano-Norwegian language style of Ibsen’soriginal writing) to understand the uniqueness of each character’s style, its development throughout the play and interaction with other characters, together with each translator’s awareness of it and its transfer to a new text “version” in translation.
Using the LitTerra platform, we ensure a parallel view of all the translations of each play, mutually aligned and annotated with stylometric findings, eventually providing a close-reading experience augmented with distant-reading findings. Thus, we can read the texts in a completely new way, dynamically change our focus and range of interest (for example, focusing only on female characters, or aggregating them), and integrate distant reading charts interactively with close reading and pin-pointing the excerpts of the text primarily contributing to the deviations of stylometric profiles across the translations, seamlessly fusing and navigating through the two often distinct experiences.
With this, we present an innovative model of DSE (Digital Scholarly Edition) platform and a continuous workflow for hybrid critical reading that we currently practice within the Centre for Ibsen Studies in light of Ibsen’s upcoming 200th anniversary.
4:00pm - 4:15pmJon Fosse and world drama: using map visualisation to interrogate the global dissemination of the fresh Nobel laureate
Jon Carlstedt Tønnessen, Jens-Morten Hanssen
National Library of Norway
During the 1990s Jon Fosse went from being a prose writer with a limited Norwegian readership to becoming a playwright with an ever-expanding international distribution. Within a decade after he wrote his first drama, Fosse held a prominent position on the global stage. This paper explores the geographical distribution of stage productions based on his works, with a point of departure in a performance dataset established by the National Library of Norway in collaboration with the performing arts archive Sceneweb. The dataset was created by extracting information from a relatively large archive of material related to Fosse’s authorship that the National Library received in 2021. A total of 870 Fosse productions worldwide have been entered into the Sceneweb database with information about theatre, venue, work, production title, performance dates, performance language, stage artists and other contributors, tour schedule, etc. (https://sceneweb.no/nb/artist/3170/Jon_Fosse). Venues are registered with geodata such as street address, city, state or province, and country, enabling the use of tools for geographical analysis.
Using python-based tools for geoparsing and map visualisation, we employ interpretive digital tools to shed new light on the events that led to Fosse’s current standing as a world dramatist. Existing accounts indicate that the international distribution of stage productions is particularly strong in three geographical areas, Scandinavia, France, and the German-speaking parts of Europe. One specific production, namely Claude Régy’s staging of Someone Is Going to Come at the Festival d’Automne in Paris in the autumn of 1999, did supposedly mark a turning point and laid the foundation for Fosse’s European breakthrough, initiating a strong wave of performances particularly on the European mainland. However, new studies suggest that there would not have been a global Fosse if it was not for his long-standing strong position in Scandinavian theatre and long-term transnational artistic collaboration initiated and upheld by Scandinavian theatre organizations with a strong ownership in Fosse.
We will present an interactive map, built with the Folium package for Python (https://python-visualization.github.io/folium), visualising the spread of Fosse through a period of three decades, from 1994 as he was presented on stage for the first time and until 2024, with a basis in the combination of precise spatial and temporal data contained in the Sceneweb dataset. Moreover, we will zoom in on France, Germany, Austria, Switzerland, Norway, Sweden, and Denmark, for a closer scrutiny of events of particular significance in these areas. Fosse has been presented on stage in six continents, and we will, last not least, explore how Fosse conquered the stage outside of Europe.
The paper addresses several of the overarching themes of the DHNB 2024 conference. The paper will demonstrate services and applications developed by the National Library’s DH Lab, thereby illustrating the role of such labs in collaborative projects across the GLAM sector. The paper will furthermore demonstrate the reproducibility and repurposing of a dataset initially created by extracting performance data from a cultural heritage collection.
4:15pm - 4:30pmNorwegian writers during the Second World War. Literary studies and relational analysis combined.
Sofie Arneberg, Lars Johnsen
National Library of Norway, Norway
Introduction: How can we use text mining to identify traits or describe the literature of Norwegian writers before and during World War II? How can a statistical analysis of the writers’ political and aesthetical position taking in the 1930s and 40s be used as a steppingstone for new studies into a literary epoche?
Objective: Cultural life in the 30s and 40s can be studied as an object where the prelude to World War II both took place and manifested itself. Most companions to European literary history mark out the period as special, almost deviant, and with a clear before-and-after perspective. Literature of the period has consequently been read and studied differently. Knowledge or assumptions on the writer’s political stance have naturally played a role. How precise or relevant are literary interpretations coloured by these assumptions or facts? Can a distant reading of the collective works of 308 writers either confirm or invalidate presumptions on their artistic production?
Case: Grouping the writers based on their aesthetic positioning was one of the tasks when the dataset was created. It was of interest to the project to test the accuracy of this method.
Methodology: The Words and Violence project (Literary intellectuals between democracy and dictatorship 1933-1952) analyses the democratic resilience and vulnerability of cultural life in the 1930s and ’40s. The project is funded by the Norwegian Research Council and consists of a research consortium with members from both universities and cultural heritage centers such as The Norwegian Center for Holocaust and Minority Studies, The Falstad Centre and The National Library of Norway. The latter contributes to the project with research, library resources and access to the digitized books from their laboratory for digital humanities (DH-LAB).
Based on prosopographical data on 308 writers active in Norway during the 1930s and 40s, researchers from the project have conducted a systematic analysis of the writers’ various position takings during WWII.
The relations between three sets of structures were analyzed: The writers’ social backgrounds, their position takings in the literary field and their political position taking during the German occupation. Inspired by works of Pierre Bourdieu and Gisèle Sapiro and based on a newly constructed data base on the writers, the study demonstrates the potentials of correspondence analysis when mapping relational structures in culture.
The established data base offers a high number of variables. Several of these concern aspects of the writers’ literary production: What was their preferred written language (Norwegian, new-Norwegian or dialectic variances); which genres did they write in; their affiliation to the different publishing houses; their publishing frequency, and were their books translated into other languages and published abroad? This information was gathered from detailed bibliographies on the writers, produced specifically for the project at the National Library of Norway.
Based on the new bibliographies, a corpus consisting of the literary works of the 308 writers, approximately 2000 books, was built and studied.
Case: Signal words such as "life," "inner", "strange", and "mind" were scrutinized within a subset of works by authors classified as having an interest in psychoanalysis. Subsequently, a document-term matrix was generated.
(figure)
Findings: 1) The initial classification was not deemed incorrect 2) The authors who used the signal words most frequently in their novels almost did not use them in their plays. This should be investigated more thoroughly.
Conclusions:
The combination of statistical investigation methods on one front and literary text mining on the other has the potential to yield valuable insights. This convergence allows for the testing of specific assumptions while simultaneously giving rise to new questions about the material.
|