Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 1st May 2025, 02:40:34pm GMT

 
 
Session Overview
Session
SESSION#11: CORPORA
Time:
Thursday, 30/May/2024:
3:00pm - 3:45pm

Session Chair: Mari Sarv, Estonian Literary Museum, Estonia
Location: H-207 [2nd floor]

https://www.hi.is/sites/default/files/atli/byggingar/khi-stakkahl-2h_2.gif

Session Abstract

In this presentation, funding opportunities within the European Research Council for research projects in the Digital Humanities will be introduced.


Show help for 'Increase or decrease the abstract text size'
Presentations
3:00pm - 3:15pm

Selecting texts for a corpus of Early Modern Icelandic

Jóhannes B. Sigtryggsson, Ellert Þór Jóhannsson

Stofnun Árna Magnússonar í íslenskum fræðum, Iceland

This presentation accounts for aspects of a recently established collaborative project which aims at digitizing Early Modern Icelandic texts and creating a text corpus from them. The project is supported by experts from three institutions: the National Library of Iceland – University Library (Lbs-Hbs), the National Archives of Iceland (ÞÍ) and the Árni Magnússon Institute for Icelandic Studies (SÁM). The project is funded with a research grant from the Icelandic Infrastructure Fund (Innviðasjóður) and coordinated by the Centre for Digital Humanities and Arts (Miðstöð stafrænna hugvísinda og lista, mshl.is).

The objective is threefold: Firstly, to digitize through an OCR-process large volumes of manuscripts, documents and printed material from the period 1540–1850 and thus making the texts available and accessible. Secondly, to mark up the texts grammatically and use them as a basis for a full-fledged language corpus. Thirdly, through the mark-up process and normalization of texts, facilitate access to a part of Icelandic cultural heritage that has been inaccessible to most people until now.

The beginning of the period is determined by the publication of Oddur Gottskálksson’s Icelandic translation of the New Testament (1540), which marks the beginning of printing in Iceland. The end of the period is determined by the first edition of Piltur og stúlka by Jón Thoroddsen from 1850, considered to be the first modern novel in Icelandic. All Icelandic books published during this three-hundred-year period are preserved and have been digitized and made available online through the website of the National Library, baekur.is.

Selected printed publications will be treated with OCR along with selected documents from the National Archives and manuscripts from the National Library, paying special attention to the writings of officials and scholars of this period.

The project will utilize the infrastructure that has already been created at the Centre for Digital Humanities and Arts, e.g. by using the artificial intelligence program Transkribus for the OCR processing of manuscripts and other written documents. The knowledge that has accumulated at the Árni Magnússon Institute over the last decade regarding digitization of texts as well as further processing of them will also be utilized.

The project is set to be carried out in three steps:

· OCR treatment of manuscripts, books and documents.

· Systematic production of grammatically marked-up texts (tagging and lemmatization), ensuring preservation and availability of digital OCR produced texts.

· Error-detection and correction of computer-readable OCR produced text using artificial intelligence.

For this project, OCR models must be created – Experts from the aforementioned institutions will select manuscripts, documents and printed books for OCR processing that will cover the entire period. Digital images of most of the texts are already available on the websites handrit.is, baekur.is, timarit.is and heimildir.is, that the participating institutes run and manage.

In this presentation we will account for the project and take a closer look at the text material that will be used in the corpus, how it will be selected and the criteria involved. We will discuss how the texts must be treated after the OCR process to be internally compatible. We will account for normalization standards, treatment of linguistically ambiguous forms and different possible representations of the digital texts in the corpus. Finally, we will mention some possible ways of enhancing the data, e.g. linking them to other resources such as dictionaries and other corpora.

The project will also have other benefits besides the text corpus. Correction models created by working with OCR texts will be useful for further development of OCR solutions for Icelandic texts and standards created at the Árni Magnússon Institute. Correction and standardisation models for OCR treated texts will be made publicly available on the research infrastructure site CLARIN (clarin.is/). Furthermore, this project will encourage the development of various technical solutions that will be useful to other institutions that host Icelandic databases in the field of humanities and art for conducting research. The project will strengthen cooperation between institutions and contribute to the development of new databases and updating of those that already exist. All technical solutions developed will adhere to international standards for metadata.

The corpus of Early Modern Icelandic will be especially useful for the study of the history of the language and the development of the lexis. The digital processing of texts and the creation of the text corpus will radically increase accessibility to the texts and their content and facilitate the use of them for research in fields such as, history, ethnography, folklore studies and more.



3:15pm - 3:30pm

Representativity, Biases and Choices in Digital Corpora Curation: Latvian Diary Corpus

Haralds Matulis, Ilze Ļaksa-Timinska, Elvīra Žvarte

Institute of Literature, Folklore and Art of University of Latvia, Latvia

The presentation will assess the representativity of Latvian Diary Corpus by reflecting on potential biases and describing collection creators' choices in the formation of a digital corpus of diaries. The central question is to what degree a collection of digitized diaries can serve as a representative digital corpus to be analyzed with computational methods. When creating a collection of cultural heritage artifacts, other values beside representativity are at stake, and oftentimes everything of historical value is added to the collection. Humanities researchers then study the collection with close reading and other qualitative methods, their domain competence allowing them to fill in the gaps and judge the representativity or accidentality of certain collection items. However, to fully harness the possibilities of computational methods of digital humanities in corpus analysis, the corpus has to be representative or at least it is essential to be aware of biases, otherwise any statistical inferences might bear just accidental character.

Latvian Diary Corpus is a part of “Autobiographical Collection” curated by Institute of Literature, Folklore and Art of the University of Latvia (ILFA). The Autobiography Collection, established in 2018, evolved from materials people have written during various periods to document their own lives and the times which they have experienced. Although the Autobiography Collection has been largely open to depositors of various autobiographical materials, all potential materials are assessed for its relevance to the collection. Then the manuscripts are scanned, originals are archived or returned to the owner, the scanned materials are queued for transcribing which is done by volunteers in the digital archive's encryption tool or by specially recruited encryptors or researchers. (Reinsone, Ļaksa-Timinska, Žvarte, 2024) Mostly these autobiographical materials are diaries, written life stories, memoirs, and letters, as well as various other materials providing complementary information – photographs, interviews with the authors, and their relatives’ stories about them, so far containing a total of 233 autobiographical units of varying size and content, of which 118 units comprise the Latvian Diary Corpus.

Representativity of the diary corpus strongly depends on balanced gender, age, social class and regional coverage. Further methodological decisions arise from different diary writing practices, prompting questions if such different writings is the same phenomenon of writing – “daily routine” vs “event triggered” diary (Lejeune, 34), “journal” vs “autobiographical style diary” all are joined in a corpus under the label of diary. Representativity is further mediated by the profile of persons who are submitting the diaries to corpus. Who are submitters of the diaries – authors, heirs, librarians, citizen historians – and how does that correlate with the content of diaries? What of the surviving materials because of their content are not donated to the corpus? Lynn Z. Bloom makes a distinction between “truly private diaries in contradistinction to those of private diaries intended as public documents” (Bloom, 27). Some researchers go further and note that a “truly private diary” should never be seen by other eyes and authors should better destroy their diaries (Shiffman, 101). And then there is a tradition of self-writing, coming from the field of autobiographies, where the narrator functions as an “autobiographical I”, and is usually cast in a good light, with a potential of publication looming in the mind of the writer (Heehs, 7). A closer look at the diary profiles in the Autobiography Collection reveals that all the diary types mentioned above are present, as well as interesting borderline cases, but not all diary types are represented in proportionally similar parts.

Decisions of corpus curators on borderline self-writing items also influences the final outlook of the corpus. The reluctance to decline potentially precious materials is countered by a need to maintain the conceptual integrity of the corpus. How to categorize different cases of self-writing materials where distinction is not clear-cut, and what to do with materials where several autobiographical self-writing styles are mixed, e.g., memories followed by diary. Oftentimes, due to limited resources the digitization (scanning, transcribing) cannot be done instantly to all donated items. Then prioritization of what to transcribe first, and what remains undigitized for the time being, also influences the composition of the digital corpus which is already used by researchers.

Although this presentation is focused on the Latvian Diary Corpus and each archival collection is unique, the assessment given in this presentation can provide some indications about the scale of biases and noise caused by different facets of the curation process which apply also to similar datasets, and could be useful to researchers in other areas of digital humanities.



3:30pm - 3:45pm

LatSenRom (1879-1940): the creation and iterations of the corpus of Latvian early novels

Anda Baklāne, Valdis Saulespurēns

National Library of Latvia, Latvia

LatSenRom (1879-1940) is a comprehensive corpus of Latvian long prose fiction, covering all novels released in Latvian as books from 1879 to 1940. The corpus was created by the National Library of Latvia, in collaboration with the Institute of Folklore, Literature, and Art of the University of Latvia, and the Institute of Mathematics and Informatics of the University of Latvia. This paper marks the first exposition detailing the dataset's genesis, design, versions, and iterations.

The paper itemizes the workflows entailed in the creation of the corpus, from digitization and optical recognition, preprocessing, and markup up to the version control and iterations designed for specific use cases.

The design of data sets, although based on some general principles, is also dependent on the aims and methodologies of a particular study, hence there is more than one way for designing sound data sets. While the design of LatSenRom follows general principles, the versions or iterations are tailored to specific research needs.

The creation of LatSenRom drew inspiration from the Distant Reading COST Action, which was carried out to produce the European Literary Text Collection (ELTEC) - a compilation of balanced collections of Europen novels in different languages, spanning publications from 1840 until 1920. Since all language collections were devised according to the same principles, the finalized datasets are methodologically rigorous and mutually comparable.

Within the context of ELTeC, the Latvian language collection could not achieve the target of 100 required novels, and the total number of works included was significantly reduced after balancing the corpus, due to the absence of novels in several categories: there were no novels published between 1840 and 1879, and only two novels were written by female authors. This situation was partly shared by collections in other smaller languages, sparking interesting discussions on the reasons why the novel genre emerged later in some regions, and how to redefine the criteria for representativeness when achieving balance among several categories is not feasible. Although the corpus of early Latvian novels, comprising 50 novels, was not representative according to ELTeC's criteria, it nonetheless encompassed every novel, providing complete coverage of publications from the first novel until 1920. The concept of full coverage inspired further development of the corpus as a collection of all works, foregoing the selection process. In its current iterations, the corpus includes all novels published as books until 1940, totaling approximately 460 works.

The question of how many Latvian novels exist proved to be one of the most challenging to answer. First, one might question what constitutes a novel. What distinguishes a novel from a very long story, a novella, a biography, or other forms of documentary literature with a significant fictional element? Second, the issue arises of where one novel ends and another begins: many works of prose fiction consist of parts, are published as trilogies, or as series. A rather formal approach was adopted to address the problem of defining the genre: items were selected based on existing bibliographies of novels. Multi-part works were separated or assembled according to specific guidelines.

Another challenge that arises when amassing corpora spanning long time periods is the change in writing tradition and grammar. At the beginning of the 20th century, Latvian publishing gradually transitioned from the old Gothic (Fraktur) to the Antiqua typeface, and the rules of orthography were changing as well.

It was decided that it was important to address the needs of different types of researchers: those interested in studying transcriptions that are as close as possible to the original texts, and those interested in the comparative analysis of works from different time periods and, hence, required normalized (equalized) versions of texts. The concept of iteration is used to distinguish parallel versions of texts from the versions that occur while the corpus changes in time - by correcting mistakes, adding or removing works.

The normalization of the corpus can be realized (at least) in two steps: normalizing of the script (typeface) and normalizing of grammar. The second step is yet not fully implemented in the time of writing this paper.

In addition to the three iterations - original text, typeface-normalized text, and grammar-normalized text - LatSenRom is available in a structured form with morphological, syntactic, and NER markup. Corpora are available to users as individual datasets, or they can be explored through the interfaces provided by the National Library of Latvia - the corpus analysis platform (nosketch.lnb.lv) and a website Latvian Prose Counter (proza.lnb.lv).

A bibliographical dataset was created, encompassing records of all works featured in LatSenRom as well as records of their reprints, facilitating further analysis of the canonization of these works.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: DHNB 2024
Conference Software: ConfTool Pro 2.6.153+TC+CC
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany