10:30am - 11:00amUnderstanding the Challenges for the Use of Web Archives in Academic Research
Sharon Healy1, Helena Byrne2, Olga Holownia3
1Independent Researcher; 2British Library, United Kingdom; 3International Internet Perservation Consortium (IIPC)
In this paper we examine the challenges for the use of web archives in academic research through a synthesis of the findings from two research studies that were published through the WARCnet research network. The aim of the WARCnet network was to promote high-quality national and transnational research to enable a better understanding of the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives. The network activities ran from 2020-2023.
This paper fits with the Education and Advocacy theme of the conference, with regards to advances in approaches to teaching digital tools and methods. In order to advance approaches to teaching digital tools and methods with a specific focus on the use of web archives for research, it is important to understand the challenges faced by both non-users and users of web archives in the first place. Other studies have also done substantive work in this area focusing on users of web archives, awareness and engagement with web archives, and the scholarly use of web archives (Hockx-Yu, 2014; Riley and Crookston, 2015; Costea, 2018; Gooding et al., 2019). What is evident from such studies is the need to continually examine the challenges for users and non-users of web archives in order to assess and develop the skills, tools, and knowledge requirements for working with web archives for research purposes.
The first study under review is the Scholarly Use of Web Archives Across Ireland: The Past, Present & Future(s). This was a collaborative project which incorporated a review of resources and literature, informal dialogues with heritage colleagues, and the use of an online survey. The study sought to (i) examine the causes for the loss of digital heritage and how this relates to Ireland, (ii) offer an overview of the landscape of web archives based across Ireland, and their availability, and accessibility as resources for Irish based research, and (iii) provide some insight into the awareness of, and engagement with web archives in Irish third-level academic institutions. For this paper we focus on the section of the study that looks at scholarly engagement with web archives and the perceived challenges by users and non-users of web archives in Irish academia. We discuss scholarly awareness and engagement with web archives in Irish academic institutions and offer some insights which may be useful when it comes to providing support and incentives to assist scholars in the use of the archived web for research and teaching. This study could be replicated in other jurisdictions to help understand user needs when using web archives in academic research.
From the findings participants from an Irish academic setting described their lack of engagement with web archives due to:
-
a lack of awareness of the availability of web archives
-
unsure as to how relevant, useful, or beneficial, a web archive would be for their research,
-
not knowing how to use a web archive
-
not having the technical skills to use/process web archived content,
-
not knowing how to find archived websites relevant to their research in a web archive,
-
not knowing how to cite/reference an archived website from a web archive,
-
unsure of the credibility or authority of using archived websites as a source,
-
unsure about copyright implications for using archived web content for research,
-
not relevant for their discipline
Participants further outlined the perceived challenges in using web archives for research due to:
-
a lack of awareness of the existence, content, and value of web archives
-
understanding search and navigation mechanisms,
-
working with large volumes of data,
-
understanding access and discovery mechanisms,
-
understanding the representativeness and completeness of the data in web archives,
-
perceptions of web archived data being a non-established source and/or lacking source credibility,
-
citing archived web content.
Overall, the findings suggest that there is a limited awareness of the existence of web archives
in Irish academic institutions. Therefore, for an unfamiliar audience, more effort is needed to
demonstrate the importance of archiving the web and to promote the value of web archives as resources for research, as well as a need for the dissemination of use cases in Irish based research that will demonstrate the use of web archives as a research resource. The study recommends the need to develop multidisciplinary and interdisciplinary research networks in Irish academia to address potential solutions for developing research models and paradigms for the use of web archives for Irish based research that are fit for purpose in a broad spectrum of research fields.
The second study that we review is the Skills, Tools, and Knowledge Ecologies in Web Archive Research. This was a collaborative project between seven members of the WARCnet network from five institutions. Two of these institutions were academic and three were from the GLAM sector. The study sought to identify and document the skills, tools, and knowledge required to achieve a broad range of goals within the web archiving life cycle and to explore the challenges for participation in web archive research, and the overlaps of such challenges across communities of practice. The methodology for the study entailed desktop research, participation in WARCnet meeting discussions, and an online questionnaire. Respondents who participated in the online questionnaire identified with residing in North America, Europe and Asia. In this paper we focus on the section of the study that looks at the challenges faced by researchers who work in an academic setting when using web archives.
From the findings, participants who identified as researchers in an academic setting offered several insights on the challenges in working with web archives due to:
-
a lack of research methods, theory, and approaches for combining traditional methods with web archive research,
-
having to learn new skills, (e.g., programming, data sheets, etc.),
-
working with large volumes of data in terms of storage, processing and analysis,
-
a lack of access to more comprehensive metadata and documentation for web archive collections,
-
legalities in terms of access to the data, use of the data, and storage of the data from web archives, as well as complications with legal deposit, copyright, and GDPR;
-
a lack of experience in handling protected data from a web archive
-
the inability to download data from some web archives
-
citing archived web content and datasets
Overall the study emphasises that collaboration is key between the creators of web archives and end users. It notes how the web archive field would be enriched through the inputs of both communities for developing a better understanding of the research methods and approaches for using web archives. For example, the study indicates that there would be some value in extending introductory web archiving training to researchers in a bid to offer them more understanding of the limitations of web archiving strategies due to technical challenges, legal constraints, and a lack of resources. The study also highlights that challenges for end users do not become less with increasing experience, and emphasises the need for training across all levels of experience.
To end, both these studies highlight that there is a steep learning curve in working with web archives for research. One challenge is understanding the terminology and technical language that is used within this field. To try and improve accessibility of these reports Healy et. al. produced a glossary that was published through WARCnet. Towards a Glossary for Web Archive Research: Version 1.0 aimed to discuss the development of a glossary of terms and concepts for web archive research, using a novel approach which can be built on depending on user needs.
11:00am - 11:15amDiving Into the Digital Heritage: (re)Searching the Norwegian Web Archive
Jon Carlstedt Tønnessen
National Library of Norway
For more than two decades, web media have played a pivotal role in cultural and societal transformations. During that time, web archive initiatives have collected and preserved petabytes of web content that can serve as an important basis for studies of these recent and ongoing processes. However, scholars who want to study web archives have described significant obstacles to find, explore and analyse relevant data. A common critique is that web archives are “messy”, “chaotic” and unstructured, lacking an organisation that can make sense to humans. Further, scholars describe the organisation of archival data and the materiality of WARC (WebARChive) records as highly complex, requiring technical expertise for meaningful interaction and analysis. While web archives may be fundamental to understand contemporary culture, phenomena, and events, the current situation does not satisfy the expectations and needs of many researchers.[1]
Learning from these lessons, the National Library of Norway has started to index the content of the Norwegian Web Archive (NWA) collection, allowing for search in full-text and rich metadata. Using SolrWayback, a bundle of technologies developed in context of the NetarchiveSuite initiative[2], the service being built enables researchers to find and explore data and metadata, perform various analyses, and produce corpora of both mono- or multimodal material for computational analysis. The paper will present how our efforts are made within existing legal and ethical commitments, in alignment with the FAIR principles, and address some main opportunities and challenges associated with the current platform.
The paper will unfold in three parts. First, I will make the case that web archives are not unstructured. Rather, the data in web archives are organised according to different principles than in paper-based archives. While paper-based archives are often designed with a hierarchical structure, ordering items into a tree-like structure with nested categories and subcategories that aim to reflect the relationships and contexts of the documents in a way that often makes sense to humans, web archives are organised flatter, based on the way web crawlers work and retrieve data.[3] Nevertheless, this increases the need for research infrastructures that provide relevant metadata with descriptions of its content, context, and other relevant attributes, making it easier to find related items, independent of the structure of the archive. It also addresses the need for interfaces where scholars can search, filter, explore and interact with the archival content in a meaningful way.[4]
Second, I will present the results from indexing domain crawls between 2019-2022, enabling search in full-text and rich metadata for 100 million web resources. This includes a demonstration of the service, displaying the power of its advanced search syntax, examples of the richness of metadata and how it supports FAIR principles. Further, I will show how export functionalities can be used to build corpora and facilitate computational analysis, such as NLP, speech-to-text, image classification and network analysis of domain relations. I will also share the main findings of a pilot study, estimating the HTMLs in the NWA collection in total to contain more than 240 billion words. This alone is 50% more than all the printed newspapers and books digitised by the National Library, making it one of the largest text corpora in the world.
Third, I will review lessons learned from testing the service with more than 20 scholarly users. In addition to bringing forward important user experiences and observing how scholars experience the service, opening the archive for researchers has provided valuable insight into the collection, already improving harvesting. This is not only a reminder that archives gain value through their usage but it highlights the importance of working systematically with user-orientation in designing and developing tools and services for the Digital Humanities.
Wrapping things up, I will briefly present how scholars can get access to the NWA service, and address some challenges that are currently unsolved, such as the identification of low-resource languages and scalability to infrastructures with distributed storage.
[1] Milligan, ‘Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives’; Ruest, Fritz, and Milligan, ‘Creating Order from the Mess: Web Archive Derivative Datasets and Notebooks’; Völske et al., ‘Web Archive Analytics’; Schafer and Winters, ‘The Values of Web Archives’; Vlassenroot et al., ‘Web Archives as a Data Resource for Digital Scholars’; Gomes and Costa, ‘The Importance of Web Archives for Humanities’.
[2] https://github.com/netarchivesuite/solrwayback/
[3] Webster, ‘Existing Web Archives’, 35–37.
[4] Brügger, ‘The Need for Research Infrastructures for the Study of Web Archives’, 221–22.
11:15am - 11:30amIntersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora
Amanda Myntti1, Veronika Laippala1, Erik Henriksson1, Elian Freyermuth2
1University of Turku, Finland; 2National Graduate School of Engineering of Caen
Web-scale corpora, automatically collected from the web and encompassing billions of words, present significant opportunities for diverse fields of research. These corpora play a pivotal role in the advancement of large language models, such as the one underpinning ChatGPT. Moreover, they include masses of texts produced in different situations with different objectives, and they host new forms of digital cultural heritage that are constantly emerging and evolving. Therefore, they open up new avenues for research within the humanities, and social sciences, but also require multidisciplinary collaboration to guarantee their usability. (Laippala et al. 2021; Välimäki and Aali 2022)
A notable challenge associated with web-scale corpora is the absence of metadata detailing their contents. Typically, they lack information regarding the origin and the content of the documents. Documents featuring different text varieties, ranging from legal notices to advertisements, news articles, fiction, and song lyrics, all have an equal status in the corpora. This study aims to address this challenge by exploring various approaches to classify web corpora to specific subsections. In particular, we focus on registers, typically applied in corpus linguistics, defined as situationally defined text varieties (Biber and Conrad 2019), and genres, often utilized in literary studies when examining various forms of literary work (e.g., Goyal and Prakash 2022; Zhang et al. 2022).
In recent years, web register identification has taken leaps forward, with web register classifiers achieving nearly human-level performances (Laippala et al. 2023; Kuzman et al. 2023). However, in practice, when applied to web-scale corpora, the predicted register classes are still very broad, including a wide range of linguistic variation. Therefore, in this study, we examine if and how the available information can be deepened by combining two approaches: registers and genres.
We apply machine learning to train two text classifiers: one targeting registers and one focusing on genres. We utilize these classifiers to predict the classes for one million documents in the widely applied, web-scale Oscar dataset (Suarez et al. 2019; Laippala et al. 2022). Then, we 1) evaluate the distributions of the text classes predicted using the two classifiers; 2) analyze the intersections of the classes, and 3) examine how the combination of the two approaches extends the metadata available for the corpus.
The register classifier is trained using the CORE corpora (Biber and Egbert 2018; Laippala et al. 2022, 2023). The scheme is hierarchical and covers eight main categories with broad, functional labels such as narrative, informational explanation and opinion, and more detailed subcategories such as news report, research article and review. The training data for our genre classification model consists of books from Kindle US. The genre categories are assigned by the authors and selected from the possible categories of Kindle US, which include categories such as Children’s books, Science & Math, and Action & Adventure. The original dataset is available at https://huggingface.co/datasets/marianna13/the-eye and a cleaned version used in training is available at our Huggingface page at [deleted-for-review].
As some of the genre classes in the dataset are overlapping, we perform some pre-processing steps to improve the data quality. We choose a subset of genres that maximize the performance in two ways: firstly, the chosen genres need to be present in our corpus, Oscar. As a web corpus, genres most suitable for our task include Medicine & Health; Cookbooks, Food & Wine; Engineering & Transportation; and Politics & Social Sciences, which are common topics in online sources. Secondly, as categories are partially overlapping and some contain very few examples, we choose other categories based on their support in the dataset and by testing the performance with different candidate subsets.
The register model is implemented by finetuning XLM-RoBERTa-Large (Conneau et al. 2020) using the CORE corpus with the task of register identification modeled as multilabel classification. For training, we use the Huggingface Transformers library. Preliminary results show the register classifier is able to reach an F1-score of 0.77. We also use XLM-RoBERTa-Large for the basis of our genre classifier. Experiments are done on multiple genre subsets as described above. Similarly to the register model, the task is framed as multilabel classification and again, we use the Huggingface Transformers library. We select the best prediction threshold based on the F1-score. Our preliminary results show an F1-score of 0.70.
We use the two classifiers to label one million documents of the Oscar corpus. In our experiments, we see some expected combinations between certain registers and genres, such as the Lyrical register and the Literature & Fiction genre often coinciding, but equally registers such as Interactive Discussion being divided into multiple genres, like Engineering & Transportation and Politics & Social Sciences based on the topic of the discussion. Our preliminary qualitative evaluation shows that the predicted genre and register labels provide valuable auxiliary information which facilitates the use of the corpus in new ways in the study of digital cultural heritage. We will be analyzing the intersection and the different combinations of genre-register pairs using topic modeling and study of keywords as well as evaluating the benefits of cross-labeling a corpus as a tool for creating additional metadata.
11:30am - 11:45amA test of browser-based collection of streaming services’ interfaces
Andreas Lenander Ægidius
The Royal Danish Library, Denmark
This paper presents a test of browser-based Web crawling on a sample of streaming services’ web sites and web players. We are especially interested in their graphical user interfaces since the Royal Danish Library collects most of the content by other means. In a legal deposit setting and for the purposes of this test we argue that streaming services consist of three main parts: their catalogue, metadata, and the graphical user interfaces. We find that the collection of all three parts are essential in order to preserve and playback what we could call 'the streaming experience'. The goal of the test is to see if we can capture a representative sample of the contemporary streaming experience, from the initial login to (momentary) playback of the contents, for the benefit of digital preservation and media research.
Currently, the Danish Web archive (Netarkivet) implements browser-based crawl systems to optimize its collection of the Danish Web sphere (Myrvoll et al., n.d.). The test will run on Browsertrix Cloud (Webrecorder, n.d.). Our sample includes streaming services for books, music, TV-series, and gaming, e.g. Netflix, DRTV, Spotify, and Twitch.tv.
In the streaming era, the very thing that defines it is what threatens to impede access to important media history and cultural heritage. Streaming services are transnational and they have paywalls while content catalogues and interfaces change constantly (Colbjørnsen et al., 2021). They challenge the collection and preservation of how they present and playback the available content. On a daily basis, Danes stream more TV (47 pct.) than they watch flow-TV (37 pct.) and six out of 10 Danes subscribe to Netflix (Kantar-Gallup, 2022). Streaming is a standard for many and no longer a first-mover activity, at least in the Nordic region of Europe (Lüders et al., 2021).
The Danish Web archive collects websites of streaming services as part of its quarterly cross-sectional crawls of the Danish Web sphere (The Royal Danish Library, n.d.). A recent analysis of its collection of web sites and interfaces concluded that the automated collection process provides insufficient documentation of the Danish streaming services (Aegidius and Andersen, in review).
This paper presents findings from a test of browser-based crawls of streaming services’ interfaces. We will discuss the most prominent sources of errors and how we may optimize the collection of national and international streaming services. The test will include a concurrent dialogue with the developers of the software via their github. This collaborative approach highlights how digital humanities tools can be 'live' and in-the-making. We hope to discuss the quality of what we can collect and what we gain from librarians and developers collaborating: how does the collaboration, i.e active development of tools, impact the archive that is being made and how do we document the process? Can we capture these aspects of our test and the resulting archives in a datasheet specifically for digital cultural heritage datasets? Alkemade et al. (2023) propose a datasheet that supports the documentation of practices and procedures established in GLAM institutions that lead to establishing collections’ descriptions. Collecting and preserving the interfaces of streaming services produce digital cultural heritage datasets that are marked by specific characteristics. They provide a case for digital archives as datasets that are often the product of multiple layers of selection; they may have been created for different purposes than establishing a statistical sample according to a specific research question; they change over time and are heterogeneous (Alkemade et al, 2023).
11:45am - 12:00pmMemory in the Mediated Age: Unveiling the Dynamics of American Society's Memory through Twitter Discourse on Lynching
Feeza Vasudeva, Narges Azizi Fard, Eetu Makela
university of helsinki, Finland
In the contemporary mediated age, the landscape of collective memory is undergoing a transformative shift, particularly evident in the realm of social media. Marked by the 'connective turn' which emphasizes the sudden surge of digital media, communication networks and online archiving; we witness an unprecedented shift in our understanding and engagement with memory (Hoskins, 2011). Exploring this shift, the study aims to search into the intricate web of remembrance and oblivion, focusing on how American society remembers and forgets specific historical events, centring on the enduring historical trauma of lynching. Utilizing mixed methods and computational tools, the research aims to explore dynamics of memory formation and perseverance as well as memory decay, within Twitter discourse.
The legacy of lynching, a deplorable chapter in American history, continues to echo in collective memory, undergoing a transformative evolution. From concrete and literal spectacles of white supremacist violence, lynching has morphed into one of the most vivid symbols of race oppression, serving as a poignant metaphor for ongoing racial relations in the United States (Rice, 2006). Through the analysis of extensive tweet datasets , this research seeks to uncover patterns, sentiments, and discourse structures, offering insights into the evolving nature of collective memory surrounding lynching. Computational methods including Named Entity Recognition, topic modeling, network analysis, LIWC software, etc., are employed to reveal the emotional dimensions of collective memory, identify recurring themes, and trace the social dynamics shaping the discourse (Jiang & Xu, 2023; Sumikawa & Jatowt & Düring, 2018).
Furthermore, the ‘connective turn’ has re-engineered memory, liberating it from traditional constraints like spatial archives, organizational structures, and institutions. Instead, memory is distributed continuously through connectivity. This prompts an exploration of the temporal dimensions linking historical memory to contemporary events, investigating instances where the memory of lynching intertwines with modern occurrences, influencing and reshaping public discourse on Twitter. The findings contribute valuable insights into understanding the delicate balance between remembering and decay in the mediated age, particularly through the lens of social media platforms. Additionally, the study sheds light on the compulsive nature of contemporary connective practices within digital media content. Individuals and groups actively engage in various connective actions like posting, liking, tweeting, scrolling, forwarding, etc., forming a coercive multitude that eschews traditional debates in favor of digital emotive expressions, often conveyed through emoticons (Hoskins 2011).
As the research unfolds, the connective turn introduces a crucial perspective on the ontological shift in what memory is and does (Hoskins 2017). This paradigm shift, both arresting and unmooring the past, challenges traditional notions of historical consciousness, particularly in the light of ‘technology-mediated memory’ (Elwood & Mitchell, 2015). The insights gained from computational analyses inform discussions on the role of social media platforms in shaping historical memory, thereby also serving as a bridge between disciplines of memory studies and computational studies.
References
Hoskins, A. (2011). Anachronisms of media, anachronisms of memory: From collective memory to a new memory ecology. In On media memory: Collective memory in a new media age (pp. 278-288). London: Palgrave Macmillan UK.
Rice, A. (2006). How We Remember Lynching. Nka: Journal of Contemporary African Art, 20(1), 32-43.
Jiang, K.,& Xu, Q. (2023). Analyzing the dynamics of social media texts using coherency network analysis: a case study of the tweets with the co-hashtags of #BlackLivesMatter and #StopAsianHate. Front Res Metr Anal.
Sumikawa, Y., & Jatowt, A., & Düring, M. (2018). Digital History meets Microblogging: Analyzing Collective Memories in Twitter. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (JCDL '18). Association for Computing Machinery, New York, NY, USA, 213–222.
Hoskins, A. (2017). The restless past: An introduction to digital memory and media. In Digital memory studies (pp. 1-24). Routledge.
Elwood, S., & Mitchell, K. (2015). Technology, memory, and collective knowing. Cultural Geographies, 22(1), 147-154.
|