3:00pm - 3:15pmTitle Matters: Film Metadata Analysis
Agata Hołobut1, Miłosz Stelmach1, Maciej Rapacz2
1Jagiellonian University, Poland; 2AGH University, Poland
Starting with the early attempts at statistical analysis of style (e.g., Salt, 1974; 1983; cf. Baxter, 2014), quantitative and digital approaches to film study have attracted growing scholarly attention. According to Burghardt et al. (2020), three main disciplinary strands can be distinguished within this research area: (1) the infrastructural strand, advanced by digital archivists working for cultural institutions and documentation centres, which focuses on making digital collections accessible and mineable; (2) the computational strand, advanced by computer scholars and media informaticians working on semi-automatic and automatic analysis of film and video content based on multimedia information retrieval; (3) the media strand advanced by film and media scholars adopting a quantitative approach to film study along the lines set out by numerical humanities (Roth, 2019).
Our contribution locates itself within the third subfield mentioned above. However, unlike scholars involved in the quantitative analysis of the cinematic medium, who investigate e.g. shot scale and length patterns, the cutting structure, luminosity and colour information or visual activity patterns (e.g., Cutting et al, 2011; Baxter, Khitrova and Tsivian, 2017; Heftberger, 2018; Bakels et al., 2020; Flueckiger and Halter, 2020; Hielscher, 2020; Pustu et al., 2020; Wang 2023), we mine and analyse film metadata, i.e., the verbal and numerical cultural information preserved in film databases. This approach has already proved useful in cultural evolutionary studies and cultural analytics, scholars studying amongst others changes in crew structure (Tinits and Sobchuk, 2020); institutional funding patterns (Van Van Beek and Willems, 2022); distribution circuits in film festivals (Zemaityte et al, 2023); remaking and cultural recycling practices (Stelmach, Hołobut and Rybicki, 2022) as well as the dissemination of selected themes in mainstream cinema (Viacom 2017).
In our presentation, we wish to discuss the interdisciplinary insights we gathered from a collaborative project that brought together a film scholar, translation scholar and information technologist. Based on the analysis of film metadata, we set out to explore global trends in film distribution, focusing on title translation practices around the world (Hołobut, Stelmach and Rapacz, forthcoming).
Film titles have so far attracted relatively little attention in film studies; major quantitative advancements coming from entertainment industry economists and marketers, investigating the impact of naming practices on box-office revenues (e.g. Xiao, Cheng, and Kim, 2021; Chung and Eoh, 2019; Bae and Kim, 2019). Most insights have been offered to date by linguists exploring the persuasive power of film titles (e.g. Dengler 1975) and translation scholars exploring linguistic and cultural shifts in title translation (or transcreation) practices (e.g. Santaemilia and Soler-Pardo, 2014; Ross, 2018; Yin 2009; Shokri 2014; Fakharzadeh 2022; Iliescu-Gheorghiu, 2016; Gabrić et al. 2022).The latter studies have been numerous, yet predominantly qualitative, limited to local film markets and narrow timeframes.
Our project was designed to offer a global diachronic perspective on title localisation patterns. To this end, we sourced title translations of two corpora of feature films released between 1950 and 2022, based on the metadata retrieved from the Internet Movie Database (IMDb). Our dataset contained, respectively:
- all localised title variants (the so-called “AKAs”) of feature films of the Official Selection at Cannes Festival since 1950 (i.e., art films directed at cinephiles around the world and usually distributed by independents, 3,114 films and 59,953 localised title tokens in total);
- annual top fifty productions distributed in most language versions since 1950, according to IMDb (3,650 films and 154,277 localised title tokens, distributed by major studios).
Each record in our dataset contained the following attributes:
- original movie title
- year of release
- localised movie title – the title under which the movie was distributed in a given region
- region of distribution
- genre(s) of the movie
- country of origin.
We analysed the dataset with a desire to test four preliminary hypotheses concerning:
- the growing share of films distributed with non-translated titles (e.g. Dutch and Ecuadorian official titles of the American comedy My Big Fat Greek Wedding being exactly the same as the original version);
- the growing share of films distributed with hybrid names that combine non-translated titles with added subheadings (e.g. the German variant of the title reading My Big Fat Greek Wedding – Hochzeit auf griechisch).
- the presence of global and regional discrepancies in the ways festival and mainstream film titles are being localised for respective audiences;
- the variation of localising trends relative to genre and region of distribution.
For our trend analysis, we used a rolling mean with a window of size five. Based on over 200,00 localized titles, the study confirmed only a few of our preliminary hypotheses. It showed a steady global increase in the annual share of films distributed internationally under their original titles – the share started at around 20% in the 1950s and has climbed up to roughly 35% in recent years. It also showed a stable but insignificant share of titles that combine non-translated original with target-language additions, thus undermining our preliminary assumption that such hybrids are gaining popularity around the world. We also discovered remarkable similarities in global trends concerning title translation for mainstream and arthouse productions.
Since most movies included in the “top fifty” corpus are American exports (i.e., 66%, 2,402 films) accounting for two-thirds of localized title tokens (i.e., 104,378), the steady Americanization of titles used in their worldwide distribution illustrates the economic, cultural, and political hegemony of “Global Hollywood” (Miller 2007) or “Hollyworld” (Hozic 2001) and the spread of English as a lingua franca. Currently, most of the production and worldwide distribution of American films is controlled by media and entertainment conglomerates, such as Warner Bros Discovery, Sony Columbia, Paramount Global or The Walt Disney Company, which exert enormous influence on localisation trends and patterns (Ross 2018). Quite surprisingly, however, on a global scale, similar patterns are also visible in the translation of arthouse films, the production and distribution of which involve different stakeholders.
On a regional scale, however, major discrepancies are visible both in dominant localisation solutions and in approaches to festival and mainstream productions. With regard to the former, Central and Eastern Europe invariably prefer domesticating practices, while Asia Pacific or the Middle East and Africa prefer “non-translation” practices. As for the latter, in some parts of the world (Europe, South and Central America), title translation norms are similar regardless of the corpus; in others (Asia Pacific, North America, Middle East and Africa), they are diversified, possibly reflecting different stakeholders involved in the decision-making process (cf. Ross 2013; 2018).
Further research paths that we envisage for our project include an in-depth explanation of the growing non-translation practices in local contexts, conducting more detailed regional breakdowns and automated fine-graining and nuancing of translation strategies other than verbatim transfer and hybridization, which will be of special interest to translation scholars.
3:15pm - 3:30pmFound in Translation: Sourcing parallel corpora for low-resource language pairs
Hinrik Hafsteinsson, Steinþór Steingrímsson
Árni Magnússon Institute for Icelandic Studies, Iceland
This paper describes the sourcing, processing, and application of parallel text data for Icelandic and Polish, demonstrating how a parallel corpus can be compiled for a language pair that has no available parallel data, by pivoting through a common language. We show the usefulness of the corpus by training and evaluating a machine translation (MT) model on the data. Iceland's linguistic landscape is evolving, with an increasing need for multilingual support due to the growing immigrant population. Polish, in particular, stands out as the language of the largest single minority in Iceland, underscoring the importance of this project.
Openly available parallel corpora for Icelandic are currently limited to English-Icelandic corpora, the most important being ParIce (Barkarson and Steingrímsson 2019), although that same language pair is also included in a number of multilingual data collection projects mostly containing web-scraped texts. An ongoing project, `Mikilvægur orðaforði fyrir fjöltyngi og vélþýðingar' (en. Important Vocabulary for Multilinguality and Machine Translation), leverages openly available datasets to source parallel texts and lexicons for the languages of the main immigrant communities in the country, with the goal of it being used for compiling bilingual dictionaries as well as for MT and general NLP tasks. The languages that have been sourced are Polish, Spanish, Tagalog, Thai, and Ukrainian. As a measure of the wider project's methodology and applicability, we illustrate the methods and source datasets used for sourcing specifically Icelandic–Polish parallel text data, as Polish speakers are the largest single minority in Iceland.
In the case of Polish, we source our parallel texts from CommonCrawl Aligned (CCAligned, El-Kishky et al. 2020), CommonCrawl Matrix (CCMatrix, Schwenk et al. 2019b, Fan et al. 2021), No Language Left Behind (NLLB, Costa-jussà 2022), OpenSubtitles (Lison and Tiedemann 2016), ParaCrawl (versions 6 through 9, Bañón et al. 2020), TildeMODEL (Rosiz et al. 2017) and WikiMatrix (Schwenk et al. 2019). All of these datasets were sourced via OPUS (Tiedemann 2012). These datasets contain pre-aligned parallel texts for multiple languages, including English and Icelandic. We employ English as a pivot language in this context, which means that we first identify sentences in English that are translations of both Icelandic and Polish sentences. This approach is based on the assumption that if an Icelandic sentence and a Polish sentence have the same English translation, they are likely to convey the same meaning. The extraction process involves automatically comparing and matching English sentences across the Icelandic and Polish datasets. This pivoting step is not performed on each dataset in isolation, instead we iterate through target language datasets (here English–Polish), and compare the English sentences to a concatenation of all English–Icelandic sentence pairs found in all the datasets. This comprehensive approach aims to capture any potential Icelandic-Polish sentence pair combinations, acknowledging the method's inherent greediness. Subsequent filtering steps address any resulting redundancy or extraneous data.
To ensure optimal coverage for the available data for each language, we combine our data with "pre-pivoted" datasets, which, in the case of Polish, are MultiCCAligned and MultiParacrawl, both of which are provided by the OPUS.
Initial manual checks of the output sentence pairs were performed to ensure data quality, though a comprehensive manual review of each pair was beyond the scope of this project (automatic evaluation methods are detailed below). Given the potential overlap in the input datasets, we applied a deduplication filter to the output, ensuring each sentence pair is unique in the final dataset. Additionally, we implemented a blanket-filtering step: only sentences shorter than 2000 characters were retained, and each sentence had to have at least 60% of its characters belonging to its respective language's alphabet. This criterion serves as a basic assurance that each sentence pair accurately represents its designated languages, Icelandic and Polish, respectively. Ultimately, this methodology yielded a bilingual dataset comprising 3,138,529 sentence pairs.
As a measure of the quality of this new bilingual dataset, we train an MT model and apply it on a standard task. To gauge the quality of the sentence pairs themselves, we use Language-agnostic BERT Sentence Embedding (LaBSE, Feng et al. 2022) to vectorize the sentences and give a rough estimate on each pair's similarity. This process scores each pair from 0 to 1 based on similarity, facilitating the creation of subsets for training MT models by removing sentence pairs likely to be incorrectly translated or detrimental for MT training in some other way:
3,138,529 pairs (all sentences)
3,094,917 pairs (LaBSE score ≥ 0.3)
2,950,339 pairs (LaBSE score ≥ 0.5)
2,505,558 pairs (LaBSE score ≥ 0.7)
979,326 pairs (LaBSE score ≥ 0.9)
We found that by fine-tuning the mBART-50 model (Tang et al. 2020) on the Icelandic->Polish translation direction using only sentence pairs from our dataset that have higher LaBSE scores than 0.7, we obtained BLEU-score (Papineni et al. 2002) of 13.0 on the Flores (Guzmán et al. 2019) evaluation set, just a bit below the BLEU-score of 13.3, obtained for the only previously published MT model for this language pair (Símonarson et al. 2022). Furthermore, the models trained on the subsets reveal insights into the relationship between data quality and translation accuracy, showing similar effects as demonstrated by Steingrímsson et al (2023).
This work has significant implications for real-world applications. Effective Icelandic–Polish translation models can facilitate communication in educational, governmental, and social contexts, directly benefiting the immigrant community and promoting cultural integration. The methodology and findings also contribute to the broader field of NLP, especially in language pairings involving lesser-studied languages.
Future work includes expanding our dataset to encompass more languages and refining our methodologies based on these initial findings. Additionally, exploring advanced data processing techniques to handle linguistic nuances and idiomatic expressions more effectively remains a priority.
In conclusion, this project not only addresses a critical need in Iceland's changing linguistic landscape but also contributes to the global effort in NLP research, particularly in enhancing machine translation for minority languages. Our approach, rooted in leveraging existing datasets and innovative processing techniques, paves the way for more inclusive and effective language technology solutions.
3:30pm - 3:45pmThe world in Norwegian: Glimpses from a digital exploration of bibliomigration patterns
Inger Hesjevoll Schmidt-Melbye1, Marcus Axelsson2, Siri Fürst Skogmo3
1NTNU, Norway; 2Østfold University College, Norway; 3Innland Norway University of Applied Sciences, Norway
Many important patterns in literary history are still poorly understood, because they weren’t easily grasped at the scale of individual reading (Underwood, 2017, np.)
Our project, The world in Norwegian, is positioned in the intersection of Digital Humanities, Translation Studies, Sociology of literature and Library development. Although methods and perspectives from Digital Humanities have advanced in many fields for several decades, they remain relatively unexplored within Translation Studies (Tanasescu, 2021; Wakabayashi 2019), especially in the context of smaller language cultures as is the case for Norwegian (Horvath, 2021; Spence & Brandao, 2021). We also see a potential in strengthening the collaboration between DH and research support from university libraries (Zhang, Liu & Mathews 2015; Zhang, Xue & Xue, 2021).
This study is a collaboration between translation scholars from three different Norwegian higher education institutions and research librarians from the Norwegian National Library and the Norwegian University of Science and Technology. Through this collaboration we contribute to enhancing university libraries’ knowledge and infrastructure in Digital Humanities in Norway. The project has received funding from the Norwegian National Library from August 2023 to August 2025.
The primary aim of our study is to investigate translation flows using digital methods and tools. We use Norway as a case in point and investigate the Norwegian literary import over the last 200 years (since 1800). A central concept in the project is bibliomigrancy or bibliomigration (Mani, 2014; Lindqvist, 2015) which is “an umbrella term that describes the migration of literary works in the form of books from one part of the world to the other” (Mani 2014, p. 289).
We partly draw on theories by Heilbron (1999; see also Sapiro, 2010), suggesting that the literary world can be described in terms of centres and peripheries and in open and closed literary systems. According to these theories, Norway is a periphery and an open literary system. This means that the Norwegian literature, to a large extent, consists of translated works, and that translated works have been an important literary factor ever since the country gained its independence some 200 years ago. English dominates as the most common source language in Scandinavia and is sometimes described as a hyper-central language (Lindqvist 2015, p. 78-80; De Swaan 2001, p. 4-5). English has traditionally been used - and is still- in indirect translation as an intermediate language from lesser-known languages in Norway (Rindal et al. 1998).
The main material for the study consists of library catalogues, mainly the National library catalogue of Norway, but also metadata from international library catalogues. The DH lab at the Norwegian National Library has developed an app which collects information from the National library catalogue and, through searching for different bibliographic system criteria, generates an overview of translated fiction.
We developed the app to discover bibliomigration patterns for translated literature in Norway, gaining a more multifaceted picture of the source cultures and languages involved, going beyond the traditional confirmation about the Anglo-American dominance.
Throughout the process, we (the researchers and the National Library) have been working closely to figure out on the one hand, which metadata we ideally want to collect from a Translation Studies perspective, and on the other hand, which metadata is actually possible to collect in large quantities through digital humanities tools.
One purpose of the app is to provide a visual representation of bibliomigration patterns by showing on a map which countries the translated literature originates from. This map will be open access and freely available to researchers, but our aim is also that the map serve more than one purpose and be useful to everyone who is interested in bibliomigration patterns and the sociology of literature – for example in public and school libraries, in the publishing sector and in the Norwegian Association of Translators.
The creation of the map poses challenges on at least two levels: First, how to collect the data which provides us with information about the publication place of the first edition of the original work, and second, how to visually represent diachronic data in a geopolitically shifting reality.
In addition to serving as the foundation of the map, the app will also be useful for example in finding and researching translations of works by a particular author, or by a particular translator. In the test stages, we have also found that the app can be used by librarians for information retrieval. The collaborative aspects of this project have sparked several outcomes that were not included in our initial proposal, and we foresee that we will experience more of this as the project progresses. In our paper, we will present the project as work in progress and demonstrate the digital resources in their current state, as well as discuss the challenges we experience.
|