Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
GOR Thesis Award II: PhD
Time:
Tuesday, 01/Apr/2025:
12:00pm - 1:15pm

Session Chair: Olaf Wenzel, Wenzel Marktforschung, Germany
Location: Max-Kade-Auditorium


Show help for 'Increase or decrease the abstract text size'
Presentations

Identifying, Characterizing, and Mitigating Errors in the Automated Measurement of Social Constructs from Text Data

Indira Sen

University of Mannheim, Germany

In this abstract, I summarize the contents and contributions of my doctoral thesis “Identifying, characterizing, and mitigating errors in the automated measurement of social constructs from text data”. My doctoral research is highly interdisciplinary and sits at the intersection of Natural Language Processing (NLP) and Social Science.

Relevance: Computational Social Science (CSS) brings a transformative approach to social science, leveraging digital trace data—data generated through online platforms and digital activities [1]. This form of data is distinct from traditional social science data like surveys, providing vast, real-time information on social constructs, or human behavior and attitudes. Given the large-scale nature of digital traces, new methods of analysis are needed to draw insights from them; particularly automated techniques using NLP and Machine Learning.

Yet, despite their advantages, research with digital traces and automated methods also present unique challenges [2]. Digital trace data lacks well-established validity measures and faces biases from platform-specific artifacts, incompleteness, and methodological gaps. Traditional social science emphasizes robust frameworks to minimize biases in survey data, but such practices are underdeveloped for digital trace data, making the rigorous study of its limitations critical. By systematically identifying and addressing these issues, we can advance the field and offer more reliable insights into complex social constructs like political attitudes, prejudicial attitudes, and health-related behaviors. This is what I set out to do in my doctoral dissertation, by exploring and answering the following research questions:

Research Questions:

1. RQ1: What errors and biases arise when using digital trace data to measure social constructs?

2. RQ2: Can survey items enhance the validity of computationally measured social constructs from digital trace data?

3. RQ3: Can both manual and automated data augmentation techniques improve the robustness and generalizability of these computational measurements?

Methods & Data

Data: I use various online data sources in this dissertation and combine survey data and survey scale items. For RQ1, we conduct a scoping review of research that uses Twitter/X, Wikipedia, search engine data, and others. For RQ2, we utilize data from Twitter/X to study sexism, and Glassdoor, a platform where employees review their workplaces, to study workplace depression. To incorporate social theory into our computational models for RQ2, we also use survey items for sexism [3, inter alia] and workplace depression [4], as well as survey-based estimates of depression in the US for validation. For RQ3, we use data from Twitter/X, Reddit, and Gab to study sexist and hateful attitudes.

Methods: The study combines computational techniques, specifically NLP and Machine Learning (ML), with quantitative and qualitative approaches rooted in social science. For RQ1, we use a combination of literature review, case studies, and qualitative analysis to conceptualize an error framework inspired by traditional survey methodology, particularly the Total Survey Error (TSE) Framework [5]. Our framework, the “Total Error Framework for Digital Traces of Human Behavior on Online Platforms” (TED-On) adapts the TSE to address idiosyncrasies of digital traces, e.g., the effect of the online platform. Our framework helps in systematically identifying errors when using digital trace data for measuring social constructs.

For RQ2, we incorporate theory into computational NLP models in two ways — using survey items to guide the creation of labeled training data which is then used to train computational models, or to use them with sentence embeddings [6] for a semi-supervised approach. For RQ3, we use manual and automated data augmentation to create better NLP models. We use Large Language Models (LLM) like GPT3.5 for automated synthetic data. Finally, we also devise robust evaluation approaches that specifically account for the generalizability of computational methods.

Results

RQ1: Through the TED-On framework, two main error types are highlighted: (1) measurement errors due to content misalignment with constructs and (2) representation errors due to biased or incomplete data coverage.

RQ2: survey-inspired codebooks and models help structure the analysis of digital trace data around specific and holistic theoretical dimensions. Concretely, we find that models developed on our theory-driven codebook outperform existing state-of-the-art models by 6% higher F1 scores. Our workplace depression classifier was also validated and found to correlate with state-level depression scores (r = 0.4).

RQ3: Manual and automated data augmentation techniques increase computational models' robustness, especially in cross-domain applications. For example, synthetic data created to detect sexism and hate speech resulted in models that generalize better across platforms, minimizing reliance on platform-specific artifacts. Both manual and LLM-generated augmented data improve the out-of-domain generalizability of computational models with improvements of 5-12% F1 scores for different domains, compared to scores of around 55% F1 from previous models.

Added Value. This thesis makes significant contributions to three foundational aspects of CSS:

1. Theory: The research introduces measurement theory adapted for digital traces, creating a shared vocabulary for the interdisciplinary field. The TED-On framework aids in error documentation and model validation tailored to digital data challenges.

2. Data: This work develops several datasets: a Twitter/X dataset on sexism, synthetic training data for detecting sexism and hate speech, and a dataset detailing workplace depression rates across companies and states, offering valuable resources for CSS research.

3. Methods: The research introduces theory-driven and generalizable NLP models for identifying sexism and hate speech, and semi-supervised models for analyzing workplace depression.

These contributions provide a roadmap for future CSS studies which could further refine social construct measurement from new digital trace data sources

References

  1. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A. L., Brewer, D., ... & Van Alstyne, M. (2009). Computational social science. Science

  2. Ruths, D., & Pfeffer, J. (2014). Social media for large studies of behavior. Science

  3. Glick, P., & Fiske, S. T. (2018). The ambivalent sexism inventory: Differentiating hostile and benevolent sexism. In Social cognition

  4. Bianchi, R., & Schonfeld, I. S. (2021). The occupational depression inventory—a solution for estimating the prevalence of job-related distress. Psychiatry Research,

  5. Groves, R. M., & Lyberg, L. (2010). Total survey error: Past, present, and future. Public opinion quarterly

  6. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Conference on Empirical Methods in Natural Language Processing.



The Power of Language: The Use, Effect, and Spread of Gender-Inclusive Language

Anica Waldendorf

Nuffield College, University of Oxford, United Kingdom

Relevance & Research Question

This dissertation investigates a new social phenomenon: the use of gender-inclusive language (GIL). Leveraging language as a strategic research site, this paper-based dissertation contributes to understanding a current social phenomenon in Germany and advances the sociological understanding of how behavioural change occurs and is experienced. It creatively combines different online research methods (web scraping, text analysis, a digital field experiment, and qualitative online interviews).

GIL refers to changing person nouns to be gender-inclusive, akin to the shift from policemen to police officers in English. In German, a researcher is a male researcher (Forscher) or a female researcher (Forscherin). A gender-inclusive alternative would be Forscher*in. The convention is to use the masculine form generically, which holds a strong and empirically documented male bias. In 2020, GIL was a relatively new and highly discussed topic in Germany. Everyday observations indicated an increase in GIL use (e.g. in iOS software and Spotify), yet academic sources viewed it as a marginal phenomenon. GIL is a potential equal opportunities tool owing to extensive research documenting the ability of GIL to mitigate the male bias in language. From a sociological perspective, it provides an opportunity to retrospectively study behavioural change using large-scale data without researcher interference.

This dissertation asks:

  • Paper 1: Has GIL been as rapidly adopted as it seems? Under what conditions can difficult behavioural change take place?
  • Paper 2: Does GIL increase the number of women who apply for a male-typed task? Do women perform better at a male-typed task when GIL is used?
  • Paper 3: How has GIL use developed since its initial increase? How do individuals experience GIL?

Methods & Data

Paper 1 uses web scraping to craft two unique datasets that were then analysed using quantitative text analysis, part-of-speech tagging, and manual annotation. Using a programme written in Python, I accessed the Deutscher Referenzkorpus (DeReKo) to collate GIL occurrences in over 4 million newspaper articles published in five different media outlets between 2000 and 2021. This measured the frequency of GIL but not of the generic masculine. Therefore, I also web-scraped the five different media outlets to gather full-length newspaper articles, which I then annotated to measure the relative use of GIL. I conducted differential analyses by the political orientation of the media outlet (left, centre, right), the type of GIL (10 different types), and the author’s gender (identified using part-of-speech tagging).

Paper 2 is co-authored with Klarita Gërxhani and Arnout van de Rijt. It is a digital field experiment that tests the effect of GIL, specifically in job applications where previous research has demonstrated an increase in girls’ and women’s attitudes and preferences towards stereotypically masculine jobs when GIL is used. We use Prolific, an online crowd-working platform, as an experimental labour market, where sign-up for our advertised task itself was an outcome variable. We advertised participation in a stereotypically masculine task (solving maths problems) and varied the use of GIL in the advertisement (on Prolific) and the task description (on Qualtrics) in a 2x2 between design, which allowed the separation of a recruitment effect and a performance effect. We then measured how many women participated under each condition (two-tailed proportion test) and how well women performed when GIL was used (two-tailed independent t-tests). The experiment was pre-registered on OSF (https://osf.io/x8ft4) with a sample size of 2,000 based on power calculations. However, the participant pool was exhausted at 1,321 participants. It was fielded in Germany and Italy.

As GIL is a relatively new behaviour, Paper 3 revisits the DeReKo data for the years 2022/23 and combines it with data from Google Trends and the Dow Factiva database to see not only how GIL use develops after its initial increase but also whether and how it is talked about. It then builds on the quantitative findings with prospective longitudinal qualitative interviews in combination with ego-centric network data collection (21 cases). The qualitative interviews were conducted online using Microsoft Teams, an online tool that was essential as 1) it allowed access to a geographically diverse group of research participants and 2) for participants to engage in participatory research by using the whiteboard function to let them draw their network.

Results

Paper 1: In addition to observing an unexpectedly rapid increase in GIL (reaching 800 occurrences per million words, or 16.5% of potential use), two different trends are identified: whilst non-binary inclusive forms of GIL are increasingly used in the left-leaning newspaper, GIL that adheres to a binary notion of gender is favoured in the mainstream and right-leaning media. Three conditions for difficult behavioural change are identified: having a role model, the possibility of a low-threshold adoption, and incremental adoption.

Paper 2: We find no effect of GIL on the share of women nor on the performance of women. In each condition, the share of women was 43%, and there was no statistically significant difference in women’s mean performance (mean number of correct answers in math task was 3.3 with GIL, 3.5 without GIL). This may be because GIL only influences attitudes, not behaviours. It may also be an artefact of our research design, so we are planning a follow-up experiment.

Paper 3: The data show that since 2021, the use of GIL has stalled, and the contestation of GIL has spiked. The current situation is what I refer to as incipient change: GIL has increased but has not replaced the generic masculine. Yet, it has also not waned. Turning to the interview data, I uncover a fragmented understanding of GIL (i.e., everyone knows GIL, but not everyone correctly understands what GIL is) alongside a surprisingly clear individual understanding of the circumstances in which GIL can or sometimes even should be used. Rather than a general separation of GIL users and non-GIL users (polarisation), I argue that the use of GIL seems to be strongly tied to context, particularly the professional context, thus mapping onto a coordination-based micro-level theory.

Added value

This thesis combines multiple online research methods to study different facets of the same highly relevant social phenomenon: GIL. First, it shows the strength of using webscraping and computational tools to measure macro-level patterns and using digital tools to access the micro level, underscoring the importance of online research not just for quantitative research but also qualitative. Second, it demonstrates how the shift of the labour market into the digital sphere enables new research designs and, with them, a new avenue of research possibilities.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: GOR 25
Conference Software: ConfTool Pro 2.8.105
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany