GOR 26 - Annual Conference & Workshops
Annual Conference- Rheinische Hochschule Cologne, Campus Vogelsanger Straße
26 - 27 February 2026
GOR Workshops - GESIS - Leibniz-Institut für Sozialwissenschaften in Cologne
25 February 2026
Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Session Overview |
| Session | ||
4.2: Poster Session
| ||
| Presentations | ||
Beyond Algorithms: How to Improve Manual Classification of Visual Data Obtained in Surveys RECSM- Universitat Pompeu Fabra, Spain Relevance & Research Question An increasing number of studies are requesting visual data within web surveys, arguing that they can enhance data quantity and quality and provide new insights. However, important challenges remain. This study focuses on one of these aspects: extracting relevant information from the visual data, a process called “classification”. Researchers are increasingly relying on automated methods using machine learning to classify visual data. Although these methods are becoming more powerful, they still cannot extract information in the same way as manual classification. Thus, the main objective of this study is to explain the challenges and solutions encountered while implementing manual classification in a complex case study on remote work homestations, where approximately 70 items must be classified based on three photos. A web survey will be conducted in the opt-in online panel Netquest in Spain (N = 1,200) in early December among remote workers. After answering conventional questions about their remote work conditions, respondents will be asked to upload two photos of their homestation and one of their main screen or laptop model information. All photos will be reviewed by the project ethics advisor to ensure no private information is visible. Manual classification will be implemented in accordance with detailed guidelines. The two homestation photos will be classified jointly, and the device-model photo will be classified separately. Two researchers will share homestation classification, with approximately 10% of the photos coded by both of them to compute interrater reliability (IRR) indicators and identify potential systematic biases. One researcher will code the device-model photo, but a subsample will undergo double coding to assess IRR. Results We expect to find differences between classifiers, identify problematic items, and detect the types of errors most likely to occur across classifiers. Added Value This work in progress focuses on discussing the challenges encountered when dealing with the classification of complex visual data collected in web surveys. We aim to provide practical, user-oriented guidelines that extend beyond the explanations usually found in academic papers, which often prioritize presenting results over detailing the underlying classification process. AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models 1LMU Munich; 2Munich Center for Machine Learning; 3University of Maryland, College Park Relevance & Research Question: Designing high-quality survey questions is a complex task. With the rapid development of large language models (LLMs), new possibilities have emerged for supporting this process through the automated generation of survey items. Despite growing interest in LLM tools within industry, published research in this area remains sparse, and little is known about the quality and characteristics of survey items generated by LLMs or the factors influencing their performance. This work provides the first in-depth analysis of LLM-based survey item generation and systematically evaluates how different design choices affect item quality. Methods & Data: Five LLMs, namely GPT-4o, GPT-4o-mini, GPT-oss-20B, LLaMA 3.1 8B, and LLaMA 3.1 70B, were used to generate survey items on four substantive domains: work, living conditions, national politics, and recent politics. We additionally evaluate three prompting strategies: zero-shot, role, and chain-of-thought prompting. To assess the quality of the generated survey items, we use the Survey Quality Predictor (SQP), a tool developed by survey methodologists for estimating the quality of attitudinal survey items based on codings of their formal and linguistic characteristics. To code these characteristics, we used an LLM-assisted procedure. The analysis allows us to evaluate not only overall quality but also around 60 specific survey item characteristics, offering a detailed view of how LLM-generated questions differ. The findings show striking differences in survey item characteristics across the different models and prompting techniques. The results also show that the prompting technique employed is a primary factor influencing the quality of LLM-generated survey items. Chain-of-thought prompting leads to the most reliable outputs. The topics 'work' and 'national politics' yield survey items with the highest quality. Closed-source GPT models generally produce more consistent and higher-quality items than open-source LLaMA models. Among all configurations, GPT-4o-mini combined with chain-of-thought prompting achieved the best overall results. For the GOR community, the study offers empirical evidence on how LLMs can (and cannot) be reliably integrated into questionnaire design workflows, providing a systematic basis for evaluating emerging AI tools in survey research and informing methodological decisions in applied settings. Who moves and who do we lose? Mobility-Specific Attrition in Panel Surveys. GESIS, Germany Relevance & Research Question Residential mobility is an important source of attrition in address-based panel surveys. That is, panelists cannot be invited to a survey because their new address is unknown. Current research into mobility specific attrition (MSA) is lacking in three aspects: (1) Because of a lack of meta-reviews and ambiguous definitions for MSA, there is limited insight in the magnitude of MSA. (2) research into the selectivity of MSA and its' contribution to attrition bias is sparse and almost exclusively based on panels of special populations, and (3) almost no research on MSA for self-administered panel surveys exists, which continue to displace traditional face-to-face-interviewing. Consequently, in this study, we address two research questions: 1) How many respondents in panel surveys are mobile and how many of those attrite due to MSA. 2) Are certain subpopulations more prone to MSA than others? To answer RQ1 we conduct a meta-review by (1) systematically sampling panel surveys from three extensive data archives (ICPSR, CESSDA and GESIS data-archive). and then (2) analyzing their field documentation to gain a deep understanding of the prevalence of mobility and MSA in panel surveys with special regards to self-administered surveys. For RQ2 we explore subpopulations at risk of MSA by employing machine learning algorithms (classification trees) on data from all available waves of FReDA (currently three waves; N = 42,787). FReDA is a probability-based self-administered mixed-mode panel study with biannual surveys using both web-based and paper-based questionnaires. FReDA’s primary mode of contact is postal mail. Its sample base are German residents aged 18 to 49 years which were recruited by drawing a sample of 108,256 individuals from population registers of German municipalities (Bujard et al., 2025). Results Analyses are planned for December 2025 and January 2026. Preliminary results will thus be available in February 2026. We identify the magnitude of MSA in panel surveys and whether it introduces systematic biases, thereby quantifying a potential source of error for panel researchers. “My (22m) Girlfriend (23f) Comes Home and Does Nothing” – Gendered Perceptions of Paid and Household Labor in Reddit Relationship Discussions over Time 1TU Dortmund University, Germany; 2University of Mannheim, Germany; 3Carl von Ossietzky University Oldenburg, Germany; 4Leibniz Institute for Educational Trajectories (LIfBi), Germany Relevance & Research Question The COVID-19 pandemic has reignited longstanding questions about gender inequalities in paid and unpaid labor. While survey research has advanced our understanding of these disparities, it typically relies on predefined categories and is susceptible to social desirability bias, especially for sensitive topics. In contrast, online postings capture intimate relationship conflicts in great depth, however rarely include demographic information. We leverage discussions in Reddit relationship communities that, due to unique community roles, include both rich descriptions of relationship conflicts around (un)paid labor and demographic details (age, gender). We first assess how well Large Language Models (LLMs) classify manifest and latent content in Reddit posts. Building on the best-performing approach, we examine how men and women discuss relationship conflicts around (un)paid labor before, during, and after the pandemic. Methods & Data Using GPT-family LLMs on 500,000 posts from the subreddits r/relationships and r/relationship_advice, we extract manifest demographic attributes and classify whether posts discuss romantic relationships, paid, and unpaid labor. We systematically vary model specifications (o3, 4o, 4.1), prompting strategies (zero-shot vs. few-shot), fine-tuned vs. base models, and context window lengths, evaluating each against human annotations. Using classifications from the best-performing approach, we apply Structural Topic Models to explore how men and women discuss (un)paid labor in romantic relationships, and how these discussions evolve over time (2011-2023). Across specifications, LLMs excel at predicting manifest categories but struggle with latent sociological constructs. Even the best-performing approach of fine-tuned GPT-4.1 with few-shot prompting and detailed category descriptions achieves only moderate performance when classifying paid and unpaid labor. Substantively, preliminary findings indicate that women more often discuss mental health in work-related conflicts, while men more frequently emphasize career objectives. Added Value We apply LLM-based classifications to core sociological questions around gender inequalities. We add to the growing research body on LLMs’ capabilities and limitations in classifying complex social-science constructs and offer new evidence on how gender disparities in paid and unpaid labor are reflected and negotiated in relationship conflicts. By combining demographic information with highly sensitive narratives, our dataset provides an empirical resource rarely available in either survey or social-media research. A Changing Language of Sustainability? Global online discourse analysis with a deep-dive on Germany 1Weber Shandwick, Germany; 2Fresenius University Koeln, Media School Background Purpose Method Results Added Value References on demand | ||