GOR 26 - Annual Conference & Workshops
Annual Conference- Rheinische Hochschule Cologne, Campus Vogelsanger Straße
26 - 27 February 2026
GOR Workshops - GESIS - Leibniz-Institut für Sozialwissenschaften in Cologne
25 February 2026
Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Session Overview |
| Date: Wednesday, 25/Feb/2026 | |
| 11:30am - 12:30pm | Begin Workshop Check-in Location: GESIS - Leibniz-Institut für Sozialwissenschaften Köln |
| 12:30pm - 3:00pm | Workshop 1 Location: GESIS, West I |
|
|
AI-Conducted Research: Hands-On Workshop on Automated Qualitative Interviews Userflix, Germany Target groups Market research professionals and agencies, UX researchers and product teams, "People Who Do Research" (PWDR) in product organizations Is the workshop geared at an exclusively German or an international audience? International audience Workshop language English Description of the content of the workshop T his hands-on workshop demonstrates how autonomous voice AI moderators can conduct, transcribe, and analyze qualitative interviews at scale - bridging the traditional divide between qualitative depth and quantitative reach. Participants will experience the complete workflow: (1) Co-creating a research study through AI-assisted question development and visual stimulus integration; (2) Observing live autonomous voice-to-voice interviews with dynamic follow-up questions and multilingual capabilities across 50+ languages; (3) Exploring real-time analysis that extracts patterns across small and large sample sizes while maintaining traceability to individual voices. Through hands-on configuration of their own study, participants will critically evaluate when autonomous voice AI maintains research rigor versus when human moderators remain essential. Workshop covers end-to-end orchestration, quality validation through full transcript access, GDPR compliance, and integration with existing research workflows. Goals of the workshop Participants will: (1) Familiarize themselves with autonomous voice AI moderation as an emerging research methodology—understanding the complete workflow from study co-creation to real-time analysis; (2) Develop evaluation frameworks for assessing AI-moderated interview quality using real pilot transcripts and comparison criteria; (3) Apply co-intelligence principles: learning when to leverage AI autonomy versus when human researcher judgment remains essential; (4) Explore current capabilities and limitations; (5) Gain implementable insights from real deployments: cost models, change management, quality validation, and GDPR compliance; (6) Envision the future of customer research: from quarterly bottleneck to continuous insight engine while maintaining methodological rigor. Necessary prior knowledge of participants Basic familiarity with qualitative research methods (moderated interviews, user research, or market research). No technical or AI expertise required. Workshop designed for research professionals, moderators, product teams conducting research, and organizations exploring research scaling. Literature that participants need to read prior to participation None required. Workshop is designed to be self-contained with all context provided during the session. Recommended additional literature Co-Intelligence - Ethan Mollick, Research that Scales - Kate Towsey Information about the instructor Bruno Recht is CEO and co-founder of Userflix, which develops autonomous voice AI moderators for qualitative interviewing. He serves as guest lecturer for Human AI Interaction at Elisava University Barcelona. Previously, he worked as a Designer at Porsche. Userflix's validation includes pilot deployments with the Nielsen Norman Group, IKEA, and leading market research agencies. Recent recognition: marktforschung.de Innovation Award 2025 and RWTH Spin-off Award. Bruno brings a unique combination of design thinking, AI product development, and hands-on implementation experience with research organizations - bridging academic rigor with commercial deployment insights. Maximum number of participants 25-30 participants (to ensure everyone can engage hands-on with the platform and receive individual attention during the practical exercises) Will participants need to bring their own devices in order to be able to access the Internet? Will they need to bring anything else to the workshop? Yes, participants should bring their own laptop or tablet with internet connection to access the Userflix platform during hands-on exercises. No software installation required (browser-based). Optional: participants may bring headphones for better audio experience during demo interviews |
| 12:30pm - 3:00pm | Workshop 2 Location: GESIS, West II |
|
|
Vibe Coding for Online Research: From Findings to Interactive, Open, and Participatory Portals 1Universität Leipzig, Germany; 2Universität zu Köln Duration of the workshop 2.5 h Target groups Applied researchers, data scientists, methodologists, and research communicators who want to convert datasets, literature syntheses, or analysis outputs into interactive dashboards and tools without deep full-stack expertise. Is the workshop geared at an exclusively German or an international audience? International Audience Workshop language English Description of the content of the workshop This practical workshop will introduce researchers to vibe coding, a rapidly emerging AI-assisted approach that allows researchers to analyze, design, and share their work through natural language interaction, rather than conventional programming. Goals of the workshop Understand the principle behind vibe coding and its dual role in data analysis and Necessary prior knowledge of participants 1. No prior programming or "coding" experience is strictly necessary. 2. Familiarity with using Large Language Models (e.g., ChatGPT, Claude) is helpful but not required. 3. A willingness to experiment with new digital tools is essential. Information about the instructor Ali Reza Hussani is a PhD candidate at the Institute of Communication Science and Journalism at Leipzig University. His research investigates online behaviors on social media, operating at the intersection of policy, evidence, and social impact. His background is in political science and conflict management. He completed a focused data science bootcamp at Spiced Academy Berlin to enhance his methodological toolkit. He is a passionate enthusiast for AI in academia, continuously adopting and experimenting with new AI-assisted workflows to make research more efficient and interactive. Maximum number of participants 20 Will participants need to bring their own devices in order to be able to access the Internet? Will they need to bring anything else to the workshop? Yes. Participants must bring their own laptop. |
| 3:00pm - 3:30pm | Break Location: GESIS - Leibniz-Institut für Sozialwissenschaften Köln |
| 3:30pm - 6:00pm | Workshop 3 Location: GESIS, West I |
|
|
AI-Technologies for Qualitative Data Analysis 1David Ranftler Xelper, Germany; 2Paul Wesendonk Xelper, Germany Duration of the workshop Target groups Researchers and practitioners who are interested in exploring AI technologies for qualitative data analysis. Is the workshop geared at an exclusively German or an international audience? English-speaking Workshop language English Description of the content of the workshop Artificial Intelligence is rapidly transforming how qualitative research is conducted — from automating analysis to novel approaches of results presentation. In this hands-on workshop, we will explore how large language models (LLMs) and embedding technologies can be used in qualitative research. Goals of the workshop By the end of the workshop, participants will gain knowledge of LLMs and embeddings. Necessary prior knowledge of participants No prior experience with software development or AI tools is required. Information about the instructors Founders of xelper, a startup providing AI solutions for qualitative market research. Participants should bring their own laptops with Internet access. |
| 3:30pm - 6:00pm | Workshop 4 Location: GESIS, West II |
|
|
What matters in pictures - A hands-on guide to AI-empowered analysis of visual input 1Human8 Europe, Belgium; 2Human8 Europe, Belgium Duration of the workshop 2,5 Target audience Workshop description Over the past six years, Human8 has been exploring the evolving needs, desires and aspirations that shape people’s lives, alongside their changing expectations of brands. Each year, our trend report returns to one guiding question: what matters to people today and tomorrow? Three course objectives |
| Date: Thursday, 26/Feb/2026 | |
| 8:00am - 9:00am | Begin Check-in Location: Rheinische Hochschule, Campus Vogelsanger Straße |
| 9:00am - 10:00am | 1: Opening and Keynote 1: Dr. Katharina Schüller Location: RH, Auditorium |
|
|
More efficiency, new problems? AI in empirical research STAT-UP Statistical Consulting & Data Science GmbH, Germany Artificial intelligence opens up numerous new possibilities in online research. Synthetic data, automated quality assurance, intelligent recruitment strategies, and new pattern recognition tools promise unprecedented efficiency—while also raising fundamental questions: How reliable are AI-generated samples? Can algorithmic methods really reduce nonresponse? And how can we prevent models from reproducing systematic biases on a large scale? Using concrete examples, the presentation shows how AI can improve data quality, what pitfalls lurk in automated data generation, and what skills are needed to work with it responsibly. The goal: a realistic, inspiring look at the interplay between statistics, humans, and machines – practical and future-oriented. |
| 10:00am - 10:15am | Break Location: RH, Lunch Hall/ Cafeteria |
| 10:15am - 11:15am | 2.1: AI and survey research Location: RH, Seminar 01 |
|
|
Talking to Results: LLM-enabled Discussions of Quantitative Survey Findings 1Verian, Germany; 2Inspirient; 3Bertelsmann Stiftung Relevance & Research Question: Methods & Data: Results: Added Value: Testing the Performance and Bias of Large Language Models in Generating Synthetic Survey Data Utrecht University, Netherlands, The Relevance & Research Question The idea to generate so-called silicon survey samples with large language models (LLMs) has gained broad attention in both academic and market research as a promising timely and cost-efficient method of data collection. However, previous research has shown that LLMs are likely to reproduce limitations and biases found in their training data, including underrepresentation of certain subgroups and imbalanced topic coverage. Using survey data from a probability-based online panel in Netherlands, we conduct a large-scale analysis examining model performance across different item types (factual, attitudinal, behavioral) and social-science-related topics, to identify when and for whom synthetic approaches perform best. We further explore strategies to mitigate potential performance limitations, including few-shot leveraging of LLMs on longitudinal survey data. Methods & Data We compare existing survey data with LLM-generated synthetic data, using the 17th wave of LISS Panel fielded in 2024. We selected nine survey items that cover key social-science topics, such as health, family, work, and political values. We chose the items based on their characteristics (factual, behavioral, attitudinal) as well as outcome type. We leverage the LLM agents based on two different few-shot learning tasks: a sociodemographic setup, which draws on seven individual background variables, and a panel-informed setup, which additionally incorporates previous responses. We compare the generated data across five different proprietary LLMs, including GPT-4.1, Gemini 2.5 Pro, Llama 2 Maverick, Deepseek-V3 and Mistral Medium 3. Results Our first results show a deep lack of accuracy in the sociodemographic few-shot setup with an average accuracy of 0.2 and a strongly underestimated variance across items and topics. Prediction errors vary significantly across subgroups, particularly by age, showing differences not only in magnitude but also in the direction of errors. Adding longitudinal survey responses to the few-shot input substantially improves the prediction quality, showing a 40 to 60 percentage point increase in overall accuracy, correcting variance underestimation, and reducing subgroup disparities. Added Value Our research contributes to a more responsible application of silicon survey samples by providing practical guidance for researchers and survey practitioners to evaluate and improve AI-generated datasets. Transcribing and coding voice answers obtained in web surveys: comparing three leading automatic speech recognition tools 1RECSM-UPF, Spain; 2DZHW, Leibniz University Hannover; 3University of Michigan Relevance & Research Question With the rise of smartphone use in web surveys, voice or oral answers have become a promising methodology for collecting rich data. Voice answers present both opportunities and challenges. This study addresses two of these challenges—labor-intensive manual transcription and coding of responses, by answering the following research questions: (RQ1) How do three leading Automatic Speech Recognition (ASR) tools —Google Cloud Speech-to-Text, OpenAI Whisper, and Vosk— perform across various dimensions? (RQ2) How similar or different are the codes of transcribed responses generated by a human and the OpenAI GPT-4o model? Methods & Data We used data collected in the Netquest opt-in online panel in Spain in February/March 2024. The questionnaire included over 80 questions, mainly about citizens’ perceptions of nursing homes. This study focuses on one open-ended narrative question in which respondents were asked to explain why they selected a given answer in a prior closed question on the amount of information nursing homes provide to the general public. For this question, participants were initially asked to answer through voice recording. In a follow-up, respondents skipping the question were also offered to type in a text box. We extracted various aspects from the transcriptions and compared them across ASR tools and human vs GPT coding. After data cleaning, 859 panellists were used for analyses. Results We found that each of the ASR tools has distinct merits and limits. Google sometimes fails to provide transcriptions, Whisper produces hallucinations (false transcriptions), and Vosk has clarity issues and high rates of incorrect words. Human and LLM-based coding also differ significantly. Thus, we recommend using several ASR tools and implementing human as well as LLM-based coding, as the latter offers additional information at minimal added cost. Added Value This study offers valuable insights into the feasibility of using voice answers. Depending on the most critical quality dimension for each study (e.g., maximizing the number of transcriptions or achieving the highest clarity), it provides guidance on selecting the most suitable ASR tool(s) and insights into the extent to which LLMs can assist with manual tasks like answer coding.
|
| 10:15am - 11:15am | 2.2: Paradata and metadata Location: RH, Seminar 02 |
|
|
Metadata uplift of survey data for research discovery and provenance 1University College London; 2Scotcen; 3University of Surrey; 4University of Essex Relevance & Research Question: -The profusion of data from the introduction of CAI has created three main problems for researchers, volume, complexity and understanding quality. The disjointed nature of many survey data collections has fragmented this across many organisations, during in which much valuable information is lost or opaque to the researcher using data at the end of the data lifecycle.- Methods & Data: -Focused and well contructed Machine Learning offers the possibility of automating at scale the available metadata resources into standardised metadata which can be made available in repositories for discover, and creating the detailed granular metadata that a researcher needs to evaluate quality of complex survey prior to data applications or access. The collaboration between CLOSER, University of Essex, University of Surrey and Scotcen has been developing machine learning models, utilising the CLOSER Discovery metadata store to improve the timeliness and accuracy of metadata extraction to deliver high quality metadata.- Results: -Preliminary results will be presented on the success and challenges faced in taking complex survey instruments and rendering them into DDI-Lifecycle for ingest into repository platforms and the opportunites for further enhancement of these metadata resources for reuse of questions into the survey development pipeline.- Added Value: -The abilty to create high quality reuable metadata across the survey specification, collection, management and dissemination lifeycle would bring efficiencies in terms of costs, improvements in quality, discoverability and understanding of these complex data resources.- Beyond the Questionnaire: Linking Passively Metered Platform Data with Surveys for Audience Profiling Datapods GmbH, Germany Relevance & Research Question: The integration of large-scale passively metered data with established survey methodologies has become a central development in contemporary market and social research. In particular, digital trace data originating from major platform operators such as Google, Meta and TikTok represents a highly promising source for enhancing sociological measurement, audience segmentation and modeling. However, substantial challenges remain with respect to obtaining continuous, consent-based access to such platform data, and to linking heterogeneous data types in a methodologically robust and privacy-compliant way. Methods & Data: Datapods has established a novel approach with its own proprietary user panel that allows for the combination of survey methodologies to define socio-economic, value-based and demographic profiles with direct copies of the personal data from big tech companies. We first established the baseline for these profiles by utilizing common survey methodologies. These measures serve as our ground truth for subsequent validation. In a second step, we linked these baseline profiles with corresponding behavioral data streams, including web-browsing histories, YouTube viewing histories and interaction logs on Instagram, Facebook and TikTok. We identified key indicators for different data types to be the most influential for the profile of the panelist and joined data across types to ensure a holistic picture about the user. Results: Early results indicate that only a relatively small subset of survey items adds substantial incremental information beyond what is already embedded in the digital trace data. Researchers can, in practice, rely on high-quality, consent-based personal data to assign users to pre-defined socio-demographic and value-based target group profiles with high accuracy, and to identify fundamental clusters and segments within the panel. The results suggest that passively collected platform data can function as a proxy for many conventional survey indicators. Final empirical results will be available by the end of 2025 and handed in subsequently. Added Value: This passively metered, platform-data-based approach to user profiling and segmentation substantially enhances survey-centric designs and, for certain research questions, can partially or even fully substitute conventional survey data collection. It enables more granular behavioral indicators and provides a scalable solution for continuous audience measurement and sociological analysis. Visualizing the Answering Process: Exploring Mode Differences with Respondent-Level Paradata from the IAB Establishment Panel 1Institute for Employment Research, Germany; 2LMU Munich Relevance & Research Question Understanding how respondents interact with survey instruments is crucial for facilitating the response process and improving data quality. Especially for establishments there is still a lack of insights into their response behavior. By analyzing respondent-level paradata we aim to explore the answering process in detail. We investigate how establishments navigate through the online questionnaire of the IAB Establishment Panel, focusing on differences between survey modes (CAI versus Web) and samples (panel versus refreshment). Following this, we investigate whether distinct response patterns can be identified and if there is a need for a tailored response process by utilizing important establishment characteristics. Methods & Data We analyze detailed respondent-side paradata from the IAB Establishment Panel, conducted annually by the Institute for Employment Research (IAB). Since 2018, the survey has been implemented in a mixed-mode design with computer-assisted personal interviewing (CAI) and an online mode (Web) using identical software. In 2022, we collected paradata logging every click, answer, and timestamp at the second level. After creating an audit trail for each respondent, we identify appropriate paradata indicators and apply cluster analysis to identify groups of establishments with similar navigation and response behaviors. Results The visualization of paradata via audit trails reveals differences in navigation behavior. Some establishments follow a straightforward sequence, while others loop back or perform multiple checks before submission. Paradata indicators reveal that Web respondents take longer, more breaks, use more tree view, edit more answers and drop-out more often than CAI respondents. Preliminary results of the clustering analysis show that we can identify two clusters for each combination of sample and mode. Cluster 1 seems to include the linear respondents, while cluster 2 includes all other respondents with more conspicuous response behavior. Added Value By combining visualization and clustering of establishments response processes, we provide an empirical approach of utilizing paradata. Our results help to understand how establishments navigate and respond to a survey. Simultaneously, we give recommendations for the survey design and evaluate mixed-mode establishment surveys. |
| 10:15am - 11:15am | 2.3: Media studies Location: RH, Seminar 03 |
|
|
Deploying Online Experiments to Investigate Content Credibility in Sensor-Based Journalism University of Cologne, Media and Technology Management, Germany Relevance & Research Question: The emerging field of sensor-based journalism relies on data beyond human reach collected by sensors (Diakopoulos, 2019; Loebbecke & Boboschko, 2020). Research on sensor-based journalism (Boboschko & Loebbecke, 2025) studies the impact of identity cues and outlet reputation on content credibility (Sundar, 1999). Communication and media studies (Boller et al., 1990; Wathen & Burkell, 2002) analyze how testimonial-based 'argument strength' drives journalistic content credibility. Aiming to complement both research streams, we ask how argument strength influences content credibility in the context of sensor-based journalism. Methods & Data: This study deploys a between-subjects online experiment (N= 853) followed by multi-group covariance-based structural equation modeling. Two treatment groups read an article on traffic affecting air pollution in London, one drawing evidence from sensor data, the other from testimonials. As endogenous latent variables, we measure argument strength with four items and content credibility with five items. Measurement items, wording of all items, descriptive statistics, standardized factor loadings, squared multiple correlations for each indicator, construct-level reliability, and convergent validity are available upon request. Results: Sensor-based journalism fosters argument strength and credibility formation; statistical details and interpretations are available upon request. Controlled online experiments allow for realistically simulating (journalistic) media consumption. Added Value: Promoting research in sensor-based journalism in times of AI-based hallucinations – an increasingly relevant phenomena in today's democracies. References: Boboschko, I. & Loebbecke, C. (2025). Identity cues influencing article credibility in sensor-based journalism, European Conference on Information Systems (ECIS), Amman, Jordan. Boller, G., Swasy, J., & Munch, J. (1990). Conceptualizing argument quality via argument structure, Advances in Consumer Research, 17(1), 321-328. Diakopoulos, N. (2019). Automating the news: how algorithms are rewriting the media, Harvard University Press, Cambridge, MA, US. Loebbecke, C., & Boboschko, I. (2020). Reflecting upon sensor-based data collection to improve decision making, Journal of Decision Systems, 29(Sup1), 18-31. Sundar, S. (1999). Exploring receivers' criteria for perception of print and online news, Journalism & Mass Communication Quarterly, 76(2), 373-386. Wathen, C., & Burkell, J. (2002). Believe it or not: factors influencing credibility on the web, Journal of the American Society for Information Science and Technology, 53(2), 133-144. War, Anxiety, and Digital Behavior: How Armed Conflict Reshapes Online Media Consumption and Social Media Engagement 1The Max Stern Yezreel Valley College, Emek Yezreel, Israel; 2University of Washington, Seattle, WA, USA; 3Bar-Ilan University, Ramat Gan, Israel Relevance & Research Question Generative AI in Media 2025 Annalect/OMG Solutions GmbH Relevance & Research Question Generative AI has emerged in recent years as a key technology shaping both consumers’ everyday lives and the media and advertising industry. It competes with major search engines such as Google in searching information and opens up new efficient and profitable opportunities for advertisers to engage consumers – for example, through AI-generated advertising, synthetic brand influencers or AI shopping agents. Methods & Data The research applies a mixed-method design comprising three modules:
Results Preliminary findings indicate a steady increase in familiarity and regular usage. GenAI is perceived as indispensable for information gathering. While efficiency is highly valued, concerns about data privacy, misinformation and job displacement persist but do not significantly affect the rate of adoption and usage. AI-generated advertising is considered forward-looking but evokes mixed reactions: younger, tech-savvy users show higher acceptance, whereas older cohorts remain skeptical. Synthetic influencers face the strongest resistance, while AI-generated TV commercials and shopping assistants receive comparatively higher acceptance. Final results will be available in January 2026. Added Value The study provides empirically grounded insights for advertisers to strategically leverage the potential of generative AI. It identifies target groups open to AI-based advertising formats and highlights acceptance barriers. These findings support the development of effective communication strategies in an increasingly AI-driven media landscape. |
| 10:15am - 11:15am | 2.4: Innovation in measurement instruments Location: RH, Seminar 04 |
|
|
Improving Measurement of Migration Preferences: A Choice-Based Conjoint Approach to Studying Refugee Resettlement Decisions Bielefeld University, Germany Relevance & Research Question Refugees in first host countries often face adverse conditions including limited legal rights, restricted employment access, and inadequate social services. While resettlement programs offer long-term settlement opportunities for particularly vulnerable refugees, little is known about how these individuals evaluate and prioritize different aspects of potential destination countries. Traditional survey items fall short in capturing refugees' complex decision-making processes regarding resettlement. This study addresses this gap by employing a choice-based conjoint experiment to reveal how refugees eligible for resettlement weight different factors in their settlement decisions. The study uses data from an online survey (n = 375) conducted jointly by BAMF-FZ and the University of Bielefeld, targeting particularly vulnerable refugees eligible for resettlement procedures. Respondents complete up to four paired comparisons of hypothetical destination countries that vary across eight key characteristics: family presence, diaspora size, crime rates, language skills, attitudes toward refugees, political systems, labor market access, and lifestyle familiarity. After each choice, respondents evaluate their perceived opportunities for building a future in both countries, enabling analysis of both relative preferences and absolute settlement potential assessments. The conjoint approach successfully reveals how refugees trade off different aspects of potential host countries when making settlement decisions. Preliminary findings indicate that attitudes toward refugees and political systems emerge as the most influential factors in resettlement preferences. Notably, the relative importance of these factors varies considerably across refugee groups, highlighting the heterogeneity in decision-making priorities shaped by diverse experiences and backgrounds. This study advances the methodological and substantive understanding of refugee migration preferences. By focusing on refugees in their first host countries, the study provides reliable insights into real-world decision-making under precarious conditions. Thus, it offers valuable lessons for researchers targeting vulnerable groups through choice-based conjoint experiments. Fear in Flight: Measuring Digital Risk Perception and Emotional Responses to Aviation Safety in Romania University of Bucharest, Romania Relevance & Research Question Industry and occupation coding: A comparison of office-based coding and a closed-list approach 1University of Southampton, United Kingdom; 2National Centre for Social Research (NatCen), United Kingdom; 3Centre for Longitudinal Studies, University College London, United Kingdom Relevance & Research Question Social surveys often collect information about industry and occupation. The traditional ‘gold-standard’ approach has been to capture this information by asking open questions about job title and duties which are later classified into standardised coding systems by expert coders. The shift towards self-completion surveys makes this more challenging as respondents often provide insufficient information to facilitate accurate coding. In addition, office-based coding is expensive and time consuming. Closed-list questions, where respondents choose from pre-defined categories, could reduce response burden and coding costs and potentially also remove ambiguities inherent to open-ended responses. However, a challenge is that category labels can be difficult to interpret. This paper assesses whether closed-list questions for industry and occupation can produce industry and occupation classifications comparable to those derived from manually coded open text responses. Methods & Data We use data from two waves of the NatCen Panel survey, a UK probability-based online panel, which employs a mixed-mode design combining online self-administration with telephone interviews. In these two waves, the panel supplemented its standard industry and occupation open-text questions (which are manually coded) with closed-list questions. We compare agreement between the two methods, using both descriptive analysis and regression models. Results For industry, the mean agreement rate between the closed-list and 1-digit manual codes was 64%, with variation across industries. Agreement was lower for occupation, at 56% for 1-digit codes and 46% for 2-digit codes, with significant differences across occupation types. Agreement rates were also influenced by sociodemographic and the length of open-text entries, with shorter descriptions generally leading to higher agreement rates. Added Value This study provides a first empirical assessment of the quality of occupation and industry data collected using a closed-list approach, offering evidence on the method's potential and limitations. The findings help identify potential improvements for collecting industry and occupation data in online surveys. |
| 10:15am - 11:15am | GOR Thesis Award Master Location: RH, Auditorium |
|
|
Measuring Ambivalent Sexism in Large Language Models: A Validation Study University of Mannheim, Germany 1 Relevance & Research Question 2 Methods & Data 2.1 Inducing Individuals Using Context data
2.2 Ambivalent Sexism Inventory The ASI consists of 22 items, such as, “Women exaggerate problems they have at work.” Answers are provided using a 6-point Likert scale ranging from 0 (disagree strongly) to 5 (agree strongly). The overall ASI score of one context is computed by averaging the answer scores of all items given that context. 2.3 Data Collection Each item is prompted individually to mitigate the effects of item order. In addition to the item, the prompt contains the context, general instructions, and answer scale. Answer scores are extracted directly from a model’s text response. Data is collected from six state-of-the-art LLMs, including Llama 3.3 70B Instruct, Mistral 7B Instruct, and Qwen 2.5 7B Instruct. 2.4 Psychometric Quality Critera The systematic validation is conducted by first evaluating reliability (i.e., the consistency of a test) using three criteria: (1) Internal consistency (Cronbach’s alpha): How consistent are responses across all items of the ASI? (2) Alternate-form reliability (Pearson correlation): How consistent are the ASI scores when rephrasing the items without changing their meaning? (3) Option-order symmetry (Pearson correlation): How consistent are the ASI scores when randomly changing the order of answer options? If reliability is deemed acceptable based on established psychometric interpretation thresholds, validity (i.e., the extent to which a test measures what it is supposed to measure) is evaluated in a second step based on the following three types of validity: (1) Concurrent validity (Pearson correlation): Does the ASI score align with the amount of sexist language used in a downstream task (writing reference letters)? (2) Convergent validity (Pearson correlation coefficient): Does the ASI score align with the sexism score of another established sexism scale, the Modern Sexism Scale [7]? (3) Factorial Validity (Confirmatory factor analysis; CFA): Do the items group together in a way that makes sense based on the underlying theory? These analyses are conducted for each of the six models and two context types. 3 Results The findings of this thesis emphasize that tests developed and validated for humans should not be automatically assumed to be valid for LLMs. This underscores the importance of conducting validation studies before interpreting psychological test scores for LLMs, which has rarely been done in the field of LLM psychometrics [3]. However, the results also raise several important questions and issues on how to conduct such validations. What constitutes an “individual” in the context of LLMs? How should a sample of “individuals” be selected? These issues highlight the need to adapt the psychometric validation approach to the LLM domain in future studies.
[1] BLODGETT, S. L., LOPEZ, G., OLTEANU, A., SIM, R., AND WALLACH, H. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Aug. 2021). [2] PELLERT, M., LECHNER, C. M., WAGNER, C., RAMMSTEDT, B., AND STROHMAIER, M. AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories. Perspectives on Psychological Science 19, 5 (Sept. 2024). [3] LÖHN, L., KIEHNE, N., LJAPUNOV, A., AND BALKE, W.-T. Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models. In Proceedings of the 17th International Natural Language Generation Conference (Sept. 2024). [4] GLICK, P., AND FISKE, S. T. Hostile and Benevolent Sexism: Measuring Ambivalent Sexist Attitudes Toward Women. Psychology of Women Quarterly 21, 1 (Mar. 1997). [5] ZHENG, L., CHIANG, W.-L., SHENG, Y., ZHUANG, S., WU, Z., ZHUANG, Y., LIN, Z., LI, Z., LI, D., XING, E. P., ZHANG, H., GONZALEZ, J. E., AND STOICA, I. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Dec. 2023. arXiv:2306.05685. [6] GE, T., CHAN, X., WANG, X., YU, D., MI, H., AND YU, D. Scaling Synthetic Data Creation with 1,000,000,000 Personas, Sept. 2024. arXiv:2406.20094. [7] SWIM, J. K., AIKIN, K. J., HALL, W. S., AND HUNTER, B. A. Sexism and racism: Old-fashioned and modern prejudices. Journal of Personality and Social Psychology 68, 2 (1995).
AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models Ludwig-Maximilians-Universität Munich, Germany Relevance & Research Question Designing high-quality survey questions is a complex task. With the rapid development of large language models (LLMs), new possibilities have emerged for supporting this process, particularly in the automated generation of survey items. Despite growing interest in LLM applications from industry, published research in this area remains limited, and little is known about the quality and characteristics of survey items generated by LLMs, as well as the factors influencing their performance. This work provides the first in-depth analysis of LLM-based survey item generation and systematically evaluates how different design choices, such as prompting technique, model choice, and fine-tuning, affect item quality. Five LLMs, namely GPT-4o, GPT-4o-mini, GPT-oss-20B, LLaMA 3.1 8B, and LLaMA 3.1 70B, generated survey items for 15 concepts across four domains: work, living conditions, national politics, and recent politics. For each concept, three prompting techniques (zero-shot, role, and chain-of-thought prompting) were applied. Additionally, the best performing model and prompting combination, namely GPT-4o-mini combined with chain-of-thought prompting, was fine-tuned on high-quality survey items to explore the effects of fine-tuning on the quality of generated items. Results The findings show striking differences in survey characteristics across the different models and prompting techniques. As an example, the type of response scale strongly differed by model family. Closed-source GPT models consistently generate five-category, bipolar response scales with medium correspondence between numeric and verbal labels. They rarely include a 'don't know' option and typically begin with the negative end of the response scale. LLaMA models show greater variation: they generate a wider range of response options, show greater inconsistency between numeric and verbal labels, differ in whether scales start with the positive or negative option, and the inclusion of a 'don't know' option varies by model size. The inclusion of an introduction text in the survey item depends strongly on the type of prompting technique used. Survey items generated with chain-of-thought prompting often included an introduction text. Regarding the quality of the generated survey items, the findings show that the prompting technique employed is a primary factor influencing the quality of LLM-generated survey items. Chain-of-thought prompting leads to the most reliable outputs. Closed-source GPT models generally produce more consistent and higher-quality items than open-source LLaMA models. The open-source GPT-oss-20B model failed to complete the given task, i.e., it did not produce a usable survey item in 68% of the cases. The survey topics `work' and `national politics' generate survey items with higher quality compared to `living conditions’ and `recent politics’. Among all configurations, GPT-4o-mini combined with chain-of-thought prompting achieved the best overall results. Fine-tuning on high-quality survey items added variety in survey item characteristics but did not lead to noticeable improvements in item quality. Added Value For the GOR community, the study offers empirical evidence on how LLMs can (and cannot) be reliably integrated into questionnaire design workflows, providing a systematic basis for evaluating emerging AI tools in survey research and informing methodological decisions in applied settings. In addition to highlighting the strengths and limitations of LLMs for survey item generation, the work helped to identify concrete weaknesses within the SQP-based evaluation pipeline, particularly regarding the coding of characteristics. The development of an LLM-assisted coding procedure contributes to future research in AI-supported survey design by laying the necessary groundwork for a fully automated pipeline that can code the SQP item attributes at scale. Adaptive Code Generation for the Analysis of Donated Data with Large Language Models University of Mannheim, Germany Relevance & Research Question: In an increasingly digitalized world, people are generating large amounts of digital trace data daily as a result of the constant recording of their information and activity. These data contain information with the potential to facilitate human behavior studies due to their accessibility and fine-grained nature. Data donation has emerged as a promising approach to get access to such data from specific online platforms, such as Instagram. In a data donation study, people are invited to answer a survey and subsequently asked to request, download, and finally donate their online platform data to research. However, as raw data, these donated Data Download Packages (DDPs) might contain highly sensitive or personal information. Previous research on data donation has solved this issue by developing privacy-preserving methods to anonymize and aggregate data directly on participants' devices. Data donation workflows typically rely on scripts designed to extract and process relevant data from these structures. Researchers who develop these extraction scripts often face the challenge that the data structure of DDPs is undocumented for most online platforms and can be subject to modifications from the entities that issue them. As such, the processing scripts that are developed to extract information from these DDPs face deprecation threats and require the researchers' manual intervention. This thesis investigates the feasibility of employing Large Language Models (LLMs) in automatically processing and extracting relevant information from these structures by generating code as an alternative to traditional, manually maintained data analysis tools and in an effort to minimize manual script development and adjustment. It also aims to make data donation research more accessible by examining this approach as a means for researchers without technical expertise to analyze and interpret structurally complex DDPs. Methods & Data: This study evaluates six open-source LLMs across thirteen Instagram DDPs of varying size and structural complexity to examine their capabilities and limitations in this domain. The models are asked a series of data-specific queries designed to assess their abilities in interpreting the provided data package, retrieving correct information, and processing it accordingly. Two main experimental settings are implemented and compared, where context on the supplied information, response formatting, and data structure are provided to the models across different setups. In the first setting, this information is integrated through an external knowledge base using Retrieval-Augmented Generation (RAG), whereas in the second, it is directly provided to the models within the prompt. The outputs of the models in each setting are assessed in terms of accuracy, common error patterns, and code generation habits. Results: Although RAG is a methodology originally developed to reduce hallucination and improve models' responses, our findings reveal that, for this task, directly providing the context in prompts yields higher accuracy in comparison. Overall, the models' performance is unsatisfactory in both settings. Their shortcomings can be attributed to the overly complex structure of the packages used in this evaluation. Ambiguous and similar naming conventions are observed throughout the Instagram DDP structure and enhance the hallucination and inaccuracy of the models' outputs. In an effort to improve the models' performance, when manually designed and explicit instructions on how to navigate these packages are provided, the LLMs perform substantially better, with some achieving near-optimal results. These instructions would also be subject to changes in response to structural updates to the DDPs provided by the online platforms. Ironically, the very manual intervention that this research sought to reduce is necessary in achieving greater performance for the evaluated LLMs at this current stage. Added Value: This research presents an evaluation of LLMs for adaptive code generation in donated data analysis tasks and explores their potential as an alternative approach in this domain. It lays the groundwork for easing the skill-based entry that data researchers without programming skills face in navigating the structural complexities and challenges inherent in data donation studies. By identifying the strengths and current limitations of LLMs in understanding and adapting to evolving data structures, this study helps set realistic expectations for their application in this field while highlighting the considerable room for future improvement. Reinforcement Learning for Optimising the Vehicle Routing Problem University of Mannheim, Germany Relevance & Research Question The Travelling Salesman Problem (TSP) requires identifying the shortest route to visit a set of locations and return to the starting depot, given the locations and the distances between them. The Vehicle Routing Problem (VRP) is similar but with multiple vehicles, with the number of customers per route limited by vehicle capacities. This was first formulated in 1959 by Dantzig and Ramser [1] and has since been further extended, such as restricting the time window for customer visits. The VRP is NP-hard, and so finding exact solutions becomes computationally intractable with increased problem instance size. As a result, heuristic and metaheuristic algorithms have replaced exact calculations. More recently, routes are found using reinforcement learning (RL) approaches, requiring less task specific expertise than heuristic selection but with the computational cost of deep learning. With each new RL model, the creators generally compare against either heuristics or the very earliest RL models from Nazari et al. [2] and Kool et al. [3]. Therefore, it is often unclear the extent to which new methods achieve good performance. Additionally, whilst heuristics require computation for each new instance, RL models instead have a high up-front cost due to their training which must be justified. This thesis evaluates a range of RL methods against heuristics for the Capacitated VRP. Specific attention is paid to complex problem variants (the Time Window variant) and complex problem instances (such as with more customers). It aims to determine whether RL methods should be used over established and theory-informed heuristics. Methods & Data Six RL models for the CVRP are compared, covering a range of architectures and optimisation procedures: Nazari [2], AM [3], AM PPO, POMO [4], Sym-NCO [5] and MDAM [6]. The Nazari model is unique in using a recurrent neural network, with all others based on a Transformer architecture. The Attention Model (AM) uses a Graph Attention Model with REINFORCE. This is further adapted to use proximal policy optimisation (AM PPO), to consider multiple possible solutions concurrently (Policy Optimisation with Multiple Optima (POMO)) or to exploit problem and solution symmetries by adapting the REINFORCE rewards (Sym-NCO). The Multi-Decoder Attention Model (MDAM) is a further Transformer model with multiple decoders. Additionally, a two-step method is included using a heuristic to group the nodes and an AM TSP model to build each route. Only AM, AM PPO, Sym-NCO and MDAM are applied to the CVRP-TW. All RL models are provided with large quantities of random problem instances for training. Initial evaluation of the models uses standard CVRP and CVRP-TW benchmarks. Most of the CVRP benchmark instances are randomly generated, whilst the CVRP-TW benchmark deliberately varies the location patterns, customers per vehicle and time windows. Further systematic test instances were created for the CVRP by varying the position of the depot, distribution and number of customers and maximum demand from a single customer. This variation induces varied instance difficulties. A robust baseline is provided by generating solutions with a range of heuristic and metaheuristic algorithms. Evaluation on all datasets considers how often valid solutions are returned and compares the average solution distances. Additional considerations are the number of vehicles used in a solution and the computation time. Results Whilst all apart from one heuristic approach finds valid solutions for all instances, multiple RL models fail for at least one problem instance. The Nazari model implementation specifically had validity issues which excluded it from all further evaluation. Regarding the quality of the solutions, the heuristics and metaheuristics consistently provide solutions within 4% of the optimum for CVRP benchmark problems, whilst even the best RL methods are more than 10% worse. When it provides valid solutions, the AM TSP two-step model outperforms other RL models, likely due to multiple TSP training instances being included in a single CVRP training instance. AM and POMO deliver the best RL solutions but consistently fall behind the heuristics. A similar pattern appears with the CVRP-TW problems, although Syn-NCO is instead consistently the best RL model. With the more complex problem variant, all models are prone to using more vehicles than the optimal solution, likely due to the increased difficulty of finding any valid solution. The gap in performance between the heuristics and the RL models often increases with the more complex instances e.g. with more customers. A time limit of 60 seconds per instance is pre-specified for heuristics. For the RL methods, producing solutions for an individual instance is almost instantaneous, but training can require large amounts of time. The 4 hours 10 minutes for training and testing the AM (10 location) model is the equivalent of finding solutions to 250 instances using a heuristic method. When training on larger instances the situation is much worse, with a training time increase of 358% for POMO with 20 customers compared to 10. Added Value The results demonstrate the robustness of the heuristic and metaheuristic methods such that RL approaches could only be viable where the same model will be used extensively. The RL models with even longer training periods might meet the heuristic performance but would only justify the computation cost after 1000s of uses. The surprisingly competitive performance of the two-step AM TSP approach signals the crucial role of method design. This is already apparent in the heuristics and metaheuristics but often overlooked for RL. In this instance, breaking down the problem into smaller steps and only using RL for the more difficult component enables the model to train more efficiently. [1] Dantzig, G. B. and J. H. Ramser (1959). The Truck Dispatching Problem. Management Science 6 (1), 80–91 [2] Nazari, M., A. Oroojlooy, L. Snyder, and M. Takac (2018). Reinforcement Learning for Solving the Vehicle Routing Problem. In Proceedings of the 32nd International Conference on Neural Information Processing Systems [3] Kool, W., H. van Hoof, and M. Welling (2019). Attention, Learn to Solve Routing Problems! In Proceedings of the 7th International Conference on Learning Representations [4] Kwon, Y.-D., J. Choo, B. Kim, I. Yoon, Y. Gwon, and S. Min (2020). POMO: Policy Optimization with Multiple Optima for Reinforcement Learning. In Proceedings of the 34th Conference on Neural Information Processing Systems [5] Kim, M., J. Park, and J. Park (2022). Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. In Proceedings of the 36th Conference on Neural Information Processing Systems [6] Xin, L., W. Song, Z. Cao, and J. Zhang (2021, May). Multi-Decoder Attention Model with Embedding Glimpse for Solving Vehicle Routing Problems. In Proceedings of the 35th AAAI Conference on Artificial Intelligence |
| 11:15am - 11:30am | Break Location: RH, Lunch Hall/ Cafeteria |
| 11:30am - 12:30pm | 3.1: App-based data collection Location: RH, Seminar 01 |
|
|
Why Are People Unwilling to Participate in Smartphone App Data Collection? Results from Qualitative In-Depth Interviews University of Mannheim, Germany Ready, set, go! Data collection for the household budget survey with an app Statistics Netherlands, Netherlands, The The effect of control over data collection on willingness to participate in app-based data collection Utrecht University, Department of Methodology and Statistics |
| 11:30am - 12:30pm | 3.2: Video and images in survey research Location: RH, Seminar 02 |
|
|
A picture is worth a thousand words: Factors influencing the quality of photos received through an online survey 1Centre d'Estudis Demogràfics, Spain; 2GESIS - Leibniz Institute for the Social Sciences; 3University of Mannheim Relevance & Research Question: Photos, which can be easily captured with smartphones, offer new opportunities to improve survey data quality by replacing or complementing conventional questions. However, potential advantages may vary by respondent characteristics, including age, gender, education, photo-taking and -sharing frequency, self-assessed verbal, mathematical, and spatial skills, and comfort with new technologies. This study addresses the following research question: To what extent do such individual characteristics affect the quality of photos of books at home submitted through an online mobile survey? The topic was selected because the number of books is a well-established proxy for cultural and socioeconomic capital in social sciences, yet its measurement through conventional questions is challenging. Methods & Data: Participants were asked for information on the books in their home through conventional survey questions and/or by submitting photos. This information covered the number of books, their intended audience (illiterate children, literate children and teenagers, and general audiences), language, and storage. Quality was evaluated with indicators tailored to book information, drawing on literature about data quality in survey methodology and in computer vision. The survey, conducted in 2023, used Netquest’s opt-in panel in Spain. The target population was parents of children in primary school. Of 1,270 individuals reaching the questions on books, 703 were asked for photos, and 238 provided at least one. Results: Photo quality was not systematically affected by most of the studied variables. Although older participants submitted more photos, extracting book information from their photos was less feasible, and they experienced more capture and submission issues. The findings suggest that photo-based surveys can be collected across diverse populations; however, age may be an exception in contexts like books, where finer details are required to extract information consistently. Added Value: Evidence on the quality of photos submitted through online surveys remains limited, especially when photos address a relevant social science topic, such as counting books and using that information to characterize respondents. This study contributes to the literature by showing that photos can be requested from broad audiences with comparable quality across respondents, and it offers practical guidance for researchers collecting photos. Enhancing participation in visual data collection in online surveys: Evidence from an experimental study about remote work environments RECSM - Universitat Pompeu Fabra, Spain Relevance & Research Question Collecting visual data through web surveys offers a promising way to obtain richer and more accurate information. Yet participation in image-based tasks remains low, and evidence on how to motivate respondents while maintaining data quality is limited. This study examines whether different strategies can help researchers achieve higher participation when requesting photos of remote work environments in web surveys: (1) offering an extra incentive specifically for sharing photos, beyond the standard survey participation incentive, (2) adding a follow-up prompt immediately after the initial photo request emphasizing the importance of sharing the photos, and (3) sending a reminder email to respondents not sharing the photos initially. Furthermore, we investigate whether the timing of incentive announcement (either at the initial request or only in the reminder) also affects participation. Methods & Data An experiment is being conducted in the opt-in online panel Netquest in Spain (N = 1,200) among adults who have worked remotely for at least seven hours per week in the past two months. Respondents are randomly assigned to one of three groups: (1) Control – asked to upload three photos of their home workspace without extra incentive; (2) Incentive – offered 10 extra panel points for uploading all photos; and (3) IncentiveReminder – offered the same extra incentive but only in the reminder. All groups get the follow-up prompt. Descriptive analyses will be used to assess differences in participation. Results The survey is currently being programmed. Data collection is planned for early December. Results are expected by early February 2026. We expect that extra incentives, follow-up prompts and reminders will all increase photo submission rates, and that announcing the incentive initially will be more effective. We will test this by comparing different participation indicators, such as break-off or item nonresponse defined in different ways, across experimental groups and across both conventional and visual formats. Added Value This study provides evidence on how different strategies might help improving participation in photos requests. Findings will help improve the design of online surveys combining textual and visual data while balancing respondent burden and data richness. Video-Interviews in Mixed-Mode Panel Surveys: Selective Feasibility and Data Quality Trade-offs 1DIW Berlin, Germany; 2GESIS Leibniz Institute for the Social Sciences, Germany; 3Humboldt University of Berlin, Germany Relevance & Research Question As survey methodologists seek innovative approaches to address declining response rates and evolving technological landscapes, Computer-assisted live video interviewing (CALVI) emerges as a promising hybrid mode combining the personal interaction of Computer-assisted personal interviews (CAPI) with the convenience of remote participation as given in Computer-assisted web interviews (CAWI). This study addresses the critical question: Is CALVI a feasible and useful add-on in a German general population mixed-mode household panel survey? Understanding CALVI's viability is essential for survey researchers considering technological adaptations to maintain data quality while accommodating respondent preferences. We implemented CALVI using a randomized controlled experimental design within the 2024 Innovation Sample of the Socio-economic Panel Study (SOEP-IS): (E1) 1,261 households from the established panel (previously using CAPI) randomly assigned to CALVI versus (C1) 1,106 households continuing with traditional face-to-face interviews; and (E2) 409 households from the 2023 refreshment sample (recruited in CAWI) assigned to CALVI versus (C2) 1,513 households continuing with web self-completion. Both experimental groups retained fallback options to their previous data collection modes. We analyzed participation rates, technical implementation success, and data quality indicators. Among 1,670 households invited to CALVI, 376 video interviews were conducted. Technical implementation proved largely successful, though 23% of interviews experienced connectivity issues, particularly when interviewers worked remotely. Unit nonresponse among CALVI-invited households is significantly higher compared to established modes, with pronounced effects among former CAWI participants (response rate in E1: 56% vs. C1: 61%; E2: 45% vs. C2: 74%). CALVI attracted specific demographic segments: young, highly educated, high-income, full-time employed urban participants from West Germany with reliable internet access. However, CALVI participants demonstrated superior data quality metrics, including lowest speeding rates, minimal item nonresponse, and highest data linkage consent rates. |
| 11:30am - 12:30pm | 3.3: Market and customer research Location: RH, Seminar 03 |
|
|
Experimental Evidence on How Questionnaire Structure Affects Awareness Reporting YouGov, Switzerland Relevance & Research Question We embedded a survey experiment in a client study among 996 driving license holders in Switzerland from the nationally representative YouGov panel. Respondents were randomly assigned to one of four experimental groups. We used a 2x2-design with list length (either 8 or 24 brands) and a behavioral nudge (vs. no nudge), stating that the questionnaire length and payout were independent of their awareness of the specific brands) as the experimental conditions. Results From Reflection to Intuition: Integrating System 1 and System 2 Measures in Behavioural Campaign Evaluation YouGov, Switzerland Relevance & Research Question: Behavioural campaigns – communication initiatives aimed at changing behaviours – play a vital role in both social and market research. Their evaluation requires a holistic approach integrating behavioural, cognitive, and emotional indicators. This is particularly challenging when topics are sensitive to social desirability or strategic responding, such as traffic safety or pricing research. Traditional explicit self-reports (System 2) provide valuable insights into reflective attitudes/intentions but are prone to response bias and often overlook intuitive or automatic aspects of behaviour. Implicit methods complement them by capturing unconscious processes (System 1). This presentatoin examines how explicit and implicit methods can be combined to obtain a more comprehensive picture of behavioural campaign effects. We explore this in the context of a Swiss road-safety campaign addressing cyclists’ “illusion of control” – the tendency to overestimate control in traffic situations and underestimate associated risks. Methods & Data: The evaluation follows a quasi-experimental repeated cross-sectional design combining explicit and implicit measures. Four online panel surveys are conducted between 2025 and 2027 in Switzerland (intervention) and Germany (control), each with 1’000 regular bicycle or e-bike users. Three outcome categories are assessed: knowledge, attitudes/intention, and behaviour. Explicit measures capture self-reported knowledge, perceived risk, and behavioural intentions. Implicit measures employ a reaction-time-based Single Association Test (SAT) with scenario vignettes covering four risk-prone situations: cycling without a helmet, mobile phone usage while cycling, running a red light, and riding under the influence of alcohol. These vignettes assess implicit risk perception and control illusions. The design enables pre/post comparisons between Switzerland and Germany to infer potential causal effects of the campaign over time. Results: Preliminary findings show that explicit and implicit measures can be integrated within one online survey. Scenario-based SATs complement self-reports by revealing intuitive risk perceptions and control beliefs. Data from two waves (pre- and first post-campaign) provide first insights into early campaign effects. Added Value: The study demonstrates a scalable framework for embedding implicit measures in large-scale online surveys. Combining scenario vignettes with reaction-time methods offers a practical blueprint for evaluating sensitive behavioural topics with higher ecological validity and depth. Measuring the beauty of products: The Product Aesthetics Inventory (PAI) 1University of Wuppertal, Germany; 2University of Münster, Germany; 3BSH Hausgeräte GmbH, Germany; 4Provinzial Versicherung, Germany Relevance & Research Question: Aesthetics is quickly processed, has a multitude of consequences for attitudes, recommendations, and purchase behavior–and thus is paramount for the success of products. However, differentiated and validated evaluation tools are lacking, particularly for interactive products. In order to close this gap, we aimed to develop and validate the Product Aesthetics Inventory (PAI) and a short version (PAI-S). Methods & Data: Items were developed in a pre-study (N=6 design experts, N=4 product users). The resulting item pool with 54 questions was tested in an online survey on different types of household appliances (alongside with several validation measures; N=6,002). In Study 1, data from n=3,000 of these participants were used to determine the number of factors using exploratory graph analysis. In Study 2, the found structure was validated in a confirmatory factor analysis with the remaining n=3,002 participants. A third web-based study (N=1,028) was conducted to further determine construct validity and generalizability among different products such as IT products, home entertainment devices, and power tools. Results: The final PAI consists of 34 questions covering eight dimensions: visual aesthetics, operating elements, brand logo, feedback sounds, operating noises, haptics, interaction aesthetics, and impression. A higher-order factor of product aesthetics can be determined both with the full PAI and with the 8-item short version PAI-S. Both versions demonstrate excellent reliability, as estimated by Cronbach’s alpha. The decrease in reliability from the full scale (.97) to the short scale (.89) is acceptable. Further, all eight subscales of the PAI achieved good to excellent reliability (range: .85-.92). Validity was confirmed by corresponding correlations with other established scales, intention measures, and overall judgments. Further, Study 3 confirms and extends these results with respect to reliability and validity. Finally, in all three studies, strict measurement invariance was achieved across different product classes for the majority of subscales. Added Value: The PAI meaningfully supplements the measurement spectrum of product evaluation beyond classic usability and brand perception. Questionnaire templates, evaluation guides and interpretation aids are freely available online (https://doi.org/10.5281/zenodo.6478042). Further, we will complement this by discussing additional options of response scaling and benchmarks for the PAI. |
| 11:30am - 12:30pm | 3.4: Online research on youth and mental health Location: RH, Seminar 04 |
|
|
Ensuring the Voices of Young People Are Heard: An Innovative Application of Respondent-Driven Sampling with Probability-Based Seeds in the NatCen Panel 1University of Southampton, United Kingdom; 2GESIS – Leibniz Institute for the Social Sciences; 3National Centre for Social Research Relevance & Research Question Certain population sub-groups are consistently under-represented or excluded from survey samples, or they may appear in insufficient frequencies. In the UK, where a named sample frame is often unavailable, current strategies for boosting samples can be prohibitively costly or impractical, particularly in self-completion survey formats. As survey research in the UK strives to become more inclusive, it is essential to explore effective methods for incorporating and enhancing the representation of these under-represented subgroups. We employed respondent-driven sampling (RDS) to recruit young adults aged 18 to 24 from the UK, using the NatCen online probability-based panel. Participants completed a brief online questionnaire and could recruit up to five peers using unique digital codes, with incentives for participation and successful recruitment. This process spanned multiple waves, aiming for a sample of about 1,500 participants and allowing for an assessment of recruitment cooperation and network characteristics in a self-completion survey format. Results We examined the patterns and predictors of recruitment cooperation in RDS using probability-based seeds. The study also assessed the extent of non-response at various recruitment stages and explored how socio-demographic and network factors influence recruitment success. Intra-chain correlation revealed a strong clustering of respondents by ethnicity, while other characteristics, such as gender, had a less pronounced impact on recruitment correlation. Additionally, the study evaluated the effect of seed composition on recruitment outcomes and considered the implications of violating key RDS assumptions for the quality of the inferences made. Notably, incorporating seeds from outside the target population led to the recruitment of a significant proportion of young adults within the sample. Added Value This study aims to enhance the evidence base for using RDS as a survey data collection method, both in the UK and internationally, by integrating empirical findings with methodological diagnostics. It also introduces a novel approach designed to amplify the voices of young people, a demographic that remains under-represented in social surveys. This perspective has the potential to make surveys more inclusive and methodologically robust. Youth Loneliness Epidemic: Real Trend or Survey Artifact? 1European Commission - Joint Research Centre; 2Vrije Universiteit Brussel, Belgium Relevance & Research Question Major health organizations have raised alarms about rising youth loneliness (WHO, 2025), with reports suggesting young people experience the highest loneliness rates and saw the sharpest increases between 2018 and 2022 (OECD, 2024). However, reported prevalence varies dramatically across surveys—from 24% to 50.5% for young adults in different European studies—raising a critical question: are we witnessing a genuine loneliness epidemic or methodological artifacts? While genuine increases are plausible given documented declines in face-to-face social contact (OECD, 2025), the magnitude may be partially explained by methodological shifts from traditional face-to-face to web-based data collection. This study examines how survey mode influences loneliness measurement, with particular focus systematic variation by age, potentially distorting our understanding of youth loneliness trends. Methods & Data This study quantifies data collection mode effects by comparing loneliness reports across web-based, face-to-face, and telephone collection modes in the Survey on Income and Living Conditions. The analysis focuses on identifying age-specific patterns in how survey mode affects loneliness measurement, examining whether the magnitude of mode effects differs between younger and older respondents. To address potential limitations due to mode self-selection, we use a variaty of techniques including controlling for respondent’s health status and limiting the analysis to a subsample of countries. Results from other surveys are included for rubustness (SOEP, Germany). Our research additionaly addresses the role of stigma (national aggregates by age group from the EU Loneliness Survey) to explain observed patterns. Results The findings reveal striking age-specific mode effects: young adults report significantly more loneliness in online surveys compared to face-to-face interviews, while older adults exhibit the opposite pattern, reporting higher loneliness levels in face-to-face settings. These divergent patterns suggest that stigma associated with loneliness and social desirability bias operate differently across the lifespan. Digital Mental Health in Wartime: Age, Gender, and Socioeconomic Predictors of AI Therapy Acceptance 1Technion – Israel Institute of Technology, Haifa, Israel; 2The Max Stern Yezreel Valley College, Emek Yizrael, Israel; 3Bar-Ilan University, Ramat Gan, Israel Relevance & Research Question |
| 11:30am - 12:30pm | GOR Thesis Award PhD Location: RH, Auditorium |
|
|
Who Counts? Survey Data Quality in the Age of AI 1LMU Munich, Germany; 2Munich Center for Machine Learning Relevance & Research Question Large language models (LLMs) have been hoped to make survey research more efficient, while also improving survey data quality. However, as they are based on Internet data, LLMs may come with similar pitfalls as other digital data sources with regard to making inferences about human attitudes and behavior. As such, they not only have the potential to mitigate, but also to amplify existing biases. In my dissertation, I investigate whether and under which conditions LLMs can be leveraged in survey research by providing empirical evidence of the potentials and limits of their applications. Potential applications of LLMs span the entire survey life cycle – before, during, and after data collection – where an LLM could act as a research assistant, interviewer, or respondent. Potential challenges for data quality stem from LLMs’ training data, alignment processes, and model architecture, as well as the research design. I focus on two major applications of LLMs covering both representational and measurement challenges and test these applications in challenging, previously unexamined contexts. Methods & Data, Results Two studies address the most prominent discussion regarding LLM-based survey research: Using LLM-generated “synthetic samples”. Coverage bias in LLM training data and alignment processes might affect the applicability of this approach. In one study, I test to what extent LLMs can estimate vote choice in Germany. To generate a synthetic sample of eligible voters in Germany, I create small profiles matching the individual characteristics of the 2017 German Longitudinal Election Study respondents. These “personas” include socio-demographic and attitudinal information known to be associated with voting behavior. Prompting GPT-3.5 with each persona in German, I ask the LLM to predict each respondents’ vote choice in the 2017 German federal elections and compare these predictions to the survey-based estimates on the aggregate and subgroup levels. I find that GPT-3.5 does not predict citizens’ vote choice accurately, exhibiting a bias towards the Green and Left parties, and making better predictions for more “typical” voter subgroups. While the LLM is able to capture broad partisan tendencies, it tends to miss out on the multifaceted factors that sway individual voters. As a consequence, not only are LLM-synthetic samples not helpful for estimating how groups likely swinging an election, such as non-partisans, will vote, they also risk underestimating the popularity of parties without a strong partisan base. Such samples thus provide little added value over survey-based estimates. Furthermore, the results suggest that GPT-3.5 might not be reliable for estimating nuanced, subgroup-specific political attitudes. In a second study, I extend the previous study to the entire European Union (EU), this time focusing on an outcome unobserved at the time of data collection: the results of the 2024 European Parliament elections. I create personas of 26,000 eligible voters in all 27 EU member states based on the Eurobarometer and compare the proprietary LLM GPT-4-Turbo with the open-source LLMs Llama-3.1 and Mistral. A week before the European elections in June 2024, I prompted the LLMs with the personas in English and asked them to predict each person’s voting behavior, once based only socio-demographic information, and once also featuring attitudinal variables. To investigate differences in LLMs’ bias across languages, I selected six diverse EU member states for which I prompted the LLMs in the respective country’s native language. After the elections’ conclusion, I compare the aggregate predicted party vote shares to the official national-level results for each country. LLM-based predictions of future voting behavior largely fail – they overestimate turnout and are unable to accurately predict party popularity. Only providing socio-demographic information about individual voters further worsens the results. Finally, LLMs are especially bad at predicting voting behavior for Eastern European countries and countries with Slavic native languages, suggesting systematic contextual biases. These findings emphasize the limited applicability of LLM-synthetic samples to public opinion prediction across contexts. Without further adaptation through, e.g., fine-tuning, LLMs appear infeasible for public opinion prediction not just in terms of accuracy, but also in terms of efficiency, highlighting a trade-off between the recency and level of detail of available survey data for synthetic samples. In the third study, I investigate the usability of LLMs for classifying open-ended survey responses. Due to their linguistic capacities, it is likely that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models, but it is unclear how the sparse, sometimes competing, existing findings generalize and how the quality of such classifications compares to established methods. I therefore test to what extent different LLMs can be used to code German open-ended survey responses on survey motivation from the GESIS Panel.pop Population Sample. I prompt GPT-4-Turbo, Llama-3.2, and Mistral NeMo in German with a predefined coding scheme and instruct them to classify each survey response, comparing zero- and few-shot prompting and fine-tuning. I evaluate the LLMs’ performance by contrasting its classifications with those made by human coders. Only fine-tuning achieves satisfactory levels of predictive accuracy. Performance differences between prompting approaches are conditional on the LLM used, as overall performance differs greatly between LLMs: GPT performs best in terms of accuracy, and few-shot prompting leads to the best performance. Disregarding fine-tuning, the prompting approach is not as important when using GPT, but makes a big difference for other LLMs. Further, the LLMs struggle especially with non-substantive catch-all categories, resulting in different distributions. The need for LLMs to be fine-tuned for this task implies that they are not the resource-efficient, easily accessible alternative researchers may have hoped. I discuss my findings in the light of the ongoing developments in the rapidly evolving LLM research landscape and point to avenues for future work. Added Value This dissertation makes both methodological and applied contributions to survey research. It discusses types and sources of bias in LLM-based survey research and empirically tests their prevalence from multiple comparative angles. It also showcases concrete applications of LLMs across several steps in the research process and several substantive topics, explaining their practical implementation and highlighting their potentials and pitfalls. Overall, this dissertation thus provides guidance on ensuring data quality when using LLMs in survey methodology, and contributes to the larger discourse about LLMs as a social science research tool. Correcting Selection Bias in Nonprobability Samples by Pseudo Weighting Utrecht University, Netherlands, The Relevance & Research Question Statistics are often estimated from a sample rather than from the entire population. If the inclusion probability of the sample is unknown to the researcher, that is, a nonprobability sample, naively treating the sample as a simple random sample may result in selection bias. Attention to correcting selection bias is increasing due to the availability of new data sources, for example, online opt-in surveys and data donation. These data are often easy to collect and may be so-called "Big Data" considering the large inclusion fraction of the population. This dissertation consists of four scientific papers, where two of them are published in influential journals and the other two are under review. In the first paper, a novel framework for correcting selection bias in nonprobability samples is proposed. It follows with the discussion of three practical challenges, and possible solutions to them are provided. Methods & Data In the framework paper, the general idea is to construct a set of unit weights for the nonprobability sample by borrowing the strength of a reference probability sample. If a proper set of weights is constructed, design-based estimators can be used for population parameter estimation given the weights. To evaluate the uncertainty of the estimated population parameter, a pseudo population bootstrap procedure is proposed, given different relations between the nonprobability sample and the probability sample. Three practical challenges for pseudo-weighting are also discussed. Namely, the model selection for bias correction, the imbalanced samples, and the small area estimation. Simulation studies and applications on real data are shown in each paper to reflect the usage of the proposed framework and the possible solution of the practical challenges. Results The proposed framework is flexible, and many kinds of probability estimation models can be used. The question is raised about how to select a proper model given the population parameter in question. After a series of performance measures were tested, we found that modeling the target variable when evaluating the performance of weights may be useful. The second challenge comes from the large size of the nonprobability sample. Since we often have a large nonprobability sample assisted with a small probability sample, we end up with an imbalanced combined sample, which can cause problems when estimating model parameters. Several remedies for imbalanced samples are discussed, and the proposed framework is also adjusted accordingly. The results show that SMOTE is a promising technique for dealing with imbalanced samples. Finally, we look at the scenario where not only the population level estimates are of interest, but also subpopulation estimates. Several approaches to combine pseudo weights with small area estimation are discussed. Of all approaches, we found that combining a hierarchical Bayesian model with weights is a relatively stable estimation approach. If both population-level and area-level estimates are of interest, aligning the weighted estimates with estimated marginal totals may be a better option. Added Value This research provides practical suggestions on how to deal with possible selection bias in nonprobability samples, which is gaining more and more importance in the age of digitalization and low response rates. |
| 12:30pm - 1:30pm | Lunch Break Location: RH, Lunch Hall/ Cafeteria |
| 1:30pm - 2:30pm | 4.1: Poster Session Location: RH, Auditorium |
|
|
Beyond the First Questionnaire: Retaining Participants in an App-based Household Budget Survey University of Mannheim, Germany Relevance & Research Question Survey attrition is a common problem in household budget surveys (HBS) as such surveys impose a high burden on participants, asking them to report their expenses daily for a specific period. We study the attrition in an app-based HBS with three preceding questionnaires and a 14-day diary. Participants can drop out at each of the questionnaires or during the diary. If respondents drop out, the data obtained from them contains missing information. The research questions are: 1. At what stage do participants drop out of an app-based HBS? 2. Does the way the study is presented in the invitation letter influence the drop-out? 3. What individual characteristics correlate with drop-out at the different stages? Methods & Data In 2024, we drew a probability sample of 7049 individuals from the register of residents and invited them by mail. We included an experimental variation in the invitation letters to analyse whether stressing the effort associated with participation and whether mentioning a receipt scanning function has an influence on participation. The survey consisted of three initial questionnaires regarding personal information, income and expenses, and information about the household. After having completed them, respondents were asked to continue with an expense diary. To be eligible for an incentive, participants had to manually enter their expenses or scan their receipts on at least seven days. Results Participants drop out of the study continuously, even after they started data entry in the diary. The poster will present results on the attrition and drop out across the stages, and whether the three experimental groups differ significantly in their attrition rates across the stages. Additionally, the poster shows whether people with specific characteristics are more likely to drop out at a particular stage. Added Value This research is part of the Smart Survey Implementation (SSI) project, funded by EUROSTAT, which aims to enhance data collection for official statistics across Europe through digital innovation. This experiment addresses attrition in app-based HBS. It informs decisions about what actions can be taken to reduce attrition. Joint Evaluation of LLM and Human Annotations with MultiTrait–MultiError Models 1University of Mannheim; 2University of Manchester; 3GESIS–Leibniz Institute for the Social Sciences; 4CSH Vienna Relevance & Research Question We propose applying MultiTrait–MultiError (MTME) models to jointly analyze imperfect responses from LLMs and humans. Our goal is to assess the extent to which MTME models can qualify measurement quality in responses from LLMs and humans, and whether they can improve estimates of underlying traits. Methods & Data In an initial study, we fit an MTME model on LLM-generated annotations of publication year and readability for 600 text excerpts from the CommonLit Ease of Readability Corpus (CLEAR) corpus. We evaluate four Qwen 3 LLMs (4B–32B). We include ground-truth data from the CLEAR corpus into our evaluation, purely for demonstrating the feasibility of the MTME approach. Results Added Value Lessons learned: Utilizing Social Media Influencers for Targeted Recruitment on Discrimination in the German Healthcare System 1German Centre for Integration and Migration Research (DeZIM), Germany; 2Bielefeld University, Germany; 3Humboldt-University, Germany Relevance & Research Question Classifying Moral Reasoning in Political Discourse: Demonstrating Interrater Reliability and Testing an AI-Based Classification Approach Freie Universität Berlin, Germany Relevance & Research Question Moral reasoning, whether people justify political positions through rules and duties (deontological reasoning) or through expected outcomes (consequentialist reasoning), is central to understanding how people deliberate in polarised political (online) debates. While experimental moral-dilemma research shows differences in rule-based vs outcome-based judgments depending on political ideology, it remains unclear whether such patterns manifest in real-world political communication. Capturing moral reasoning in naturalistic discourse could deepen our understanding of how polarization emerges and provide a foundation for designing interventions that adapt to people’s reasoning styles. This study asks: Can moral reasoning in political discourse be reliably classified using NLP methods, and do we find evidence supporting fundings from experimental moral dilemma research? Methods & Data This study presents a validation of an approach to classify moral reasoning in political discourse using a large language model (LLM). A corpus of 576 sentences from Reddit discussions and German parliamentary speeches was pre-sampled using a novel extension of the Distributed Dictionary Representations (DDR) method (Garten et al., 2018), which identifies sentences with high cosine similarity to exemplary reasoning styles. Two expert raters then independently coded each sentence as deontological, consequentialist, or neutral and adapted the coding manual based on deliberation after each round. Results Dynamic Surveys for Dynamic Life Courses: Development of a Web-App for Self-Administered Life History Data Collection 1Leibniz Institute for Educational Trajectories (LIfBi), Germany; 2German Centre for Higher Education Research and Science Studies (DZHW), Germany Relevance & Research Question Methods & Data Results Added Value |
| 1:30pm - 2:30pm | 4.2: Poster Session Location: RH, Auditorium |
|
|
Beyond Algorithms: How to Improve Manual Classification of Visual Data Obtained in Surveys RECSM- Universitat Pompeu Fabra, Spain Relevance & Research Question An increasing number of studies are requesting visual data within web surveys, arguing that they can enhance data quantity and quality and provide new insights. However, important challenges remain. This study focuses on one of these aspects: extracting relevant information from the visual data, a process called “classification”. Researchers are increasingly relying on automated methods using machine learning to classify visual data. Although these methods are becoming more powerful, they still cannot extract information in the same way as manual classification. Thus, the main objective of this study is to explain the challenges and solutions encountered while implementing manual classification in a complex case study on remote work homestations, where approximately 70 items must be classified based on three photos. A web survey will be conducted in the opt-in online panel Netquest in Spain (N = 1,200) in early December among remote workers. After answering conventional questions about their remote work conditions, respondents will be asked to upload two photos of their homestation and one of their main screen or laptop model information. All photos will be reviewed by the project ethics advisor to ensure no private information is visible. Manual classification will be implemented in accordance with detailed guidelines. The two homestation photos will be classified jointly, and the device-model photo will be classified separately. Two researchers will share homestation classification, with approximately 10% of the photos coded by both of them to compute interrater reliability (IRR) indicators and identify potential systematic biases. One researcher will code the device-model photo, but a subsample will undergo double coding to assess IRR. Results We expect to find differences between classifiers, identify problematic items, and detect the types of errors most likely to occur across classifiers. Added Value This work in progress focuses on discussing the challenges encountered when dealing with the classification of complex visual data collected in web surveys. We aim to provide practical, user-oriented guidelines that extend beyond the explanations usually found in academic papers, which often prioritize presenting results over detailing the underlying classification process. AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models 1LMU Munich; 2Munich Center for Machine Learning; 3University of Maryland, College Park Relevance & Research Question: Designing high-quality survey questions is a complex task. With the rapid development of large language models (LLMs), new possibilities have emerged for supporting this process through the automated generation of survey items. Despite growing interest in LLM tools within industry, published research in this area remains sparse, and little is known about the quality and characteristics of survey items generated by LLMs or the factors influencing their performance. This work provides the first in-depth analysis of LLM-based survey item generation and systematically evaluates how different design choices affect item quality. Methods & Data: Five LLMs, namely GPT-4o, GPT-4o-mini, GPT-oss-20B, LLaMA 3.1 8B, and LLaMA 3.1 70B, were used to generate survey items on four substantive domains: work, living conditions, national politics, and recent politics. We additionally evaluate three prompting strategies: zero-shot, role, and chain-of-thought prompting. To assess the quality of the generated survey items, we use the Survey Quality Predictor (SQP), a tool developed by survey methodologists for estimating the quality of attitudinal survey items based on codings of their formal and linguistic characteristics. To code these characteristics, we used an LLM-assisted procedure. The analysis allows us to evaluate not only overall quality but also around 60 specific survey item characteristics, offering a detailed view of how LLM-generated questions differ. The findings show striking differences in survey item characteristics across the different models and prompting techniques. The results also show that the prompting technique employed is a primary factor influencing the quality of LLM-generated survey items. Chain-of-thought prompting leads to the most reliable outputs. The topics 'work' and 'national politics' yield survey items with the highest quality. Closed-source GPT models generally produce more consistent and higher-quality items than open-source LLaMA models. Among all configurations, GPT-4o-mini combined with chain-of-thought prompting achieved the best overall results. For the GOR community, the study offers empirical evidence on how LLMs can (and cannot) be reliably integrated into questionnaire design workflows, providing a systematic basis for evaluating emerging AI tools in survey research and informing methodological decisions in applied settings. Who moves and who do we lose? Mobility-Specific Attrition in Panel Surveys. GESIS, Germany Relevance & Research Question Residential mobility is an important source of attrition in address-based panel surveys. That is, panelists cannot be invited to a survey because their new address is unknown. Current research into mobility specific attrition (MSA) is lacking in three aspects: (1) Because of a lack of meta-reviews and ambiguous definitions for MSA, there is limited insight in the magnitude of MSA. (2) research into the selectivity of MSA and its' contribution to attrition bias is sparse and almost exclusively based on panels of special populations, and (3) almost no research on MSA for self-administered panel surveys exists, which continue to displace traditional face-to-face-interviewing. Consequently, in this study, we address two research questions: 1) How many respondents in panel surveys are mobile and how many of those attrite due to MSA. 2) Are certain subpopulations more prone to MSA than others? To answer RQ1 we conduct a meta-review by (1) systematically sampling panel surveys from three extensive data archives (ICPSR, CESSDA and GESIS data-archive). and then (2) analyzing their field documentation to gain a deep understanding of the prevalence of mobility and MSA in panel surveys with special regards to self-administered surveys. For RQ2 we explore subpopulations at risk of MSA by employing machine learning algorithms (classification trees) on data from all available waves of FReDA (currently three waves; N = 42,787). FReDA is a probability-based self-administered mixed-mode panel study with biannual surveys using both web-based and paper-based questionnaires. FReDA’s primary mode of contact is postal mail. Its sample base are German residents aged 18 to 49 years which were recruited by drawing a sample of 108,256 individuals from population registers of German municipalities (Bujard et al., 2025). Results Analyses are planned for December 2025 and January 2026. Preliminary results will thus be available in February 2026. We identify the magnitude of MSA in panel surveys and whether it introduces systematic biases, thereby quantifying a potential source of error for panel researchers. “My (22m) Girlfriend (23f) Comes Home and Does Nothing” – Gendered Perceptions of Paid and Household Labor in Reddit Relationship Discussions over Time 1TU Dortmund University, Germany; 2University of Mannheim, Germany; 3Carl von Ossietzky University Oldenburg, Germany; 4Leibniz Institute for Educational Trajectories (LIfBi), Germany Relevance & Research Question The COVID-19 pandemic has reignited longstanding questions about gender inequalities in paid and unpaid labor. While survey research has advanced our understanding of these disparities, it typically relies on predefined categories and is susceptible to social desirability bias, especially for sensitive topics. In contrast, online postings capture intimate relationship conflicts in great depth, however rarely include demographic information. We leverage discussions in Reddit relationship communities that, due to unique community roles, include both rich descriptions of relationship conflicts around (un)paid labor and demographic details (age, gender). We first assess how well Large Language Models (LLMs) classify manifest and latent content in Reddit posts. Building on the best-performing approach, we examine how men and women discuss relationship conflicts around (un)paid labor before, during, and after the pandemic. Methods & Data Using GPT-family LLMs on 500,000 posts from the subreddits r/relationships and r/relationship_advice, we extract manifest demographic attributes and classify whether posts discuss romantic relationships, paid, and unpaid labor. We systematically vary model specifications (o3, 4o, 4.1), prompting strategies (zero-shot vs. few-shot), fine-tuned vs. base models, and context window lengths, evaluating each against human annotations. Using classifications from the best-performing approach, we apply Structural Topic Models to explore how men and women discuss (un)paid labor in romantic relationships, and how these discussions evolve over time (2011-2023). Across specifications, LLMs excel at predicting manifest categories but struggle with latent sociological constructs. Even the best-performing approach of fine-tuned GPT-4.1 with few-shot prompting and detailed category descriptions achieves only moderate performance when classifying paid and unpaid labor. Substantively, preliminary findings indicate that women more often discuss mental health in work-related conflicts, while men more frequently emphasize career objectives. Added Value We apply LLM-based classifications to core sociological questions around gender inequalities. We add to the growing research body on LLMs’ capabilities and limitations in classifying complex social-science constructs and offer new evidence on how gender disparities in paid and unpaid labor are reflected and negotiated in relationship conflicts. By combining demographic information with highly sensitive narratives, our dataset provides an empirical resource rarely available in either survey or social-media research. A Changing Language of Sustainability? Global online discourse analysis with a deep-dive on Germany 1Weber Shandwick, Germany; 2Fresenius University Koeln, Media School Background Purpose Method Results Added Value References on demand |
| 1:30pm - 2:30pm | 4.3: Poster Session Location: RH, Auditorium |
|
|
Different Measures, Different Conclusions? Evaluating Operationalizations of Non-Optimal Response Behavior 1University of Mannheim, Germany; 2GESIS – Leibniz Institute for the Social Sciences, Germany Relevance & Research Question Do Audio Buttons Matter? Evidence from a Web-Based Panel Recruitment Survey GESIS – Leibniz Institute for the Social Sciences, Germany Relevance & Research Question Ensuring accessibility and response quality across diverse respondent groups is crucial when recruiting participants for mixed-mode probability panels. Audio buttons—short spoken renditions of survey questions—are a potential tool to support respondents with low literacy skills, visual impairments, or other limitations that may exist in reading survey questions. Yet their effectiveness and actual uptake remain understudied. This paper examines the extent to which audio buttons are used in a web-based recruitment survey for a mixed-mode panel refreshment and assesses whether their provision influences respondent behavior, consent to panel participation, and sociodemographic differences in usage. Methods & Data We analyze experimental data from the FReDA Recruitment Survey 2024, in which web respondents were randomly assigned to a questionnaire with audio buttons available for all questions (n=7,490) or to a version without audio buttons (n=7,406). In the experimental group, the questionnaire contained 149 audio buttons. On average, respondents took 19.4 minutes to complete the survey. Results Overall uptake of audio buttons was low: 16.6% of respondents used an audio button at least once, and most activated them only once or twice. A smaller group (5.7%) used an audio button three or more times. Usage peaked at the first question (around 6%), while most other questions showed activation rates between 1–2%. The availability of audio buttons did not affect consent to join the panel. Initial analyses indicate that men, younger respondents, individuals born abroad or without German citizenship, those with lower educational attainment, and smartphone users were significantly more likely to use audio buttons. Respondents who activated audio buttons also spent more time completing the survey. The potential impact of audio-button use on response quality is currently being examined. Added Value Do respondents in a cross-sectional probability survey donate their Spotify or Google Search data? – Short answer: No GESIS Leibniz Institute for the Social Sciences, Germany Relevance & Research Question Data donation offers valuable insights into digital trace data and online behavior, yet little is known about the willingness of respondents in probability-based – especially cross-sectional – surveys to participate. This study examines the feasibility of implementing data donation in a cross-sectional population survey and whether participation differs between the request of Google Search or Spotify data, assuming varying levels of perceived data sensitivity. Methods & Data Results Donation flows show a substantial drop-off between initial interest and completed data donations. While willingness did not differ between the Google and Spotify groups – 350 individuals accessed the study and 232 consented after learning which platform data would be requested – actual donations were rare: 18 in the Spotify sample and 12 in the Google sample. Preliminary post-donation survey results indicate platform-specific sensitivity perceptions: streaming services like Netflix are seen as relatively non-sensitive, whereas social media data are viewed as more private. Telegram stands out as particularly sensitive, with very low stated willingness to donate. Linking these findings with ISSP survey data (forthcoming) may indicate whether such patterns relate to political or ideological differences. At GOR, we will present further analyses of willingness to participate by attitudes towards digital tech and privacy concerns. Added Value How Processing Decisions Influence the Measurement of News Consumption with Web-Tracking Data University of Mannheim, Germany Relevance & Research Question Web-tracking technology is increasingly adopted in social science research to study online behavior because of its potential capability to collect granular individual-level data in-situ and unobtrusively. Although web-tracking is a promising approach for accurately measuring online behavior, recent research has highlighted processing error as an additional error source. This line of research demonstrated that plausible and defensible choices across various processing procedures can transform raw web-tracking data differently. However, previous research has examined the effects of these processing choices on the resulting data in isolation. To date, no study has systematically examined how processing decisions may jointly transform web-tracking data differently. Addressing this research gap is important, because a series of data processing procedures is often required to convert raw web-tracking data into the measure of interest. Using online news consumption as an exemplary variable, I ask the research question: Do different combinations of web-tracking data processing decisions lead to different distributions of online news consumption? Methods & Data To address this research question, I will conduct a multiverse analysis that systematically examines the distribution of online news consumption generated from a sequence of reasonable data-processing procedures. The analysis will be based on the PINCET dataset (Bach et al., 2023) that contains multi-wave survey data and web-tracking data from German adults between July to December 2021 on both PCs (N= 1,863) and mobile devices (N= 1,708). Results Building on Clemm von Hohenberg et al. (2024), I identify a five-step processing pipeline, alongside the corresponding reasonable processing options at each decision point to transform the raw web-tracking data into measures of online news consumption: 1) defining visit duration (5 options), 2) de-duplication (4 options), 3) classification of URLs (3 options), 4) handling missing data (3 options), 5) operationalisation of news media consumption (2 options). This results in a multiverse of 5x4x3x3x2= 360 datasets. Correlations between all online news consumption variables across the dataset multiverse will be computed. Added Value This paper aims to enhance transparency of research using digital behavioral data by providing empirical evidence on how researcher degrees of freedom affects the measurement of web-tracking data. The Instagram Reality Check: Measuring the Accuracy of Self-Reported Social Media Behavior. MZES - University of Mannheim, Germany Relevance & Research Question: Prior research investigating the accuracy of self-reported online behavior has focused primarily on general usage measures, like the usage frequency of a social media platform, often neglecting the self-reported accuracy of specific activities (e.g., posting, commenting, or liking). Self-reporting accuracy may differ substantially between general usage and specific behaviors, yet this distinction remains unexplored in survey research. Filling this gap, we compare self-reports against actual behavior to examine the accuracy of self-reports for both general usage frequency and specific behaviors. Methods & Data: We test how respondents’ individual characteristics and Instagram usage influence misreporting and how question framing influences reporting accuracy through a 2×2 survey experiment that varies the reference period (last week vs. typical week) and the response scale (numeric vs. vague quantifier labels). Our pre-registered analysis draws on data from over 400 participants from a probability-based German online panel. Participants self-reported their Instagram use in a web survey and provided actual usage data through data donation. Results: Data collection finished at the end of November 2025, and we will provide our results by January 2026. We hypothesize that respondents underreport their general platform usage frequency and that misreporting will also occur for specific behaviors (e.g., posting, commenting, or liking). We further hypothesize that the accuracy of self-reports will vary depending on platform usage frequency and engagement in specific behaviors. Finally, we predict that questions regarding typical week will yield more accurate estimates than those regarding last week, and that vague quantifier labels will perform at least as well as numeric labels. Added Value: Our study replicates previous findings and extends them to specific online behavior. First, we help researchers assess the validity of self-reported online behaviors. Second, through our survey experiment testing different reference periods, we offer insights into how to accurately inquire about specific online behaviors. Third, we illustrate the potential of data donation to gather fine-grained data on individual behaviors that participants might be unable to report accurately.
|
| 1:30pm - 2:30pm | 4.4: Poster Session Location: RH, Auditorium |
|
|
Technostress and Burnout in Daily Academic Life: An Empirical Investigation of Study-Related Stressors within the Study Demands and Resources Model Technische Hochschule Köln, Germany Understanding Technostress in Higher Education: Insights from an Online Survey Using the Study Demands and Resources Model
Relevance and Research Question Methods & Data Results Added Value Estimating Economic Preferences from Search Queries Goethe University, Germany Relevance & Research Question Results Trait or State? Understanding Motivational Drivers of Straightlining in a Longitudinal Panel Survey GESIS - Leibniz Institute for the Social Sciences, Germany Relevance & Research Question Straightlining—providing (nearly) identical responses in multi-item batteries—is a common indicator of satisficing in survey research. Compared with task difficulty and respondent ability, respondent motivation has received less systematic attention as a driver of satisficing. In longitudinal surveys, an important open question is whether motivational constructs reflect stable, trait-like characteristics or situational, state-like fluctuations across waves. Survey research also employs diverse operationalizations of motivation (e.g., topic interest, personality traits, survey attitudes), yet these are rarely compared in terms of temporal stability or predictive power. This study therefore examines the extent to which motivational measures display trait- versus state-like variation and how these components relate to straightlining over time. Methods & Data We use data from the GESIS Panel.pop, a probability-based mixed-mode panel in Germany, drawing on nine annual waves of the Social and Political Participation Longitudinal Core Study. Repeated measures are available for political interest, Big Five traits (agreeableness, conscientiousness), survey attitudes, and straightlining. To assess stability, we estimated separate random-intercept models for each motivational indicator and computed intraclass correlation coefficients (ICCs). To predict straightlining, we applied a within–between decomposition and estimated a multilevel binomial logistic regression with respondent random intercepts, controlling for demographics, cohort, mode, and survey year. Results Motivational indicators show considerable temporal stability. Political interest is the most trait-like (ICC = 0.76), followed by conscientiousness (0.64) and agreeableness (0.58). Survey attitudes display more moderate stability, with ICC values ranging from 0.53 (perceived burden) to 0.59 (survey value) and 0.62 (enjoyment). In the predictive model, political interest is the strongest determinant of straightlining: both higher average levels and within-person increases reduce straightlining. Survey attitudes show smaller, largely trait-level associations, and personality traits have modest effects. Added Value Findings indicate that straightlining is driven mainly by stable, between-person motivational differences, with political interest standing out as the strongest factor. By comparing multiple motivation measures and separating their trait and state components, the study provides practical insights for identifying respondents at risk of satisficing and for supporting data quality in longitudinal surveys. Extensions to additional satisficing indicators (e.g., item nonresponse, speeding) are planned. AFGfluencers in Germany: Platforms, Actors, and Issues Leipzig University, Germany Relevance & Research Question The digital landscape of the Afghanistan diaspora in Germany is rapidly evolving, yet little is known about its online influencers, communication practices, and content narratives. Understanding this emerging microcosm is crucial for bridging knowledge gaps in digital media research and fostering social cohesion. This study investigates three core questions: Which digital platforms are most used by the Afghanistan diaspora in Germany? Who are the key influencers shaping this digital space? What topics and narratives dominate discussions and content creation among AFGfluencers? Methods & Data This research is ongoing, but preliminary work has mapped the main platforms and key influencers within the Afghanistan diaspora in Germany. Initial observations indicate active communication around cultural identity, migration experiences, and community issues. Analysis of content themes and engagement patterns is in progress, aiming to reveal how influencers connect members, shape narratives, and contribute to the formation of an online diaspora network. Added Value This study provides novel insights into an under-researched yet important digital community — the Afghanistan diaspora in Germany — enhancing understanding of online diaspora communication. By highlighting key actors, platforms, and narratives, it informs both academic research and policy discussions on integration, social cohesion, and misinformation mitigation. Methodologically, it demonstrates practical approaches for analyzing online communities, contributing to the broader field of digital social research. The findings have the potential to guide engagement with transnational publics and foster inclusive digital discourse. Comparing Probability and Nonprobability Online Surveys: Data Quality and Fieldwork Processes 1Institute for Employment Research, Germany; 2Ludwig-Maximilians-Universität, Munich, Germany; 3University of Maryland, College Park, USA Objectives - Which business question wanted the client to be answered? |
| 1:30pm - 2:30pm | 4.5: Poster Session Location: RH, Auditorium |
|
|
Developing a measurement of masculinity norms: insights from the MEN4DEM project University of Bergamo, Italy Relevance & Research Question The literature identifies different types of masculinity. Hegemonic masculinity refers to the type of masculinity legitimating unequal gender relations (between men and women, between masculinity and femininity and among masculinities). Hegemonic masculine norms for the "ideal man" include traits and practices like being strong, successful, independent, unemotional, in control. Hypermasculinity, aggressivity, sexual prowess are generally glorified. Differently, caring masculinity refers to the idea that, without rejecting masculinity, men are able to adopt what (traditionally) is seen as a feminine characteristic. Here, the emotional dimension is crucial. Recent research refers of men negotiating masculinity e.g., not feeling challenged ‘as men’ because of their caregiving roles but reinforce interest in ‘manly’ hobbies/sports suggesting processes in which they negotiate what aspects of masculinity do not fit with their identity and what they do. Extreme-right and manosphere ideologies tend to privilege hegemonic masculinity, which legitimizes unequal gender relations and anti-democratic stands. But how to measure these norms to investigate their spread in the public opinion? This contribution describes the development, both conceptually and empirically, of the measurement of hegemonic masculinity norms and explore their spread among the general population of six European countries. Methods & Data The contribution reports on the development of the new “Hegemonic Masculinity Norms Scale” drafted in the context of the MEN4DEM (Masculinities for the future of European democracy – Horizon Europe, GA n. 101177356), on the cross-cultural comparability challenges (including use of advance translation), results from the online survey experiment conducted to assess reliability and validity. The final measurement is then used to show the spread of hegemonic masculinity norms in Greece, Germany, Italy, Netherlands, Poland, and Sweden. (N=800 in each of the 6 countries; representative samples randomly assigned to exp.setting) Social Media Surveys in Emerging Risks: Measuring Stress During Bradyseism in Campi Flegrei, Italy 1University of Padova, Italy; 2University of Bari “Aldo Moro", Italy; 3University of Salerno, Italy Relevance & Research Question In emerging risk contexts and situations requiring contingent data collection, social media surveys can play a crucial role in analysing population needs and stress levels, offering an innovative monitoring tool. Bradyseism is a phenomenon of slow ground deformation that occurs in the Campi Flegrei area, in the Naples proximity (South of Italy). This phenomenon can cause shallow earthquakes, damage to structures, and an increase in seismic activity. The Campi Flegrei area is characterized by high population density and strong socioeconomic inequalities. This study aims to assess stress levels in the population affected by bradyseism in the Campi Flegrei area. Methods & Data An online survey was conducted during a period of increased bradyseism activity in the Campi Flegrei area. Meta advertisement platform was used to recruit participants exposed to the phenomenon. The survey, lasting 11 days, collected over 600 completed questionnaires with an investment of 720 Euros. This approach allowed for targeted spatial sampling of the population. The study included residents in the arounds of Pozzuoli (high-risk area) and Ottaviano (control area). Results The survey provided valuable information on stress levels among the affected population. The propensity to respond to a survey on social networks is proportional to the interest in the survey topic. This resulted in a higher number of responses from those affected by the earthquakes, despite our efforts to achieve a balanced sample. The data collected offered insights into the differential impacts of the bradyseism phenomenon across various demographic groups, with a particular focus on gender-based vulnerabilities. Moreover, the stress level, measured using a ten-item psychometric scale, showed a high correlation with both the number of perceived earthquake tremors and their intensity. "Automated Political Stance Identification in Political Texts Universität Potsdam/Hasso-Plattner Institut Relevance & Research Question This study addresses the growing need for transparent, theory-based methods to analyze political text using artificial intelligence. It asks whether Large Language Models (LLMs) can reliably identify political stances—such as ideological orientation, support for liberal democratic values, and populist rhetoric—without the need for manually labeled data. Methods & Data The paper employs a Natural Language Inference approach leveraging a retrained model, DEBATE (Burnham et al. 2024). In so doing, we build on expert survey codebooks from reputable sources in political science. The model is validated by comparing its outputs with expert-assigned scores to assess accuracy. Results The findings show no statistically significant differences between the model’s classifications and those of human experts. The model also demonstrates strong multilingual capabilities across English, German, Italian, Portuguese, and Spanish. Added Value This study introduces a cost-effective, replicable, and theoretically grounded approach for stance detection in political texts. By eliminating the need for data labeling and integrating the method into the forthcoming Automated Political Stance Identification (APSI) platform, it provides an accessible tool for researchers, policymakers, civil society, and the public seeking evidence-based insights into ideological and rhetorical patterns in politics. Lessons Learned from Developing Indices for Syndicated Studies YouGov, Switzerland Relevance & Research Question For a syndicated study, data from two competing companies have been pooled to date. A new questionnaire was developed specifically for the study in which internal employees were surveyed. In addition, data from various customer satisfaction studies were integrated. Financial performance indicators from the partner companies and their suppliers were also included. From these inputs, distinct non-overlapping indicators were constructed, complemented by a single comprehensive index that synthesizes all individual measures. The development process took place in close collaboration with the syndicated study partners. Moreover, suppliers and customers were invited to contribute their perspectives and assessments, ensuring that the resulting indicators reflect a broad and balanced understanding of the market context. Results Goodnight, Prince of Darkness: Ozzy Osbourne’s Death as a Global Facebook Event Bar-Ilan University, Israel Relevance & Research Question When news of Ozzy Osbourne’s death broke in July 2025, millions of Facebook users brought collective grief together in a global digital ritual. This study examines how mourning, nostalgia, and fandom unfolded on social media, asking: how does celebrity death evolve into a networked media event? Building on media event theory (Dayan & Katz, 1992) and the concept of affective publics, the research explores how emotions, algorithms, and cultural memory intersect when a legendary figure dies "live" in the networked sphere. Methods & Data We collected 46,390 public Facebook posts published between 22 and 30 July 2025 that mentioned "Ozzy Osbourne". We then preprocessed the texts through tokenization, lemmatization, and stopword removal, and extracted bigrams using the gensim Phrases model. We applied Latent Dirichlet Allocation (LDA) topic modeling (Coherence = 0.73) to identify dominant themes and conducted sentiment analysis with TextBlob to measure polarity on a –1 to +1 scale. Finally, we visualized temporal and emotional dynamics to trace the evolution of discourse over time. Results |
| 2:00pm - 3:45pm | GOR Impact and Innovation Award Location: RH, Seminar 03 |
|
|
Scaling Qualitative Depth: A Large-Scale Validation Study Comparing AI-Moderated Interviews and Conventional Surveys in OTC Pharma Research 1MCM Klosterfrau Vertriebsgesellschaft mbH; 2Q Agentur für Forschung GmbH; 3horizoom-Panel – horizoom GmbH; 4xelper UG (haftungsbeschränkt) Agentic AI in der Marktforschung Aequitas Group, Germany Comparing AI-Moderated, Human-Moderated, and Unmoderated Usability Testing: Insights into Quality, User Perception and Practical Implementation 1Porsche AG, Germany; 2Userlutions GmbH; 3xelper UG BEYOND AI - How the DFB leveraged a new qualitative insights method that improves AI results. SUPRA, Germany From Insights to Impact: A Life-Centric, AI-Driven Approach to Modern Brand Tracking GIM Gesellschaft für innovative Marktforschung mbH, Germany |
| 2:30pm - 3:30pm | 5.1: Data quality and measurement error I Location: RH, Seminar 01 |
|
|
I misbehave, but only once in a while: how face-saving strategies can reduce socially desirable responding in online survey research University of Groningen, Netherlands, The Relevance & Research Question ‘And Yet…’: The Effectiveness of Probing Questions in Reducing Item Nonresponse to Financial Questions NRU HSE, Russian Federation Relevance & Research Question: High rates of item nonresponse to questions on income and expenditures compromise data quality, leading to sampling bias and limiting the generalizability of findings. This study investigates the effectiveness of follow-up (probing) questions in reducing nonresponse to financial questions within a Russian context and identifies the profiles of non-respondents. Methods & Data: Using data from the 6th wave of the HSE University's "Economic Behavior of Households" survey (N=6000), the analysis employs a two-stage approach combining the Random Forest method with the Boruta algorithm and logistic regressions. Results: Results indicate that probing questions successfully converted 36-41% of initial nonresponses into substantive answers, decreasing the overall nonresponse rate from 6-17% to 4-10%. Key predictors of nonresponse were found to be unawareness of household expenditures (a 3- to 5-fold increase in odds), poor health, lack of savings, and residence in small towns. The technique proved most effective for respondents with lower education levels, no savings, and those from small towns. Added Value: The findings demonstrate that while probing is a valuable tool, its primary mechanism is reducing cognitive complexity rather than mitigating question sensitivity. Based on this evidence, the paper offers practical recommendations for improving the design of surveys that include financial questions.
Having ACES Up Your Sleeve: Developing and Validating Attention Checks Embedded Subtly (ACES) to Improve Identification of Inattentive Participants Institute of Philosophy and Sociology of the Polish Academy of Sciences, Poland Relevance & Research Question: Careless or insufficient-effort responding (C/IER) is a major threat to data quality in online surveys. Existing detection approaches face substantial limitations: indicator-based methods (e.g., straightlining indices) require subjective threshold-setting, while model-based approaches rely on strong and often unrealistic assumptions and are difficult to implement in applied research. Attention checks offer an objective-to-score and easy-to-use alternative. However, traditional attention checks, such as instructed-response items (Please select strongly agree) or bogus items (Orange is a fruit), suffer from limited validity, and high respondent reactivity (Daikeler et al., 2024; Gummer et al., 2021; Silber et al., 2022). This project addresses the lack of attention checks that would mimic ordinary questionnaire items to limit reactivity while reliably identifying inattentive respondents. The central research question is: How can we design attention checks that outperform existing approaches in validity and non-reactivity? Methods & Data: First, 460 candidate ACES were developed across diverse domains (e.g., personality, technology, political attitudes), drawing on the concept of frequency/infrequency items (Kay & Saucier, 2023), and tested on 1498 respondents. Items were evaluated using distributional properties and socio-demographic invariance. The selected ACES were validated in a between-subjects experiment (N = 880) comparing four conditions: (1) ACES, (2) IRIs, (3) bogus items, and (4) a no-check control group. The questionnaire included measures of perceived clarity and seriousness, as well as open-ended feedback on attention checks. The findings were replicated in another survey (N = 1113) targeting less experienced online panel members. All studies used non-probability quota samples (age, gender, education) from Polish online research panels. Results: ACES showed stronger associations with independent indicators of inattentiveness (e.g., response times, straightlining indices) and demonstrated higher classification accuracy (careless-not careless) than IRIs and bogus items. Respondents generally did not recognize ACES as attention checks, despite being highly familiar with traditional checks. Added Value: This project delivers the first validated ACES set and provides empirical evidence that ACES improve detection of C/IER in comparison to other attention checks while minimizing respondent reactivity. This enables using ACES in probability samples or interviewer-based surveys where traditional attention checks were not used due to their coarseness and potential for reactivity. |
| 2:30pm - 3:30pm | 5.2: Online panels I Location: RH, Seminar 02 |
|
|
Comparing Probability, Opt-In, and Synthetic Panels: A Case Study from the Netherlands 1Norstat, Netherlands, The; 2Lifepanel Relevance & Research Question The growth of nonprobability online panels and the emergence of synthetic survey respondents have created new opportunities and uncertainties for social measurement. While probability samples remain the reference standard, opt-in and synthetic data sources offer faster fieldwork and lower cost but may introduce unknown biases. This study asks: How comparable are attitudes measured across a probability panel, an opt-in panel, and a synthetic dataset? Methods & Data Three parallel surveys (≈500 completes each) were administered using identical instruments.
The questionnaire measured perceptions of the national situation, attitudes toward elections, and interest in sports. Analyses include demographic comparison with CBS benchmarks, item nonresponse, variance structures, and inter-item correlations. Calibration experiments test post-stratification and raking on age, gender, education, and region to evaluate alignment potential. Results The probability panel demonstrates the expected demographic balance and serves as the comparative baseline. The opt-in panel aligns closely with probability results after weighting, although unweighted data show overrepresentation of younger, higher-education groups. Attitudinal means are largely consistent across the two empirical samples, with modest discrepancies in political trust and evaluations of the national direction. The synthetic dataset approximates mean values for several attitude items but exhibits compressed variance and weakened correlation patterns, indicating insufficient behavioral realism. Some synthetic respondents show inconsistent response structures not observed among human participants. Calibration improves demographic similarity but does not correct these structural limitations, suggesting that synthetic data are constrained more by model assumptions than by post-survey adjustment. Added Value This is one of the first empirical comparisons integrating probability, opt-in, and synthetic survey data within a single national framework. The study provides practical guidance on when synthetic respondents can complement empirical data (e.g., instrument testing) and where their limitations lie. It also clarifies the degree to which calibration can bridge differences between probability and nonprobability data but highlights fundamental constraints for synthetic datasets. The findings contribute to methodological best practices as synthetic data become increasingly visible in survey research. Optimizing Panel Consent using Repeated Requests while Experimentally varying Request Placement and Panel Consent Incentives Institute for Employment Research (IAB), Germany Relevance & Research Question High panel consent rates are essential for reducing panel attrition and limiting the risk of panel consent bias in panel surveys. This study investigates how panel consent rates can be optimized by applying three innovative survey design features covering a repeated request for panel consent within the questionnaire while experimentally varying the placement of requests (beginning vs. end) and the incentive for panel consent. Methods & Data Analyses are based on the recruitment wave of the third cohort of the "Online Panel for Labour Market Research" (OPAL) and cover about 7.200 cases (about 12% are classified as partial interviews). In our design of repeated requests for panel consent within the questionnaire, respondents who do not provide consent at the first request are followed up with a second request to reconsider their decision. The survey experiment comprises four experimental groups that differ in the placement of the first panel consent request (beginning vs. end) and the incentive for panel consent at the first (0€ vs. 5€) and the second request (5€ vs. 10€). Results Concerning the first request, an early request is more successful than a late placement: When offering no incentive, the panel consent rate is higher when asked at the beginning rather than asking at the end. When offering a 5€ incentive, the panel consent rate is higher when asking at the beginning rather than at the end. Due to the second request, the cumulated panel consent rate increases by 5 to 10 percentage points across experimental groups. The highest cumulative panel consent rate after two requests has the design with the first request at the beginning while offering 5€ and offering 10€ at the second request at the questionnaire end. Added Value This paper provides evidence, that the highest panel consent rates are realized with a placement at the beginning of the questionnaire, which questions the traditional placement at the end. The panel consent rate can be significantly improved when implementing a second request within the questionnaire. Results show that the placement of a panel consent request can be more relevant than incentivizing. You’ve got Mail: Does sending Thank-you postcards increase response in a probability-based online panel? 1GESIS, Germany; 2University of Mannheim, Germany; 3Heinrich Heine Universität Düsseldorf, Germany; 4University of Hamburg, Germany Relevance & Research Question Survey nonresponse is one of the major challenges to survey data quality. While various treatments, e.g., monetary incentives, have been tested to increase response rates, the evidence regarding differential effects of the treatments for different population groups is rather thin. For example, it is known that incentives work well overall, but it is unclear whether a less costly form of appreciation (like a postcard) would achieve the same or a greater effect for certain population groups or whether some groups do not need any treatment at all. Methods & Data An experiment to increase survey response will be fielded in mid-December 2025 among the newly recruited participants of the German Internet Panel (GIP). After the first panel wave, before the second panel wave, panelists will randomly be assigned to four experimental groups: 1.) receiving a “Thank-you” postcard from the GIP team 2.) receiving a handwritten “Thank-you” postcard with the same text as 1.) 3.) receiving a postcard that states that they have been credited an extra of 5 Euro as a “Thank-you” for being in the panel 4.) control group The postcards are not connected to the invitation to the second panel wave but are a general expression of appreciation for the panelist’s participation in the study. Results Results will be presented including: 1.) the overall effect of the treatment on nonresponse in wave 2 2.) possible interaction effects of personal characteristics and the treatment on nonresponse in wave 2. The personal characteristics include basic socio-demography, the BIG 5 personality traits, and the motivation to participate in the survey. Added Value Looking at heterogeneous effects of the treatment on nonresponse, the findings will inform survey practitioners on developing a targeted design to increase response rates and more specially, to increase response rates for specific population subgroups. |
| 2:30pm - 3:30pm | 5.3: AI and society Location: RH, Seminar 04 |
|
|
Information-Seeking in the Age of Generative AI: Factors That Influence the Behavioural Intention of Media Students to Use ChatGPT Hochschule Darmstadt, Germany Relevance & Research Question: Methods & Data: Results: Added Value:
Exploring Differences in ChatGPT Adoption and Usage in Spain: Contrasting Survey and Metered Data Findings RECSM-UPF, Spain Relevance & Research Question What do we talk about when we talk to LLMs? 1Université Paris Nanterre, France; 2Aalto University, Finland; 3Vrije Universiteit Amsterdam; 4Bilendi Relevance & Research Question Commercial LLMs are now part of everyday online life, but we still know strikingly little about what people actually do with them in practice. Here, we present empirical insights upon the content of messages that people exchange with chatbots, such as ChatGPT. There still are few consensual results in the area. A very recent study by OpenAI (Chatterji et al., 2025) found surprising results that partly contradicted earlier research, demonstrating limited gender and education differences and very few “personal” interactions between users and ChatGPT. We use extensive GDPR-complaint data to address two questions: RQ1: Can these recent findings regarding topic distribution and gender/education differences be replicated? RQ2: How personal do conversations with LLMs get, and do they become more personal over time? Methods & Data Our data covers 5 months of conversation records from panel members in Brazil, Germany, Mexico and Spain who agreed to share their internet activity on laptop and/or mobile device (01.06.25–31.10.25; N = 45,200 participants). We collect both HTML streams and in-app contents for six major AI platforms: ChatGPT, Claude, Copilot, Gemini, Meta and Perplexity. We examine the context of the conversations using LLM-based classifiers. In particular, we reproduce the same prompts and data input as those used in Chatterji et al. (2025) for comparability (RQ1). Our data uniquely combines multiple AI system sources and reliable sociodemographics information, putting us in a good position to better understand and assess the divergences in previous studies which were limited in terms of LLM sources and user qualification. LLM-based classifiers enable fine-grained classification on high volumes of data, allowing for new approaches to RQ2 besides mere content classification. We believe the extent to which people get personal with LLMs is underestimated when it is assessed merely through topic classification, since topics other than “self expression” and “relationships”, such as practical guidance or multimedia topics, may also involve personal engagement with the systems. |
| 3:30pm - 4:00pm | Break Location: RH, Lunch Hall/ Cafeteria |
| 4:00pm - 5:00pm | 6.1: Data quality and measurement error II Location: RH, Seminar 01 |
|
|
Assessing Trends in Turnout Bias in Social Science Surveys: Evidence from the European Social Survey and German Survey Programs 1GESIS - Leibniz Institute for the Social Sciences, Germany; 2University of Mannheim Relevance & Research Question Social science surveys frequently overestimate voter turnout due to measurement and nonresponse errors, which undermine the validity of research on the causes and consequences of political disengagement. As turnout bias may differ across countries and over time, both cross-national and longitudinal comparisons are challenged. Despite these concerns, there is no comprehensive longitudinal and cross-national comparison of turnout bias. Consequently, it remains unclear to what extent turnout bias is shaped by contextual factors or by survey design. To close this gap, we examine (1) the prevalence and development, (2) the contextual factors and (3) the survey design features associated with turnout bias in European social science surveys since 2000. Methods & Data We analyze data from the European Social Survey (ESS) and a unique data set of German Survey Programs (GSP) conducted between 2000 and 2023. First, we run separate OLS regression models using either absolute or relative turnout bias as the dependent variable and the year of data collection as the independent variable for each country in the ESS and for each survey program in the GSP. Second, we estimate fixed effects and mixed effects models using absolute or relative turnout bias in the ESS and GSP as the dependent variable. As independent variables, we include contextual factors. Third, we add variables capturing variations in survey design as independent variables. Results Our findings reveal that the extent of turnout bias varies between countries and has increased over time, posing a significant challenge for both cross-national and longitudinal research. We identify several survey design features that could mitigate turnout bias which are in line with previous literature. Moreover, we discuss methodological innovations aimed at reducing turnout bias by targeting nonvoters before or during data collection through tailored survey designs. Added Value The persistence of measurement and nonresponse errors, along with the lack of validated turnout data, are a constraint for social science surveys. This study offers the first comprehensive longitudinal and cross-national comparison of turnout bias. Our results underscore the urgent need for methodological innovations to ensure the validity and comparability of data on political disengagement. Validating a 6-Item Scale for Measuring Perceived Response Burden in Establishment Surveys 1IAB, Germany; 2IAB, Germany; University of Munich, Germany Relevance & Research Question Response burden is a significant challenge in establishment surveys, threatening data quality and survey participation. However, the field lacks validated instruments to measure perceived response burden. This study addresses this gap by developing and validating a 6-item, binary (Yes/No) response burden scale. Our central research question is whether this scale achieves measurement equivalence across different levels of objective burden (questionnaire length) and stability over time (longitudinally). Methods & Data We utilize data from an experiment embedded within three quarterly follow-up waves (2023-2024) of the IAB Job Vacancy Survey (IAB-JVS), a large-scale German establishment survey. Establishments (n=3,888) were randomly assigned to receive either a short (2-Page) or a longer (4-Page) follow-up questionnaire. We test for measurement invariance using multi-group Confirmatory Factor Analysis (CFA) adapted for binary indicators (WLSMV estimator), following Wu and Estabrook (2016). We assess both cross-sectional invariance (between experimental groups) and longitudinal invariance (across the three waves). Results The scale demonstrates strong construct validity: respondents in the 4-page condition reported significantly higher perceived burden across all items (e.g., "High number of questions"). The analysis confirms full scalar invariance across the 2-page and 4-page experimental groups in each wave (e.g., Q1: ΔCFI < 0.01). This indicates the scale measures the same latent construct equivalently regardless of objective burden. Furthermore, the scale achieved full longitudinal scalar invariance across the three waves, demonstrating its temporal stability even as quarterly questionnaire content changed. Added Value This study provides practicioners with a validated, concise instrument to monitor perceived burden in establishment surveys, Based on the confirmation of cross-sectional and longitudinal scalar invariance, researchers can now confidently use this scale to track burden trends over time and accurately evaluate the impact of questionnaire design interventions. Hopefully, our work provides a reliable tool for comparative analysis, supporting efforts to improve data quality and respondent engagement. The effects of panel conditioning on response behavior across different cohorts: Bias in the Core Discussion Network University of Mannheim, Germany Research Question Panel conditioning: changes in response behavior caused by repeated survey participation, is a central methodological concern in online panels. Research has identified both positive and negative conditioning effects, but little is known about how these processes unfold in egocentric social network surveys, where name-generator items create opportunities for satisficing. In this study I ask: (1) How does repeated participation affect the likelihood of motivated misreporting in these filter questions? (2) To what extent is this relationship mediated by respondents’ reported network size, that is, the number of alters named in the generator? Methods I use data from the 12th wave of the online probability-based LISS panel, drawing on the Core Discussion Network module, which includes a name generator and follow-up questions on alter characteristics. Panel experience operates as the independent variable, and Motivated misreporting in filter questions as the dependent variable, while network size serves as the mediator. I estimate causal mediation models using Poisson and logistic regression with 5,000 bootstrap resamples and control for sociodemographic and survey-evaluation variables associated with panel attrition. Results The results reveal two opposing mechanisms: a direct and indirect effect. Indirectly, respondents with greater survey experience report larger discussion networks, which increases their likelihood of misreporting in the filter questions, to avoid the tie-strength assessments. Directly, however, more experienced participants are less likely to engage in motivated misreporting when network size is held constant, suggesting reduced satisficing due to increased familiarity with online survey tasks. Because these pathways counteract each other, the total effect of panel experience on misreporting is small and statistically nonsignificant. Added Value This study demonstrates that panel conditioning in online surveys operates through simultaneous, opposing mechanisms that remain hidden to conventional response-quality diagnostics. The findings highlight the need to consider how question order, task burden, and instrument structure interact with respondent experience in modules involving name generators. By applying mediation analysis, the study provides a framework for detecting hidden behavioral mechanisms and offers practical guidance for improving the design and interpretation of longitudinal online surveys. |
| 4:00pm - 5:00pm | 6.2: Online panels II Location: RH, Seminar 02 |
|
|
Handling the Recruitment Process for a Probability Online Panel In-House: Insights and Lessons From the 2025 German Internet Panel Recruitment University of Mannheim, Germany Relevance & Research Question: In this contribution we will report on the 2025 recruitment of new respondents for the GIP. This includes a general overview of the recruitment procedure and results, unexpected obstacles we encountered along the way, lessons learned, and things to look out for when handling sample recruitment in-house. As we observe a general trend of moving more processes in-house among academic survey projects, often to reduce costs, this contribution will be of interest for a wide audience of survey practitioners who (plan to) handle their own recruitment of respondents. Methods & Data: In September and October 2025, we sent out 7,000 invitations to prospective new respondents for the GIP. About 2/3 went to a random sample from the population registers of 135 municipalities drawn by GESIS and 1/3 to addresses sampled from a commercial database. We report on the process of handling the recruitment of a probability sample in-house, obstacles we encountered and possible solutions to these, and recruitment results such as response rates. Results: During the recruitment process, we encountered a number of obstacles, such as uncooperative municipalities, coordinating the printing and sending of invitation letters and reminders in-house, print quality, handling the prepaid cash incentives, and a higher-than-expected number of invitations being returned by the postal service as undeliverable, presumably due to incorrect addresses, particularly in the commercial address database sample. While the fieldwork is still ongoing, preliminary results indicate that about 1,900 recruitment interviews have been started and about 1,100 to 1,200 new respondents will be recruited into the panel, while about 800 of the 7,000 invitations have been returned undeliverable. Added Value: We provide a hands-on report with concrete advice for conducting a survey recruitment and avoiding potential obstacles and costly errors. In addition, we report recent information on the quality of obtained address data, response rates, and recruitment success for a probability online panel in the ever-changing survey climate in Germany, which will be of interest to any practitioners planning the recruitment of a probability sample. Methods to Maximize the Panel Consent Rate in the Recruitment Wave of a New Web Panel 1Institute for Employment Research (IAB), Germany; 2ZEW Mannheim; 3University of Bamberg; 4LMU Munich; 5University of Mannheim Relevance & Research Question Panel consent is the permission given by respondents to be re-contacted for future panel waves. The lower the panel consent rate, the larger the initial panel attrition and the higher the risk of panel consent bias, which threatens data quality. This study investigates whether incentives for panel consent and repeated requests for panel consent can increase panel consent rates. Methods & Data Results Our analyses (N≈41.000) show that 1) panel consent rates at the first request are higher if incentives are offered; 2) the second request significantly increases the cumulated panel consent rate; 3) the second-request effect is highest for group 3, where incentives are offered only at the second request; 4) the cumulative panel consent rate is highest for groups 2 and 3. Regarding the cost effectiveness, group 3 leads to panel consent rates as high as for group 2, while requiring costs per panel consenting respondent is close to group 1. We will also provide results on how the experimental design affects wave 2 response rates. Added Value This paper introduces and assesses two innovative survey design features to maximize panel consent rates. Furthermore, we analyse the costs associated with each design and derive recommendations for panel consent request designs. Between the Waves: How additional studies shape panel participation trajectories. Robert Koch-Institut, Germany Relevance & Research Question: In recent years, probability-based mixed-mode panels have become a common tool in empirical research. Most panels rely on continuous, regular surveys that ask participants about core topics at fixed intervals (in a similar way to classic longitudinal panels). In addition, some of these panels offer additional or ad hoc studies. These studies allow internal or external researchers to conduct additional and in-depth surveys. The Robert Koch Institute (RKI) Panel 'Health in Germany', set up in 2024, operates exactly in line with this logic. However, for research purposes, the question arises as to what extent additional studies (at irregular intervals on different topics) influence the non-response of further (regular) surveys. In this presentation we will present first findings within the RKI Panel. Methods & Data: The analysis draws on online survey response metrics from panel waves of the RKI Panel, complemented by information on invitations to and participation in an ad hoc survey conducted between the regular waves. A comparison is made between two randomly selected groups of panel members invited versus not invited to the analyzed ad hoc survey. Multivariate models control for sociodemographic characteristics to isolate potential effects attributable to the ad hoc survey. Results: Preliminary findings show that participants who were invited to the ad hoc survey exhibit marginally lower sub-group response rates for the following wave. However, controlling for sociodemographic characteristics in a multivariate model, we see no substantial effect regarding potential non-response for participants of the ad hoc survey. Thus, participants show stable response patterns overall. Data collection is still ongoing. Further results on more ad hoc surveys and following waves will be presented at the conference. Added Value: The findings contribute to an emerging evidence on the effects of additional survey burden in mixed-mode probability panels. By investigating invitation effects, the study offers practical insights for panel management, especially regarding contact frequency and respondent burden in newly established panels such as the RKI Panel.
|
| 4:00pm - 5:00pm | 6.3: Modelling people, informing policy: new approaches in the AI era Location: RH, Seminar 03 |
|
|
The last interview - A concept to create digital twins 1HTW Berlin, Germany; 2Splendid Research; 3Xelper 1. Relevance & Research Question 2. Methods & Data 3. Results 4. Added Value Personas++ – Slicing & Dicing the Result Space of a Survey Inspirient, Germany Relevance & Research Question Advances in computational methods, in particular recent advances in Artificial Intelligence (AI), have vastly reduced the manual effort required to derive results from any given survey dataset. This equally applies to structured, quantitative, interview-level data, but also to qualitative data. For the former, statistical and visualization methods may now be applied automatically; for the latter, sentiments, topics and codes are now easily extracted. As an industry, we’re thus experiencing the commoditization of results. This gives rise to the new questions of how to efficiently work with this overly abundant set of results, how to focus on what matters, how to tell signal from noise. In this talk, we introduce the concept of the result space, which we define as the set of all possible results that can be derived from a given dataset of interview-level raw survey data by (automatically) applying current analytical methods. Based on our practical work across dozens of surveys over the past years, we propose alternatives to structuring this space, e.g., by variable, by methodology, or by significance of result; we look into alternatives for sorting and ranking results; and we discuss ways for measuring relations between results. To strongly anchor the rather theoretical aspects of our work to every-day practical use, we illustrate the specific applicability of these concepts on real-world survey datasets, and on specific questions that we can now answer: What are the Top 3 things to know among all results relating a given sub-demographic? Of all the regression analyses, which ones stand out and why? Is there anything I overlooked in this summary that I wrote? We further demonstrate practicality by showcasing the system used to automatically derive the result spaces. Certain slices through the result space of a survey have already proven their practical value: Personas, for example, allow zeroing in on the particular wants, needs, and opinions of a sub-demographic of particular interest. With the toolkit presented in this talk, we generalize this concept, thereby providing the means to more deeply and more effectively investigate increasingly abundant survey results. The EU-ALMPO Project: Rethinking ALMPs through AI-Driven Analysis and Policy Innovation Institute for Social Research (IRS), Italy Relevance & Research Question: Digital transformation, Evidence-based policymaking Amid rapid technological innovation and digital transformation, EU-ALMPO addresses the need for more agile, inclusive, and responsive to evolving skill mismatches labour market interventions. Anchoring policy design in data, machine-learning analytics, and stakeholder co-creation, the project objective is the creation of the EU Active Labour Market Policy Observatory – an AI-enabled digital hub that enhances the design, implementation, and evaluation of ALMPs across Member States. By integrating advanced AI tools into a centralised digital platform, the Observatory supports evidence-based decisions and fosters knowledge exchange among policymakers, researchers, and labour-market actors. Methods & Data Analytical framework, Participatory validation, Comparative policy evaluation Funded under Horizon Europe, EU-ALMPO has completed the WP1, which developed the analytical framework underpinning the Observatory. The framework analysed existing ALMP systems, identified structural gaps, and assessed the effectiveness of policies in addressing skills mismatches. It also provides a conceptual bridge - a translation layer - for its integration into the project’s AI-supported system for policy-makers across the EU and beyond. Methodologically, it combines a literature review, a meta-evaluation of ALMPs, and participatory validation with stakeholders from several EU countries. Results Skills mismatch analysis, Servuction model Serving both diagnostic and prescriptive functions for skill mismatch analysis, the framework also deepens reflection on the implications of generative technologies for policy innovation. Inspired by the Servuction Model - bridging together front-end and back-end - it develops a user-oriented perspective and translate the knowledge and content developed in the project into actionable items that are valuable and useful for the policy-makers involved. Added Value AI-policy integration, Adaptive governance EU-ALMPO represents a pioneering intersection between labour-market policy and AI. Its analytical framework and reflections around the ways to bridge policy design and AI tools, offers a data-driven, adaptive, and inclusive approach that strengthens Europe’s capacity to respond to future labour-market transformations. While the focus of the project is related to policy making in the area of skills and labour market, the project represents an innovative ground to bridge policy and technology and is thus relevant for potentially other policy areas as well. |
| 5:15pm - 6:30pm | DGOF: Member Meeting Location: RH, Seminar 01 |
| 7:00pm - 8:00pm | Early Career Speed Networking Event Location: Blue Shell Our GOR Early Career Speed Networking provides an opportunity for all early-career online researchers (i.e. PhD students) and practitioners (i.e. within their first 5 years in the online research industry) to get to know and connect with others in the field in a fun and engaging way. Especially for first-time visitors at GOR, the event offers a great informal and casual way to get to know other conference participants and to meet a diverse group of early career people from different backgrounds, disciplines, and institutions.
|
| 8:00pm - 11:59pm | GOR 26 Party Location: Blue Shell |
| Date: Friday, 27/Feb/2026 | |
| 8:00am - 9:00am | Begin Check-in Location: Rheinische Hochschule, Campus Vogelsanger Straße |
| 9:00am - 10:00am | 7.1: AI and qualitative research Location: RH, Seminar 01 |
|
|
AI-Conducted User Research: From Weeks to Hours Through Autonomous Interviewing Userflix, Germany Relevance & Research Question Methods & Data We developed Userflix, an end-to-end AI platform for qualitative research automation utilizing large language models fine-tuned for research methodology. The system implements: (1) AI-guided study setup through conversational project briefing, (2) real-time audio-to-audio interviews with dynamic follow-up questions, (3) visual stimuli presentation, (4) automatic transcription and analysis, and (5) automated insight extraction with traceability to source interviews. Evaluation pilots (Q3-Q4 2025) are ongoing with Nielsen Norman Group (UX research methodology assessment), Innofact and Skopos (agency workflow integration), and IKEA (multilingual European research). Partners are systematically comparing AI versus human interview quality, transcript depth, and participant experience. Results Early evaluation feedback demonstrates 95% time reduction (weeks to hours) and 90% cost reduction (€36/hour vs €500-750/interview). The AI successfully conducts multilingual interviews, generates contextual follow-up questions, and enables unprecedented scale (50-500 interviews vs traditional 8-12), allowing statistical pattern recognition in qualitative data. Nielsen Norman Group is assessing methodology soundness against established standards. Agency partners report high participant comfort, with some showing greater openness on sensitive topics with AI interviewers. The platform's 24/7 availability increased completion rates by 40% compared to scheduled interviews. Key advantages include parallel execution, elimination of interviewer bias, and consistent quality. Added Value This research demonstrates that AI can augment human researchers by handling routine execution, enabling focus on strategic interpretation. The "quantified qualitative" approach—conducting 50-500 interviews instead of 8-12—bridges qualitative depth with quantitative validation, addressing the longstanding trade-off between scale and depth. For the online research community, this represents making comprehensive qualitative research accessible to broader audiences while elevating professional researchers to strategic roles. Evaluation results will provide evidence-based guidance for AI research tool adoption and quality standards. Augmenting Qualitative Research with AI: Topic Modeling with Agentic RAG 1Freie Universität Berlin, Germany; 2Deutsche Hochschule, Germany; 3Lee Kong Chian School of Business, Singapore Management University, Singapore Relevance & Research Question Large Language Models (LLMs) increasingly shape qualitative and computational social science research, yet their use for text data analysis using topic modeling remains limited by low transparency, unstable outputs, and prompt sensitivity. Traditional approaches such as LDA often produce overlapping, generic topics, whereas LLM prompting lacks consistency and reproducibility. We introduce Agentic Retrieval-Augmented Generation (Agentic RAG) - a multi-step, agent based LLM pipeline designed to improve efficiency, transparency, consistency, and theoretical alignment in qualitative text analysis. Our study addresses two research questions: (1) How does Agentic RAG perform compared to LDA and LLM prompting in terms of topic validity, granularity, and reliability across datasets? and (2) How can Agentic RAG be extended to enable theory advancement through “lens-based” retrieval? Methods & Data We benchmark Agentic RAG against LDA and LLM prompting using three heterogeneous datasets: (i) the 20 Newsgroups corpus (online communication), (ii) the VAXX Twitter/X dataset (data on vaccine hesitancy), and (iii) a qualitative interview corpus from an organizational research context. Agentic RAG is implemented as a model-agnostic, agent-based pipeline that orchestrates retrieval, data analysis, and topic generation. In our analysis, Agentic RAG was applied to produce topics using different GPT models (GPT-3.5, GPT-4o, GPT-5). We evaluate all methods using standardized metrics: topic validity, topic overlap, and inter-round semantic reliability, computed via cosine similarity measures that extend prior topic quality metrics. Results Across datasets, Agentic RAG consistently yields high-validity topics with minimal redundancy compared to both LDA and LLM prompting. Whereas LDA and LLM prompting perform well only on specific datasets, Agentic RAG maintains performance across heterogeneous data architectures, while being more transparent and efficient. Based on these results, we derive a structured trade-off table that summarizes the strengths and limitations of all approaches, providing qualitative and computational scholars with clear guidance for selecting an appropriate text analysis method. Added Value Our findings demonstrate that Agentic RAG offers a scalable, transparent, and reproducible approach for qualitative text analysis. The method strengthens the rigor of LLM-based qualitative research by enabling more stable outputs, explicit retrieval reasoning, and broader options for assessing topic quality. Reinventing Online Qualitative Methods: Lessons from an AI-Assisted Study on Pathways Out of Loneliness 1Hochschule Trier, Germany; 2Bilendi&respondi Relevance & Research Question Loneliness has emerged as a growing social and public health concern that increasingly affects younger age groups. In response, the state government of North Rhine-Westphalia has initiated multiple initiatives and established a competence network to counteract loneliness. Against this backdrop, the present study examines the role of digital technologies in both the emergence and alleviation of loneliness. The research focuses on three interconnected key questions: 1) How do technological environments, ranging from face-to-face- communication tools to digital social platforms, shape experiences of social connectedness and emotional well-being, and to what extend may they contribute to or mitigate feelings of loneliness? 2) What role do so called third-places play in individuals’ perceptions of social belonging and connectedness? 3) How are digital media used to build and maintain social relationships and under what conditions are digitally mediated interactions transferred into offline, contexts? Methods & Data The study employs a qualitative research design based on more than 150 participants, and was conducted using BARI, the qualitative AI developed by Bilendi. Participants engaged via WhatsApp or Facebook Messenger over roughly one week. BARI supported almost the entire research process, including project flow, moderation, data analysis and reporting. The AI-based moderation is methodologically notable as the absence of a human interviewer may foster greater openness when discussing sensitive topics such as loneliness, potentially reducing social desirability bias. This setup allowed collecting rich narrative data while simultaneously enabling an empirical assessment of the methodological implications of AI supported qualitative research. Results Beyond substantive insights into perceptions and experiences of loneliness, the presentation will highlight methodological findings regarding the strengths, weaknesses and challenges of AI-assisted qualitative research. The integration of participant feedback and researcher reflection will be shown to play a central role in improving the AI´s performance and refining its methodological contribution to future research. Added Value The study provides dual added value: empirically, it offers new insights into how digital media and social spaces shape loneliness; methodologically, it delivers one of the first systematic assessments of AI-moderated qualitative fieldwork, demonstrating its potential and its limitations for scalable, participant-centered online research. |
| 9:00am - 10:00am | 7.2: New insights on satisficing Location: RH, Seminar 02 |
|
|
Is ‘don’t know’ good enough? Maximizing vs satisficing decision-making tendency as a predictor of survey satisficing Technical University Darmstadt, Germany Relevance & Research Question Respondents who do not go through the question answer process optimally and instead exhibit satisficing behavior are a longstanding problem for survey researchers. A growing body of studies examines stable, time-invariant, predictors of survey satisficing behavior, such as personality traits. These predictors introduce the possibility to measure the potential to satisfice before actual survey satisficing behavior occurs. This study wants to add to this research by introducing a potentially stable and reliable predictor of survey satisficing. Building on the notion that the question-answer process is a decision-making process and satisficing behavior is caused by low aspiration decisions, the effect of a decision-making tendency to maximize (in opposition to satisfice) on survey satisficing behavior is modelled. Data was gathered in October 2024 from 2,911 respondents within the Bilendi non-probability online access panel. Due to its short length, the generalizability of items and reduced dimensions of the scale, the modified maximizing scale by Lai (2010) is applied in this study to measure maximizing with its opposite pole satisficing. As dependent variables, “don’t know” responding and midpoint choosing in four single choice questions was measured. To explain each of them, two multilevel models with questions on level 1 and persons on level 2 were calculated to grasp four questions in one model. Langer’s (2020) extension of the McKelvey & Zavoina (1975) Pseudo R² for multilevel logistic regression models was obtained for the baseline models. Respondents who score medium to high on the maximizing scale exhibit significantly less “don’t know” responding and midpoint choosing than those scoring lower. However, the magnitude of the effect is small. The maximizing scale explains only a small amount of between-person variance in the examined satisficing behaviors. By affecting satisficing behavior in surveys, maximizing is not only a source of bias in survey data. The short and compact scale can be integrated in surveys to measure a respondent’s potential to exhibit satisficing behavior before it occurs. With the help of LLMs, consecutive survey questions could be tailored to said potential.
Measuring Response Effort and Satisficing with Paradata: A Process-Based Approach in the Czech GGS II Masaryk university, Czechia Relevance & Research Question A Data-Driven Approach for Detecting Speeding Behavior in Online Surveys Robert Koch Institute, Germany Relevance & Research Question Online surveys provide a great opportunity for researchers to collect response times alongside the participants’ answers. Extremely short response times—known as speeding—can indicate careless responding. Previous studies often identified speeding behavior using fixed cutoffs, for example those derived from average reading speeds reported in the literature. Although empirically motivated, these thresholds overlook other important cognitive demands of the questions and differences between respondents. This study introduces a probabilistic mixture modeling approach to identify speeding behavior in the “Health in Germany” probability-based panel of the Robert Koch Institute (RKI). We validate this approach by comparing its classifications against established methods and by analyzing correlations with other indicators of data quality. Methods & Data Response times from the CAWI participants of the 2024 regular panel (N~30.000) wave were analyzed using a shifted lognormal–uniform mixture model. As is standard practice in response-time analysis, the lognormal component represents regular, attention-based responses. The uniform component captures implausibly short response times (“speeding”). The shift parameter models the minimal realistic answering time. Hierarchical model specifications allow for variation across survey items and respondents. Fixed effects allow to estimate how model features such as speeding likelihood and minimal attention-based answering time vary as function of characteristics the participant (e.g., age) and the item (e.g., number of words). Results Preliminary results show that the model accurately reproduces the empirical distribution of response times and allows for the calculation of speeding probabilities per response, participant and question. Speeding probabilities vary substantially across participants, suggesting that individual differences are the dominant source of speeding behavior, while item-level differences are smaller but still substantial. Further analyses, to be completed before the conference, will examine how participant and item characteristics correlate with model parameters, how speeding behavior correlates with other indicators of data quality, and how the model performs in out-of-sample prediction. This study demonstrates the practical value of response-time modeling for improving data quality diagnostics in online panels. By quantifying speeding probabilities instead of applying fixed cutoffs, the method supports more data-driven cleaning and better understanding of response behavior.
|
| 9:00am - 10:00am | 7.3: Designing inclusive and engaging surveys Location: RH, Seminar 03 |
|
|
Accessibility and inclusivity in self-completion surveys: An evidence review 1University of Southampton, United Kingdom; 2City St George’s, University of London; 3Institute for Social and Economic Research, University of Essex, United Kingdom Relevance & Research Question Survey research aims to understand social issues and inform effective public policy. For results to be accurate and equitable, surveys must inclusively represent diverse population sub-groups. Excluding these groups can lead to biased data and policies that perpetuate inequalities. Consequently, inclusivity is now a core principle for major statistical bodies in the UK like the UK Statistics Authority. This has led to a “respondent-centred design” approach, which argues that making surveys accessible for marginalised groups often benefits all respondents (Wilson and Dickinson 2021). However, achieving greater inclusivity involves practical trade-offs, as measures like targeted procedures, alternative response modes, or survey questionnaire translations or adaptations, can often be resource intensive. Evidence of inclusivity practices implemented as part of probability-based self-administered surveys is scarce, and research is required to determine best practice recommendations. This evidence review highlights measures that aim to increase participation for harder-to-survey population sub-groups in self-administered surveys, while maintaining the goal of obtaining high-quality, representative data. We focus on two main population subgroups: (1) individuals with disabilities and impairments and (2) individuals with literacy and/or language limitations. Results This evidence review identifies general recommendations for recruitment practices to facilitate the inclusion of these frequently excluded sub-groups. It also highlights the cost trade-offs involved in implementing these methods, beyond the ethical imperative for inclusivity. Added Value This study addresses an under-researched area by providing evidence-based, practical recommendations for enhancing participation, accessibility and inclusivity in large-scale surveys. Effectiveness of the knock-to-nudge approach for establishing contact with respondents: Evidence from the National Readership Survey (PAMCo) and National Survey for Wales (NSW) in the UK University of Southampton, United Kingdom Relevance & Research Question Knock-to-nudge is an innovative method of household contact, first introduced during the COVID-19 pandemic when face-to-face interviewing was not possible. In this approach, interviewers visit households and encourage sampled units to participate in a survey through a remote survey mode (either web or telephone) at a later date. Interviewers also can collect contact information, such as telephone numbers or email addresses, or conduct within-household selection of individuals on the doorstep if required. This approach continued to be used post-pandemic in a number of surveys, but there remains a knowledge gap regarding its advantages and limitations. It is still unclear whether knock-to-nudge approach leads to improvements in sample composition and data quality. Methods & Data We analysed data from two UK surveys: the National Readership Survey (PAMCo) and the National Survey for Wales (NSW), each of which employed different versions of the knock-to-nudge approach. Our aim was to determine whether this method improves survey participation and sample composition, and to assess how incorporating participants recruited via knock-to-nudge impacts on data quality and responses to substantive questions. We investigate these effects using descriptive analyses, statistical tests, and logistic regression models. Results Our findings demonstrate that knock-to-nudge is associated with: (1) a significant increase in response rates, (2) improved sample composition, (3) higher item non-response, and (4) significant differences in responses to substantive survey questions. Added Value This study contributes to the under-researched area of knock-to-nudge methods. The results indicate that, when carefully designed and implemented, this approach can enhance recruitment efforts and improve sample composition of the resulting samples in surveys. However, its viability as a universal solution for mixed-mode surveys depends on whether these methodological benefits outweigh the potential compromises in data quality and the additional implementation costs. How do Respondents Evaluate a Chatbot-Like Survey Design? An Experimental Comparison With a Web Survey Design Technical University of Darmstadt, Germany Relevance & Research Question: Web surveys efficiently collect data on attitudes and behaviors, but often face challenges like satisficing behavior. The increasing prevalence of respondents using smartphones to answer surveys has brought about additional design challanges. The application of a messenger design as a web survey interface offers the opportunity to mitigate some of the drawbacks of a responsive web survey desing. Recent studies have demonstrated that a chatbot-like survey design may provide higher quality responses and greater engagement, albeit with longer response times. This study explores the respondents’ evaluation of using a messenger interface in a web survey setting. Methods & Data: In 2025, a sample of 2.123 members of a non-probability online access panel in Germany answered a survey on the topic of “vacation”. The sample was cross-stratified by age and gender and limited to respondents aged 18-74. In a field-experiment employing a between-subjects designs respondents were randomly assigned to either a web survey design or to a chatbot design that mimics a messenger interface. In addition to survey duration, we assess the respondents’ evaluation concerning user experience, perceived social presence, perceived flow, ease of use and general satisfaction. Results: Overall respondents in the chatbot condition were less satisfied with the survey and it took them longer to answer the questions. Also, they experienced lower levels of flow and ease of use. There was no significant difference in the user experience and the survey related social presence was lower only for respondents using a mobile device. Older respondents, females and respondents with a higher education degree seem to evaluate the chatbot design more preferable than younger, male and respondents with lower levels of education. However, the preference for the web survey design is generally confirmed for all respondent groups irrespective of age, education and gender and also for respondents using a desktop or a mobile device. Added Value: This study contributes to an assessment of using a messenger interface for the administration of survey questions. We discuss the results in light of the recent trend towards an application of a chatbot-like interface in AI supported surveys. |
| 10:00am - 10:15am | Break Location: RH, Lunch Hall/ Cafeteria |
| 10:15am - 11:00am | 8: Keynote 2: TBA Location: RH, Auditorium |
| 11:00am - 11:45am | 9: Award Ceremony Location: RH, Auditorium |
| 11:45am - 12:00pm | Break Location: RH, Lunch Hall/ Cafeteria |
| 12:00pm - 1:00pm | 10.1: Smart surveys and interactive survey features Location: RH, Seminar 01 |
|
|
Providing extra incentives for open voice answers in web surveys 1DZHW; Leibniz University Hannover, Germany; 2RECSM-University Pompeu Fabra, Spain; 3University of Michigan, USA Relevance & Research Question Methods & Data Results Added Value Alexa, Start the Interview! Respondents’ Experience with Smart Speaker Interviews Compared to Web Surveys 1Technical University of Darmstadt, Germany; 2Former Postdoc at Technical University of Darmstadt, Germany Relevance & Research Question Methods & Data Do respondents show higher activity and engagement in app-based diaries compared to web-based diaries? A case study using Statistics Netherlands’ Household Budget Diary. 1Utrecht University, The Netherlands; 2Statistics Netherlands Relevance & Research Question Smartphones offer opportunities for official statistics, promising improved user experience, reduced response burden, and higher data quality. We investigate whether respondents show higher activity and engagement in app-based diaries compared to traditional web-based diaries. Methods & Data We use Statistics Netherlands’ Household Budget Survey (HBS) as a case study. The HBS is a diary survey conducted every five years to capture household expenditure on goods and services. In 2020, Statistics Netherlands conducted a 4-week web-based survey (N = ~3,000). In 2021, they conducted a 2-week app-based survey on a smaller sample (N =~700). We compare participation and response behavior of respondents in the two modes. Among the indicators are the amount and spread of the reporting. Results First results show that initial dropout is higher in the web diary. During the first two weeks, dropout is gradual (~1% per day) and very similar across modes. % of registered respondents who submit at least one purchase and/or validate at least one day is higher for the app respondents. More results on objective burden (time spent in study) and reporting patterns will follow. Added Value App-based surveys are currently transitioning from the pilot phase to full implementation in panel surveys and official statistics. Our goal is to evaluate the expectation that app-based data collection enhances activity and user engagement in diary studies. |
| 12:00pm - 1:00pm | 10.2: Data donation Location: RH, Seminar 02 |
|
|
Data donations in online panels: Factors influencing donation probability 1GESIS - Leibniz Intitute for the Social Sciences, Germany; 2Utrecht University Relevance & Research Question Methods & Data Results In general, donation probabilities varied largely across studies and were strongly influenced by study design and contextual factors, such as donation method, the requested data type, and the recruitment strategy. Studies with a lower mean age generally showed higher donation probabilities. However, online panels displayed more consistent donation probabilities compared to cross-sectional recruitments. Potential moderating factors, such as participants’ familiarity with online data collections and trust in the institution receiving the donation, warrant further consideration. Added Value Motivations, Privacy, and Data Types: What Drives WhatsApp Chat Data Donation in a Probability Sample?” 1GESIS, Germany; 2University of Mannheim, Germany; 3University of Michigan, USA Relevance & Research Question Data donations are increasingly discussed as a valuable source for social science research. However, little probability-based evidence exists on what drives individuals’ hypothetical willingness to donate personal communication data. We examine three components that could influence consent behavior: motivational framing (societal benefit, personal benefit, no benefit information), privacy (consent from chat partners required vs. consent not required), and the requested data type (aggregated metadata vs. full chat content). Methods & Data The study was fielded in the May 2025 wave of the German Internet Panel (GIP), a probability-based online panel of the German adult population. Respondents were randomly assigned to one of 18 conditions in a 3×2×2 experimental design. The dependent variable was a binary measure of hypothetical willingness to donate WhatsApp chats, and we controlled for characteristics such as WhatsApp usage intensity and perceived data sensitivity. Overall, 11% of respondents (approx. 350 individuals) reported willingness to donate their chat data. Willingness was higher when societal benefit was emphasized (13%) compared with personal benefit or no benefit information (each 10%). Requiring consent from chat partners showed no significant effect (12% vs. 10%). Participants were more willing to donate when full chat content was requested (15%) rather than aggregated metadata (8%). Multivariate analyses indicated that willingness increased among heavy WhatsApp users and decreased with higher perceived data sensitivity. No significant interaction effects across experimental factors were found. Added Value This study provides one of the first probability-based examinations of willingness to donate WhatsApp chat data—an especially sensitive and understudied data type. The results indicate that, in this hypothetical context, variation in content sensitivity influenced stated willingness less than expected..These findings offer empirically grounded guidance for implementing future data donation infrastructures and highlight which informational cues and design choices may reduce barriers, increase trust, and support responsible integration of chat-based digital trace data into social research. For example, trying to limit the content sensitivity of a data donation request may not be as promising to increase data donation rates as simply emphasizing the research’s societal benefit. Motivate and persuade: Testing strategies to increase participation in data donation studies 1University of Mannheim, Germany; 2Institute for Employment Research, Germany; 3University of Klagenfurt, Austria; 4LMU Munich, Germany Relevance & Research Question Methods & Data Results Added Value |
| 12:00pm - 1:00pm | 10.3: Social media recruitment Location: RH, Seminar 03 |
|
|
Static or Animated? How Ad Design Shapes Survey Recruitment GESIS, Germany Relevance & Research Question Social networking sites have become popular tools for recruiting survey respondents through targeted advertisements. Ad design is crucial, as it must capture users’ attention within seconds.While previous studies highlight the relevance of ad design in recruitment performance, sample composition, and data quality, they have almost exclusively focused on static images. However, static images represent only one possible design format, and the potential effects of animated visuals remain underexplored. This study extends prior research by systematically comparing the effects of static and animated ad images on two key aspects of survey recruitment: sample composition and response quality. It addresses the following research questions: How are different visual elements (static vs. animated) related to sample composition? How are different visual elements (static vs. animated) related to response quality? Methods & Data Data stem from the recruitment campaign for the new online panel GP.dbd, which combines survey data with digital behavior data (e.g., web tracking and app data). The target population comprised adults living in Germany, and recruitment was conducted in 2023 via Facebook and Instagram. Four static images and their animated counterparts were tested. Differences between the two ad formats were examined using descriptive and comparative analyses focusing on respondent demographics and data quality indicators. Results Findings show that the visual format of an ad image influences both who participates and how attentively respondents engage with the survey. Static images tend to attract women and highly educated individuals, whereas animated ads appeal more to men, those with lower or middle education, and older respondents. Analyses of extreme response times, item nonresponse, and break-offs yielded mixed findings, suggesting that animation influences attentiveness in complex and context-dependent ways. Added Value These results highlight that ad design choices can subtly shape both the composition and engagement of recruited samples. Rather than favoring one format over the other, the findings suggest that static and animated ads serve different purposes in recruitment. A practical implication is to combine both formats strategically. Together, these insights provide nuanced, evidence-based guidance for researchers and practitioners seeking to optimize recruitment on social media platforms. Is a Video Worth a Thousand Pictures? The Effect of Advertisement Design on Survey Recruitment with Social Media 1University of Mannheim, Germany; 2University of Warwick, UK Relevance & Research Question Social media platforms, such as Facebook and Instagram, are increasingly used for survey recruitment, particularly for targeting hard-to-reach populations. Previous research has shown that the visual design of advertisements plays a key role in the effectiveness and costs of the recruitment and the data quality of the resulting samples (e.g., Donzowa et al., 2025; Höhne et al., 2025). Pictures are typically used in the advertisements to attract the social media users’ attention and motivate them to click on the survey invitation. However, there is yet a limited understanding of how other visual formats, in particular videos, influence survey recruitment and data quality. In this study, we examine the effectiveness of pictures vs. short videos for the survey recruitment of young adults with social media. We hypothesize that videos are more engaging than pictures, resulting in a larger number of completed surveys at a lower cost, but do not expect differences in sample composition and response quality. Methods & Data We will conduct an online survey in December 2025 among young people aged 18-25 from Germany who traveled with Interrail in the last year. The survey includes questions about their travel behavior, attitudes towards the European Union, and European identity. Survey respondents are recruited through Meta (Facebook and Instagram). Using snowball sampling, respondents are also asked to forward the survey invitation to other people who fit the target criteria. In the ad campaign, we use pictures showing different contents, such as a person within a train, trains in front of different landscapes, and railway stations. We create corresponding videos with the AI software Midjourney by setting the respective pictures as the starting frame. We compare the effects of the two advertisement formats on survey recruitment regarding their effectiveness, measured by the number of completed surveys and referrals through snowballing sampling, and their cost efficiency. Furthermore, we evaluate differences in sample balance across sociodemographics, in particular age and gender, and response quality, measured by completion time and item-nonresponse. The findings will be highly relevant for survey practitioners who plan to recruit respondents through social media. Social Media Sampling to Reach Migrant Populations for Market and Opinion Research Bilendi Relevance & Research Question Traditional survey methods consistently face critical challenges in achieving adequate coverage and response rates among migrant populations, leading to significant sampling bias in market and opinion research. Understanding these diverse groups is vital for both commercial and public sector decision-making. Research Question: Can targeted, non-probability sampling methods utilizing social media platforms (SMs) effectively recruit demographically diverse and representative samples of specific migrant populations in European countries, and how do the resulting data quality and efficiency metrics compare to surveys via online access panels? Methods & Data We’ve run 5 online surveys between January and October 2025 focusing on first- and second-generation migrants from Turkish and Arabic origin countries residing in France, Germany and Belgium. We used stratified recruitment campaigns across Meta platforms (Facebook/Instagram), utilizing the advertising API for targeting based on age, gender and language. The surveys were run in local languages + Arabic + Turkish. The collected data was compared with online access panel data regarding metadata on response behaviour (survey speed, devices used, drop-outs…) as well as survey results including a deeper analysis and comparison of language impacts on survey results. Results The Social Media Sampling (SMS) method demonstrated clear advantages in efficiency and reach. Crucially, SMs proved highly effective at accessing younger and lower-assimilation migrant cohorts who are severely underrepresented in standard frames. The results showed, for example, that participants had significantly better knowledge of Turkish and Arabic compared to panel members. In addition, the proportion of Muslims was on average 20 percentage points higher than in the online access panels. Furthermore, we observed that many of the participants via social media were 1st generation migrants and an overrepresentation of people who recently moved to the respective country. Added Value This research provides an essential, validated framework for survey practitioners, demonstrating that social media can be leveraged as a rapid and reasonably cost-effective primary recruitment tool for hard-to-reach, mobile populations. It offers a robust, tested procedure for mitigating sampling biases compared to online panels. Ultimately, this study promotes greater inclusivity and accuracy in survey results by ensuring the reliable representation of migrant voices. |
| 12:00pm - 1:00pm | 10.4: Curated Session: Collect, Share, Act: The Power of Activated Knowledge Location: RH, Auditorium This session brings together diverse perspectives on how organisations can manage consumer or audience knowledge more effectively and translate insights into action. Experts from technology, media, and in-house research share practical experiences and strategic reflections on making insights accessible, connected, and truly impactful within organisations. |
|
|
"Mind the Gap!" On the Importance of Data Literacy and Knowledge Management in the Digital Age ARD MEDIA/agma, Germany TBA From Insights to Impulses for action Deutsche Welle, Germany TBA Values-based customer targeting in the age of AI Uranos GmbH, Germany TBA |
| 1:00pm - 2:00pm | Lunch Break Location: RH, Lunch Hall/ Cafeteria |
| 2:00pm - 3:00pm | 11.1: Sampling and weighting Location: RH, Seminar 01 |
|
|
Enhancing data accuracy in KnowledgePanel Europe: Leveraging different weighting techniques and adjustment variables for optimal outcomes Ipsos Relevance & Research Question: Although online probability-based panels aim for accuracy, they can exhibit a left-leaning bias in public opinion research due to the overrepresentation of politically and civically engaged individuals. Researchers employ weighting techniques to correct sample imbalances relative to the population. This study aims to assess the extent to which diverse adjustment variables and weighting techniques can mitigate this left-leaning bias and enhance the accuracy of estimates from probability-based panels. To gauge the relative advantages of various adjustment procedures and variables, each was evaluated based on its success in reducing bias for different benchmarks from high-quality, "gold-standard" surveys. These benchmarks cover a range of topics, like civic engagement, living situation and technology use. Besides biases, the variance or precision of estimates is crucial. The "margin of error" (MOE) describes the expected variance in survey estimates if repeated multiple times under identical circumstances. The MOE is calculated for estimates from all benchmark variables to see how different weighting procedures and variables affect variability. Results: Initial findings reveal variability in left-leaning bias in KP Europe samples. While various weighting methods effectively reduce bias and align results with population distributions, the choice of adjustment variables significantly affects the accuracy of the estimates. Additionally, incorporating political variables alongside basic demographics has a different impact on the MOE across KP Europe's countries. Added Value: This study highlights the critical role of adjustment variables in improving the accuracy of estimates and provides valuable insights into the effectiveness of weighting techniques for reducing bias in political and public opinion research across diverse European contexts.
Do sampling and stratification strategies matter for treatment effects of political and media experiments administered online? The SOM Institute, University of Gothenburg, Sweden Relevance & Research Question Exploring the representativeness of web-only surveys of the general population 1Institute for Social and Economic Research, University of Essex; 2Department of Social Statistics and Demography, University of Southampton Relevance & Research Question • RQ1: How have internet exclusion and intensity of internet use changed over time? • RQ2: What are the characteristics of different types of internet users and non-users? How representative are these groups? How has this changed over time? • RQ3: How does the representativeness of web respondents compare to the representativeness of different groups of internet users? How has this changed over time? Methods & Data We use coefficients of variation of the response propensities to estimate the representativeness of internet users and web respondents with regard to a set of auxiliary variables. The results offer valuable empirical evidence about the quality of web-only surveys in the past and present, which will assist survey practitioners in understanding the opportunities and risks of conducting web-only surveys now and in the near future. |
| 2:00pm - 3:00pm | 11.2: Ensuring participation Location: RH, Seminar 02 |
|
|
The effects of push-to-complete reminders The SOM Institute, University of Gothenburg, Sweden Relevance & Research Question Survey researchers often allow respondents to fill out questionnaires both online and by paper-and-pencil forms in a mixed mode fashion. However, respondents who choose to complete questionnaires on paper tend to submit their questionnaires with fewer unanswered questions than respondents who choose to complete questionnaires online. Furthermore, many respondents who choose to fill out a questionnaire online do so without ever submitting their questionnaire. The aim of the present study is to evaluate whether sending digital push-to-complete reminders to respondents who have chosen to fill out a questionnaire online but have not yet submitted it are more likely to submit their questionnaires and submit them with fewer unanswered questions than similar respondents who do not get a digital push-to-complete reminder. Methods & Data The assessment will be made on a self-administered push-to-web mixed-mode survey (web and sequential paper-and-pencil questionnaire) distributed to a random sample of 9,000 individuals residing in Gothenburg, Sweden. In the experiment, potential respondents were randomly assigned to one of two groups: One group was sent digital push-to-complete reminders a few days after they had started but not submitted the questionnaire online, whereas the other group did not receive such a reminder. Results Data collection began in August 2025 and will be completed in early January 2026. Empirical results from the experiment will not be available until January 2026. Upon completing the data collection, this present abstract will be amended with the results of the experiment. The data will be analyzed by comparing response rates (RR1) and data quality between the treatment group and the control group. Added Value The present study contributes to existing research on survey reminders by examining whether digital push-to-complete reminders can increase response rates and data quality in push-to-web mixed-mode questionnaires specifically, and online questionnaires generally. The Effect of Survey Burden and Interval Between Survey Waves on Panel Participation: Experimental Evidence from the GLEN Panel 1RPTU Kaiserslautern-Landau, Germany; 2LMU Munich Relevance & Research Question Methods & Data We use data from the German Longitudinal Environmental Study (GLEN), a large-scale, nationwide randomly sampled panel on environmental topics launched in 2024. The experiments were implemented in an inter-wave survey in September 2025, followed by a panel wave in November 2025, used to measure participation effects. Experiment 1 The first experiment investigates the effect of time interval between survey waves by not inviting a random 10% (N = 1,727) of the eligible sample (N = 16,772) to the inter-wave survey. We expect longer intervals between panel waves to reduce participation rates, as a higher survey frequency creates habituation and increases familiarity and engagement with the panel. Experiment 2 In the second experiment, we examine the effect of complexity and thematic content of the questionnaire. In a first random split, 25% of participants received a long item battery on climate change skepticism, expected to increase burden due to its repetitive nature, while the rest answered a more diverse module on internet use. In a second random split, 90% were assigned a complex factorial survey experiment on CO2 pricing policies, expected to increase burden through topic complexity, while the rest answered questions on cultural participation. The assumed completion time was kept constant across groups, allowing us to isolate the effect of the questions on subsequent participation. For both experiments, we analyze the effect on participation in the next panel wave. Data collection will be completed by the end of 2025. We will present results at the conference. We contribute to the literature on panel nonresponse and survey experience by providing experimental evidence from a nationwide randomly sampled panel. Are interviewer administered follow-ups of web non-respondents still needed to maximise data quality? Evidence from Understanding Society: the UK Household Longitudinal Study 1University of Southampton, United Kingdom; 2University of Essex, United Kingdom Relevance & Research Question Many surveys have transitioned to online data collection. To minimize the risk of nonresponse bias, surveys often adopt a web-first mode with follow-up of nonrespondents via face-to-face or telephone interviewing. Evidence suggests such designs may reduce costs and may produce datasets of higher quality than web only designs. However, with proportions of populations using the internet increasing markedly and people becoming less willing to welcome interviewers, in recent years the contributions of web with face-to-face or telephone modes to minimizing non-response biases that justify such a design may have changed. This paper addresses this issue. The main research questions are: Do we still need to follow up web-non-respondents in a second mode to RQ1: maximise response rates? RQ2: maximise dataset representativeness? RQ3: maximise response by under-represented hard-to-reach population subgroups? RQ4: minimise non-response biases remaining after non-response weighting? and how has this changed over time? Methods & Data This study uses data from Understanding Society (UK Household Longitudinal Study, UKHLS). We focus on the Innovation Panel component of the study, in which a subset of sample members has been offered web interviews with face-to-face or telephone follow-ups of non-respondents. For each survey wave, we use Coefficients of Variation of response propensities to quantify the representativeness of web only and web plus face-to-face or telephone respondents. In addition, we use the UKHLS main survey, which enables investigation of hard-to-reach population groups. Results Key findings are: 1) follow-ups are still required to maximise response rates and dataset sizes, though impacts have declined; 2) the impact of follow-ups on representativeness has declined, with web and web plus face-to-face datasets not differing; 3) impacts of follow-ups on the under-representation of hard-to-reach population subgroups have become negligible; and 4) impacts of follow-ups on non-response biases remaining after non-response weighting, have similarly declined and are now negligible. Added Value We discuss the implications for survey practice. This paper is the first to investigate if follow-ups are still needed in web surveys in the UK context. If follow-ups are not needed any more, this could potentially have large cost-saving implications for survey agencies. |
| 2:00pm - 3:00pm | 11.3: Inferential leap: from digital trace data to measuring concepts Location: RH, Seminar 03 |
|
|
Exploring Types of Masculinity in the Discourse of Fringe Online Communities GESIS Leibniz Institute for the Social Science, Germany Relevance & Research Question 4chan is an anonymous online forum known for ephemeral content and internet subcultures. This research examines how masculinities are constructed and contested in such spaces, where gender norms are negotiated through irony, confrontation, and subcultural slang. Combining theories of masculinity with computational text analysis, it bridges qualitative research and data-driven modeling of online discourse. It asks how different types of masculinities (hegemonic, hybrid, negotiating, caring) are constructed and circulated online and how theory-driven annotation and lexicon development can support computational identification. This study employs a mixed-method design integrating computational text analysis and qualitative annotation. It uses a comprehensive 2.5-year dataset of over 328M 4chan posts, encompassing textual data and metadata across all boards collected via a 4chan text collection tool. The platform’s unfiltered nature allows gender and politics to frequently intersect across diverse contexts. The empirical pipeline includes preprocessing (cleansing, fixing misspelled, broken, or repeated words), identifying masculinity-related discourse through SentenceBert and extracting candidate terms for a theory-driven annotation schema grounded in gender studies and discourse theory. The schema operationalizes masculinity types through parameters such as dominance, emotionality, caregiving, and norm stance. Human annotators use a web-based interface (Doccano) supported by automated term suggestions and inter-annotator agreement metrics. The resulting annotations form the basis for a masculinity lexicon and subsequent computational modeling and clustering to explore discursive variations across online communities. Results This study generates a dataset and lexicon capturing linguistic construction of different masculinity types. These resources support qualitative interpretation and large-scale computational modeling, revealing patterns in identity performance and negotiation of gender norms. Methodologically, it demonstrates how theory-driven annotation combined with embedding-based term extraction and automated quality control, providing a scalable framework for future studies of digital gender expression. Conceptually, this study illuminates how different types of masculinities are performed and negotiated in online communities, linking these patterns to broader social and political discourses. Methodologically, it demonstrates how theory-driven annotation and embedding-based lexicon development can create scalable tools for analyzing gendered language. The resulting dataset, annotation framework, and lexicon provide reusable resources for future research across platforms and contexts. Measuring online information-seeking behavior in the context of (expectant) parenthood: A proof-of-concept study using metered data on website visits, online searches, and app usage 1DZHW; Leibniz University Hannover, Germany; 2RECSM-University Pompeu Fabra, Spain Relevance & Research Question Social Media as a Data Collection Tool and Its Impact on Body Image Perception University of Padova, Italy Relevance & Research Question Advertising campaigns were conducted on Meta and TikTok to recruit participants for an online survey about body image perception. Throughout the campaign, various images were used for the advertisements, comparing and testing the performance. Additionally, an experiment was conducted to assess the impact of an incentive as a motivation for engaging with the questionnaire. Data analysis focused on campaign performance metrics and survey responses, employing logistic regression and cluster analysis to examine the relationship between social media use and body satisfaction. Added Value |
| 2:00pm - 3:00pm | 11.4: DGOF KI (AI) FORUM: INSPIRATION SESSION (HELD IN ENGLISH) Location: RH, Auditorium Join us for an insightful session where industry experts put innovative solutions to the test and explore how AI can unlock new opportunities in market research. Topics will be announced soon, covering areas such as synthetic data, agentic AI, regulatory considerations, and the sharing of best practices. |
| 3:00pm - 3:15pm | Break Location: RH, Lunch Hall/ Cafeteria |
| 3:15pm - 4:15pm | 12.1: Survey recruitment Location: RH, Seminar 01 |
|
|
How Recruitment Channels Shape Data Quality: Evidence From A Multi-Source Panel GESIS Leibniz Institute for the Social Sciences, Germany Relevance & Research Question Declining response rates in traditional probability-based surveys have prompted researchers and survey practitioners to increasingly explore alternative recruitment strategies, such as via Social Networking Sites (SNS) and piggybacking (i.e., re-using respondents) from established surveys. While these strategies offer faster and more cost-efficient access to respondents, they also raise questions about the quality of the resulting data. SNS may threaten response quality through different motivational motives and an increased risk of satisficing. On the other hand, piggybacked samples may benefit from respondents’ experience and commitment but could suffer from conditioning effects. This research provides a comparative assessment of response quality across different recruitment strategies. Methods & Data Results will be presented at the conference in February. Preliminary analyses reveal notable sociodemographic differences (e.g., in age and education) across recruitment groups, pointing to potential disparities in response behaviors and data quality. Added Value Altogether, this study provides one of the first comparative assessments of response quality across recruitment strategies increasingly used in survey practice. By controlling for differences in sample composition, we disentangle compositional from recruitment-driven effects, offering insights into how integrated recruitment designs affect data quality. Looks great, responds poorly: lessons from ten years of invitation letter experiments Statistics Netherlands, The Netherlands Relevance & Research Question Our research combines qualitative pre-testing with large-scale field experiments. After many rounds of testing, a standard letter was designed that performs consistently well — until three new experiments challenged our assumptions. We examined (1) the effect of adding a QR code for easier access, (2) a shorter version of the letter, and (3) the response to a refreshed, more visually appealing layout. Each experiment used a fresh, representative sample and a corresponding control group. Results Adding a QR code had no significant effect (no code: 35.1% vs. QR code: 34.8%, n.s.). Added Value Using Text Messages (SMS) for Representative Sample Recruitment in Online Research Aristotle University of Thessaloniki, Greece Relevance & Research Question |
| 3:15pm - 4:15pm | 12.2: Push to web and mixed mode surveys Location: RH, Seminar 02 |
|
|
Introducing Web in a Telephone Employee Survey and its Impacts on Selection Bias and Costs 1IAB; 2LMU-Munich Relevance & Research Question Telephone surveys have historically been a popular form of data collection in labor market research and continue to be used to this day. Yet, telephone surveys are confronted with many challenges, including imperfect coverage of the target population, low response rates, risk of nonresponse bias, and rising data collection costs. To address these challenges, many telephone surveys have shifted to online and mixed-mode data collection to reduce costs and minimize the risk of coverage and nonresponse biases. However, empirical evaluations of the intended effects of introducing online and mixed-mode data collection in ongoing telephone surveys are lacking. Methods & Data We address this research gap by analyzing a telephone employee survey in Germany, the Linked Personnel Panel (LPP), which experimentally introduced a sequential web-to-telephone mixed-mode design in the refreshment samples of the 4th and 5th waves of the panel. By utilizing administrative data available for the sampled individuals with and without known telephone numbers, we estimate the before-and-after effects of introducing the web mode on coverage and nonresponse rates and biases. Results We show that the LPP was affected by known telephone number coverage bias for various employee subgroups prior to introducing the web mode, though many of these biases were partially offset by nonresponse bias. Introducing the web-to-telephone design improved the response rate but increased total selection bias, on average, compared to the standard telephone single-mode design. This result was driven by larger nonresponse bias in the web-to-telephone design and partial offsetting of coverage and nonresponse biases in the telephone single-mode design. Significant cost savings (up to 50% per respondent) were evident in the web-to-telephone design. Added Value Using a unique experimental design we showed the potential for known telephone number coverage bias in telephone surveys. However, while introducing the web mode eliminates this coverage error, there is potential for other error trade-offs. For practitioners, this underscores the importance of carefully weighing the potential trade-offs between costs and multiple sources of error when designing a specific study. Examining the influence of respondents' internet-related characteristics on mode choice (paper vs. web mode) in a probabilistic mixed-mode panel with push-to-web design GESIS – Leibniz Institute for the Social Sciences, Germany Relevance & Research Question The web survey mode offers a high degree of flexibility while also being highly cost-efficient. However, in the context of web surveys, potential coverage issues are repeatedly raised as a problem, as target respondents without internet access cannot participate in an internet-based survey. In addition, there are respondents with internet access for whom participating online would be possible from a formal perspective, but who do not want to participate via the internet. I want to know: Are the reasons for abstaining from participating online-based also influenced by internet-related characteristics, apart from simply having or not having internet access? I use data from the GESIS Panel.pop Population Sample, a probability-based self-administered mixed-mode panel of the German general population, surveyed via web and paper mode (push-to-web design with paper mode as alternative mode). I perform logistic regressions with mode selection as the dependent variable and various internet-related characteristics as independent variables (frequency of internet use, internet skills, variety of internet use, and number of internet-enabled devices). I put particular emphasis on the key challenge of identifying causal effect directions between internet-related characteristics to construct my models with appropriate control variables. For all regressors, I use not only the quasi-metric scale level frequently used in the literature, but I also examine different threshold levels through varying thresholding. With my analyses, I want to shed light on (internet-related characteristics of) those individuals who are not reached by online-only. Knowledge about those not reached by online-only also contributes to evaluating a target group-oriented use of paper questionnaires when conducting surveys. Data Quality in Push-to-Web Longitudinal Surveys: Evidence from ELSA's Transition to Sequential Mixed-Mode Design National Centre for Social Research, United Kingdom Relevance & Research Question Evidence from ELSA study provides critical evidence for longitudinal surveys implementing push-to-web among older populations. Key contributions: (1) household-level completion metrics are essential—individual response rates mask significant household non-completion, which has big implications on fielwork costs; (2) structured web design reduces measurement error in complex variables despite completion burdens; (3) sequential mixed-mode designs must balance mode-specific strengths—web excels at structured data while face-to-face remains valuable for cognitively demanding tasks and panel engagement; (4) realistic duration expectations and improved re-entry protocols are crucial. Findings demonstrate push-to-web viability for ageing populations when supported by tailored strategies and hybrid approaches for mode-sensitive content. |
| 3:15pm - 4:15pm | 12.3: Methods, tools, and frameworks - a bird's view on data collection Location: RH, Seminar 03 |
|
|
The Methods Hub: Integrating Tools, Tutorials, and Environments for Transparent Online Research GESIS - Leibniz Institute for the Social Sciences, Germany Relevance & Research Question: As digital communication increasingly unfolds on online platforms, behavioral data have become central to understanding media exposure, polarization, and social interaction. Yet computational approaches necessary to analyze such data often remain inaccessible to many communication and social science researchers who lack extensive programming expertise or institutional resources. As a result, many research-driven tools remain scattered across personal repositories, supplementary materials, or project websites, reducing their visibility, reusability, and long-term sustainability. This presentation addresses the question of how a community-driven infrastructure can lower entry barriers for computational methods and support transparent, reproducible online research. Methods & Data: The Methods Hub is designed as an open platform that curates computational resources relevant to social science research. It integrates three core components: (1) open-source tools ranging from lightweight scripts to fully developed software packages, (2) tutorials explaining both general principles of reproducible computational workflows and concrete methodological applications, and (3) containerized interactive coding environments that can be executed directly in the browser without local installation. All contributions follow open licensing and reproducibility standards and are reviewed accordingly. The platform architecture supports interoperability with complementary infrastructures (e.g., KODAQS) to facilitate cross-linking between datasets, tools, and training materials. The development process combines community submissions, expert curation, and iterative user testing to ensure methodological relevance and usability. Results: Preliminary implementation demonstrates that the platform successfully bridges gaps between computational tooling and social science workflows. Initial contributions include tools for digital trace data collection, automated preprocessing pipelines, validation and reliability routines, and visualization templates. Tutorials and browser-based execution environments have proven effective in enabling researchers to test methods without configuring complex software environments. User feedback from pilot workshops indicates substantial reductions in setup time, increased willingness to experiment with computational approaches, and improved understanding of reproducible research practices. Added Value: The platform lowers entry barriers to behavioral data analysis, strengthens methodological knowledge transfer, and promotes long-term visibility and reuse of tools otherwise confined to fragmented project repositories. Through openness, interoperability, and executable documentation, the Methods Hub contributes to building a robust ecosystem for computational communication science. Let's Talk About Limitations: Data Quality Reporting Practices in Quantitative Social Science Research 1University of Mannheim, Germany; 2GESIS – Leibniz Institute for the Social Sciences Relevance & Research Question Clearly communicating data quality limitations is essential for transparent research. Data quality frameworks and reporting guidelines support researchers in identifying and documenting potential data quality concerns, but it is unclear how well this translates to reporting practices. In this project, we analyze reports of data quality limitations in substantive social science publications. Thus, we provide insights into typical limitations that reoccur but also highlight underrepresented areas where researchers might require additional guidance. Methods & Data We analyze the “Limitations” sections and limitation-related paragraphs in “Discussion” sections of substantive survey-based research published in the journals including American Sociological Review and Public Opinion Quarterly. We use a large language model to extract data-quality-related aspects of these sections and paragraphs and assign them to the measurement and representation sides as defined in Total Data Quality error frameworks. We then cluster the excerpts into themes and compare the themes to components of the error frameworks. Based on this, we discuss which data quality dimensions are commonly and rarely mentioned, and what possible reasons for these differences may be. Through comparisons with reporting guidelines (e.g., AAPOR transparency initiative, datasheets for datasets), we highlight areas where researchers might require additional support. We also analyze areas where current guidelines might be adapted to better represent researchers’ needs in reporting. Results Initial findings show a prevalence of discussions on measurement validity and on coverage of the target population in contrast to only few mentions of limitations related to data processing. We also find that limitations are often communicated implicitly, adding a challenge for readers from other disciplines. For example, briefly mentioning the concrete implications of using an “online non-probability sample” would increase interdisciplinary validity assessments. Added Value We contribute to the transparent and well-structured communication of data quality as a crucial step for validating research by providing an overview of current reporting practices and directions for improved reporting. Qualitative Research in Digital Contexts: A Systematic Review of Online Data Collection Practices FH Wiener Neustadt GmbH City Campus, Austria Relevance & Research Question We conducted a systematic literature review (Tranfield et al. 2003) of academic journal articles reporting experiences and reflections of qualitative online data collection from 2000 to 2024. A literature search was carried out in the databases Springer Link, Science Direct and Emerald Insight, using among others, the following terms: “digital OR virtual OR online data collection” AND “qualitative research OR method”. Following a three-stage selection process, 44 articles were selected and systematically coded in MaxQDA, combining a deductive framework (according to “before, during and after the survey”) with an inductive coding process. Results The online context will continue to play an increasingly important role in qualitative social research. We fulfil the need for practical guidance on conducting qualitative research projects online while maintaining quality standards. Furthermore, we relate research practices to existing debates on research quality. In doing so, it not only offers practical guidance but also theoretical connections for developing a reflexive methodology of digital social research. |
| 3:15pm - 4:15pm | 12.4: DGOF KI (AI) FORUM: WORLD CAFÉ (SESSION HELD IN GERMAN) Location: RH, Auditorium Take part in engaging World Café discussions with fellow GOR participants. Now in its third edition, this interactive format is centered on meaningful exchange. The session invites participants to share challenges and barriers, explore opportunities and limitations in applying AI to research practice, and discuss ways to overcome them in the future. Topics will be announced soon. |