GOR 26 - Annual Conference & Workshops
Annual Conference- Rheinische Hochschule Cologne, Campus Vogelsanger Straße
26 - 27 February 2026
GOR Workshops - GESIS - Leibniz-Institut für Sozialwissenschaften in Cologne
25 February 2026
Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Session Overview |
| Session | ||
2.1: AI and survey research
| ||
| Presentations | ||
Talking to Results: LLM-enabled Discussions of Quantitative Survey Findings 1Verian, Germany; 2Inspirient; 3Bertelsmann Stiftung Relevance & Research Question: Methods & Data: Results: Added Value: Testing the Performance and Bias of Large Language Models in Generating Synthetic Survey Data Utrecht University, Netherlands, The Relevance & Research Question The idea to generate so-called silicon survey samples with large language models (LLMs) has gained broad attention in both academic and market research as a promising timely and cost-efficient method of data collection. However, previous research has shown that LLMs are likely to reproduce limitations and biases found in their training data, including underrepresentation of certain subgroups and imbalanced topic coverage. Using survey data from a probability-based online panel in Netherlands, we conduct a large-scale analysis examining model performance across different item types (factual, attitudinal, behavioral) and social-science-related topics, to identify when and for whom synthetic approaches perform best. We further explore strategies to mitigate potential performance limitations, including few-shot leveraging of LLMs on longitudinal survey data. Methods & Data We compare existing survey data with LLM-generated synthetic data, using the 17th wave of LISS Panel fielded in 2024. We selected nine survey items that cover key social-science topics, such as health, family, work, and political values. We chose the items based on their characteristics (factual, behavioral, attitudinal) as well as outcome type. We leverage the LLM agents based on two different few-shot learning tasks: a sociodemographic setup, which draws on seven individual background variables, and a panel-informed setup, which additionally incorporates previous responses. We compare the generated data across five different proprietary LLMs, including GPT-4.1, Gemini 2.5 Pro, Llama 2 Maverick, Deepseek-V3 and Mistral Medium 3. Results Our first results show a deep lack of accuracy in the sociodemographic few-shot setup with an average accuracy of 0.2 and a strongly underestimated variance across items and topics. Prediction errors vary significantly across subgroups, particularly by age, showing differences not only in magnitude but also in the direction of errors. Adding longitudinal survey responses to the few-shot input substantially improves the prediction quality, showing a 40 to 60 percentage point increase in overall accuracy, correcting variance underestimation, and reducing subgroup disparities. Added Value Our research contributes to a more responsible application of silicon survey samples by providing practical guidance for researchers and survey practitioners to evaluate and improve AI-generated datasets. Transcribing and coding voice answers obtained in web surveys: comparing three leading automatic speech recognition tools 1RECSM-UPF, Spain; 2DZHW, Leibniz University Hannover; 3University of Michigan Relevance & Research Question With the rise of smartphone use in web surveys, voice or oral answers have become a promising methodology for collecting rich data. Voice answers present both opportunities and challenges. This study addresses two of these challenges—labor-intensive manual transcription and coding of responses, by answering the following research questions: (RQ1) How do three leading Automatic Speech Recognition (ASR) tools —Google Cloud Speech-to-Text, OpenAI Whisper, and Vosk— perform across various dimensions? (RQ2) How similar or different are the codes of transcribed responses generated by a human and the OpenAI GPT-4o model? Methods & Data We used data collected in the Netquest opt-in online panel in Spain in February/March 2024. The questionnaire included over 80 questions, mainly about citizens’ perceptions of nursing homes. This study focuses on one open-ended narrative question in which respondents were asked to explain why they selected a given answer in a prior closed question on the amount of information nursing homes provide to the general public. For this question, participants were initially asked to answer through voice recording. In a follow-up, respondents skipping the question were also offered to type in a text box. We extracted various aspects from the transcriptions and compared them across ASR tools and human vs GPT coding. After data cleaning, 859 panellists were used for analyses. Results We found that each of the ASR tools has distinct merits and limits. Google sometimes fails to provide transcriptions, Whisper produces hallucinations (false transcriptions), and Vosk has clarity issues and high rates of incorrect words. Human and LLM-based coding also differ significantly. Thus, we recommend using several ASR tools and implementing human as well as LLM-based coding, as the latter offers additional information at minimal added cost. Added Value This study offers valuable insights into the feasibility of using voice answers. Depending on the most critical quality dimension for each study (e.g., maximizing the number of transcriptions or achieving the highest clarity), it provides guidance on selecting the most suitable ASR tool(s) and insights into the extent to which LLMs can assist with manual tasks like answer coding.
| ||