GOR 26 - Annual Conference & Workshops
Annual Conference- Rheinische Hochschule Cologne, Campus Vogelsanger Straße
26 - 27 February 2026
GOR Workshops - GESIS - Leibniz-Institut für Sozialwissenschaften in Cologne
25 February 2026
Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Session Overview |
| Session | ||
GOR Thesis Award PhD
| ||
| Presentations | ||
Who Counts? Survey Data Quality in the Age of AI 1LMU Munich, Germany; 2Munich Center for Machine Learning Relevance & Research Question Large language models (LLMs) have been hoped to make survey research more efficient, while also improving survey data quality. However, as they are based on Internet data, LLMs may come with similar pitfalls as other digital data sources with regard to making inferences about human attitudes and behavior. As such, they not only have the potential to mitigate, but also to amplify existing biases. In my dissertation, I investigate whether and under which conditions LLMs can be leveraged in survey research by providing empirical evidence of the potentials and limits of their applications. Potential applications of LLMs span the entire survey life cycle – before, during, and after data collection – where an LLM could act as a research assistant, interviewer, or respondent. Potential challenges for data quality stem from LLMs’ training data, alignment processes, and model architecture, as well as the research design. I focus on two major applications of LLMs covering both representational and measurement challenges and test these applications in challenging, previously unexamined contexts. Methods & Data, Results Two studies address the most prominent discussion regarding LLM-based survey research: Using LLM-generated “synthetic samples”. Coverage bias in LLM training data and alignment processes might affect the applicability of this approach. In one study, I test to what extent LLMs can estimate vote choice in Germany. To generate a synthetic sample of eligible voters in Germany, I create small profiles matching the individual characteristics of the 2017 German Longitudinal Election Study respondents. These “personas” include socio-demographic and attitudinal information known to be associated with voting behavior. Prompting GPT-3.5 with each persona in German, I ask the LLM to predict each respondents’ vote choice in the 2017 German federal elections and compare these predictions to the survey-based estimates on the aggregate and subgroup levels. I find that GPT-3.5 does not predict citizens’ vote choice accurately, exhibiting a bias towards the Green and Left parties, and making better predictions for more “typical” voter subgroups. While the LLM is able to capture broad partisan tendencies, it tends to miss out on the multifaceted factors that sway individual voters. As a consequence, not only are LLM-synthetic samples not helpful for estimating how groups likely swinging an election, such as non-partisans, will vote, they also risk underestimating the popularity of parties without a strong partisan base. Such samples thus provide little added value over survey-based estimates. Furthermore, the results suggest that GPT-3.5 might not be reliable for estimating nuanced, subgroup-specific political attitudes. In a second study, I extend the previous study to the entire European Union (EU), this time focusing on an outcome unobserved at the time of data collection: the results of the 2024 European Parliament elections. I create personas of 26,000 eligible voters in all 27 EU member states based on the Eurobarometer and compare the proprietary LLM GPT-4-Turbo with the open-source LLMs Llama-3.1 and Mistral. A week before the European elections in June 2024, I prompted the LLMs with the personas in English and asked them to predict each person’s voting behavior, once based only socio-demographic information, and once also featuring attitudinal variables. To investigate differences in LLMs’ bias across languages, I selected six diverse EU member states for which I prompted the LLMs in the respective country’s native language. After the elections’ conclusion, I compare the aggregate predicted party vote shares to the official national-level results for each country. LLM-based predictions of future voting behavior largely fail – they overestimate turnout and are unable to accurately predict party popularity. Only providing socio-demographic information about individual voters further worsens the results. Finally, LLMs are especially bad at predicting voting behavior for Eastern European countries and countries with Slavic native languages, suggesting systematic contextual biases. These findings emphasize the limited applicability of LLM-synthetic samples to public opinion prediction across contexts. Without further adaptation through, e.g., fine-tuning, LLMs appear infeasible for public opinion prediction not just in terms of accuracy, but also in terms of efficiency, highlighting a trade-off between the recency and level of detail of available survey data for synthetic samples. In the third study, I investigate the usability of LLMs for classifying open-ended survey responses. Due to their linguistic capacities, it is likely that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models, but it is unclear how the sparse, sometimes competing, existing findings generalize and how the quality of such classifications compares to established methods. I therefore test to what extent different LLMs can be used to code German open-ended survey responses on survey motivation from the GESIS Panel.pop Population Sample. I prompt GPT-4-Turbo, Llama-3.2, and Mistral NeMo in German with a predefined coding scheme and instruct them to classify each survey response, comparing zero- and few-shot prompting and fine-tuning. I evaluate the LLMs’ performance by contrasting its classifications with those made by human coders. Only fine-tuning achieves satisfactory levels of predictive accuracy. Performance differences between prompting approaches are conditional on the LLM used, as overall performance differs greatly between LLMs: GPT performs best in terms of accuracy, and few-shot prompting leads to the best performance. Disregarding fine-tuning, the prompting approach is not as important when using GPT, but makes a big difference for other LLMs. Further, the LLMs struggle especially with non-substantive catch-all categories, resulting in different distributions. The need for LLMs to be fine-tuned for this task implies that they are not the resource-efficient, easily accessible alternative researchers may have hoped. I discuss my findings in the light of the ongoing developments in the rapidly evolving LLM research landscape and point to avenues for future work. Added Value This dissertation makes both methodological and applied contributions to survey research. It discusses types and sources of bias in LLM-based survey research and empirically tests their prevalence from multiple comparative angles. It also showcases concrete applications of LLMs across several steps in the research process and several substantive topics, explaining their practical implementation and highlighting their potentials and pitfalls. Overall, this dissertation thus provides guidance on ensuring data quality when using LLMs in survey methodology, and contributes to the larger discourse about LLMs as a social science research tool. Correcting Selection Bias in Nonprobability Samples by Pseudo Weighting Utrecht University, Netherlands, The Relevance & Research Question Statistics are often estimated from a sample rather than from the entire population. If the inclusion probability of the sample is unknown to the researcher, that is, a nonprobability sample, naively treating the sample as a simple random sample may result in selection bias. Attention to correcting selection bias is increasing due to the availability of new data sources, for example, online opt-in surveys and data donation. These data are often easy to collect and may be so-called "Big Data" considering the large inclusion fraction of the population. This dissertation consists of four scientific papers, where two of them are published in influential journals and the other two are under review. In the first paper, a novel framework for correcting selection bias in nonprobability samples is proposed. It follows with the discussion of three practical challenges, and possible solutions to them are provided. Methods & Data In the framework paper, the general idea is to construct a set of unit weights for the nonprobability sample by borrowing the strength of a reference probability sample. If a proper set of weights is constructed, design-based estimators can be used for population parameter estimation given the weights. To evaluate the uncertainty of the estimated population parameter, a pseudo population bootstrap procedure is proposed, given different relations between the nonprobability sample and the probability sample. Three practical challenges for pseudo-weighting are also discussed. Namely, the model selection for bias correction, the imbalanced samples, and the small area estimation. Simulation studies and applications on real data are shown in each paper to reflect the usage of the proposed framework and the possible solution of the practical challenges. Results The proposed framework is flexible, and many kinds of probability estimation models can be used. The question is raised about how to select a proper model given the population parameter in question. After a series of performance measures were tested, we found that modeling the target variable when evaluating the performance of weights may be useful. The second challenge comes from the large size of the nonprobability sample. Since we often have a large nonprobability sample assisted with a small probability sample, we end up with an imbalanced combined sample, which can cause problems when estimating model parameters. Several remedies for imbalanced samples are discussed, and the proposed framework is also adjusted accordingly. The results show that SMOTE is a promising technique for dealing with imbalanced samples. Finally, we look at the scenario where not only the population level estimates are of interest, but also subpopulation estimates. Several approaches to combine pseudo weights with small area estimation are discussed. Of all approaches, we found that combining a hierarchical Bayesian model with weights is a relatively stable estimation approach. If both population-level and area-level estimates are of interest, aligning the weighted estimates with estimated marginal totals may be a better option. Added Value This research provides practical suggestions on how to deal with possible selection bias in nonprobability samples, which is gaining more and more importance in the age of digitalization and low response rates. | ||