JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at office@dgof.de.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Only Sessions at Date / Time

Session Overview

Session

GOR Thesis Award Master

Time:

Thursday, 26/Feb/2026:

10:15am - 11:15am

Location: RH, Auditorium

Presentations

Measuring Ambivalent Sexism in Large Language Models: A Validation Study

Jana Jung

University of Mannheim, Germany

1 Relevance & Research Question
Large language models (LLMs) often perpetuate gender biases and stereotypes learned from uncurated training data, underscoring the importance of reliable methods for measuring bias. Although several methods of measuring gender bias have been proposed, there are concerns about the ambiguities and inconsistencies in how these methods conceptualize and operationalize gender bias. A promising alternative to these existing methods is LLM psychometrics, which applies psychometric tests originally developed for humans to evaluate human-like characteristics of LLMs, such as personality. Psychometric tests have several advantages: they are grounded in psychological theory, have been rigorously validated, and provide standardized assessment tools. However, it remains unclear how these assessments can be meaningfully applied to LLMs. In this thesis, I explore whether the Ambivalent Sexism Inventory (ASI) [4] can be used to measure sexism in LLMs. To address this question, I propose a systematic validation approach grounded in established psychometric standards.

2 Methods & Data

2.1 Inducing Individuals Using Context data
To approximate psychometric testing conditions, we conceptualize an LLM as a representation of a population and induce individuals by prompting the model with different context information. Two types of contexts are used: human-chatbot interactions and personas. Using personas is a method that has been used in previous LLM psychometrics studies. However, this does not reflect how most users incorporate LLMs into their everyday lives. Therefore, we also use real-life interactions between users and LLM-powered chatbots as a context type. For each context type, n = 300 contexts are sampled from the Chatbot Arena Conversations dataset [5] and the Persona Hub dataset [6], respectively.

2.2 Ambivalent Sexism Inventory

The ASI consists of 22 items, such as, “Women exaggerate problems they have at work.” Answers are provided using a 6-point Likert scale ranging from 0 (disagree strongly) to 5 (agree strongly). The overall ASI score of one context is computed by averaging the answer scores of all items given that context.

2.3 Data Collection

Each item is prompted individually to mitigate the effects of item order. In addition to the item, the prompt contains the context, general instructions, and answer scale. Answer scores are extracted directly from a model’s text response. Data is collected from six state-of-the-art LLMs, including Llama 3.3 70B Instruct, Mistral 7B Instruct, and Qwen 2.5 7B Instruct.

2.4 Psychometric Quality Critera

The systematic validation is conducted by first evaluating reliability (i.e., the consistency of a test) using three criteria:

(1) Internal consistency (Cronbach’s alpha): How consistent are responses across all items of the ASI?

(2) Alternate-form reliability (Pearson correlation): How consistent are the ASI scores when rephrasing the items without changing their meaning?

(3) Option-order symmetry (Pearson correlation): How consistent are the ASI scores when randomly changing the order of answer options?

If reliability is deemed acceptable based on established psychometric interpretation thresholds, validity (i.e., the extent to which a test measures what it is supposed to measure) is evaluated in a second step based on the following three types of validity:

(1) Concurrent validity (Pearson correlation): Does the ASI score align with the amount of sexist language used in a downstream task (writing reference letters)?

(2) Convergent validity (Pearson correlation coefficient): Does the ASI score align with the sexism score of another established sexism scale, the Modern Sexism Scale [7]?

(3) Factorial Validity (Confirmatory factor analysis; CFA): Do the items group together in a way that makes sense based on the underlying theory?

These analyses are conducted for each of the six models and two context types.

3 Results
For 10 out of the 12 model-context type combinations, the ASI displays low reliability, indicating low consistency and high measurement error. Only for two models - Llama 3.3 70B Instruct and Qwen 2.5 7B Instruct - reliability is deemed acceptable when using Persona Hub contexts. However, the validity evaluation for these two cases indicates low validity. Crucially, for both Llama 70B and Qwen, ASI scores do not significantly correlate with sexist behavior in the downstream task (r = -0.1, p = .523 and r = 0.08, p = .612 respectively). Based on these findings, the ASI is not considered valid for any of the six LLMs. However, the results also indicate that the choice of context type influences evaluation outcomes.
4 Added Value

The findings of this thesis emphasize that tests developed and validated for humans should not be automatically assumed to be valid for LLMs. This underscores the importance of conducting validation studies before interpreting psychological test scores for LLMs, which has rarely been done in the field of LLM psychometrics [3].

However, the results also raise several important questions and issues on how to conduct such validations. What constitutes an “individual” in the context of LLMs? How should a sample of “individuals” be selected? These issues highlight the need to adapt the psychometric validation approach to the LLM domain in future studies.

[1] BLODGETT, S. L., LOPEZ, G., OLTEANU, A., SIM, R., AND WALLACH, H. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Aug. 2021).

[2] PELLERT, M., LECHNER, C. M., WAGNER, C., RAMMSTEDT, B., AND STROHMAIER, M. AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories. Perspectives on Psychological Science 19, 5 (Sept. 2024).

[3] LÖHN, L., KIEHNE, N., LJAPUNOV, A., AND BALKE, W.-T. Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models. In Proceedings of the 17th International Natural Language Generation Conference (Sept. 2024).

[4] GLICK, P., AND FISKE, S. T. Hostile and Benevolent Sexism: Measuring Ambivalent Sexist Attitudes Toward Women. Psychology of Women Quarterly 21, 1 (Mar. 1997).

[5] ZHENG, L., CHIANG, W.-L., SHENG, Y., ZHUANG, S., WU, Z., ZHUANG, Y., LIN, Z., LI, Z., LI, D., XING, E. P., ZHANG, H., GONZALEZ, J. E., AND STOICA, I. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Dec. 2023. arXiv:2306.05685.

[6] GE, T., CHAN, X., WANG, X., YU, D., MI, H., AND YU, D. Scaling Synthetic Data Creation with 1,000,000,000 Personas, Sept. 2024. arXiv:2406.20094.

[7] SWIM, J. K., AIKIN, K. J., HALL, W. S., AND HUNTER, B. A. Sexism and racism: Old-fashioned and modern prejudices. Journal of Personality and Social Psychology 68, 2 (1995).

AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models

Anna Fuchs

Ludwig-Maximilians-Universität Munich, Germany

Relevance & Research Question

Designing high-quality survey questions is a complex task. With the rapid development of large language models (LLMs), new possibilities have emerged for supporting this process, particularly in the automated generation of survey items. Despite growing interest in LLM applications from industry, published research in this area remains limited, and little is known about the quality and characteristics of survey items generated by LLMs, as well as the factors influencing their performance. This work provides the first in-depth analysis of LLM-based survey item generation and systematically evaluates how different design choices, such as prompting technique, model choice, and fine-tuning, affect item quality.
Methods & Data

Five LLMs, namely GPT-4o, GPT-4o-mini, GPT-oss-20B, LLaMA 3.1 8B, and LLaMA 3.1 70B, generated survey items for 15 concepts across four domains: work, living conditions, national politics, and recent politics. For each concept, three prompting techniques (zero-shot, role, and chain-of-thought prompting) were applied. Additionally, the best performing model and prompting combination, namely GPT-4o-mini combined with chain-of-thought prompting, was fine-tuned on high-quality survey items to explore the effects of fine-tuning on the quality of generated items.
To assess the quality of the generated survey items, we use the Survey Quality Predictor (SQP; https://sqp.gesis.org), a validated tool for estimating the quality of attitudinal survey items. SQP predicts item quality, validity, and reliability based on a range of coded formal and linguistic characteristics of the survey items, e.g., question formulations, the response scale types, and the inclusion of an introduction text. To code these characteristics, we used an LLM-assisted procedure. A GPT-4.1-nano model was fine-tuned on examples of coded characteristics from the SQP database to automate the coding process, supplemented by manual revision.
The analysis allows us to evaluate not only overall quality but also around 60 specific survey item characteristics, offering a detailed view of how LLM-generated questions differ.

Results

The findings show striking differences in survey characteristics across the different models and prompting techniques. As an example, the type of response scale strongly differed by model family. Closed-source GPT models consistently generate five-category, bipolar response scales with medium correspondence between numeric and verbal labels. They rarely include a 'don't know' option and typically begin with the negative end of the response scale. LLaMA models show greater variation: they generate a wider range of response options, show greater inconsistency between numeric and verbal labels, differ in whether scales start with the positive or negative option, and the inclusion of a 'don't know' option varies by model size. The inclusion of an introduction text in the survey item depends strongly on the type of prompting technique used. Survey items generated with chain-of-thought prompting often included an introduction text.

Regarding the quality of the generated survey items, the findings show that the prompting technique employed is a primary factor influencing the quality of LLM-generated survey items. Chain-of-thought prompting leads to the most reliable outputs. Closed-source GPT models generally produce more consistent and higher-quality items than open-source LLaMA models. The open-source GPT-oss-20B model failed to complete the given task, i.e., it did not produce a usable survey item in 68% of the cases. The survey topics `work' and `national politics' generate survey items with higher quality compared to `living conditions’ and `recent politics’. Among all configurations, GPT-4o-mini combined with chain-of-thought prompting achieved the best overall results. Fine-tuning on high-quality survey items added variety in survey item characteristics but did not lead to noticeable improvements in item quality.

Added Value

For the GOR community, the study offers empirical evidence on how LLMs can (and cannot) be reliably integrated into questionnaire design workflows, providing a systematic basis for evaluating emerging AI tools in survey research and informing methodological decisions in applied settings. In addition to highlighting the strengths and limitations of LLMs for survey item generation, the work helped to identify concrete weaknesses within the SQP-based evaluation pipeline, particularly regarding the coding of characteristics. The development of an LLM-assisted coding procedure contributes to future research in AI-supported survey design by laying the necessary groundwork for a fully automated pipeline that can code the SQP item attributes at scale.

Adaptive Code Generation for the Analysis of Donated Data with Large Language Models

Miger Shkrepa

University of Mannheim, Germany

Relevance & Research Question: In an increasingly digitalized world, people are generating large amounts of digital trace data daily as a result of the constant recording of their information and activity. These data contain information with the potential to facilitate human behavior studies due to their accessibility and fine-grained nature. Data donation has emerged as a promising approach to get access to such data from specific online platforms, such as Instagram. In a data donation study, people are invited to answer a survey and subsequently asked to request, download, and finally donate their online platform data to research. However, as raw data, these donated Data Download Packages (DDPs) might contain highly sensitive or personal information. Previous research on data donation has solved this issue by developing privacy-preserving methods to anonymize and aggregate data directly on participants' devices. Data donation workflows typically rely on scripts designed to extract and process relevant data from these structures. Researchers who develop these extraction scripts often face the challenge that the data structure of DDPs is undocumented for most online platforms and can be subject to modifications from the entities that issue them. As such, the processing scripts that are developed to extract information from these DDPs face deprecation threats and require the researchers' manual intervention. This thesis investigates the feasibility of employing Large Language Models (LLMs) in automatically processing and extracting relevant information from these structures by generating code as an alternative to traditional, manually maintained data analysis tools and in an effort to minimize manual script development and adjustment. It also aims to make data donation research more accessible by examining this approach as a means for researchers without technical expertise to analyze and interpret structurally complex DDPs.

Methods & Data: This study evaluates six open-source LLMs across thirteen Instagram DDPs of varying size and structural complexity to examine their capabilities and limitations in this domain. The models are asked a series of data-specific queries designed to assess their abilities in interpreting the provided data package, retrieving correct information, and processing it accordingly. Two main experimental settings are implemented and compared, where context on the supplied information, response formatting, and data structure are provided to the models across different setups. In the first setting, this information is integrated through an external knowledge base using Retrieval-Augmented Generation (RAG), whereas in the second, it is directly provided to the models within the prompt. The outputs of the models in each setting are assessed in terms of accuracy, common error patterns, and code generation habits.

Results: Although RAG is a methodology originally developed to reduce hallucination and improve models' responses, our findings reveal that, for this task, directly providing the context in prompts yields higher accuracy in comparison. Overall, the models' performance is unsatisfactory in both settings. Their shortcomings can be attributed to the overly complex structure of the packages used in this evaluation. Ambiguous and similar naming conventions are observed throughout the Instagram DDP structure and enhance the hallucination and inaccuracy of the models' outputs. In an effort to improve the models' performance, when manually designed and explicit instructions on how to navigate these packages are provided, the LLMs perform substantially better, with some achieving near-optimal results. These instructions would also be subject to changes in response to structural updates to the DDPs provided by the online platforms. Ironically, the very manual intervention that this research sought to reduce is necessary in achieving greater performance for the evaluated LLMs at this current stage.

Added Value: This research presents an evaluation of LLMs for adaptive code generation in donated data analysis tasks and explores their potential as an alternative approach in this domain. It lays the groundwork for easing the skill-based entry that data researchers without programming skills face in navigating the structural complexities and challenges inherent in data donation studies. By identifying the strengths and current limitations of LLMs in understanding and adapting to evolving data structures, this study helps set realistic expectations for their application in this field while highlighting the considerable room for future improvement.

Reinforcement Learning for Optimising the Vehicle Routing Problem

Abigail Hayes

University of Mannheim, Germany

Relevance & Research Question

The Travelling Salesman Problem (TSP) requires identifying the shortest route to visit a set of locations and return to the starting depot, given the locations and the distances between them. The Vehicle Routing Problem (VRP) is similar but with multiple vehicles, with the number of customers per route limited by vehicle capacities. This was first formulated in 1959 by Dantzig and Ramser [1] and has since been further extended, such as restricting the time window for customer visits.

The VRP is NP-hard, and so finding exact solutions becomes computationally intractable with increased problem instance size. As a result, heuristic and metaheuristic algorithms have replaced exact calculations. More recently, routes are found using reinforcement learning (RL) approaches, requiring less task specific expertise than heuristic selection but with the computational cost of deep learning.

With each new RL model, the creators generally compare against either heuristics or the very earliest RL models from Nazari et al. [2] and Kool et al. [3]. Therefore, it is often unclear the extent to which new methods achieve good performance. Additionally, whilst heuristics require computation for each new instance, RL models instead have a high up-front cost due to their training which must be justified. This thesis evaluates a range of RL methods against heuristics for the Capacitated VRP. Specific attention is paid to complex problem variants (the Time Window variant) and complex problem instances (such as with more customers). It aims to determine whether RL methods should be used over established and theory-informed heuristics.

Methods & Data

Six RL models for the CVRP are compared, covering a range of architectures and optimisation procedures: Nazari [2], AM [3], AM PPO, POMO [4], Sym-NCO [5] and MDAM [6]. The Nazari model is unique in using a recurrent neural network, with all others based on a Transformer architecture. The Attention Model (AM) uses a Graph Attention Model with REINFORCE. This is further adapted to use proximal policy optimisation (AM PPO), to consider multiple possible solutions concurrently (Policy Optimisation with Multiple Optima (POMO)) or to exploit problem and solution symmetries by adapting the REINFORCE rewards (Sym-NCO). The Multi-Decoder Attention Model (MDAM) is a further Transformer model with multiple decoders. Additionally, a two-step method is included using a heuristic to group the nodes and an AM TSP model to build each route. Only AM, AM PPO, Sym-NCO and MDAM are applied to the CVRP-TW. All RL models are provided with large quantities of random problem instances for training.

Initial evaluation of the models uses standard CVRP and CVRP-TW benchmarks. Most of the CVRP benchmark instances are randomly generated, whilst the CVRP-TW benchmark deliberately varies the location patterns, customers per vehicle and time windows. Further systematic test instances were created for the CVRP by varying the position of the depot, distribution and number of customers and maximum demand from a single customer. This variation induces varied instance difficulties.

A robust baseline is provided by generating solutions with a range of heuristic and metaheuristic algorithms. Evaluation on all datasets considers how often valid solutions are returned and compares the average solution distances. Additional considerations are the number of vehicles used in a solution and the computation time.

Results

Whilst all apart from one heuristic approach finds valid solutions for all instances, multiple RL models fail for at least one problem instance. The Nazari model implementation specifically had validity issues which excluded it from all further evaluation.

Regarding the quality of the solutions, the heuristics and metaheuristics consistently provide solutions within 4% of the optimum for CVRP benchmark problems, whilst even the best RL methods are more than 10% worse. When it provides valid solutions, the AM TSP two-step model outperforms other RL models, likely due to multiple TSP training instances being included in a single CVRP training instance. AM and POMO deliver the best RL solutions but consistently fall behind the heuristics.

A similar pattern appears with the CVRP-TW problems, although Syn-NCO is instead consistently the best RL model. With the more complex problem variant, all models are prone to using more vehicles than the optimal solution, likely due to the increased difficulty of finding any valid solution. The gap in performance between the heuristics and the RL models often increases with the more complex instances e.g. with more customers.

A time limit of 60 seconds per instance is pre-specified for heuristics. For the RL methods, producing solutions for an individual instance is almost instantaneous, but training can require large amounts of time. The 4 hours 10 minutes for training and testing the AM (10 location) model is the equivalent of finding solutions to 250 instances using a heuristic method. When training on larger instances the situation is much worse, with a training time increase of 358% for POMO with 20 customers compared to 10.

Added Value

The results demonstrate the robustness of the heuristic and metaheuristic methods such that RL approaches could only be viable where the same model will be used extensively. The RL models with even longer training periods might meet the heuristic performance but would only justify the computation cost after 1000s of uses.

The surprisingly competitive performance of the two-step AM TSP approach signals the crucial role of method design. This is already apparent in the heuristics and metaheuristics but often overlooked for RL. In this instance, breaking down the problem into smaller steps and only using RL for the more difficult component enables the model to train more efficiently.

[1] Dantzig, G. B. and J. H. Ramser (1959). The Truck Dispatching Problem. Management Science 6 (1), 80–91

[2] Nazari, M., A. Oroojlooy, L. Snyder, and M. Takac (2018). Reinforcement Learning for Solving the Vehicle Routing Problem. In Proceedings of the 32nd International Conference on Neural Information Processing Systems

[3] Kool, W., H. van Hoof, and M. Welling (2019). Attention, Learn to Solve Routing Problems! In Proceedings of the 7th International Conference on Learning Representations

[4] Kwon, Y.-D., J. Choo, B. Kim, I. Yoon, Y. Gwon, and S. Min (2020). POMO: Policy Optimization with Multiple Optima for Reinforcement Learning. In Proceedings of the 34th Conference on Neural Information Processing Systems

[5] Kim, M., J. Park, and J. Park (2022). Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. In Proceedings of the 36th Conference on Neural Information Processing Systems

[6] Xin, L., W. Song, Z. Cao, and J. Zhang (2021, May). Multi-Decoder Attention Model with Embedding Glimpse for Solving Vehicle Routing Problems. In Proceedings of the 35^th AAAI Conference on Artificial Intelligence

GOR 26 - Annual Conference & Workshops

Annual Conference- Rheinische Hochschule Cologne, Campus Vogelsanger Straße
26 - 27 February 2026

GOR Workshops - GESIS - Leibniz-Institut für Sozialwissenschaften in Cologne
25 February 2026

Conference Agenda