General Online Research Conference 2024 (GOR 24)

Ask a Llama - Creating variance in synthetic survey data

Matthias Roth

GESIS-Leibniz-Institut für Sozialwissenschaften in Mannheim, Germany

Relevance & Research Question:

Recently there has been a growth of research on whether Large Language Models (LLM) can be a source for high quality synthetic survey data. However, research has shown that synthetic survey data produced by LLMs underestimates the variational and correlational patterns that exist in human data. Additionally, the process of creating synthetic survey data with LLMs inherently has a lot of researcher’s degrees of freedom which can impact the distribution of the synthetic survey data.

In this study we assess the problem of underestimated (co-)variance by systematically varying three factors and observe their impact on synthetic survey data: (1) The number and type of covariates a LLM sees before answering a question, (2) the model used to create the synthetic survey data and (3) the way we extract responses from the model.

Methods & Data:

We use five socio-demographic background questions and seven substantive questions from the 2018 German General Social Survey as covariates to have the LLM predict one substantive outcome, the satisfaction of the respondent with the government. To predict responses to the target question we use LLama2 in its chat and non-chat variant, as well as two versions finetuned on German text data to control for differences between LLMs.

Results:

First results show that the (co-)variance in synthetic survey data changes depending on (1) the type and quantity of covariates the model sees, (2) the model used to generate the responses and (3) whether we simulate from the model implied probability distribution or only look at the most likely response option. Especially (3), simulating from the model implied probability distribution, improves the estimation of standard deviations. Covariances estimates, however, remain underestimated.

Added Value:

We add value in three ways: (1) We provide information on which factors impact variance in synthetic survey data. (2) By creating German synthetic survey data, we can compare findings with results from research that has mostly focused on survey data from the US. (3) We show that using open-source LLMs enables researchers to obtain more information from the models than relying on closed-source APIs.

To Share or Not to Share? Analyzing Survey Responses on Smartphone Sensor Data Sharing through Text Mining.

Marc Smeets, Vivian Meertens, Jeldrik Bakker

Statistics Netherlands, Netherlands, The

Relevance & Research Question

In 2019, Statistics Netherlands (CBS) conducted the consent survey, inviting respondents to share various types of smartphone data, including location, personal photos and videos, and purchase receipts. The survey particularly focused on understanding the reasons behind the reluctance to share this data. This study explores the following research question: What classifications of motivations and sentiments can be identified for unwillingness to share data with CBS, using a data-driven text mining approach?

Methods & Data

Results

This research applies multiple text mining techniques to detect underlying sentiments and motivations for not sharing sensor measurements with CBS. The manually classified responses from the survey serve as valuable training and test data for our text mining algorithms. Our findings provide a comprehensive comparison and validation of manual and automated classification methods, offering insights into the effectiveness of text mining.

Added Value

The study underscores the potential of text mining as an additional tool for analyzing open-text responses in survey research. By using this technique, we detect sentiments and motivations, enhancing the understanding of respondents’ perspectives on data sharing. This approach not only contributes to applying text mining in understanding attitudes towards data privacy and consent, but expands the methodology of survey research for analyzing open-ended questions and text data in general.

Conference Agenda