Relevance & Research Question:
Recently there has been a growth of research on whether Large Language Models (LLM) can be a source for high quality synthetic survey data. However, research has shown that synthetic survey data produced by LLMs underestimates the variational and correlational patterns that exist in human data. Additionally, the process of creating synthetic survey data with LLMs inherently has a lot of researcher’s degrees of freedom which can impact the distribution of the synthetic survey data.
In this study we assess the problem of underestimated (co-)variance by systematically varying three factors and observe their impact on synthetic survey data: (1) The number and type of covariates a LLM sees before answering a question, (2) the model used to create the synthetic survey data and (3) the way we extract responses from the model.
Methods & Data:
We use five socio-demographic background questions and seven substantive questions from the 2018 German General Social Survey as covariates to have the LLM predict one substantive outcome, the satisfaction of the respondent with the government. To predict responses to the target question we use LLama2 in its chat and non-chat variant, as well as two versions finetuned on German text data to control for differences between LLMs.
Results:
First results show that the (co-)variance in synthetic survey data changes depending on (1) the type and quantity of covariates the model sees, (2) the model used to generate the responses and (3) whether we simulate from the model implied probability distribution or only look at the most likely response option. Especially (3), simulating from the model implied probability distribution, improves the estimation of standard deviations. Covariances estimates, however, remain underestimated.
Added Value:
We add value in three ways: (1) We provide information on which factors impact variance in synthetic survey data. (2) By creating German synthetic survey data, we can compare findings with results from research that has mostly focused on survey data from the US. (3) We show that using open-source LLMs enables researchers to obtain more information from the models than relying on closed-source APIs.