Session | ||
B6.1: Automatic analysis of answers to open-ended questions in surveys
| ||
Presentations | ||
Using the Large Language Model BERT to categorize open-ended responses to the "most important political problem" in the German Longitudinal Election Study (GLES) GESIS, Germany Relevance & Research Question Open-ended survey questions are crucial e.g., for capturing unpredictable trends, but the resulting unstructured text data poses challenges. Quantitative usability requires categorization, a labor-intensive process in terms of costs and time, especially with large datasets. In the case of the German Longitudinal Election Study (GLES) spanning from 2018 to 2022, with nearly 400,000 uncoded mentions, it prompted us to explore new ways of coding. Our objective was to test various machine learning approaches to determine the most efficient and cost-effective method for creating a long-term solution for coding responses, ensuring high quality simultaneously. Which approach is best suited for the long-term coding of open-ended mentions regarding the "most important political problem" in the GLES? Methods & Data Pre-2018, GLES data was manually coded. Shifting to a (partially) automated process involved revising the codebook. Subsequently, the extensive dataset comprising nearly 400,000 open responses to the question regarding the "most important political problem" in the GLES surveys conducted between 2018 and 2022 was employed. The coding process was facilitated using the Large Language Model BERT (Bidirectional Encoder Representations from Transformers). During the entire process, we tested a whole host of important aspects (hyperparameter finetuning, downsizing of the “other” category, simulations of different amounts of training data, quality control of different survey modes, using training data from 2017) before arriving at the final implementation. The "new" codebook already demonstrates high quality and consistency, evident from its Fleiss Kappa value of 0.90 for the matching of individual codes. Utilizing this refined codebook as a foundation, 43,000 mentions were manually coded, serving as the training dataset for BERT. The final implementation of coding for the extensive dataset of almost 400,000 mentions using BERT yields excellent results, with a 0/1 loss of 0.069, a Micro F1 score of 0.946 and a Macro F1 score of 0.878. The outcomes highlight the efficacy of the (partially) automated coding approach, emphasizing accuracy with the refined codebook and BERT's robust performance. This strategic shift towards advanced language models signifies an innovative departure from traditional manual methods, emphasizing efficiency in the coding process. The Genesis of Systematic Analysis Methods Using AI: An Explorative Case Study TU Dresden, Germany Relevance & Research Question The analysis of open-ended questions in large-scale surveys can provide detailed insights into respondents' views that often can't be assessed with closed-ended questions. However, due to the large number of respondents, it takes a lot of resources to review the answers within open-ended questions and thus provide them as research results. This contribution aims to show the potential benefits and limitations of using AI-based tools (e.g. ChatGPT), for analyzing open-ended questions in large-scaled surveys. It therefore also aims to highlight the challenge of conducting systematic analysis methods with AI. Methods & Data Results Added Value Insights from the Hypersphere - Embedding Analytics in Market Research SPLENDID Research, Germany Relevance & Research Question: In the intersection of qualitative and quantitative research, analyzing open-ended questions remains a significant challenge for data analysts. The incorporation of AI language models introduces the complex embedding space: a realm where semantics intertwine with mathematical principles. This paper explores how Embedding Analytics, a subset of explainable AI, can be utilized to decode and analyze open-ended questions effectively. Methods & Data: Our approach utilized the ada_V2 encoder to transform market research responses into spatial representations on the surface of a 1,536-dimensional hypersphere. This process enabled us to analyze semantic similarities using traditional statistics as well as advanced machine learning techniques. We employed K-Means Clustering for text grouping and respondent segmentation, and Gaussian Mixture Models for overarching topic analysis across numerous responses. Dimensional reduction through t-SNE facilitated the transformation of these complex data sets into more comprehensible 2D or 3D visual representations. Results: Utilizing OpenAI’s ada_V2 encoder, we successfully generated text embeddings that can be plausibly clustered based on semantic content, transcending barriers of language and text length. These clusters, formed via K-Means and Gaussian Mixture Models, effectively yield insightful and automated analyses from qualitative data. The two-dimensional “cognitive constellations” created through t-SNE offer clear and accessible visualizations of intricate knowledge domains, such as brand perception or public opinion. Added Value: This methodology allows for a precise numerical analysis of verbatim responses without the need for labor-intensive manual coding. It facilitates automated segmentation, simplification of complex data, and even enables qualitative data to drive prediction tasks. The rich, nuanced datasets derived from semantic complexity are suitable for robust analysis using a wide range of statistical methods, thereby enhancing the efficacy and depth of market research analysis. |