JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at statistische-woche@dstatg.de.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Daily Overview

Session

MSE2: Methodology of Statistical Surveys 2

Time:

Wednesday, 03/Sept/2025:

11:00am - 12:40pm

Session Chair: Florian Dumpert, Statistisches Bundesamt, Germany

Location: E.03.112

Presentations

11:00am - 11:25am

Imputing Missing Values in Official Statistics: Assessing Imputation Accuracy across Various Imputation Methods.

Florian Dumpert¹, Markus Pauly^2,3, Maria Thurow^2,3, Inken Veips²

¹Statistisches Bundesamt (Destatis), Germany; ²TU Dortmund University, Germany; ³UA Ruhr, Research Center Trustworthy Data Science and Security, Germany

In the process of data preparation, handling missing values is an important element that may affect later research outcomes. In this study, we conduct a comparative simulation study of imputation methods, including widely used methods such as missRanger (Random Forest based imputation) and MICE (Multiple Imputation by Chained Equations) as well as two approaches that were especially established for the use in Official Statistics: CANCEIS (Canadian Census Edit and Imputation System) from Statistics Canada, which includes the option of adding plausibility rules within the imputation and VIM (Visualization and Imputation of Missing Value) from Statistics Austria.

Using the German Structure of Earnings Survey from 2010 to 2018, we show how to assess imputation methods based on their (multivariate) imputation accuracy. Since the term “imputation accuracy” is not uniquely defined in theory and practice, we use different measures in our analyses: Besides often used accuracy measures like the normalized root mean squared error (NRMSE) and the proportion of false classification (PFC), we also focus on distribution distance measures.

Since in Official Statistics data plausibility is another crucial aspect in data preparation, we place an additional focus on this aspect comparing the different imputation methods regarding their ability to impute data fulfilling predefined so-called edit rules.

11:25am - 11:50am

Von der Praxis zur Methodik: Evaluierung von Imputationen in der amtlichen Statistik

Steffen Moritz, Florian Dumpert

Statistisches Bundesamt, Deutschland

Eine hohe Datenqualität ist eine Grundvoraussetzung für verlässliche amtliche Statistiken. Da in Erhebungen regelmäßig fehlerhafte oder fehlende Angaben auftreten, kommt der Imputation eine zentrale Bedeutung zu. Durch geeignete Imputationsverfahren können potenzielle Verzerrungen reduziert und die Aussagekraft statistischer Ergebnisse gesichert werden.

Die Auswahl eines geeigneten Verfahrens gestaltet sich in der Praxis jedoch häufig als äußerst komplex. Der wahre, „fehlende“ Wert ist in der Regel unbekannt und kann somit nicht zur Qualitätsbeurteilung herangezogen werden. Diese fehlende Ground Truth erschwert eine objektive Bewertung der Imputationsgenauigkeit und den methodischen Vergleich verschiedener Verfahren. Es besteht die Gefahr, dass Verfahren bevorzugt werden, die für bestimmte Analysezwecke geeignet erscheinen, dabei aber in anderen Kontexten systematische Verzerrungen verursachen.

Derzeit in der Praxis eingesetzte Evaluationsmethoden umfassen u. a. erklärende Grafiken zum Vergleich von Verteilungen, Modellgütemaße sowie Simulationsstudien. Die Auswahl an Metriken reicht von RMSE, MAE und MAPE bis hin zu verschiedensten verteilungsbezogenen Kennzahlen. Jede dieser Methoden und Metriken bringt spezifische Stärken und Schwächen mit sich – ein universell geeignetes Verfahren existiert allerdings nicht. Diese Vielfalt erschwert eine konsistente Anwendung in der Praxis, da ein übergreifender Bewertungsrahmen in der amtlichen Statistik bisher fehlt. Entsprechend erfolgt die Auswahl von Verfahren häufig ad hoc, und verwendete Ansätze variieren sowohl im internationalen Kontext als auch innerhalb einzelner Statistikämter stark.

Der Vortrag diskutiert die Notwendigkeit eines strukturierten und robusten Bewertungsrahmens für Imputationen, der bestehende Methoden systematisiert, konsistente Anwendung unterstützt und die Vergleichbarkeit erhöht. Ziel ist es, Wege aufzuzeigen, wie Transparenz und Reproduzierbarkeit in der Qualitätssicherung fehlender Daten gestärkt werden können – national wie international.

11:50am - 12:15pm

Möglichkeiten für ML-basierte Imputationsverfahren für Item-Nonresponse am Beispiel der Statistiken des Verarbeitenden Gewerbes (StatVG)

Elena Stäger, Muhammet Akman, Richard Bündgens, Christian Borgs

IT.NRW, Statistisches Landesamt Nordrhein-Westfalen, Deutschland

Im Rahmen der Statistiken des Verarbeitenden Gewerbes (StatVG) wird der Inland-Umsatz als wichtige Kennzahl momentan bei fehlenden Werten durch den letzten bekannten Wert des entsprechenden Betriebs ersetzt (LOCF (Last Observation Carried Forward) -Imputation), wenn ein Unternehmen keine Meldung vornimmt. Auch bereits erprobt wurde die Imputation von fehlenden Werten mit Canceis, welches auf einem k-NN- Algorithmus basiert.

Ziel des Projektes ist es, sowohl LOCF, wie auch Canceis bei fehlenden Einzelwerten durch eine bessere Methode zu schlagen. Dabei sollen auch maschinelle Lernverfahren getestet werden: Es werden Zeitreihen-Imputationen (imputeTS) mit baumbasierten Verfahren (missRanger missForest) und neuralen Netzen (LSTM NN) verglichen.

Dabei werden Umsatz-Werte aus dem StatVG-Datensatz zufällig als fehlend gesetzt. So können die Imputationsverfahren mit den bekannten Werten verglichen werden. Das zentrale Ziel ist es, die Abweichung der imputierten Werte und der eigentlichen Werte gegenüber den bisherigen Verfahren (über den RMSE ausgewertet) deutlich zu verringern.

Es zeigt sich, dass ein ML-basiertes Imputationsverfahren (missRanger) gegenüber den Referenzverfahren bei einzelnen fehlenden Werten zu einer deutlichen Reduktion der Abweichungen zwischen wahren und imputierten Werten führt. Damit ist diese Methode für eine Weiterentwicklung der Imputationsmethodik für fehlende Einzelwerte ein vielversprechender Anwendungsfall für die StatVG und für den Statistischen Verbund.

12:15pm - 12:40pm

Handling constraints in automated statistical data editing via full conditional distributions

Christian Aßmann^1,2, Ariane Würbach¹, Katja-Verena Bürk³, Florian Dumpert³

¹Leibniz Institute for Educational Trajectories Bamberg, Germany; ²Chair of Survey Statistics and Data Analysis, Otto-Friedrich-University Bamberg, Germany; ³German Federal Statistical Office

Reported survey data are prone to inaccuracies due to respondent error as reported values may be missing or implausible, i.e., they do not satisfy logical constraints. When such logical constraints are due to the interaction of multiple variables, it is also unclear which variable or variables are actually erroneous. A standard method used by Statistical O ces to correct data and ensure data consistency are edit-imputation routines following the Felligi-Holt paradigm. Using such an easily computable heuristic does not necessarily exploit all the information available in the observed data. Another way that incorporates all available information is to apply Bayesian methods in the form of full conditional distributions of missing values to properly account for the uncertainty that arises in the process of replacing erroneous values. While Bayesian approaches based on parametric models are available in the literature for categorical and continuous data, this paper presents a method for specifying full conditional distributions using classification and regression trees instead, while taking into account nested balance constraints, i.e., linked constraints involving multiple variables. The CART algorithm was chosen, because it provides exible univariate approximations to the full conditional distributions of the variables while reducing the computational intensity of the overall Bayesian approach. Results from simulation suggest that, compared to complete case analysis, the average root mean squared error of moment estimates can typically be reduced by 20 to 30 percent when using the nonparametric Bayesian approach and the corresponding speci cation of full conditional distributions using the CART algorithm.

Statistical Week 2025

2-5 September 2025
Wiesbaden, Germany

Conference Agenda