23-meta-science-1: 1
Quantifying the variation in effects due to multiplicity of analysis strategies – A conceptional perspective
Susanne Strohmaier, S.Necdet Cervirme, Georg Heinze, Michael Kammer, Moritz Pamminger, Daniela Dunkler
Medical University of Vienna, Austria
When addressing a particular research question using observational data, many decisions must be made during the conceptualization of the statistical analysis plan (SAP). This garden of forking paths is a well-known problem leading to low replicability of research findings, as each decision can lead to different results, even if each decision on its own was scientifically justifiable.
In our ongoing project “Towards precise statistical analysis plans facilitating microdata analyses to advance health research” (TOPSTATS), we explore the variation in the relevant estimates in three case studies utilizing routinely collected Austrian data. These real-world applications originate from occupational epidemiology, nephrology, and pharmacoepidemiology, and rely on distinct time-to-event methodologies to address potentially causal research questions.
Several concepts have been proposed to tackle the consequences of the multiplicity of analysis strategies. For example, the social science literature promotes a multi-model approach to present a preferred model estimate in the context of results from other plausible models. This idea is similar to the concept of sensitivity analysis often used in epidemiological research, but focuses on the analysis models. The data science community is more concerned with multiverse-style methods aiming to neutrally present the results of ‘’all’’ possible analysis decisions. Additionally, multi-analyst approaches examine the variation due to analytic choices deemed appropriate by independent analysts. We discuss the advantages and limitations of these existing concepts and suggest a multi-analysis approach that integrates elements from all these approaches, while maintaining the overarching goal of statistics: supporting decision-making in the face of uncertainty.
The core idea of TOPSTATS is to develop a consensus state‐of‐the‐art (SOTA) SAP for a relevant target estimand in each case study, as well as a meta‐SAP comprising plausible alternative decisions at each step of an SAP together with international methodological experts. Data analysis then follows the SOTA‐SAP and all sensible pathways through the meta‐SAP. We suggest to present the result of our preferred (i.e., SOTA) analysis strategy in the context of a distribution of plausible estimates while highlighting how decisions at different stages in the analysis path affect the estimate of interest.
Motivated by our case studies, we present a pragmatic strategy how worthwhile paths through the landscape of meta-SAPs could be identified (a selection is necessary given the sheer number of possible paths) following ideas of a principled multiverse and discuss possible measures to empirically quantify the influence of particular decisions.
This work was supported through the ÖAW project DATA_2023-32_TopstatsMicrodata.
23-meta-science-1: 2
Uncertainty in individual predictions: sample size versus model and modeler choices
Toby Hackmann1, Ben van Calster1,2,3, Liesbeth C de Wreede1, Ewout W Steyerberg1,4
1Department of Biomedical Data Sciences, LUMC, The Netherlands; 2Department of Development and Regeneration, KU Leuven, Belgium; 3EPI-Center, KU Leuven, Belgium; 4Julius Center for Health Sciences and Primary Care, UMCU, The Netherlands
Background/Introduction: Epistemic uncertainty in clinical prediction models is the uncertainty that can be estimated. It consists of sampling/approximation uncertainty and model-related uncertainty. The model-related source of uncertainty consists of uncertainty about the true model (model uncertainty) and about the knowledge, preferences and choices of the developer of the model (modeler uncertainty). We aim to quantify model and modeler uncertainty as separate from sampling uncertainty.
Methods: As a case study, we developed prediction models for 30-day mortality after acute myocardial infarction as binary outcome based on data from the GUSTO-I trial. Models contained various subsets of potential predictors in different model classes. We quantify variability with a random-effects model with model-based predictions as outcome. Multiple model-based predictions are available for each patient arising from model and modeler choices. Sampling uncertainty was estimated through bootstrapping. Model categories (glm, RF, Neural Net) were modelled by the highest level random effect and within-category choices (hyperparameters, variable selection, regularization) led to a second-level random effect for a total of 96 combinations. Combined with 100 bootstrap replications, 9600 predictions were made per patient. Model choices were based on a scoping review of common modeling choices. Within-category choices were optimized using common performance metrics. Sample sizes ranged between 400, 2,000, and 10,000 patients with 7% experiencing the event of interest.
Results: The variance between predictions for an individual patient based on different model categories and/or within-category choices was substantial. Preliminary results showed that within-category standard deviation for predictions based on logistic regression models was 0.005, compared to standard errors due to sampling of 0.024 for 1,000 patients and 0.011 for 5,000 patients.
Conclusions: Model and modeler uncertainty are major additional sources of uncertainty over sampling uncertainty, which may be uncovered by bootstrapping. Following guidance on good practices could reduce the variability between model choices and make predictions for individual patients more stable between different modelers.
Acknowledgements: We would like to thank Napsugar Forró, MSc for her help in developing the mixed-model approach for the estimation of between-model uncertainty.
23-meta-science-1: 3
Quantifying reproducibility - Results from a scoping review on reproducibility metrics and simulation studies into their real-world applicability
Rachel Heyard1, Samuel Pawel1, Joris Frese2, Bernhard Voelkl3, Hanno Würbel3, Sarah K McCann4, Kimberley E Wever5, Helena Hartmann6, Louise Townsin7, Stephanie Zellers8, Leonhard Held1
1University of Zurich, Switzerland; 2European University Institute, Florence, Italy; 3University of Bern, Bern, Switzerland; 4QUEST Center, BIH, Berlin, Germany; 5Radboud University Medical Center, Nijmegen, the Netherland; 6University Hospital Essen, Essen, Germany; 7Torrens University Australia, Australia; 8University of Helsinki, Helsinki, Finland
Background - Replication studies and large-scale replication projects aiming to quantify different aspects of reproducibility (such as the Reproducibility Project: Cancer Biology) are increasingly common. The iRISE (improving Reproducibility In SciencE) consortium defines reproducibility as “the extent to which the results of a study agree with those of replication studies”. Currently, no standardized approach to measuring reproducibility exists and a diverse set of metrics is in use. Further, little is known about the applicability and performance of reproducibility metrics under various conditions and in real-world contexts.
Methods - To identify reproducibility metrics, we conducted a scoping review of large-scale replication projects that used metrics and methodological papers that suggested or discussed them. A list of 49 large-scale projects was compiled by the research team, and 97 methodological papers were identified through a search in Scopus, MedLine, PsycINFO and EconLit. To study the metrics’ applicability in various real-world contexts, simulation studies and real-world data analyses were performed. We were specifically interested in the applicability of metrics, initially developed to assess whether a direct replication study was successful, in the context of the translation of results from preclinical animal studies to human trials. To ensure the practical relevance of this translation simulation study, we used real-world data to select simulation parameters.
Results - We identified 50 reproducibility metrics and characterized them based on their type (e.g. formulas and/or statistical models, graphical representations, algorithms), input required, and appropriate application scenarios. We found that each metric addresses a distinct research question. Preliminary results from our simulation study indicate that specific metrics are more useful in some contexts compared to others, showing the importance of the choice of the metric for the validity of (large-scale) replication projects.
Conclusion - To support future replication teams and meta-researchers, we provide a comprehensive and interactive “live” table to guide the selection of the most appropriate metrics aligned with goals of the study or large-scale project. We present assumptions and limitations of some commonly used metrics and give tangible recommendations for their application in various contexts, including the translation of results from preclinical animal studies to human trials.
23-meta-science-1: 4
Data quality, a blind spot in study and reporting guidelines
Carsten Oliver Schmidt
University Medicine Greifswald, Germany
Background Ensuring that data is ‘fit for purpose’ is fundamental to credible, replicable and reproducible scientific research. Numerous works discuss concepts and tools related to data quality. However, does this translate to explicit requirements on transparent data quality reporting in reporting guidelines, appraisal tools, and journal author instructions? This work critically examines how such guidelines address data quality.
Methods A comprehensive review of key reporting guidelines, a review of appraisal tools2, and journal author instructions was conducted to assess the extent to which data quality is explicitly considered. The analysis included for example major guidelines such as CONSORT, SPIRIT, STROBE, PRISMA, STARD, TRIPOD, and TRIPOD-AI, as well as journal submission guidelines from high-impact medical journals. The review focused on explicit mentions of data quality, as well as the coverage of its key constituting elements, including data integrity, completeness, correctness, and variability.
Results Findings indicate a striking lack of emphasis on data quality within existing guidance documents. Among the major reporting guidelines, only limited and vague references to ‘data quality’ or related processes exist, often without concrete definitions or structured reporting requirements. There was only one single mention of ‘data quality’ across hundreds of evaluation criteria in 49 appraisal tools. Key data quality components such as measurement error, misclassification, and heterogeneity are inconsistently addressed, with most guidelines omitting them entirely. A review of journal author instructions revealed similarly limited references to data quality, with most guidance focusing on study design and statistical methods rather than providing empirical evidence on the quality of the underlying data.
Conclusion Transparency regarding data quality should be the norm, not the exception. However, the absence of any systematic coverage of data quality-related aspects represents a major shortcoming in current guidance practice. Consequently, the reverse is the case: transparency remains the exception rather than the norm1. To address this fundamental flaw, structured and transparent data quality reporting should be a key component of all relevant guidelines and should be routinely enforced by journals and funding agencies.
- Huebner et al., & Topic Group "Initial Data Analysis" of the, S. I. (2020). Hidden analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Medical Research Methodology, 20(1), 61. doi:10.1186/s12874-020-00942-y
- Jiu et al. Tools for assessing quality of studies investigating health interventions using real-world data: a literature review and content analysis. BMJ Open 2024; 14(2): e075173.
23-meta-science-1: 5
To adjust or not: it is not the tests performed that count but how they are reported and interpreted
Sabine Hoffmann, Simon Lemster, Juliane Wilcke, Anne-Laure Boulesteix
LMU Munich, Germany
Most original articles published in the medical literature report the results of multiple statistical tests. In a few simple cases, there is general agreement on whether one should adjust for multiple testing. For many cases encountered in practice, however, this is less clear and the recommendations in the literature are contradictory, along different dimensions, or otherwise confusing. This lack of clear guidance may hinder or impair the conduct of analyses, encourage questionable research practices, ultimately jeopardizing the credibility of medical research. In this project, we present a unifying criterion that supports researchers in their decision to adjust or not for multiple testing and if yes over which set of hypotheses. We relate the criterion to previous rules proposed in the literature and illustrate its use in two complex multiple testing situations. In addition, we show that the criterion also addresses the multiple testing situation resulting from the multiplicity of possible analysis strategies for a given research question and dataset, which - if not handled properly - opens the door to fishing for significance and false positive findings.
|