46th Annual Conference of the International Society for Clinical Biostatistics (ISCB)

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Session

Model selection and simulations

Time:

Monday, 25/Aug/2025:

4:00pm - 5:30pm

Location: Biozentrum U1.101

Biozentrum, 122 seats

Presentations

15-model-selection-simulations: 1

A systematic review of variable and functional form selection methods used in Covid-19 prognostic models

Michael Kammer¹, Marc Y. R. Henrion^2,3, Gregor Buch⁴, Georg Heinze¹

¹Medical University of Vienna (Austria); ²Malawi Liverpool Wellcome Research Programme (Malawi); ³Liverpool School of Tropical Medicine (UK); ⁴University Medical Center of the Johannes Gutenberg University Mainz (Germany)

Background

The Covid-19 pandemic created a pressing need for accurate predictions of health outcomes related to the disease. Numerous statistical and machine-learning models were developed in response. A study by Wynants et al. (2020) found that nearly all were at high risk of bias. Several members of the STRATOS topic group on variable and function selection (TG2) hypothesized that, in response to this public health emergency, researchers relied on modeling strategies familiar to them, or those they perceived as trustworthy for producing robust results. Consequently, the published models offer a valuable opportunity to examine current practices in variable selection and functional forms in statistical regression models. On behalf of STRATOS TG2, we systematically reviewed the model building approaches used in these papers.

Methods

A systematic re-review of published models in the existing database from the study by Wynants et al (2020) was conducted. A detailed protocol comprising inclusion criteria and a structured questionnaire to extract precise information about the modelling strategy used were prespecified and preregistered. The primary focus of our study was on regression-based models and, specifically, on the methods used for variable selection and the incorporation of functional forms. Data extraction based on a full text review was performed independently by two reviewers per paper, followed by consensus-based consolidation.

Results

A total of 20 reviewers extracted data from 181 regression-based prognostic models. We observed considerable variability in approaches to variable selection and functional forms, with researchers frequently combining multiple methods. Univariable selection was widely used, often in multi-stage variable selection strategies. Only very few studies accounted for non-linear functional forms or interactions, mostly by splines or multivariable fractional polynomials. Many papers also reported on statistical inference, e.g. through confidence intervals for model coefficients, but failed to account for additional uncertainty due to model selection. Notably, the existing, if limited, best-practice recommendations for model building were rarely cited. Overall, reporting quality was often poor, making it challenging to precisely determine the modeling strategies applied.

Conclusion

Our review demonstrates the reliance of many study authors on simplified modeling strategies or combined modeling strategies that were not properly tested and have unknown statistical properties. These findings underscore the need for clearer, more comprehensive and more accessible guidance on modeling strategies to support practitioners in developing robust, reliable prediction models.

References

Wynants L et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020

15-model-selection-simulations: 2

Robust and Flexible Extension of M-quantile Regression with Adaptive Class of Regularization

Iftikhar Ahmed Shovon, Md Hasinur Rahaman Khan

Institute of Statistical Research and Training, University of Dhaka, Bangladesh

Introduction

Quantile regression is widely used to model the conditional distribution of a dependent variable given a set of predictors. However, its robustness is fixed and lacks adaptability, limiting its flexibility in various applications. M-quantile regression overcomes this by leveraging influence functions that allow adjustable robustness levels. Building on Ranalli et al.(2023), where Lasso and Elastic Net were used for regularization, this study extends M-quantile regression by incorporating Adaptive Lasso, Adaptive Elastic Net, and penalized B-splines. Traditional regularization methods lack the oracle property and impose uniform penalties, which may result in inconsistent variable selection and suboptimal handling of complex datasets.

Methods

The proposed framework enhances M-quantile regression using adaptive regularization to achieve sparsity, robustness, and flexibility, making it suitable for high-dimensional and environmentally relevant data. The Adaptive Elastic Net combines Elastic Net’s grouping effect with the oracle property of Adaptive Lasso, improving variable selection and predictive modeling. Additionally, penalized B-splines are integrated to capture non-linear relationships, further enhancing the model’s adaptability.

Results

The proposed method was implemented with simulated and real-life data. Simulation studies show that the proposed method more effectively identifies and shrinks irrelevant coefficients to zero compared to traditional method, while also exhibiting lower bias for true zero coefficients. Analysis of PMetro data that was collected from Perugia, Italy reveals that the proposed method effectively enforces the time-of-day effect and shrinks nearly all coefficients of lagged vehicular traffic count to zero.

Conclusion

The proposed framework provides a more robust and flexible extension of M-quantile regression, achieving improved sparsity, predictive accuracy, and resilience to multicollinearity. Its ability to adaptively regulate robustness and capture non-linearity makes it particularly valuable for high-dimensional applications such as environmental studies. Future work focuses on computational efficiency and extensions to spatiotemporal modeling.

Keywords: M-quantile Regression, Adaptive Elastic Net, B-splines, Air pollution

15-model-selection-simulations: 3

Robust standard errors for coefficients of selected and unselected predictors after variable selection for binary outcomes

Nilufar Akbari¹, Ulrike Grittner¹, Georg Heinze²

¹Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117, Germany; ²Institute of Clinical Biometrics, Center for Medical Data Science, Medical University of Vienna, Spitalgasse 23, Vienna 1090, Austria

This study aimed to investigate if a modified sandwich estimator of the covariance could help achieve valid standard errors for regression coefficients after variable selection. In regression models, predictors can either be pre-specified or selected through data-driven methods. Valid standard errors for regression coefficients can be estimated for pre-specified models based on established theory. However, for data-driven methods, there is no widely accepted way to compute standard errors that encompass the uncertainty of selection. Most commonly the uncertainty in selection is ignored and standard errors for non-selected predictors are treated as zero. The bootstrap may be used to investigate the variability of selections, however, this method is computationally intensive. This highlights the need for a new, simpler approach.

After having obtained promising results from a previous simulation with a continuous outcome, we now investigated regression with binary outcomes. We performed a simulation study in the setting of logistic regression with five true and five noise candidate predictors, using three correlation structures between those predictors and three different marginal event rates. We compared four methods for obtaining standard errors of the regression coefficients of all candidate predictors, regardless of selection. These methods included using the coefficients and model-based variance of the full model, the final selected model with model-based variance and collapsed variances for non-selected predictors, the final selected model combined with the sandwich method modified to supply standard errors for all candidate predictors and the bootstrap variance method.

Results showed that the full model attained standard errors similar to the true sampling standard errors, which were underestimated by those of the selected model with model-based variances, particularly for correlated predictors. The bootstrap methods performed well for non-predictors and strong predictors but slightly worse for weak predictors. The sandwich method's standard errors were close to the bootstrap results for non-predictors and slightly underestimating for true predictors. For correlated predictors the sandwich method performed slightly better than the bootstrap.

In any case, the modified sandwich estimator improved over the common practice of ignoring uncertainty induced by variable selection, but more research is needed to refine the method. In addition to the simulation, we applied the proposed method to a real data example, which produced results consistent with the simulation findings. This work was supported by DFG grants RA-2347/8-1 and BE-2056/22-1 and FWF grant I-4739-B.

15-model-selection-simulations: 4

The “multi-performance plot” in simulation studies: a compact visualisation of up to seven performance measures comparing multiple statistical methods

Wang Pok Lo

Centre for Population Health Sciences, Usher Institute, University of Edinburgh, UK

Background: Simulation studies can be used to evaluate the performances of statistical methods, such as to determine how accurately or precisely an estimand can be estimated by each method. Commonly used performance measures include bias, empirical standard error (EmpSE), mean squared error (MSE), average model standard error (ModSE), and coverage (Morris et al. [2019]). Tables are a natural way to present these measures. However, pattern identification from such tabular presentations may be hindered when their sizes become large. This occurs when (1) many performance measures are assessed; (2) many combinations of simulation parameters are evaluated, such as in full factorial designs, and (3) many statistical methods are evaluated.

Methods: A new two-dimensional plot termed a “multi-performance plot” is proposed to simultaneously address all three scenarios. In scenario (1), the plot is generated as follows. For each combination of parameters simulated, the estimated bias and estimated EmpSE are plotted on the horizontal and vertical axes respectively. Points closer to the origin are more desirable. The squared distance from a point to the origin is approximately the estimated MSE. Next, a vertical line, whose length is the estimated ModSE, is drawn downwards from the point. Finally, each point is shaded using an appropriate colour gradient to visualise the estimated coverage. This plot additionally allows the visualisation of two relative performance measures described in Morris et al. [2019], namely the relative error in ModSE and relative precision. If scenario (2) and/or scenario (3) arise with scenario (1), the shapes of points, and the colours and types of vertical lines can be varied to reflect different simulation parameters and statistical methods.

Results: The utility of the multi-performance plot is illustrated in two examples: the outperformance of a non-parametric method of surrogate endpoint evaluation (Parast et al. [2016]), and good performance of a new model accounting for dependent censoring in survival analysis (Deresa and Van Keilegom [2020]).

Conclusion: The multi-performance plot displays up to seven performance measures and can complement tabular presentations for easier identification of patterns, especially in large simulation studies.

Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in medicine. 2019 May 20;38(11):2074-102.

Parast L, McDermott MM, Tian L. Robust estimation of the proportion of treatment effect explained by surrogate marker information. Statistics in medicine. 2016 May 10;35(10):1637-53.

Deresa NW, Van Keilegom I. Flexible parametric model for survival data subject to dependent censoring. Biometrical Journal. 2020 Jan;62(1):136-56.

15-model-selection-simulations: 5

What is the impact of increasing numbers of auxiliary variables for an explanatory variable in imputation models?

Paul Madley-Dowd¹, Rheanna Mainzer², Samantha Ip³, Alexia Sampri³, Carmen Petitjean³, Jonathan Sterne¹, Katherine Lee², Tim Morris⁴, Angela Wood³, Kate Tilling¹

¹University of Bristol, United Kingdom; ²Murdoch Children's Research Institute, Australia; ³University of Cambridge, United Kingdom; ⁴University College London, United Kingdom

Background: When performing multiple imputation to explore an exposure-outcome association, auxiliary variables are included in imputation models, but not the analysis model, to reduce bias and/or improve statistical efficiency. Recent work has provided evidence for an inclusive strategy for including auxiliary variables in the imputation model (i.e., all available) when there is missing data in the outcome variable. Such a finding is not expected to hold for missing data in other variables such as the exposure or confounders. The aim of this work is to explore the impact of increasing numbers of auxiliaries when imputing an exposure or confounder variable under different missingness mechanisms. Given the increasing sample sizes available using electronic healthcare records, and the different sampling strategies employed to ensure computational viability, it is important to explore impacts under different sample sizes.

Methods: We simulated a complete dataset including an exposure, an outcome, a confounder, and between 1 and 100 auxiliary variables that collectively explained between 10% and 60% of the variance in a missing variable. We explored 50% missing data in either the exposure or the confounder variable. We repeated our simulation at sample sizes of 1000, 10,000, and 1,000,000 and explored different missingness mechanisms. We explored scenarios where the exposure, outcome, and confounder were continuous or binary, using multivariable linear regression as the analysis model when the outcome was continuous and multivariable logistic regression when the outcome was binary.

Results: Using a sample size of n=1000 shows that larger numbers of auxiliary variables that weakly predict the incomplete variable can introduce substantial quantities of bias in estimates of exposure-outcome associations. This bias is often reduced when using variables that more strongly predict the incomplete variable, though this depends on the variable type (continuous or binary) for each analysis model variable and the missingness mechanism. Increasing the sample size also reduces the size of the bias.

Conclusions: We caution against using an inclusive strategy of all available auxiliary variables when imputing an incomplete exposure or confounder variable. Our results suggest that careful consideration needs to be given to 1) how predictive an auxiliary variable is of an incomplete exposure or confounder variable that is to be imputed, 2) the variable type (continuous/binary) for each variable in the analysis model, and 3) the assumed missing data mechanism. Our work provides guidance on the importance of these factors at different sample sizes.

Conference Agenda