46-prediction-prognostic-mod-4: 1
Adapting existing sample size calculations for developing risk prediction models to control for model stability
Gareth Ambler, Menelaos Pavlou, Rumana Z Omar
University College London, United Kingdom
Background
The use of recently proposed sample size calculations can lead to more reliable risk prediction models. These calculations can determine the required sample size to ensure that the expected calibration slope (CS), a measure of model overfitting, will meet some target value (often 0.9) when models are fitted using MLE; a perfectly calibrated model has a CS of 1. These calculations require information on the number of predictors, outcome prevalence and model strength.
Methods
In practice, the observed CS will vary around the target value. This aspect of model performance, model stability, is not accounted for in existing calculations. We use simulation to investigate model stability when varying the number of predictors (p), outcome prevalence and model strength (c-statistic). We quantify model stability using the Probability of Acceptable Calibration (PAC), defined here as achieving an observed CS within [0.85-1.15]. We also investigate the performance of a simple (post-estimation) uniform shrinkage approach (using bootstrapping) which may be useful in some scenarios. Finally, we propose an adaptation of existing sample size calculations to control model stability and ensure that PAC is sufficiently high; here we aim for PAC=75%.
Results
When adhering to existing sample size recommendations, the variability in the observed CS increased substantially with decreasing p (although CS=0.9 was achieved on average). Consequently, PAC was often low, particularly for p<10. Applying simple uniform shrinkage led to much higher PAC unless there were very few predictors (p£5). Our proposed adaptation resulted in higher sample sizes than those currently recommended for p<15, and similar sizes for higher p.
Conclusions
Sample size calculations for the development of prediction models should take account of model stability. Applying post-estimation shrinkage at the existing recommended sizes may be beneficial unless the number of predictors is very small.
References
Riley et al. 2020. Calculating the sample size required for developing a clinical prediction model. British Medical Journal, https://doi.org/10.1136/bmj.m441.
Pavlou et al. 2024. An evaluation of sample size requirements for developing risk prediction models with binary outcomes. BMC Medical Research Methodology, https://doi.org/10.1186/s12874-024-02268-5.
Riley & Collins 2023. Stability of clinical prediction models developed using statistical or machine learning methods. Biometrical Journal, https://doi.org/10.1002/bimj.202200302.
46-prediction-prognostic-mod-4: 2
Optimal Methods for Handling Continuous Predictors in Clinical Prediction Model Development: Balancing Flexibility and Stability
Phichayut Phinyo1, Pakpoom Wongyikul1, Noraworn Jirattikanwong1, Natthanaphop Isaradech2, Wachiranun Sirikul2, Wuttipat Kiratipaisarl2, Noppadon Seesuwan3, Suppachai Lawanaskol4
1Department of Biomedical Informatics and Clinical Epidemiology (BioCE), Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand; 2Department of Community Medicine, Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand; 3Department of Emergency medicine, Lampang Hospital, Muang District, Lampang, Thailand; 4Chaiprakarn Hospital, Chaiprakarn, Chiang Mai, Thailand
Background: Prediction stability is increasingly recognised as essential for ensuring reliable and reproducible model development. While dataset size and algorithm choices affect stability, the impact of specific modelling decisions, particularly in handling continuous predictors, is less well understood. This study examines how different methods for handling continuous predictors influence stability.
Methods: A dataset of 19,418 patients, previously used to develop a prediction model for hospital admission, was randomly sampled to create five datasets of different sizes [0.2, 0.5, 1, 2, and 5 times the base size]. The base size represented the minimum sufficient sample size for developing models. We defined six continuous candidate predictors, with most showing non-linear association with the endpoint. Six approaches for handling continuous predictors were compared: (1) DICHO – dichotomisation, (2) CAT – categorisation into tertiles, (3) LINEAR – assuming a linear relationship, (4) QUAD – assuming a quadratic relationship, (5) MFP – multivariable fractional polynomial transformations, and (6) XGBoost – extreme gradient boosting. Logistic regression was used for methods (1) to (5). Prediction stability was assessed using the bootstrap procedure proposed by Riley and Collins. Optimism-corrected area under the curves (AUCs) and calibration slopes were estimated to assess model performance. A modelling approach was considered highly stable if ≥90% of predictions had a mean absolute prediction error (MAPE) of ≤5%.
Results: At the base size, DICHO, LINEAR, and QUAD produced highly stable predictions with similar levels of calibration. However, DICHO exhibited lower AUCs than the other two. While MFP and XGBoost demonstrated higher AUCs than the others, their predictions lacked stability, and their calibration was poor. At larger sample sizes (2×Base and 5×Base), all methods achieved high stability. LINEAR, QUAD, and MFP outperformed DICHO and CAT in AUCs. Although XGBoost produced stable predictions with high AUCs, but significant miscalibration persisted. With a smaller-than-sufficient sample sizes (0.2×Base and 0.5×Base), LINEAR and DICHO demonstrated better stability than more complex methods (i.e. QUAD, MFP and XGBoost).
Conclusions: Methods for modelling continuous predictors should be chosen based on sample size and the trade-off between discrimination, calibration, and stability. For sufficiently large sample sizes, LINEAR and QUAD are preferred. MFP can yield higher AUCs but requires a substantially larger sample size for stability. XGBoost may not be an optimal choice even with a large sample size if calibration is a priority. For small datasets, LINEAR—and even DICHO—may be preferable to more complex, flexible methods to ensure maximal stability.
46-prediction-prognostic-mod-4: 3
Do Stable Performance Metrics Guarantee Stable Model Predictions?
Natthanaphop Isaradech1, Phichayut Phinyo2, Wuttipat Kiratipaisarl1, Pakpoom Wongyikul2, Noraworn Jirattikanwong2, Wachiranun Sirikul1
1Department of Community Medicine, Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand; 2Department of Biomedical Informatics and Clinical Epidemiology (BioCE), Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand
Introduction: Clinical prediction models are generally developed using either statistical models or machine learning algorithms to support clinical decision-making in critical situations. These models are typically evaluated based on their predictive performance, particularly in terms of discrimination and calibration. Concerns about the stability of these performance metrics have led to the need for internal validation and the estimation of model optimism. More recently, attention has shifted towards the stability of model predictions, which is more relevant to clinical practice, as unstable predictions can lead to inconsistent risk estimations and ultimately impact patient outcomes. This study aims to determine the association between performance metrics optimism and stability metrics (e.g., the classification instability index (CII)).
Methods: We replicated the development of a previously published prediction model for mortality using the GUSTO-I dataset. Ten scenarios of sample sizes (500, 1000, 2000, 3000, 4000, 5000, 6000, 10000, 20000, and 40830) were drawn through stratified random sampling. Seven candidate predictors were defined, and logistic regression was used for model derivation. Each dataset underwent the same development process and was evaluated for predictive performance and stability using 200 bootstrap resamples. We estimated and compared mean absolute prediction error (MAPE) and the classification instability index (CII) with the estimated optimism of performance metrics, such as the area under the ROC curve (AuROC).
Results: We found that models trained on different sample sizes exhibited similar discrimination performance, with a good AuROC (mean 0.73, 95% CI: 0.70–0.76) and favorable optimism (mean 0.017, 95% CI: 0.009–0.025). However, mean AUC optimism showed a clear downward trend and decreasing standard deviations as the sample size increased. This pattern was also observed in stability metrics such as CII, MAPE, estimated risk (relative to the original model), and calibration, despite the models generally having low optimism. The results also showed MAPE instability and values exceeding 5%, with unstable calibration, especially at high predicted probabilities in sample sizes of 1,000–5,000. MAPE began to stabilize and remain low at 10,000 samples.
Conclusion: Stable discriminative performance metrics and low optimism from internal validation do not necessarily indicate stability in model prediction and calibration. Model developers should be cautious about relying solely on consistent discrimination performance and low optimism, particularly when the model is trained on a small sample size.
46-prediction-prognostic-mod-4: 4
The Imbalance Dilemma: Can Class Imbalance Corrections Stabilize Clinical Prediction Models?
Wachiranun Sirikul1,2, Natthanaphop Isaradech1, Wuttipat Kiratipaisarl1, Phichayut Phinyo2, Pakpoom Wongyeekul2, Noraworn Jirattikanwong2
1Department of Community Medicine, Faculty of Medicine, Chiang Mai, Thailand; 2Department of Biomedical Informatics and Clinical Epidemiology (BioCE), Faculty of Medicine, Chiang Mai, Thailand
Background
Class imbalance is a common problem in developing clinical prediction models (CPMs), often leading to a prediction paradox. A variety of methods for correcting data imbalance have been proposed and implemented to improve the development of CPMs. However, correcting for imbalance could potentially degrade model performance and exacerbate bias by increasing overfitting. In this study, we investigated how imbalance correction influenced the performance of logistic regression models, focusing on prediction stability using the Gusto dataset.
Methods
Model development and internal validation were done using different methods to correct for class imbalance (none, SMOTENC, BorderlineSMOTE, and ADASYN). This was done with the imbalanced-learn library in Python (version 3.12) and with sample sizes of 500, 1000, 2000, 3000, and 40830. The smallest sample scenario was determined based on the minimum sample size required for developing a multivariable classification model by Riley RD et.al. CPMs were developed using penalised logistic regression with hyperparameter tuning using grid search with 10-fold cross-validation via sklearn. Model performance and prediction stability were evaluated using 200 bootstrap samplings to obtain the area under received operating characteristic curves (AuROCs) with optimism corrected, calibrations, mean absolute prediction error (MAPE), and classification instability indices (CII).
Results
In the full sample scenario, the AuROCs with optimism correction for the models without imbalance correction, SMOTENC, BorderlineSMOTE, and ADASYN were 0.768, 0.802, 0.803, and 0.789, respectively. All models using imbalance corrections showed better prediction stability in discrimination, calibration, MAPE, and CII compared to the model without an imbalance correction. In the minimal required sample scenario (n=500), the models with imbalance corrections, except for ADASYN, improved model discrimination compared to the model using the original data. Despite the limited sample size affecting all models' calibration and prediction stability, the original data model had the most stable predictions. With more saturated data scenarios (n=3000), the findings of performance and stability were consistent with the full sample scenario.
Conclusion
Our study demonstrated that using oversampling techniques for imbalance correction in simple clinical data and standard statistical models not only improved model discrimination but also enhanced the calibration and stability of CPMs in large sample sizes. However, applying imbalance corrections in small samples, such as the minimal required sample, should be approached with caution. The effects of imbalance corrections can be attributed to the trade-off between introducing optimal bias to make the model more suitable for the minority class and applying optimal model penalisation to mitigate overfitting.
46-prediction-prognostic-mod-4: 5
Prediction with Logistic Regression in Binary Class Imbalance: comparing re-sampling techniques with threshold probability assignment under varying predictive covariates
Henk van der Pol1,2, Ragnhild Sørum Falk3, Marta Fiocco2,4,5, Arnoldo Frigessi6, Euloge Clovis Kenne Pagui3,6
1Department of Medical Oncology, Leiden University Medical Center, the Netherlands; 2Mathematical Institute, Leiden University, the Netherlands; 3Oslo Centre for Biostatistics and Epidemiology, Oslo University Hospital, Oslo, Norway; 4Princess Máxima Centre for Paediatric Oncology, Utrecht, the Netherlands; 5Department of Biomedical Data Science, Section Medical Statistics, Leiden University Medical Centre, Leiden, the Netherlands; 6Oslo Centre for Biostatistics and Epidemiology, University of Oslo, Oslo, Norway
Background:
Prediction of rare events in binary classification is a well-studied topic in biostatistics, which has gained renewed interest in the context of Machine Learning (ML). Class imbalance is often addressed in the ML-community through re-sampling techniques such as random over sampling, random under sampling, and synthetic minority oversampling technique (SMOTE). However, recent research has shown that re-sampling techniques are not always necessary and may even be harmful to clinical prediction [1]. Nevertheless, little interest is shown how a proper threshold probability assignment strategy may overcome this issue. Importantly, this study explores how the strength of the covariates and the correct specification of model affect the prediction performance in an imbalanced setting. The aim of this research is to investigate the impact of class imbalance correction strategies on the performance of logistic classifiers, when covariates have varying degree of predictive power.
Methods:
We conducted a Monte Carlo simulation, based on a logistic regression model with various settings such as, prevalence of disease, re-sampling techniques, model specification and strength of covariates. The predictive performance is mainly assessed by the precision and recall (sensitivity) measures. We compare the re-sampling techniques with several threshold probabilities.
Results:
In all simulation scenarios, proper threshold probability assignment strategy have comparable or better prediction performance compared to re-sampling techniques. Specifically, the threshold that maximizes the area under the ROC curve (AUC) returns same level of precision compared to the SMOTE re-sampling technique. However, the recall is on average, 10% greater. Moreover, these differences are enlarged with the increase strength of a predictive variable. Lastly, we reinforce current findings in which we show that un-correcting the dataset returns a higher AUC and Area under the precision and recall curve in imbalanced data, compared to re-sampling techniques.
Conclusion:
Proper threshold probability assignment strategy outperforms re-sampling techniques in imbalanced setting. Focus should continue on optimal threshold probabilities based on the outcome of interest, investigate predictive variables.
Reference:
[1] van den Goorbergh, R., van Smeden, M., Timmerman, D., & Van Calster, B. (2022). The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association, 29(9), 1522-1531.
|