36-1 Biomarker: 1
Quantifying the Clinical Usefulness of Novel Biomarkers and Tests: Beyond Traditional Statistics
Frank Doornkamp1, Jelle J Goeman1, Ewout W Steyerberg1,2
1LUMC, Netherlands, The; 2UMCU, Netherlands, The
Background: New prognostic tests and biomarkers are often described with sensitivity and specificity, but evaluating their clinical usefulness requires moving beyond traditional performance metrics. We aim to evaluate how new biomarkers improve health outcomes through improved treatment allocation, compared to current standard care.
Methods: We developed a decision analytic model comparing treatment decision making based on a reference prediction model versus the reference model combined with a new biomarker. Risk distributions for a clinical outcome were simulated based on a reference model alone or with the new biomarker. Individual treatment benefit was estimated assuming a constant relative effect. Treatment was recommended if the absolute risk reduction exceeded a defined treatment threshold. Traditional statistical performance measures included sensitivity, specificity, and the area under de ROC curve (AUC). The clinical usefulness of the biomarker was quantified with Net Benefit: a weighted sum between the reduction in events and the number of treatments given, per 10,000 patients. Uncertainty of the improvement in Net Benefit was assessed by subsampling from the simulated data set. As an illustrative example, we assessed the clinical usefulness of the genomic MammaPrint test (assumed sensitivity 64%, specificity 67%) in addition to the standard clinical risk assessment (PREDICT model, https://breast.predict.cam/tool) for early breast cancer patients. Systematic sensitivity analyses assessed the drivers of clinical usefulness.
Results: For early breast cancer, adding the MammaPrint to the PREDICT model increased the AUC from 0.69 to 0.72. Net Benefit increased from 13 net distant metastases prevented to 15 per 10,000. Substantial uncertainty was noted by drawing samples equal to the MINDACT trial, the study that provided key evidence on the incremental value of MammaPrint test (n=6693), suggesting further evidence is needed to claim clinical usefulness. Sensitivity analysis showed that the clinical usefulness of a new test was larger if the event rate was higher, treatment effect larger, its own quality better, or quality of the reference model lower. These patterns were not consistent by increases in AUC. We present test and context characteristics in a ShinyApp to facilitate early assessments of the potential clinical usefulness of novel tests and biomarkers.
Conclusions: Decision analytic modeling provides insights into how the sensitivity and specificity of a new biomarker translate to clinical usefulness within its clinical context. We found that clinical usefulness depends not only on its prognostic strength but also on key contextual factors, which are not captured by traditional statistical performance measures.
36-1 Biomarker: 2
The underlap coefficient as measure of a biomarker’s discriminatory ability in a multi-class disease setting
Zhaoxi Zhang, Vanda Inácio, Miguel de Carvalho
University of Edinburgh (United Kingdom)
Background: The first step in evaluating a potential diagnostic biomarker is to examine the variation in its values across different disease groups. The significance of employing appropriate metrics in the evaluation phase cannot be overstated. The most commonly used metrics for this purpose are predominantly Receiver Operating Characteristic (ROC) based. However, these measures rely on a stochastic ordering assumption for the distributions of the biomarker’s outcomes across groups. This assumption can be restrictive, particularly when covariates are involved, and its violation may lead to incorrect conclusions about a biomarker’s ability to distinguish between disease classes. Even when a stochastic ordering exists, the order may vary across different biomarkers in discovery studies, complicating automated ranking.
Methods: To address these challenges and complement existing measures, we propose the underlap coefficient (UNL), a novel summary index of a biomarker's ability to distinguish between multiple disease groups, and study its properties particularly in the three-class case. We establish a direct analytical link between the UNL and the three-class Youden index (YI), as well as between the UNL and the Weitzman’s two-class overlap coefficient (OVL). These relationships can be easily generalized to settings with more than three disease classes. We also numerically explore the relationship between the UNL and the volume under the ROC surface (VUS) in a proper trinormal framework. Additionally, we introduce Bayesian nonparametric estimators for both the unconditional underlap coefficient and its covariate-specific counterpart. Furthermore, we illustrate the proposed approach through an application to an Alzheimer’s disease (AD) dataset aimed to assess how four potential AD biomarkers distinguish between individuals with normal cognition, mild impairment, and dementia, and how and if age and gender impact this discriminatory ability.
Results: A simulation study reveals a good performance of the proposed estimators across a range of conceivable scenarios. The results from the application study indicate a moderate age and gender effect on the biomarkers' discriminatory ability. Also, by comparing UNL with the three-class YI in the application study, we find that YI could be underestimating some biomarkers' discriminatory capability at certain covariate values.
Conclusion: We discuss the underlap coefficient as a measure of diagnostic accuracy in a multi-class disease framework. It offers advantages over ROC-based summary measures for evaluating the diagnostic potential of a biomarker during its discovery phase, as it does not require an assumed order of classes and is better suited for multi-modal density settings.
36-1 Biomarker: 3
Time-dependent accuracy for Continuous Biomarkers using Copula Modelling
Adina Najwa Kamarudin1, Ahmad Faiz Mohd Azhar1, Nurain Ibrahim2
1Universiti Teknologi Malaysia, Malaysia; 2Universiti Teknologi MARA, Malaysia
In biomedical research, a key interest lies in the building of classification models to analyse patient survival based on biomarkers, with patients being discriminated into different cases. If a biomarker is dependent on time, the accuracy of these models can be assessed through the time-dependent receiver operating characteristic (ROC) curve. The sensitivity and specificity of the classification model are measured by this curve to detect which patients have long or short survival times. The accuracy trend is produced by computing the area under this curve (AUC) at each time point. Several methodologies have been proposed for the estimation of the accuracy trend, revolving around nonparametric or semiparametric estimators. The use of these methods may limit researchers from suggesting additional reasons for why accuracy measurements are high or low at certain time points. In this paper, a demonstration of how the accuracy trend of a classification model can be estimated parametrically from a time-dependent ROC curve is provided. Based on a simulation study on copula functions, it is shown that the accuracy value and trend may be influenced by the dependence measurements or the selection of copulas. This can improve the understanding of the accuracy trend of a classification model. In a real application, the statistical information of a single biomarker/score derived from the primary biliary cholangitis (PBC) dataset was linked to its time-to-event using the Gaussian copula. It was observed that the biomarker/score derived from five covariates gives the highest accuracy performance, as the strongest negative dependence structure was found compared to other markers.
Keywords: time-dependent, AUC, copula, sensitivity, specificity.
36-1 Biomarker: 4
Sample size determination for hypothesis testing of the intraclass correlation coefficient for agreement in two-way ANOVA models
Dipro Mondal1, Alberto Cassese2, Math JJM Candel1, Sophie Vanbelle1
1Department of Methodology and Statistics, Care and Public Health Research Institute (CAPHRI), Maastricht University, The Netherlands; 2Department of Statistics, Computer Science, Applications ”Giuseppe Parenti”,The University of Florence, Italy
Introduction: Reliability assessment is essential in medical domains to ensure accurate patient diagnosis. When multiple raters evaluate the same patients using quantitative measurements, a two-way ANOVA model may be appropriate, with the intraclass correlation coefficient for agreement (ICCa) serving as the reliability metric.
Designing such reliability studies requires determining the number of patients and raters. While sample size procedures exist based on the expected width of confidence intervals for ICCa, procedures based on hypothesis testing remain underdeveloped. These procedures utilise the lower limit of the confidence interval for ICCa [1, 2] and determine sample sizes ensuring adequate power for testing whether ICCa exceeds a predefined threshold. We propose sample size procedures for hypothesis testing building on available confidence interval methods for ICCa.
Methods: We identify seven classes of confidence interval methods for ICCa and compare their empirical type-I error rates. Focusing on the best performing methods, simulation-based sample size determination procedures are proposed. These procedures are evaluated by assessing the empirical power of the hypothesis test at the calculated sample size. Accessibility of these procedures is facilitated by implementing an interactive R/Shiny app.
Results: Comparison of the type-I error rates of the confidence interval methods indicates that the rater-to-error variance ratio influences which method emerges as the best-performing in maintaining the type-I error rate close to the nominal value. Evaluation of our proposed sample size procedures shows that they provide adequate power across most parameter configurations.
Conclusion: The rater-to-error variance ratio should guide practitioners in selecting an appropriate confidence interval method for ICCa. Our proposed sample size procedures, along with the R/Shiny app implementation, provide a practical framework for designing reliability studies.
1. Mondal, D., et al., Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model. Statistical Methods in Medical Research, 2024. 33(3): p. 532-553.
2. Zou, G.Y., Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med, 2012. 31(29): p. 3972-81.
36-1 Biomarker: 5
Accounting for misclassification of binary outcomes in external control arm studies for unanchored indirect comparisons: simulations and applied example
Mikail Nourredine1,2, Antoine Gavoille1,2, Côme Lepage3,4, Behrouz Kassai-Koupai2,5, Michel Cucherat6, Fabien Subtil1,2
1Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, F-69003 Lyon, France; 2Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, F-69100 Villeurbanne, France; 3Fédération Francophone de Cancérologie Digestive, EPICAD INSERM UMR CTM 1231, University of Burgundy and Franche Comté, Dijon, France; 4Department of Digestive Oncology, University Hospital Dijon, University of Burgundy and Franche Comté, Dijon, France; 5Service hospitalo-universitaire de pharmaco-toxicologie de Lyon, Centre d’Investigation Clinique CIC 1407, Inserm-Hospices Civils de Lyon, Lyon France; 6Service de pharmacologie et de toxicologie (metaEvidence.org), Hospices Civils de Lyon, Lyon, France
Introduction Statistical methods for single-arm trials with an External Control Arm (ECA) usually assume no difference in outcome measurement between arms. However, ECA data may measure only proxy outcomes, leading to potential misclassification and biased estimates. This study aimed to quantify bias from ignoring binary outcome misclassification and propose a likelihood-based correction method.
Methods The proposed model relies on a validation study in which both proxy and reference outcomes are measured, to overcome the misclassification problem, and a joint likelihood estimation for the validation, ECA, and the prospective single-arm data. In addition to standard assumptions in indirect treatment comparisons, it requires a correctly specified outcome measurement error model that accounts for all variables contributing to non-differential measurement error. We performed simulations varying sample size, specificity, and sensitivity, evaluating relative bias, empirical standard error (SE), root mean square error (RMSE), and 95% confidence interval (CI) coverage. In an applied example, we compared sorafenib with a proxy outcome (PRODIGE11-trial) versus placebo with the reference outcome (SHARP-trial), using the SHARP trial’s gold standard treatment effect estimate.
Results Simulations showed that ignoring misclassification in binary outcomes leads to substantial bias in the estimation of indirect treatment effects. Even with a specificity and sensitivity at 0.9, the uncorrected method had a relative bias of 67%. The proposed model reduced bias in all simulation sets, with a relative bias below 5% and a 95%CI coverage between 95% and 96.5%. Across varying levels of specificity and sensitivity, the proposed method achieved approximately half the RMSE of the uncorrected method. Additionally, increasing the ECA sample size had a greater impact on reducing the proposed method’s RMSE than enlarging the validation study sample size. The gold standard effect of sorafenib compared with placebo was OR=0.52 (SHARP). Ignoring outcome misclassification resulted in an overestimation of the indirect treatment effect (OR=0.36), using the proposed model the estimation was OR=0.55. However, with only 161 patients in the sorafenib arm of the PRODIGE-11 trial, the 95%CI estimated by the proposed model was wide. This is conservative, as it transfers the uncertainty in measurement to the uncertainty in the decision. These findings align with simulation results, where the empirical SE of the proposed model was twice that of the reference outcome regression for a sample size of 200 patients.
Conclusions The findings underscore the importance of addressing outcome misclassification in indirect comparisons. The proposed correction method may improve reliability in unanchored indirect treatment comparisons.
|