46th Annual Conference of the International Society for Clinical Biostatistics (ISCB)

Session

Machine learning 2

Time:

Wednesday, 27/Aug/2025:

9:00am - 10:30am

Location: Biozentrum U1.101

Biozentrum, 122 seats

Presentations

33-1 ML 2: 1

Asymmetric Shapley values to quantify the importance of genomic variables in clinical prediction studies

Mark van de Wiel

Amsterdam UMC, Netherlands, The

In many clinical prediction studiess, genomics data are combined with other variables, such as gender, sex and disease state. In such models, the additional predictive strength of genes is often disappointingly small. This may lead to the premature conclusion that genomics is irrelevant for the disease or its progression. Quantifying the 'importance' of genomic variables purely by assessing decreased prediction accuracy after removal from these variables to the model has two shortcomings. First, it does not account well for correlations. If those genomic variables are highly correlated to the other (low-dimensional) variables, the latter act as a buffer when removing genomics from the model, leading to little decline in performance. Still, the two types of variables may be equally important for the outcome. Shapley values, either for single variables or groups of variables, provide a solution for this in the sense that importance is spread out over correlated variables. A second shortcoming is that causality or temporal ordering is completely ignored. This is particularly relevant, as genomic aberrations are often conceived to be close to the root cause. This means that they may have indirect effects on the outcome via some of the other variables.

Motivated by an application to prediction of relapse-free survival for colorectal cancer patients, we study asymmetric Shapley values to quantify the importance of genomic variables when disease state acts as a mediator, and when several confounders are present. We show that accounting for the ordering matters for quantifying the importance of genomics. Moreover, we address regularization to deal with dependency of the mediator (and confounders) on the high-dimensional genomics variable. These dependencies are crucial for the calculation of Shapley values. Finally, we illustrate how to perform inference with the asymmetric Shapley values to compare the importance of genomics with that of other variables. Extensions to pre-defined gene sets will be also be discussed. To summarize, this work helps researchers to obtain a better quantification of the (relative) importance of genomics variables in clinical prediction settings.

33-1 ML 2: 2

Fused Estimation of Varying Omics Effects for Clinico-genomic Data

Jeroen Goedhart¹, Mark van de Wiel¹, Wessel van Wieringen^1,2, Thomas Klausch¹

¹Amsterdam University Medical Centers, Netherlands, The; ²Vrije Universiteit Amsterdam, Netherlands, The

Background
Cancer prognosis is often based on a set of omics covariates and a set of established clinical risk factors such as age, tumor stage, and prognostic indices. Combining these two sets of covariates in a so-called clinico-genomic model poses challenges. First, difference in dimension: clinical covariates should be favored because they are low-dimensional and usually have stronger prognostic ability compared to high-dimensional omics covariates. Second, complex interactions: since many cancers are heterogeneous, genetic profiles and their prognostic effects may vary across patient subpopulations. Last, redundancy: a (set of) gene(s) may encode similar prognostic information as a classical risk factor.

Methods
To address these challenges, we combine regression trees, employing clinical covariates only, with a unique fusion-like penalized regression framework in the leaf nodes for the omics covariates. The fusion penalty controls the modeled variability in genetic profiles across subpopulations defined by the tree. We prove that the shrinkage limit of our penalized framework equals a benchmark model for clinico-genomic data: a ridge regression with penalized omics covariates and unpenalized clinical risk factors. Along with boosting prognostic performance for various situations, the proposed method has another practical advantage. It allows researchers to evaluate, for different patient subpopulations, whether the added overall omics effect enhances prognosis compared to only employing clinical covariates.

Results and Conclusion
We illustrate the strengths of the proposed method in simulations and in an application to colorectal cancer prognosis based on age, tumor stage, gender, a molecular clustering variable, and 20,000+ gene expression measurements. Our method reveals that the overall omics effect is not required for colorectal cancer prognosis of some patient subpopulations. Our method also finds a large variability in the subpopulation-specific effects of a set of genes related to expression of cancer/testis antigens.

33-1 ML 2: 3

Random forests using longitudinal predictors

Justine Remiat¹, Cécile Proust-Lima², Robin Genuer¹

¹Univ. Bordeaux, INSERM, INRIA, BPH, U1219, France; ²Univ. Bordeaux, INSERM, BPH, U1219, France,

Introduction and Objectives

Random Forests are an effective predictive tool, particularly in high-dimensional settings. However, they are not well-suited for longitudinal data collected over time. To address this limitation, Fréchet Random Forests [1] were proposed. They can handle any type of data within a metric space by using a distance tailored to each data type (e.g., images, trajectories). This work aimed to implement the Fréchet Random Forest for trajectory data, fully exploiting the flexibility of the Generalized Fréchet distance; and evaluate the performance of the Fréchet Random Forest in predicting a continuous outcome using longitudinal inputs.

Methods

The generalized discrete Fréchet distance depends on a time-shifting parameter, called timescale, which modifies its behavior. We proposed two implementations: the timescale defined as an hyper parameter or the time-scale randomly drawn at each tree node to explore all time sensitivity behaviors. A simulation study has been conducted to illustrate the flexibility of the Fréchet random forest to capture different scenarios of association:(i) time-sensitive association (ii) shape-sensitive association and (iii) a mix of both. We then apply the method to data from a population-based cohort to predict the risk of dementia from clinical marker trajectories.

Results

The simulations illustrated the flexibility of the Fréchet Random Forests to adapt to different types of associations with the timescale tuning. The Fréchet forests also demonstrated better predictive performance (MSE) across all three scenarios compared to classical Random Forests with pre-determined features. On the application data, the Fréchet forests outperformed classical forests, even with more irregular and sparse data, while similarly identifying predictive markers.6

Conclusion

Thanks to its tunable timescale parameter that can adapt to different structures of association, the Fréchet Random Forest constitutes a flexible tool for prediction based on longitudinal data.

[1] Capitaine L. et al. Fréchet Random Forests for Metric Space Valued Regression with Non-Euclidean Predictors. JMLR. 2024

33-1 ML 2: 4

Digital Twins you can ‘count’ on: a novel application of digital-twin prognostic scores in Negative-Binomial models.

Tasos Papanikos, Doug Thompson, Harry Parr, Aris Perperoglou

GlaxosmithKline, United Kingdom

Introduction:

Methods for including prognostic information in clinical trials have seen a resurgence via the development of so-called ‘digital-twins’ (DT) – whereby an ensemble of machine-learning models are trained on historical data before being then adopted within the analysis of a current trial. The potential of controlling statistical uncertainty while preserving asymptotic unbiasedness of the marginal treatment estimator could be positively seen by regulators. Still, applications in non-linear outcomes such as binary, count or time-to-even endpoints require more understanding and addressing issues around non-collapsibility. More recently Conner et al. [1] evaluated the collapsibility property of the Rate Ratio (RR) treatment effect, concluding that the marginal RR is equal to the conditional RR when covariates are prognostic (i.e., with no heterogeneity of the treatment effect). The aim of this work is to evaluate the use of DTs as a novel application in the analysis of count data, and its utility in trials for respiratory diseases such as COPD, where the primary analysis is frequently a Negative-Binomial or Poisson regression model.

Methods

We evaluate the performance of DT models for count or recurrent endpoints by carrying out an extensive simulation study. We take inspiration from COPD trial data to motivate a diverse set of plausible scenarios. We consider varying the design parameters of the present clinical trial, whilst additionally exploring how a realistic drift in performance (i.e., historical data versus current trial data) may impact key operating characteristics (e.g., power, type 1 error, bias etc.).

Results

A full set of results will be presented to illustrate the expected effect of applying DT on sample size, and what might be realistic expectations under each simulation scenario. Additionally, we will present whether the expected consistency between the conditional and marginal treatment effect estimates holds and where further investigations may be expected.

Conclusion

Through this work we have demonstrated a novel application of DT methodologies for count or recurrent endpoints. Our work highlights new areas of expansion for DTs and indicates possible ‘quick wins’ for leveraging prognostic information in COPD trials.

33-1 ML 2: 5

Calibrating machine learning approaches for probability estimation in case of the absence of calibration data

Eleonora Di Carluccio¹, Giorgos Koliopanos¹, Francisco Ojeda^3,4, Christian Weimar², Andreas Ziegler^1,3,4,5

¹Cardio-CARE, Switzerland; ²BDH-Klinik Elzach, Elzach, Germany; ³Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany; ⁴Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany; ⁵School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa

Statistical prediction models are gaining popularity in applied research. One challenge is the transfer of the prediction model to a population that may be structurally different from the one for which the model was developed. If a cohort is available from the target population, the model can be calibrated to the characteristics of the target population. Elkan proposed a closed formula for calibration in case a calibration cohort is lacking. His method relies on the equal distribution of covariates in affected as well as unaffected individuals for both the existing and the calibration cohort. In this presentation, we propose a novel method which uses synthetic data generation from an existing cohort in conjunction with marginal statistics from the calibration cohort. A “recalibration” logistic model is computed in the synthetic data and used to recalibrate the predicted probabilities in the calibration cohort. We illustrate the novel approach in a simulation study and with two real data sets. The simulation studies and the illustration demonstrate the potential of this novel approach for calibration in absence of calibration data, when marginals are correctly specified and the correlations between variables from the model development and the population for which calibration is required are identical.

Conference Agenda