Session | ||
Machine learning 2
| ||
Presentations | ||
33-1 ML 2: 1
Asymmetric Shapley values to quantify the importance of genomic variables in clinical prediction studies Amsterdam UMC, Netherlands, The In many clinical prediction studiess, genomics data are combined with other variables, such as gender, sex and disease state. In such models, the additional predictive strength of genes is often disappointingly small. This may lead to the premature conclusion that genomics is irrelevant for the disease or its progression. Quantifying the 'importance' of genomic variables purely by assessing decreased prediction accuracy after removal from these variables to the model has two shortcomings. First, it does not account well for correlations. If those genomic variables are highly correlated to the other (low-dimensional) variables, the latter act as a buffer when removing genomics from the model, leading to little decline in performance. Still, the two types of variables may be equally important for the outcome. Shapley values, either for single variables or groups of variables, provide a solution for this in the sense that importance is spread out over correlated variables. A second shortcoming is that causality or temporal ordering is completely ignored. This is particularly relevant, as genomic aberrations are often conceived to be close to the root cause. This means that they may have indirect effects on the outcome via some of the other variables. Motivated by an application to prediction of relapse-free survival for colorectal cancer patients, we study asymmetric Shapley values to quantify the importance of genomic variables when disease state acts as a mediator, and when several confounders are present. We show that accounting for the ordering matters for quantifying the importance of genomics. Moreover, we address regularization to deal with dependency of the mediator (and confounders) on the high-dimensional genomics variable. These dependencies are crucial for the calculation of Shapley values. Finally, we illustrate how to perform inference with the asymmetric Shapley values to compare the importance of genomics with that of other variables. Extensions to pre-defined gene sets will be also be discussed. To summarize, this work helps researchers to obtain a better quantification of the (relative) importance of genomics variables in clinical prediction settings. 33-1 ML 2: 2
Fused Estimation of Varying Omics Effects for Clinico-genomic Data 1Amsterdam University Medical Centers, Netherlands, The; 2Vrije Universiteit Amsterdam, Netherlands, The Background Methods Results and Conclusion 33-1 ML 2: 3
Random forests using longitudinal predictors 1Univ. Bordeaux, INSERM, INRIA, BPH, U1219, France; 2Univ. Bordeaux, INSERM, BPH, U1219, France, Introduction and Objectives Random Forests are an effective predictive tool, particularly in high-dimensional settings. However, they are not well-suited for longitudinal data collected over time. To address this limitation, Fréchet Random Forests [1] were proposed. They can handle any type of data within a metric space by using a distance tailored to each data type (e.g., images, trajectories). This work aimed to implement the Fréchet Random Forest for trajectory data, fully exploiting the flexibility of the Generalized Fréchet distance; and evaluate the performance of the Fréchet Random Forest in predicting a continuous outcome using longitudinal inputs. Methods The generalized discrete Fréchet distance depends on a time-shifting parameter, called timescale, which modifies its behavior. We proposed two implementations: the timescale defined as an hyper parameter or the time-scale randomly drawn at each tree node to explore all time sensitivity behaviors. A simulation study has been conducted to illustrate the flexibility of the Fréchet random forest to capture different scenarios of association:(i) time-sensitive association (ii) shape-sensitive association and (iii) a mix of both. We then apply the method to data from a population-based cohort to predict the risk of dementia from clinical marker trajectories. Results The simulations illustrated the flexibility of the Fréchet Random Forests to adapt to different types of associations with the timescale tuning. The Fréchet forests also demonstrated better predictive performance (MSE) across all three scenarios compared to classical Random Forests with pre-determined features. On the application data, the Fréchet forests outperformed classical forests, even with more irregular and sparse data, while similarly identifying predictive markers.6 Conclusion Thanks to its tunable timescale parameter that can adapt to different structures of association, the Fréchet Random Forest constitutes a flexible tool for prediction based on longitudinal data. [1] Capitaine L. et al. Fréchet Random Forests for Metric Space Valued Regression with Non-Euclidean Predictors. JMLR. 2024 33-1 ML 2: 4
Digital Twins you can ‘count’ on: a novel application of digital-twin prognostic scores in Negative-Binomial models. GlaxosmithKline, United Kingdom Introduction: Methods for including prognostic information in clinical trials have seen a resurgence via the development of so-called ‘digital-twins’ (DT) – whereby an ensemble of machine-learning models are trained on historical data before being then adopted within the analysis of a current trial. The potential of controlling statistical uncertainty while preserving asymptotic unbiasedness of the marginal treatment estimator could be positively seen by regulators. Still, applications in non-linear outcomes such as binary, count or time-to-even endpoints require more understanding and addressing issues around non-collapsibility. More recently Conner et al. [1] evaluated the collapsibility property of the Rate Ratio (RR) treatment effect, concluding that the marginal RR is equal to the conditional RR when covariates are prognostic (i.e., with no heterogeneity of the treatment effect). The aim of this work is to evaluate the use of DTs as a novel application in the analysis of count data, and its utility in trials for respiratory diseases such as COPD, where the primary analysis is frequently a Negative-Binomial or Poisson regression model. Methods We evaluate the performance of DT models for count or recurrent endpoints by carrying out an extensive simulation study. We take inspiration from COPD trial data to motivate a diverse set of plausible scenarios. We consider varying the design parameters of the present clinical trial, whilst additionally exploring how a realistic drift in performance (i.e., historical data versus current trial data) may impact key operating characteristics (e.g., power, type 1 error, bias etc.). Results A full set of results will be presented to illustrate the expected effect of applying DT on sample size, and what might be realistic expectations under each simulation scenario. Additionally, we will present whether the expected consistency between the conditional and marginal treatment effect estimates holds and where further investigations may be expected. Conclusion Through this work we have demonstrated a novel application of DT methodologies for count or recurrent endpoints. Our work highlights new areas of expansion for DTs and indicates possible ‘quick wins’ for leveraging prognostic information in COPD trials. 33-1 ML 2: 5
Calibrating machine learning approaches for probability estimation in case of the absence of calibration data 1Cardio-CARE, Switzerland; 2BDH-Klinik Elzach, Elzach, Germany; 3Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany; 4Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany; 5School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa Statistical prediction models are gaining popularity in applied research. One challenge is the transfer of the prediction model to a population that may be structurally different from the one for which the model was developed. If a cohort is available from the target population, the model can be calibrated to the characteristics of the target population. Elkan proposed a closed formula for calibration in case a calibration cohort is lacking. His method relies on the equal distribution of covariates in affected as well as unaffected individuals for both the existing and the calibration cohort. In this presentation, we propose a novel method which uses synthetic data generation from an existing cohort in conjunction with marginal statistics from the calibration cohort. A “recalibration” logistic model is computed in the synthetic data and used to recalibrate the predicted probabilities in the calibration cohort. We illustrate the novel approach in a simulation study and with two real data sets. The simulation studies and the illustration demonstrate the potential of this novel approach for calibration in absence of calibration data, when marginals are correctly specified and the correlations between variables from the model development and the population for which calibration is required are identical. |