27-1 Machine Learning: 1
Performance evaluation of dimensionality reduction techniques on high-dimensional DNA methylation data
Kuldeep Kumar Sharma1, Binukumar B1, Binu V. S1, Thirumoorthy Chinnasamy1, Gokulakrishnan K1, Saravanan P2, Mohan V3
1National institute of mental health and neurosciences, India; 2Warwick Medical School, University of Warwick, UK; 3Madras Diabetes Research Foundation (MDRF), Chennai, India
Introduction: Most biomedical researchers in the recent past have started recording thousands to millions of features simultaneously on each object or individual, and such data are said to be high-dimensional data. One such field of high-dimensional datasets is epigenetics. Epigenetics is a subbranch of genetics that focuses on inheritable changes in gene activity or function that occur without alterations to the DNA sequence. Datasets obtained from epigenetics are known as DNA methylation (DNAm) data. The methylation datasets include beta values for each cytosine phosphate guanine (CpG), indicating the degree of methylation. A statistical framework for handling such high-dimensional data involves the use of various dimension reduction (DR) techniques. Knowing how each technique performs on DNAm datasets can be helpful in deciding which DR to use.
Objective: This communication aims to reduce the dimensionality of DNAm data via various DR techniques, such as principal component analysis (PCA), partial least squares discriminant analysis (PLS-DA), multidimensional scaling (MDS), isometric mapping (ISOMAP) and uniform manifold approximation and projection (UMAP). In addition, there is a comparison of techniques in terms of the retained amount of information, local neighborhood preservation criteria, and global structure-holding approaches.
Methodology: Data for the current study were obtained from the ongoing study METBIOWIN. There were 8,62,927 CpG sites, and the dataset consisted of grouping variables for gestational diabetes status (yes/no). Dimension reduction was performed via PCA, PLS-DA, MDS, ISOMAP, and UMAP. Furthermore, the retained amount of information, local neighborhood preservation criteria, and global structure-holding approaches were assessed via measures such as Shannon’s entropy, Spearman’s rho, Konig’s measure, trustworthiness & continuity, Kruskal’s stress score, Sammon’s score, and residual variance.
Result: PCA performed well in terms of information retention, followed by PLSDA and MDS, while UMAP consistently showed the weakest performance. For local structure preservation, MDS and PCA excelled with high König’s measure and trustworthiness, whereas UMAP underperformed significantly. Regarding global structure preservation, MDS and UMAP showed the lowest Kruskal stress scores, indicating a strong fit, while ISOMAP performed the worst. Overall, MDS, and PCA were the most effective, while UMAP lagged across all criteria.
Conclusion: Overall, MDS, PCA, and PLSDA emerged as the most robust techniques across multiple metrics, whereas UMAP was consistently less effective.
Keywords: DNA methylation data, PCA, PLS-DA, MDS, ISOMAP, UMAP.
27-1 Machine Learning: 2
Deep Generalised Mixed Effects Models: a Novel General Neural Network Structure for Analysing Hierarchical Data
Nina van Gerwen1,2, Dimitris Rizopoulos1,2, Manon Hillegers3, Loes Keijsers4, Sten Willemsen1,2
1Department of Biostatistics, Erasmus University Medical Center, the Netherlands; 2Department of Epidemiology, Erasmus University Medical Center, the Netherlands; 3Department of Child Psychiatry, Erasmus University Medical Center, the Netherlands; 4Department of Psychology, Education and Child Studies, Erasmus University Rotterdam, the Netherlands
Background: The Experience Sampling Method (ESM) is an intensive longitudinal research design where participants report their thoughts, emotional states and behaviours multiple times a day. ESMs have become increasingly popular to investigate individuals’ daily experiences. Our work is motivated by ESM data collected by the GrowIt! app. During the COVID-19 pandemic, the app was released to investigate daily mood changes among young adults, give users insight into their emotions and enhance users’ resilience. Current procedures to analyse ESM data face various challenges. In particular, ESM data are high-dimensional and exhibit complex correlation structures, which standard statistical techniques (e.g., marginal and mixed effect models) cannot adequately capture. Alternatively, machine learning procedures, such as recurrent neural networks, can be used to model ESM data and accommodate these correlations. However, these procedures face a problem with missing data. In our motivating dataset, adolescents often stopped using the app due to previous strong feelings of negative emotions. Hence, the implied missing data are of the missing-at-random type that standard machine learning procedures cannot accommodate.
Methods: We develop a novel neural network (NN) architecture that generalises mixed effects models to deep learning to overcome these challenges. Our Deep Generalised Mixed Model (DGMM) allows semiparametric and highly flexible modelling of the data’s mean and correlation structure with NNs. Classical estimation of mixed models requires integration over the random effects distribution, which is intractable when we estimate the random effects with a NN. Therefore, we use an adaptation of variational autoencoders to estimate the DGMM. By specifying a tractable variational distribution to sample from, we approximate the marginal log-likelihood as an expectation with respect to the variational distribution and the Kullback-Leibler divergence between the variational distribution and the marginal distribution of the random effects, together known as the Evidence Lower Bound. The variational distribution can also be seen as a nonlinear function of the data, which we estimate with another NN. Through this approach, the DGMM is able to accommodate longitudinal outcomes following any generic distribution, scale well to high-dimensional settings and provide valid inference when data is missing-at-random.
Results: In the GrowIt! app data, the DGMM showed good predictive performance for the multivariate analysis of longitudinal outcomes. A simulation study of the DGMM also showed good performance in various settings.
Conclusion: We have implemented the DGMM in Python using Keras, and are developing a wrapper function for R users.
27-1 Machine Learning: 3
Adapting transformer neural networks for longitudinal data with few time points
Kiana Farhadyar, Harald Binder
Institute of Medical Biometry and Statistics, Germany
When simultaneously assessing several characteristics of individuals over time, there might be a complex pattern of relations between these. While autoregressive models can be useful in such a setting when there is a smooth continuous temporal pattern, they rely on fixed lag structures and struggle with a small number of time points. The attention-based weighting scheme in the transformer neural network architectures might be better suited for discontinuous patterns, as it can assign weights to different time points, but so far, it has been limited to rather large datasets due to a large number of parameters. Therefore, we created a considerably simplified architecture that still maintains the key characteristics of transformers. This also allowed us to design a statistical testing approach for identifying context characteristics that steer the effect of other characteristics. This is complemented by a visualization approach for illustrating the pairwise relevance of characteristics. We illustrate our proposed technique using both simulated data and real-world data comprising self-reported stressors from a longitudinal resilience assessment study. There, prediction performance is seen to improve over classical regression approaches. In addition, the statistical testing approach uncovers the underlying patterns in the simulation study and highlights significant features in the real data that align with the mental health dynamics.
27-1 Machine Learning: 4
Distinguishing subgroup and site-specific heterogeneity in multi-site prognostic models using a neural network representation
Max Behrens1,2, Daiana Stolz3, Eleni Papakonstantinou3, Janis M. Nolde4, Gabriele Bellerino5, Moritz Hess1,2, Harald Binder1,2
1Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Straße 26, 79104 Freiburg, Germany; 2Freiburg Center for Data Analysis and Modeling, University of Freiburg, Ernst-Zermelo-Straße 1, 79104 Freiburg, Germany; 3Clinic of Pneumology, Medical Center – University of Freiburg, Faculty of Medicine, University of Freiburg, Killianstrasse 5, 79106 Freiburg, Germany; 4Department of Nephrology, Faculty of Medicine and Medical Center, University of Freiburg, Hugstetter Strasse 55, 79106 Freiburg, Germany; 5University of Freiburg, Department of Mathematical Stochastics, Ernst-Zermelo-Straße 1, 79104 Freiburg, Germany
Introduction: Multi-site clinical studies often struggle with heterogeneity in the effects of patient characteristics on outcomes. This variability may arise not only from inherent differences between (unknown) patient subgroups but also from systematic variations across sites, including differing subgroup compositions or clinical practices. Often, we assume homogeneity or rely on site-level adjustments through interaction terms or random effects models, not covering the full spectrum of heterogeneity. We propose a more flexible approach leveraging a low-dimensional representation, obtained via neural networks, to quantify different sources of heterogeneity. Specifically, the aim is to distinguish between patient subgroup differences that are invariant across sites and site-specific differences.
Methods: Our approach employs an autoencoder to learn a low-dimensional representation of the high-dimensional patient characteristic space, preserving essential patterns while reducing noise and redundancy. We integrate prognostic modelling directly within this learned latent representation. For each patient, we fit a localized regression model—weighting nearby patients based on their proximity in the latent space—to obtain patient-specific coefficient estimates that capture local prognostic variations. By examining the distribution of these patient-specific coefficients, we quantify overall heterogeneity for each latent dimension. Further, we can disentangle site effects by analysing patterns in these coefficients across sites. The entire method, from autoencoder training to localized regression, is trained end-to-end. This joint optimization ensures the latent representation not only preserves patterns in patient characteristics but also local variations in prognostic effects, making heterogeneity patterns more discernible.
Results: We illustrate our method with a multi-site dataset of patients with chronic obstructive pulmonary disease, investigating the heterogeneity in prognostic factors for disease progression. Within individual sites, distinct subgroups were found that differed in their coefficient estimates. Across sites, we identified both site-invariant subgroups and site-specific subgroups, likely reflecting differences in patient demographics or clinical practices. To aid interpretations, we provide a visualization approach for translating these patterns back to the level of patient characteristics.
Conclusion: Our autoencoder-based approach provides a tool for quantifying and decomposing heterogeneity in multi-site clinical data. By learning a latent representation optimized for both data reconstruction and preservation of local prognostic effects, we can effectively identify and differentiate site-specific and site-invariant sources of heterogeneity. This method allows for a more nuanced understanding of prognostic factors, moving beyond global estimates and facilitating the development of more robust and personalized models.
27-1 Machine Learning: 5
Unraveling Breast Cancer Genetic Risk in Chinese Women: Integrating GWAS, Fine-Mapping, and Machine Learning in the China Kadoorie Biobank
Shizhe Xu1, Christiana Kartsonaki1, Kuang Lin1, Kyriaki Michailidou2
1University of Oxford, United Kingdom; 2The Cyprus Institute of Neurology and Genetics, Cyprus
Background / Introduction: Genome-wide association studies (GWAS) have identified approximately 200 genomic regions containing common genetic variants associated with breast cancer risk. However, their target genes remain uncertain mainly due to linkage disequilibrium (LD) and the prevalence of variants in non-coding regions. To address this, fine-mapping methods have been introduced to pinpoint the most likely causal variants from a set of credible candidate variants and identify target genes. Most previous GWAS and fine-mapping studies have primarily focused on European-ancestry individuals. Given differences in genetic architecture and environmental exposures between Asian and European populations, our study aims to conduct GWAS on Chinese women and perform fine-mapping with summary statistics to uncover additional association signals and candidate susceptibility genes for breast cancer. Furthermore, we integrate machine learning models into the fine-mapping process.
Methods: First, we performed a GWAS on 57,660 Chinese women from the China Kadoorie Biobank using two software packages, SAIGE and REGENIE. Second, to understand how these packages handle complex living regions in China, we analysed specific loci and explicitly compared their approaches to computing relatedness and performing association testing by a Firth logistic regression model or a linear mixed model. Third, to distinguish true causal variants from significant signals, we applied fine-mapping methods such as SuSiE-RSS and PolyFun to our generated summary statistics. Fourth, to investigate computational trade-offs, we conducted fine-mapping on individual-level data and summary statistics to evaluate loss in accuracy. Finally, we applied a sequence-based deep learning model to assign functional annotations to variants in non-coding regions and incorporated a supervised learning approach, such as random forest.
Results: Summary statistics, Manhattan plots, QQ plots and LD score regression files were generated. Among Chinese women, several genetic loci associated with breast cancer were identified. The signals detected by SAIGE were slightly more significant than those identified by REGENIE due to differences in their underlying algorithms and correction thresholds. A comparison was conducted between fine-mapping results derived from individual-level data and summary statistics. A systematic evaluation was conducted to assess whether functional annotations and supervised learning enhance fine-mapping accuracy.
Conclusion: This study provides new insights into breast cancer genetics based on data from Chinese women. The comparison between REGENIE and SAIGE enhances our understanding of their strengths and limitations. This study also visualises the differences between fine-mapping using individual-level data and summary statistics. Finally, a machine learning-based framework in fine-mapping paves the way for more explicit analysis.
|