Novel Approaches for Cancer Subtypes Discovery and Pathway Analysis
Abstract
Complex diseases, particularly cancer, encompass a wide range of disorders, from ag- gressive and lethal to indolent lesions with low or delayed potential for progression to death. Treatment options and success heavily depend on the disease subtype of individual patients, which are often determined based on their molecular features. The advent of high-throughput platforms in the past decade has generated a wealth of molecular data, not only for gene expres- sion but also for other molecular data, including DNA methylation and non-coding microRNA. This has significantly increased the number of samples to cover the heterogeneity of the dis- eases and allowed for subtyping from a more holistic perspective, considering phenomena at different molecular levels in a single analysis. However, the stochastic nature of omics data and its high dimensionality have hindered consensus among different omic levels and the inter- pretability of the discovered subtypes, necessitating a powerful integrative technique to handle the noise, high dimensionality, and large sample sizes for improved subtyping. Following subtyping, understanding the biological mechanisms driving subtype differ- ences in complex diseases remains crucial for developing effective treatments and therapies. Pathway analysis and gene set enrichment analysis are widely used to determine significantly impacted biological processes between conditions. However, current pathway analyses are biased towards well-studied diseases, sensitive to noise, and have limited validation across di- verse datasets and conditions, making their effectiveness unclear in analyzing new diseases, complex etiologies, or in analyzing data with weak signals compared to controls. Moreover, inconsistencies among different methods hinder interpretation and confidence in the results for downstream analyses. This dissertation addresses these challenges by investigating a wide range of techniques for integrating multi-omics data to subtype cancer patients, including matrix factorization, genetic algorithms, and similarity-based methods. We introduce several novel subtyping frameworks, including Multi-objective Genetic K-means clustering Algorithm (MGKA), Disease Subtyp- ing using Community detection from Consensus networks (DSCC), PINSPlus, and Subtyping Multi-omics using a Randomized Transformation (SMRT). MGKA utilizes a multi-objective genetic algorithm to refine the k-means clustering algorithm and automatically determine the optimal number of subtypes. DSCC employs a consensus network approach, building patient similarity networks from individual data types and using community detection to identify ro- bust subtypes. PINSPlus is an extension of the original PINS method, integrating multiple data types and providing a more accurate and efficient subtyping analysis. SMRT is capable of in- tegrating a large number of omics data types to subtype cancer patients. Through an extensive analysis of over 11,000 patients across 37 cancer types, we demonstrate the ability of these methods to detect cancer subtypes with significant differences in patient risk and survival. No- tably, these methods easily handle a large number of data types and patients, are robust against noise and missing data, and gain accuracy as more data types are integrated. For the second challenge, we first introduce a web interface that offers pathway analysis using multiple methods and datasets in a single session with rich visualization features, allow- ing life scientists to easily conduct pathway analysis, compare results from different methods and datasets, and reach better consensus for downstream analyses. We then introduce a novel consensus pathway analysis approach, Perturbation-based Gene Set Analysis (PGSA), which efficiently determines significantly impacted pathways across a wide range of diseases. We analyzed 421 datasets from more than 30 diseases, demonstrating PGSA’s superior perfor- mance compared to state-of-the-art methods in identifying significantly impacted pathways. This marks the first time a pathway analysis method has been tested on such a large number of datasets and diseases to prevent bias and overfitting to well-studied diseases.