【摘要】：It has long been appreciated that genomic variants can partly contribute to human cancer. Two main types of variants, single nucleotide polymorphisms (SNPs) and copy number alterations (CNAs) have been largely explored in genome wide association studies (GWASs) recently. Characterization of the variants can enable us to understand the genesis and progression of tumors, so as to provide valuable information for the diagnosis and treatment of human cancer. For this purpose, we simulate genomic variants and identify genomic patterns (i.e., significant genomic variants and structures among them) with respect to SNPs and CNAs. Five key contributions of the thesis are summarized as below.
1. For a clear understanding of the development of genome simulation, we make a comprehensive comparison of existing simulators w.r.t. evolutionary and demographic scenarios, computational efficiency, and applicability in genome study. This work will help to guide informed choice for researchers and help to make progress on new simulation methods.
2. The important issue that arises in existing genome simulators is: efficiency and flexibility can not be well handled simultaneously. We propose a new algorithm, SIMLD, to simulate real linkage disequilibrium (LD) patterns and case-control samples. The main features of SIMLD are two-fold: (1) less number of evolutionary generations is required to converge to real LD patterns; and (2) various disease models can be flexibly incorporated to produce phenotypes.
3. To search for susceptibility SNPs and epistatic models that underlie human cancer, we propose a novel SNP association study method based on probability theory, called ProbSNP. The experimental results show that ProbSNP achieves success in applications to simulation and real data when compared with other methods. The main features of ProbSNP are three-fold: (1) joint probability between SNPs and phenotypes is modelled to assess the importance of SNPs; (2) the stability of the SNP selection is validated through resampling process; and (3) the space for detecting epistatic models is reduced due to the step of individual SNP selection.
4. In addition to SNPs, somatic copy number alterations (CNAs) in genomes underlie almost all human cancers. To identify significant consensus events (SCEs) from random background CNAs, we develop a novel algorithm, called iSCE, which uses permutation test to determine significance based on a new statistic. The experimental results show that iSCE outperforms others in terms of larger area under the Receiver Operating Characteristics curve. The novel features of iSCE are three-fold: (1) iSCE considers the strong correlation among neighboring probes thus assigns a score to each region instead of single probe; (2) iSCE conducts permutations on ensemble CNAs segments rather than single probes across samples; and (3) iSCE iteratively performs significance assessment and SCE-exclusive permutations.
5. To identify subtype-speicfic SCEs in heterogeneous diseases, we analyze two types of ovarian cancers: primary-recurrent ovarian cancer and high-grade ovarian cancer, w.r.t. CNAs based on clustering and the iSCE algorithm. The identified patterns show biological significance when compared with regions known to be associated with oncogenes (EGFR, KRAS, MYC, etc.) and tumor suppressor genes (CDKN2A/B, PTEN, etc.). The results will be helpful for exploring subtype-specific diagnosis and treatment.