收藏本站
收藏 | 手机打开
二维码
手机客户端打开本文

基于数据降维技术的全基因组区域化关联分析统计推断方法研究

高青松  
【摘要】:Many common human diseases, such as cancer, schizophrenia, essential hypertension, type 2 diabetes, and cardiovascular disease, are known to be complex diseases. Complex diseases, also known as multifactorial diseases, are controlled by multiple genetic and environmental factors. Although they often show a tendency for family aggregation, complex diseases do not have a clear-cut pattern of inheritance. This makes it difficult to determine one's risk of inheriting or passing on these disorders. Recently with rapid improvements in high-throughout genotyping techniques and the growing number of available markers, genome-wide association studies (GWAS), which genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) on thousands of participants, are emerging as promising approaches for the identification of SNPs that are marginally associated with complex diseases. On the other hand, researches on gene-gene interactions (epistasis) in GWAS have shed light on some disease-associated pathways and networks to some extent, and improved our understanding of the genetic basis of complex diseases despite the computational challenge. However, there are still many analytic and interpretation challenges in GWAS. It is customary to run SNP-based association or interaction tests in the whole genome to identify causal or associated SNPs with strong marginal or jointly epistasis effects on disease or traits.In other words, the unit of association is the SNP. However, such a SNP-based analysis usually leads to computational burden and the well-known multiplicity problem, with a highly inflated risk of type I error and decreased ability to detect modest effects. In the present study, higher units, such as gene or genome regions, were considered to deal with these and related challenges. Under the framework, we proposed four methods to detect disease-associated genes or gene-gene interactions in the genome, presented in four chapters as follows: Chapter 1 A new method to test the nonlinear feature in nonlinear principal component analysis Given the SNPs allocated into genes or regions, the issue of how to evaluate genetic association for each candidate gene or genome region remains. As powerful multi-marker analysis methods, PCA-based methods are often applied in the gene- or region- based association study. PCA can capture linkage disequilibrium information and avoid multicolinearity between SNPs within a candidate gene/region. However, it only extracts the linear relationship between SNPs. For nonlinear situation, the PCA-based methods will lose power, and a nonlinear PCA model should be used. Therefore, in present study, we introduced a nonlinear measure determine whether the underlying relationship within a given variable set can be described by a linear PCA model or whether nonlinear PCA model must be utilized for further study. Applications to two simulated data and the data from GAW16 are described to demonstrate its performance. In the two simulated examples, as expected, no violations of the accuracy bounds arise in the linear example while some of the residual variances fall outside the accuracy bounds in the nonlinear example. For the real data, at least one of the residual variances fall outside any of the accuracy bounds, implying that a nonlinear PCA model is required for this data set. These results show that the new nonlinearity measure is effective to detect the relationships between variables in a given data set. With this measure, we can choose a more suitable model to make optimal use of all information available in the given data set. Chapter 2 Gene- or region- based association study via kernel principal component analysis For linear data, PCA-based methods are better choices for the following association study, while nonlinear approaches should be applied for nonlinear data. Among the modified nonlinear PCA methods, the kernel PCA (KPCA) is the most well known and widely adopted. In this study, we proposed to combine KPCA with logistic regression test (LRT) to detect the association between multiple SNPs in a candidate gene or genome region and diseases or traits. The algorithm conducted KPCA first to account for between-SNP relationships in a candidate region, and then applied LRT to test the association between kernel principal components (KPCs) scores and diseases. Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR=1.2, 1.3). Application to the four regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop 16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT. KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels. Chapter 3 Exhaustive sliding-window scan approach for genome-wide association study via PCA-based logistic model The gene- or region-based approaches mentioned above, including our newly proposed KPCA-based method, will definitely improve our understanding of the genetic basis of complex diseases. However, all of these approaches only allow a gene or genome region of several to tens of markers. For a large number of SNPs across the candidate region or the human genome, the performance of these methods will not be satisfying. In recent years, sliding-window methods, in which several neighboring SNPs together included in a "window", have been a popular strategy of automated GWAS data analysis. In these sliding-window approaches, the candidate region or the whole genome is divided into many contiguous overlapping windows, followed by gene- or region-based multi-locus association methods in each window. Sliding-window approach can be implemented with the fixed window size or variable sizes. However, we are not certain whether the window sizes previously set or decided by specific methods are statistically sufficient to gain the optimal detection power. Lin et al proposed that an exhaustive search of all possible windows of SNPs at the genome level is not only computationally practical but also statistically sufficient to detect common or rare genetic-risk alleles. With the development as well as the extensive applications of multiprocessor and multithreading computational technique, the "exhaustive" methods have been more feasible in practice. At present study, under the framework of "exhaustive" search, we first conducted simulations to assess statistical powers with different window sizes, and then evaluated the performance via application to real data to test whether the exhaustive strategy can be extended in GWAS data analysis. Results from both simulation and real data analysis indicated that the powers and p-values with different window sizes were quite different. Furthermore, with the development of multiprocessor computational technique, the proposed exhaustive strategy combined with the cluster computer technique is computationally efficient and feasible for analyzing GWAS data. The exhaustive strategy is computationally efficient and feasible, so it should be popularized in GWAS data analysis. Chapter 4 A new gene- or region-based method for detecting gene-gene interactions between two unlinked loci via kernel canonical correlation analysis For GWAS data set, it is often of interest to identify SNPs that jointly have an epistatic (interaction) effect on complex diseases. However, most of the current methods consider SNP as the unit of association, which leads to several well-know limitations such as multiple testing. Under the gene- or region-based framework, our group have previously proposed a gene-based statistic (CCU statistic) for detecting gene-gene co-association based on canonical correlation analysis (CCA). Apparently, in the case that the two genes of interest are unlinked, the co-association between them is the same as their interaction effect. The CCU statistic has been proved to have good performance on detecting gene-gene co-associations or interactions. Despite that, CCA can only detect linear structure of the data set. If the genomic data contains nonlinear structure, CCA will not be able to detect it. In recent years, kernel CCA (KCCA), as a generalized CCA, has been studied intensively in the field of machine learning, face recognition and data classification, and has been claimed success in many applications. We, therefore, proposed to use KCCA rather than CCA to construct a revised version of CCU statistic-kernel CCU (KCCU) statistic-for detecting gene-gene interaction in association study. Simulation results showed that all the powers of KCCU statistic were higher than CCU statistic at given significant levels, sample sizes and relative risks. Application to RA data in GAW16 Problem 1 showed that CCU statistic only detected the interaction between PTPN22 and C5 genes, while KCCU statistics identified all the pairwise interactions among the four genes. In summary, KCCU statistic had better performance than CCU statistic.


知网文化
【相似文献】
中国期刊全文数据库 前19条
1 毕达天;邱长波;张晗;;数据降维技术研究现状及其进展[J];情报理论与实践;2013年02期
2 刘琴;马峰;;AP统计教学和考试内容暨评价体系介绍[J];上海中学数学;2017年06期
3 ;统计推断[J];中国护理管理;2016年07期
4 吴光亮;;高中数学概率教学中渗透数学思想[J];中学生数理化(教与学);2017年11期
5 张少林;科学形式下的“不科学”——析外语科研论文中虚构统计推断结果现象[J];国外外语教学;2004年04期
6 王银书;浅议统计推断[J];机械工业标准化与质量;2003年12期
7 周亮,韩玉启,朱慧明;正态线性单方程计量经济模型的Bayes统计推断[J];南京理工大学学报(自然科学版);2002年02期
8 毛宗福,叶金华,丁元林;临床期刊论著中统计推断应用缺陷调查[J];湖北医科大学学报;1998年02期
9 杜长进;统计推断与教育评价[J];徐州师范学院学报(自然科学版);1993年02期
10 郭建英;;寿命分布函数初步统计推断的通用程序[J];哈尔滨科学技术大学学报;1987年02期
11 韦博成;;非线性统计的若干新进展[J];应用概率统计;1988年04期
12 裴先登,梁伊珍;二重配合的抽样检查与统计推断[J];华中理工大学学报;1989年03期
13 虞筱宁;经典学派、Bayes学派、Fiducial学派纵横谈——统计推断中三大学派综述[J];九江师专学报;1989年06期
14 沈世镒;Hilbert空间中算子的统计推断问题[J];数学物理学报;1989年03期
15 寒露;;考试成绩的统计推断[J];上海教育评估研究;2015年06期
16 杨桂元;均值已知的条件下方差的统计推断[J];统计教育;2005年06期
17 罗建国;均数情结与统计推断[J];中国统计;2002年05期
18 张忠;矿井安全统计推断初探[J];煤矿安全;1996年03期
19 刘长文;统计推断方法及其应用[J];北京警院学报;1997年04期
中国重要会议论文全文数据库 前10条
1 徐利民;龚珊;余再军;;奇异值分解与非负矩阵分解色在数据降维方面的特性分析[A];2010年通信理论与信号处理学术年会论文集[C];2010年
2 凤四海;杨蓉;;统计推断与科学推断[A];加入WTO和中国科技与可持续发展——挑战与机遇、责任和对策(下册)[C];2002年
3 徐才万;孙孝云;;某大学学报(医学版)论著中统计推断应用缺陷分析[A];全面建设小康社会:中国科技工作者的历史责任——中国科协2003年学术年会论文集(下)[C];2003年
4 张霄帅;高青松;薛付忠;;基于核典型相关理论的整体基因(或基因组区域)之间非线性交互作用统计推断方法[A];2011年中国卫生统计学年会会议论文集[C];2011年
5 Wei Zhang;Yanbin Teng;Shilin Wei;Jianku Zhang;;On adaptive fuzzy sliding mode control method of UUV with the large interference of load separation[A];第36届中国控制会议论文集(C)[C];2017年
6 Kai Zheng;Yi Jiang;Guofeng Wang;Xingcheng Wang;;Local Sliding Mode Control Design for a Class of Second-order Systems with Friction[A];第36届中国控制会议论文集(C)[C];2017年
7 Yan Ren;Wenbo Tan;Hui Liu;;Application of adaptive sliding mode control based on HSMDO in stabilized platform[A];第36届中国控制会议论文集(C)[C];2017年
8 Qin Zhi-Chang;Xiong Fu-Rui;Ding Qian;Hernndez Carlos;Schtze Oliver;Fernandez Jesús;Sun Jian-Qiao;;Multi-objective optimal design of sliding mode[A];中国力学大会-2015论文摘要集[C];2015年
9 ;Adaptive Sliding Mode Controller for a Class of Second-order Underactuated Systems[A];2009中国控制与决策会议论文集(2)[C];2009年
10 ;Improved Nonsingular Terminal Sliding Mode Controller Design for High-order Systems[A];2009中国控制与决策会议论文集(3)[C];2009年
中国博士学位论文全文数据库 前10条
1 刘丽娜;数据降维与字典学习及其在人体特征识别中的应用[D];上海大学;2018年
2 宋海燕;序约束下参数模型的统计推断[D];吉林大学;2005年
3 胡果荣;基于舍人数据的统计推断[D];吉林大学;2006年
4 张世斌;基于离散抽样OU型过程的统计推断[D];复旦大学;2007年
5 朱复康;条件异方差时间序列模型的统计推断[D];吉林大学;2008年
6 邱亚松;基于数据降维技术的气动外形设计方法[D];西北工业大学;2014年
7 王江艳;复杂数据的统计推断:时间序列、抽样和函数型数据[D];苏州大学;2016年
8 钱成;二维异孔共价有机框架构筑新策略的研究[D];湖南大学;2018年
9 王道红;学位论文质量管理研究[D];华东师范大学;2005年
10 Muddasir Hanif;[D];吉林大学;2008年
中国硕士学位论文全文数据库 前10条
1 高青松;基于数据降维技术的全基因组区域化关联分析统计推断方法研究[D];山东大学;2011年
2 范博翔;基于时空局部嵌入的有向网络数据降维[D];清华大学;2017年
3 杨雨珩;基于流形结构的相对尺度化降维方法研究[D];清华大学;2017年
4 秦利;基于CFD信息与数据降维方法的风场重构[D];华北电力大学(北京);2018年
5 尹梁;面向软测量的发酵过程变量选择和数据降维研究[D];江苏大学;2017年
6 肖海明;基于数据降维和支持向量机的入侵检测方法研究[D];华北电力大学(河北);2010年
7 李琰;社会调查的空间抽样与统计推断研究[D];武汉大学;2017年
8 黄亚楠;基于逆Lomax分布对P(Y<X)的统计推断[D];北京交通大学;2018年
9 李士敏;基于Malliavin计算的Ornstein-Uhlenbeck过程统计推断性质研究[D];南京航空航天大学;2018年
10 刘晓妍;伞形约束统计推断问题的优化方法[D];北京交通大学;2017年
中国重要报纸全文数据库 前10条
1 记者 郭云飞 通讯员 向群策 曹晓艺;口口香米业助粮农增收5亿元[N];湖南日报;2012年
2 记者 肖少华;口口香米业销售过亿元[N];益阳日报;2007年
3 本报记者 肖少华 通讯员 曹晓艺;党的旗帜在口口香米业迎风飘扬[N];益阳日报;2011年
4 林风;学位论文应该更规范管理[N];团结报;2018年
5 王钟的;学位论文如何才能挤出“水分”[N];中国青年报;2014年
6 佘颖;为什么大学生要冒险买论文[N];经济日报;2018年
7 王心禾;旧书网卖学位论文,哪儿出了问题[N];检察日报;2017年
8 记者 梁蓬飞;16篇“问题论文”作者及导师被问责追责[N];解放军报;2017年
9 南京大学 苏新宁 河海大学 徐绪堪;学位论文这笔资源怎么利用[N];光明日报;2015年
10 记者 高靓;学位论文作假行为处理办法今起实施[N];中国教育报;2013年
 快捷付款方式  订购知网充值卡  订购热线  帮助中心
  • 400-819-9993
  • 010-62982499
  • 010-62783978