Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/36782
標題: 利用適應性遺傳演算法結合k最近鄰法於基因表現資料進行基因篩選與樣本分類之研究
The Study on Gene Selection and Sample Classification Based on Gene Expression Data Using Adaptive Genetic Algorithms / k-Nearest Neighbors Method
作者: 李建邦
Lee, Chien-Pang
關鍵字: Microarray
微陣列
Gene Selection
Genetic Algorithms
k-Nearest Neighbors
基因篩選
遺傳演算法
k最近鄰法
出版社: 農藝學系所
引用: 林汶鑫、郭寶錚、何兆銓。2003。遺傳演算法的理論與應用。科學農業51(7,8) : 186-195。 林汶鑫。2003。遺傳演算法應用在雙線性模式之研究。碩士論文。台中。國立中興大學農藝學系碩士班。 李欣怡。2005。Affymetrix高密度寡具核苷酸晶片試驗統計分析方法之比較。碩士論文。台北。國立台灣大學農藝研究所生物統計組。 馬立人、蔣中華。2003。生物晶片 二版。九州圖書文物有限公司。 陳奕仁。2001。適應性基因演算法結合精英政策於線性馬達定位機台之主動式振動控制器設計。碩士論文。高雄。國立中山大學機械工程研究所。 陳智豪。2002。cDNA生物晶片資料的統計分析方法之研究。碩士論文。台北。國立台灣大學農藝研究所生物統計組。 盧建銘。2004。不同變數選取法在近紅外光光譜資料上之比較。碩士論文。台中。國立中興大學農藝學系碩士班。 Alizadeh, A. A., M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson Jr, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D. Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, H. C. Byrd, D. Botstein, P. O. Brown and L. M. Staudt. 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769): 503 - 511. Alon, U., N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack and A. J. Levine. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probe by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96: 6745 - 6750. Bozinov, D. and J. Rahnenführer. 2002. Unsupervised technique for robust target separation and analysis of DNA microarray spots through adaptive pixel clustering. Bioinformatics 18(5): 747 - 756. Callow, M. J., S. Dudoit, E. L. Gong, T. P. Speed and E. M. Rubin. 2000. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Res. 10: 2022 - 2029. Chen, Y., E. R. Dougherty and M. L. Bittner. 1997. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Opt. 2(4): 364 - 374. Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74: 829 - 836. De Jong, K. A. 1975. An analysis of the behavior of a class of genetic adaptive system. Ph. D. Dissertation, Department of Computer and Communication Sciences. University of Michigan. Dobbin, K. and R. Simon. 2002.Comparison of microarray designs for class comparison and class discovery. Bioinf. 18: 1438-1445. Duggan, D. J., M. Bittner, Y. Chen, P. Meltzer and J. Trent. 1999. Expression profiling using cDNA microarrays. Nature Genet. 21: 10-14. Dudoit, S., Y. H. Yang, M. J. Callow and T. P. Speed. 2002. Statistical method for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12: 111 - 139. Efron, B., R. Tibshirani, V. Goss and G. Chu. 2000. Microarrays and their use in a comparative experiment. Tech. report. Department of Biochemistry, Stanford University. Enas, G. G. and S. C. Choi. 1986. Choice of the smoothing parameter and efficiency of k-nearest neighbor classification. Comp. & Maths. with Appls. 12A(2): 235 - 244. Fix, E. and J. L. Hodges, 1951. Nonparametric discrimination: consistency properties. Project No. 21-49-004. Report 4. U.S. Air Force School of Aviation Medicine. Randolph Field. TX. Goldberg, D. E. 1985. Optimal initial population size for binary-coded genetic algorithms (TCGA report No.85001) The clearinghouse for genetic algorithms, department of engineering mechanics. Tuscaloosa: University of Alabama. Goldberg, D. E. 1989. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Publishing Company. Gottardo, R., J. Besag, M. Stephens and A. Murua. 2006. Probabilistic segmentation and intensity estimation for microarray images. Biostatistics 7(1): 85 - 99. Holland, J. H. 1975. Adaptation in natural and artificial systems. Ann Arbor: The University of Michigan Press. Ho, S. Y., C. C. Liu and S. Liu. 2002. Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm. Pattern Recogn. Lett. 23: 1495 - 1503. Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. Proceedings of the European Conference on Machine Learning. Berlin. 37-142. Kerr, N. K., M. Martin and G. A. Churchill. 2000. Analysis of variance for gene expression microarray data. J. Comput. Biol. 7: 819-837. Kerr, M. K. and G. A. Churchill. 2001. Statistical design and the analysis of gene expression microarray data. Genet. Res. 77: 123-128. Kuncheva, L. I. 1995. Editing for the k-nearest neighbors rule by a genetic algorithm. Pattern Recogn. Lett. 16: 809 - 814. Kuncheva, L. I. and J. C. Bezdek. 1998. Nearest prototype classification: clustering, genetic algorithms or random search. IEEE Trans. Systems Man Cybernet. C. 28(1): 160-164. Kuncheva, L. I. and L. C. Jain. 1999. Nearest neighbor classifier: Simultaneous editing and feature selection. Pattern Recogn. Lett. 20: 1149-1156. Lee, M. L. T., 2004. Analysis of microarray gene expression data. New York. Kluwer Academic Publishers. Li, L., T. A. Darden, C. R. Weinberg, A. J. Levine and L. G. Pedersen. 2001a. Gene Assessment and Sample Classification for Gene Expression Data Using a Genetic Algorithm / k-Nearest Neighbor Method. Comb. Chem. High Throughput Screen. 4: 727 - 739. Li, L., C. R. Weinberg, T. A. Darden and L. G. Pedersen. 2001b. Gene selection for sample classification based on gene expression data : study of sensitivity to choice of parameters of the GA / KNN method. Bioinformatics 17(12): 1131 - 1142. Li, L., D. M. Umbach, P. Terry and J. A. Taylor. 2004. Application of the GA/KNN method to SELDI proteomics data. Bioinformatics 20(10): 1638 - 1640. Liu, D., T. Shi, J. A. DiDonato, J. D. Carpten, J. Zhu and Z. H. Duan. 2004. Application of genetic algorithm/k-nearest neighbor method to the classification of renal cell carcinoma. 2004 IEEE Computational Systems Bioinformatics Conference (CSB''04): 558 - 559. Lönnstedt, I and T. Speed. 2002. Replicated mircoarray data. Statistica Sinica 12: 31 - 46. Raymer, M. L., W. F. Punch, E. D. Goodman, P. C. Sanschagrin and L. A. Kuhn. 1997. Simultaneous Feature Extraction and Selection Using a Masking Genetic Algorithm. In Proceedings of ICGA-97: 561 - 567. Ruan, W., T. C. Giras, Z. Lin and Y. Ou. 2003. ASCAP parameter determination by an intelligent genetic algorithm. Proceedings of the 2003 IEEE/ASME Joint Rail Conference: 133 - 141. Shannon, W., R. Culverhouse and J. Duncan. 2003. Analyzing microarray data using cluster analysis. Pharmacogenomics 4(1): 41 - 51. Simon, R. M., E. L. Korn, L. M. McShane, M. D. Radmacher, G. W. Wright and Y. Zhao. 2003. Design and Analysis of DNA Microarray Investigations. New York. Springer. Smyth, G. K. and T. P. Speed. 2003. Normalization of cDNA microarray data. Methods 31: 265 - 273. Srinivas, M. and L. M. Patnaik. 1994. Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on System, Man and Cybernetics 24(4): 656 - 666. Tsai, C. A., H. M. Hsueh and J. J. Chen. 2004. A generalized additive model for microarray gene expression data analysis. J Biopharm Stat. 14(3): 553-573. Westfall, P. H. and S. S. Young. 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment. New York: Wiley. Yang, Y. and X. Liu. 1999.A re-examination of text categorization method. Proceedings of the 22nd Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval: 42-49 Yang, Y. H., S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai and T. P. Speed. 2002. Normalization for cDNA microarray data : a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30(4): e15.1 - e15.10. Yao, L. M. IEEE and W. A. Sethares. 1994. Nonlinear parameter estimation via the genetic algorithm. IEEE transactions on signal processing 42(4): 927-935
摘要: 近年來微陣列技術(microarray technology)已成為研究基因表現的重要工具之一。而和過去研究基因表現方法最大的差異在於,微陣列技術可同時偵測數以千計基因的表現情形。過去研究者常藉母數統計方法試圖搜尋出具有顯著差異表現的基因,但微陣列資料常無法滿足母數統計方法中的前提假設,且對每一基因單獨進行顯著性的檢定時,可能會遇到第一型錯誤(type I error)過度膨脹的問題。因此為了解決上述的問題,本研究期望能利用不須前提假設的變數選取方法來降低基因維度,再經由生物學家尋找出真正重要的基因。 本研究根據Li et al.(2001a)所提出的遺傳演算法(genetic algorithms;GA)結合k最近鄰法(k-nearest neighbors;KNN)(GA / KNN法)為基礎並進行修正,另行利用適應性遺傳演算法(adaptive genetic algorithms;AGA)結合k最近鄰法(AGA / KNN法)縮減微陣列資料的範圍。AGA與KNN都是發展已久的方法,但本研究卻是首度結合兩者以應用於分析微陣列資料。由於適應性遺傳演算法是一種機器學習的搜尋工具而k最近鄰法為無母數的判別分析方法,因此在使用時並不須受限於一些前提假設。AGA / KNN法和GA / KNN法最大的不同處為:編碼方式改為二進位編碼,每一數串包含了全部的基因,並加入機率適應性交換及突變機制(adaptive probabilities of crossover and mutation)與汰換政策(extinction and immigration strategy)。由於遺傳演算法的特色為能找出近似最適解,而每次所產生的最佳數串通常都不一樣。因此AGA / KNN法利用重複抽樣的觀念,反覆執行AGA / KNN法,先找出眾多表現較佳的數串,再根據基因累積出現次數進行排序,依此排序對基因進行縮減。 本研究中先以AGA / KNN法應用於Alon et al.(1999)對結腸癌病患所進行的寡核苷酸晶片資料進行實例研究。再以Callow et al.(2000)研究老鼠apo AI基因之cDNA晶片資料比較AGA / KNN法與GA / KNN法選取基因的能力。根據分析結果,AGA / KNN與GA / KNN都能縮減基因的範圍,並可對樣本進行正確分類。但AGA / KNN法正確篩選基因的能力優於GA / KNN法,且AGA / KNN法所消耗CPU時間僅有GA / KNN法的一半。因此若依據本研究所分析的資料,AGA / KNN法整體表現應不劣於GA / KNN法。最後經本研究的分析與比較後,建議當使用AGA / KNN法時,先大約進行100次試行(runs)後,再選取累積出現次數最高的前50至前100個基因,所搜尋出的基因應可涵蓋重要的基因,並可對樣本進行正確的分類。
Microarray technology has become a valuable tool for studying gene expression in recent years. The main difference between microarray and traditional methods is that microarray can measure thousands of genes at the same time. In the past, researchers always used parametric statistical methods to find the significant genes. However, microarray data often cannot obey some assumptions of parametric statistical methods, and type I error would be over expanded while each gene was tested for significance. Therefore, this research was expected to find a variable selection method without assumptions restriction to reduce the dimension of the data set. After using the proposed method, biologists can select the relevant genes according to the sub-gene set. In this study, adaptive genetic algorithms / k-nearest neighbors (AGA / KNN) was used to reduce the dimension of the data set, and it was based on genetic algorithms / k-nearest neighbors (GA / KNN) which was first described by Li et al.(2001a). Although AGA and KNN were well-developed, AGA / KNN was first used to analyze the microarray data. Since AGA was a machine learning tool and KNN was a nonparametric discrimination analysis, both of them could be used without assumptions restriction. There are three main differences between AGA/KNN and GA / KNN. Firstly, the encoding has become binary code, and each string included all genes. Secondly, the adaptive probabilities of crossover and mutation were added. Finally, the extinction and immigration strategy was added. Since GA can just find the near optimal solution, the best string of each run is often not the same. Here, AGA / KNN was repeated by many runs to solve that problem. Thus, lots of the best strings were saved. The frequency of gene was computed by those strings to reduce the dimension of the data set. In this study, an original colon data which is a high-density oligonucleotide chip (Alon et al., 1999) was analyzed. In addition, mice apo AI data which is a cDNA chip (Callow et al., 2000) was also used to compare the ability of gene selection of AGA / KNN and GA / KNN. Based on the results, it was found that AGA / KNN and GA / KNN could reduce the dimension of the data set and all samples could be classified correctly. But the accuracy of AGA / KNN was higher than that of GA / KNN, and it only took half CPU time of GA / KNN. Therefore, it was claimed that the performance of AGA / KNN should not be worse than that of GA / KNN. Finally, we suggested that when AGA / KNN was employed to analyze the microarray data, the top 50 and up to 100 most frequent genes were selected after AGA / KNN were repeated about 100 runs. Those selected genes should include relevant genes, and those selected genes could classify sample correctly.
URI: http://hdl.handle.net/11455/36782
其他識別: U0005-1108200614481000
文章連結: http://www.airitilibrary.com/Publication/alDetailedMesh1?DocID=U0005-1108200614481000
Appears in Collections:農藝學系

文件中的檔案:

取得全文請前往華藝線上圖書館



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.