Please use this identifier to cite or link to this item:
標題: 採用可靠與有效的方法輔助預測人類蛋白質亞細胞位置
REALoc: Reliable and effective methods to assist predicting human protein subcellular localization
作者: 孫翰豪
Sun, Han-Hao
關鍵字: 人類蛋白質
human protein
subcellular localization
machine learning
出版社: 基因體暨生物資訊學研究所
引用: 1. von Heijne G: Signal sequences. The limits of variation. J Mol Biol 1985, 184(1):99-105. 2. Blobel G, Dobberstein B: Transfer of proteins across membranes. I. Presence of proteolytically processed and unprocessed nascent immunoglobulin light chains on membrane-bound ribosomes of murine myeloma. The Journal of cell biology 1975, 67(3):835-851. 3. Walter P, Ibrahimi I, Blobel G: Translocation of proteins across the endoplasmic reticulum. I. Signal recognition protein (SRP) binds to in-vitro-assembled polysomes synthesizing secretory protein. The Journal of cell biology 1981, 91(2 Pt 1):545-550. 4. Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology 2000, 300(4):1005-1016. 5. Petersen TN, Brunak S, von Heijne G, Nielsen H: SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods 2011, 8(10):785-786. 6. Shibiao W, Man-Wai M, Sun-Yuan K: GOASVM: Protein subcellular localization prediction based on Gene ontology annotation and SVM. In: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on: 25-30 March 2012 2012. 2229-2232. 7. McGinnis S, Madden TL: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research 2004, 32(suppl 2):W20-W25. 8. Chi SM, Nam D: WegoLoc: accurate prediction of protein subcellular localization using weighted Gene Ontology terms. Bioinformatics 2012, 28(7):1028-1030. 9. Chou KC, Wu ZC, Xiao X: iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Molecular Biosystems 2012, 8(2):629-641. 10. Chou KC, Shen HB: A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0. PLoS One 2010, 5(4):e9931. 11. Chou KC: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001, 43(3):246-255. 12. Yoon Y, Lee GG: Subcellular Localization Prediction through Boosting Association Rules. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 2011, 9(2):609-618. 13. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. 14. Blum T, Briesemeister S, Kohlbacher O: MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics 2009, 10(1):274. 15. Shibiao W, Man-Wai M, Sun-Yuan K: Protein subcellular localization prediction based on profile alignment and Gene Ontology. In: Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on: 18-21 Sept. 2011 2011. 1-6. 16. Chou KC, Wu ZC, Xiao X: iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One 2011, 6(3):e18258. 17. He J, Gu H, Liu W: Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PLoS One 2012, 7(6):e37155. 18. Shen HB, Chou KC: A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. Anal Biochem 2009, 394(2):269-274. 19. Goldberg T, Hamp T, Rost B: LocTree2 predicts localization for all domains of life. Bioinformatics 2012, 28(18):i458-i465. 20. Yu CS, Chen YC, Lu CH, Hwang JK: Prediction of protein subcellular localization. Proteins 2006, 64(3):643-651. 21. Jiang JQ, Wu MY: Predicting multiplex subcellular localization of proteins using protein-protein interaction network: a comparative study. Bmc Bioinformatics 2012, 13(Suppl 10):S20. 22. Ewing RM, Chu P, Elisma F, Li H, Taylor P, Climie S, McBroom-Cerajewski L, Robinson MD, O''Connor L, Li M et al: Large-scale mapping of human protein-protein interactions by mass spectrometry. Molecular systems biology 2007, 3:89. 23. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N et al: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437(7062):1173-1178. 24. Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury M, Dumousseau M, Feuermann M, Hinz U et al: The IntAct molecular interaction database in 2012. Nucleic Acids Res 2012, 40(Database issue):D841-846. 25. Vens C, Rosso MN, Danchin EG: Identifying discriminative classification-based motifs in biological sequences. Bioinformatics 2011, 27(9):1231-1238. 26. Chou KC: Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2011, 273(1):236-247. 27. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M et al: The Universal Protein Resource (UniProt). Nucleic Acids Res 2005, 33(Database issue):D154-159. 28. Fu L, Niu B, Zhu Z, Wu S, Li W: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012, 28(23):3150-3152. 29. Lin H, Wang H, Ding H, Chen YL, Li QZ: Prediction of subcellular localization of apoptosis protein using Chou!|s pseudo amino acid composition. Acta biotheoretica 2009, 57(3):321-330. 30. Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Res 1999, 27(1):368-369. 31. Cherian BS, Nair AS: Protein location prediction using atomic composition and global features of the amino acid sequence. Biochemical and Biophysical Research Communications 2010, 391(4):1670-1674. 32. Nair R, Rost B: Mimicking Cellular Sorting Improves Prediction of Subcellular Localization. Journal of Molecular Biology 2005, 348(1):85-100. 33. Pierleoni A, Martelli PL, Fariselli P, Casadio R: BaCelLo: a balanced subcellular localization predictor. Bioinformatics 2006, 22(14):e408-416. 34. Chang CC, Lin CJ: LIBSVM: A Library for Support Vector Machines. Acm T Intel Syst Tec 2011, 2(3):1-27. 35. Ihaka R, Gentleman R: R: A language for data analysis and graphics. Journal of computational and graphical statistics 1996, 5(3):299-314.
摘要: 蛋白質亞細胞位置一直是生物研究的重要一環,藥物開發與探討蛋白質功用都需要亞細胞位置資訊的輔助。我們發展出可以同時預測人類Singleplex和Multiplex兩種不同類型蛋白質的系統,REALoc,其具有兩層系統架構,整合了one-to-one 與many-to-many的不同機器學習方法,使用許多sequence based features和function based features,除了胺基酸組成、surface accessibility之外,還包含我們發展的weighted sign AAindex、sequence similarity profile及藉由regular-mRMR特徵選擇的Gene Ontology資訊。 REALoc用於預測六個亞細胞位置 (細胞膜、細胞質、內質網/高基氏體、粒線體、細胞核和細胞外),並且與4個相關預測網站進行比較,REALoc在訓練資料庫5倍交叉驗證得到75.34%的absolute true success rate,獨立測驗資料庫則為57.14%,高於其他預測系統10%以上。最後,我們分析Vote與GANN二種模型在單位置與多位置之預測效能,也測試protein-protein interaction與亞細胞位置的關係。
Protein subcellular localization is an important part of biological research; which could support drug development and explore the function of proteins. Many subcellular localization prediction tools has developed, most of them used the data of eukaryotes or prokaryotes for model training, however, the related predictors for human proteins are rare. We established a system to predict subcellular localization of human proteins with Singleplex and Multiplex, called REALoc. It based on two layers architecture integrated with two different machine learning methods, one-to-one and many-to may. Besides, system included many sequence based features and function based features, such as amino acid composition, surface accessibility. In addition, we developed a series of computing features like weighted sign AAindex, sequence similarity profile and regular-mRMR feature selection for Gene Ontology. 5 folds Cross-validation was performed with iLoc-Hum on training dataset covers 6 location sites (Cell membrane, Cytoplasm, Endoplasmic reticulum/Golgi apparatus, Mitochondrion, Nucleus, secreted), overall absolute true success rate of REALoc is 75.34%, and on testing dataset is 57.14% which performances are about 10% higher than other four prediction systems. Finally, this study discussed the performance of the two decision mechanism of vote and GANN for predicting single location and multiple locations. Furthermore, the relationship between the protein-protein interaction and subcellular localization by using motifs was investigated.
其他識別: U0005-0808201302161600
Appears in Collections:基因體暨生物資訊學研究所



Show full item record
TAIR Related Article

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.