Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/97778
標題: 利用機器學習方法通過混合特徵編碼方式預測蛋白質四級結構特徵
Prediction of protein quaternary structural attributes through hybrid feature encoding method by using machine learning approach
作者: 劉猷楠
Yu-Nan Liu
關鍵字: 蛋白質四級結構預測
機器學習
支持向量機
雙胺基酸組成
混合特徵編碼
protein quaternary structural prediction
machine learning
support vector machine
dipeptide feature composition
hybrid feature encoding method
引用: 1.Shen, H.-B. and K.-C. Chou, QuatIdent: a web server for identifying protein quaternary structural attribute by fusing functional domain and sequential evolution information. Journal of Proteome Research, 2009. 8(3): p. 1577-1584. 2.Chou, K.C. and Y.D. Cai, Predicting protein quaternary structure by pseudo amino acid composition. Proteins: Structure, Function, and Bioinformatics, 2003. 53(2): p. 282-289. 3.Linderstrøm-Lang, K.U., Lane medical lectures: proteins and enzymes. Vol. 6. 1952: Stanford University Press. 4.Branden, C.I., Introduction to protein structure. 1999: Garland Science. 5.Kyte, J., Structure in protein chemistry. 2018: Garland Science. 6.Bernstein, F.C., et al., The Protein Data Bank: a computer-based archival file for macromolecular structures. Journal of molecular biology, 1977. 112(3): p. 535-542. 7.Tung, C.-H., et al., QuaBingo: A Prediction System for Protein Quaternary Structure Attributes Using Block Composition. BioMed research international, 2016. 2016. 8.Garian, R., Prediction of quaternary structure from primary structure. Bioinformatics, 2001. 17(6): p. 551-556. 9.Zhang, S.-W., et al., Classification of protein quaternary structure with support vector machine. Bioinformatics, 2003. 19(18): p. 2390-2396. 10.Shi, J., et al., Classification of protein homo--oligomers using amino acid composition distribution. Shengwu Wuli Xuebao, 2006. 22(1): p. 49-56. 11.Levy, E.D., PiQSi: protein quaternary structure investigation. Structure, 2007. 15(11): p. 1364-1367. 12.Xiao, X., P. Wang, and K.-C. Chou, Quat-2L: a web-server for predicting protein quaternary structural attributes. Molecular diversity, 2011. 15(1): p. 149-155. 13.Levy, E.D., et al., Assembly reflects evolution of protein complexes. Nature, 2008. 453(7199): p. 1262. 14.Chen, Z., et al., A 'minimal' sodium channel construct consisting of ligated S5-P-S6 segments forms a toxin-activatable ionophore. Journal of Biological Chemistry, 2002. 277(27): p. 24653-24658. 15.Marchler-Bauer, A., et al., CDD: a conserved domain database for interactive domain family analysis. Nucleic acids research, 2006. 35(suppl_1): p. D237-D240. 16.Sun, X.-Y., et al., Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou's PseAAC via discrete wavelet transform. Molecular BioSystems, 2012. 8(12): p. 3178-3184. 17.Sheng, Y., et al., Quad-PRE: a hybrid method to predict protein quaternary structure attributes. Computational and mathematical methods in medicine, 2014. 2014. 18.Levy, E.D., et al., 3D complex: a structural classification of protein complexes. PLoS computational biology, 2006. 2(11): p. e155. 19.Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995. 20(3): p. 273-297. 20.Chang, C.-C. and C.-J. Lin, LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2011. 2(3): p. 27. 21.Hall, M., et al., The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 2009. 11(1): p. 10-18. 22.Li, W. and A. Godzik, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006. 22(13): p. 1658-1659. 23.James, G., et al., An introduction to statistical learning. Vol. 112. 2013: Springer. 24.Frank, E., et al., Data mining in bioinformatics using Weka. Bioinformatics, 2004. 20(15): p. 2479-2481. 25.Bermingham, M.L., et al., Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific reports, 2015. 5: p. 10312. 26.Chen, Y.-W. and C.-J. Lin, Combining SVMs with various feature selection strategies, in Feature extraction. 2006, Springer. p. 315-324. 27.Peng, H., F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 2005. 27(8): p. 1226-1238. 28.Lin, H. and Q.Z. Li, Using pseudo amino acid composition to predict protein structural class: approached by incorporating 400 dipeptide components. Journal of Computational Chemistry, 2007. 28(9): p. 1463-1466. 29.Kawashima, S. and M. Kanehisa, AAindex: amino acid index database. Nucleic acids research, 2000. 28(1): p. 374-374. 30.Marchler-Bauer, A., et al., CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic acids research, 2002. 30(1): p. 281-283. 31.Liu, B., et al., Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic acids research, 2015. 43(W1): p. W65-W71. 32.Gorris, H.H., et al., Rapid profiling of peptide stability in proteolytic environments. Analytical chemistry, 2009. 81(4): p. 1580-1586. 33.Sharma, A., et al., Designing of peptides with desired half-life in intestine-like environment. BMC bioinformatics, 2014. 15(1): p. 282. 34.Zaki, A.M., et al., Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. New England Journal of Medicine, 2012. 367(19): p. 1814-1820. 35.van Boheemen, S., et al., Genomic characterization of a newly discovered coronavirus associated with acute respiratory distress syndrome in humans. MBio, 2012. 3(6): p. e00473-12. 36.Pallesen, J., et al., Immunogenicity and structures of a rationally designed prefusion MERS-CoV spike antigen. Proceedings of the National Academy of Sciences, 2017. 114(35): p. E7348-E7357.
摘要: 蛋白質的四級結構與生物功能有密不可分的關係。因此,蛋白質亞基組成的預測是生物蛋白質體學一項重要議題。然而,現有的方法大都沒有考慮到異質編碼之整合與資料量少的亞基類別之精確度。為了解決這個問題,本研究除了提供一個可大於12聚體蛋白質之預測工具QUATgo,同時採用3種序列編碼,並首次導入在四級結構預測的雙胺基酸組成(dipeptide composition)、蛋白質半衰期特性的Half Life Prediction,並且修改了前人所提出的蛋白質功能性區域組成(Functional Domain Composition)的編碼方式以解決特徵向量龐大的問題。QUATgo使用兩階段式架構解決了單一亞基資料量不足的問題,整個過程以10倍交叉驗證測試分類器之預測準確度,第一階段預測模型使用隨機森林演算法分別能產生十六個同源、異源之寡聚體及單寡聚體,在第一層分類器準確度為63.4%,然而異源十聚體的訓練資料筆數不足,所以將異源十聚體與大於異源十二聚體的訓練資料視為同一類別X,若第一階段分類器結果為類別X,則將序列送往以support vector machines建構之第二階段分類器,該分類器能以97.5%的準確度分辨異源十聚體和大於異源十二聚體,QUATgo最終能有61.4%的交叉驗證準確度以及63.4%的獨立測試準確度。而在case study中,QUATgo可準確預測中東呼吸綜合徵冠狀病毒的包外域中可變複合結構。
Predicting their attributes is an essential task in computational biology for the advancement of the proteomics. However, the existing methods did not consider the integration of heterogeneous coding and the accuracy of subunit categories with low data number. To end this, we proposed a predictive tool which can predicting more than 12 subunit protein oligomers, QUATgo. At the same time, three kinds of sequence coding were used, including dipeptide composition which was first time using to predict protein quaternary structural attributes, protein half-life characteristics and we modified the coding method of the Functional Domain Composition which proposed by the predecessors to solve the problem of large feature vectors. QUATgo solves the problem of insufficient data in a single subunit using a two-stage architecture and uses 10 times cross-validation to test the predictive accuracy of the classifier, the first-stage prediction model uses a random forest algorithm to generate sixteen homologous, heterologous oligomers and monomer respectively. The accuracy of the first-stage classifier is 63.4%. However, the number of training data of the hetero-10mer is insufficient so the training data of the hetero-10mer and the hetero-more than 12mer is regarded as the same category X. If the result of the first stage classifier is class X the sequence will sent to second stage classifier which was constructed with support vector machines, and can the prediction result of the hetero-10mer and hetero-more than 12mer with an accuracy of 97.5%, QUATgo will eventually have 61.4% cross-validation accuracy and 63.4% independent test accuracy. In case study, QUATgo can accurately predicts the variable complex structure of the MERS-CoV ectodomains.
URI: http://hdl.handle.net/11455/97778
文章公開時間: 2021-08-28
Appears in Collections:生物科技學研究所

文件中的檔案:

取得全文請前往華藝線上圖書館



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.