Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/92366
標題: 利用新的特徵與非平衡式學習模型提升賴氨酸乙醯化作用位點之預測效能
Designing a new feature and imbalanced learning model for improving lysine acetylation sites prediction
作者: NING LEE
李寧
關鍵字: Lysine acetylation
imbalanced learning model
K-value
Gas
賴胺酸乙醯化
非平衡式模型
K-value
Gas
引用: 1. Chestier, A., et al., (1979) Rapid turnover of acetyl groups in the four core histones of simian virus 40 minichromosomes, Proc Natl Acad Sci U S A. 76: 46–50. 2. Kaluarachchi Duffy, S., et al., (2012) Exploring the yeast acetylome using functional genomics. Cell 149:936–948. 3. Shogren-Knaak, M., (2006) Histone H4-K16 acetylation controls chromatin structure and protein interactions. Science 311:844-7. 4. Gozzini, A., et al., (2003) Butyrates, as a single drug, induce histone acetylation and granulocytic maturation: possible selectivity on core binding factor-acute myeloid leukemia blasts. Cancer Res 63: 8955-61. 5. Seligson, D. B., et al., (2005) Global histone modification patterns predict risk of prostate cancer recurrence. Nature 435: 1262-6. 6. Pons, D., et al., (2009) Epigenetic histone acetylation modifiers in vascular remodelling: new targets for therapy in cardiovascular disease. Eur Heart J 30: 266-77. 7. Morris, S. A. (2006) Identification of histone H3 lysine 36 acetylation as a highly conserved histone modification. J Biol Chem 282: 7632-40. 8. Medzihradszky, K. F. (2005) In-solution digestion of proteins for mass spectrometry. Methods Enzymol 405: 50-65. 9. Mardis, ER. (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387-402. 10. Breiman, L. (1996) Bagging predictors. Machine Learning 24: 123-40. 11. Schapire, R. E. (1989) The Strength of Weak Learnability. Machine Learning 5: 197-227. 12. Wolpert, D. H. (1992) Stacked generalization. Neural Networks 5: 241-59. 13. Xu, Y. (2010) Lysine acetylation sites prediction using an ensemble of support vector machine classifiers. J Theor Biol 264: 130-5. 14. Gnad, F. (2010) Predicting post-translational lysine acetylation using support vector machines. Bioinformatics 26: 1666-8. 15. Hou, T, et al. (2014) LAceP: lysine acetylation site prediction using logistic regression. PLoS One 9 (2):e89575. 16. Fenn, J. B. (2002) Electrospray ionization mass spectrometry: How it all began. J Biomol Tech 13: 101-18. 17. Tanaka, K. (1988) Protein and Polymer Analyses up to m/z 100 000 by Laser Ionization Time-of flight Mass Spectrometry. 2 (20): 151–3. 18. Jennie A. Freiman, A.B. (1978) The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. N Engl J Med 299:690-694. 19. Choudhary, C. (2009) Lysine acetylation targets protein complexes and co-regulates major cellular functions. Science 325: 834-40. 20. Hornbeck PV, et al. (2012) PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res 40: D261–270.21 (16): p. 1776-9. 21. Collyda C, et al. (2006) Fuzzy Hidden Markov Models: a new approach in multiple sequence alignment. Stud Health Technol Inform 124: 99–104. 22. Stanke M, Schoffmann O, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7: 62. 23. Yoon BJ (2009) Hidden Markov Models and their Applications in Biological Sequence Analysis. Curr Genomics 10: 402–415. 24. Florence Jungo., et al. (2012) The UniProtKB/Swiss-Prot Tox-Prot program: A central hub of integrated venom protein data Toxicon 60 (4):551-7. 25. Li W & Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22 (13):1658-1659. 26. Marchler-Bauer, A. (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33: D192-6. 27. Nakai, K., Kidera, A., and Kanehisa, M. (1988) Cluster analysis of amino acid indices for prediction of protein structure and function. Protein Eng. 2, 93-100. 28. Tomii, K. and Kanehisa, M. (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27-36. 29. Kawashima, S., Ogata, H., and Kanehisa, M. (1999) AAindex: amino acid index database. Nucleic Acids Res. 27, 368-369. 30. Dayhoff, M.O. (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5: 345-58. 31. Henikoff, S. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915-19. 32. Atchley, W. R. (2005) Solving the protein sequence metric problem. Proc Natl Acad Sci U S A 102: 6395-400. 33. Mathura, S. (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties. J Mol Model 7:455-53. 34. Vacic V, et al. (2006) Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 22 (12):1536-7. 35. Trievel RC, Rojas JR, Sterner DE, Venkataramani RN, Wang L, et al. (1999) Crystal structure and mechanism of histone acetylation of the yeast GCN5 transcriptional coactivator. Proceedings of the National Academy of Sciences of the United States of America 96: 8931–8936. 36. Sternglanz R, Schindelin H (1999) Structure and mechanism of action of the histone acetyltransferase Gcn5 and similarity to other N-acetyltransferases. Proceedings of the National Academy of Sciences of the United States of America 96: 8807–8808. 37. Rojas JR, Trievel RC, Zhou JX, Mo Y, Li XM, et al. (1999) Structure of Tetrahymena GCN5 bound to coenzyme A and a histone H3 peptide. Nature 401: 93–98. 38. Buchan DWA, Ward SM, Lobley AE, Nugent TCO, Bryson K, et al. (2010) Protein annotation and modelling servers at University College London. Nucleic Acids Research 38: W563–W568. 39. Bryson K, McGuffin LJ, Marsden RL, Ward JJ, Sodhi JS, et al. (2005) Protein structure prediction servers at University College London. Nucleic Acids Research 33: W36–W38. 40. Petersen, B. (2009) A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct Biol 9: 51. 41. Dorigo M, Birattari M, & Stutzle T (2006) Ant colony optimization. Computational Intelligence Magazine, IEEE 1 (4):28-39. 42. Huang H, Xie HB, Guo JY, & Chen HJ (2012) Ant colony optimization-based feature selection method for surface electromyography signals classification. Comput Biol Med 42 (1):30-38. 43. Nemati S, Basiri ME, Ghasem-Aghaee N, & Aghdam MH (2009) A novel ACO–GA hybrid algorithm for feature selection in protein function prediction. Expert systems with applications 36 (10):12086-12094. 44. Chen L, Chen B, & Chen Y (2011) Image feature selection based on ant colony optimization. AI 2011: Advances in Artificial Intelligence, (Springer), pp 580-589. 45. Cai YD & Chou KC (2006) Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. J Theor Biol 238 (2):395-400. 46. Hearst, M. A. (1998) Support vector machines. Intelligent Systems and their Applications, IEEE 13: 18-28. 47. Chang, c. c. (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2: 27. 48. Hall M, et al. (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1):10-18.
摘要: Lysine acetylation is a crucial type of protein post-translational modification, which is involved in many important cellular processes, such as gene regulation, cytokines signal transfer, protein stability and metabolism. Therefore, it is useful to develop a protein lysine acetylation prediction system for the future relative works. In this study, we developed a new predict system, Kace. On learning data processing, two-layer imbalanced learning model was used, and all the negative data were assigned to 24 sample datasets. In addition to considering the features of sequence, structure and physicochemical properties, lysine composition was first adopted in our system. After feature integration, the accuracy of imbalanced learning model we used is higher 3% than Bagging sampling method. Moreover, Kace adopted the feature selection tool Gas to select major feature encodings for system learning, and evaluated 45 kinds of classifiers to selected the algorithm, Libsvm with best performance. In non-training independent data set, compare with three existing lysine acetylation forecast website EnsemblePail, PHOSIDA and LAceP Kace has good performance than others. In addition to comparison of different species, we also analyzed the performance on web tools for different families of proteins and three sub-categories of human, mouse, and other species. Finally, Kace provides more accurate prediction on lysine acetylation prediction and a web server was constructed with freely available at http://predictor.nchu.edu.tw/Kace.
蛋白質的賴安酸乙醯化在轉譯後修飾中為重要的類型之一,它參與了許多重要的細胞過程,如基因調控、細胞因子的訊號傳遞、DNA的複製與修復、蛋白質穩定性與代謝調節等作用。因此,能夠發展一套蛋白質賴胺酸乙醯化的預測系統,對往後乙醯化作用研究有一定的幫助。本論文針對多物種的賴氨酸乙醯化樣本,發展了一個新的預測系統Kace,在學習資料處理上則是採用二層式的非平衡式模型,將所有的negative data將其分配成24個樣本資料,除了考慮序列、結構及物化特性的特徵外,並首次將lysine的序列組成特性進行編碼。經特徵整合後,與Bagging的分類取樣方法比較上也高出了百分之3的準確率。接著Kace亦利用Gas特徵選擇工具選取重要的特徵編碼以利系統的學習。而分類演算法的選擇則是從45種分類器中評估,選定有最佳效能的Libsvm。在非訓練獨立測試資料中,與三個現有的賴氨酸乙醯化預測網站EnsemblePail、PHOSIDA、LAceP做比較,Kace有較佳的預測效能。除了針對全物種比較外,更將其物種分別歸類為人類、老鼠、其他三大類別進行預測效能分析。最後也針對不同的蛋白質家族進行討論。Kace可提供相關研究人員更精確的賴胺酸乙醯化預測。
URI: http://hdl.handle.net/11455/92366
其他識別: U0005-3007201515071500
文章公開時間: 2018-08-04
Appears in Collections:基因體暨生物資訊學研究所

文件中的檔案:

取得全文請前往華藝線上圖書館



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.