標題: 利用新的特徵與非平衡式學習模型提升賴氨酸乙醯化作用位點之預測效能
Designing a new feature and imbalanced learning model for improving lysine acetylation sites prediction
摘要: Lysine acetylation is a crucial type of protein post-translational modification, which is involved in many important cellular processes, such as gene regulation, cytokines signal transfer, protein stability and metabolism. Therefore, it is useful to develop a protein lysine acetylation prediction system for the future relative works. In this study, we developed a new predict system, Kace. On learning data processing, two-layer imbalanced learning model was used, and all the negative data were assigned to 24 sample datasets. In addition to considering the features of sequence, structure and physicochemical properties, lysine composition was first adopted in our system. After feature integration, the accuracy of imbalanced learning model we used is higher 3% than Bagging sampling method. Moreover, Kace adopted the feature selection tool Gas to select major feature encodings for system learning, and evaluated 45 kinds of classifiers to selected the algorithm, Libsvm with best performance. In non-training independent data set, compare with three existing lysine acetylation forecast website EnsemblePail, PHOSIDA and LAceP Kace has good performance than others. In addition to comparison of different species, we also analyzed the performance on web tools for different families of proteins and three sub-categories of human, mouse, and other species. Finally, Kace provides more accurate prediction on lysine acetylation prediction and a web server was constructed with freely available at
蛋白質的賴安酸乙醯化在轉譯後修飾中為重要的類型之一,它參與了許多重要的細胞過程,如基因調控、細胞因子的訊號傳遞、DNA的複製與修復、蛋白質穩定性與代謝調節等作用。因此,能夠發展一套蛋白質賴胺酸乙醯化的預測系統,對往後乙醯化作用研究有一定的幫助。本論文針對多物種的賴氨酸乙醯化樣本,發展了一個新的預測系統Kace,在學習資料處理上則是採用二層式的非平衡式模型,將所有的negative data將其分配成24個樣本資料,除了考慮序列、結構及物化特性的特徵外,並首次將lysine的序列組成特性進行編碼。經特徵整合後,與Bagging的分類取樣方法比較上也高出了百分之3的準確率。接著Kace亦利用Gas特徵選擇工具選取重要的特徵編碼以利系統的學習。而分類演算法的選擇則是從45種分類器中評估,選定有最佳效能的Libsvm。在非訓練獨立測試資料中,與三個現有的賴氨酸乙醯化預測網站EnsemblePail、PHOSIDA、LAceP做比較,Kace有較佳的預測效能。除了針對全物種比較外,更將其物種分別歸類為人類、老鼠、其他三大類別進行預測效能分析。最後也針對不同的蛋白質家族進行討論。Kace可提供相關研究人員更精確的賴胺酸乙醯化預測。
2018-08-04
