dc.description.abstract蛋白質突變後可能會導致結構改變,進而影響蛋白質功能甚至造成疾病的發生,且在蛋白質工程、藥物設計或者優化工業,常會透過突變提升蛋白質穩定性或改變蛋白質特性時能維持其穩定性。但目前預測蛋白質突變後穩定性之工具甚多,且常以不同演算法與特徵建構模型,可能產生相互矛盾的預測結果,導致使用者在決策上產生疑慮。因此,本研究以機器學習整合11個預測工具的結果並加入蛋白質序列特性作為特徵進行編碼,並透過六種組合之特徵選擇方法挑選最佳模型,進而提高模型準確度且降低訓練模型之時間複雜度。此系統中包含三種模組分別為網站模組(Website Module)、序列模組(Sequence Module)、單機模組(Stand-alone Module),且其中單機和序列模組於整合之線上工具無法運作時,能達到維持系統預測效能之功能。最終結構分類模型之MCC可從0.547提升至0.708,回歸模型PCC可達 0.697,且序列模型之準確度優於以結構資訊為輸入之單一方法預測工具,MCC高於0.105,不僅成功整合現有的預測工具,甚至提高整合工具之準確度。另外在單機版測試,分類模型之MCC僅相差0.019,而PCC也只有相差0.04,系統在整合之線上工具無法運作時,亦能維持效能之穩定性。zh_TW
dc.description.abstractMutation of a single amino acid residue may change protein structure which affect protein function and disease. Increasing protein stability or maintaining it stable while changing protein properties is often a goal in protein engineering, drug design or optimize industrial. A variety of methods and features have been proposed to predict the stability of protein mutations, the conflicting prediction results from different tools could cause confusion to users. Therefore, this study integrates 11 prediction tools with machine learning and adds information of protein sequences. The best model is selected through six combined feature selection methods to improve accuracy and reduce the time complexity of the training model. The three modules included in the system are website module, sequence module, and stand-alone module. When integrated online tools are not working, stand-alone and sequence modules can maintain prediction accuracy. The MCC (Matthews Correlation Coefficient) of the structural classification model can be increased from 0.547 to 0.708, and PCC (Pearson correlation coefficient) 0.697 on regression model. And the accuracy of the sequence model is better than the prediction tool with structural information as input, and MCC is higher than 0.105. Not only successfully integrates predictors, but also improves the accuracy of integration tools. In the stand-alone test, the MCC of the classification model a narrow margin by 0.019, and PCC a small margin by 0.04.Therefore, when the integrated online tools are not working, the stability of the system performance can be maintained.en_US
dc.description.tableofcontentsContent 誌謝 i 摘要 ii Abstract iii Content of Figures vii Content of Tables viii 1 Introduction 1 2 Related Works 5 2.1 Cross-validation 5 2.2 NetSurfp 6 2.3 Weka 6 2.4 Support Vector Machine 7 2.5 mRMR 8 2.6 XGBoost 8 2.7 Ealuation classification 9 3 Materials and Methods 11 3.1 Dataset 11 3.1.1 Collection of training and testing data 11 3.1.2 Data processing 12 3.1.3 Definitions of positive and negative data 12 3.2 Integration of prediction tools 12 3.2.1 Element predictors 14 3.3 Feature encoding 16 3.3.1 Predictior result features 16 3.3.2 Sequence based features 18 Binary 18 Physicochemical and biochemical properties 19 3.3.3 Structure based features 22 Relative/Absolute surface accessibility (RSA/ASA) 22 Secondary structure (SS) 22 3.4 Input module 23 3.4.1 Website module (WM) 23 3.4.2 Stand-alone module (SAM) 23 3.4.3 Sequence module (SM) 23 3.5 Feature selection 24 3.6 Stand-alone 24 3.7 Learning model construction 25 4 Result and Discussion 26 4.1 Comparison of machine learning algorithm 26 4.2 Comparison of feature selection 29 4.2.1 Evaluation of classifiers for structural model 29 4.2.2 Evaluation of classifiers for sequential model 34 4.2.3 Performance of regression model for structural 38 4.2.4 Performance of regression model for sequence 42 4.3 Performance of different threshold 45 4.4 Evaluation of stand-alone module 48 4.4.1 Performance of stand-alone classification model 48 4.4.2 Performance of stand-alone regression model 49 4.5 Performance of independent test 50 4.5.1 Performance classification model of independent test 50 4.5.2 Performance regression model of independent test 52 4.6 Case study 54 4.7 Performance with different experimental conditions 57 4.8 Perspectives 60 5 Conclusion 62 6 Reference 63 7 Supplementary Materials 69zh_TW
Integrated off-the-shelf Predictor for Protein Stability Changes upon Single Mutation by Various Modules Using Machine Learning
