中国物理B ›› 2025, Vol. 34 ›› Issue (1): 18901-018901.doi: 10.1088/1674-1056/ad8db2
• • 上一篇
Dezhi Lu(鲁德志)1,†, Hao Wu(吴淏)1,†, Yutong Hou(侯俞彤)2, Yuncheng Wu(吴云成)3, Yuanyuan Liu(刘媛媛)1,‡, and Jinwu Wang(王金武)1,2,§
Dezhi Lu(鲁德志)1,†, Hao Wu(吴淏)1,†, Yutong Hou(侯俞彤)2, Yuncheng Wu(吴云成)3, Yuanyuan Liu(刘媛媛)1,‡, and Jinwu Wang(王金武)1,2,§
摘要: Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods. While experimental approaches are highly accurate, they often demand extensive time and resources. To address these challenges, we present a computational ensemble learning framework designed to identify essential proteins more efficiently. Our method begins by using node2vec to transform proteins in the protein-protein interaction (PPI) network into continuous, low-dimensional vectors. We also extract a range of features from protein sequences, including graph-theory-based, information-based, compositional, and physiochemical attributes. Additionally, we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices (PSSMs) and capture evolutionary information. We then combine these features for classification using various machine learning algorithms. To enhance performance, we integrate the outputs of these algorithms through ensemble methods such as voting, weighted averaging, and stacking. This approach effectively addresses data imbalances and improves both robustness and accuracy. Our ensemble learning framework achieves an AUC of 0.948 and an accuracy of 0.9252, outperforming other computational methods. These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.
中图分类号: (Complex systems)