中国物理B ›› 2025, Vol. 34 ›› Issue (1): 18901-018901.doi: 10.1088/1674-1056/ad8db2

• • 上一篇    

Accurate prediction of essential proteins using ensemble machine learning

Dezhi Lu(鲁德志)1,†, Hao Wu(吴淏)1,†, Yutong Hou(侯俞彤)2, Yuncheng Wu(吴云成)3, Yuanyuan Liu(刘媛媛)1,‡, and Jinwu Wang(王金武)1,2,§   

  1. 1 School of Medicine, Shanghai University, Shanghai 200444, China;
    2 Shanghai Key Laboratory of Orthopaedic Implants, Department of Orthopaedic Surgery, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China;
    3 University of Shanghai for Science and Technology, Shanghai 200093, China
  • 收稿日期:2024-09-15 修回日期:2024-10-21 接受日期:2024-11-01 发布日期:2024-12-06
  • 通讯作者: Yuanyuan Liu, Jinwu Wang E-mail:yuanyuan_liu@shu.edu.cn;wangjw@shsmu.edu.cn
  • 基金资助:
    This work was financially supported by the National Key R&D Program of China (Grant No. 2022YFF1202600), the National Natural Science Foundation of China (Grant No. 82301158), Science and Technology Innovation Action Plan of Shanghai Science and Technology Committee (Grant No. 22015820100), Two-hundred Talent Support (Grant No. 20152224), Translational Medicine Innovation Project of Shanghai Jiao Tong University School of Medicine (Grant No. TM201915), Clinical Research Project of Multi-Disciplinary Team, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine (Grant No. 201914), and China Postdoctoral Science Foundation (Grant No. 2023M742332).

Accurate prediction of essential proteins using ensemble machine learning

Dezhi Lu(鲁德志)1,†, Hao Wu(吴淏)1,†, Yutong Hou(侯俞彤)2, Yuncheng Wu(吴云成)3, Yuanyuan Liu(刘媛媛)1,‡, and Jinwu Wang(王金武)1,2,§   

  1. 1 School of Medicine, Shanghai University, Shanghai 200444, China;
    2 Shanghai Key Laboratory of Orthopaedic Implants, Department of Orthopaedic Surgery, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China;
    3 University of Shanghai for Science and Technology, Shanghai 200093, China
  • Received:2024-09-15 Revised:2024-10-21 Accepted:2024-11-01 Published:2024-12-06
  • Contact: Yuanyuan Liu, Jinwu Wang E-mail:yuanyuan_liu@shu.edu.cn;wangjw@shsmu.edu.cn
  • Supported by:
    This work was financially supported by the National Key R&D Program of China (Grant No. 2022YFF1202600), the National Natural Science Foundation of China (Grant No. 82301158), Science and Technology Innovation Action Plan of Shanghai Science and Technology Committee (Grant No. 22015820100), Two-hundred Talent Support (Grant No. 20152224), Translational Medicine Innovation Project of Shanghai Jiao Tong University School of Medicine (Grant No. TM201915), Clinical Research Project of Multi-Disciplinary Team, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine (Grant No. 201914), and China Postdoctoral Science Foundation (Grant No. 2023M742332).

摘要: Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods. While experimental approaches are highly accurate, they often demand extensive time and resources. To address these challenges, we present a computational ensemble learning framework designed to identify essential proteins more efficiently. Our method begins by using node2vec to transform proteins in the protein-protein interaction (PPI) network into continuous, low-dimensional vectors. We also extract a range of features from protein sequences, including graph-theory-based, information-based, compositional, and physiochemical attributes. Additionally, we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices (PSSMs) and capture evolutionary information. We then combine these features for classification using various machine learning algorithms. To enhance performance, we integrate the outputs of these algorithms through ensemble methods such as voting, weighted averaging, and stacking. This approach effectively addresses data imbalances and improves both robustness and accuracy. Our ensemble learning framework achieves an AUC of 0.948 and an accuracy of 0.9252, outperforming other computational methods. These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.

关键词: protein-protein interaction (PPI), essential proteins, deep learning, ensemble learning

Abstract: Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods. While experimental approaches are highly accurate, they often demand extensive time and resources. To address these challenges, we present a computational ensemble learning framework designed to identify essential proteins more efficiently. Our method begins by using node2vec to transform proteins in the protein-protein interaction (PPI) network into continuous, low-dimensional vectors. We also extract a range of features from protein sequences, including graph-theory-based, information-based, compositional, and physiochemical attributes. Additionally, we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices (PSSMs) and capture evolutionary information. We then combine these features for classification using various machine learning algorithms. To enhance performance, we integrate the outputs of these algorithms through ensemble methods such as voting, weighted averaging, and stacking. This approach effectively addresses data imbalances and improves both robustness and accuracy. Our ensemble learning framework achieves an AUC of 0.948 and an accuracy of 0.9252, outperforming other computational methods. These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.

Key words: protein-protein interaction (PPI), essential proteins, deep learning, ensemble learning

中图分类号:  (Complex systems)

  • 89.75.-k