Accurate prediction of essential proteins using ensemble machine learning

doi:10.1088/1674-1056/ad8db2

中国物理B ›› 2025, Vol. 34 ›› Issue (1): 18901-018901.doi: 10.1088/1674-1056/ad8db2

所属专题： SPECIAL TOPIC — Computational programs in complex systems

Accurate prediction of essential proteins using ensemble machine learning

Dezhi Lu(鲁德志)^1,†, Hao Wu(吴淏)^1,†, Yutong Hou(侯俞彤)², Yuncheng Wu(吴云成)³, Yuanyuan Liu(刘媛媛)^1,‡, and Jinwu Wang(王金武)^1,2,§

1 School of Medicine, Shanghai University, Shanghai 200444, China;
2 Shanghai Key Laboratory of Orthopaedic Implants, Department of Orthopaedic Surgery, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China;
3 University of Shanghai for Science and Technology, Shanghai 200093, China

收稿日期:2024-09-15 修回日期:2024-10-21 接受日期:2024-11-01 出版日期:2025-01-15 发布日期:2024-12-06
通讯作者: Yuanyuan Liu, Jinwu Wang E-mail:yuanyuan_liu@shu.edu.cn;wangjw@shsmu.edu.cn
基金资助:
This work was financially supported by the National Key R&D Program of China (Grant No. 2022YFF1202600),the National Natural Science Foundation of China (Grant No. 82301158), Science and Technology Innovation Action Plan of Shanghai Science and Technology Committee (Grant No. 22015820100), Two-hundred Talent Support (Grant No. 20152224), Translational Medicine Innovation Project of Shanghai Jiao Tong University School of Medicine (Grant No. TM201915), Clinical Research Project of Multi-Disciplinary Team, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine (Grant No. 201914), and China Postdoctoral Science Foundation (Grant No. 2023M742332).

Accurate prediction of essential proteins using ensemble machine learning

Dezhi Lu(鲁德志)^1,†, Hao Wu(吴淏)^1,†, Yutong Hou(侯俞彤)², Yuncheng Wu(吴云成)³, Yuanyuan Liu(刘媛媛)^1,‡, and Jinwu Wang(王金武)^1,2,§

1 School of Medicine, Shanghai University, Shanghai 200444, China;
2 Shanghai Key Laboratory of Orthopaedic Implants, Department of Orthopaedic Surgery, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China;
3 University of Shanghai for Science and Technology, Shanghai 200093, China

Received:2024-09-15 Revised:2024-10-21 Accepted:2024-11-01 Online:2025-01-15 Published:2024-12-06
Contact: Yuanyuan Liu, Jinwu Wang E-mail:yuanyuan_liu@shu.edu.cn;wangjw@shsmu.edu.cn
Supported by:
This work was financially supported by the National Key R&D Program of China (Grant No. 2022YFF1202600),the National Natural Science Foundation of China (Grant No. 82301158), Science and Technology Innovation Action Plan of Shanghai Science and Technology Committee (Grant No. 22015820100), Two-hundred Talent Support (Grant No. 20152224), Translational Medicine Innovation Project of Shanghai Jiao Tong University School of Medicine (Grant No. TM201915), Clinical Research Project of Multi-Disciplinary Team, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine (Grant No. 201914), and China Postdoctoral Science Foundation (Grant No. 2023M742332).

摘要/Abstract

摘要： Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods. While experimental approaches are highly accurate, they often demand extensive time and resources. To address these challenges, we present a computational ensemble learning framework designed to identify essential proteins more efficiently. Our method begins by using node2vec to transform proteins in the protein-protein interaction (PPI) network into continuous, low-dimensional vectors. We also extract a range of features from protein sequences, including graph-theory-based, information-based, compositional, and physiochemical attributes. Additionally, we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices (PSSMs) and capture evolutionary information. We then combine these features for classification using various machine learning algorithms. To enhance performance, we integrate the outputs of these algorithms through ensemble methods such as voting, weighted averaging, and stacking. This approach effectively addresses data imbalances and improves both robustness and accuracy. Our ensemble learning framework achieves an AUC of 0.960 and an accuracy of 0.9252, outperforming other computational methods. These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.

关键词: protein-protein interaction (PPI), essential proteins, deep learning, ensemble learning

Abstract: Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods. While experimental approaches are highly accurate, they often demand extensive time and resources. To address these challenges, we present a computational ensemble learning framework designed to identify essential proteins more efficiently. Our method begins by using node2vec to transform proteins in the protein-protein interaction (PPI) network into continuous, low-dimensional vectors. We also extract a range of features from protein sequences, including graph-theory-based, information-based, compositional, and physiochemical attributes. Additionally, we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices (PSSMs) and capture evolutionary information. We then combine these features for classification using various machine learning algorithms. To enhance performance, we integrate the outputs of these algorithms through ensemble methods such as voting, weighted averaging, and stacking. This approach effectively addresses data imbalances and improves both robustness and accuracy. Our ensemble learning framework achieves an AUC of 0.960 and an accuracy of 0.9252, outperforming other computational methods. These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.

Key words: protein-protein interaction (PPI), essential proteins, deep learning, ensemble learning

中图分类号: (Complex systems)

89.75.-k

Dezhi Lu(鲁德志), Hao Wu(吴淏), Yutong Hou(侯俞彤), Yuncheng Wu(吴云成), Yuanyuan Liu(刘媛媛), and Jinwu Wang(王金武). Accurate prediction of essential proteins using ensemble machine learning[J]. 中国物理B, 2025, 34(1): 18901-018901.

参考文献

[1] Kovács I A, Luck K, Spirohn K, Wang Y, Pollis C, Schlabach S, Bian W, Kim D K, Kishore N, Hao T, Calderwood M A, Vidal M and Barabási A L 2019 Nat. Commun. 10 1240
[2] Sengupta K, Saha S, Halder A K, Chatterjee P, Nasipuri M, Basu S and Plewczynski D 2022 Frontiers in Genetics 13 969915
[3] Saha S, Chatterjee P, Nasipuri M and Basu S 2021 PeerJ 9 e12117
[4] Saha S, Halder A K, Bandyopadhyay S S, Chatterjee P, NasipuriMand Basu S 2022 Methods (San Diego, Calif.) 203 488
[5] Zhang X, Acencio M L and Lemke N 2016 Frontiers in Physiology 7 75
[6] Ao C, Zhou W, Gao L, Dong B and Yu L 2020 Genomics 112 4666
[7] Acencio M L and Lemke N 2009 BMC Bioinformatics. 10 290
[8] Wang N, Zeng M, Li Y, Wu F X and Li M 2021 Journal of Computational Biology 28 687
[9] Wu C Y, Lin B T, Shi K, Zhang Q J, Gao R, Yu Z, De Marinis Y, Liu Z P and Zhang Y 2021 Current Bioinformatics 16 1161
[10] Zhong J, Wang J, Peng W, Zhang Z and Li M 2015 Tsinghua Science and Technology 20 491
[11] Lu P L, Yang P S and Liao Y G 2023 Journal of Shanghai Jiaotong University (Science) 28 1
[12] Schapke J, Tavares A and Recamonde-Mendoza M 2021 IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 1615
[13] Li Y M, Zeng M,Wu Y F, Li Y and LiM2022 IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 3263
[14] Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y and Jiang H 2007 Proc. Natl. Acad. Sci. USA 104 4337
[15] Zeng M, Zhang F H, Wu F X, Li Y H, Wang J X and Li M 2019 Bioinformatics 36 1114
[16] Lu P L, Zhong Y and Yang P S. 2023 Chin. Phys. B 33 018903
[17] Grover A and Leskovec J 2016 KDD: Proceedings. International Conference on Knowledge Discovery & Data Mining 2016 855
[18] Eraslan G, Avsec Z, Gagneur J and Theis F J 2019 Nat. Rev. Gene. 20 389
[19] Zhang F, Song H, Zeng M, Wu F X, Li Y, Pan Y and Li M 2021 IEEE/ACM Transactions on Computational Biology and Bioinformatics 18 2208
[20] Zeng M, Li M, Wu F X, Li Y and Pan Y 2019 BMC Bioinformatics 20 506
[21] Wang S F, LiWJ, Fei Y, Cao Z C, Xu D and Guo H 2019 IEEE Access 7 42384
[22] Lu P, Chen Y, Zhang T and Liao Y 2022 Chin. Phys. B 31 118901
[23] Yang P, Lu P and Zhang T 2023 Chin. Phys. B 32 058902
[24] Er M J, Zhang Y, Wang N and Pratama M 2016 Information Sciences 373 388
[25] Lv Q, Chen G, He H, Yang Z, Zhao L, Chen H Y and Chen C Y 2023 Chemical Science 14 10684
[26] Wu X H, Tao R, Sun Z Y, Zhang T Y, Li X Y, Yuan Y, Zheng S W, Cao C, Zhang Z H, Zhao X Y and Yang P 2024 Spectrochimica acta. Part A, Molecular and Biomolecular Spectroscopy 316 124351
[27] Geurts P, Ernst D and Wehenkel L 2006 Machine Learning 63 03
[28] Jerome H F 2001 The Annals of Statistics 29 1189
[29] Blomen V A, Májek P, Jae L T, et al. 2015 Science 350 1092

Accurate prediction of essential proteins using ensemble machine learning

Accurate prediction of essential proteins using ensemble machine learning

RichHTML

PDF (PC)

赞

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0

[1]	Wenhao Yuan(袁文浩), Cheng Peng(彭程), and Qian He(何迁). A large language model-powered literature review for high-angle annular dark field imaging[J]. 中国物理B, 2024, 33(9): 98703-098703.
[2]	Kang Liu(刘炕), Cheng Zhou(周成), Jipeng Huang(黄继鹏), Hongwu Qin(秦宏伍), Xuan Liu(刘轩), Xinwei Li(李鑫伟), and Lijun Song(宋立军). High-quality ghost imaging based on undersampled natural-order Hadamard source[J]. 中国物理B, 2024, 33(9): 94204-094204.
[3]	Xi Wang(王玺), Meng Tang(唐孟), Ming-Xuan Jiang(蒋明璇), Yang-Chun Chen(陈阳春), Zhi-Xiao Liu(刘智骁), and Hui-Qiu Deng(邓辉球). Properties of radiation defects and threshold energy of displacement in zirconium hydride obtained by new deep-learning potential[J]. 中国物理B, 2024, 33(7): 76103-076103.
[4]	Xiaoyu Cheng(程晓昱), Chenxue Xie(解晨雪), Yulun Liu(刘宇伦), Ruixue Bai(白瑞雪), Nanhai Xiao(肖南海), Yanbo Ren(任琰博), Xilin Zhang(张喜林), Hui Ma(马惠), and Chongyun Jiang(蒋崇云). Image segmentation of exfoliated two-dimensional materials by generative adversarial network-based data augmentation[J]. 中国物理B, 2024, 33(3): 30703-030703.
[5]	Zhi-Gang Zheng(郑志刚), Fei-Fei Han(韩菲菲), Le Wang(王乐), and Sheng-Mei Zhao(赵生妹). Generation of orbital angular momentum hologram using a modified U-net[J]. 中国物理B, 2024, 33(3): 34207-034207.
[6]	Haowen Xiao(肖皓文) and Zhiguang Han(韩枝光). Quantum state estimation based on deep learning[J]. 中国物理B, 2024, 33(12): 120307-120307.
[7]	Denghui Peng(彭登辉), Shenlong Wang(王神龙), and Yuanchen Huang(黄元辰). A deep learning method based on prior knowledge with dual training for solving FPK equation[J]. 中国物理B, 2024, 33(1): 10202-10202.
[8]	Pengli Lu(卢鹏丽), Yu Zhong(钟雨), and Peishi Yang(杨培实). Essential proteins identification method based on four-order distances and subcellular localization information[J]. 中国物理B, 2024, 33(1): 18903-18903.
[9]	Jiamei Cui(崔佳梅), Yunjie Li(李韵洁), Cai Zhao(赵偲), and Wen Zheng(郑文). Classification and structural characteristics of amorphous materials based on interpretable deep learning[J]. 中国物理B, 2023, 32(9): 96101-096101.
[10]	Tian Wang(王田), Jiahui Chen(陈家辉), Jing Teng(滕婧), Jingang Shi(史金钢),Xinhua Zeng(曾新华), and Hichem Snoussi. Crysformer: An attention-based graph neural network for properties prediction of crystals[J]. 中国物理B, 2023, 32(9): 90703-090703.
[11]	Wei Zheng(郑玮), Fengming Xue(薛凤鸣), Zhongyong Chen(陈忠勇), Chengshuo Shen(沈呈硕), Xinkun Ai(艾鑫坤), Yu Zhong(钟昱), Nengchao Wang(王能超), Ming Zhang(张明),Yonghua Ding(丁永华), Zhipeng Chen(陈志鹏), Zhoujun Yang(杨州军), and Yuan Pan(潘垣). Disruption prediction based on fusion feature extractor on J-TEXT[J]. 中国物理B, 2023, 32(7): 75203-075203.
[12]	Hai-Jian Bai(柏海舰), Chen-Chen Guo(过晨晨), Heng Ding(丁恒), Li-Yang Wei(卫立阳), Ting Sun(孙婷), and Xing-Yu Chen(陈星宇). Modeling differential car-following behavior under normal and rainy conditions: A memory-based deep learning method with attention mechanism[J]. 中国物理B, 2023, 32(6): 60507-060507.
[13]	Peishi Yang(杨培实), Pengli Lu(卢鹏丽), and Teng Zhang(张腾). AG-GATCN: A novel method for predicting essential proteins[J]. 中国物理B, 2023, 32(5): 58902-058902.
[14]	Yue Hou(侯越), Di Zhang(张迪), Da Li(李达), and Ping Yang(杨萍). Inatorial forecasting method considering macro and micro characteristics of chaotic traffic flow[J]. 中国物理B, 2023, 32(10): 100508-100508.
[15]	Xiao-Gang Wang(汪小刚) and Hao-Yu Wei(魏浩宇). Deep-learning-based cryptanalysis of two types of nonlinear optical cryptosystems[J]. 中国物理B, 2022, 31(9): 94202-094202.