Please wait a minute...
Chin. Phys. B, 2025, Vol. 34(1): 018901    DOI: 10.1088/1674-1056/ad8db2
SPECIAL TOPIC — Computational programs in complex systems Prev   Next  

Accurate prediction of essential proteins using ensemble machine learning

Dezhi Lu(鲁德志)1,†, Hao Wu(吴淏)1,†, Yutong Hou(侯俞彤)2, Yuncheng Wu(吴云成)3, Yuanyuan Liu(刘媛媛)1,‡, and Jinwu Wang(王金武)1,2,§
1 School of Medicine, Shanghai University, Shanghai 200444, China;
2 Shanghai Key Laboratory of Orthopaedic Implants, Department of Orthopaedic Surgery, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China;
3 University of Shanghai for Science and Technology, Shanghai 200093, China
Abstract  Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods. While experimental approaches are highly accurate, they often demand extensive time and resources. To address these challenges, we present a computational ensemble learning framework designed to identify essential proteins more efficiently. Our method begins by using node2vec to transform proteins in the protein-protein interaction (PPI) network into continuous, low-dimensional vectors. We also extract a range of features from protein sequences, including graph-theory-based, information-based, compositional, and physiochemical attributes. Additionally, we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices (PSSMs) and capture evolutionary information. We then combine these features for classification using various machine learning algorithms. To enhance performance, we integrate the outputs of these algorithms through ensemble methods such as voting, weighted averaging, and stacking. This approach effectively addresses data imbalances and improves both robustness and accuracy. Our ensemble learning framework achieves an AUC of 0.960 and an accuracy of 0.9252, outperforming other computational methods. These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.
Keywords:  protein-protein interaction (PPI)      essential proteins      deep learning      ensemble learning  
Received:  15 September 2024      Revised:  21 October 2024      Accepted manuscript online:  01 November 2024
PACS:  89.75.-k (Complex systems)  
Fund: This work was financially supported by the National Key R&D Program of China (Grant No. 2022YFF1202600),the National Natural Science Foundation of China (Grant No. 82301158), Science and Technology Innovation Action Plan of Shanghai Science and Technology Committee (Grant No. 22015820100), Two-hundred Talent Support (Grant No. 20152224), Translational Medicine Innovation Project of Shanghai Jiao Tong University School of Medicine (Grant No. TM201915), Clinical Research Project of Multi-Disciplinary Team, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine (Grant No. 201914), and China Postdoctoral Science Foundation (Grant No. 2023M742332).
Corresponding Authors:  Yuanyuan Liu, Jinwu Wang     E-mail:  yuanyuan_liu@shu.edu.cn;wangjw@shsmu.edu.cn

Cite this article: 

Dezhi Lu(鲁德志), Hao Wu(吴淏), Yutong Hou(侯俞彤), Yuncheng Wu(吴云成), Yuanyuan Liu(刘媛媛), and Jinwu Wang(王金武) Accurate prediction of essential proteins using ensemble machine learning 2025 Chin. Phys. B 34 018901

[1] Kovács I A, Luck K, Spirohn K, Wang Y, Pollis C, Schlabach S, Bian W, Kim D K, Kishore N, Hao T, Calderwood M A, Vidal M and Barabási A L 2019 Nat. Commun. 10 1240
[2] Sengupta K, Saha S, Halder A K, Chatterjee P, Nasipuri M, Basu S and Plewczynski D 2022 Frontiers in Genetics 13 969915
[3] Saha S, Chatterjee P, Nasipuri M and Basu S 2021 PeerJ 9 e12117
[4] Saha S, Halder A K, Bandyopadhyay S S, Chatterjee P, NasipuriMand Basu S 2022 Methods (San Diego, Calif.) 203 488
[5] Zhang X, Acencio M L and Lemke N 2016 Frontiers in Physiology 7 75
[6] Ao C, Zhou W, Gao L, Dong B and Yu L 2020 Genomics 112 4666
[7] Acencio M L and Lemke N 2009 BMC Bioinformatics. 10 290
[8] Wang N, Zeng M, Li Y, Wu F X and Li M 2021 Journal of Computational Biology 28 687
[9] Wu C Y, Lin B T, Shi K, Zhang Q J, Gao R, Yu Z, De Marinis Y, Liu Z P and Zhang Y 2021 Current Bioinformatics 16 1161
[10] Zhong J, Wang J, Peng W, Zhang Z and Li M 2015 Tsinghua Science and Technology 20 491
[11] Lu P L, Yang P S and Liao Y G 2023 Journal of Shanghai Jiaotong University (Science) 28 1
[12] Schapke J, Tavares A and Recamonde-Mendoza M 2021 IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 1615
[13] Li Y M, Zeng M,Wu Y F, Li Y and LiM2022 IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 3263
[14] Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y and Jiang H 2007 Proc. Natl. Acad. Sci. USA 104 4337
[15] Zeng M, Zhang F H, Wu F X, Li Y H, Wang J X and Li M 2019 Bioinformatics 36 1114
[16] Lu P L, Zhong Y and Yang P S. 2023 Chin. Phys. B 33 018903
[17] Grover A and Leskovec J 2016 KDD: Proceedings. International Conference on Knowledge Discovery & Data Mining 2016 855
[18] Eraslan G, Avsec Z, Gagneur J and Theis F J 2019 Nat. Rev. Gene. 20 389
[19] Zhang F, Song H, Zeng M, Wu F X, Li Y, Pan Y and Li M 2021 IEEE/ACM Transactions on Computational Biology and Bioinformatics 18 2208
[20] Zeng M, Li M, Wu F X, Li Y and Pan Y 2019 BMC Bioinformatics 20 506
[21] Wang S F, LiWJ, Fei Y, Cao Z C, Xu D and Guo H 2019 IEEE Access 7 42384
[22] Lu P, Chen Y, Zhang T and Liao Y 2022 Chin. Phys. B 31 118901
[23] Yang P, Lu P and Zhang T 2023 Chin. Phys. B 32 058902
[24] Er M J, Zhang Y, Wang N and Pratama M 2016 Information Sciences 373 388
[25] Lv Q, Chen G, He H, Yang Z, Zhao L, Chen H Y and Chen C Y 2023 Chemical Science 14 10684
[26] Wu X H, Tao R, Sun Z Y, Zhang T Y, Li X Y, Yuan Y, Zheng S W, Cao C, Zhang Z H, Zhao X Y and Yang P 2024 Spectrochimica acta. Part A, Molecular and Biomolecular Spectroscopy 316 124351
[27] Geurts P, Ernst D and Wehenkel L 2006 Machine Learning 63 03
[28] Jerome H F 2001 The Annals of Statistics 29 1189
[29] Blomen V A, Májek P, Jae L T, et al. 2015 Science 350 1092
[1] A large language model-powered literature review for high-angle annular dark field imaging
Wenhao Yuan(袁文浩), Cheng Peng(彭程), and Qian He(何迁). Chin. Phys. B, 2024, 33(9): 098703.
[2] High-quality ghost imaging based on undersampled natural-order Hadamard source
Kang Liu(刘炕), Cheng Zhou(周成), Jipeng Huang(黄继鹏), Hongwu Qin(秦宏伍), Xuan Liu(刘轩), Xinwei Li(李鑫伟), and Lijun Song(宋立军). Chin. Phys. B, 2024, 33(9): 094204.
[3] Properties of radiation defects and threshold energy of displacement in zirconium hydride obtained by new deep-learning potential
Xi Wang(王玺), Meng Tang(唐孟), Ming-Xuan Jiang(蒋明璇), Yang-Chun Chen(陈阳春), Zhi-Xiao Liu(刘智骁), and Hui-Qiu Deng(邓辉球). Chin. Phys. B, 2024, 33(7): 076103.
[4] Image segmentation of exfoliated two-dimensional materials by generative adversarial network-based data augmentation
Xiaoyu Cheng(程晓昱), Chenxue Xie(解晨雪), Yulun Liu(刘宇伦), Ruixue Bai(白瑞雪), Nanhai Xiao(肖南海), Yanbo Ren(任琰博), Xilin Zhang(张喜林), Hui Ma(马惠), and Chongyun Jiang(蒋崇云). Chin. Phys. B, 2024, 33(3): 030703.
[5] Generation of orbital angular momentum hologram using a modified U-net
Zhi-Gang Zheng(郑志刚), Fei-Fei Han(韩菲菲), Le Wang(王乐), and Sheng-Mei Zhao(赵生妹). Chin. Phys. B, 2024, 33(3): 034207.
[6] Quantum state estimation based on deep learning
Haowen Xiao(肖皓文) and Zhiguang Han(韩枝光). Chin. Phys. B, 2024, 33(12): 120307.
[7] Essential proteins identification method based on four-order distances and subcellular localization information
Pengli Lu(卢鹏丽), Yu Zhong(钟雨), and Peishi Yang(杨培实). Chin. Phys. B, 2024, 33(1): 018903.
[8] A deep learning method based on prior knowledge with dual training for solving FPK equation
Denghui Peng(彭登辉), Shenlong Wang(王神龙), and Yuanchen Huang(黄元辰). Chin. Phys. B, 2024, 33(1): 010202.
[9] Crysformer: An attention-based graph neural network for properties prediction of crystals
Tian Wang(王田), Jiahui Chen(陈家辉), Jing Teng(滕婧), Jingang Shi(史金钢),Xinhua Zeng(曾新华), and Hichem Snoussi. Chin. Phys. B, 2023, 32(9): 090703.
[10] Classification and structural characteristics of amorphous materials based on interpretable deep learning
Jiamei Cui(崔佳梅), Yunjie Li(李韵洁), Cai Zhao(赵偲), and Wen Zheng(郑文). Chin. Phys. B, 2023, 32(9): 096101.
[11] Disruption prediction based on fusion feature extractor on J-TEXT
Wei Zheng(郑玮), Fengming Xue(薛凤鸣), Zhongyong Chen(陈忠勇), Chengshuo Shen(沈呈硕), Xinkun Ai(艾鑫坤), Yu Zhong(钟昱), Nengchao Wang(王能超), Ming Zhang(张明),Yonghua Ding(丁永华), Zhipeng Chen(陈志鹏), Zhoujun Yang(杨州军), and Yuan Pan(潘垣). Chin. Phys. B, 2023, 32(7): 075203.
[12] Modeling differential car-following behavior under normal and rainy conditions: A memory-based deep learning method with attention mechanism
Hai-Jian Bai(柏海舰), Chen-Chen Guo(过晨晨), Heng Ding(丁恒), Li-Yang Wei(卫立阳), Ting Sun(孙婷), and Xing-Yu Chen(陈星宇). Chin. Phys. B, 2023, 32(6): 060507.
[13] AG-GATCN: A novel method for predicting essential proteins
Peishi Yang(杨培实), Pengli Lu(卢鹏丽), and Teng Zhang(张腾). Chin. Phys. B, 2023, 32(5): 058902.
[14] Inatorial forecasting method considering macro and micro characteristics of chaotic traffic flow
Yue Hou(侯越), Di Zhang(张迪), Da Li(李达), and Ping Yang(杨萍). Chin. Phys. B, 2023, 32(10): 100508.
[15] Deep-learning-based cryptanalysis of two types of nonlinear optical cryptosystems
Xiao-Gang Wang(汪小刚) and Hao-Yu Wei(魏浩宇). Chin. Phys. B, 2022, 31(9): 094202.
No Suggested Reading articles found!