SPECIAL TOPIC — Computational programs in complex systems |
Prev
|
|
|
Accurate prediction of essential proteins using ensemble machine learning |
Dezhi Lu(鲁德志)1,†, Hao Wu(吴淏)1,†, Yutong Hou(侯俞彤)2, Yuncheng Wu(吴云成)3, Yuanyuan Liu(刘媛媛)1,‡, and Jinwu Wang(王金武)1,2,§ |
1 School of Medicine, Shanghai University, Shanghai 200444, China; 2 Shanghai Key Laboratory of Orthopaedic Implants, Department of Orthopaedic Surgery, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China; 3 University of Shanghai for Science and Technology, Shanghai 200093, China |
|
|
Abstract Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods. While experimental approaches are highly accurate, they often demand extensive time and resources. To address these challenges, we present a computational ensemble learning framework designed to identify essential proteins more efficiently. Our method begins by using node2vec to transform proteins in the protein-protein interaction (PPI) network into continuous, low-dimensional vectors. We also extract a range of features from protein sequences, including graph-theory-based, information-based, compositional, and physiochemical attributes. Additionally, we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices (PSSMs) and capture evolutionary information. We then combine these features for classification using various machine learning algorithms. To enhance performance, we integrate the outputs of these algorithms through ensemble methods such as voting, weighted averaging, and stacking. This approach effectively addresses data imbalances and improves both robustness and accuracy. Our ensemble learning framework achieves an AUC of 0.948 and an accuracy of 0.9252, outperforming other computational methods. These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.
|
Received: 15 September 2024
Revised: 21 October 2024
Accepted manuscript online: 01 November 2024
|
PACS:
|
89.75.-k
|
(Complex systems)
|
|
Fund: This work was financially supported by the National Key R&D Program of China (Grant No. 2022YFF1202600), the National Natural Science Foundation of China (Grant No. 82301158), Science and Technology Innovation Action Plan of Shanghai Science and Technology Committee (Grant No. 22015820100), Two-hundred Talent Support (Grant No. 20152224), Translational Medicine Innovation Project of Shanghai Jiao Tong University School of Medicine (Grant No. TM201915), Clinical Research Project of Multi-Disciplinary Team, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine (Grant No. 201914), and China Postdoctoral Science Foundation (Grant No. 2023M742332). |
Corresponding Authors:
Yuanyuan Liu, Jinwu Wang
E-mail: yuanyuan_liu@shu.edu.cn;wangjw@shsmu.edu.cn
|
Cite this article:
Dezhi Lu(鲁德志), Hao Wu(吴淏), Yutong Hou(侯俞彤), Yuncheng Wu(吴云成), Yuanyuan Liu(刘媛媛), and Jinwu Wang(王金武) Accurate prediction of essential proteins using ensemble machine learning 2025 Chin. Phys. B 34 018901
|
[1] Kovács I A, Luck K, Spirohn K, Wang Y, Pollis C, Schlabach S, Bian W, Kim D K, Kishore N, Hao T, Calderwood M A, Vidal M and Barabási A L 2019 Nat. Commun. 10 1240 [2] Sengupta K, Saha S, Halder A K, Chatterjee P, Nasipuri M, Basu S and Plewczynski D 2022 Frontiers in Genetics 13 969915 [3] Saha S, Chatterjee P, Nasipuri M and Basu S 2021 PeerJ 9 e12117 [4] Saha S, Halder A K, Bandyopadhyay S S, Chatterjee P, Nasipuri M and Basu S 2022 Methods (San Diego, Calif.) 203 488 [5] Zhang X, Acencio M L and Lemke N 2016 Frontiers in Physiology 7 75 [6] Ao C, Zhou W, Gao L, Dong B and Yu L 2020 Genomics 112 4666 [7] Acencio M L and Lemke N 2009 BMC Bioinformatics. 10 290 [8] Wang N, Zeng M, Li Y, Wu F X and Li M 2021 Journal of Computational Biology 28 687 [9] Wu C Y, Lin B T, Shi K, Zhang Q J, Gao R, Yu Z, De Marinis Y, Liu Z P and Zhang Y 2021 Current Bioinformatics 16 1161 [10] Zhong J, Wang J, Peng W, Zhang Z and Li M 2015 Tsinghua Science and Technology 20 491 [11] Lu P L, Yang P S and Liao Y G 2023 Journal of Shanghai Jiaotong University (Science) 28 1 [12] Schapke J, Tavares A and Recamonde-Mendoza M 2021 IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 1615 [13] Li Y M, Zeng M, Wu Y F, Li Y and Li M 2022 IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 3263 [14] Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y and Jiang H 2007 Proc. Natl. Acad. Sci. USA 104 4337 [15] Zeng M, Zhang F H, Wu F X, Li Y H, Wang J X and Li M 2019 Bioinformatics 36 1114 [16] Lu P L, Zhong Y and Yang P S. 2023 Chin. Phys. B 33 018903 [17] Grover A and Leskovec J 2016 KDD: Proceedings. International Conference on Knowledge Discovery & Data Mining 2016 855 [18] Eraslan G, Avsec Ž, Gagneur J and Theis F J 2019 Nat. Rev. Gene. 20 389 [19] Zhang F, Song H, Zeng M, Wu F X, Li Y, Pan Y and Li M 2021 IEEE/ACM Transactions on Computational Biology and Bioinformatics 18 2208 [20] Zeng M, Li M, Wu F X, Li Y and Pan Y 2019 BMC Bioinformatics 20 506 [21] Wang S F, Li W J, Fei Y, Cao Z C, Xu D and Guo H 2019 IEEE Access 7 42384 [22] Lu P, Chen Y, Zhang T and Liao Y 2022 Chin. Phys. B 31 118901 [23] Yang P, Lu P and Zhang T 2023 Chin. Phys. B 32 058902 [24] Er M J, Zhang Y, Wang N and Pratama M 2016 Information Sciences 373 388 [25] Lv Q, Chen G, He H, Yang Z, Zhao L, Chen H Y and Chen C Y 2023 Chemical Science 14 10684 [26] Wu X H, Tao R, Sun Z Y, Zhang T Y, Li X Y, Yuan Y, Zheng S W, Cao C, Zhang Z H, Zhao X Y and Yang P 2024 Spectrochimica acta. Part A, Molecular and Biomolecular Spectroscopy 316 124351 [27] Geurts P, Ernst D and Wehenkel L 2006 Machine Learning 63 03 [28] Jerome H F 2001 The Annals of Statistics 29 1189 [29] Blomen V A, Májek P, Jae L T, et al. 2015 Science 350 1092 |
No Suggested Reading articles found! |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
Altmetric
|
blogs
Facebook pages
Wikipedia page
Google+ users
|
Online attention
Altmetric calculates a score based on the online attention an article receives. Each coloured thread in the circle represents a different type of online attention. The number in the centre is the Altmetric score. Social media and mainstream news media are the main sources that calculate the score. Reference managers such as Mendeley are also tracked but do not contribute to the score. Older articles often score higher because they have had more time to get noticed. To account for this, Altmetric has included the context data for other articles of a similar age.
View more on Altmetrics
|
|
|