Text-guided diverse-expression diffusion model for molecule generation
Wenchao Weng(翁文超)1,†, Hanyu Jiang(蒋涵羽)2,†, Xiangjie Kong(孔祥杰)1,‡, and Giovanni Pau3
1 College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310014, China; 2 Hangzhou Dianzi University ITMO Joint Institute, Hangzhou Dianzi University, Hangzhou 310018, China; 3 Faculty of Engineering and Architecture, Kore University of Enna, Italy
Abstract The task of molecule generation guided by specific text descriptions has been proposed to generate molecules that match given text inputs. Mainstream methods typically use simplified molecular input line entry system (SMILES) to represent molecules and rely on diffusion models or autoregressive structures for modeling. However, the one-to-many mapping diversity when using SMILES to represent molecules causes existing methods to require complex model architectures and larger training datasets to improve performance, which affects the efficiency of model training and generation. In this paper, we propose a text-guided diverse-expression diffusion (TGDD) model for molecule generation. TGDD combines both SMILES and self-referencing embedded strings (SELFIES) into a novel diverse-expression molecular representation, enabling precise molecule mapping based on natural language. By leveraging this diverse-expression representation, TGDD simplifies the segmented diffusion generation process, achieving faster training and reduced memory consumption, while also exhibiting stronger alignment with natural language. TGDD outperforms both TGM-LDM and the autoregressive model MolT5-Base on most evaluation metrics.
Fund: Project supported in part by the National Natural Science Foundation of China (Grant Nos. 62476247 and 62072409), the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (Grant No. 2024C01214), and the Zhejiang Provincial Natural Science Foundation (Grant No. LR21F020003).
Corresponding Authors:
Xiangjie Kong
E-mail: xjkong@ieee.org
Cite this article:
Wenchao Weng(翁文超), Hanyu Jiang(蒋涵羽), Xiangjie Kong(孔祥杰), and Giovanni Pau Text-guided diverse-expression diffusion model for molecule generation 2025 Chin. Phys. B 34 050701
[1] Tang Y, Yang Z, Yao Y, Zhou Y, Tan Y, Wang Z, Pan T, Xiong R, Sun J and Wei G 2024 Chin. Phys. B 33 030701 [2] Weininger D 1988 Journal of Chemical Information and Computer Sciences 28 31 [3] Weininger D, Weininger A and Weininger J L 1989 Journal of Chemical Information and Computer Sciences 29 97 [4] Weininger D 1990 Journal of Chemical Information and Computer Sciences 30 237 [5] Kingma D P 2013 arXiv:1312.6114 [stat.ML] [6] Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y 2014 arXiv:1406.1078 [cs.CL] [7] Gómez-Bombarelli R, Wei J N, Duvenaud D, Hernández-Lobato J M, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel T D, Adams R P and Aspuru-Guzik A 2018 ACS Central Science 4 268 [8] Olivecrona M, Blaschke T, Engkvist O and Chen H 2017 Journal of Cheminformatics 9 48 [9] Popova M, Isayev O and Tropsha A 2018 Science Advances 4 eaap7885 [10] Ho J, Jain A and Abbeel P 2020 Advances in Neural Information Processing Systems 33 6840 [11] Gong H, Liu Q, Wu S and Wang L 2024 Proceedings of the AAAI Conference on Artificial Intelligence,February 20-27, 2024, Vancouver, Canada, p. 109 [12] Irwin R, Dimitriadis S, He J and Bjerrum E J 2022 Machine Learning: Science and Technology 3 015022 [13] Wang S, Guo Y, Wang Y, Sun H and Huang J 2019 Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, September 7-10, 2019, Niagara Falls, NY, USA, p. 429 [14] Edwards C, Lai T, Ros K, Honke G, Cho K and Ji H 2022 2022 Conference on Empirical Methods in Natural Language Processing p. 375 [15] Irwin R, Dimitriadis S, He J and Bjerrum E J 2022 Machine Learning: Science and Technology 3 015022 [16] Weininger D, Weininger A and Weininger J L 1989 Journal of Chemical Information and Computer Sciences 29 97 [17] Heller S R, McNaught A, Pletnev I, Stein S and Tchekhovskoi D 2015 Journal of Cheminformatics 7 23 [18] O’Boyle N and Dalke A 2018 chemrxiv.7097960.v1 [19] Krenn M, Häse F, Nigam A, Friederich P and Aspuru-Guzik A 2020 Machine Learning: Science and Technology 1 045024 [20] Wu J N, Wang T, Chen Y, Tang L J, Wu H L and Yu R Q 2024 Nat. Commun. 15 4993 [21] Yi Q, Chen X, Zhang C, Zhou Z, Zhu L and Kong X 2024 PeerJ Computer Science 10 e1905 [22] Li X, Thickstun J, Gulrajani I, Liang P S and Hashimoto T B 2022 Advances in Neural Information Processing Systems 35 4328 [23] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I 2017 Proceedings of the 31st International Conference on Neural Information Processing Systems, December 4-9, 2017, Long Beach, California, USA, p. 6000 [24] Austin J, Johnson D D, Ho J, Tarlow D and Van Den Berg R 2021 Advances in Neural Information Processing Systems 34 17981 [25] Hoogeboom E, Satorras V G, Vignac C and Welling M 2022 International conference on machine learning, July 17-23, 2022, Baltimore, Maryland, USA, p. 8867 [26] Xu M, Yu L, Song Y, Shi C, Ermon S and Tang J 2022 arXiv:2203.02923 [cs.LG] [27] Papinesi K 2002 Proc. 40th Actual Meeting of the Association for Computational Linguistics (ACL), July 7-12, 2002, Philadelphia, Pennsylvania, p. 311 [28] Miller F P, Vandome A F and McBrewster J 2009 Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau Levenshtein distance, Spell checker, Hamming distance (Alpha Press) 68 [29] Durant J L, Leland B A, Henry D R and Nourse J G 2002 Journal of Chemical Information and Computer Sciences 42 1273 [30] Schneider N, Sayle R A and Landrum G A 2015 Journal of Chemical Information and Modeling 55 2111 [31] Rogers D and Hahn M 2010 Journal of Chemical Information and Modeling 50 742 [32] Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W and Liu P J 2020 Journal of Machine Learning Research 21 1 [33] Beltagy I, Lo K and Cohan A 2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), November, 2019, Hong Kong, China, p. 3615 [34] Devlin J, Chang M W, Lee K and Toutanova K 2019 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2-7, 2019, Minneapolis, Minnesota, p. 4171
Altmetric calculates a score based on the online attention an article receives. Each coloured thread in the circle represents a different type of online attention. The number in the centre is the Altmetric score. Social media and mainstream news media are the main sources that calculate the score. Reference managers such as Mendeley are also tracked but do not contribute to the score. Older articles often score higher because they have had more time to get noticed. To account for this, Altmetric has included the context data for other articles of a similar age.