COMPUTATIONAL PROGRAMS FOR PHYSICS |
Prev
Next
|
|
|
Literature classification and its applications in condensed matter physics and materials science by natural language processing |
Siyuan Wu(吴思远)1,2,4, Tiannian Zhu(朱天念)1,2,4, Sijia Tu(涂思佳)1,3, Ruijuan Xiao(肖睿娟)1,4,†, Jie Yuan(袁洁)1,4, Quansheng Wu(吴泉生)1,4, Hong Li(李泓)1,‡, and Hongming Weng(翁红明)1,4,§ |
1 Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China; 2 School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100190, China; 3 College of Materials Science and Optoelectronic Technology, University of Chinese Academy of Sciences, Beijing 100049, China; 4 Condensed Matter Physics Data Center of Chinese Academy of Sciences, Beijing 100190, China |
|
|
Abstract The exponential growth of literature is constraining researchers' access to comprehensive information in related fields. While natural language processing (NLP) may offer an effective solution to literature classification, it remains hindered by the lack of labelled dataset. In this article, we introduce a novel method for generating literature classification models through semi-supervised learning, which can generate labelled dataset iteratively with limited human input. We apply this method to train NLP models for classifying literatures related to several research directions, i.e., battery, superconductor, topological material, and artificial intelligence (AI) in materials science. The trained NLP `battery' model applied on a larger dataset different from the training and testing dataset can achieve F1 score of 0.738, which indicates the accuracy and reliability of this scheme. Furthermore, our approach demonstrates that even with insufficient data, the not-well-trained model in the first few cycles can identify the relationships among different research fields and facilitate the discovery and understanding of interdisciplinary directions.
|
Received: 27 December 2023
Revised: 27 March 2024
Accepted manuscript online: 09 April 2024
|
PACS:
|
07.05.Kf
|
(Data analysis: algorithms and implementation; data management)
|
|
Fund: This research was funded by the Informatization Plan of Chinese Academy of Sciences (Grant No. CASWX2021SF-0102), the National Key R&D Program of China (Grant Nos. 2022YFA1603903, 2022YFA1403800, and 2021YFA0718700), the National Natural Science Foundation of China (Grant Nos. 11925408, 11921004, and 12188101), and the Chinese Academy of Sciences (Grant No. XDB33000000). |
Corresponding Authors:
Ruijuan Xiao, Hong Li, Hongming Weng
E-mail: rjxiao@iphy.ac.cn;hli@iphy.ac.cn;hmweng@iphy.ac.cn
|
Cite this article:
Siyuan Wu(吴思远), Tiannian Zhu(朱天念), Sijia Tu(涂思佳), Ruijuan Xiao(肖睿娟), Jie Yuan(袁洁), Quansheng Wu(吴泉生), Hong Li(李泓), and Hongming Weng(翁红明) Literature classification and its applications in condensed matter physics and materials science by natural language processing 2024 Chin. Phys. B 33 050704
|
[1] Tshitoyan V, Dagdelen J, Weston L, Dunn A, Rong Z, Kononova O, Persson K A, Ceder G and Jain A 2019 Nature 571 95 [2] Swain M C and Cole J M 2016 J. Chem. Inf. Model. 56 1894 [3] Devlin J, Chang M W, K Lee and Toutanova K 2018 arXiv 1810.04805 [4] Brown T B, Mann B, Ryder N, et al. 2020 arXiv 2005.14165 [5] Song Y, Miret S and Liu B 2023 arXiv 2305.08264 [6] Liu Y, Yang Z W, Zou X X, Ma S C, Liu D H, Avdeev M and Shi S Q 2023 National Science Review 10 nwad125 [7] Mikolov T, Chen K, Corrado G and Dean J 2013 arXiv 1301.3781 [8] Levy O, Goldberg Y and Dagan I 2015 Transactions of the Association for Computational Linguistics 3 211 [9] Schnabel T, Labutov I, Mimno D and Joachims T 2015 Conference on Empirical Methods in Natural Language Processing 298-307 [10] Arora S, Li Y Z, Liang Y Y, Ma T Y and Risteski A 2016 Transactions of the Association for Computational Linguistics 4 385 [11] Yin, Zi, Shen and Yuanyuan 2018 arXiv 1812.04224 [12] Hopfield J J 1982 Proc. Natl. Acad. Sci. USA 79 2554 [13] Hochreiter S and Schmidhuber J 1997 Neural Computation 9 1735 [14] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I 2017 arXiv 1706.03762 [15] Park Y J, Jerng S E, Park J S, Kwon C, Hsu C W, Ren Z, Yoon S and Li J 2023 arXiv 2308.13687 [16] Yang S J, Li S, Venugopalan S, Tshitoyan V, Aykol M, Merchant A, Cubuk E D and Cheon G 2023 arXiv 2311.13778 [17] Zhang X, Zhou Z C, Chen M and Sun Y Y 2023 J. Phys. Chem. Lett. 14 11342 [18] Rubungo A N, Arnold C, Rand B P and Dieng A B 2023 arXiv 2310.14029 [19] Sagawa T and Kojima R 2023 arXiv 2311.06708 [20] Buehler M J 2023 arXiv 2310.19998 [21] Zheng Z, Zhang O, Borgs C, Chayes J T and Yaghi O M 2023 J. Am. Chem. Soc. 145 18048 [22] Zheng Z L, Alawadhi A H, Chheda S, Neumann S E, Rampal N, Liu S C, Nguyen H L, Lin Y, Rong Z, Siepmann J I, Gagliardi L, Anandkumar A, Borgs C, Chayes J T and Yaghi O M 2023 J. Am. Chem. Soc. 145 28284 [23] Zheng Z L, Rong Z C, Rampal N, Borgs C, Chayes J T and Yaghi O M 2023 Angewandte Chemie 135 e202311983 [24] Zheng Z L, Zhang O F, Nguyen H L, Rampal N, Alawadhi A H, Rong Z C, Head-Gordon T, Borgs C, Chayes J T and Yaghi O M 2023 ACS Cent. Sci. 9 2161 [25] Boiko D A, MacKnight R, Kline B and Gomes G 2023 Nature 624 570 [26] Yang X J, Wilson S D and Petzold L 2024 arXiv 2401.01089 [27] Chen Z Y, Xie F K, Wan M, Yuan Y, Liu M, Wang Z G, Meng S and Wang Y G 2023 Chin. Phys. B 32 118104 [28] Jyotirmoy Deb, Lakshi Saikia, Kripa Dristi Dihingia, G. Narahari Sastry 2024 J. Chem. Inf. Model. 64 799 [29] TensorFlow https://tensorflow.google.cn/ [30] Web of Science https://www.webofscience.com/wos/woscc/basicsearch [31] Wu S Y, Xiao R J, Li H and Chen L Q 2023 arXiv 2304.08728 [32] Inaguma Y, Chen L Q, Itoh M, Nakamura,T Uchida T, Ikuta H and Wakihara M 1999 Solid State Commun. 86 689 [33] Chang C, Zhang H P, Zhao R, Li F C, Luo P, Li M Z and Bai H Y 2022 Nat. Mater. 21 1240 [34] Li F, Wang H, Li J and Geng H 2023 Chin. Phys. B 32 106103 [35] The personalized model has been internal testing at Materials Science Electronic Laboratory Platform https://in.iphy.ac.cn/eln/#/recusertype [36] Data Center for Condensed Matter Physics, Chinese Academy of Sciences http://condmatt.iphy.ac.cn/. The model results have been presented weekly at http://condmatt.iphy.ac.cn/literature trends/newest and the data about energy density and ionic conductivity will be merged in it step by step. The authors claim that the model has been changed from TensorFlow to pyTorch based on transformers architecture since August 2023 at http://condmatt.iphy.ac.cn/literature trends/newest. This website will be discard and your can see [35] or wait for new website. |
No Suggested Reading articles found! |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
Altmetric
|
blogs
Facebook pages
Wikipedia page
Google+ users
|
Online attention
Altmetric calculates a score based on the online attention an article receives. Each coloured thread in the circle represents a different type of online attention. The number in the centre is the Altmetric score. Social media and mainstream news media are the main sources that calculate the score. Reference managers such as Mendeley are also tracked but do not contribute to the score. Older articles often score higher because they have had more time to get noticed. To account for this, Altmetric has included the context data for other articles of a similar age.
View more on Altmetrics
|
|
|