中国物理B ›› 2024, Vol. 33 ›› Issue (5): 50704-050704.doi: 10.1088/1674-1056/ad3c30

• • 上一篇    下一篇

Literature classification and its applications in condensed matter physics and materials science by natural language processing

Siyuan Wu(吴思远)1,2,4, Tiannian Zhu(朱天念)1,2,4, Sijia Tu(涂思佳)1,3, Ruijuan Xiao(肖睿娟)1,4,†, Jie Yuan(袁洁)1,4, Quansheng Wu(吴泉生)1,4, Hong Li(李泓)1,‡, and Hongming Weng(翁红明)1,4,§   

  1. 1 Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China;
    2 School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100190, China;
    3 College of Materials Science and Optoelectronic Technology, University of Chinese Academy of Sciences, Beijing 100049, China;
    4 Condensed Matter Physics Data Center of Chinese Academy of Sciences, Beijing 100190, China
  • 收稿日期:2023-12-27 修回日期:2024-03-27 接受日期:2024-04-09 出版日期:2024-05-20 发布日期:2024-05-20
  • 通讯作者: Ruijuan Xiao, Hong Li, Hongming Weng E-mail:rjxiao@iphy.ac.cn;hli@iphy.ac.cn;hmweng@iphy.ac.cn
  • 基金资助:
    This research was funded by the Informatization Plan of Chinese Academy of Sciences (Grant No. CASWX2021SF-0102), the National Key R&D Program of China (Grant Nos. 2022YFA1603903, 2022YFA1403800, and 2021YFA0718700), the National Natural Science Foundation of China (Grant Nos. 11925408, 11921004, and 12188101), and the Chinese Academy of Sciences (Grant No. XDB33000000).

Literature classification and its applications in condensed matter physics and materials science by natural language processing

Siyuan Wu(吴思远)1,2,4, Tiannian Zhu(朱天念)1,2,4, Sijia Tu(涂思佳)1,3, Ruijuan Xiao(肖睿娟)1,4,†, Jie Yuan(袁洁)1,4, Quansheng Wu(吴泉生)1,4, Hong Li(李泓)1,‡, and Hongming Weng(翁红明)1,4,§   

  1. 1 Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China;
    2 School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100190, China;
    3 College of Materials Science and Optoelectronic Technology, University of Chinese Academy of Sciences, Beijing 100049, China;
    4 Condensed Matter Physics Data Center of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2023-12-27 Revised:2024-03-27 Accepted:2024-04-09 Online:2024-05-20 Published:2024-05-20
  • Contact: Ruijuan Xiao, Hong Li, Hongming Weng E-mail:rjxiao@iphy.ac.cn;hli@iphy.ac.cn;hmweng@iphy.ac.cn
  • Supported by:
    This research was funded by the Informatization Plan of Chinese Academy of Sciences (Grant No. CASWX2021SF-0102), the National Key R&D Program of China (Grant Nos. 2022YFA1603903, 2022YFA1403800, and 2021YFA0718700), the National Natural Science Foundation of China (Grant Nos. 11925408, 11921004, and 12188101), and the Chinese Academy of Sciences (Grant No. XDB33000000).

摘要: The exponential growth of literature is constraining researchers' access to comprehensive information in related fields. While natural language processing (NLP) may offer an effective solution to literature classification, it remains hindered by the lack of labelled dataset. In this article, we introduce a novel method for generating literature classification models through semi-supervised learning, which can generate labelled dataset iteratively with limited human input. We apply this method to train NLP models for classifying literatures related to several research directions, i.e., battery, superconductor, topological material, and artificial intelligence (AI) in materials science. The trained NLP `battery' model applied on a larger dataset different from the training and testing dataset can achieve F1 score of 0.738, which indicates the accuracy and reliability of this scheme. Furthermore, our approach demonstrates that even with insufficient data, the not-well-trained model in the first few cycles can identify the relationships among different research fields and facilitate the discovery and understanding of interdisciplinary directions.

关键词: natural language processing, text mining, materials science

Abstract: The exponential growth of literature is constraining researchers' access to comprehensive information in related fields. While natural language processing (NLP) may offer an effective solution to literature classification, it remains hindered by the lack of labelled dataset. In this article, we introduce a novel method for generating literature classification models through semi-supervised learning, which can generate labelled dataset iteratively with limited human input. We apply this method to train NLP models for classifying literatures related to several research directions, i.e., battery, superconductor, topological material, and artificial intelligence (AI) in materials science. The trained NLP `battery' model applied on a larger dataset different from the training and testing dataset can achieve F1 score of 0.738, which indicates the accuracy and reliability of this scheme. Furthermore, our approach demonstrates that even with insufficient data, the not-well-trained model in the first few cycles can identify the relationships among different research fields and facilitate the discovery and understanding of interdisciplinary directions.

Key words: natural language processing, text mining, materials science

中图分类号:  (Data analysis: algorithms and implementation; data management)

  • 07.05.Kf