Integrating Data Mining and Natural Language Processing to Construct Melting Points Database for Organometallic Compounds

Journal
J. Chem. Inf. Model. (Journal of Chemical Information and Modeling)
Date
2024.10.01
Abstract

 As semiconductor devices miniaturize, the importance of atomic layer deposition (ALD) technology is growing. When designing ALD precursors, it is important to consider the melting point because the precursors should have melting points lower than the process temperature. However, obtaining melting point data is challenging due to experimental sensitivity and high computational costs. As a result, a comprehensive and well-organized melting point database for OMCs has not been fully reported yet. Therefore, in this study, we constructed a database of melting points for 1,845 OMCs, including 58 metal and 6 metalloid elements. The database contains CAS numbers, molecular formulas, and structural information, and was constructed through automatic extraction and systematic curation. The melting point information was extracted using two methods: 1) 1,434 materials from 11 chemical vendor databases, and 2) 411 materials identified through natural language processing (NLP) techniques with an accuracy of 86.3%, based on 2,096 scientific papers published over the past 29 years. In our database, the OMCs contain up to around 250 atoms and have melting points that range from -170 °C to 1610 °C. The main source is the Chemsrc database, accounting for 607 materials (32.9 %), and Fe is the most common central metal or metalloid element (15.0 %), followed by Si (11.6 %) and B (6.7 %). To validate the utilization of the constructed database, a multi-modal neural network model was developed integrating graph-based and feature-based information as descriptors to predict the melting points of OMCs but moderate performance. We believe current approach reduces the time and cost associated with hand-operated data collection and processing, contributing to effective screening of potentially promising ALD precursors and providing crucial information for the advancement of the semiconductor industry.

Reference
JCIM. 2004
DOI
https://doi.org/10.1021/acs.jcim.4c01254