M.L. Hildner, K. Johnson, et al.
Surface Science
We present SMI-TED (SMILE Transformer Encoder Decoder), a large-scale foundation model for materials and chemistry, trained on a massive dataset of 91 million SMILES samples (4 billion molecular tokens) from PubChem using self-supervised learning. Our encoder-decoder architecture enables a wide range of complex tasks, including the prediction of quantum chemical properties and reaction yields. We offer two model variants, with 289M and 8 X 289 parameters, respectively, to accommodate different use cases. Our model achieves state-of-the-art results across multiple benchmark datasets, demonstrating its versatility and effectiveness. Notably, our model's latent space exhibits compositionality and separability, essential properties for higher-level reasoning tasks and few-shot learning capabilities. To facilitate further research and applications, we make our model weights and source code publicly available on HuggingFace and GitHub, respectively.
M.L. Hildner, K. Johnson, et al.
Surface Science
Jerng-Sik Song, Chin-An Chang
Journal of Vacuum Science and Technology A: Vacuum, Surfaces and Films
Paul H. Kasai, Patrick Wheeler
Applied Surface Science
L.K. Wang, A. Acovic, et al.
MRS Spring Meeting 1993