Abstract

The ability of molecular property prediction is of great significance to drug discovery, human health, and environmental protection. Despite considerable efforts, quantitative prediction of various molecular properties remains a challenge. Although some machine learning models, such as bidirectional encoder from transformer, can incorporate massive unlabeled molecular data into molecular representations via a self-supervised learning strategy, it neglects three-dimensional (3D) stereochemical information. Algebraic graph, specifically, element-specific multiscale weighted colored algebraic graph, embeds complementary 3D molecular information into graph invariants. We propose an algebraic graph-assisted bidirectional transformer (AGBT) framework by fusing representations generated by algebraic graph and bidirectional transformer, as well as a variety of machine learning algorithms, including decision trees, multitask learning, and deep neural networks. We validate the proposed AGBT framework on eight molecular datasets, involving quantitative toxicity, physical chemistry, and physiology datasets. Extensive numerical experiments have shown that AGBT is a state-of-the-art framework for molecular property prediction.

Document Type

Article

Publication Date

6-10-2021

Notes/Citation Information

Published in Nature Communications, v. 12, issue 1, article no. 3521.

© The Author(s) 2021

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/.

Digital Object Identifier (DOI)

https://doi.org/10.1038/s41467-021-23720-w

Funding Information

The work of Dong Chen, Xin Chen, Yi Jiang and Feng Pan was supported in part by the National Key R&D Program of China (2016YFB0700600). The work of Kaifu Gao and Guo-Wei Wei was supported in part by NSF grants DMS-2052983, DMS1761320, IIS1900473, NIH grants GM126189, and GM129004, Bristol-Myers Squibb, and Pfizer. The work of Duc Nguyen was supported in part by NSF grant DMS-2053284 and University of Kentucky start-up fund. The work of Dong Chen was also partly supported by Michigan State University.

Related Content

The pre-training dataset used in this work is CheMBL26, which is available at https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_26/. To ensure the reproducibility of this work, the eight datasets used in this work, including four quantitative toxicity datasets (LD50, IGC50, LC50, and LC50DM), partition coefficient dataset, FreeSolv dataset, Lipophilicity dataset, and BBBP dataset, are available at https://weilab.math.msu.edu/Database/.

The overall models and related code have been released as an open-source code and is also available in the Github repository: https://github.com/ChenDdon/AGBTcode.

41467_2021_23720_MOESM1_ESM.pdf (3759 kB)
Supplementary information

41467_2021_23720_MOESM2_ESM.xlsx (26 kB)
Supplementary data 1

41467_2021_23720_MOESM3_ESM.pdf (39 kB)
Description of additional supplementary files

41467_2021_23720_MOESM4_ESM.pdf (278 kB)
Reporting summary

41467_2021_23720_MOESM5_ESM.pdf (1049 kB)
Peer review file

Share

COinS