Skip to content

Framework and data for generative molecular design — Infrastructure et données pour la conception moléculaire générative

License

Notifications You must be signed in to change notification settings

nrc-cnrc/VNFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VNFlow: Integration of Variational Autoencoders and Normalizing Flows for Novel Molecular Design

This repository accompanies the first report on integrating variational autoencoders with normalizing flows into a comprehensive molecular design workflow. Our approach has generated novel molecules with performance metrics exceeding those found in the ChEMBL database, underscoring its potential for identifying promising drug candidates. It has been published as an open-access article in the Journal of Cheminformatics.

Schema

This is a result of an ongoing, multi-year collaboration between the National Research Council Canada and Defence Research and Development Canada and it was inspired by the works of Alán Aspuru-Guzik's lab (Molecular VAE) and Nathan C. Frey (FastFlow). Relevant portions of code from their libraries have been reused with appropriate attribution and are clearly identified within this repository.

Schema

This repository includes scripts for training and validation, analysis tools, training data, generated data, and trained models, including variational autoencoders (VAEs), normalizing flows, and their combinations.

Files

Data

The data/ folder contains the following files:

  1. zinc_vocabulary_aromatics.txt The extracted aromatic ring fragments from the Zinc250k database, done using RDKit fragmentation tool.

  2. ChEMBL22-50k-random.zip 50,000 random samples from ChEMBL22 dataset taken over from Molecular VAE github repository (if used please cite the appropriate original work).

  3. OP_dataset_training.csv The initial dataset of 510 organo-phosphate molecules. A portion of this dataset (175 molecules with SMILES representation ending with "OP(C)(=O)F" string was used as a training dataset for one of our tests.

  4. OP_dataset_generated.csv The dataset of organo-phosphate molecules generated in the first organo-phosphate test.

  5. 1_AutoregressiveRationalQuadraticSpline-4-32.csv, 1_MaskedAffineAutoregressive-4-32.csv, 1_RealNVP-4-32.csv, 1_RealNVP-4-32_more.csv, 1_condMaskedAffineAutoregressive-4-32.csv, 1_condRealNVP-4-32.csv, 1_random_VAE.csv contain the datasets generated by different types of flows reported in the manuscript (the file 1_RealNVP-4-32_more.csv has been used for chemical space visualisation above).

  6. ChemBL-35-cleaned.csv contains first 10,000 rows of the dataset for test purposes

Documents

The documents/ folder contains the following reports:

  1. Published_work.pdf The final report that has undergone peer review. It contains details about the training, the model settings, an overview of the datasets used, and steps to reproduce our work. The report also includes the applications and findings of this work.
  2. Supplementary_information.pdf The additional materials including tables and figures.
  3. Pre-print.pdf The version of the report that have not undergone peer review.

Usage

The main tools provided by this repository are these in the folder scripts/:

Script Brief Description
initial_tests_with_FastFlows.ipynb standalone notebook, workflow similar to Nathan C. Frey's FastFlows applied for organo-phosphate molecules, results not included in our report
nflows_directly_and_analysis.ipynb standalone notebook, affine flows combined with reverse permutation flow and used for generation of organo-phosphate molecules
OP_dataset_graphs_and_analysis.ipynb standalone notebook, analysis of the generated dataset of organo-phosphate molecules
ORCA_header settings used for DFT calculations done for organo-phosphates in the software ORCA
molecules/model.py tools for data handling and an example of a VAE model definition
molecules/dataset_loading.py tools for data loading for a variational autoencoder model
0.0-file-prep.ipynb initial file conversion
1.0-VAE_training.ipynb script for training a variational autoencoder model
1.1-random_sampling_VAE.ipynb random sampling from a previously trained variational autoencoder model
1.1-glasflow-RealNVP.ipynb training of flows using Glasflow library and generation of samples using a variational autoencoder model
1.1-nflows.ipynb training of flows using nflows library and generation of samples using a variational autoencoder model
1.1-normflows-MaskedRatQuad.ipynb training of Masked Rational Quadratic flows using normflows library and generation of samples using a variational autoencoder model
1.2.analysis.ipynb the main analysis file for the workflow combining VAE and Flow models

Key Prerequisites

This project uses the following packages:

babel==2.14.0
flowcon==0.0.4
h5py==3.11.0
nflows==0.14
normflows==1.7.3
pandas==2.2.2
scikit-learn==1.4.2
scipy==1.13.0
selfies==2.1.1
tensorflow==2.16.1
torch==2.3.0
rdkit=2023.09.6

Citing this work

If you find this repository useful in your work, please consider citing our paper:

@article{vnflow,
  title        = {VNFlow: Integration of variational autoencoders and normalizing flows for novel molecular design},
  author       = {Hostaš, Jiří and Ghaemi, Mohammad S. and Hu, Hang and Lin, Junan and Hu, Anguang and Ooi, Hsu K.},
  journal      = {Journal of Cheminformatics},
  volume       = {17},
  pages        = {161},
  year         = {2025},
  doi          = {10.1186/s13321-025-01104-2}
}

Support

For technical support, consider sending your question to the email of the corresponding author.

License

Published under the MIT License (see LICENSE).

Copyright

Centre de recherche en technologies numériques / Digital Technologies Research Centre

Conseil national de recherches Canada / National Research Council Canada

Defence Research and Development Canada / Recherche et développement pour la défense Canada

Copyright 2025, Sa Majesté le Roi du Chef du Canada / His Majesty the King in Right of Canada


nrc

About

Framework and data for generative molecular design — Infrastructure et données pour la conception moléculaire générative

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •