VNFlow: Integration of Variational Autoencoders and Normalizing Flows for Novel Molecular Design

This repository accompanies the first report on integrating variational autoencoders with normalizing flows into a comprehensive molecular design workflow. Our approach has generated novel molecules with performance metrics exceeding those found in the ChEMBL database, underscoring its potential for identifying promising drug candidates. It has been published as an open-access article in the Journal of Cheminformatics.

This is a result of an ongoing, multi-year collaboration between the National Research Council Canada and Defence Research and Development Canada and it was inspired by the works of Alán Aspuru-Guzik's lab (Molecular VAE) and Nathan C. Frey (FastFlow). Relevant portions of code from their libraries have been reused with appropriate attribution and are clearly identified within this repository.

This repository includes scripts for training and validation, analysis tools, training data, generated data, and trained models, including variational autoencoders (VAEs), normalizing flows, and their combinations.

Files

Data

The data/ folder contains the following files:

zinc_vocabulary_aromatics.txt The extracted aromatic ring fragments from the Zinc250k database, done using RDKit fragmentation tool.
ChEMBL22-50k-random.zip 50,000 random samples from ChEMBL22 dataset taken over from Molecular VAE github repository (if used please cite the appropriate original work).
OP_dataset_training.csv The initial dataset of 510 organo-phosphate molecules. A portion of this dataset (175 molecules with SMILES representation ending with "OP(C)(=O)F" string was used as a training dataset for one of our tests.
OP_dataset_generated.csv The dataset of organo-phosphate molecules generated in the first organo-phosphate test.
1_AutoregressiveRationalQuadraticSpline-4-32.csv, 1_MaskedAffineAutoregressive-4-32.csv, 1_RealNVP-4-32.csv, 1_RealNVP-4-32_more.csv, 1_condMaskedAffineAutoregressive-4-32.csv, 1_condRealNVP-4-32.csv, 1_random_VAE.csv contain the datasets generated by different types of flows reported in the manuscript (the file 1_RealNVP-4-32_more.csv has been used for chemical space visualisation above).
ChemBL-35-cleaned.csv contains first 10,000 rows of the dataset for test purposes

Documents

The documents/ folder contains the following reports:

Published_work.pdf The final report that has undergone peer review. It contains details about the training, the model settings, an overview of the datasets used, and steps to reproduce our work. The report also includes the applications and findings of this work.
Supplementary_information.pdf The additional materials including tables and figures.
Pre-print.pdf The version of the report that have not undergone peer review.

Usage

The main tools provided by this repository are these in the folder scripts/:

Script	Brief Description
`initial_tests_with_FastFlows.ipynb`	standalone notebook, workflow similar to Nathan C. Frey's FastFlows applied for organo-phosphate molecules, results not included in our report
`nflows_directly_and_analysis.ipynb`	standalone notebook, affine flows combined with reverse permutation flow and used for generation of organo-phosphate molecules
`OP_dataset_graphs_and_analysis.ipynb`	standalone notebook, analysis of the generated dataset of organo-phosphate molecules
`ORCA_header`	settings used for DFT calculations done for organo-phosphates in the software ORCA
`molecules/model.py`	tools for data handling and an example of a VAE model definition
`molecules/dataset_loading.py`	tools for data loading for a variational autoencoder model
`0.0-file-prep.ipynb`	initial file conversion
`1.0-VAE_training.ipynb`	script for training a variational autoencoder model
`1.1-random_sampling_VAE.ipynb`	random sampling from a previously trained variational autoencoder model
`1.1-glasflow-RealNVP.ipynb`	training of flows using Glasflow library and generation of samples using a variational autoencoder model
`1.1-nflows.ipynb`	training of flows using nflows library and generation of samples using a variational autoencoder model
`1.1-normflows-MaskedRatQuad.ipynb`	training of Masked Rational Quadratic flows using normflows library and generation of samples using a variational autoencoder model
`1.2.analysis.ipynb`	the main analysis file for the workflow combining VAE and Flow models

Key Prerequisites

This project uses the following packages:

babel==2.14.0
flowcon==0.0.4
h5py==3.11.0
nflows==0.14
normflows==1.7.3
pandas==2.2.2
scikit-learn==1.4.2
scipy==1.13.0
selfies==2.1.1
tensorflow==2.16.1
torch==2.3.0
rdkit=2023.09.6

Citing this work

If you find this repository useful in your work, please consider citing our paper:

@article{vnflow,
  title        = {VNFlow: Integration of variational autoencoders and normalizing flows for novel molecular design},
  author       = {Hostaš, Jiří and Ghaemi, Mohammad S. and Hu, Hang and Lin, Junan and Hu, Anguang and Ooi, Hsu K.},
  journal      = {Journal of Cheminformatics},
  volume       = {17},
  pages        = {161},
  year         = {2025},
  doi          = {10.1186/s13321-025-01104-2}
}

Support

For technical support, consider sending your question to the email of the corresponding author.

License

Published under the MIT License (see LICENSE).

Copyright

Centre de recherche en technologies numériques / Digital Technologies Research Centre

Conseil national de recherches Canada / National Research Council Canada

Defence Research and Development Canada / Recherche et développement pour la défense Canada

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
data		data
documents		documents
images		images
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VNFlow: Integration of Variational Autoencoders and Normalizing Flows for Novel Molecular Design

Files

Data

Documents

Usage

Key Prerequisites

Citing this work

Support

License

Copyright

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

nrc-cnrc/VNFlow

Folders and files

Latest commit

History

Repository files navigation

VNFlow: Integration of Variational Autoencoders and Normalizing Flows for Novel Molecular Design

Files

Data

Documents

Usage

Key Prerequisites

Citing this work

Support

License

Copyright

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages