This is the repository for the synthetic data generation pipeline of MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures.
- Create a virtual environment.
python3.10 -m venv markushgenerator-env
source markushgenerator-env/bin/activate
- Install MarkushGenerator.
PIP_USE_PEP517=0 pip install -e .
- Install Java 17.
sudo apt-get install openjdk-17-jdk
sudo update-alternatives --config 'java'
- Download the CDK library (version
cdk-2.9.jar) from and move it toMarkushGenerator/lib/.
wget https://github.com/cdk/cdk/releases/download/cdk-2.9/cdk-2.9.jar -P ./lib/
The notebook MarkushGenerator/markushgenerator/draw.ipynb shows how to:
- Draw an image from a CXSMILES.
- Draw a textual definition associated with the CXSMILES.
Each generated sample contains the:
- CXSMILES.
- Optimized CXSMILES.
- Markush structure image.
- OCR cells, containing the position and content of text written in the images. Some characters are currently omitted such as explicit carbons and implicit hydrogens. Atoms with charges are formatted as "atom, charge, numger of charges". Superscripts and subscripts are ignored.

