WaveDNA

A Deep Learning Tool for DNA Sequence Classification Using Wavelet Transform and Convolutional Neural Networks

Overview

WaveDNA is a novel computational tool that transforms DNA sequences into image-like representations for deep learning-based classification. By leveraging the power of wavelet transforms and convolutional neural networks, WaveDNA enables accurate identification of transcription factor binding sites (TFBSs) and other sequence patterns in genomic data.

Key Features

Wavelet-based Encoding: Converts DNA sequences into two-dimensional frequency representations using Continuous Wavelet Transform (CWT)
Deep Learning Classification: Utilizes ResNet50 architecture for robust sequence classification
Interpretability studies: Captures biologically relevant sequence parts leveraging Grad-CAM

How It Works

WaveDNA operates through a two-stage pipeline:

1. Sequence Encoding

Numerical Mapping: DNA nucleotides are converted to numerical values (A=1, C=2, G=3, T=4)
Wavelet Transform: The numerical signal undergoes Continuous Wavelet Transform using the Morlet wavelet
Image Generation: Wavelet coefficients are visualized as 2D images with color-coded intensity patterns

2. Deep Learning Classification

CNN Processing: Generated wavelet images are fed into a ResNet50 neural network
Feature Extraction: The 50-layer architecture with residual blocks extracts hierarchical features
Binary Output: Final classification into TFBS (positive) or non-TFBS (negative) categories

Data for reproducing the analysis

This project uses ChIP-seq data from the ENCODE Project to train and evaluate models, as well as to perform explainability analyses.

The following table lists the 13 transcription factors used in this project, along with their corresponding BED and BigWig file identifiers from ENCODE. The BED files are used to extract positive sequences for training WaveDNA, while bigWig files are used for conducting explainability analysis. For negative sequences, we used Meuleman et al. DHS index.

Transcription Factor (TF)	BED ID	BigWig ID
SETDB1	ENCFF812YDP	ENCFF413AIX
GATA1	ENCFF657CTC	ENCFF331URE
GATA2	ENCFF409YKQ	ENCFF184LTK
JUND	ENCFF658HLG	ENCFF292YDK
RFX5	ENCFF438ZEN	ENCFF482YCH
HOXA3	ENCFF753OCA	ENCFF631HAF
TCF7L2	ENCFF726SWQ	ENCFF406WNX
ZBED1	ENCFF753UAX	ENCFF606OSZ
NFKB2	ENCFF687TJX	ENCFF918FFX
MEF2B	ENCFF884QQW	ENCFF710GGG
LCORL	ENCFF606UCO	ENCFF227FSQ
MBD2	ENCFF788YHU	ENCFF510BMM
GMEB1	ENCFF365ETH	ENCFF398IIR

Example

To download the ENCFF812YDP BED file, use: https://www.encodeproject.org/files/ENCFF812YDP/@@download/ENCFF812YDP.bed.gz

Please make sure to unzip your BED file before running the analysis.

To download the ENCFF413AIX bigWig file, use: https://www.encodeproject.org/files/ENCFF812YDP/@@download/ENCFF812YDP.bigWig

Usage

1. Environment Setup

First, create a conda/mamba environment with Python 3.10 and install the required dependencies:

# Create a new environment with Python 3.10
conda create -n wavedna python=3.10
# or using mamba (faster)
mamba create -n wavedna python=3.10

# Activate the environment
conda activate wavedna

# Install dependencies from requirements.txt
pip install -r requirements.txt

2. Data Preparation

Download Reference Genome

Download the human reference genome (hg38) from UCSC:

# Create data directory
mkdir -p data/genome

# Download hg38 reference genome
cd data/genome
wget https://hgdownload.soe.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz

# Extract the genome file
gunzip hg38.fa.gz

Prepare Input Data

WaveDNA requires two types of input data:

Positive sequences: BED file from ChIP-seq experiments (ENCODE database) (see Section Data to download data used in the study)
- Contains genomic coordinates of transcription factor binding sites
- Format: chr, start, end, name, score, strand
Negative sequences: BED file containing DNase hypersensitive sites
- We recommend using data from Meuleman et al. (available at [https://www.encodeproject.org/annotations/ENCSR857UZV/])
- These represent accessible chromatin regions without specific TF binding

# Create directories for input data
mkdir -p data/bed_files

# Example file structure:
# data/bed_files/positive_sequences.bed  # ChIP-seq peaks from ENCODE
# data/bed_files/negative_sequences.bed  # DNase hypersensitive sites

BED File Format Requirements:

Your BED files should contain at least 10 columns with peak summit information:

chr1    1000    1200    peak1    100    +    1.5    10.2    50    100

Columns 1-3: chromosome, start, end
Column 7: enrichment score (for duplicate removal)
Column 10: peak summit position (for centering)

3. Dataset Splitting

Run the dataset split script to create training and test sets from your ChIP-seq data:

python dataset_split.py positive_bed_folder negative_bed_folder data/genome/hg38.fa data/processed

Input Structure:

Organize your input data as follows:

positive_bed_folder/
├── TF1_peaks.bed
├── TF2_peaks.bed
└── TF3_peaks.bed

negative_bed_folder/
└── DNase_sites.bed  # Single negative dataset file

Output Structure:

The script generates the following directory structure:

data/processed/
├── traindata/
│   ├── positive/
│   │   ├── TF1_train.fa
│   │   ├── TF2_train.fa
│   │   └── TF3_train.fa
│   └── negative/
│       ├── TF1_neg_train.fa
│       ├── TF2_neg_train.fa
│       └── TF3_neg_train.fa
└── testdata/
    ├── positive/
    │   ├── TF1_test.fa
    │   ├── TF2_test.fa
    │   └── TF3_test.fa
    └── negative/
        ├── TF1_neg_test.fa
        ├── TF2_neg_test.fa
        └── TF3_neg_test.fa

Command Arguments:

positive_bed_folder: Directory containing ChIP-seq BED files (one per transcription factor)
negative_bed_folder: Directory containing a single BED file with DNase hypersensitive sites
genome_file: Path to reference genome FASTA file (e.g., hg38.fa)
output_directory: Directory where processed datasets will be saved

4. Wavelet Transform and Colormap Generation

Convert DNA sequences to wavelet-based image representations using the continuous wavelet transform:

# Process positive training sequences
python create_colormaps.py \
    data/processed/traindata/positive/TF1_train.fa \
    data/processed/images/train/positive \
    0 \
    TF1_pos \
    1000 \
    8

# Process negative training sequences  
python create_colormaps.py \
    data/processed/traindata/negative/TF1_neg_train.fa \
    data/processed/images/train/negative \
    0 \
    TF1_neg \
    1000 \
    8

# Process positive test sequences
python create_colormaps.py \
    data/processed/testdata/positive/TF1_test.fa \
    data/processed/images/test/positive \
    0 \
    TF1_pos \
    500 \
    8

# Process negative test sequences
python create_colormaps.py \
    data/processed/testdata/negative/TF1_neg_test.fa \
    data/processed/images/test/negative \
    0 \
    TF1_neg \
    500 \
    8

Command Arguments:

fasta_file: Path to input FASTA file containing DNA sequences
output_directory: Directory where colormap images will be saved
synchrosqueeze: Use synchrosqueezing transform (1) or standard CWT (0)
prefix: Prefix for output image filenames
threshold: Maximum number of sequences to process (for limiting dataset size)
num_workers: Number of CPU cores for parallel processing

Transform Options:

Standard CWT (synchrosqueeze=0): Uses continuous wavelet transform coefficients
Synchrosqueezed CWT (synchrosqueeze=1): Enhanced frequency localization using synchrosqueezing

5. ResNet50 Training and Classification

Train the ResNet50 convolutional neural network on the generated wavelet images:

Prerequisites:

Before training, ensure your image directory structure matches the expected format:

data/processed/images/
├── TF_NAME/
│   ├── train/
│   │   ├── positive/
│   │   │   ├── TF1_pos_cmap_0.png
│   │   │   └── ...
│   │   └── negative/
│   │       ├── TF1_neg_cmap_0.png
│   │       └── ...
│   └── test/
│       ├── positive/
│       │   ├── TF1_pos_cmap_0.png
│       │   └── ...
│       └── negative/
│           ├── TF1_neg_cmap_0.png
│           └── ...

Training Command:

# Train ResNet50 for a specific transcription factor
python train_resnet.py --tf_name PATZ1 --batch_size 16

# Example for different transcription factors
python train_resnet.py --tf_name CTCF --batch_size 32
python train_resnet.py --tf_name MAX --batch_size 8

Command Arguments:

--tf_name: Name of the transcription factor (must match directory name)
--batch_size: Training batch size (adjust based on GPU memory)

Training Configuration:

The model uses the following default settings:

Architecture: ResNet50 with pre-trained ImageNet weights
Fine-tuning: Full model fine-tuning with adaptive learning rate
Input size: Images resized to 224×224 pixels
Normalization: ImageNet standard normalization
Optimizer: AdamW with learning rate 3e-5
Scheduler: ReduceLROnPlateau based on validation accuracy
Loss function: CrossEntropyLoss for binary classification
Epochs: 3 (default, can be modified in code)

Output:

After training, the following files are generated:

models/ResNet/TF_NAME/
└── ResNet_TF_NAME_1bs_fullft.pth  # Best model checkpoint

The script uses wandb to log the metrics during model's training. Please make sure to log in into your wandb account to track the run.

6. Model Interpretation with Grad-CAM

After training, analyze and interpret model predictions using Grad-CAM (Gradient-weighted Class Activation Mapping) to understand which regions of the wavelet images contribute most to the classification decisions and correlate them with biological ChIP-seq signals:

XAI Grad-CAM Analysis

The xai_grad_cam.py script provides comprehensive model interpretability by generating activation heatmaps and correlating them with experimental ChIP-seq data to validate biological relevance of learned features.

Prerequisites:

Before running Grad-CAM analysis, ensure you have:

Trained model: A saved ResNet50 model checkpoint from the training step
Test images: Wavelet-transformed images from your test dataset
ChIP-seq data: BigWig file containing experimental binding signals
Genomic coordinates: BED file with coordinates matching your test sequences

Command Usage:

# Basic Grad-CAM analysis for a specific transcription factor
python xai_grad_cam.py \
    -indir data/processed/images/test/positive \
    -outdir results/gradcam/CTCF \
    -model models/ResNet/CTCF/ResNet_CTCF_1bs_fullft.pth \
    -bigwig data/chipseq/CTCF_signal.bw \
    -bedfile data/bed_files/CTCF_test_sequences.bed \
    -layer 4 \
    -convIndex 1 \
    -blockIndex 2

Command Arguments:

-indir: Directory containing test images (wavelet-transformed sequences)
-outdir: Output directory for Grad-CAM visualizations and results
-model: Path to trained ResNet50 model checkpoint
-bigwig: Path to BigWig file containing ChIP-seq signal values
-bedfile: Path to BED file with genomic coordinates of test sequences
-layer: ResNet layer to analyze (1-4, corresponding to layer1-layer4)
-convIndex: Convolutional layer index within the block (1-3)
-blockIndex: Block index within the specified layer (0-based)
-max_sequences: Optional limit on number of sequences to process

Layer Selection Guide:

ResNet50 architecture provides different levels of feature abstraction (from Layer1 to Layer4) Each layer contains multiple blocks with convolutional filters that can be analyzed independently.

7. End-to-End Pipeline Script

To simplify the entire workflow — from data acquisition to model training — WaveDNA provides a streamlined script for launching the full ResNet-based classification pipeline with minimal manual intervention.

Script: `ResNet_pipeline.py`

This script automates the following steps:

Download ChIP-seq BED file from ENCODE based on the provided accession code.
Split the dataset into training and testing FASTA files.
Generate wavelet-based images from the FASTA sequences.
Train the ResNet50 model on the generated images.

Usage

python run_pipeline.py --tf_name CTCF --bedCode ENCFF001ABC --run_model 1

Arguments:

--tf_name: Name of the transcription factor (used for folder naming and labeling)

--bedCode: ENCODE accession ID for the ChIP-seq BED file

--run_model: Set to 1 to train the model after preprocessing, 0 to skip training

Comparisons

WaveDNA has been benchmarked against several state-of-the-art DNA sequence modeling approaches. For reproducibility, the finetune folder includes scripts and configurations to replicate the experiments reported in our paper:

NT (Nucleotide Transformer)
Hyena
DNABERT-2

Each subfolder contains a dedicated bash script to run the corresponding model. Please note that some models require specific software environments or dependencies. Refer to the respective official GitHub repositories for setup instructions and requirements. Using the provided scripts with the recommended environments will allow you to reproduce our comparative results efficiently.

Citation, Contact, and License

If you use WaveDNA in your research, please cite: Improving DNA Modeling with WaveDNA: Enhancing Speed, Generalizability, and Interpretability through Wavelet Transformation

WaveDNA is released under the MIT License, allowing free use, modification, and distribution with proper attribution.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
finetune		finetune
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

WaveDNA

Overview

Key Features

How It Works

1. Sequence Encoding

2. Deep Learning Classification

Data for reproducing the analysis

Example

Usage

1. Environment Setup

2. Data Preparation

Download Reference Genome

Prepare Input Data

BED File Format Requirements:

3. Dataset Splitting

Input Structure:

Output Structure:

Command Arguments:

4. Wavelet Transform and Colormap Generation

Command Arguments:

Transform Options:

5. ResNet50 Training and Classification

Prerequisites:

Training Command:

Command Arguments:

Training Configuration:

Output:

6. Model Interpretation with Grad-CAM

XAI Grad-CAM Analysis

Prerequisites:

Command Usage:

Command Arguments:

Layer Selection Guide:

7. End-to-End Pipeline Script

Script: ResNet_pipeline.py

Usage

Comparisons

Citation, Contact, and License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Script: `ResNet_pipeline.py`

Packages