This repository provides an end-to-end containerized environment for Next-Generation Sequencing (NGS) data analysis, specifically focusing on Variant Calling using the GATK (Genome Analysis Toolkit) best practices.
By utilizing Docker for containerization (and Pixi under the hood), this project guarantees reproducibility, ease of deployment, and a hassle-free setup of bioinformatics tools.
The container is pre-configured with the standard variant calling stack:
- Quality Control:
fastqc,multiqc - Read Trimming & Filtering:
fastp - Alignment:
bwa - BAM Manipulation:
samtools - VCF Manipulation:
bcftools - Metrics & Duplicates:
picard - Variant Calling:
gatk4
When you clone this repository, you will see a few core files. Here is what they do:
README.md: The file you are reading right now! It contains the documentation and instructions for the project.Dockerfile: The "recipe" that Docker uses to build your isolated container. It tells Docker to start with Ubuntu, installpixi, and copy your project files inside. (See the "For Students" section at the bottom for a line-by-line breakdown!)pixi.toml: The configuration file for the Pixi package manager. This is where we list all the bioinformatics tools we want to use (likebwa,samtools,gatk4) and their specific versions.pixi.lock: A generated file that locks down the exact versions of every single sub-dependency. This ensures that the environment builds exactly the same way every time, completely eliminating the "it works on my machine" problem.
Follow these steps to clone the repository, set up your environment, and run a complete Variant Calling pipeline.
First, you need to download this project to your computer. Open your terminal (or Command Prompt/PowerShell on Windows) and run:
git clone https://github.com/deepbioacademy/ngs_workflow.git
cd ngs_workflowBioinformatics tools often have complex dependencies. Installing them directly on your computer can cause conflicts and errors.
Docker solves this by creating a "Container"βa lightweight, standalone, and completely isolated mini-computer running inside your actual computer.
- What it does: It packages all our necessary tools into a single blueprint called an Image (defined by the
Dockerfile). - Why it's great: When you start a Container from this Image, you get an environment that works exactly the same way on every machine. No installation headaches!
Installation: Before you begin, install Docker Desktop on your machine:
- Windows / Mac: Download and install Docker Desktop.
- Linux: Follow the instructions for Docker Engine.
Once installed, open Docker Desktop and make sure it is running. Verify it in your terminal: docker --version.
Think of this step as downloading and installing all the bioinformatics tools into your isolated mini-computer. You only need to do this once.
Make sure you are inside the ngs_workflow folder, then run:
docker build -t ngs-workflow .What does this command mean?
docker build: Create a new image.-t ngs-workflow: Name (tag) our image "ngs-workflow"..: Look in the current folder for the setup instructions.
Now we start the Container and "mount" (connect) our local folders so the tools inside can read our raw data and save results to our actual computer.
Run the following command:
docker run -it \
-v $(pwd)/data:/workspace/data \
-v $(pwd)/results:/workspace/results \
ngs-workflow(Windows Users: Replace $(pwd) with %cd% in Command Prompt, or ${PWD} in PowerShell).
You are now inside the Docker container! Your terminal prompt will change, and every tool is ready to be used.
Now that you are inside the container, let's run a complete analysis. We will use sample data from the 1000 Genomes Project.
Before running the tools, we need to create the required output folders and download our sample data.
# Create all necessary folders
mkdir -p data/raw data/reference results/qc results/trimmed results/alignment results/variants results/multiqc
# Download sample data
cd data/raw
curl -O ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz
curl -O ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00096/sequence_read/SRR062634_2.filt.fastq.gzEvaluate the quality of your raw reads, trim adapters, and re-evaluate the trimmed reads.
fastqc: Analyzes sequence data and generates an HTML report detailing read quality. We run this before and after trimming to see the improvement.-o: Specifies the output directory to save the HTML and ZIP reports.
fastp: Automatically filters out low-quality reads and trims adapter sequences so they don't interfere with alignment.-i/-I: The input Forward (R1) and Reverse (R2) FASTQ files.-o/-O: The output (trimmed) Forward and Reverse FASTQ files.--json/--html: Saves the fastp quality reports in JSON and HTML formats so MultiQC can read them.
multiqc: Scans theresults/qc/directory and aggregates all FastQC and fastp reports into a single, easy-to-read dashboard.-o: Specifies the output directory for the final MultiQC dashboard.
# 1. Run FastQC on raw data (Before Trimming)
fastqc data/raw/SRR062634_1.filt.fastq.gz data/raw/SRR062634_2.filt.fastq.gz -o results/qc/
# 2. Trim reads with fastp (and save reports to qc folder)
fastp -i data/raw/SRR062634_1.filt.fastq.gz -I data/raw/SRR062634_2.filt.fastq.gz \
-o results/trimmed/SRR062634_1_trimmed.fastq.gz -O results/trimmed/SRR062634_2_trimmed.fastq.gz \
--json results/qc/fastp.json --html results/qc/fastp.html
# 3. Run FastQC on trimmed data (After Trimming)
fastqc results/trimmed/SRR062634_1_trimmed.fastq.gz results/trimmed/SRR062634_2_trimmed.fastq.gz -o results/qc/
# 4. Generate a combined report with MultiQC
multiqc results/qc/ -o results/multiqc/Before alignment, index your reference genome (only needs to be done once per reference).
(Make sure you have a reference genome ref.fa in data/reference/ before running this).
bwa index: Creates an FM-index for the Burrows-Wheeler Aligner, enabling fast mapping.gatk CreateSequenceDictionary: Creates a.dictfile containing contig names and lengths, required by GATK.samtools faidx: Creates a.faiindex, allowing quick extraction of any sequence from the reference.
# Download refernece genome
cd data/reference/
curl -O https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz# Index with BWA
bwa index data/reference/ref.fa
# Create sequence dictionary for GATK
gatk CreateSequenceDictionary -R data/reference/ref.fa
# Create samtools index
samtools faidx data/reference/ref.faAlign the trimmed reads to the reference genome.
bwa mem: The algorithm that maps our short reads to the reference genome.-t 4: Uses 4 CPU threads to speed up the alignment process.-R: Adds "Read Group" information (e.g.,SMfor Sample Name,PLfor Platform). This is strictly required by GATK for downstream processing!>: A bash redirect that saves the console output into a.samfile instead of printing it to the screen.
bwa mem -t 4 -R "@RG\tID:SRR062634\tSM:HG00096\tPL:ILLUMINA" \
data/reference/ref.fa \
results/trimmed/SRR062634_1_trimmed.fastq.gz \
results/trimmed/SRR062634_2_trimmed.fastq.gz > results/alignment/SRR062634.samConvert SAM to BAM and sort it.
samtools view: A tool to read and convert SAM/BAM files.-S: Tells samtools the input is a SAM file.-b: Tells samtools to output a BAM file.
samtools sort: Sorts the reads by their coordinate position on the genome (required for variant calling).-o: Specifies the output filename for the sorted BAM file.
samtools index: Creates a.baiindex so tools can randomly access the BAM file quickly.
samtools view -Sb results/alignment/SRR062634.sam | samtools sort -o results/alignment/SRR062634_sorted.bam
samtools index results/alignment/SRR062634_sorted.bamMark PCR duplicates so they don't skew variant calling.
gatk MarkDuplicates: Identifies read pairs that likely originated from a single DNA molecule (PCR duplicates) and flags them so they are ignored by the variant caller. This prevents false positive variant calls caused by amplification biases.-I: The input sorted BAM file.-O: The output BAM file with duplicates marked.-M: A required text file where GATK will save metrics/statistics about the duplicates it found.
gatk MarkDuplicates \
-I results/alignment/SRR062634_sorted.bam \
-O results/alignment/SRR062634_marked_dup.bam \
-M results/alignment/marked_dup_metrics.txt
samtools index results/alignment/SRR062634_marked_dup.bamFinally, call variants using GATK HaplotypeCaller.
gatk HaplotypeCaller: The core tool that actually identifies SNPs (Single Nucleotide Polymorphisms) and Indels (Insertions/Deletions) by assembling haplotypes in active regions. It outputs a VCF (Variant Call Format) file containing all identified mutations.-R: The reference genome FASTA file.-I: The input BAM file (after sorting and duplicate marking).-O: The output VCF file where the final variants will be saved.
gatk HaplotypeCaller \
-R data/reference/ref.fa \
-I results/alignment/SRR062634_marked_dup.bam \
-O results/variants/SRR062634_raw_variants.vcfIf you prefer not to use Docker, you can install the tools directly on your system using Pixi:
- Install Pixi:
curl -fsSL https://pixi.sh/install.sh | bash - Install tools:
pixi install(Run this inside thengs_workflowfolder) - Enter the environment:
pixi shell
All dependencies are defined in pixi.toml.
If you are learning how to build your own Docker containers, here is a step-by-step breakdown of how we created the Dockerfile for this project.
The Dockerfile is essentially a recipe. Docker reads this recipe from top to bottom to build the image.
FROM ubuntu:22.04Every Dockerfile must start with a FROM instruction. This tells Docker what operating system to start with. We are using a clean, bare-bones Ubuntu Linux system.
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
curl \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*ENV: Sets an environment variable. We setDEBIAN_FRONTENDtononinteractiveso that Ubuntu doesn't ask us any "Yes/No" prompts while installing software (since no human is there to click "Yes" during the build process).RUN: Executes a command inside the container. We use theapt-getpackage manager to update the system and installcurl(a tool to download things from the internet).
RUN curl -fsSL https://pixi.sh/install.sh | bash
ENV PATH="/root/.pixi/bin:${PATH}"We use curl to download and run the installation script for Pixi. Then, we update the system's PATH so that the computer knows where to find the pixi command.
WORKDIR /workspaceWORKDIR creates a folder named /workspace inside the container and navigates into it (like running mkdir /workspace and then cd /workspace). Everything we do from now on happens inside this folder.
COPY pixi.toml pixi.lock* ./
RUN pixi install
COPY . .COPY: Takes files from your actual computer and copies them into the container. We first copypixi.toml(which listsbwa,gatk4, etc.).RUN pixi install: This tells Pixi to read the.tomlfile and download all the bioinformatics tools.COPY . .: Finally, we copy the rest of our project files (like scripts or READMEs) into the container. (We do this in two steps to make rebuilding the image faster! If we only change a script, Docker doesn't have to re-download the heavy bioinformatics tools).
ENTRYPOINT ["pixi", "run"]
CMD ["bash"]ENTRYPOINT: This is the main command that runs when the container starts. We set it topixi run, meaning everything executed in this container will automatically use the Pixi environment where our tools are installed.CMD: This provides the default argument to the Entrypoint. If the user just runsdocker run ngs-workflow, it will executepixi run bash, dropping them into an interactive terminal!
