g is a GPU-accelerated GWAS engine for BGEN-backed REGENIE step 2 association scans. The current package is a Python 3.14 API/CLI backed by JAX compute code and a Rust/PyO3 native extension for BGEN parsing and output persistence.
The active public surface is intentionally narrow:
- Python API:
g.regenie(config)andg.regenie.from_options({...}) - CLI:
g regenie ...,g-regenie ..., andg config init|validate|explain - Inputs: BGEN 1.2 genotype data, optional
.sample, phenotype/covariate tables, and REGENIE step 1_pred.listfiles - Outputs: resumable Arrow chunk run directories, with optional compressed
final.parquetfinalization
Legacy direct linear and logistic entrypoints are no longer public.
Active development targets biobank-scale REGENIE step 2 workflows.
- Quantitative REGENIE step 2 is the primary supported workflow.
- Binary REGENIE step 2 is public but still partial/evolving.
- REGENIE step 1 is not implemented in
g; use originalregenieto generate prediction lists. - The default REGENIE step 2 chunk size is
8192variants.
Binary mode currently supports score-test-only output by default and approximate Firth fallback with --firth --approx. SPA and exact Firth without --approx are exposed as REGENIE-style flags but are not implemented yet.
src/g/- Python package, CLI, API, JAX setup, I/O, and compute orchestrationsrc/g/compute/- REGENIE step 2 quantitative and binary kernelssrc/g/engine/- BGEN-backed pipeline orchestration and cache warmingsrc/g/io/- input source handling and output run managementsrc/*.rs- Rust native extension modules for BGEN, sample, pipeline, and output pathsbenches/- Rust Criterion benchmarkstests/- pytest coverage for API, CLI, I/O, Rust architecture, and REGENIE pipelinesscripts/- data setup, benchmark, profiling, and server bootstrap utilitiesdocs/- development notes, roadmaps, style guide, and Ubuntu/SLURM instructionsarchive/- archived reference/experimental code, not the active package
Local generated state lives in data/, .tools/, .venv/, and target/; these are git-ignored.
The project is managed with uv, just, Rust/Cargo, and maturin.
- Python:
>=3.14,<3.15 - Python runtime dependencies: JAX, NumPy, Click
- Native extension: Rust 2024, PyO3 ABI3 for Python 3.14
- Benchmark/data tools:
plink,plink2,regenie,zstd - Optional GPU workflow: CUDA-capable JAX environment and SLURM access
On systems with Nix, the flake provides the expected development tools:
nix developOn the Ubuntu/SLURM server, bootstrap repo-local tools first:
UV_CACHE_DIR=/tmp/g-uv-cache uv run --no-project python scripts/bootstrap_server_tools.py
source scripts/server_env.shCPU-oriented development environment:
just bootstrap
just doctorGPU-capable development environment:
just bootstrap-gpu
just doctor-jaxServer-specific checks:
just doctor-server
just doctor-baselinesPrepare the local 1000 Genomes chromosome 22 benchmark data and simulated phenotypes:
just setup-dataGenerate binary REGENIE step 1 baseline predictions for binary step 2:
just setup-binary-baselineQuantitative REGENIE step 2:
uv run g \
regenie \
--step 2 \
--qt \
--bgen data/1kg_chr22_full.bgen \
--sample data/1kg_chr22_full.sample \
--phenoFile data/pheno_cont.txt \
--phenoCol phenotype_continuous \
--covarFile data/covariates.txt \
--covarColList age,sex \
--pred data/baselines/regenie_step1_qt_pred.list \
--out data/example_regenie2Binary traits with approximate Firth fallback:
uv run g \
regenie \
--step 2 \
--bt \
--bgen data/1kg_chr22_full.bgen \
--sample data/1kg_chr22_full.sample \
--phenoFile data/pheno_bin.txt \
--phenoCol phenotype_binary \
--covarFile data/covariates.txt \
--covarColList age,sex \
--pred data/baselines/regenie_step1_pred.list \
--firth \
--approx \
--pThresh 0.01 \
--out data/example_regenie2_binaryConfig files use the same option names under TOML sections:
uv run g config init --out regenie.toml
uv run g config validate regenie.toml
uv run g regenie --config regenie.toml --g-device gpuRuntime configuration is resolved in this order: packaged defaults from
src/g/config.default.toml, values in --config, then explicit CLI flags. The
packaged default file sets safe runtime defaults for trait mode, binary fallback
knobs, compute, output, and diagnostics, but it intentionally omits
workload-specific required inputs such as bgen, phenoFile, phenoCol,
pred, and out.
Useful execution flags include --g-device cpu|gpu, --bsize, --g-variant-limit, --g-staging-depth, --g-resume, --g-trusted-no-missing-diploid, --g-writer-threads, --g-writer-queue-depth, JAX cache settings, BGEN decode tiling, and binary/Firth solver settings.
from pathlib import Path
import g
artifacts = g.regenie.from_options(
{
"step": 2,
"qt": True,
"bgen": Path("data/1kg_chr22_full.bgen"),
"sample": Path("data/1kg_chr22_full.sample"),
"phenoFile": Path("data/pheno_cont.txt"),
"phenoCol": "phenotype_continuous",
"covarFile": Path("data/covariates.txt"),
"covarColList": "age,sex",
"pred": Path("data/baselines/regenie_step1_qt_pred.list"),
"out": Path("data/example_regenie2"),
"g-device": "cpu",
}
)The API returns g.RunArtifacts with the output run directory and finalized Parquet path when Parquet finalization is enabled.
Given --out data/example_regenie2, g writes Arrow chunks under a g run directory and finalizes Parquet by default:
data/example_regenie2.g/phenotype_continuous.regenie2_linear.run/
chunks/
chunk_000000000.arrow
chunk_000000001.arrow
effective_config.toml
run_manifest.json
final.parquet
Binary runs use the .regenie2_binary.run suffix. Arrow chunks are written incrementally and can be resumed with --g-resume. --g-output-format controls parquet or arrow.
Common local commands:
just format
just lint
just typecheck
just check
just testNo-Nix or reduced-toolchain lanes:
just check-local
just test-local
just test-local-focusedNative extension and Rust benchmarks:
just install-perf-extension
just benchmark-rust
just benchmark-bgen-readerREGENIE comparison and profiling:
just benchmark-regenie-comparison
just benchmark-regenie-comparison-gpu
just profile-regenie-comparison
just profile-regenie-comparison-gpu
just profile-regenie2-deep-smokeBinary GPU smoke and full runs:
just setup-regenie2-binary-gpu-inputs
just verify-regenie2-binary-gpu-inputs
just regenie2-binary-gpu-smoke
just regenie2-binary-gpuUse the login node for dependency sync, formatting, linting, tests, data preparation, and baseline generation. Use SLURM recipes for GPU work:
just slurm-gpu-shell
just slurm-gpu-just doctor-jax
just slurm-regenie2-binary-gpu-smoke
just verify-regenie2-binary-gpu-smoke-output
just slurm-regenie2-binary-gpu
just verify-regenie2-binary-gpu-outputThe default GPU node is landau. Override cluster settings with GWAS_ENGINE_GPU_NODE, GWAS_ENGINE_SLURM_PARTITION, GWAS_ENGINE_SLURM_ACCOUNT, GWAS_ENGINE_SLURM_CPUS_PER_TASK, GWAS_ENGINE_SLURM_MEMORY, GWAS_ENGINE_SLURM_TIME, and GWAS_ENGINE_SLURM_GPUS_PER_TASK.
Full server notes live in docs/UBUNTU_SLURM_DEVELOPMENT.md. Reduced-toolchain notes live in docs/NO_NIX_DEVELOPMENT.md. Coding rules live in docs/STYLEGUIDE.md.