A reproducible proteomics pipeline that identifies and ranks cerebrospinal fluid (CSF)-detectable markers of autophagy and lysosomal function with rodent-to-human translation, integrating 13 mass-spectrometry datasets across ~3,600 samples.
The pipeline scores ~9,500 proteins across five evidence axes:
| Component | Weight | Source |
|---|---|---|
| Mouse CSF evidence | 0.25 | 7 mouse CSF datasets (D1-D5, D7-D8) |
| Human CSF evidence | 0.30 | Human Astral CSF discovery (D11) + validation (D10, D12) |
| EV support | 0.10 | Human cell-line EV secretome reference |
| Brain plausibility | 0.10 | Mouse brain lysate (D6) |
| Autophagy membership | 0.25 | Curated autophagy/lysosome gene list (R1, 599 genes) |
Hard gates (D11 tier A/B, mouse CSF tier A/B, R1 membership, plasma exclusion) filter candidates down to a 122-protein core panel and a top-80 shortlist.
| Step | Script | Description |
|---|---|---|
| 00 | 00_setup.py |
Validate inputs, create directory structure |
| 01 | 01_extract_and_qc.py |
Parse 13 datasets into standardised format |
| 02 | 02_orthology_mapping.py |
Mouse-to-human orthology via g:Profiler |
| 03 | 03_evidence_scoring.py |
Compute 5-axis evidence scores for all proteins |
| 04 | 04_autophagy_filter.py |
Apply hard gates, rank candidates |
| 05 | 05_peptide_feasibility.py |
Cross-species peptide conservation & assay feasibility |
| 06 | 06_module_validation.py |
Co-abundance module analysis on Astral data |
| 07 | 07_validate_and_report.py |
QC, sensitivity analyses, figures, methods draft |
| 08 | 08_ad_model_crosscheck.py |
Cross-check against AppNL-G-F AD mouse model CSF |
| 09 | 09_ev_gate_analysis.py |
EV hard-gate sensitivity analysis |
| 10 | 10_supplementary_gene_scoring.py |
Score supplementary curated gene lists (R1b) |
- Conda or Mamba
- Raw data files (see
raw/README.mdfor sourcing instructions)
# Clone the repository
git clone https://github.com/Sigray-Lab/CSF-Panel-Discovery.git
cd CSF-Panel-Discovery
# Create conda environment
conda env create -f DataProc/Scripts/environment.yml
conda activate csf_panel
# Place raw data files in raw/ (see raw/README.md)
# Run the pipeline
cd DataProc/Scripts
python 00_setup.py
python 01_extract_and_qc.py
python 02_orthology_mapping.py
python 03_evidence_scoring.py
python 04_autophagy_filter.py
python 05_peptide_feasibility.py
python 06_module_validation.py
python 07_validate_and_report.py
python 08_ad_model_crosscheck.py
python 09_ev_gate_analysis.py
python 10_supplementary_gene_scoring.pyCSF_panel_project/
├── DataProc/
│ ├── Scripts/ # Pipeline code (11 steps + utilities)
│ │ ├── config.yaml # Central configuration (all paths, weights, thresholds)
│ │ ├── environment.yml # Conda environment specification
│ │ ├── 00_setup.py ... 10_supplementary_gene_scoring.py
│ │ └── utils/ # Shared modules (parsers, scoring, orthology, QC, viz)
│ ├── Outputs/ # Final deliverables (ranked lists, figures, reports)
│ ├── DerivedData/ # Intermediate files (regenerated by pipeline)
│ ├── QC/ # Quality control artifacts (regenerated)
│ ├── Log/ # Timestamped execution logs (regenerated)
│ └── project_plan.md # Full pipeline specification (v3)
└── raw/ # Source data (not included; see raw/README.md)
Outputs/candidates_ranked.tsv— All ~9,500 scored proteins with full evidence breakdownOutputs/core_panel_shortlist.tsv— Top 80 shortlisted proteins (pass all hard gates)Outputs/master_pipeline_list.tsv— 139 master list (122 core + 17 human-only additions)Outputs/figures/— Publication-ready PDF figuresOutputs/methods_draft.md— Methods section draft
All tuneable parameters (weights, gate thresholds, tier definitions, file paths) are centralised in DataProc/Scripts/config.yaml. The pipeline is fully config-driven with no hardcoded paths.
TBD
TBD