Skip to content

Add STPath: spatial transcriptomics gene expression prediction from histology#233

Open
simonschindler wants to merge 13 commits into
mainfrom
STPath
Open

Add STPath: spatial transcriptomics gene expression prediction from histology#233
simonschindler wants to merge 13 commits into
mainfrom
STPath

Conversation

@simonschindler
Copy link
Copy Markdown
Contributor

@simonschindler simonschindler commented Apr 9, 2026

Summary

This PR integrates STPath (Huang et al., 2025) into LazySlide as a new model of type spatial_transcriptomics. STPath predicts gene expression across the full transcriptome (~38 000 genes) from tile-level image embeddings and spatial coordinates, using a spatial transformer foundation model (STFM).

  • Adds src/lazyslide/models/tile_prediction/stpath.py (~1 900 lines): all tokenizers, the STFM architecture, and the STPath wrapper class that returns a fully annotated AnnData object
  • Registers the model as "stpath" in MODEL_REGISTRY via @register
  • Adds the paper citation to docs/source/references.bib
  • Adds torch-geometric as the [stpath] optional extra in pyproject.toml (needed for sparse gene expression encoding)
  • Adds tests/models/test_stpath_equivalence.py: clones the original STPath repository from GitHub at test time and asserts bit-identical outputs (atol=1e-5) between the reimplementation and upstream, for the same weights and inputs. Set STPATH_REPO to a local clone to skip the network step.

Points needing reviewer input

1. License and commercial use are unspecified (None)
Neither the GitHub repository nor the HuggingFace model card carries a license file or declaration. The value previously present (CC BY-NC-ND 4.0) was an incorrect assumption and has been removed. Both fields are explicitly set to None. The test_models_general.py assertion was relaxed from is not None to hasattr() to accommodate models with genuinely unspecified licenses. If the authors clarify the license this should be updated.

2. No LazySlide dispatch function for spatial_transcriptomics models
The contributing guide states that new models should be wired into a corresponding LazySlide function (e.g. zs.tl.encode_tiles() for vision models). There is currently no such function for spatial transcriptomics prediction — STPath must be used by instantiating it directly. This is a known gap; opening a separate issue to add a zs.tl.predict_gene_expression() or similar function is suggested.

3. New ModelTask.spatial_transcriptomics enum value
feature_prediction was renamed to spatial_transcriptomics to be more descriptive and consistent with the established field name. The contributing guide currently only documents vision, segmentation, multimodal, and tile_prediction model types — should a spatial_transcriptomics section be added to the guide?

Test plan

  • pytest tests/models/test_stpath_equivalence.py — numerical equivalence against upstream (requires network or STPATH_REPO)
  • pytest tests/test_model_registry.pystpath appears in registry with correct metadata
  • pytest tests/models/test_models_general.py -m large_runner -k stpath — model initialises, param count and FLOPs estimation work
  • @Mr-Milk — no gating on HuggingFace so no access grant needed

simon and others added 13 commits February 3, 2026 11:16
- Import STPath and STFM in tile_prediction/__init__.py so the @register
  decorator fires and the model appears in MODEL_REGISTRY
- Remove two scratch dev scripts (test_stpath.py, comp_predictions.py)
  that contained hardcoded local paths and Desktop output paths
- Remove debug print(self.config) from STFM.__init__
- Replace print() with warnings.warn() for missing gene symbols and
  torch_geometric fallback
- Add torch-geometric as the [stpath] optional extra in pyproject.toml
- Add stpath entry to MODEL_INPUT_ARGS in test_models_general.py
- Correct registry metadata: license and commercial set to None
  (no license file in upstream GitHub repo or HuggingFace model card;
  the previous CC BY-NC-ND 4.0 value was an incorrect assumption)
- Update paper_url to https://doi.org/10.64898/2026.03.17.711896
- Relax test_models_general assertions on license/commercial from
  `is not None` to hasattr(), so models with genuinely unspecified
  licenses do not cause a hard test failure
FLOPs scale with spot count (N) rather than being fixed — the stored
305.9 GFLOPs was computed on the INT2 dev dataset and is not meaningful
as a registry constant. param_size (~50M, verified at 49.2M) is kept.
test_stpath_equivalence.py clones the original STPath repo from GitHub
and verifies that the LazySlide STFM reimplementation produces bit-identical
outputs (atol=1e-5) to the original for the same weights and inputs.
The clone is skipped gracefully when the network is unavailable; set
STPATH_REPO to a local clone to skip the network step during development.

Also removes the leftover test_stpath.ipynb dev notebook.
@Mr-Milk
Copy link
Copy Markdown
Member

Mr-Milk commented Apr 9, 2026

Thanks, @simonschindler, for adding the new model. Here are a few comments:

  1. Should we make a Protocol or ABC for spatial_transcriptomics model?
  2. If it's a new type of model, we usually place it in a new folder, currently it's placed in tile_prediction.
  3. For docstring, LazySlide uses numpy doc style, and you are using the google style.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants