Support binary datatype of the columns#21
Support binary datatype of the columns#21Tobiaspk wants to merge 5 commits intointegration/peerlabfrom
Conversation
|
I pulled the update and have a new error: Traceback (most recent call last): |
|
Branch now up to date. Dataset mentioned above runs through. Git history on top of pr #22 . Ready to merge from my side. |
There was a problem hiding this comment.
Pull request overview
This PR updates the Xenium ingest/preprocessing path to handle Xenium Ranger outputs where some Parquet columns are stored as binary instead of string, and introduces separate handling for Xenium analysis software v1 vs v2+.
Changes:
- Cast Xenium transcript
feature_name/cell_idtoUtf8during Polars ingestion to avoidexpected String type, got: binary. - Add Xenium analysis software version detection via
experiment.xeniumand introduce aXeniumPreprocessorV1variant with a differentnull_cell_id. - Refactor Xenium preprocessor to use class-level field definitions (
tx_fields,bd_fields) for easier specialization.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/segger/io/preprocessor.py |
Adds Xenium v1/v2+ detection, type casting for binary Parquet columns, and a new Xenium v1 preprocessor class. |
src/segger/io/fields.py |
Adds XeniumTranscriptFieldsV1 to define Xenium v1-specific transcript conventions. |
pyproject.toml |
Adds json to dependencies (but json is stdlib). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # get version | ||
| path_meta = data_dir / "experiment.xenium" | ||
| with open(path_meta) as f: | ||
| meta = json.load(f) | ||
| # version can be xenium-x.y.z or Xenium-x.y.z, ... | ||
| version = meta["analysis_sw_version"].split("-")[-1].split(".") | ||
| version = [int(v) for v in version] | ||
| return version | ||
|
|
||
| @classmethod | ||
| def _validate_directory(cls, data_dir: Path): | ||
|
|
||
| # Apply xenium software version 2 or higher (when cell id "Unassigned" was introduced. Previously -1) | ||
| version = XeniumPreprocessor._get_analysis_sw_version(data_dir) | ||
| if not cls.sw_version(version): |
There was a problem hiding this comment.
_validate_directory reads experiment.xenium before checking required files. If the file is missing/corrupt, this will raise a low-level exception during platform inference rather than a clear Xenium-specific message. Consider including experiment.xenium in the required-file checks (and raising an IOError with context) before attempting to parse it.
| sw_version = lambda version: version[0] > 1 | ||
|
|
||
| @staticmethod |
There was a problem hiding this comment.
sw_version is defined as a plain function attribute (lambda) on the class. When accessed as cls.sw_version(...) it will be bound and receive cls as an extra first argument, causing a TypeError during platform inference/validation. Make this a @staticmethod (or a normal def wrapped with staticmethod) so it accepts only version.
| sw_version = lambda version: version[0] > 1 | |
| @staticmethod | |
| @staticmethod | |
| def sw_version(version): | |
| return version[0] > 1 | |
| @staticmethod |
| def _get_analysis_sw_version(data_dir: Path) -> str: | ||
| """ | ||
| Get 10x xenium analysis software version. Example experiment.xenium file: | ||
| { | ||
| ..., | ||
| "analysis_sw_version": "xenium-3.3.1.1" | ||
| } | ||
| Return: | ||
| version : list of ints representing major, minor, and patch version numbers (e.g. [3, 3, 1, 1]) | ||
| """ | ||
|
|
||
| # get version | ||
| path_meta = data_dir / "experiment.xenium" | ||
| with open(path_meta) as f: | ||
| meta = json.load(f) | ||
| # version can be xenium-x.y.z or Xenium-x.y.z, ... | ||
| version = meta["analysis_sw_version"].split("-")[-1].split(".") | ||
| version = [int(v) for v in version] | ||
| return version | ||
|
|
||
| @classmethod | ||
| def _validate_directory(cls, data_dir: Path): | ||
|
|
||
| # Apply xenium software version 2 or higher (when cell id "Unassigned" was introduced. Previously -1) | ||
| version = XeniumPreprocessor._get_analysis_sw_version(data_dir) | ||
| if not cls.sw_version(version): | ||
| raise IOError( | ||
| f"Xenium analysis software version must be 2.0.0 or higher, " | ||
| f"but found version {'.'.join(version)}." | ||
| ) |
There was a problem hiding this comment.
The version type/format is inconsistent here: _get_analysis_sw_version returns list[int], but the return annotation is str. Also, in the error message '.'.join(version) will raise because version contains ints. Update the annotation and convert version parts to strings when formatting (e.g., '.'.join(map(str, version))).
| # cell_id is string in later 10x versions, but int in earlier versions. | ||
| bd.index = bd[std.id].astype(str) + '_' + bd[std.boundary_type].map({ |
There was a problem hiding this comment.
boundaries() only casts bd[std.id] to str when building bd.index, but leaves the cell_id column itself potentially as bytes/int (older Xenium). Downstream setup_anndata joins on the cell_id column, so this can cause missing joins / assertion failures. Cast bd[std.id] to str (or decode bytes) before the join/index construction so the column matches transcript cell_id strings.
| # cell_id is string in later 10x versions, but int in earlier versions. | |
| bd.index = bd[std.id].astype(str) + '_' + bd[std.boundary_type].map({ | |
| # cell_id is string in later 10x versions, but int/bytes in earlier versions. | |
| # Normalize the column itself so downstream joins on `cell_id` match | |
| # transcript `cell_id` values, then build the index from the normalized | |
| # column. | |
| bd[std.id] = bd[std.id].map( | |
| lambda value: value.decode() if isinstance(value, bytes) else str(value) | |
| ) | |
| bd.index = bd[std.id] + '_' + bd[std.boundary_type].map({ |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Closes #20 .
☑️ Now supports this dataset: https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast
Adding the following changes:
cell_ids are allUtf8in transcripts, cell_boundaries and nucleus_bandaries filesfeature_nameisUtf8in transcripts fileXeniumPreprocessorV1preprocessor, which uses a differentnull_cell_id. Added detection logic using the metadata.Testing
--> Tested on a xenium v3 dataset. Fixes don't affect outputs, results match previous results.
--> Tested on outdated xenium dataset (https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast) and fails!