Skip to content

Support binary datatype of the columns#21

Open
Tobiaspk wants to merge 5 commits intointegration/peerlabfrom
bugfix/cast_binary_feature_names
Open

Support binary datatype of the columns#21
Tobiaspk wants to merge 5 commits intointegration/peerlabfrom
bugfix/cast_binary_feature_names

Conversation

@Tobiaspk
Copy link
Copy Markdown
Collaborator

@Tobiaspk Tobiaspk commented Mar 4, 2026

Closes #20 .

☑️ Now supports this dataset: https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast

Adding the following changes:

  • Ensures that cell_ids are all Utf8 in transcripts, cell_boundaries and nucleus_bandaries files
  • Ensures that feature_name is Utf8 in transcripts file
  • Adds XeniumPreprocessorV1 preprocessor, which uses a different null_cell_id. Added detection logic using the metadata.

Testing

--> Tested on a xenium v3 dataset. Fixes don't affect outputs, results match previous results.
--> Tested on outdated xenium dataset (https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast) and fails!

@erikla
Copy link
Copy Markdown

erikla commented Mar 4, 2026

I pulled the update and have a new error:

Traceback (most recent call last):
  File "/home/ladewie1/micromamba/envs/segger/bin/segger", line 6, in
    sys.exit(app())
             ^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/cyclopts/core.py", line 1869, in call
    result = _run_maybe_async_command(command, bound, resolved_backend)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/cyclopts/_run.py", line 50, in _run_maybe_async_command
    return command(*bound.args, **bound.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/bin/github/segger/src/segger/cli/segment.py", line 313, in segment
    datamodule = ISTDataModule(
                 ^^^^^^^^^^^^^^
  File "", line 27, in init
  File "/home/ladewie1/bin/github/segger/src/segger/data/data_module.py", line 160, in post_init
    self.load()
  File "/home/ladewie1/bin/github/segger/src/segger/data/data_module.py", line 172, in load
    tx = self.tx = pp.transcripts
                   ^^^^^^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/functools.py", line 1001, in get
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/bin/github/segger/src/segger/io/preprocessor.py", line 439, in transcripts
    .collect()
     ^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/polars/lazyframe/opt_flags.py", line 326, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2440, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: cannot compare string with numeric type (i32)

@Tobiaspk
Copy link
Copy Markdown
Collaborator Author

Branch now up to date. Dataset mentioned above runs through. Git history on top of pr #22 . Ready to merge from my side.

@Tobiaspk Tobiaspk changed the base branch from main to integration/peerlab April 16, 2026 20:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Xenium ingest/preprocessing path to handle Xenium Ranger outputs where some Parquet columns are stored as binary instead of string, and introduces separate handling for Xenium analysis software v1 vs v2+.

Changes:

  • Cast Xenium transcript feature_name / cell_id to Utf8 during Polars ingestion to avoid expected String type, got: binary.
  • Add Xenium analysis software version detection via experiment.xenium and introduce a XeniumPreprocessorV1 variant with a different null_cell_id.
  • Refactor Xenium preprocessor to use class-level field definitions (tx_fields, bd_fields) for easier specialization.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/segger/io/preprocessor.py Adds Xenium v1/v2+ detection, type casting for binary Parquet columns, and a new Xenium v1 preprocessor class.
src/segger/io/fields.py Adds XeniumTranscriptFieldsV1 to define Xenium v1-specific transcript conventions.
pyproject.toml Adds json to dependencies (but json is stdlib).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +394 to +408
# get version
path_meta = data_dir / "experiment.xenium"
with open(path_meta) as f:
meta = json.load(f)
# version can be xenium-x.y.z or Xenium-x.y.z, ...
version = meta["analysis_sw_version"].split("-")[-1].split(".")
version = [int(v) for v in version]
return version

@classmethod
def _validate_directory(cls, data_dir: Path):

# Apply xenium software version 2 or higher (when cell id "Unassigned" was introduced. Previously -1)
version = XeniumPreprocessor._get_analysis_sw_version(data_dir)
if not cls.sw_version(version):
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_directory reads experiment.xenium before checking required files. If the file is missing/corrupt, this will raise a low-level exception during platform inference rather than a clear Xenium-specific message. Consider including experiment.xenium in the required-file checks (and raising an IOError with context) before attempting to parse it.

Copilot uses AI. Check for mistakes.
Comment thread pyproject.toml Outdated
Comment on lines +380 to 382
sw_version = lambda version: version[0] > 1

@staticmethod
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sw_version is defined as a plain function attribute (lambda) on the class. When accessed as cls.sw_version(...) it will be bound and receive cls as an extra first argument, causing a TypeError during platform inference/validation. Make this a @staticmethod (or a normal def wrapped with staticmethod) so it accepts only version.

Suggested change
sw_version = lambda version: version[0] > 1
@staticmethod
@staticmethod
def sw_version(version):
return version[0] > 1
@staticmethod

Copilot uses AI. Check for mistakes.
Comment on lines +383 to +412
def _get_analysis_sw_version(data_dir: Path) -> str:
"""
Get 10x xenium analysis software version. Example experiment.xenium file:
{
...,
"analysis_sw_version": "xenium-3.3.1.1"
}
Return:
version : list of ints representing major, minor, and patch version numbers (e.g. [3, 3, 1, 1])
"""

# get version
path_meta = data_dir / "experiment.xenium"
with open(path_meta) as f:
meta = json.load(f)
# version can be xenium-x.y.z or Xenium-x.y.z, ...
version = meta["analysis_sw_version"].split("-")[-1].split(".")
version = [int(v) for v in version]
return version

@classmethod
def _validate_directory(cls, data_dir: Path):

# Apply xenium software version 2 or higher (when cell id "Unassigned" was introduced. Previously -1)
version = XeniumPreprocessor._get_analysis_sw_version(data_dir)
if not cls.sw_version(version):
raise IOError(
f"Xenium analysis software version must be 2.0.0 or higher, "
f"but found version {'.'.join(version)}."
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version type/format is inconsistent here: _get_analysis_sw_version returns list[int], but the return annotation is str. Also, in the error message '.'.join(version) will raise because version contains ints. Update the annotation and convert version parts to strings when formatting (e.g., '.'.join(map(str, version))).

Copilot uses AI. Check for mistakes.
Comment on lines +539 to +540
# cell_id is string in later 10x versions, but int in earlier versions.
bd.index = bd[std.id].astype(str) + '_' + bd[std.boundary_type].map({
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boundaries() only casts bd[std.id] to str when building bd.index, but leaves the cell_id column itself potentially as bytes/int (older Xenium). Downstream setup_anndata joins on the cell_id column, so this can cause missing joins / assertion failures. Cast bd[std.id] to str (or decode bytes) before the join/index construction so the column matches transcript cell_id strings.

Suggested change
# cell_id is string in later 10x versions, but int in earlier versions.
bd.index = bd[std.id].astype(str) + '_' + bd[std.boundary_type].map({
# cell_id is string in later 10x versions, but int/bytes in earlier versions.
# Normalize the column itself so downstream joins on `cell_id` match
# transcript `cell_id` values, then build the index from the normalized
# column.
bd[std.id] = bd[std.id].map(
lambda value: value.decode() if isinstance(value, bytes) else str(value)
)
bd.index = bd[std.id] + '_' + bd[std.boundary_type].map({

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segger fails for data generated with Xenium Ranger v1: expected String type, got: binary

3 participants