Support binary datatype of the columns by Tobiaspk · Pull Request #21 · dpeerlab/segger

Tobiaspk · 2026-03-04T16:44:19Z

Closes #20 .

☑️ Now supports this dataset: https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast

Adding the following changes:

Ensures that cell_ids are all Utf8 in transcripts, cell_boundaries and nucleus_bandaries files
Ensures that feature_name is Utf8 in transcripts file
Adds XeniumPreprocessorV1 preprocessor, which uses a different null_cell_id. Added detection logic using the metadata.

Testing

--> Tested on a xenium v3 dataset. Fixes don't affect outputs, results match previous results.
--> Tested on outdated xenium dataset (https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast) and fails!

erikla · 2026-03-04T17:10:57Z

I pulled the update and have a new error:

Traceback (most recent call last):
  File "/home/ladewie1/micromamba/envs/segger/bin/segger", line 6, in
    sys.exit(app())
             ^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/cyclopts/core.py", line 1869, in call
    result = _run_maybe_async_command(command, bound, resolved_backend)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/cyclopts/_run.py", line 50, in _run_maybe_async_command
    return command(*bound.args, **bound.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/bin/github/segger/src/segger/cli/segment.py", line 313, in segment
    datamodule = ISTDataModule(
                 ^^^^^^^^^^^^^^
  File "", line 27, in init
  File "/home/ladewie1/bin/github/segger/src/segger/data/data_module.py", line 160, in post_init
    self.load()
  File "/home/ladewie1/bin/github/segger/src/segger/data/data_module.py", line 172, in load
    tx = self.tx = pp.transcripts
                   ^^^^^^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/functools.py", line 1001, in get
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/bin/github/segger/src/segger/io/preprocessor.py", line 439, in transcripts
    .collect()
     ^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/polars/lazyframe/opt_flags.py", line 326, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ladewie1/micromamba/envs/segger/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2440, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: cannot compare string with numeric type (i32)

Tobiaspk · 2026-03-12T22:09:37Z

Branch now up to date. Dataset mentioned above runs through. Git history on top of pr #22 . Ready to merge from my side.

Copilot

Pull request overview

This PR updates the Xenium ingest/preprocessing path to handle Xenium Ranger outputs where some Parquet columns are stored as binary instead of string, and introduces separate handling for Xenium analysis software v1 vs v2+.

Changes:

Cast Xenium transcript feature_name / cell_id to Utf8 during Polars ingestion to avoid expected String type, got: binary.
Add Xenium analysis software version detection via experiment.xenium and introduce a XeniumPreprocessorV1 variant with a different null_cell_id.
Refactor Xenium preprocessor to use class-level field definitions (tx_fields, bd_fields) for easier specialization.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`src/segger/io/preprocessor.py`	Adds Xenium v1/v2+ detection, type casting for binary Parquet columns, and a new Xenium v1 preprocessor class.
`src/segger/io/fields.py`	Adds `XeniumTranscriptFieldsV1` to define Xenium v1-specific transcript conventions.
`pyproject.toml`	Adds `json` to dependencies (but `json` is stdlib).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T20:51:27Z

+        # get version
+        path_meta = data_dir / "experiment.xenium"
+        with open(path_meta) as f:
+            meta = json.load(f)
+        # version can be xenium-x.y.z or Xenium-x.y.z, ...
+        version = meta["analysis_sw_version"].split("-")[-1].split(".")
+        version = [int(v) for v in version]
+        return version
+
+    @classmethod
+    def _validate_directory(cls, data_dir: Path):
+
+        # Apply xenium software version 2 or higher (when cell id "Unassigned" was introduced. Previously -1)
+        version = XeniumPreprocessor._get_analysis_sw_version(data_dir)
+        if not cls.sw_version(version):


_validate_directory reads experiment.xenium before checking required files. If the file is missing/corrupt, this will raise a low-level exception during platform inference rather than a clear Xenium-specific message. Consider including experiment.xenium in the required-file checks (and raising an IOError with context) before attempting to parse it.

Copilot · 2026-04-16T20:51:28Z

+    sw_version = lambda version: version[0] > 1
+
    @staticmethod


sw_version is defined as a plain function attribute (lambda) on the class. When accessed as cls.sw_version(...) it will be bound and receive cls as an extra first argument, causing a TypeError during platform inference/validation. Make this a @staticmethod (or a normal def wrapped with staticmethod) so it accepts only version.

Suggested change

sw_version = lambda version: version[0] > 1

@staticmethod

@staticmethod

def sw_version(version):

return version[0] > 1

@staticmethod

Copilot · 2026-04-16T20:51:28Z

+    def _get_analysis_sw_version(data_dir: Path) -> str:
+        """
+        Get 10x xenium analysis software version. Example experiment.xenium file:
+        {
+            ...,
+            "analysis_sw_version": "xenium-3.3.1.1"
+        }
+        Return:
+            version : list of ints representing major, minor, and patch version numbers (e.g. [3, 3, 1, 1])
+        """

+        # get version
+        path_meta = data_dir / "experiment.xenium"
+        with open(path_meta) as f:
+            meta = json.load(f)
+        # version can be xenium-x.y.z or Xenium-x.y.z, ...
+        version = meta["analysis_sw_version"].split("-")[-1].split(".")
+        version = [int(v) for v in version]
+        return version
+
+    @classmethod
+    def _validate_directory(cls, data_dir: Path):
+
+        # Apply xenium software version 2 or higher (when cell id "Unassigned" was introduced. Previously -1)
+        version = XeniumPreprocessor._get_analysis_sw_version(data_dir)
+        if not cls.sw_version(version):
+            raise IOError(
+                f"Xenium analysis software version must be 2.0.0 or higher, "
+                f"but found version {'.'.join(version)}."
+            )


The version type/format is inconsistent here: _get_analysis_sw_version returns list[int], but the return annotation is str. Also, in the error message '.'.join(version) will raise because version contains ints. Update the annotation and convert version parts to strings when formatting (e.g., '.'.join(map(str, version))).

Copilot · 2026-04-16T20:51:28Z

+        # cell_id is string in later 10x versions, but int in earlier versions.
+        bd.index = bd[std.id].astype(str) + '_' + bd[std.boundary_type].map({


boundaries() only casts bd[std.id] to str when building bd.index, but leaves the cell_id column itself potentially as bytes/int (older Xenium). Downstream setup_anndata joins on the cell_id column, so this can cause missing joins / assertion failures. Cast bd[std.id] to str (or decode bytes) before the join/index construction so the column matches transcript cell_id strings.

Suggested change

# cell_id is string in later 10x versions, but int in earlier versions.

bd.index = bd[std.id].astype(str) + '_' + bd[std.boundary_type].map({

# cell_id is string in later 10x versions, but int/bytes in earlier versions.

# Normalize the column itself so downstream joins on `cell_id` match

# transcript `cell_id` values, then build the index from the normalized

# column.

bd[std.id] = bd[std.id].map(

lambda value: value.decode() if isinstance(value, bytes) else str(value)

)

bd.index = bd[std.id] + '_' + bd[std.boundary_type].map({

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Support binary datatype of the columns

7b64dd8

Tobiaspk added 2 commits March 4, 2026 12:28

Support xenium software version 1

461a752

Debug datatypes

70c31fd

Tobiaspk mentioned this pull request Mar 4, 2026

Segger fails for data generated with Xenium Ranger v1: expected String type, got: binary #20

Open

more casting

56fd20e

Tobiaspk changed the base branch from main to integration/peerlab April 16, 2026 20:47

Tobiaspk requested review from andrewmoorman and Copilot April 16, 2026 20:48

Copilot started reviewing on behalf of Tobiaspk April 16, 2026 20:48 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Update pyproject.toml

37252de

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support binary datatype of the columns#21

Support binary datatype of the columns#21
Tobiaspk wants to merge 5 commits intointegration/peerlabfrom
bugfix/cast_binary_feature_names

Tobiaspk commented Mar 4, 2026 •

edited

Loading

Uh oh!

erikla commented Mar 4, 2026

Uh oh!

Tobiaspk commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# cell_id is string in later 10x versions, but int in earlier versions.
		bd.index = bd[std.id].astype(str) + '_' + bd[std.boundary_type].map({

-        # cell_id is string in later 10x versions, but int in earlier versions.
-        bd.index = bd[std.id].astype(str) + '_' + bd[std.boundary_type].map({
+        # cell_id is string in later 10x versions, but int/bytes in earlier versions.
+        # Normalize the column itself so downstream joins on `cell_id` match
+        # transcript `cell_id` values, then build the index from the normalized
+        # column.
+        bd[std.id] = bd[std.id].map(
+            lambda value: value.decode() if isinstance(value, bytes) else str(value)
+        )
+        bd.index = bd[std.id] + '_' + bd[std.boundary_type].map({

Conversation

Tobiaspk commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikla commented Mar 4, 2026

Uh oh!

Tobiaspk commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tobiaspk commented Mar 4, 2026 •

edited

Loading