Create a combined annotation table per sample and per reference. by mariiabilous · Pull Request #108 · bdsc-tds/xenium_analysis_pipeline

mariiabilous · 2026-02-16T11:41:18Z

For each reference, all available annotation results are merged into a single file (combined_labels.parquet), stored under cell_type_annotation/.../<reference_name>/.

The output contains one row per cell and multiple annotation result columns. Column names follow the format:
{level}/{method}/{granularity}/{old_column_name}

This provides a unified view of all annotation results while preserving their provenance in the column structure.

Closes #98

coderabbitai · 2026-02-16T11:41:53Z

Warning

Rate limit exceeded

@senbaikang has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 2 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d3f093c-bb74-44a4-9aa2-04692ccc1677

📥 Commits

Reviewing files that changed from the base of the PR and between 4a71190 and 67a972b.

📒 Files selected for processing (4)

workflow/Snakefile
workflow/rules/_cell_type_annotation/combine_annotation_results.smk
workflow/rules/cell_type_annotation.smk
workflow/scripts/_cell_type_annotation/combine_annotation_results.R

Walkthrough

Adds a new Snakemake rule and R script to discover and merge reference-based labels.parquet files into per-reference combined_labels.parquet outputs; exposes get_combined_label_outputs() in workflow/Snakefile and includes the rule in the cell-type annotation workflow.

Changes

Cohort / File(s)	Summary
Top-level workflow `workflow/Snakefile`	Added `def get_combined_label_outputs() -> list[str]` to construct `combined_labels.parquet` output paths and extended `get_input2all(...)` to include those outputs when `cell_type_annotation` is enabled; minor formatting adjustments.
New Snakemake rule `workflow/rules/_cell_type_annotation/combine_annotation_results.smk`	Added `combineAnnotationLabels` rule that computes `input_root_dir` from config and wildcards, declares `combined_labels.parquet` and a log file as outputs, and runs an R container to perform the merge.
Workflow include `workflow/rules/cell_type_annotation.smk`	Added `include: "_cell_type_annotation/combine_annotation_results.smk"` to load the new combine rule into the cell-type annotation ruleset.
R script `workflow/scripts/_cell_type_annotation/combine_annotation_results.R`	New script: recursively finds `labels.parquet` under the provided input root, errors if none found, reads and renames non-`cell_id` columns by prefixing `level/tool/mode` inferred from paths, full-joins tables on `cell_id`, and writes `combined_labels.parquet`.

sequenceDiagram
    participant Snakemake
    participant Config as Config/FS
    participant Files as FileSystem
    participant Rcont as R container
    Snakemake->>Config: read config (output_path, containers)
    Snakemake->>Files: compute input_root_dir from wildcards
    Snakemake->>Files: locate `labels.parquet` files under input_root_dir
    Snakemake->>Rcont: launch combine_annotation_results.R with inputs/log
    Rcont->>Files: read discovered `labels.parquet` files
    Rcont->>Rcont: parse paths, prefix columns, full_join on `cell_id`
    Rcont->>Files: write `combined_labels.parquet` and log
    Snakemake->>Files: register outputs in DAG

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through folders, one by one,

Found each label and stitched them as one,
I prefixed names so sources stay clear,
Saved a parquet bundle — carrot cheer! 🥕

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: creating combined annotation tables per sample and per reference, which is the primary objective of the PR.
Description check	✅ Passed	The description is directly related to the changeset, explaining the purpose of the combined_labels.parquet file and the column naming convention used.
Linked Issues check	✅ Passed	The PR addresses issue `#98` by implementing combined annotation tables per reference, enabling simpler inclusion of cell-type annotation metadata as requested.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing combined annotation output: new functions, rules, and R script for merging annotation results per reference.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch combine-annotation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@workflow/rules/_cell_type_annotation/combine_annotation_results.smk`:
- Around line 9-18: The rule currently uses the directory variable root_dir as
an input which creates implicit, untracked dependencies; change the rule to
include explicit inputs for the individual label files produced by the upstream
annotation rule (e.g., the per-sample labels.parquet files) by adding a Python
function (or lambda) that computes and returns the list of expected label file
paths for the given wildcards and include that list alongside root_dir in the
rule's input. Update references in this file (combine_annotation_results.smk) to
use that function instead of relying solely on root_dir, and ensure this aligns
with how get_input2all() and the annotation rule name/outputs compute those
per-sample label paths so Snakemake will always materialize upstream annotation
outputs before combining.

In `@workflow/scripts/_cell_type_annotation/combine_annotation_results.R`:
- Around line 37-39: The pipeline will error when label_files is empty because
purrr::reduce has no .init; change the logic around merged (the pipeline using
label_files %>% map(read_and_annotate) %>% reduce(full_join, by = "cell_id")) to
handle the empty case by either providing a suitable .init to reduce (e.g., an
empty data frame/tibble with the cell_id column) or short-circuiting when
length(label_files) == 0 to return an empty/tidy tibble with the cell_id column;
update references to merged, label_files, read_and_annotate, and
reduce/full_join accordingly.
- Around line 13-18: The current list.files call that populates label_files uses
pattern "labels\\.parquet$" which also matches "combined_labels.parquet"; update
the list.files invocation (the pattern argument used to build label_files) to
only match filenames exactly named "labels.parquet" or explicitly exclude
"combined_labels.parquet" (e.g., tighten the regex to anchor the filename
component to ^labels\\.parquet$ or filter out any path/basename equal to
"combined_labels.parquet" after listing) so the script does not read its own
output.

In `@workflow/Snakefile`:
- Around line 558-592: get_combined_label_outputs() currently builds paths using
only the first two components of annotation (annotation.split("/")[0] and [1])
which collapses different annotation methods under the same path; update the
path construction to include the full annotation hierarchy (e.g., use the entire
annotation split into path segments or the full annotation string as extra path
components) so that distinct annotation_ids like reference_based/ref1/rctd and
reference_based/ref1/singler map to distinct directories; reference symbols to
change: function get_combined_label_outputs, the list comprehension iterating
for (sample, annotation) in samples_annotations, and the use of
annotation.split("/")[0]/[1]; optionally, if you still want to ensure
deterministic uniqueness, post-process the returned list with dict.fromkeys(...)
to deduplicate while preserving order.

🧹 Nitpick comments (1)

workflow/scripts/_cell_type_annotation/combine_annotation_results.R (1)

20-35: Path-based metadata extraction is fragile; consider documenting the expected directory layout.

The function relies on fixed positional offsets from the end of the path (n-1, n-2, n-3) to derive mode, level, and tool. Any change to the directory structure (e.g. an extra nesting level) will silently produce wrong column prefixes. A brief comment or assertion validating the expected structure would make this more robust.

mariiabilous · 2026-02-16T12:33:51Z

@coderabbitai review

coderabbitai · 2026-02-16T12:34:04Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (2)

workflow/scripts/_cell_type_annotation/combine_annotation_results.R (2)
25-40: Positional path parsing is fragile — consider a defensive check.

Lines 31–33 assume the directory structure is always …/<tool>/<level>/<mode>/labels.parquet. If the path layout ever changes (or an unexpected nested directory appears), the extracted components will silently be wrong, producing misleading column prefixes with no error.

A lightweight assertion would catch this early:
🛡️ Suggested defensive guard
   parts <- str_split(path, .Platform$file.sep)[[1]]
   n <- length(parts)
+  if (n < 4) {
+    stop("Unexpected path structure (too few components): ", path)
+  }
   mode <- parts[n - 1]
   level <- parts[n - 2]
   tool  <- parts[n - 3]
38-38: Dense double-evaluation of the column filter — consider simplifying for readability.

The same subsetting expression colnames(df)[colnames(df) != "cell_id"] is evaluated on both LHS and RHS. A small refactor improves clarity:
♻️ Optional readability improvement
-  colnames(df)[colnames(df) != "cell_id"] <- paste0(prefix, "/", colnames(df)[colnames(df) != "cell_id"])
+  cols_to_rename <- setdiff(colnames(df), "cell_id")
+  colnames(df)[colnames(df) %in% cols_to_rename] <- paste0(prefix, "/", cols_to_rename)

coderabbitai

🧹 Nitpick comments (2)

workflow/scripts/_cell_type_annotation/combine_annotation_results.R (2)
25-45: Path-component extraction is positional and fragile — consider documenting the expected layout.

The function silently assumes the directory structure is .../tool/level/mode/labels.parquet (3 directories above the file). If an upstream rule changes the nesting depth or inserts/removes a directory level, the extracted tool, level, and mode will be silently wrong rather than failing visibly.

A lightweight improvement: add a comment documenting the expected path layout, or validate the extracted components against known values.
📝 Suggested documentation/validation
 read_and_annotate <- function(path) {
   df <- read_parquet(path)
   
-  # Extract tool/level/mode from path
+  # Expected path suffix: .../<tool>/<level>/<mode>/labels.parquet
   parts <- str_split(path, .Platform$file.sep)[[1]]
   n <- length(parts)
   
   if (n < 4) {
     stop("Unexpected path structure (too few components): ", path)
   }
   
   mode <- parts[n - 1]
   level <- parts[n - 2]
   tool  <- parts[n - 3]
47-51: full_join can silently inflate rows if any labels.parquet contains duplicate cell_ids.

If an upstream annotation tool accidentally produces duplicate cell_id values, the full_join will create a cartesian product for those IDs, silently inflating the merged table. Since this is a data-pipeline script where silent row multiplication could be hard to debug, a post-merge sanity check would be cheap insurance.
🛡️ Proposed post-merge assertion
 merged <- label_files %>%
   map(read_and_annotate) %>%
   reduce(full_join, by = "cell_id")
 
+if (anyDuplicated(merged$cell_id)) {
+  warning("Duplicate cell_id entries detected after merge — check input labels.parquet files.")
+}
+
 write_parquet(merged, snakemake@output[["combined"]])

mariiabilous · 2026-02-16T15:07:43Z

@senbaikang , ready from my side

mariiabilous · 2026-02-17T15:12:11Z

TODO: fix input

coderabbitai Bot reviewed Feb 16, 2026

View reviewed changes

Comment thread workflow/rules/_cell_type_annotation/combine_annotation_results.smk

Comment thread workflow/scripts/_cell_type_annotation/combine_annotation_results.R

Comment thread workflow/scripts/_cell_type_annotation/combine_annotation_results.R

Comment thread workflow/Snakefile

mariiabilous force-pushed the combine-annotation branch from 134b489 to 0161d28 Compare February 16, 2026 11:59

coderabbitai Bot reviewed Feb 16, 2026

View reviewed changes

mariiabilous requested a review from senbaikang February 16, 2026 15:08

mariiabilous added 7 commits March 9, 2026 12:08

feat: combined annotation results per reference

431de00

fix: combine annotation results per reference

46cba4f

fix: restore correct default params of get_input2all

9795978

fix: restore default params

6237dc8

fix: restore default params

793d4b3

fix: caderabbitai

a0c7c22

fix: nitpick comment

67a972b

senbaikang force-pushed the combine-annotation branch from 4a71190 to 67a972b Compare March 9, 2026 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a combined annotation table per sample and per reference.#108

Create a combined annotation table per sample and per reference.#108
mariiabilous wants to merge 7 commits into
mainfrom
combine-annotation

mariiabilous commented Feb 16, 2026

Uh oh!

coderabbitai Bot commented Feb 16, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mariiabilous commented Feb 16, 2026

Uh oh!

coderabbitai Bot commented Feb 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

mariiabilous commented Feb 16, 2026

Uh oh!

mariiabilous commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mariiabilous commented Feb 16, 2026

Uh oh!

coderabbitai Bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mariiabilous commented Feb 16, 2026

Uh oh!

coderabbitai Bot commented Feb 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mariiabilous commented Feb 16, 2026

Uh oh!

mariiabilous commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Feb 16, 2026 •

edited

Loading