Skip to content

Create a combined annotation table per sample and per reference.#108

Open
mariiabilous wants to merge 7 commits into
mainfrom
combine-annotation
Open

Create a combined annotation table per sample and per reference.#108
mariiabilous wants to merge 7 commits into
mainfrom
combine-annotation

Conversation

@mariiabilous
Copy link
Copy Markdown
Collaborator

For each reference, all available annotation results are merged into a single file (combined_labels.parquet), stored under cell_type_annotation/.../<reference_name>/.

The output contains one row per cell and multiple annotation result columns. Column names follow the format:
{level}/{method}/{granularity}/{old_column_name}

This provides a unified view of all annotation results while preserving their provenance in the column structure.

Closes #98

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 16, 2026

Warning

Rate limit exceeded

@senbaikang has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 2 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d3f093c-bb74-44a4-9aa2-04692ccc1677

📥 Commits

Reviewing files that changed from the base of the PR and between 4a71190 and 67a972b.

📒 Files selected for processing (4)
  • workflow/Snakefile
  • workflow/rules/_cell_type_annotation/combine_annotation_results.smk
  • workflow/rules/cell_type_annotation.smk
  • workflow/scripts/_cell_type_annotation/combine_annotation_results.R

Walkthrough

Adds a new Snakemake rule and R script to discover and merge reference-based labels.parquet files into per-reference combined_labels.parquet outputs; exposes get_combined_label_outputs() in workflow/Snakefile and includes the rule in the cell-type annotation workflow.

Changes

Cohort / File(s) Summary
Top-level workflow
workflow/Snakefile
Added def get_combined_label_outputs() -> list[str] to construct combined_labels.parquet output paths and extended get_input2all(...) to include those outputs when cell_type_annotation is enabled; minor formatting adjustments.
New Snakemake rule
workflow/rules/_cell_type_annotation/combine_annotation_results.smk
Added combineAnnotationLabels rule that computes input_root_dir from config and wildcards, declares combined_labels.parquet and a log file as outputs, and runs an R container to perform the merge.
Workflow include
workflow/rules/cell_type_annotation.smk
Added include: "_cell_type_annotation/combine_annotation_results.smk" to load the new combine rule into the cell-type annotation ruleset.
R script
workflow/scripts/_cell_type_annotation/combine_annotation_results.R
New script: recursively finds labels.parquet under the provided input root, errors if none found, reads and renames non-cell_id columns by prefixing level/tool/mode inferred from paths, full-joins tables on cell_id, and writes combined_labels.parquet.
sequenceDiagram
    participant Snakemake
    participant Config as Config/FS
    participant Files as FileSystem
    participant Rcont as R container
    Snakemake->>Config: read config (output_path, containers)
    Snakemake->>Files: compute input_root_dir from wildcards
    Snakemake->>Files: locate `labels.parquet` files under input_root_dir
    Snakemake->>Rcont: launch combine_annotation_results.R with inputs/log
    Rcont->>Files: read discovered `labels.parquet` files
    Rcont->>Rcont: parse paths, prefix columns, full_join on `cell_id`
    Rcont->>Files: write `combined_labels.parquet` and log
    Snakemake->>Files: register outputs in DAG
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through folders, one by one,

Found each label and stitched them as one,
I prefixed names so sources stay clear,
Saved a parquet bundle — carrot cheer! 🥕

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: creating combined annotation tables per sample and per reference, which is the primary objective of the PR.
Description check ✅ Passed The description is directly related to the changeset, explaining the purpose of the combined_labels.parquet file and the column naming convention used.
Linked Issues check ✅ Passed The PR addresses issue #98 by implementing combined annotation tables per reference, enabling simpler inclusion of cell-type annotation metadata as requested.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing combined annotation output: new functions, rules, and R script for merging annotation results per reference.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch combine-annotation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@workflow/rules/_cell_type_annotation/combine_annotation_results.smk`:
- Around line 9-18: The rule currently uses the directory variable root_dir as
an input which creates implicit, untracked dependencies; change the rule to
include explicit inputs for the individual label files produced by the upstream
annotation rule (e.g., the per-sample labels.parquet files) by adding a Python
function (or lambda) that computes and returns the list of expected label file
paths for the given wildcards and include that list alongside root_dir in the
rule's input. Update references in this file (combine_annotation_results.smk) to
use that function instead of relying solely on root_dir, and ensure this aligns
with how get_input2all() and the annotation rule name/outputs compute those
per-sample label paths so Snakemake will always materialize upstream annotation
outputs before combining.

In `@workflow/scripts/_cell_type_annotation/combine_annotation_results.R`:
- Around line 37-39: The pipeline will error when label_files is empty because
purrr::reduce has no .init; change the logic around merged (the pipeline using
label_files %>% map(read_and_annotate) %>% reduce(full_join, by = "cell_id")) to
handle the empty case by either providing a suitable .init to reduce (e.g., an
empty data frame/tibble with the cell_id column) or short-circuiting when
length(label_files) == 0 to return an empty/tidy tibble with the cell_id column;
update references to merged, label_files, read_and_annotate, and
reduce/full_join accordingly.
- Around line 13-18: The current list.files call that populates label_files uses
pattern "labels\\.parquet$" which also matches "combined_labels.parquet"; update
the list.files invocation (the pattern argument used to build label_files) to
only match filenames exactly named "labels.parquet" or explicitly exclude
"combined_labels.parquet" (e.g., tighten the regex to anchor the filename
component to ^labels\\.parquet$ or filter out any path/basename equal to
"combined_labels.parquet" after listing) so the script does not read its own
output.

In `@workflow/Snakefile`:
- Around line 558-592: get_combined_label_outputs() currently builds paths using
only the first two components of annotation (annotation.split("/")[0] and [1])
which collapses different annotation methods under the same path; update the
path construction to include the full annotation hierarchy (e.g., use the entire
annotation split into path segments or the full annotation string as extra path
components) so that distinct annotation_ids like reference_based/ref1/rctd and
reference_based/ref1/singler map to distinct directories; reference symbols to
change: function get_combined_label_outputs, the list comprehension iterating
for (sample, annotation) in samples_annotations, and the use of
annotation.split("/")[0]/[1]; optionally, if you still want to ensure
deterministic uniqueness, post-process the returned list with dict.fromkeys(...)
to deduplicate while preserving order.
🧹 Nitpick comments (1)
workflow/scripts/_cell_type_annotation/combine_annotation_results.R (1)

20-35: Path-based metadata extraction is fragile; consider documenting the expected directory layout.

The function relies on fixed positional offsets from the end of the path (n-1, n-2, n-3) to derive mode, level, and tool. Any change to the directory structure (e.g. an extra nesting level) will silently produce wrong column prefixes. A brief comment or assertion validating the expected structure would make this more robust.

Comment thread workflow/rules/_cell_type_annotation/combine_annotation_results.smk
Comment thread workflow/scripts/_cell_type_annotation/combine_annotation_results.R
Comment thread workflow/scripts/_cell_type_annotation/combine_annotation_results.R
Comment thread workflow/Snakefile
@mariiabilous
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
workflow/scripts/_cell_type_annotation/combine_annotation_results.R (2)

25-40: Positional path parsing is fragile — consider a defensive check.

Lines 31–33 assume the directory structure is always …/<tool>/<level>/<mode>/labels.parquet. If the path layout ever changes (or an unexpected nested directory appears), the extracted components will silently be wrong, producing misleading column prefixes with no error.

A lightweight assertion would catch this early:

🛡️ Suggested defensive guard
   parts <- str_split(path, .Platform$file.sep)[[1]]
   n <- length(parts)
+  if (n < 4) {
+    stop("Unexpected path structure (too few components): ", path)
+  }
   mode <- parts[n - 1]
   level <- parts[n - 2]
   tool  <- parts[n - 3]

38-38: Dense double-evaluation of the column filter — consider simplifying for readability.

The same subsetting expression colnames(df)[colnames(df) != "cell_id"] is evaluated on both LHS and RHS. A small refactor improves clarity:

♻️ Optional readability improvement
-  colnames(df)[colnames(df) != "cell_id"] <- paste0(prefix, "/", colnames(df)[colnames(df) != "cell_id"])
+  cols_to_rename <- setdiff(colnames(df), "cell_id")
+  colnames(df)[colnames(df) %in% cols_to_rename] <- paste0(prefix, "/", cols_to_rename)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
workflow/scripts/_cell_type_annotation/combine_annotation_results.R (2)

25-45: Path-component extraction is positional and fragile — consider documenting the expected layout.

The function silently assumes the directory structure is .../tool/level/mode/labels.parquet (3 directories above the file). If an upstream rule changes the nesting depth or inserts/removes a directory level, the extracted tool, level, and mode will be silently wrong rather than failing visibly.

A lightweight improvement: add a comment documenting the expected path layout, or validate the extracted components against known values.

📝 Suggested documentation/validation
 read_and_annotate <- function(path) {
   df <- read_parquet(path)
   
-  # Extract tool/level/mode from path
+  # Expected path suffix: .../<tool>/<level>/<mode>/labels.parquet
   parts <- str_split(path, .Platform$file.sep)[[1]]
   n <- length(parts)
   
   if (n < 4) {
     stop("Unexpected path structure (too few components): ", path)
   }
   
   mode <- parts[n - 1]
   level <- parts[n - 2]
   tool  <- parts[n - 3]

47-51: full_join can silently inflate rows if any labels.parquet contains duplicate cell_ids.

If an upstream annotation tool accidentally produces duplicate cell_id values, the full_join will create a cartesian product for those IDs, silently inflating the merged table. Since this is a data-pipeline script where silent row multiplication could be hard to debug, a post-merge sanity check would be cheap insurance.

🛡️ Proposed post-merge assertion
 merged <- label_files %>%
   map(read_and_annotate) %>%
   reduce(full_join, by = "cell_id")
 
+if (anyDuplicated(merged$cell_id)) {
+  warning("Duplicate cell_id entries detected after merge — check input labels.parquet files.")
+}
+
 write_parquet(merged, snakemake@output[["combined"]])

@mariiabilous
Copy link
Copy Markdown
Collaborator Author

@senbaikang , ready from my side

@mariiabilous
Copy link
Copy Markdown
Collaborator Author

TODO: fix input

@senbaikang senbaikang force-pushed the combine-annotation branch from 4a71190 to 67a972b Compare March 9, 2026 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Output metadata along with UMAP

1 participant