Create a combined annotation table per sample and per reference.#108
Create a combined annotation table per sample and per reference.#108mariiabilous wants to merge 7 commits into
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughAdds a new Snakemake rule and R script to discover and merge reference-based Changes
sequenceDiagram
participant Snakemake
participant Config as Config/FS
participant Files as FileSystem
participant Rcont as R container
Snakemake->>Config: read config (output_path, containers)
Snakemake->>Files: compute input_root_dir from wildcards
Snakemake->>Files: locate `labels.parquet` files under input_root_dir
Snakemake->>Rcont: launch combine_annotation_results.R with inputs/log
Rcont->>Files: read discovered `labels.parquet` files
Rcont->>Rcont: parse paths, prefix columns, full_join on `cell_id`
Rcont->>Files: write `combined_labels.parquet` and log
Snakemake->>Files: register outputs in DAG
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@workflow/rules/_cell_type_annotation/combine_annotation_results.smk`:
- Around line 9-18: The rule currently uses the directory variable root_dir as
an input which creates implicit, untracked dependencies; change the rule to
include explicit inputs for the individual label files produced by the upstream
annotation rule (e.g., the per-sample labels.parquet files) by adding a Python
function (or lambda) that computes and returns the list of expected label file
paths for the given wildcards and include that list alongside root_dir in the
rule's input. Update references in this file (combine_annotation_results.smk) to
use that function instead of relying solely on root_dir, and ensure this aligns
with how get_input2all() and the annotation rule name/outputs compute those
per-sample label paths so Snakemake will always materialize upstream annotation
outputs before combining.
In `@workflow/scripts/_cell_type_annotation/combine_annotation_results.R`:
- Around line 37-39: The pipeline will error when label_files is empty because
purrr::reduce has no .init; change the logic around merged (the pipeline using
label_files %>% map(read_and_annotate) %>% reduce(full_join, by = "cell_id")) to
handle the empty case by either providing a suitable .init to reduce (e.g., an
empty data frame/tibble with the cell_id column) or short-circuiting when
length(label_files) == 0 to return an empty/tidy tibble with the cell_id column;
update references to merged, label_files, read_and_annotate, and
reduce/full_join accordingly.
- Around line 13-18: The current list.files call that populates label_files uses
pattern "labels\\.parquet$" which also matches "combined_labels.parquet"; update
the list.files invocation (the pattern argument used to build label_files) to
only match filenames exactly named "labels.parquet" or explicitly exclude
"combined_labels.parquet" (e.g., tighten the regex to anchor the filename
component to ^labels\\.parquet$ or filter out any path/basename equal to
"combined_labels.parquet" after listing) so the script does not read its own
output.
In `@workflow/Snakefile`:
- Around line 558-592: get_combined_label_outputs() currently builds paths using
only the first two components of annotation (annotation.split("/")[0] and [1])
which collapses different annotation methods under the same path; update the
path construction to include the full annotation hierarchy (e.g., use the entire
annotation split into path segments or the full annotation string as extra path
components) so that distinct annotation_ids like reference_based/ref1/rctd and
reference_based/ref1/singler map to distinct directories; reference symbols to
change: function get_combined_label_outputs, the list comprehension iterating
for (sample, annotation) in samples_annotations, and the use of
annotation.split("/")[0]/[1]; optionally, if you still want to ensure
deterministic uniqueness, post-process the returned list with dict.fromkeys(...)
to deduplicate while preserving order.
🧹 Nitpick comments (1)
workflow/scripts/_cell_type_annotation/combine_annotation_results.R (1)
20-35: Path-based metadata extraction is fragile; consider documenting the expected directory layout.The function relies on fixed positional offsets from the end of the path (
n-1,n-2,n-3) to derive mode, level, and tool. Any change to the directory structure (e.g. an extra nesting level) will silently produce wrong column prefixes. A brief comment or assertion validating the expected structure would make this more robust.
134b489 to
0161d28
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
workflow/scripts/_cell_type_annotation/combine_annotation_results.R (2)
25-40: Positional path parsing is fragile — consider a defensive check.Lines 31–33 assume the directory structure is always
…/<tool>/<level>/<mode>/labels.parquet. If the path layout ever changes (or an unexpected nested directory appears), the extracted components will silently be wrong, producing misleading column prefixes with no error.A lightweight assertion would catch this early:
🛡️ Suggested defensive guard
parts <- str_split(path, .Platform$file.sep)[[1]] n <- length(parts) + if (n < 4) { + stop("Unexpected path structure (too few components): ", path) + } mode <- parts[n - 1] level <- parts[n - 2] tool <- parts[n - 3]
38-38: Dense double-evaluation of the column filter — consider simplifying for readability.The same subsetting expression
colnames(df)[colnames(df) != "cell_id"]is evaluated on both LHS and RHS. A small refactor improves clarity:♻️ Optional readability improvement
- colnames(df)[colnames(df) != "cell_id"] <- paste0(prefix, "/", colnames(df)[colnames(df) != "cell_id"]) + cols_to_rename <- setdiff(colnames(df), "cell_id") + colnames(df)[colnames(df) %in% cols_to_rename] <- paste0(prefix, "/", cols_to_rename)
There was a problem hiding this comment.
🧹 Nitpick comments (2)
workflow/scripts/_cell_type_annotation/combine_annotation_results.R (2)
25-45: Path-component extraction is positional and fragile — consider documenting the expected layout.The function silently assumes the directory structure is
.../tool/level/mode/labels.parquet(3 directories above the file). If an upstream rule changes the nesting depth or inserts/removes a directory level, the extractedtool,level, andmodewill be silently wrong rather than failing visibly.A lightweight improvement: add a comment documenting the expected path layout, or validate the extracted components against known values.
📝 Suggested documentation/validation
read_and_annotate <- function(path) { df <- read_parquet(path) - # Extract tool/level/mode from path + # Expected path suffix: .../<tool>/<level>/<mode>/labels.parquet parts <- str_split(path, .Platform$file.sep)[[1]] n <- length(parts) if (n < 4) { stop("Unexpected path structure (too few components): ", path) } mode <- parts[n - 1] level <- parts[n - 2] tool <- parts[n - 3]
47-51:full_joincan silently inflate rows if anylabels.parquetcontains duplicatecell_ids.If an upstream annotation tool accidentally produces duplicate
cell_idvalues, thefull_joinwill create a cartesian product for those IDs, silently inflating the merged table. Since this is a data-pipeline script where silent row multiplication could be hard to debug, a post-merge sanity check would be cheap insurance.🛡️ Proposed post-merge assertion
merged <- label_files %>% map(read_and_annotate) %>% reduce(full_join, by = "cell_id") +if (anyDuplicated(merged$cell_id)) { + warning("Duplicate cell_id entries detected after merge — check input labels.parquet files.") +} + write_parquet(merged, snakemake@output[["combined"]])
|
@senbaikang , ready from my side |
|
TODO: fix input |
4a71190 to
67a972b
Compare
For each reference, all available annotation results are merged into a single file (combined_labels.parquet), stored under
cell_type_annotation/.../<reference_name>/.The output contains one row per cell and multiple annotation result columns. Column names follow the format:
{level}/{method}/{granularity}/{old_column_name}This provides a unified view of all annotation results while preserving their provenance in the column structure.
Closes #98