Skip to content

geo_sub fix#110

Open
j-bac wants to merge 5 commits into
mainfrom
geo_fix
Open

geo_sub fix#110
j-bac wants to merge 5 commits into
mainfrom
geo_fix

Conversation

@j-bac
Copy link
Copy Markdown
Collaborator

@j-bac j-bac commented Mar 16, 2026

No description provided.

@j-bac j-bac requested a review from senbaikang March 16, 2026 15:20
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Warning

Rate limit exceeded

@coderabbitai[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 57 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da7a4e16-2efe-4fa1-a371-c846eed3b3f0

📥 Commits

Reviewing files that changed from the base of the PR and between a135435 and 784524f.

📒 Files selected for processing (2)
  • workflow/rules/geo_sub.smk
  • workflow/scripts/_segmentation/clean_segger_predict_results.py

Walkthrough

Adds an optional CSV name-mapping for geo_sub, exposes precomputed TARGET_GEO_IDS, target_id_to_path, and brcode_to_ptcode mappings, renames rule wildcards/outputs from geo_sub_sample_id to geo_sub_id, and replaces getopt parsing with manual argument parsing in run.sh.

Changes

Cohort / File(s) Summary
Documentation
config/README.md
Added name_mapping_csv section describing a two-column CSV to map donor/sample names to replacement names for geo_sub folder creation.
Workflow Rules
workflow/rules/geo_sub.smk
Precompute TARGET_GEO_IDS, target_id_to_path, and brcode_to_ptcode (optional CSV); switch rule wildcards/outputs/logs from geo_sub_sample_idgeo_sub_id; resolve original paths via target_id_to_path; update gatherFilesPerSampleForGeoSub and gatherFilesForGeoSub to expand using geo_sub_id and TARGET_GEO_IDS and to generate _all_files.txt with absolute paths.
Shell Wrapper
run.sh
Replaced getopt-based parsing with manual argument parsing; added bind-dir processing loop to construct SINGULARITY_BIND; preserved snakemake invocation and runtime behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Snakemake as Snakemake Workflow
    participant CSV as name_mapping_csv (optional)
    participant Parser as Mapping Parser
    participant Matcher as ID Matcher
    participant PathLookup as target_id_to_path
    participant Rules as GeoSub Rules

    Snakemake->>CSV: load optional CSV
    CSV-->>Parser: provide rows
    Parser->>Matcher: build brcode_to_ptcode (longest-substring)
    Matcher->>PathLookup: produce target_id_to_path & TARGET_GEO_IDS
    Rules->>PathLookup: request original_path for geo_sub_id
    PathLookup-->>Rules: return original_path
    Rules->>Rules: expand gatherFilesPerSampleForGeoSub / gatherFilesForGeoSub using TARGET_GEO_IDS
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit found a CSV in the glen, 🐇
Hopped names from old to new again,
IDs lined up in tidy rows,
Rules gathered files where the pathway goes,
The workflow hops along — follow when! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'geo_sub fix' is vague and generic, using non-descriptive language that doesn't convey the specific nature or scope of the changes made to the codebase. Provide a more specific title that describes the actual fix or change, such as 'Add optional name mapping CSV for geo_sub sample ID replacement' or 'Refactor geo_sub workflow to support sample name mapping'.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset. Add a description explaining what the geo_sub fix addresses, what changes were made, and why these changes are necessary.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch geo_fix
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
config/README.md (1)

229-232: Documentation could be more complete.

The documentation is functional but could benefit from a few clarifications:

  1. Indicate whether this configuration is optional (the implementation treats it as optional).
  2. Clarify whether the CSV should have headers or not.
  3. Mention what happens if the file doesn't exist or is invalid.
  4. Minor grammar: "with specified str" → "with new names" or "with specified strings".
📝 Suggested documentation improvement
 ## name_mapping_csv
-Used to replace donor and sample names with specified str (when creating geo_sub folder). 
-Path to csv file with first column containing donor/sample names and second column new names.
+(Optional) Used to replace donor and sample names with new names when creating the geo_sub folder.
+Path to a CSV file (without headers) where the first column contains original donor/sample names and the second column contains the replacement names.
+If the file does not exist or cannot be loaded, the workflow proceeds without name mapping.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/README.md` around lines 229 - 232, Update the name_mapping_csv
documentation to state that the name_mapping_csv configuration is optional,
clarify that the CSV should contain two columns (original name, new name) and
indicate whether headers are allowed or required (e.g., "no header expected" or
"header row ignored"), replace "with specified str" with "with new names" (or
"with specified strings"), and add a short sentence describing runtime behavior
when the file is missing or invalid (e.g., the system skips renaming and logs a
warning/error) and mention the context where it is used (creating geo_sub
folder).
workflow/rules/geo_sub.smk (1)

80-86: Only the last cp command logs errors.

The current shell command only redirects stderr from the final cp to the log file. If an earlier cp fails, the error won't be captured in the log, making debugging harder.

📝 Suggested improvement to capture all errors
     shell:
-        "cp {input.r_img} {output.r_img} && "
-        "cp {input.r_ts} {output.r_ts} && "
-        "cp {input.p_cnt} {output.p_cnt} && "
-        "cp {input.p_cells} {output.p_cells} && "
-        "cp {input.p_cell_boundaries} {output.p_cell_boundaries} && "
-        "cp {input.p_nuc_boundaries} {output.p_nuc_boundaries} &> {log}"
+        "("
+        "cp {input.r_img} {output.r_img} && "
+        "cp {input.r_ts} {output.r_ts} && "
+        "cp {input.p_cnt} {output.p_cnt} && "
+        "cp {input.p_cells} {output.p_cells} && "
+        "cp {input.p_cell_boundaries} {output.p_cell_boundaries} && "
+        "cp {input.p_nuc_boundaries} {output.p_nuc_boundaries}"
+        ") &> {log}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workflow/rules/geo_sub.smk` around lines 80 - 86, The shell block in
workflow/rules/geo_sub.smk currently only redirects stderr from the final cp, so
failures in earlier cp commands won't be logged; modify the shell invocation in
the rule's shell block (the sequence of cp {input.r_img} ... cp
{input.p_nuc_boundaries}) to ensure stderr from all cp commands is captured —
either redirect each cp's stderr to {log} or group the entire sequence (e.g.,
with (...) or { ...; } ) and redirect the group's stderr to {log} so every cp
error is written to the log.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@workflow/rules/geo_sub.smk`:
- Around line 32-36: The loop over sorted_brcodes can cause cascading
replacements because after replacing final_id with brcode_to_ptcode[brcode] it
continues searching and may replace the new text; update the loop in the block
that iterates over sorted_brcodes to stop after the first successful replacement
by adding a break immediately after final_id = final_id.replace(brcode,
brcode_to_ptcode[brcode]) so only one replacement occurs before assigning
target_id_to_path[final_id] = rel_path; use the existing variables
sorted_brcodes, final_id, brcode_to_ptcode, target_id_to_path, and rel_path to
locate and change the code.
- Around line 7-15: The code directly indexes config['name_mapping_csv'] which
can raise KeyError; change to use mapping_file = config.get('name_mapping_csv')
(or default None/''), then guard the load with "if mapping_file and
os.path.exists(mapping_file):" before attempting to read into df_mapping and
build brcode_to_ptcode; keep error handling around pd.read_csv as-is and ensure
brcode_to_ptcode stays an empty dict when mapping_file is unset or missing.

---

Nitpick comments:
In `@config/README.md`:
- Around line 229-232: Update the name_mapping_csv documentation to state that
the name_mapping_csv configuration is optional, clarify that the CSV should
contain two columns (original name, new name) and indicate whether headers are
allowed or required (e.g., "no header expected" or "header row ignored"),
replace "with specified str" with "with new names" (or "with specified
strings"), and add a short sentence describing runtime behavior when the file is
missing or invalid (e.g., the system skips renaming and logs a warning/error)
and mention the context where it is used (creating geo_sub folder).

In `@workflow/rules/geo_sub.smk`:
- Around line 80-86: The shell block in workflow/rules/geo_sub.smk currently
only redirects stderr from the final cp, so failures in earlier cp commands
won't be logged; modify the shell invocation in the rule's shell block (the
sequence of cp {input.r_img} ... cp {input.p_nuc_boundaries}) to ensure stderr
from all cp commands is captured — either redirect each cp's stderr to {log} or
group the entire sequence (e.g., with (...) or { ...; } ) and redirect the
group's stderr to {log} so every cp error is written to the log.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9b3dff0-205d-4f5b-bbee-18665fe428a5

📥 Commits

Reviewing files that changed from the base of the PR and between e7901ad and b50bab5.

📒 Files selected for processing (2)
  • config/README.md
  • workflow/rules/geo_sub.smk

Comment thread workflow/rules/geo_sub.smk Outdated
Comment thread workflow/rules/geo_sub.smk
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
workflow/rules/geo_sub.smk (1)

32-35: ⚠️ Potential issue | 🟡 Minor

Stop after first match to prevent chained remapping.

At Line 32-Line 35, the loop keeps scanning after a replacement, so a newly substituted token can be replaced again in the same pass.

Proposed fix
     for brcode in sorted_brcodes:
         if brcode in final_id:
             final_id = final_id.replace(brcode, brcode_to_ptcode[brcode])
+            break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workflow/rules/geo_sub.smk` around lines 32 - 35, The loop over
sorted_brcodes can cause chained remapping because after replacing a match in
final_id you continue scanning and may replace the newly substituted token; in
the block that iterates sorted_brcodes and does final_id =
final_id.replace(brcode, brcode_to_ptcode[brcode]) add logic to stop scanning
after the first successful replacement (e.g., break out of the loop or return
immediately) so only the first matching brcode is substituted for final_id;
refer to sorted_brcodes, final_id, and brcode_to_ptcode in the change.
🧹 Nitpick comments (1)
workflow/rules/geo_sub.smk (1)

39-39: Make target expansion order deterministic.

At Line 39, TARGET_GEO_IDS depends on glob.glob discovery order, which can vary and change _all_files.txt ordering between runs.

Proposed fix
-TARGET_GEO_IDS = list(target_id_to_path.keys())
+TARGET_GEO_IDS = sorted(target_id_to_path.keys())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workflow/rules/geo_sub.smk` at line 39, TARGET_GEO_IDS is built from
target_id_to_path keys which inherit nondeterministic glob.glob discovery order;
make expansion deterministic by sorting the keys before assigning TARGET_GEO_IDS
(e.g., replace the current assignment with one that uses sorted(...) on
target_id_to_path.keys()) so subsequent users of TARGET_GEO_IDS get a stable,
reproducible order.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@workflow/rules/geo_sub.smk`:
- Around line 36-39: Before assigning into target_id_to_path, check whether
final_id already exists and handle the collision: in the block that sets
target_id_to_path[final_id] = rel_path, replace the blind assignment with a
guard that tests "if final_id in target_id_to_path" and then either raise an
error (e.g., ValueError including the existing path and the new rel_path) or
otherwise resolve the conflict (e.g., append to a list or include a
disambiguating suffix). Ensure TARGET_GEO_IDS is still derived from the keys
after the guard so collisions are not silently overwritten.

---

Duplicate comments:
In `@workflow/rules/geo_sub.smk`:
- Around line 32-35: The loop over sorted_brcodes can cause chained remapping
because after replacing a match in final_id you continue scanning and may
replace the newly substituted token; in the block that iterates sorted_brcodes
and does final_id = final_id.replace(brcode, brcode_to_ptcode[brcode]) add logic
to stop scanning after the first successful replacement (e.g., break out of the
loop or return immediately) so only the first matching brcode is substituted for
final_id; refer to sorted_brcodes, final_id, and brcode_to_ptcode in the change.

---

Nitpick comments:
In `@workflow/rules/geo_sub.smk`:
- Line 39: TARGET_GEO_IDS is built from target_id_to_path keys which inherit
nondeterministic glob.glob discovery order; make expansion deterministic by
sorting the keys before assigning TARGET_GEO_IDS (e.g., replace the current
assignment with one that uses sorted(...) on target_id_to_path.keys()) so
subsequent users of TARGET_GEO_IDS get a stable, reproducible order.
🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 26f2e3d5-fbd8-405e-975e-b856f7284ceb

📥 Commits

Reviewing files that changed from the base of the PR and between b50bab5 and b933e56.

📒 Files selected for processing (1)
  • workflow/rules/geo_sub.smk

Comment thread workflow/rules/geo_sub.smk Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

An unexpected error occurred while generating fixes: Update is not a fast forward - https://docs.github.com/rest/git/refs#update-a-reference

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 2 file(s) based on 1 unresolved review comment.

Files modified:

  • run.sh
  • workflow/rules/geo_sub.smk

Commit: a5ecf94332f528bc089e2c207be4607de10a1c80

The changes have been pushed to the geo_fix branch.

Time taken: 2m 9s

Fixed 2 file(s) based on 1 unresolved review comment.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 1 file(s) based on 1 unresolved review comment.

Files modified:

  • workflow/rules/geo_sub.smk

Commit: a1354352f89e7460054652585e9ccff9258936b0

The changes have been pushed to the geo_fix branch.

Time taken: 2m 26s

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

An unexpected error occurred while generating fixes: Update is not a fast forward - https://docs.github.com/rest/git/refs#update-a-reference

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@run.sh`:
- Around line 58-59: The while-loop's case statement in run.sh falls through to
the *) handler when $1 is empty; add an explicit empty-argument guard at the top
of the case (e.g., a '') pattern that does a break so the loop exits cleanly
when arguments are exhausted. Locate the while ... case ... esac block (the loop
that checks $1) and insert the empty-string case before the other patterns to
prevent the "Unknown option" *) branch from running when there are no more args.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85ebecc8-4f37-4c49-9652-a39ec52f03c7

📥 Commits

Reviewing files that changed from the base of the PR and between b933e56 and a5ecf94.

📒 Files selected for processing (2)
  • run.sh
  • workflow/rules/geo_sub.smk

Comment thread run.sh
Fixed 1 file(s) based on 1 unresolved review comment.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
workflow/rules/geo_sub.smk (1)

86-91: Redirect logs for the full copy chain, not only the last command.

Current &> {log} applies only to the final cp, so diagnostics from earlier copy steps are missed. Wrap the chain in a grouped command and redirect once.

Proposed fix
-    shell:
-        "cp {input.r_img} {output.r_img} && "
-        "cp {input.r_ts} {output.r_ts} && "
-        "cp {input.p_cnt} {output.p_cnt} && "
-        "cp {input.p_cells} {output.p_cells} && "
-        "cp {input.p_cell_boundaries} {output.p_cell_boundaries} && "
-        "cp {input.p_nuc_boundaries} {output.p_nuc_boundaries} &> {log}"
+    shell:
+        "( "
+        "cp {input.r_img} {output.r_img} && "
+        "cp {input.r_ts} {output.r_ts} && "
+        "cp {input.p_cnt} {output.p_cnt} && "
+        "cp {input.p_cells} {output.p_cells} && "
+        "cp {input.p_cell_boundaries} {output.p_cell_boundaries} && "
+        "cp {input.p_nuc_boundaries} {output.p_nuc_boundaries} "
+        ") &> {log}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workflow/rules/geo_sub.smk` around lines 86 - 91, The sequence of multiple
"cp" commands in the rule currently only redirects output of the final cp to
{log}; group the entire chain (the lines building the shell command string that
include "cp {input.r_img} {output.r_img}" through "cp {input.p_nuc_boundaries}
{output.p_nuc_boundaries}") into a single grouped shell command (e.g., wrap with
parentheses or braces) and apply a single redirection to {log} so all commands'
stdout/stderr are captured; locate the command string in geo_sub.smk and update
it to perform the grouping and single &> {log} redirect.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@workflow/rules/geo_sub.smk`:
- Around line 24-45: The discovery is nondeterministic because found_dirs comes
from glob.glob and dict key order may vary; make the process deterministic by
iterating over a sorted list of found_dirs (use sorted(found_dirs)) when
building target_id_to_path and then export TARGET_GEO_IDS as a sorted list
(e.g., TARGET_GEO_IDS = sorted(target_id_to_path.keys())) so final_id generation
(final_id, target_id_to_path) and downstream outputs are stable across runs.

---

Nitpick comments:
In `@workflow/rules/geo_sub.smk`:
- Around line 86-91: The sequence of multiple "cp" commands in the rule
currently only redirects output of the final cp to {log}; group the entire chain
(the lines building the shell command string that include "cp {input.r_img}
{output.r_img}" through "cp {input.p_nuc_boundaries} {output.p_nuc_boundaries}")
into a single grouped shell command (e.g., wrap with parentheses or braces) and
apply a single redirection to {log} so all commands' stdout/stderr are captured;
locate the command string in geo_sub.smk and update it to perform the grouping
and single &> {log} redirect.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ebb447a2-298d-47dc-8491-4f0af945e582

📥 Commits

Reviewing files that changed from the base of the PR and between a5ecf94 and a135435.

📒 Files selected for processing (1)
  • workflow/rules/geo_sub.smk

Comment thread workflow/rules/geo_sub.smk Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 2 file(s) based on 1 unresolved review comment.

Files modified:

  • workflow/rules/geo_sub.smk
  • workflow/scripts/_segmentation/clean_segger_predict_results.py

Commit: 784524fb26ee1a11ce5682018a9e24403de7d50a

The changes have been pushed to the geo_fix branch.

Time taken: 2m 23s

Fixed 2 file(s) based on 1 unresolved review comment.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant