Skip to content

Deterministic extraction, deduplication hardening, portability, and expanded test coverage#780

Open
hypnwtykvmpr wants to merge 3 commits into
safishamsi:v7from
hypnwtykvmpr:pr-upstream-rollup
Open

Deterministic extraction, deduplication hardening, portability, and expanded test coverage#780
hypnwtykvmpr wants to merge 3 commits into
safishamsi:v7from
hypnwtykvmpr:pr-upstream-rollup

Conversation

@hypnwtykvmpr
Copy link
Copy Markdown

This roll-up bundles several independent lines of work that touch extraction reliability, graph determinism, cross-machine portability, and defense against edge-case regressions.

Deterministic extraction pipeline

  • A source-location pre-pass in deduplication prevents entities sharing the same (source_file, source_location) from being treated as independent, with token-level label containment to avoid wrongly merging same-line symbols.
  • A prune pipeline removes stale references to nodes that no longer exist in the working tree after incremental rebuilds.
  • Deterministic output: manifest.json uses sort_keys; cluster sorting is stabilized so graphify update produces repeatable output on unchanged inputs.

Portability between machines

  • Manifest keys are stored root-relative instead of absolute, so graphify-out/ directories committed to git remain usable when cloned on a different machine.
  • Legacy absolute keys in manifests are normalized at load time, including macOS /var to /private/var symlink resolution.

Edge-case hardening

Eight defects found through systematic review of the build, dedup, detect, extract, watch, and symbol_resolution modules were resolved:

  • Stale nodes from deleted or modified source files no longer survive incremental graph merges.
  • Watch rebuilds now filter preserved nodes whose source files have been removed.
  • build_from_json guards against collapsing parallel edges with different relation types between the same endpoints.
  • Intra-file call resolution detects ambiguous duplicate labels and prefers scope-qualified labels over bare ones.
  • Cross-file call resolvers now track (source, target, relation) triples instead of only (source, target) pairs.
  • Source-location dedup uses token-level rather than character-level substring containment.
  • Incremental detection no longer skips content re-hashing when a file's mtime hasn't changed within the second-resolution filesystem clock.
  • Normalized endpoint ID collision detection with overlap threshold and suffix disambiguation.

Additional file format support

  • .json and .jsonc are now indexed as document types alongside .yaml and .yml for document extraction.

Test coverage

  • Added unit tests spanning cache, dedup, export, ingest, llm_backends, security, tree_html, cluster, detect, manifest, and watch modules.
  • pytest.importorskip guards added for optional dependency imports to keep the suite green on GitHub CI.

69 files changed, +21,176 / -725 lines

…hardening, portability, and expanded test coverage

This roll-up bundles several independent lines of work that touch
extraction reliability, graph determinism, cross-machine portability, and
defense against edge-case regressions.

## Deterministic extraction pipeline

- A source-location pre-pass in deduplication prevents entities sharing
  the same (source_file, source_location) from being treated as
  independent, with token-level label containment to avoid wrongly merging
  same-line symbols like `user` and `userId`.
- A prune pipeline removes stale references to nodes that no longer exist
  in the working tree after incremental rebuilds.
- Deterministic output: manifest.json uses sort_keys; cluster sorting is
  stabilized in _node_key and _edge_key so `graphify update` produces
  repeatable output on unchanged inputs.

## Portability between machines

- Manifest keys are stored root-relative instead of absolute, so
  graphify-out/ directories committed to git remain usable when cloned
  on a different machine.
- Legacy absolute keys in manifests are normalized at load time, including
  macOS /var → /private/var symlink resolution.

## Edge-case hardening

Eight defects found through systematic review of the build, dedup,
detect, extract, watch, and symbol_resolution modules were resolved:

  - Stale nodes from deleted or modified source files no longer survive
    incremental graph merges.
  - Watch rebuilds now filter preserved nodes whose source files have
    been removed.
  - build_from_json guards against collapsing parallel edges with
    different relation types between the same endpoints.
  - Intra-file call resolution detects ambiguous duplicate labels and
    prefers scope-qualified labels over bare ones.
  - Cross-file call resolvers now track (source, target, relation)
    triples instead of only (source, target) pairs.
  - Source-location dedup uses token-level rather than character-level
    substring containment.
  - Incremental detection no longer skips content re-hashing when a
    file's mtime hasn't changed within the second-resolution filesystem
    clock.
  - Normalized endpoint ID collision detection with overlap threshold
    and suffix disambiguation.

## Additional file format support

- .json and .jsonc are now indexed as document types alongside .yaml
  and .yml for document extraction.

## Test coverage

- Added unit tests spanning cache, dedup, export, ingest, llm_backends,
  security, tree_html, cluster, detect, manifest, and watch modules.
- pytest.importorskip guards added for optional dependency imports to
  keep the suite green on GitHub CI.
Copilot AI review requested due to automatic review settings May 8, 2026 11:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Jonathan Dirks added 2 commits May 8, 2026 08:41
## Bash/shell script support

Adds shell script (.sh, .bash, .zsh) extraction to the AST pipeline:

- detect.py: shell to _LANG_FAMILY mapping; watched extensions
  include .sh/.bash/.zsh; _is_generated_or_bundle skips
  minified/bundled shell artifacts
- extract.py: full bash extractor using tree-sitter-bash —
  function extraction, call resolution (command_name + arguments),
  source/sourcing edges, variable references, here-doc detection,
  external command suppression (133 common builtins/tools)
- symbol_resolution.py: bash_call_resolver with shlex-aware
  argument parsing; env -S, sudo, exec, xargs chaining support;
  external command lookup in static map
- analyze.py: shell family entry in _LANG_FAMILY for
  cross-language false-positive suppression
- watch.py: shell extensions in watched suffix checks

Test fixtures: sample.sh (functions, calls), lib.sh (library),
imperative.sh (pure commands, no functions), broken.sh (syntax
edge cases)

## ForgeCode install platform

- __main__.py: `forgecode` entry in _PLATFORM_CONFIG using
  generic skill.md; `graphify forgecode install/uninstall`
  CLI handler; platform listed in help text
- test_install.py: install/uninstall test; PLATFORMS dict
  entry; defensive regression assertion

## Other fixes

- test_serve.py: fix coroutine-never-awaited warnings by
  closing coroutine objects after mock asyncio.run patch

1987 tests pass (81% coverage), graphify update . successful
…view

Findings addressed from security review (.dox/2026-05-08-bug-hunt-security-review.md):

F1: Fix graph edge/data loss — reverse-direction pairs in build.py
- build_from_json: seen edge set includes (reversed_source, reversed_target)
  for each edge in undirected mode, preventing distinct relations between
  the same node pair from being dropped when traversed in reverse order

F2: Fix Bash symlink path mismatch in extract.py
- Resolve input paths with os.path.realpath() before remapping to graph labels
- Prevents edge-to-wrong-node when symlinks cause double-resolved path mismatch

F3: Fix unpinned CDN script in generated HTML
- vis-network script tag now includes integrity hash (sha384) and
  crossorigin='anonymous' attributes

F4: Add size cap to MCP/CLI graph loaders
- serve.py _load_graph caps files at 512MB before JSON parsing
- __main__.py _load_graph_json caps files at 512MB before parsing
- Merge driver _load_graph uses separate 50MB cap for report comparison

F5: Wire validate_graph_path into MCP/CLI graph loaders
- serve.py and __main__.py graph loaders now route through security module
- Base parameter passed as parent directory for path-traversal protection
- FileNotFoundError surfaced to callers with clear exit code 1

F6: Validate GitHub clone URL owner/repo against naming rules
- _clone_repo validates owner (max 39 chars) and repo (max 100 chars)
  against alphanumeric+hyphen pattern; rejects '..' and '.' as names

F7: Atomic writes for graph.json and manifest.json
- export.py to_json: writes to .tmp_ file, renames with os.replace
- detect.py save_manifest: writes to .tmp_ file, renames with os.replace
- Prevents partial/corrupt reads during concurrent graphify operations

F8: Node/edge traversal caps in MCP BFS/DFS
- serve.py _MAX_TRAVERSAL_NODES=2000, _MAX_TRAVERSAL_EDGES=5000
- Prevents runaway traversal on pathological graphs

F9: Fix Obsidian color tag to use _obsidian_tag consistently
- export.py uses _obsidian_tag throughout instead of inline label.replace

Additional:
- test_symbol_resolution.py: +6 bash call/source resolver tests
- test_install.py: +5 ForgeCode CLI/uninstall/version/path tests
- test_extract.py: bash extractor edge-case coverage

2004 tests pass, git diff --check clean, graphify update . successful
Copy link
Copy Markdown
Owner

@safishamsi safishamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the large rollup — there are real improvements here (manifest portability, dedup pre-pass, Leiden seed, community sort stability). However the PR as submitted has a few blockers and is too large to merge safely as a single unit.

Blockers

1. Breaking API change in deduplicate_entities()

The return type changed from (nodes, edges) to (nodes, edges, hyperedges, stats) without being flagged. Any external caller (downstream scripts, integrations, forks) will break silently. This needs to be documented as a breaking change, and all call sites in the repo verified.

2. Edge collapse fix is incomplete

The "first relation wins" logic in build_from_json uses ordered (src, tgt) pairs. For an undirected nx.Graph, a B→A edge arriving after A→B with a different relation will be silently overwritten by NetworkX — the new collision counter won't catch it. The fix doesn't achieve determinism without guaranteed upstream edge ordering.

3. Fork is inaccessible (404)

The source repo is currently returning 404, which prevents full-file context verification for the review. Please ensure the branch is publicly accessible.

4. No CI

No checks ran on this PR. All other merged PRs in this repo have CI green before merge.

Recommended: split into focused PRs

At 21,176 lines in a single commit this exceeds review tooling limits. Each of these is independently mergeable and reviewable:

  • Portability: detect.py / cache.py / watch.py relative-path machinery
  • Dedup pre-pass: _source_location_dedup + prune_graph_references in dedup.py
  • Build determinism: build_from_json edge collapse + sort_keys
  • New modules: symbol_resolution.py, scip_ingest.py, semantic_facts.py, deterministic_docs.py — each deserves its own PR with rationale

Happy to merge each piece once it comes in separately with CI green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants