Deterministic extraction, deduplication hardening, portability, and expanded test coverage#780
Deterministic extraction, deduplication hardening, portability, and expanded test coverage#780hypnwtykvmpr wants to merge 3 commits into
Conversation
…hardening, portability, and expanded test coverage
This roll-up bundles several independent lines of work that touch
extraction reliability, graph determinism, cross-machine portability, and
defense against edge-case regressions.
## Deterministic extraction pipeline
- A source-location pre-pass in deduplication prevents entities sharing
the same (source_file, source_location) from being treated as
independent, with token-level label containment to avoid wrongly merging
same-line symbols like `user` and `userId`.
- A prune pipeline removes stale references to nodes that no longer exist
in the working tree after incremental rebuilds.
- Deterministic output: manifest.json uses sort_keys; cluster sorting is
stabilized in _node_key and _edge_key so `graphify update` produces
repeatable output on unchanged inputs.
## Portability between machines
- Manifest keys are stored root-relative instead of absolute, so
graphify-out/ directories committed to git remain usable when cloned
on a different machine.
- Legacy absolute keys in manifests are normalized at load time, including
macOS /var → /private/var symlink resolution.
## Edge-case hardening
Eight defects found through systematic review of the build, dedup,
detect, extract, watch, and symbol_resolution modules were resolved:
- Stale nodes from deleted or modified source files no longer survive
incremental graph merges.
- Watch rebuilds now filter preserved nodes whose source files have
been removed.
- build_from_json guards against collapsing parallel edges with
different relation types between the same endpoints.
- Intra-file call resolution detects ambiguous duplicate labels and
prefers scope-qualified labels over bare ones.
- Cross-file call resolvers now track (source, target, relation)
triples instead of only (source, target) pairs.
- Source-location dedup uses token-level rather than character-level
substring containment.
- Incremental detection no longer skips content re-hashing when a
file's mtime hasn't changed within the second-resolution filesystem
clock.
- Normalized endpoint ID collision detection with overlap threshold
and suffix disambiguation.
## Additional file format support
- .json and .jsonc are now indexed as document types alongside .yaml
and .yml for document extraction.
## Test coverage
- Added unit tests spanning cache, dedup, export, ingest, llm_backends,
security, tree_html, cluster, detect, manifest, and watch modules.
- pytest.importorskip guards added for optional dependency imports to
keep the suite green on GitHub CI.
## Bash/shell script support Adds shell script (.sh, .bash, .zsh) extraction to the AST pipeline: - detect.py: shell to _LANG_FAMILY mapping; watched extensions include .sh/.bash/.zsh; _is_generated_or_bundle skips minified/bundled shell artifacts - extract.py: full bash extractor using tree-sitter-bash — function extraction, call resolution (command_name + arguments), source/sourcing edges, variable references, here-doc detection, external command suppression (133 common builtins/tools) - symbol_resolution.py: bash_call_resolver with shlex-aware argument parsing; env -S, sudo, exec, xargs chaining support; external command lookup in static map - analyze.py: shell family entry in _LANG_FAMILY for cross-language false-positive suppression - watch.py: shell extensions in watched suffix checks Test fixtures: sample.sh (functions, calls), lib.sh (library), imperative.sh (pure commands, no functions), broken.sh (syntax edge cases) ## ForgeCode install platform - __main__.py: `forgecode` entry in _PLATFORM_CONFIG using generic skill.md; `graphify forgecode install/uninstall` CLI handler; platform listed in help text - test_install.py: install/uninstall test; PLATFORMS dict entry; defensive regression assertion ## Other fixes - test_serve.py: fix coroutine-never-awaited warnings by closing coroutine objects after mock asyncio.run patch 1987 tests pass (81% coverage), graphify update . successful
…view Findings addressed from security review (.dox/2026-05-08-bug-hunt-security-review.md): F1: Fix graph edge/data loss — reverse-direction pairs in build.py - build_from_json: seen edge set includes (reversed_source, reversed_target) for each edge in undirected mode, preventing distinct relations between the same node pair from being dropped when traversed in reverse order F2: Fix Bash symlink path mismatch in extract.py - Resolve input paths with os.path.realpath() before remapping to graph labels - Prevents edge-to-wrong-node when symlinks cause double-resolved path mismatch F3: Fix unpinned CDN script in generated HTML - vis-network script tag now includes integrity hash (sha384) and crossorigin='anonymous' attributes F4: Add size cap to MCP/CLI graph loaders - serve.py _load_graph caps files at 512MB before JSON parsing - __main__.py _load_graph_json caps files at 512MB before parsing - Merge driver _load_graph uses separate 50MB cap for report comparison F5: Wire validate_graph_path into MCP/CLI graph loaders - serve.py and __main__.py graph loaders now route through security module - Base parameter passed as parent directory for path-traversal protection - FileNotFoundError surfaced to callers with clear exit code 1 F6: Validate GitHub clone URL owner/repo against naming rules - _clone_repo validates owner (max 39 chars) and repo (max 100 chars) against alphanumeric+hyphen pattern; rejects '..' and '.' as names F7: Atomic writes for graph.json and manifest.json - export.py to_json: writes to .tmp_ file, renames with os.replace - detect.py save_manifest: writes to .tmp_ file, renames with os.replace - Prevents partial/corrupt reads during concurrent graphify operations F8: Node/edge traversal caps in MCP BFS/DFS - serve.py _MAX_TRAVERSAL_NODES=2000, _MAX_TRAVERSAL_EDGES=5000 - Prevents runaway traversal on pathological graphs F9: Fix Obsidian color tag to use _obsidian_tag consistently - export.py uses _obsidian_tag throughout instead of inline label.replace Additional: - test_symbol_resolution.py: +6 bash call/source resolver tests - test_install.py: +5 ForgeCode CLI/uninstall/version/path tests - test_extract.py: bash extractor edge-case coverage 2004 tests pass, git diff --check clean, graphify update . successful
safishamsi
left a comment
There was a problem hiding this comment.
Thanks for the large rollup — there are real improvements here (manifest portability, dedup pre-pass, Leiden seed, community sort stability). However the PR as submitted has a few blockers and is too large to merge safely as a single unit.
Blockers
1. Breaking API change in deduplicate_entities()
The return type changed from (nodes, edges) to (nodes, edges, hyperedges, stats) without being flagged. Any external caller (downstream scripts, integrations, forks) will break silently. This needs to be documented as a breaking change, and all call sites in the repo verified.
2. Edge collapse fix is incomplete
The "first relation wins" logic in build_from_json uses ordered (src, tgt) pairs. For an undirected nx.Graph, a B→A edge arriving after A→B with a different relation will be silently overwritten by NetworkX — the new collision counter won't catch it. The fix doesn't achieve determinism without guaranteed upstream edge ordering.
3. Fork is inaccessible (404)
The source repo is currently returning 404, which prevents full-file context verification for the review. Please ensure the branch is publicly accessible.
4. No CI
No checks ran on this PR. All other merged PRs in this repo have CI green before merge.
Recommended: split into focused PRs
At 21,176 lines in a single commit this exceeds review tooling limits. Each of these is independently mergeable and reviewable:
- Portability:
detect.py/cache.py/watch.pyrelative-path machinery - Dedup pre-pass:
_source_location_dedup+prune_graph_referencesindedup.py - Build determinism:
build_from_jsonedge collapse +sort_keys - New modules:
symbol_resolution.py,scip_ingest.py,semantic_facts.py,deterministic_docs.py— each deserves its own PR with rationale
Happy to merge each piece once it comes in separately with CI green.
This roll-up bundles several independent lines of work that touch extraction reliability, graph determinism, cross-machine portability, and defense against edge-case regressions.
Deterministic extraction pipeline
graphify updateproduces repeatable output on unchanged inputs.Portability between machines
Edge-case hardening
Eight defects found through systematic review of the build, dedup, detect, extract, watch, and symbol_resolution modules were resolved:
Additional file format support
Test coverage
69 files changed, +21,176 / -725 lines