feat: port 5 sqz-inspired features (SimHash, verifier, dedup cache, delta encoder, TOON, init --only/--skip)#1493
feat: port 5 sqz-inspired features (SimHash, verifier, dedup cache, delta encoder, TOON, init --only/--skip)#1493FlorianBruniaux wants to merge 8 commits intodevelopfrom
Conversation
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
📊 Automated PR Analysis
SummaryPorts 6 token-efficiency features inspired by competitive analysis of sqz: SimHash fingerprinting, a two-pass compression verifier, a session-level SHA-256 dedup cache with SQLite persistence, a SimHash+LCS delta encoder, TOON lossless compact JSON encoding, and --only/--skip flags for rtk init. Each feature is implemented as a standalone module with unit tests. Review Checklist
Analyzed automatically by wshm · This is an automated analysis, not a human review. |
There was a problem hiding this comment.
Pull request overview
Ports several sqz-inspired token-efficiency features into RTK (new core modules + CLI surface) to improve compression safety (verifier), JSON compaction (TOON), and enable dedup/delta-style workflows.
Changes:
- Added new core modules: SimHash, delta encoder, dedup cache (SQLite), TOON JSON encoding, and a post-compression verifier.
- Extended CLI with
init --only/--skip, plusexpandanddedup-compactcommands. - Integrated TOON into
rtk jsonand verifier-based fallback intortk testoutput filtering.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/fixtures/near_duplicate_a.txt | Fixture for near-duplicate delta tests. |
| tests/fixtures/near_duplicate_b.txt | Fixture for near-duplicate delta tests. |
| tests/fixtures/api_response.json | Fixture for TOON savings tests. |
| src/main.rs | Adds new CLI flags/subcommands and handlers. |
| src/hooks/init.rs | Adds agent selection helpers (KNOWN_AGENTS, aliasing, resolver) + unit tests. |
| src/core/verifier.rs | New post-compression invariant verifier with fallback. |
| src/core/toon.rs | New TOON encoder + null stripping utility + tests. |
| src/core/simhash.rs | New SimHash implementation + tests. |
| src/core/mod.rs | Exposes new core modules. |
| src/core/delta_encoder.rs | New SimHash-gated LCS delta encoder + tests. |
| src/core/dedup_cache.rs | New SQLite-backed dedup cache + tests. |
| src/cmds/system/json_cmd.rs | Attempts TOON pipeline before compact JSON output. |
| src/cmds/rust/runner.rs | Adds verifier-based fallback for rtk test filtering. |
| //! Lossless compact JSON encoding: simple alphanumeric keys lose their quotes, | ||
| //! no whitespace around separators, null fields stripped upstream. |
| // Check 2: critical diagnostic lines must be preserved | ||
| // Use "error:" with colon to avoid false positive from clippy "1 errors" | ||
| let error_lines: Vec<&str> = original | ||
| .lines() | ||
| .filter(|l| { | ||
| let lo = l.to_lowercase(); | ||
| lo.contains("error:") | ||
| || lo.contains("warning:") | ||
| || lo.contains("fatal:") | ||
| || lo.contains("panic:") | ||
| || lo.contains("exception:") | ||
| }) |
| fn lcs_indices(a: &[&str], b: &[&str]) -> Vec<(usize, usize)> { | ||
| let m = a.len(); | ||
| let n = b.len(); | ||
| let mut dp = vec![vec![0usize; n + 1]; m + 1]; | ||
| for i in 1..=m { | ||
| for j in 1..=n { | ||
| if a[i - 1] == b[j - 1] { | ||
| dp[i][j] = dp[i - 1][j - 1] + 1; | ||
| } else { | ||
| dp[i][j] = dp[i - 1][j].max(dp[i][j - 1]); | ||
| } | ||
| } | ||
| } | ||
| let mut result = Vec::new(); | ||
| let (mut i, mut j) = (m, n); | ||
| while i > 0 && j > 0 { | ||
| if a[i - 1] == b[j - 1] { | ||
| result.push((i - 1, j - 1)); | ||
| i -= 1; | ||
| j -= 1; | ||
| } else if dp[i - 1][j] > dp[i][j - 1] { | ||
| i -= 1; | ||
| } else { | ||
| j -= 1; | ||
| } | ||
| } | ||
| result.reverse(); |
| } | ||
|
|
||
| impl DedupCache { | ||
| pub fn new(db_path: PathBuf) -> Result<Self> { |
| let db_path = core::config::Config::load() | ||
| .ok() | ||
| .and_then(|c| c.tracking.database_path) | ||
| .unwrap_or_else(|| { | ||
| dirs::data_local_dir() | ||
| .unwrap_or_else(|| std::path::PathBuf::from(".")) | ||
| .join("rtk/history.db") | ||
| }); |
| /// Recover original content referenced by a §ref:HASH§ dedup token | ||
| #[command(about = "Recover original content from a §ref:HASH§ dedup token")] | ||
| Expand { | ||
| /// The hash prefix (first 8 chars from the §ref:HASH§ token) | ||
| hash: String, | ||
| }, | ||
|
|
||
| /// Evict stale dedup cache entries and show cache stats | ||
| #[command(about = "Evict stale dedup cache entries and show stats")] | ||
| DedupCompact, |
| // --only / --skip: validate agent list early | ||
| let _ = hooks::init::resolve_agents(only.as_deref(), skip.as_deref())?; | ||
| if show { |
| let db_path = core::config::Config::load() | ||
| .ok() | ||
| .and_then(|c| c.tracking.database_path) | ||
| .unwrap_or_else(|| { | ||
| dirs::data_local_dir() | ||
| .unwrap_or_else(|| std::path::PathBuf::from(".")) | ||
| .join("rtk/history.db") | ||
| }); |
| if toon.starts_with("TOON:") { | ||
| toon | ||
| } else { | ||
| filter_json_compact(&content, max_depth)? |
| if toon.starts_with("TOON:") { | ||
| toon | ||
| } else { | ||
| filter_json_compact(&content, max_depth)? |
test_custom_db_path_env and test_default_db_path both mutate RTK_DB_PATH. On parallel test runs (Windows CI), test_default_db_path could call remove_var while the other test had just called set_var, causing get_db_path() to return the default path instead of the custom one. Added a static ENV_MUTEX and acquired a guard in both tests so they run exclusively on all platforms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Summary
Ports 6 token-efficiency features from competitive analysis of sqz into RTK. Each module is standalone and tested.
src/core/simhash.rs) — 64-bit locality-sensitive hash via 3-gram shingles. Used as gate for delta encoding: skip LCS when files are dissimilar.src/core/verifier.rs) — 6 invariant checks (min_retention, error_lines, file_paths, json_keys, diff_hunks, numeric_values).error_linesis a hard blocker: returns original if any error lines are dropped. Applied inrtk testoutput.src/core/dedup_cache.rs) — SHA-256 fingerprint + SQLite persistence. Cache hit returns§ref:XXXXXXXX§(13 tokens instead of full content). 7-day TTL. New commands:rtk expand <hash>andrtk dedup-compact.src/core/delta_encoder.rs) — O(n×m) DP with 5000-line guard. Only runs when SimHash distance ≤ 20 (near-duplicates). Format:§delta:HASH§\n-removed\n+added.src/core/toon.rs) — Lossless compact JSON: simple alphanumeric keys drop quotes, null fields stripped. Applied automatically inrtk json. Falls back to standard compact if TOON doesn't help.rtk init --only/--skip(src/hooks/init.rs) — Filter which agents get hooks installed. Supports aliases (claude-code→claude, roo→cline, gemini-cli→gemini). Validated againstKNOWN_AGENTSlist.Test plan
cargo test --allpasses (all new modules have unit tests)cargo clippy --all-targets— zero warningsrtk json tests/fixtures/api_response.json— TOON encoding appliedrtk init --only claude— only Claude hook installedrtk init --skip cursor— all agents except Cursorrtk init --only claude --skip cursor— rejected (conflicts_with)rtk dedup-compact— shows cache statsrtk expand <hash>— expands a cached ref🤖 Generated with Claude Code