Skip to content

feat: port 5 sqz-inspired features (SimHash, verifier, dedup cache, delta encoder, TOON, init --only/--skip)#1493

Open
FlorianBruniaux wants to merge 8 commits intodevelopfrom
feat/sqz-features
Open

feat: port 5 sqz-inspired features (SimHash, verifier, dedup cache, delta encoder, TOON, init --only/--skip)#1493
FlorianBruniaux wants to merge 8 commits intodevelopfrom
feat/sqz-features

Conversation

@FlorianBruniaux
Copy link
Copy Markdown
Collaborator

Summary

Ports 6 token-efficiency features from competitive analysis of sqz into RTK. Each module is standalone and tested.

  • SimHash fingerprinting (src/core/simhash.rs) — 64-bit locality-sensitive hash via 3-gram shingles. Used as gate for delta encoding: skip LCS when files are dissimilar.
  • Two-pass compression verifier (src/core/verifier.rs) — 6 invariant checks (min_retention, error_lines, file_paths, json_keys, diff_hunks, numeric_values). error_lines is a hard blocker: returns original if any error lines are dropped. Applied in rtk test output.
  • Session dedup cache (src/core/dedup_cache.rs) — SHA-256 fingerprint + SQLite persistence. Cache hit returns §ref:XXXXXXXX§ (13 tokens instead of full content). 7-day TTL. New commands: rtk expand <hash> and rtk dedup-compact.
  • SimHash+LCS delta encoder (src/core/delta_encoder.rs) — O(n×m) DP with 5000-line guard. Only runs when SimHash distance ≤ 20 (near-duplicates). Format: §delta:HASH§\n-removed\n+added.
  • TOON JSON encoding (src/core/toon.rs) — Lossless compact JSON: simple alphanumeric keys drop quotes, null fields stripped. Applied automatically in rtk json. Falls back to standard compact if TOON doesn't help.
  • rtk init --only/--skip (src/hooks/init.rs) — Filter which agents get hooks installed. Supports aliases (claude-code→claude, roo→cline, gemini-cli→gemini). Validated against KNOWN_AGENTS list.

Test plan

  • cargo test --all passes (all new modules have unit tests)
  • cargo clippy --all-targets — zero warnings
  • rtk json tests/fixtures/api_response.json — TOON encoding applied
  • rtk init --only claude — only Claude hook installed
  • rtk init --skip cursor — all agents except Cursor
  • rtk init --only claude --skip cursor — rejected (conflicts_with)
  • rtk dedup-compact — shows cache stats
  • rtk expand <hash> — expands a cached ref

🤖 Generated with Claude Code

Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Copilot AI review requested due to automatic review settings April 24, 2026 09:54
@pszymkowiak pszymkowiak added effort-large Plusieurs jours, nouveau module enhancement New feature or request labels Apr 24, 2026
@pszymkowiak
Copy link
Copy Markdown
Collaborator

[w] wshm · Automated triage by AI

📊 Automated PR Analysis

Type feature
🔴 Risk high

Summary

Ports 6 token-efficiency features inspired by competitive analysis of sqz: SimHash fingerprinting, a two-pass compression verifier, a session-level SHA-256 dedup cache with SQLite persistence, a SimHash+LCS delta encoder, TOON lossless compact JSON encoding, and --only/--skip flags for rtk init. Each feature is implemented as a standalone module with unit tests.

Review Checklist

  • Tests present
  • Breaking change
  • Docs updated

Analyzed automatically by wshm · This is an automated analysis, not a human review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports several sqz-inspired token-efficiency features into RTK (new core modules + CLI surface) to improve compression safety (verifier), JSON compaction (TOON), and enable dedup/delta-style workflows.

Changes:

  • Added new core modules: SimHash, delta encoder, dedup cache (SQLite), TOON JSON encoding, and a post-compression verifier.
  • Extended CLI with init --only/--skip, plus expand and dedup-compact commands.
  • Integrated TOON into rtk json and verifier-based fallback into rtk test output filtering.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
tests/fixtures/near_duplicate_a.txt Fixture for near-duplicate delta tests.
tests/fixtures/near_duplicate_b.txt Fixture for near-duplicate delta tests.
tests/fixtures/api_response.json Fixture for TOON savings tests.
src/main.rs Adds new CLI flags/subcommands and handlers.
src/hooks/init.rs Adds agent selection helpers (KNOWN_AGENTS, aliasing, resolver) + unit tests.
src/core/verifier.rs New post-compression invariant verifier with fallback.
src/core/toon.rs New TOON encoder + null stripping utility + tests.
src/core/simhash.rs New SimHash implementation + tests.
src/core/mod.rs Exposes new core modules.
src/core/delta_encoder.rs New SimHash-gated LCS delta encoder + tests.
src/core/dedup_cache.rs New SQLite-backed dedup cache + tests.
src/cmds/system/json_cmd.rs Attempts TOON pipeline before compact JSON output.
src/cmds/rust/runner.rs Adds verifier-based fallback for rtk test filtering.

Comment thread src/core/toon.rs
Comment on lines +2 to +3
//! Lossless compact JSON encoding: simple alphanumeric keys lose their quotes,
//! no whitespace around separators, null fields stripped upstream.
Comment thread src/core/verifier.rs
Comment on lines +54 to +65
// Check 2: critical diagnostic lines must be preserved
// Use "error:" with colon to avoid false positive from clippy "1 errors"
let error_lines: Vec<&str> = original
.lines()
.filter(|l| {
let lo = l.to_lowercase();
lo.contains("error:")
|| lo.contains("warning:")
|| lo.contains("fatal:")
|| lo.contains("panic:")
|| lo.contains("exception:")
})
Comment thread src/core/delta_encoder.rs
Comment on lines +56 to +82
fn lcs_indices(a: &[&str], b: &[&str]) -> Vec<(usize, usize)> {
let m = a.len();
let n = b.len();
let mut dp = vec![vec![0usize; n + 1]; m + 1];
for i in 1..=m {
for j in 1..=n {
if a[i - 1] == b[j - 1] {
dp[i][j] = dp[i - 1][j - 1] + 1;
} else {
dp[i][j] = dp[i - 1][j].max(dp[i][j - 1]);
}
}
}
let mut result = Vec::new();
let (mut i, mut j) = (m, n);
while i > 0 && j > 0 {
if a[i - 1] == b[j - 1] {
result.push((i - 1, j - 1));
i -= 1;
j -= 1;
} else if dp[i - 1][j] > dp[i][j - 1] {
i -= 1;
} else {
j -= 1;
}
}
result.reverse();
Comment thread src/core/dedup_cache.rs
}

impl DedupCache {
pub fn new(db_path: PathBuf) -> Result<Self> {
Comment thread src/main.rs
Comment on lines +2399 to +2406
let db_path = core::config::Config::load()
.ok()
.and_then(|c| c.tracking.database_path)
.unwrap_or_else(|| {
dirs::data_local_dir()
.unwrap_or_else(|| std::path::PathBuf::from("."))
.join("rtk/history.db")
});
Comment thread src/main.rs
Comment on lines +662 to +671
/// Recover original content referenced by a §ref:HASH§ dedup token
#[command(about = "Recover original content from a §ref:HASH§ dedup token")]
Expand {
/// The hash prefix (first 8 chars from the §ref:HASH§ token)
hash: String,
},

/// Evict stale dedup cache entries and show cache stats
#[command(about = "Evict stale dedup cache entries and show stats")]
DedupCompact,
Comment thread src/main.rs
Comment on lines +1781 to 1783
// --only / --skip: validate agent list early
let _ = hooks::init::resolve_agents(only.as_deref(), skip.as_deref())?;
if show {
Comment thread src/main.rs
Comment on lines +2420 to +2427
let db_path = core::config::Config::load()
.ok()
.and_then(|c| c.tracking.database_path)
.unwrap_or_else(|| {
dirs::data_local_dir()
.unwrap_or_else(|| std::path::PathBuf::from("."))
.join("rtk/history.db")
});
if toon.starts_with("TOON:") {
toon
} else {
filter_json_compact(&content, max_depth)?
if toon.starts_with("TOON:") {
toon
} else {
filter_json_compact(&content, max_depth)?
test_custom_db_path_env and test_default_db_path both mutate RTK_DB_PATH.
On parallel test runs (Windows CI), test_default_db_path could call
remove_var while the other test had just called set_var, causing
get_db_path() to return the default path instead of the custom one.

Added a static ENV_MUTEX and acquired a guard in both tests so they
run exclusively on all platforms.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Florian BRUNIAUX <florian@bruniaux.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort-large Plusieurs jours, nouveau module enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants