Skip to content

fix(acp): propagate agent_key through MCP bridge so tools resolve identity#1094

Open
codebit0 wants to merge 14 commits intonextlevelbuilder:devfrom
codebit0:refactor/agent-scope-adapter
Open

fix(acp): propagate agent_key through MCP bridge so tools resolve identity#1094
codebit0 wants to merge 14 commits intonextlevelbuilder:devfrom
codebit0:refactor/agent-scope-adapter

Conversation

@codebit0
Copy link
Copy Markdown
Contributor

@codebit0 codebit0 commented May 4, 2026

Summary

  • Bug: Tools that key off agent_key (sessions_history, sessions_send, session ownership checks) failed with agent context required when invoked through the Gemini ACP path. The MCP bridge headers signed only the agent UUID, so ToolAgentKeyFromCtx returned empty downstream.
  • Fix: bridgeContextMiddleware already loads the agent row to populate ShellDenyGroups. Reuse that lookup to also inject store.WithAgentKey and tools.WithToolAgentKey from ag.AgentKey. No header / HMAC schema change.
  • Defensive fallback: ToolAgentKeyFromCtx adds a third fallback to store.AgentKeyFromContext so any future caller that sets only the store-level key still resolves correctly.
  • Adapter scaffold: Add internal/agent/scope.go with an AgentScope value object + ScopeFromContext adapter. Composes from existing store.* readers + RunContext so it is purely additive — no existing reader is modified. Provides a single read surface for new code (especially across process boundaries) without touching the legacy ctx-key fan-out.

Why this is OO-shaped, not a one-line patch

The recurring class of bugs is Domain Primitive Obsession: agent identity is fanned out across store.AgentKey, tools.ctxAgentKey, RunContext.AgentToolKey, and (for ACP) the MCP-bridge headers — so any boundary forgets one of them. The bridge fix is the immediate symptom fix; AgentScope gives new code a single value-object surface to read from, with the legacy readers as the fallback. Existing call sites are untouched, keeping the upstream merge surface tiny.

Files

  • internal/agent/scope.go (new) — AgentScope value object + adapter
  • internal/agent/scope_test.go (new) — 6 unit tests (zero-value / identity / inject / legacy composition / RunContext fallback / explicit-Inject precedence)
  • internal/gateway/server.go — bridge middleware injects agent_key from the agent row already loaded for shell deny groups
  • internal/tools/context_keys.goToolAgentKeyFromCtx adds store.AgentKeyFromContext as a third fallback

Test plan

  • go build ./... and go build -tags sqliteonly ./... clean
  • go vet ./... clean
  • go test ./internal/agent/ ./internal/tools/ -count=1 pass
  • Smoke: invoke sessions_history from a Gemini ACP agent and confirm it no longer errors agent context required

codebit0 and others added 14 commits May 2, 2026 10:26
- ACP MCP 브리지 노출: Gemini가 skill_search 등 goclaw 빌트인 툴 사용 가능
  - NewBridgeServer가 BuiltinToolStore에서 활성화된 툴 목록을 동적으로 로딩
  - bridgeAlwaysExcluded(spawn, create_forum_topic, heartbeat) 제외 처리
- Gemini McpServer 헤더 포맷 수정: 스펙(object) 대신 Gemini CLI 0.36.x 실제
  구현에 맞춰 []McpServerKV 배열 형식으로 변경
- ACP 퍼-세션 cwd 격리: cli-workspaces와 동일하게 세션별 독립 워크스페이스 생성
- enrichGeminiACPArgs: skills-store 등 5개 스킬 소스 디렉토리를 --include-directories로 추가
- buildACPMcpServersFunc: context 헤더 + HMAC 인증 + UI 등록 MCP 서버 통합
- is_system 시맨틱 정리: UpsertSystemSkill이 is_system을 강제하지 않음
  - INSERT: is_system=false (운영자가 명시적으로 설정)
  - UPDATE: is_system 컬럼 유지 (시더 재실행이 운영자 설정 덮어쓰지 않음)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Python dependency check used a 5s context deadline. Importing a
typical skill bundle (numpy + vectorbt + pandas + psycopg2 + dotenv)
takes ~4.4s on idle and easily crosses 5s under load. On context
deadline, the previous code marked every package as missing and
re-triggered pip install for the whole set, which produced more load
and made the next check time out again — an install loop visible in
journalctl as the same packages being installed every ~30 seconds.

- Raise depCheckTimeout from 5s to 30s.
- On context.DeadlineExceeded, log a warning and return nil (assume
  present) instead of treating every import as missing.
- On other exec errors, surface the captured stderr so future failures
  are diagnosable instead of opaque.
ACP-mode agents (Atlas, Cortana, etc.) reach goclaw tools only through
the /mcp/bridge MCP server, which exposes the subset of tools listed in
builtin_tools with enabled=true. Several tools registered in the Go
tool registry were never seeded into builtin_tools, so ACP agents could
not call them and reported errors like 'datetime tool is not available'.
Native-provider agents read tools.Registry directly and were unaffected,
which masked the gap.

- Seed datetime, delegate, mcp_tool_search, list_group_members,
  vault_read, vault_search in builtinToolSeedData. The seed reconcile
  step (DELETE WHERE name != ALL) means manual INSERTs were wiped on
  every restart; the seed function is the single source of truth.
- Always construct an mcp.Manager (empty when no servers configured)
  and register mcp_tool_search. This lets the BM25 discovery tool be
  available to LLMs immediately; it returns 'no tools found' until
  external MCP servers are added, then becomes useful without a
  redeploy.
…or default args

End-to-end Gemini CLI 0.40.1 ACP fixes, verified live with `datetime` tool
returning correct time text through the full ACP→MCP bridge path.

Changes:

1. ACP spec method name aliases (tool_bridge.go)
   The ACP spec uses snake_case (`fs/read_text_file`,
   `terminal/wait_for_exit`, `session/request_permission`); only Claude CLI
   ever emitted camelCase. Each switch case now accepts both, so newer
   Gemini builds no longer fall through to "unknown method" — the JSON-RPC
   error that Gemini's catch path stringifies as `[object Object]` in
   tool_call_update content because the rejection isn't a JS Error.

2. Kind-based session permission (tool_bridge.go, types.go)
   Replace the legacy `RequestPermissionRequest`/`Response` (flat outcome)
   with new `SessionRequestPermissionRequest`/`Response` that match the
   spec nested-outcome shape: `{outcome: {outcome: "selected", optionId}}`
   or `{outcome: {outcome: "cancelled"}}`. handleSessionPermission selects
   by `PermissionOption.Kind` (allow_once / allow_always / reject_once /
   reject_always) since OptionID is agent-defined and unstable across
   versions. Approve-all prefers allow_once over allow_always.

3. Provider-layer vendor-arg + include-directories injection
   (acp_provider.go, gateway_providers.go)
   Removes the cmd-level `enrichGeminiACPArgs` in favor of two cleaner
   primitives in the providers package, mirroring the architecture in
   merge/2026-04-29:
   - `WithIncludeDirectories(dirs []string)` option: callers pass
     candidate skill directories; NewACPProvider stat-filters and emits
     `--include-directories <dir>` pairs only for gemini.
   - `applyVendorDefaultArgs(binary, args)` helper: appends per-binary
     CLI defaults that goclaw's deployment requires regardless of user
     state. For gemini, injects `--skip-trust` so MCP discovery runs
     even when the per-session cwd inherits DO_NOT_TRUST from
     `~/.gemini/trustedFolders.json` (ACP sessions always run inside a
     goclaw-managed sandbox, so the user-facing trust gate is moot).
   The cmd layer now only enumerates candidate skill dirs via
   `acpSkillCandidateDirs(workspace)` and passes them as an option;
   binary gating and vendor flag emission live in the provider.

4. ACP ClientInfo.Name reset to empty
   Restored to "" (no ClientInfo identity broadcast); aligns with the
   pre-debug state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the ACP process is respawned, resolveSession asks the agent to
restore the previous session via session/load. Some agents (notably
Gemini CLI 0.39.x with gemini-3.1-pro-preview) reply with success but
an empty sessionId — likely because their internal session store was
cleared with the process restart and they have no actual session to
restore.

Previously we stored that empty string as the new session ID, then
sent session/prompt with sid="" which produces a JSON-RPC -32603
"Internal error" on the agent side. The error was suppressed for
external channels (telegram), so users just saw "no response."

Now treat empty sid as a soft failure equivalent to an explicit
load error: fall through to NewSession to get a fresh, valid session
ID. Conversation history is lost on this path (same as a real load
failure), but at least the agent can respond instead of hanging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restores the per-session watchdog that fires session/cancel after
prompt_timeout (default 10 min) of no session/update activity. Without
this, an ACP agent that hangs silently (network glitch, model stall)
keeps the goroutine blocked indefinitely.

Three-piece port from 1e940b1 onto current source:
- internal/providers/acp/session.go: promptInactivityTimeout package var,
  watchdog goroutine in Prompt() that polls lastActivity and cancels
  when stale
- internal/providers/acp/process.go: promptTimeout field on ACPProcess
  + ProcessPool, SetPromptTimeout/getPromptTimeout for runtime config
- internal/providers/acp_provider.go: WithACPSessionTTL,
  WithACPPromptTimeout options

Also restores per-server logging in session/new (logs each MCP server
URL/command at INFO so misconfig is visible).

Cherry-pick of 1e940b1 — auto-merged cleanly with current acp_provider.go.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports 36e13bf from nextlevelbuilder#1069. Two related improvements that share the same
plumbing:

1. GEMINI.md as out-of-band system prompt
   writeGeminiMD writes the system prompt to <session-cwd>/GEMINI.md
   once per session. Gemini CLI reads this automatically as project
   context, so we no longer need to repeat the (often >30KB) system
   prompt on every prompt turn — only the current user message is sent.
   When the system prompt content changes, the file is rewritten and
   the live ACP session is invalidated so the next request starts a
   fresh session that loads the updated instructions.

2. isNew flag for session reset history restore
   resolveSession now returns (sid, isNew, err). isNew=true means the
   session was just created via session/new (no prior context known to
   the agent). extractACPContent uses this to decide:
   - isNew=false: send only the current user message (system prompt
     comes from GEMINI.md, history is in agent state)
   - isNew=true:  serialise full conversation transcript (system
     excluded) so episodic summaries + recent turns are restored
     even after a process respawn or session invalidation

Auto-merged cleanly with the empty-sid fallback (18f57c8) — empty
sid path now correctly produces isNew=true.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Generalises the previous gemini-only writeGeminiMD into per-binary
context-file dispatch, fixing two correctness issues for non-gemini
ACP providers:

1. Wrong filename written: every ACP provider was producing GEMINI.md in
   its session workspace — fine for Gemini CLI, ignored by Claude CLI
   (looks for CLAUDE.md) and Codex (AGENTS.md). The file was wasted disk.

2. System prompt loss for non-gemini binaries: extractACPContent's
   isNew=false short-circuit ("only send the current user message —
   GEMINI.md handles system prompt") fired for ALL binaries, including
   ones that don't read GEMINI.md. Continuation turns ran without any
   system prompt, leaving claude/codex agents instruction-less.

Changes:
- ACPProvider gains `binary` and `contextFileName` fields, populated
  from the new `contextFileNameForBinary(binary)` mapping
  (claude→CLAUDE.md, gemini→GEMINI.md, codex→AGENTS.md, else "").
- writeGeminiMD → writeContextFile: writes the binary-appropriate
  filename, skips entirely when contextFileName is empty.
- extractACPContent gains `hasContextFile bool`. When false (unknown
  binary), every turn embeds the system prompt as a [System] block in
  front of the current user message — preserves pre-feature behaviour.
  When true, the existing isNew-aware optimised paths run.
- Chat / ChatStream pass `hasContextFile = p.contextFileName != ""`.

Verified live with atlas (gemini) — GEMINI.md still produced, datetime
tool round-trip works. Non-gemini binaries exercised through unit test
+ vet only (no live claude/codex deployment to test against).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ered runs

Cron jobs / heartbeat ticks / any run path that doesn't carry a request-scoped
locale (no Accept-Language header, no WS connect param) used to default
straight to "en", which forced English output even when the operator and the
agent's persona had a different language preference. atlas with
"Language: 한국어" in its IDENTITY.md was getting English cron meta-prompts
that nudged the LLM into English replies.

Resolution cascade (centralised in internal/i18n):

  request ctx (HTTP Accept-Language, WS connect param)
        ↓
  GOCLAW_DEFAULT_LOCALE env override
        ↓
  POSIX system locale (LC_ALL > LC_MESSAGES > LANG)
        ↓
  "en"

Changes:
- internal/i18n/locale_default.go (new): Default() with resolveDefault()
  cascade, normalizePOSIXLocale() that strips ".UTF-8" / "@modifier" / "_REGION"
  suffixes (ko_KR.UTF-8 → ko). Result is sync.Once-cached;
  ResetDefaultForTest exposed for hermetic env-driven tests.
- internal/i18n/i18n.go: register LocaleKO so the cascade can return "ko".
  Catalog entries themselves arrive separately in nextlevelbuilder#1081; until then i18n.T
  for missing ko keys falls back to English via existing lookup logic.
- internal/store/context.go: LocaleFromContext now delegates to i18n.Default()
  when ctx has no value, instead of hardcoding "en". All existing call sites
  (loop_pipeline_callbacks, loop_finalize, etc) automatically pick up the
  cascade — no callsite changes needed.
- cmd/gateway_cron.go: extract extraPrompt-building into buildCronExtraPrompt.
  Switch on store.LocaleFromContext(ctx) to return ko / vi / zh / en text.
  cron handler itself has no if/else for locale — the cascade lives in i18n.
- Tests: locale_default_test.go covers env priority + POSIX parsing edge
  cases (C, POSIX, unsupported langs). gateway_cron_test.go covers the
  buildCronExtraPrompt locale matrix (ctx wins, env fallback, no-delivery
  variant).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- 백엔드: LocaleKO 상수, IsSupported에 ko 추가, catalog_ko.go 185개 문자열
- 웹 UI: 38개 네임스페이스 JSON 파일 (locales/ko/)
- constants.ts SUPPORTED_LANGUAGES에 ko 추가
- index.ts에 ko 리소스 등록 및 getInitialLanguage에 ko 감지
- en/topbar.json languages에 한국어 옵션 추가
- webui handler: index.html에 no-cache 헤더 추가하여 배포 후 캐시 문제 방지
… levels

Three small operational improvements ported from nextlevelbuilder#1069's 00a817f,
selecting only the parts that don't depend on the channel watchdog
infrastructure (which is a separate feature in nextlevelbuilder#1069 and not yet on dev).

- channels: export MaskBotToken(err) — regex replaces "bot<id>:<secret>/"
  → "bot***/" in error strings. Apply to "failed to start channel" log
  in manager.go and "failed to sync telegram menu commands" log in the
  telegram channel. Without this, journalctl can leak the bot token
  whenever Telegram returns an error containing the upstream URL.

- consolidation: episodic summarize timeout 30s → 120s. ACP-mode
  providers (Gemini CLI etc.) need 5~10s for spawn + initialize +
  session/new before the LLM call begins; 30s often timed out during
  the model response itself, leaving sessions without summaries.

- tools/filesystem: rebalance security.path_escape log levels.
  resolvePathWithAllowed now emits warn (was silent) when the allow-list
  fallback rejects a path; the inner resolvePath drops to debug because
  resolvePathWithAllowed re-emits warn at its boundary, avoiding
  double-warns for routine deny cases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Strip <current_reply_target> and <extra_context> before persisting to
  per-binary context file (CLAUDE.md / GEMINI.md / AGENTS.md). Those tags
  are turn-specific dynamic content; they're injected inline via
  extractACPContent each turn.

- Compare stable content via sha256 sidecar (.sha256) instead of byte-exact.
  Skip-if-same now triggers no session invalidation when only dynamic
  portions of the system prompt change.

- Net effect: ACP sessions stay alive across chat-target / extra-prompt
  changes, preserving CLI-side context accumulation. Avoids spurious
  session/new round-trips that previously dropped the agent's working memory.

- Add 4 unit tests for stable-content extraction and write idempotence.
…ntity

Symptom: tools that key off agent_key (sessions_history, sessions_send,
session ownership checks) failed with "agent context required" when invoked
through Gemini ACP. The MCP bridge headers signed only the agent UUID, so
ToolAgentKeyFromCtx returned empty downstream.

Bridge fix (gateway/server.go):
- bridgeContextMiddleware already loads the agent row to populate
  ShellDenyGroups. Reuse that lookup to also inject store.WithAgentKey
  and tools.WithToolAgentKey when ag.AgentKey is non-empty. No header /
  HMAC schema change.

Defensive fallback (tools/context_keys.go):
- ToolAgentKeyFromCtx adds a third fallback to store.AgentKeyFromContext
  so any future caller that sets store.WithAgentKey but not the
  tools-package key still resolves correctly.

Adapter scaffold (agent/scope.go):
- Introduce AgentScope value object grouping agent_key, agent_id,
  tenant_id, user_id, session_key, agent_type into one immutable
  snapshot. Inject / ScopeFromContext form a single read surface for
  new code (especially across process boundaries like ACP) without
  modifying existing per-key readers.
- ScopeFromContext composes from existing readers (store.* + RunContext)
  as the legacy fallback, so the value object is purely additive.
- 6 unit tests cover IsZero / HasAgentIdentity / Inject /
  legacy-key composition / RunContext fallback / explicit Inject
  beats legacy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant