Skip to content

feat(crawler): queue-backed manager revalidation fan-out#4210

Open
bokelley wants to merge 1 commit intomainfrom
claude/issue-4200-manager-revalidation-queue
Open

feat(crawler): queue-backed manager revalidation fan-out#4210
bokelley wants to merge 1 commit intomainfrom
claude/issue-4200-manager-revalidation-queue

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 8, 2026

Closes #4200 item 2. When a manager rotates its `adagents.json`, every publisher delegating via ads.txt `MANAGERDOMAIN` needs re-validation so its `authorized_agents` view stays in sync. Inline fan-out at managed-network scale (Raptive ≈ 6K publishers) would saturate crawler concurrency, so this PR adds a persistent queue and a bounded worker tick.

What lands

  • Migration 471 `manager_revalidation_queue` table mirroring the shape of `catalog_crawl_queue` (367): idempotent insert, `next_attempt_after` for backoff, partial index for the worker's due-row scan, second index on `manager_domain` for ops-side per-manager status.
  • `cacheAdagentsManifest` change-detect: reads the previously-cached body before the upsert and compares the contributory subset (`authorized_agents`, `properties`) via recursive stable-key canonicalization. Only actual content drift triggers the fan-out — `$schema` / `last_updated` / trailing-comment noise is intentionally ignored.
  • `enqueueManagerRevalidation` walks `publishers WHERE manager_domain = $1` via the partial index added in feat(registry): persist managerdomain discovery provenance on publisher rows #4204, so a Raptive-scale rotation enumerates 6K delegating publishers via an index-only scan. Insert is idempotent and resets attempts/backoff on re-enqueue so a fresh manager change supersedes any in-flight retry window.
  • `processManagerRevalidationQueue` worker tick drains up to 50 rows per 5-minute interval at concurrency 10. Success deletes the row; failure advances exponential backoff (1h / 6h / 1d / 3d) and stores `last_error` truncated to 500 chars.

At those bounds, a 6K-publisher manager rotation propagates within ~10 hours — comfortably ahead of the 60-minute organic re-crawl cadence that any single row would catch on its own.

Tests

  • Integration (`manager-revalidation-queue.test.ts`): queue idempotency, attempts/backoff reset on re-enqueue, due-row filtering, batch limit, oldest-first ordering, success deletion, geometric backoff (1h → 6h), `last_error` truncation.
  • Unit (`manifest-content-changed.test.ts`): null previous, identical manifests, `$schema`/`last_updated` noise ignored, `authorized_agents`/`properties` change detected, order-sensitivity on `authorized_agents`, missing fields treated as empty arrays.

Out of scope

  • Item 5 (`/api/registry/managers/:domain/recrawl` endpoint): now possible because the queue + reverse-lookup are in place. Thin wrapper that calls `enqueueManagerRevalidation` and returns the count. Separate PR.
  • The `source` enum extension to `adagents_json_via_manager` for per-agent rows in `agent_property_authorizations` (still tracked under AAO crawler/API: persist managerdomain discovery provenance and reverse index #4200 follow-ups).

Refs #4200, #4173, #4204.

Closes #4200 item 2. When a manager rotates its adagents.json, every
publisher delegating via ads.txt MANAGERDOMAIN needs re-validation so
its authorized_agents view stays in sync. Inline fan-out at managed-
network scale would saturate crawler concurrency, so this PR adds a
persistent queue and a bounded worker tick.

- Migration 471: manager_revalidation_queue table mirroring the shape
  of catalog_crawl_queue (idempotent insert, next_attempt_after for
  backoff, partial index for the worker's due-row scan, second index
  on manager_domain for ops-side per-manager status).

- cacheAdagentsManifest reads the previously-cached body before the
  upsert and compares the contributory subset (authorized_agents,
  properties) via recursive stable-key canonicalization. Only actual
  content drift triggers the fan-out — $schema / last_updated /
  trailing-comment noise is intentionally ignored.

- enqueueManagerRevalidation walks publishers WHERE manager_domain = $1
  via the partial index added in #4204, so a Raptive-scale rotation
  enumerates 6K delegating publishers via an index-only scan. Insert
  is idempotent and resets attempts/backoff on re-enqueue so a fresh
  manager change supersedes any in-flight retry window.

- New crawler tick processManagerRevalidationQueue drains up to 50
  rows per 5-minute interval at concurrency 10. Success deletes the
  row; failure advances exponential backoff (1h / 6h / 1d / 3d) and
  stores last_error truncated to 500 chars. At those bounds, a 6K-
  publisher manager rotation propagates within ~10 hours, comfortably
  ahead of the 60-minute organic re-crawl cadence for any single row.

- Tests: integration coverage for queue idempotency, due-row filtering,
  oldest-first ordering, success deletion, and exponential backoff. Unit
  coverage for the change-detection helper, including order-sensitivity
  on authorized_agents and ignored noise fields.
* insert, exponential backoff on failure, deletion on success.
*/
import { describe, it, expect, beforeAll, beforeEach, afterAll } from 'vitest';
import { initializeDatabase, closeDatabase, query } from '../../src/db/client.js';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AAO crawler/API: persist managerdomain discovery provenance and reverse index

1 participant