feat(crawler): queue-backed manager revalidation fan-out#4210
Open
feat(crawler): queue-backed manager revalidation fan-out#4210
Conversation
Closes #4200 item 2. When a manager rotates its adagents.json, every publisher delegating via ads.txt MANAGERDOMAIN needs re-validation so its authorized_agents view stays in sync. Inline fan-out at managed- network scale would saturate crawler concurrency, so this PR adds a persistent queue and a bounded worker tick. - Migration 471: manager_revalidation_queue table mirroring the shape of catalog_crawl_queue (idempotent insert, next_attempt_after for backoff, partial index for the worker's due-row scan, second index on manager_domain for ops-side per-manager status). - cacheAdagentsManifest reads the previously-cached body before the upsert and compares the contributory subset (authorized_agents, properties) via recursive stable-key canonicalization. Only actual content drift triggers the fan-out — $schema / last_updated / trailing-comment noise is intentionally ignored. - enqueueManagerRevalidation walks publishers WHERE manager_domain = $1 via the partial index added in #4204, so a Raptive-scale rotation enumerates 6K delegating publishers via an index-only scan. Insert is idempotent and resets attempts/backoff on re-enqueue so a fresh manager change supersedes any in-flight retry window. - New crawler tick processManagerRevalidationQueue drains up to 50 rows per 5-minute interval at concurrency 10. Success deletes the row; failure advances exponential backoff (1h / 6h / 1d / 3d) and stores last_error truncated to 500 chars. At those bounds, a 6K- publisher manager rotation propagates within ~10 hours, comfortably ahead of the 60-minute organic re-crawl cadence for any single row. - Tests: integration coverage for queue idempotency, due-row filtering, oldest-first ordering, success deletion, and exponential backoff. Unit coverage for the change-detection helper, including order-sensitivity on authorized_agents and ignored noise fields.
| * insert, exponential backoff on failure, deletion on success. | ||
| */ | ||
| import { describe, it, expect, beforeAll, beforeEach, afterAll } from 'vitest'; | ||
| import { initializeDatabase, closeDatabase, query } from '../../src/db/client.js'; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #4200 item 2. When a manager rotates its `adagents.json`, every publisher delegating via ads.txt `MANAGERDOMAIN` needs re-validation so its `authorized_agents` view stays in sync. Inline fan-out at managed-network scale (Raptive ≈ 6K publishers) would saturate crawler concurrency, so this PR adds a persistent queue and a bounded worker tick.
What lands
At those bounds, a 6K-publisher manager rotation propagates within ~10 hours — comfortably ahead of the 60-minute organic re-crawl cadence that any single row would catch on its own.
Tests
Out of scope
Refs #4200, #4173, #4204.