Skip to content

docs(sitemap-author): schema v1.1 — 12 patches from twitter+hackernews PoC#1822

Merged
jackwener merged 3 commits into
mainfrom
docs/sitemap-schema-v1.1
Jun 1, 2026
Merged

docs(sitemap-author): schema v1.1 — 12 patches from twitter+hackernews PoC#1822
jackwener merged 3 commits into
mainfrom
docs/sitemap-schema-v1.1

Conversation

@jackwener
Copy link
Copy Markdown
Owner

Summary

Sitemap schema reference v1.1 based on twitter (12 files, @opencli-user) + hackernews (10 files, @opencli-质量官) parallel PoCs. 12 patches in 3 groups, all surfaced as findings during PoC content writing and cross-reviewed in thread.

Group 1 — Scope/boundary (6 clarifications, no format change)

§ Patch Rationale
1.1 CJK token-per-char 30-50% higher than English; split sub-file rather than relaxing 800 limit Twitter F3: pitfalls.md trimmed from 4113 → 2726 bytes still hit ~1800 token; relaxing target = drift
2.1 auth_strategy = primary, not union Twitter F4: most sites mix PUBLIC + COOKIE; union form adds complexity for marginal expressiveness
2.5 pitfalls.md task-executor-level only; adapter-internal → notes.md Twitter F2: 4 of 10 initial pitfalls were adapter-implementer view (queryId parse / envelope unwrap)
2.5 Pitfall id/trigger/workaround written from task-executor 1st-person codex catch: even when workaround is correct, naming "queryid_rotation_breaks_adapter" reads adapter-internal
2.4 apis.md entry adds optional notes: field Twitter PoC need: GraphQL queryId path / known schema drift / special headers
2.2 Page Linked APIs may be empty when endpoints.json incomplete Twitter F1: 34+ GraphQL operations referenced by adapters, only 2 in endpoints.json; strict ref forces empty (correct, surfaces real gap)

Group 2 — Reuse/compactness (3 structural)

§ Patch Rationale
2.2 + 4 Partial pages: page_id with _ prefix + url_patterns: [] Twitter F5: tweet card UI on 5 pages (home/profile/status/notifications/bookmarks); without partials, duplicate or arbitrary owner pick
3 Form B compact YAML for actions (~80 token vs Form A markdown ~250) Twitter F6: 11/12 PoC files exceeded 800 token under markdown; structural fix, not author trim
3 Drop action-level verified_at / source (inherit file-level) codex catch: YAML compression eaten by 3-line Evidence; file last_verified already covers

Group 3 — Execution health/anchors (3 action-level semantics)

§ Patch Rationale
3.3 Cross-page UI primitives may write adapter-first + DOM fallback within action Twitter PoC _tweet_card.md like_tweet: forcing UI-primitive to workflow level makes "click like" its own workflow, mismatched granularity
3.4 adapter_health_update: <adapter> -> suspect directive in Recovery + consumption skill writes workflow health Twitter _tweet_card.md Recovery had this as text; formalize as directive so next agent sees the suspect mark and skips broken Best path
2.2 testid optional; selector_pattern first-class with 5 forms + discouraged list Hackernews F9: no testid on old sites; codex catch on nth-child(<rank>) instability validated the discouraged-anchor list

v1 → v1.1 migration

Both PoCs are currently local at ~/.opencli/sites/{twitter,hackernews}/sitemap/ and written in v1 markdown form. After this PR lands:

  1. @opencli-user normalizes twitter PoC to v1.1 (Form B YAML / partial spec / adapter_health_update directives)
  2. @opencli-质量官 normalizes hackernews PoC to v1.1
  3. Both promote together in a follow-up PR to skills/opencli-sitemap-author/references/site-memory/{twitter,hackernews}/sitemap/

Open question for PR review

@opencli-user proposed a two-tier file size limit (hard 800 for simple sites / soft 2000 for dense sites) — should this land in v1.1, or stick with hard 800 + the CJK split-file note? My current take: hard 800 is correct, soft tiers drift. But happy to revisit.

Reviewers

Per agreed split: @opencli-user + @codex-coder cross-review (this PR consolidates content from both their findings).

Test plan

  • npm run docs:build clean
  • CI green

jackwener added 3 commits June 2, 2026 02:19
…s PoC

Cross-validated against two PoCs (twitter 12 files / hackernews 10 files).
v1.1 changelog at top of file. 12 patches in 3 groups:

Group 1 — Scope/boundary (6 clarifications):
- §1.1 CJK token-per-char 30-50% higher than English; split sub-file rather
  than relaxing 800-token limit (which would drift).
- §2.1 auth_strategy = primary strategy, not union; per-page contract_strength
  expresses exceptions.
- §2.5 pitfalls.md is task-executor-level only; adapter-internal pitfalls
  (queryId parsing, envelope unwrap) move to ~/.opencli/sites/<site>/notes.md.
- §2.5 pitfall id / trigger / workaround written from task-executor 1st-person
  view ("when agent does X, ..."), not adapter-implementer view.
- §2.4 apis.md entry adds optional `notes:` field for GraphQL queryId path and
  other meta info (still no URL / method / params / response — those stay in
  endpoints.json).
- §2.2 page Linked APIs may be empty when endpoints.json is still being
  collected; do not insert fake placeholder ids.

Group 2 — Reuse/compactness (3 structural):
- §2.2 + §4 partial pages: `page_id` with `_` prefix and `url_patterns: []`
  for cross-page UI (e.g. _tweet_card.md). Referenced by other pages via the
  existing `action:<id> in pages/_<name>.md` form. Eliminates duplication and
  arbitrary "which page owns the like button" calls.
- §3 introduces Form B compact YAML for actions (~80 token each vs Form A
  markdown ~250). Both forms remain valid; Form B is recommended when page
  density would otherwise blow the 800-token budget.
- §3 drops action-level `verified_at` and `source` — file-level frontmatter
  already covers both, repeated copies just drift.

Group 3 — Execution health/anchors (3 action-level):
- §3.3 cross-page UI primitive actions (the kind that live in partials)
  may write Best/Fallback inline as adapter-first + DOM fallback within a
  single action, rather than being forced up into a workflow Best/Fallback
  pair. Decouples UI-primitive routing from task-level routing.
- §3.4 Recovery may include `adapter_health_update: <adapter> -> suspect`
  directive. Consumption skill (opencli-browser-sitemap) writes the matching
  workflow's adapter_health on the local overlay so the next agent skips the
  broken Best path instead of re-running it. Write-side closure for the
  failure → next-agent-avoidance loop.
- §2.2 testid marked optional; selector_pattern promoted to first-class
  anchor with 5 acceptable shapes (id-anchored / sibling traversal / attribute
  boundary / form name / ARIA) and explicit discouraged-anchor list
  (nth-child, single-class grabs, text-content selectors). Old sites without
  testid (HN, forums) are no longer second-class.

No code changes — pure schema reference. Both PoCs remain local; promotion to
references/site-memory/{twitter,hackernews}/sitemap/ comes once this lands.
- Form B delimiter table (`|` enum / `||` fallback / `;` sequential) to
  disambiguate `do:` and `recover:` parsing.
- §3.3 like_tweet example updated to `||` fallback form.
- §3.4 explicit note: adapter_health recovery (suspect → healthy) is read
  side, deferred to opencli-browser-sitemap skill spec.
@jackwener jackwener merged commit dc67023 into main Jun 1, 2026
11 checks passed
@jackwener jackwener deleted the docs/sitemap-schema-v1.1 branch June 1, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant