Skip to content

fix(server): surface actionable guidance when embedding rebuild blocks startup#2618

Open
fancyboi999 wants to merge 1 commit into
volcengine:mainfrom
fancyboi999:fix/2273-embedding-rebuild-startup-guidance
Open

fix(server): surface actionable guidance when embedding rebuild blocks startup#2618
fancyboi999 wants to merge 1 commit into
volcengine:mainfrom
fancyboi999:fix/2273-embedding-rebuild-startup-guidance

Conversation

@fancyboi999

Copy link
Copy Markdown

Summary

Upgrading OpenViking (e.g. v0.3.15 → v0.3.19+) with an existing workspace can crash openviking-server at startup with a bare EmbeddingRebuildRequiredError traceback and no recovery guidance, pushing operators toward deleting business data to get the server back up.

This PR turns that expected-but-fatal upgrade condition into an actionable, operator-facing recovery runbook, and cleans up the stale deploy guidance that contributed to the "server won't start after upgrade" reports.

Root Cause

  • On startup, init_context_collection() (openviking/storage/collection_schemas.py:338,344) raises EmbeddingRebuildRequiredError when an existing vector collection's embedding metadata (provider/model/dimension) no longer matches the current config — expected after a default-embedding-model change between versions.
  • That error is raised inside service.initialize(), called from the server lifespan via the shared choke point _initialize_runtime_state (openviking/server/app.py). It was never caught, so it propagated out of the FastAPI lifespan and uvicorn aborted startup with a raw traceback and zero remediation guidance.
  • The legacy python -m openviking.console.bootstrap command (the standalone console, removed in v0.3.18) still appeared in the OpenClaw-plugin setup docs, so users following them hit ModuleNotFoundError — the second half of the original report.

Solution & Trade-offs

  • Catch EmbeddingRebuildRequiredError at the single shared startup choke point _initialize_runtime_state, log a clear, case-aware recovery runbook, then re-raise so startup still aborts — the server must never serve vectors from a mismatched embedding space.
    • Provider/model changed, dimension unchanged → set embedding.allow_metadata_override=true, restart, then ov reindex viking:// --mode vectors_only --sudo --wait true.
    • Dimension changed → makes clear allow_metadata_override does not bypass this and the vector index must be rebuilt; no risky destructive command is suggested.
  • The storage-layer raise sites and messages are unchanged (they are already correct and distinguish the two cases); the runbook lives at the server/operator boundary where deployment context belongs. The vestigial test-only _on_deferred_init_done path is untouched, and no second error handler is introduced.
  • Docs: add an upgrade-troubleshooting FAQ entry (EN + ZH) covering both recovery cases; replace the removed console.bootstrap command with Web Studio at /studio in the OpenClaw-plugin setup docs. The historical references in the internal design doc are intentionally left as-is.

Validation Evidence

# TDD: new unit tests (red → green)
pytest tests/server/test_server_health.py::test_embedding_rebuild_guidance_is_actionable \
       tests/server/test_server_health.py::test_initialize_runtime_state_surfaces_embedding_rebuild_guidance
# + sibling init tests → 4 passed

ruff format --check  → Passed
ruff check           → All checks passed
pre-commit (ruff + ruff-format) → Passed

End-to-end verification driving the real FastAPI lifespan (create_applifespan_context_initialize_runtime_stateservice.initialize()) with a service that raises EmbeddingRebuildRequiredError:

STARTUP_ABORTED: True
ABORT_CAUSED_BY_EmbeddingRebuildRequiredError: True
GUIDANCE_LOGGED_DURING_REAL_STARTUP: True

Note: in the real lifespan the error is wrapped in an ExceptionGroup by the MCP task group as it propagates; catching at the inner choke point logs the guidance before that wrapping, so the operator always sees it.

Type of Change

  • Bug fix (fix)
  • Documentation (docs)

Testing

  • Unit tests pass
  • Manual/real-path verification completed (real lifespan startup abort + guidance logged)

Related Issues

Affected Areas

Primary: Platform / Server (openviking/server/app.py). Supporting docs span docs/*/faq and examples/openclaw-plugin.

Checklist

  • Code follows project style guidelines (ruff/ruff-format)
  • Tests added for new behavior
  • Documentation updated
  • All applicable local checks pass (full suite needs the native AGFS build; CI Test Lite covers it)

…s startup

On upgrade, an existing vector collection whose embedding metadata no longer
matches the current config makes init_context_collection raise
EmbeddingRebuildRequiredError. This propagated out of the lifespan uncaught, so
openviking-server crashed at startup with a bare traceback and no recovery path
— pushing operators toward deleting business data to get the server back up.

Catch the error at the shared startup choke point (_initialize_runtime_state),
log an operator-facing, case-aware recovery runbook (provider/model change vs.
dimension change, with the verified `ov reindex` command and the
allow_metadata_override caveat), then re-raise so startup still aborts and the
server never serves vectors from a mismatched embedding space.

Docs:
- Add an upgrade-troubleshooting FAQ entry (en + zh) covering both recovery cases.
- Replace the removed `python -m openviking.console.bootstrap` (gone since
  v0.3.18) with Web Studio at /studio in the openclaw-plugin setup docs.

Closes volcengine#2273
@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

2273 - Fully compliant

Compliant requirements:

  • Added actionable recovery guidance for EmbeddingRebuildRequiredError
  • Updated docs to remove references to openviking.console.bootstrap
  • Added tests for the new error handling
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add actionable guidance for embedding rebuild startup errors

Relevant files:

  • openviking/server/app.py
  • tests/server/test_server_health.py

Sub-PR theme: Update docs for upgrade troubleshooting and remove stale console references

Relevant files:

  • docs/en/faq/faq.md
  • docs/zh/faq/faq.md
  • examples/openclaw-plugin/README.md
  • examples/openclaw-plugin/README_CN.md
  • examples/openclaw-plugin/docs/openviking-openclaw-plugin-guide.md

⚡ Recommended focus areas for review

Documentation Gap
Japanese FAQ not updated with the new embedding rebuild guidance, while English and Chinese FAQs are updated. This creates documentation inconsistency across languages.

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: Docker cannot be upgraded from V0.3.15 to V0.3.19 or later

1 participant