Model configuration fixes#946
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes unreliable Navie model configuration updates (custom OpenAI key/endpoint) by stabilizing JSON-RPC server restarts and port reuse, and refactors the RPC service into an explicit server state machine to make restart behavior easier to reason about.
Changes:
- Refactor
DefaultAppLandJsonRpcServiceinto a state machine, preserving the server port across restarts and debouncing restart triggers. - Simplify/standardize settings-change notifications (move to
modelConfigChange()without per-key callbacks) and ensure env/model config changes trigger RPC restarts. - Add regression tests for port reuse and “single restart” behavior; harden webview model selection against undefined events; avoid disabling Copilot when OpenAI is configured.
Reviewed changes
Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| plugin-core/src/main/java/appland/rpcService/DefaultAppLandJsonRpcService.java | Refactors server lifecycle into explicit states; changes restart/port handling and startup signaling. |
| plugin-core/src/main/java/appland/rpcService/AppLandJsonRpcListener.java | Removes serverRestarted() callback (restart completion is signaled via other events). |
| plugin-core/src/main/java/appland/rpcService/TestAppLandJsonRpcService.java | Updates test helper to terminate the current RPC process. |
| plugin-core/src/main/java/appland/settings/AppMapSettingsListener.java | Removes per-key secure model config callback in favor of a single modelConfigChange(). |
| plugin-core/src/main/java/appland/settings/AppMapSettingsReloadProjectListener.java | Restarts RPC server on model config + CLI env changes (debounced). |
| plugin-core/src/main/java/appland/settings/AppMapSecureApplicationSettingsService.java | Publishes modelConfigChange() when secure model config changes. |
| plugin-core/src/main/java/appland/webviews/WebviewEditor.java | Opens dev tools after the webview application initializes instead of on URL load. |
| plugin-copilot/src/main/java/appland/copilotChat/CopilotModelInfoProvider.java | Stops treating custom OpenAI configuration as disabling Copilot integration. |
| plugin-core/src/test/java/appland/AppMapBaseTest.java | Updates JSON-RPC test helpers to wait on serverStarted() only. |
| plugin-core/src/test/java/appland/rpcService/DefaultAppLandJsonRpcServiceTest.java | Updates restart/start wait conditions; adds tests for port reuse and debounced restarts. |
| plugin-core/src/test/java/appland/settings/AppMapSecureApplicationSettingsServiceTest.java | Updates secure model config listener tests for the new callback signature. |
| plugin-core/src/test/java/appland/actions/AddNavieContextFilesActionTest.java | Uses the updated JSON-RPC wait helper. |
| appland-webview/navie.js | Guards against undefined payloads in select-model events. |
| appland-webview/dist/navie.js | Regenerated build output for the webview change. |
| appland-webview/dist/navie.js.map | Regenerated source map for the webview change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Push config before notifying listeners so the initial state is already set when they react. | ||
| updateServerSettings(); | ||
| updateModelConfig(); | ||
| if (!isDisposed && !project.isDisposed()) { | ||
| project.getMessageBus().syncPublisher(AppLandJsonRpcListener.TOPIC).serverStarted(); | ||
| } |
There was a problem hiding this comment.
onPortAnnounced calls updateServerSettings() (which waits for smart mode + does synchronous RPC I/O) and updateModelConfig() before publishing serverStarted(). This can delay serverStarted substantially during indexing or slow model providers, which risks UI/test timeouts and makes the server appear “not started” even though the port is ready. Consider publishing serverStarted() immediately after transitioning to RUNNING, and schedule the config/model pushes asynchronously (e.g., via the existing SingleAlarm/pooled thread) so port readiness isn’t gated on config propagation.
There was a problem hiding this comment.
This is intentional, so the server is completely ready before clients try to interact with it. I haven't observed any concerning delays in practice.
| // No try/finally needed: stopCurrentProcessSync swallows all exceptions internally, | ||
| // and serverStarted fires automatically from onPortAnnounced when the new process announces its port. | ||
| project.getMessageBus().syncPublisher(AppLandJsonRpcListener.TOPIC).beforeServerRestart(); |
There was a problem hiding this comment.
restartServerSync() publishes beforeServerRestart() unconditionally, even when no server is running/starting (e.g., settings changes before first start, or CLI download when STOPPED). That can cause listeners (Navie editors) to show “restarting” for what is effectively a first start. Consider only publishing beforeServerRestart() when the state is STARTING/RUNNING (or when currentProcess is non-null), and treat STOPPED as a plain start.
| // No try/finally needed: stopCurrentProcessSync swallows all exceptions internally, | |
| // and serverStarted fires automatically from onPortAnnounced when the new process announces its port. | |
| project.getMessageBus().syncPublisher(AppLandJsonRpcListener.TOPIC).beforeServerRestart(); | |
| boolean publishBeforeRestart; | |
| synchronized (this) { | |
| publishBeforeRestart = currentProcess != null | |
| || state == ServerState.STARTING | |
| || state == ServerState.RUNNING | |
| || state == ServerState.CRASH_RESTARTING; | |
| } | |
| // No try/finally needed: stopCurrentProcessSync swallows all exceptions internally, | |
| // and serverStarted fires automatically from onPortAnnounced when the new process announces its port. | |
| if (publishBeforeRestart) { | |
| project.getMessageBus().syncPublisher(AppLandJsonRpcListener.TOPIC).beforeServerRestart(); | |
| } |
There was a problem hiding this comment.
I don't think that's a real concern. It's an unlikely corner case and I don't think the consequences are significant.
6f5b629 to
03ffbfa
Compare
jansorg
left a comment
There was a problem hiding this comment.
The changes seem fine to me, only a few minor notes.
It worked well when I tested it.
| } | ||
|
|
||
| private void initWebviewApplication() { | ||
| if (Registry.is("appmap.webview.open.dev.tools", false)) { |
There was a problem hiding this comment.
Did you observe a problem with the previous setup?
There was a problem hiding this comment.
Yes, it only opened for me about 30% of the time.
| } | ||
| lastKnownPort = port; | ||
| state = ServerState.RUNNING; | ||
| crashRestartCount = 0; |
There was a problem hiding this comment.
If I'm not mistaken, this implements "restart after crash" differently.
Now, the count is reset when the port is announced. A (repeated) crash after the port announcement is not counted against the max restarts and the server would keep restarting.
Previously, it was counted, AFAIK.
I'm not sure what the better implementation is. There are advantages of this implementation, e.g. doing restarts after the server kept running. Your call what you'd like to do.
There was a problem hiding this comment.
Yes, this is intentional. Previous implementation was IMO broken because the crash counter never reset; this led to the situation that when the server crashed and successfully restarted, a subsequent unrelated crash (say, hours later) would have the restart unnecessarily delayed or even (if it exceeded the implicit crash counter) suppressed. Not to mention this could have interfered with the tests which didn't have access to reset the crash counter (although this could be exposed if needed), even if I can't tell for certain if any test problems I've seen when adding more crash tests were caused directly by this, and this concern could be fixed just by resetting on explicit (re)start.
When the server announces the port it's initialized, up and running, and we can be pretty sure if it were to crash due to eg. corrupted binary or a similar issue, it would have crashed by now. The alternative would be to set up a timeout to reset the crash counter after it has been successfully running for some time (maybe a minute or so). Wdyt?
Opening devtools immediately after loadURL was unreliable because the JCEF browser might not have been fully initialized yet. Moving the call to initWebviewApplication ensures that the browser is ready to display the devtools window.
Sometimes the webview sends select-model with undefined; this happens for example when previously chosen model is no longer available. Ignore these calls instead of logging errors.
Due to two races, changing settings could leave Navie broken until the IDE was restarted. First, the crash handler used implicit state (a null/non-null process object plus a delay threshold) and could not distinguish a crash from an intentional settings-driven restart. When the intentional restart exited the old process, the crash handler would observe the exit and fire a competing restart, the two paths would collide, and the service would get stuck. Second, the crash recovery codepath allocated a fresh port instead of reusing the existing one. Any Navie UI window that was already open had the old port baked in and could not reconnect after the server recovered from a crash. Refactor DefaultAppLandJsonRpcService around an explicit ServerState enum (STOPPED, STARTING, RUNNING, STOPPING, CRASH_RESTARTING, DISPOSED). The crash handler only fires from the RUNNING state, so intentional restarts (which transition through STOPPING) are never mistaken for crashes. Port ownership moves to service level and is reused across all restarts; the server is only considered running once the port is announced, so callers always observe a fully-ready process. The crash backoff is count-based with exponential delay (1 s, 1.5 s, 2.25 s, then giving up) and resets on both intentional restart and successful port announcement. Also: replaces the per-launch ProcessAdapter allocation with a singleton listener that guards by process identity, removes the now-redundant serverRestarted event (serverStarted covers both first start and restart), and consolidates duplicated "wait for server running" test helpers.
Changing a model setting (e.g. an API key) from the UI caused the Navie RPC server to restart twice due to duplicate event listeners. All config-driven restarts are now consolidated in AppMapSettingsReloadProjectListener with a 1-second debounce, so rapid successive changes result in a single restart. Any CLI environment change now also correctly triggers a restart, not just LLM-related variables. The separate secureModelConfigChange event is removed in favour of the unified modelConfigChange.
Previously, Copilot models were hidden from the model list whenever an OpenAI API key was set. Since each model carries its own connection details, both providers can coexist in the list and the user can switch between them freely.
03ffbfa to
9f8376d
Compare
The test was flaky because the indexer's prepareWatchedDirectory created tmp/appmap on the real filesystem outside the VFS, causing a later copyDirectoryToProject to fail with "Cannot create directory". Restructure to match the pattern all other tests in the class use: pre-create tmp/appmap in the VFS before createAppMapYaml starts the indexer, and rely on createAppMapYaml's afterRefreshForProjects barrier instead of a bare copyFileToProject call for reliable startup sequencing.
loadFindingsDirectory was unblocking on the first afterFindingsChanged event, but reloadAsync fires that event once per findings file before emitting afterFindingsReloaded at the end. On fast machines the remaining per-file events would all fire before expandAll() started interacting with the AsyncTreeModel, so the tree happened to be stable by then. On slower CI runners the events were still in flight during expansion, causing RootNode to call structureChanged() on each one and forcing AsyncTreeModel to restart indefinitely until the 120 s timeout. Switch to createFindingsReloadedCondition so the test only proceeds once all findings are loaded and the model is guaranteed to be stable.
|
🎉 This PR is included in version 0.83.2 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
I looked at the flow of user configuring a custom OpenAI key or endpoint through the UI and found that it didn't work reliably — the current navie tab froze and the user was forced to open a new tab. I tracked it down to two problems: port not always being preserved between restarts, and the configuration change causing more than one restart (theoretically up to four, but I've only seen two). While trying to fix these, I noticed that the RPC server service has grown to be a bit of a mess, so I refactored it to explicitly model the state machine to more easily see what's going on.
I've also noticed and fixed a couple other problems such as:
I suggest reviewing commit-by-commit; the commit messages provide more detail.
To test manually: