Make transparent proxy health check parameters configurable#4108
Make transparent proxy health check parameters configurable#4108gkatz2 wants to merge 5 commits intostacklok:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4108 +/- ##
==========================================
+ Coverage 68.68% 68.71% +0.03%
==========================================
Files 454 458 +4
Lines 46027 46263 +236
==========================================
+ Hits 31612 31789 +177
- Misses 11977 12027 +50
- Partials 2438 2447 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The health check ping timeout, retry delay, and failure threshold were hardcoded, making it impossible to tune resilience for environments with transient network disruptions. Only the check interval was configurable via TOOLHIVE_HEALTH_CHECK_INTERVAL. Add three new environment variables following the same pattern: TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, and TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. Raise the default failure threshold from 3 to 5 to extend the tolerance window from ~40s to ~60s, reducing false positives without significantly delaying detection of genuinely dead servers. Fixes stacklok#4084 Signed-off-by: Greg Katz <gkatz@indeed.com>
0656ee6 to
7d879e7
Compare
Signed-off-by: Greg Katz <gkatz@indeed.com>
|
@claude could you review this PR? |
|
I'll analyze this and get back to you. |
There was a problem hiding this comment.
Pull request overview
This PR makes the transparent proxy health check behavior configurable via environment variables (ping timeout, retry delay, failure threshold), improving resilience in environments with transient network disruptions and increasing the default failure threshold from 3 to 5.
Changes:
- Add env var constants + resolution helpers for transparent proxy health check parameters, and wire resolved values into
TransparentProxy. - Replace the package-level retry-count constant with a per-proxy
healthCheckFailureThresholdfield (options + env var). - Add unit tests for env var resolution and constructor wiring, and document the new tuning parameters.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| pkg/transport/proxy/transparent/transparent_proxy.go | Adds env var resolution for health-check parameters and stores the failure threshold on the proxy instance. |
| pkg/transport/proxy/transparent/transparent_test.go | Adds table-driven tests for the getter functions and constructor/option wiring. |
| docs/arch/03-transport-architecture.md | Documents the new transparent proxy health check tuning environment variables. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
|
||
| Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead. | ||
|
|
||
| **Failure window**: With the defaults, the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (5 intervals × 10s + retry delays). This is designed to survive transient network disruptions without prematurely killing healthy backends. |
There was a problem hiding this comment.
The failure-window math in parentheses is unclear/inconsistent with the described ~60s tolerance. Given the current retry behavior, consider documenting the window as roughly (failureThreshold-1)×(interval+retryDelay) plus the initial failure tick, so readers can reason about it accurately.
| **Failure window**: With the defaults, the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (5 intervals × 10s + retry delays). This is designed to survive transient network disruptions without prematurely killing healthy backends. | |
| **Failure window**: With the defaults, the proxy tolerates approximately 70 seconds of consecutive health check failures before shutting down. In general, the window is roughly `(failureThreshold - 1) × (interval + retryDelay) + interval` (an initial failure tick plus additional failures with retries). This is designed to survive transient network disruptions without prematurely killing healthy backends. |
| t.Run(tt.name, func(t *testing.T) { | ||
| if tt.envValue != "" { | ||
| t.Setenv(HealthCheckIntervalEnvVar, tt.envValue) | ||
| } | ||
| assert.Equal(t, tt.expected, getHealthCheckInterval()) | ||
| }) |
There was a problem hiding this comment.
The table test only calls t.Setenv when envValue is non-empty. If TOOLHIVE_HEALTH_CHECK_INTERVAL is set in the ambient environment, the "default when unset" case will read it and fail. Set the env var for every subtest (including the empty-string case) so the test is hermetic.
| t.Run(tt.name, func(t *testing.T) { | ||
| if tt.envValue != "" { | ||
| t.Setenv(HealthCheckPingTimeoutEnvVar, tt.envValue) | ||
| } | ||
| assert.Equal(t, tt.expected, getHealthCheckPingTimeout()) | ||
| }) |
There was a problem hiding this comment.
Same issue as TestGetHealthCheckInterval: the "default when unset" case doesn’t clear TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, so external env configuration can make this test flaky. Ensure each subtest explicitly sets the env var (empty string for the default case).
| t.Run(tt.name, func(t *testing.T) { | ||
| if tt.envValue != "" { | ||
| t.Setenv(HealthCheckRetryDelayEnvVar, tt.envValue) | ||
| } | ||
| assert.Equal(t, tt.expected, getHealthCheckRetryDelay()) | ||
| }) |
There was a problem hiding this comment.
Same hermeticity issue: when envValue is empty, the test doesn’t clear TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, so the default-case assertion can be affected by the process environment. Set the env var for every subtest (including empty string).
| t.Run(tt.name, func(t *testing.T) { | ||
| if tt.envValue != "" { | ||
| t.Setenv(HealthCheckFailureThresholdEnvVar, tt.envValue) | ||
| } | ||
| assert.Equal(t, tt.expected, getHealthCheckFailureThreshold()) | ||
| }) |
There was a problem hiding this comment.
Same hermeticity issue: the "default when unset" case doesn’t clear TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. If a developer/CI sets it, this test will fail. Set the env var for every subtest (empty string for the default case) to isolate from the external environment.
| func TestNewTransparentProxyDefaultValues(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| proxy := newMinimalProxy() | ||
|
|
||
| assert.Equal(t, DefaultPingerTimeout, proxy.healthCheckPingTimeout) | ||
| assert.Equal(t, DefaultHealthCheckRetryDelay, proxy.healthCheckRetryDelay) | ||
| assert.Equal(t, DefaultHealthCheckFailureThreshold, proxy.healthCheckFailureThreshold) | ||
| assert.Equal(t, DefaultHealthCheckInterval, proxy.healthCheckInterval) |
There was a problem hiding this comment.
This test expects default health-check values, but NewTransparentProxyWithOptions reads from env vars at construction time. If any of the TOOLHIVE_HEALTH_CHECK_* env vars are set in the process environment, this can fail. Consider explicitly clearing those env vars in this test before calling newMinimalProxy().
| export TOOLHIVE_HEALTH_CHECK_RETRY_DELAY=10s | ||
| ``` | ||
|
|
||
| > **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy has a separate `TOOLHIVE_PROXY_REQUEST_TIMEOUT` setting. The vMCP server uses its own circuit breaker pattern. |
There was a problem hiding this comment.
Documentation references TOOLHIVE_PROXY_REQUEST_TIMEOUT, but that env var doesn’t appear to exist in the codebase. Either update this to the correct setting name (if there is one) or remove the reference to avoid misleading users.
| > **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy has a separate `TOOLHIVE_PROXY_REQUEST_TIMEOUT` setting. The vMCP server uses its own circuit breaker pattern. | |
| > **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy uses separate timeout settings, and the vMCP server uses its own circuit breaker pattern. |
- Always set env vars in getter tests (including empty string for default cases) so tests are hermetic against ambient environment - Clear all health check env vars in TestNewTransparentProxyDefaultValues - Fix failure window formula in docs to match actual behavior - Remove reference to not-yet-merged env var name in docs Signed-off-by: Greg Katz <gkatz@indeed.com>
|
@claude Review this PR focusing on the new TOOLHIVE_HEALTH_CHECK_* env vars and their interactions, edge cases, and documentation. |
|
I'll analyze this and get back to you. |
amirejaz
left a comment
There was a problem hiding this comment.
Good, focused PR. The env var pattern is consistent and the tests cover the happy/sad paths well. I have four findings: one test reliability bug, a docs inaccuracy, an undocumented edge case, and a now-redundant guard worth cleaning up.
|
|
||
| Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead. | ||
|
|
||
| **Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends. |
There was a problem hiding this comment.
Inaccurate estimate: The formula in this sentence gives 60s with defaults, not ~70s.
(threshold-1) × (interval + retryDelay) = (5-1) × (10s + 5s) = 60s
The "plus the initial failure tick" phrase is ambiguous — it's not a separate additive term, it's already captured by the (threshold-1) cycles in the formula. The 70s figure doesn't follow from the math.
| **Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends. | |
| **Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` before shutting down — approximately 60 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends. |
| Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead. | ||
|
|
||
| **Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends. | ||
|
|
There was a problem hiding this comment.
Undocumented interaction worth calling out: when TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT > TOOLHIVE_HEALTH_CHECK_INTERVAL, each retry blocks the monitor goroutine longer than one interval tick. The next tick fires into a buffered channel, so the effective cycle time becomes pingTimeout + retryDelay + pingTimeout rather than interval + retryDelay. This silently extends the failure window beyond what the formula predicts.
Consider adding a note here:
Note: If
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUTis larger thanTOOLHIVE_HEALTH_CHECK_INTERVAL, the effective failure window will exceed the formula above, because each retry blocks the health monitor goroutine for the full ping timeout duration.
| "status", status) | ||
|
|
||
| if consecutiveFailures < healthCheckRetryCount { | ||
| if consecutiveFailures < p.healthCheckFailureThreshold { |
There was a problem hiding this comment.
Edge case: threshold=1 silently disables retries. After consecutiveFailures++ the count is 1. 1 < 1 is false, so execution falls straight through to shutdown — performHealthCheckRetry is never called. A user setting TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=1 would get "instant shutdown on first failure" behaviour with no retry, which is almost certainly not what they expect given the name.
The getter already accepts 1 as valid (the "valid minimum" test case). Options:
- Raise the minimum to 2 in
getHealthCheckFailureThreshold(validaten >= 2), or - Document this clearly in the env var table in the arch doc (e.g. "Note: a value of 1 disables retries").
At minimum this should be noted in the docs.
| ) | ||
|
|
||
| // performHealthCheckRetry performs a retry health check after a delay | ||
| // Returns true if the retry was successful (health check recovered), false otherwise |
There was a problem hiding this comment.
Redundant guard: The retryDelay == 0 branch below (line 682) can never be true anymore. healthCheckRetryDelay is now always initialised to a positive value by getHealthCheckRetryDelay() in the constructor — either from the env var or DefaultHealthCheckRetryDelay. The withHealthCheckRetryDelay option also guards against non-positive values.
This was meaningful when the struct field could be left at the zero-value, but it's now dead code. Consider removing it to avoid confusion for future readers.
| assert.Equal(t, 8, proxy.healthCheckFailureThreshold) | ||
| } | ||
|
|
||
| func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) { |
There was a problem hiding this comment.
Test is env-var sensitive and can produce a spurious failure. This test calls t.Parallel() but doesn't isolate itself from TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. If that env var is set in the test environment (e.g. to 10), newMinimalProxy() picks it up. The option withHealthCheckFailureThreshold(0) is then a no-op (0 is not > 0), so proxy.healthCheckFailureThreshold becomes 10 — and the assertion against DefaultHealthCheckFailureThreshold (5) fails.
Fix: drop t.Parallel() and add t.Setenv for isolation, matching the pattern used in TestNewTransparentProxyDefaultValues:
| func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) { | |
| func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) { | |
| t.Setenv(HealthCheckFailureThresholdEnvVar, "") | |
| proxy := newMinimalProxy(withHealthCheckFailureThreshold(0)) | |
| assert.Equal(t, DefaultHealthCheckFailureThreshold, proxy.healthCheckFailureThreshold) | |
| } |
- Fix failure window estimate: 70s → 60s to match the formula - Document threshold=1 behavior (immediate shutdown, no retries) - Note that pingTimeout > interval extends the failure window - Remove pre-existing dead code: retryDelay == 0 guard was unreachable before this PR (constructor always set a positive value via DefaultHealthCheckRetryDelay) - Fix test hermeticity: isolate env vars in option function test Signed-off-by: Greg Katz <gkatz@indeed.com>
|
Hi @amirejaz — thanks for the thorough review. Your 5 inline findings plus the Copilot bot's suggestions have all been addressed in the last two commits (
Would you have time for a re-review? Thanks! |

Summary
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT,TOOLHIVE_HEALTH_CHECK_RETRY_DELAY,TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD) following the existingTOOLHIVE_HEALTH_CHECK_INTERVALpattern.Fixes #4084
Type of change
Test plan
task test)task test-e2e)task lint-fix)Built a test binary and started an MCP server with invalid env vars (
TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=banana,TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT=not-a-duration). Verified warning logs appeared in the server log file with the env var name, invalid value, and default fallback. Confirmeddefault=5in the threshold warning, showing the new default is active.Changes
pkg/transport/proxy/transparent/transparent_proxy.gohealthCheckRetryCountconstant with struct fieldpkg/transport/proxy/transparent/transparent_test.gonewMinimalProxy()docs/arch/03-transport-architecture.mdDoes this introduce a user-facing change?
Yes. Three new environment variables for tuning health check behavior:
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT5sTOOLHIVE_HEALTH_CHECK_RETRY_DELAY5sTOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD5Behavioral change: The default failure threshold increases from 3 to 5. This means the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (previously ~40 seconds). This improves resilience to transient network disruptions without significantly delaying detection of genuinely dead servers.
Special notes for reviewers
getHealthCheckInterval()was also updated to addslog.Warnon invalid values andslog.Debugon custom values, for consistency with the new getter functions.healthCheckRetryCountpackage-level constant was replaced with ahealthCheckFailureThresholdstruct field to allow per-proxy configuration via the options pattern.--health-check-failure-thresholdonthv run) would be a natural follow-up, following the precedence pattern suggested in Make proxy response timeout configurable #4063. Happy to include that in this PR if you'd prefer it bundled, or as a separate follow-up.