Skip to content

Make transparent proxy health check parameters configurable#4108

Open
gkatz2 wants to merge 5 commits intostacklok:mainfrom
gkatz2:fix/configurable-health-check-4084
Open

Make transparent proxy health check parameters configurable#4108
gkatz2 wants to merge 5 commits intostacklok:mainfrom
gkatz2:fix/configurable-health-check-4084

Conversation

@gkatz2
Copy link
Contributor

@gkatz2 gkatz2 commented Mar 11, 2026

Summary

  • The transparent proxy health check has mostly hardcoded parameters (ping timeout, retry delay, failure threshold), making it impossible to tune resilience for environments with transient network disruptions. When the proxy kills itself due to false positives, the client's MCP session is invalidated, requiring a full client restart to recover.
  • Add three new environment variables (TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD) following the existing TOOLHIVE_HEALTH_CHECK_INTERVAL pattern.
  • Raise the default failure threshold from 3 to 5, extending the tolerance window from ~40s to ~60s out of the box.
  • Add warning logs when invalid env var values are provided, and debug logs when custom values are successfully applied.

Fixes #4084

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Built a test binary and started an MCP server with invalid env vars (TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=banana, TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT=not-a-duration). Verified warning logs appeared in the server log file with the env var name, invalid value, and default fallback. Confirmed default=5 in the threshold warning, showing the new default is active.

Changes

File Change
pkg/transport/proxy/transparent/transparent_proxy.go Add env var constants, getter functions, struct field for failure threshold, wire getters in constructor, replace package-level healthCheckRetryCount constant with struct field
pkg/transport/proxy/transparent/transparent_test.go Table-driven tests for all 4 getters, constructor wiring tests, option function tests, test helper newMinimalProxy()
docs/arch/03-transport-architecture.md Add "Health Check Tuning Parameters" section documenting all 4 env vars

Does this introduce a user-facing change?

Yes. Three new environment variables for tuning health check behavior:

Environment Variable Description Default Type
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT Timeout for each health check ping 5s duration
TOOLHIVE_HEALTH_CHECK_RETRY_DELAY Delay between retry attempts after a failure 5s duration
TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD Consecutive failures before proxy shutdown 5 integer

Behavioral change: The default failure threshold increases from 3 to 5. This means the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (previously ~40 seconds). This improves resilience to transient network disruptions without significantly delaying detection of genuinely dead servers.

Special notes for reviewers

  • The existing getHealthCheckInterval() was also updated to add slog.Warn on invalid values and slog.Debug on custom values, for consistency with the new getter functions.
  • The healthCheckRetryCount package-level constant was replaced with a healthCheckFailureThreshold struct field to allow per-proxy configuration via the options pattern.
  • Per-server flags (e.g., --health-check-failure-threshold on thv run) would be a natural follow-up, following the precedence pattern suggested in Make proxy response timeout configurable #4063. Happy to include that in this PR if you'd prefer it bundled, or as a separate follow-up.

@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 11, 2026
@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.71%. Comparing base (c474326) to head (9e89dc3).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4108      +/-   ##
==========================================
+ Coverage   68.68%   68.71%   +0.03%     
==========================================
  Files         454      458       +4     
  Lines       46027    46263     +236     
==========================================
+ Hits        31612    31789     +177     
- Misses      11977    12027      +50     
- Partials     2438     2447       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The health check ping timeout, retry delay, and failure threshold
were hardcoded, making it impossible to tune resilience for
environments with transient network disruptions. Only the check
interval was configurable via TOOLHIVE_HEALTH_CHECK_INTERVAL.

Add three new environment variables following the same pattern:
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY,
and TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. Raise the default failure
threshold from 3 to 5 to extend the tolerance window from ~40s to ~60s,
reducing false positives without significantly delaying detection of
genuinely dead servers.

Fixes stacklok#4084

Signed-off-by: Greg Katz <gkatz@indeed.com>
@gkatz2 gkatz2 force-pushed the fix/configurable-health-check-4084 branch from 0656ee6 to 7d879e7 Compare March 11, 2026 18:54
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026
@amirejaz
Copy link
Contributor

@claude could you review this PR?

@claude
Copy link
Contributor

claude bot commented Mar 11, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the transparent proxy health check behavior configurable via environment variables (ping timeout, retry delay, failure threshold), improving resilience in environments with transient network disruptions and increasing the default failure threshold from 3 to 5.

Changes:

  • Add env var constants + resolution helpers for transparent proxy health check parameters, and wire resolved values into TransparentProxy.
  • Replace the package-level retry-count constant with a per-proxy healthCheckFailureThreshold field (options + env var).
  • Add unit tests for env var resolution and constructor wiring, and document the new tuning parameters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
pkg/transport/proxy/transparent/transparent_proxy.go Adds env var resolution for health-check parameters and stores the failure threshold on the proxy instance.
pkg/transport/proxy/transparent/transparent_test.go Adds table-driven tests for the getter functions and constructor/option wiring.
docs/arch/03-transport-architecture.md Documents the new transparent proxy health check tuning environment variables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.


Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead.

**Failure window**: With the defaults, the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (5 intervals × 10s + retry delays). This is designed to survive transient network disruptions without prematurely killing healthy backends.
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure-window math in parentheses is unclear/inconsistent with the described ~60s tolerance. Given the current retry behavior, consider documenting the window as roughly (failureThreshold-1)×(interval+retryDelay) plus the initial failure tick, so readers can reason about it accurately.

Suggested change
**Failure window**: With the defaults, the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (5 intervals × 10s + retry delays). This is designed to survive transient network disruptions without prematurely killing healthy backends.
**Failure window**: With the defaults, the proxy tolerates approximately 70 seconds of consecutive health check failures before shutting down. In general, the window is roughly `(failureThreshold - 1) × (interval + retryDelay) + interval` (an initial failure tick plus additional failures with retries). This is designed to survive transient network disruptions without prematurely killing healthy backends.

Copilot uses AI. Check for mistakes.
Comment on lines +597 to +602
t.Run(tt.name, func(t *testing.T) {
if tt.envValue != "" {
t.Setenv(HealthCheckIntervalEnvVar, tt.envValue)
}
assert.Equal(t, tt.expected, getHealthCheckInterval())
})
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table test only calls t.Setenv when envValue is non-empty. If TOOLHIVE_HEALTH_CHECK_INTERVAL is set in the ambient environment, the "default when unset" case will read it and fail. Set the env var for every subtest (including the empty-string case) so the test is hermetic.

Copilot uses AI. Check for mistakes.
Comment on lines +621 to +626
t.Run(tt.name, func(t *testing.T) {
if tt.envValue != "" {
t.Setenv(HealthCheckPingTimeoutEnvVar, tt.envValue)
}
assert.Equal(t, tt.expected, getHealthCheckPingTimeout())
})
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as TestGetHealthCheckInterval: the "default when unset" case doesn’t clear TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, so external env configuration can make this test flaky. Ensure each subtest explicitly sets the env var (empty string for the default case).

Copilot uses AI. Check for mistakes.
Comment on lines +645 to +650
t.Run(tt.name, func(t *testing.T) {
if tt.envValue != "" {
t.Setenv(HealthCheckRetryDelayEnvVar, tt.envValue)
}
assert.Equal(t, tt.expected, getHealthCheckRetryDelay())
})
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same hermeticity issue: when envValue is empty, the test doesn’t clear TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, so the default-case assertion can be affected by the process environment. Set the env var for every subtest (including empty string).

Copilot uses AI. Check for mistakes.
Comment on lines +670 to +675
t.Run(tt.name, func(t *testing.T) {
if tt.envValue != "" {
t.Setenv(HealthCheckFailureThresholdEnvVar, tt.envValue)
}
assert.Equal(t, tt.expected, getHealthCheckFailureThreshold())
})
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same hermeticity issue: the "default when unset" case doesn’t clear TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. If a developer/CI sets it, this test will fail. Set the env var for every subtest (empty string for the default case) to isolate from the external environment.

Copilot uses AI. Check for mistakes.
Comment on lines +693 to +701
func TestNewTransparentProxyDefaultValues(t *testing.T) {
t.Parallel()

proxy := newMinimalProxy()

assert.Equal(t, DefaultPingerTimeout, proxy.healthCheckPingTimeout)
assert.Equal(t, DefaultHealthCheckRetryDelay, proxy.healthCheckRetryDelay)
assert.Equal(t, DefaultHealthCheckFailureThreshold, proxy.healthCheckFailureThreshold)
assert.Equal(t, DefaultHealthCheckInterval, proxy.healthCheckInterval)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test expects default health-check values, but NewTransparentProxyWithOptions reads from env vars at construction time. If any of the TOOLHIVE_HEALTH_CHECK_* env vars are set in the process environment, this can fail. Consider explicitly clearing those env vars in this test before calling newMinimalProxy().

Copilot uses AI. Check for mistakes.
export TOOLHIVE_HEALTH_CHECK_RETRY_DELAY=10s
```

> **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy has a separate `TOOLHIVE_PROXY_REQUEST_TIMEOUT` setting. The vMCP server uses its own circuit breaker pattern.
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation references TOOLHIVE_PROXY_REQUEST_TIMEOUT, but that env var doesn’t appear to exist in the codebase. Either update this to the correct setting name (if there is one) or remove the reference to avoid misleading users.

Suggested change
> **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy has a separate `TOOLHIVE_PROXY_REQUEST_TIMEOUT` setting. The vMCP server uses its own circuit breaker pattern.
> **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy uses separate timeout settings, and the vMCP server uses its own circuit breaker pattern.

Copilot uses AI. Check for mistakes.
- Always set env vars in getter tests (including empty string for
  default cases) so tests are hermetic against ambient environment
- Clear all health check env vars in TestNewTransparentProxyDefaultValues
- Fix failure window formula in docs to match actual behavior
- Remove reference to not-yet-merged env var name in docs

Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026
@amirejaz
Copy link
Contributor

@claude Review this PR focusing on the new TOOLHIVE_HEALTH_CHECK_* env vars and their interactions, edge cases, and documentation.

@claude
Copy link
Contributor

claude bot commented Mar 12, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

Copy link
Contributor

@amirejaz amirejaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, focused PR. The env var pattern is consistent and the tests cover the happy/sad paths well. I have four findings: one test reliability bug, a docs inaccuracy, an undocumented edge case, and a now-redundant guard worth cleaning up.


Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead.

**Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inaccurate estimate: The formula in this sentence gives 60s with defaults, not ~70s.

(threshold-1) × (interval + retryDelay) = (5-1) × (10s + 5s) = 60s

The "plus the initial failure tick" phrase is ambiguous — it's not a separate additive term, it's already captured by the (threshold-1) cycles in the formula. The 70s figure doesn't follow from the math.

Suggested change
**Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends.
**Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` before shutting down — approximately 60 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends.

Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead.

**Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undocumented interaction worth calling out: when TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT > TOOLHIVE_HEALTH_CHECK_INTERVAL, each retry blocks the monitor goroutine longer than one interval tick. The next tick fires into a buffered channel, so the effective cycle time becomes pingTimeout + retryDelay + pingTimeout rather than interval + retryDelay. This silently extends the failure window beyond what the formula predicts.

Consider adding a note here:

Note: If TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT is larger than TOOLHIVE_HEALTH_CHECK_INTERVAL, the effective failure window will exceed the formula above, because each retry blocks the health monitor goroutine for the full ping timeout duration.

"status", status)

if consecutiveFailures < healthCheckRetryCount {
if consecutiveFailures < p.healthCheckFailureThreshold {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge case: threshold=1 silently disables retries. After consecutiveFailures++ the count is 1. 1 < 1 is false, so execution falls straight through to shutdown — performHealthCheckRetry is never called. A user setting TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=1 would get "instant shutdown on first failure" behaviour with no retry, which is almost certainly not what they expect given the name.

The getter already accepts 1 as valid (the "valid minimum" test case). Options:

  1. Raise the minimum to 2 in getHealthCheckFailureThreshold (validate n >= 2), or
  2. Document this clearly in the env var table in the arch doc (e.g. "Note: a value of 1 disables retries").

At minimum this should be noted in the docs.

)

// performHealthCheckRetry performs a retry health check after a delay
// Returns true if the retry was successful (health check recovered), false otherwise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant guard: The retryDelay == 0 branch below (line 682) can never be true anymore. healthCheckRetryDelay is now always initialised to a positive value by getHealthCheckRetryDelay() in the constructor — either from the env var or DefaultHealthCheckRetryDelay. The withHealthCheckRetryDelay option also guards against non-positive values.

This was meaningful when the struct field could be left at the zero-value, but it's now dead code. Consider removing it to avoid confusion for future readers.

assert.Equal(t, 8, proxy.healthCheckFailureThreshold)
}

func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is env-var sensitive and can produce a spurious failure. This test calls t.Parallel() but doesn't isolate itself from TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. If that env var is set in the test environment (e.g. to 10), newMinimalProxy() picks it up. The option withHealthCheckFailureThreshold(0) is then a no-op (0 is not > 0), so proxy.healthCheckFailureThreshold becomes 10 — and the assertion against DefaultHealthCheckFailureThreshold (5) fails.

Fix: drop t.Parallel() and add t.Setenv for isolation, matching the pattern used in TestNewTransparentProxyDefaultValues:

Suggested change
func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) {
func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) {
t.Setenv(HealthCheckFailureThresholdEnvVar, "")
proxy := newMinimalProxy(withHealthCheckFailureThreshold(0))
assert.Equal(t, DefaultHealthCheckFailureThreshold, proxy.healthCheckFailureThreshold)
}

- Fix failure window estimate: 70s → 60s to match the formula
- Document threshold=1 behavior (immediate shutdown, no retries)
- Note that pingTimeout > interval extends the failure window
- Remove pre-existing dead code: retryDelay == 0 guard was
  unreachable before this PR (constructor always set a positive
  value via DefaultHealthCheckRetryDelay)
- Fix test hermeticity: isolate env vars in option function test

Signed-off-by: Greg Katz <gkatz@indeed.com>
@gkatz2 gkatz2 closed this Mar 15, 2026
@gkatz2 gkatz2 reopened this Mar 15, 2026
@gkatz2
Copy link
Contributor Author

gkatz2 commented Mar 18, 2026

Hi @amirejaz — thanks for the thorough review. Your 5 inline findings plus the Copilot bot's suggestions have all been addressed in the last two commits (9e89dc30 and a41f8a87):

  • Test hermeticity: all getter tests and the default-values test now explicitly set env vars (including empty string for default cases)
  • Failure window formula corrected to ~60s
  • pingTimeout > interval interaction documented
  • threshold=1 behavior (immediate shutdown, no retries) documented
  • Dead retryDelay == 0 guard removed

Would you have time for a re-review? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make transparent proxy health check parameters configurable

4 participants