Make transparent proxy health check parameters configurable by gkatz2 · Pull Request #4108 · stacklok/toolhive

gkatz2 · 2026-03-11T16:36:17Z

Summary

The transparent proxy health check has mostly hardcoded parameters (ping timeout, retry delay, failure threshold), making it impossible to tune resilience for environments with transient network disruptions. When the proxy kills itself due to false positives, the client's MCP session is invalidated, requiring a full client restart to recover.
Add three new environment variables (TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD) following the existing TOOLHIVE_HEALTH_CHECK_INTERVAL pattern.
Raise the default failure threshold from 3 to 5, extending the tolerance window from ~40s to ~60s out of the box.
Add warning logs when invalid env var values are provided, and debug logs when custom values are successfully applied.

Type of change

Test plan

Unit tests (task test)
E2E tests (task test-e2e)
Linting (task lint-fix)
Manual testing (describe below)

Built a test binary and started an MCP server with invalid env vars (TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=banana, TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT=not-a-duration). Verified warning logs appeared in the server log file with the env var name, invalid value, and default fallback. Confirmed default=5 in the threshold warning, showing the new default is active.

Changes

File	Change
`pkg/transport/proxy/transparent/transparent_proxy.go`	Add env var constants, getter functions, struct field for failure threshold, wire getters in constructor, replace package-level `healthCheckRetryCount` constant with struct field
`pkg/transport/proxy/transparent/transparent_test.go`	Table-driven tests for all 4 getters, constructor wiring tests, option function tests, test helper `newMinimalProxy()`
`docs/arch/03-transport-architecture.md`	Add "Health Check Tuning Parameters" section documenting all 4 env vars

Does this introduce a user-facing change?

Yes. Three new environment variables for tuning health check behavior:

Environment Variable	Description	Default	Type
`TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT`	Timeout for each health check ping	`5s`	duration
`TOOLHIVE_HEALTH_CHECK_RETRY_DELAY`	Delay between retry attempts after a failure	`5s`	duration
`TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD`	Consecutive failures before proxy shutdown	`5`	integer

Behavioral change: The default failure threshold increases from 3 to 5. This means the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (previously ~40 seconds). This improves resilience to transient network disruptions without significantly delaying detection of genuinely dead servers.

Special notes for reviewers

The existing getHealthCheckInterval() was also updated to add slog.Warn on invalid values and slog.Debug on custom values, for consistency with the new getter functions.
The healthCheckRetryCount package-level constant was replaced with a healthCheckFailureThreshold struct field to allow per-proxy configuration via the options pattern.
Per-server flags (e.g., --health-check-failure-threshold on thv run) would be a natural follow-up, following the precedence pattern suggested in Make proxy response timeout configurable #4063. Happy to include that in this PR if you'd prefer it bundled, or as a separate follow-up.

codecov · 2026-03-11T16:39:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.71%. Comparing base (c474326) to head (9e89dc3).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4108      +/-   ##
==========================================
+ Coverage   68.68%   68.71%   +0.03%     
==========================================
  Files         454      458       +4     
  Lines       46027    46263     +236     
==========================================
+ Hits        31612    31789     +177     
- Misses      11977    12027      +50     
- Partials     2438     2447       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The health check ping timeout, retry delay, and failure threshold were hardcoded, making it impossible to tune resilience for environments with transient network disruptions. Only the check interval was configurable via TOOLHIVE_HEALTH_CHECK_INTERVAL. Add three new environment variables following the same pattern: TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, and TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. Raise the default failure threshold from 3 to 5 to extend the tolerance window from ~40s to ~60s, reducing false positives without significantly delaying detection of genuinely dead servers. Fixes stacklok#4084 Signed-off-by: Greg Katz <gkatz@indeed.com>

Signed-off-by: Greg Katz <gkatz@indeed.com>

amirejaz · 2026-03-11T23:25:29Z

@claude could you review this PR?

claude · 2026-03-11T23:26:09Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

Copilot

Pull request overview

This PR makes the transparent proxy health check behavior configurable via environment variables (ping timeout, retry delay, failure threshold), improving resilience in environments with transient network disruptions and increasing the default failure threshold from 3 to 5.

Changes:

Add env var constants + resolution helpers for transparent proxy health check parameters, and wire resolved values into TransparentProxy.
Replace the package-level retry-count constant with a per-proxy healthCheckFailureThreshold field (options + env var).
Add unit tests for env var resolution and constructor wiring, and document the new tuning parameters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
pkg/transport/proxy/transparent/transparent_proxy.go	Adds env var resolution for health-check parameters and stores the failure threshold on the proxy instance.
pkg/transport/proxy/transparent/transparent_test.go	Adds table-driven tests for the getter functions and constructor/option wiring.
docs/arch/03-transport-architecture.md	Documents the new transparent proxy health check tuning environment variables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-11T23:44:38Z

docs/arch/03-transport-architecture.md

+
+Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead.
+
+**Failure window**: With the defaults, the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (5 intervals × 10s + retry delays). This is designed to survive transient network disruptions without prematurely killing healthy backends.


The failure-window math in parentheses is unclear/inconsistent with the described ~60s tolerance. Given the current retry behavior, consider documenting the window as roughly (failureThreshold-1)×(interval+retryDelay) plus the initial failure tick, so readers can reason about it accurately.

Suggested change

**Failure window**: With the defaults, the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (5 intervals × 10s + retry delays). This is designed to survive transient network disruptions without prematurely killing healthy backends.

**Failure window**: With the defaults, the proxy tolerates approximately 70 seconds of consecutive health check failures before shutting down. In general, the window is roughly `(failureThreshold - 1) × (interval + retryDelay) + interval` (an initial failure tick plus additional failures with retries). This is designed to survive transient network disruptions without prematurely killing healthy backends.

Copilot · 2026-03-11T23:44:39Z

pkg/transport/proxy/transparent/transparent_test.go

+		t.Run(tt.name, func(t *testing.T) {
+			if tt.envValue != "" {
+				t.Setenv(HealthCheckIntervalEnvVar, tt.envValue)
+			}
+			assert.Equal(t, tt.expected, getHealthCheckInterval())
+		})


The table test only calls t.Setenv when envValue is non-empty. If TOOLHIVE_HEALTH_CHECK_INTERVAL is set in the ambient environment, the "default when unset" case will read it and fail. Set the env var for every subtest (including the empty-string case) so the test is hermetic.

Copilot · 2026-03-11T23:44:39Z

pkg/transport/proxy/transparent/transparent_test.go

+		t.Run(tt.name, func(t *testing.T) {
+			if tt.envValue != "" {
+				t.Setenv(HealthCheckPingTimeoutEnvVar, tt.envValue)
+			}
+			assert.Equal(t, tt.expected, getHealthCheckPingTimeout())
+		})


Same issue as TestGetHealthCheckInterval: the "default when unset" case doesn’t clear TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, so external env configuration can make this test flaky. Ensure each subtest explicitly sets the env var (empty string for the default case).

Copilot · 2026-03-11T23:44:39Z

pkg/transport/proxy/transparent/transparent_test.go

+		t.Run(tt.name, func(t *testing.T) {
+			if tt.envValue != "" {
+				t.Setenv(HealthCheckRetryDelayEnvVar, tt.envValue)
+			}
+			assert.Equal(t, tt.expected, getHealthCheckRetryDelay())
+		})


Same hermeticity issue: when envValue is empty, the test doesn’t clear TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, so the default-case assertion can be affected by the process environment. Set the env var for every subtest (including empty string).

Copilot · 2026-03-11T23:44:39Z

pkg/transport/proxy/transparent/transparent_test.go

+		t.Run(tt.name, func(t *testing.T) {
+			if tt.envValue != "" {
+				t.Setenv(HealthCheckFailureThresholdEnvVar, tt.envValue)
+			}
+			assert.Equal(t, tt.expected, getHealthCheckFailureThreshold())
+		})


Same hermeticity issue: the "default when unset" case doesn’t clear TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. If a developer/CI sets it, this test will fail. Set the env var for every subtest (empty string for the default case) to isolate from the external environment.

Copilot · 2026-03-11T23:44:39Z

pkg/transport/proxy/transparent/transparent_test.go

+func TestNewTransparentProxyDefaultValues(t *testing.T) {
+	t.Parallel()
+
+	proxy := newMinimalProxy()
+
+	assert.Equal(t, DefaultPingerTimeout, proxy.healthCheckPingTimeout)
+	assert.Equal(t, DefaultHealthCheckRetryDelay, proxy.healthCheckRetryDelay)
+	assert.Equal(t, DefaultHealthCheckFailureThreshold, proxy.healthCheckFailureThreshold)
+	assert.Equal(t, DefaultHealthCheckInterval, proxy.healthCheckInterval)


This test expects default health-check values, but NewTransparentProxyWithOptions reads from env vars at construction time. If any of the TOOLHIVE_HEALTH_CHECK_* env vars are set in the process environment, this can fail. Consider explicitly clearing those env vars in this test before calling newMinimalProxy().

Copilot · 2026-03-11T23:44:40Z

docs/arch/03-transport-architecture.md

+export TOOLHIVE_HEALTH_CHECK_RETRY_DELAY=10s
+```
+
+> **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy has a separate `TOOLHIVE_PROXY_REQUEST_TIMEOUT` setting. The vMCP server uses its own circuit breaker pattern.


Documentation references TOOLHIVE_PROXY_REQUEST_TIMEOUT, but that env var doesn’t appear to exist in the codebase. Either update this to the correct setting name (if there is one) or remove the reference to avoid misleading users.

Suggested change

> **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy has a separate `TOOLHIVE_PROXY_REQUEST_TIMEOUT` setting. The vMCP server uses its own circuit breaker pattern.

> **Note**: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy uses separate timeout settings, and the vMCP server uses its own circuit breaker pattern.

- Always set env vars in getter tests (including empty string for default cases) so tests are hermetic against ambient environment - Clear all health check env vars in TestNewTransparentProxyDefaultValues - Fix failure window formula in docs to match actual behavior - Remove reference to not-yet-merged env var name in docs Signed-off-by: Greg Katz <gkatz@indeed.com>

amirejaz · 2026-03-12T00:34:55Z

@claude Review this PR focusing on the new TOOLHIVE_HEALTH_CHECK_* env vars and their interactions, edge cases, and documentation.

claude · 2026-03-12T00:35:39Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

amirejaz

Good, focused PR. The env var pattern is consistent and the tests cover the happy/sad paths well. I have four findings: one test reliability bug, a docs inaccuracy, an undocumented edge case, and a now-redundant guard worth cleaning up.

amirejaz · 2026-03-12T04:34:44Z

docs/arch/03-transport-architecture.md

+
+Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead.
+
+**Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends.


Inaccurate estimate: The formula in this sentence gives 60s with defaults, not ~70s.

(threshold-1) × (interval + retryDelay) = (5-1) × (10s + 5s) = 60s

The "plus the initial failure tick" phrase is ambiguous — it's not a separate additive term, it's already captured by the (threshold-1) cycles in the formula. The 70s figure doesn't follow from the math.

Suggested change

**Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends.

**Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` before shutting down — approximately 60 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends.

amirejaz · 2026-03-12T04:34:44Z

docs/arch/03-transport-architecture.md

+Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead.
+
+**Failure window**: With the defaults, the proxy tolerates roughly `(threshold-1) × (interval + retryDelay)` plus the initial failure tick before shutting down — approximately 70 seconds with default values. This is designed to survive transient network disruptions without prematurely killing healthy backends.
+


Undocumented interaction worth calling out: when TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT > TOOLHIVE_HEALTH_CHECK_INTERVAL, each retry blocks the monitor goroutine longer than one interval tick. The next tick fires into a buffered channel, so the effective cycle time becomes pingTimeout + retryDelay + pingTimeout rather than interval + retryDelay. This silently extends the failure window beyond what the formula predicts.

Consider adding a note here:

Note: If TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT is larger than TOOLHIVE_HEALTH_CHECK_INTERVAL, the effective failure window will exceed the formula above, because each retry blocks the health monitor goroutine for the full ping timeout duration.

amirejaz · 2026-03-12T04:34:44Z

pkg/transport/proxy/transparent/transparent_proxy.go

 		"status", status)

-	if consecutiveFailures < healthCheckRetryCount {
+	if consecutiveFailures < p.healthCheckFailureThreshold {


Edge case: threshold=1 silently disables retries. After consecutiveFailures++ the count is 1. 1 < 1 is false, so execution falls straight through to shutdown — performHealthCheckRetry is never called. A user setting TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=1 would get "instant shutdown on first failure" behaviour with no retry, which is almost certainly not what they expect given the name.

The getter already accepts 1 as valid (the "valid minimum" test case). Options:

Raise the minimum to 2 in getHealthCheckFailureThreshold (validate n >= 2), or

Document this clearly in the env var table in the arch doc (e.g. "Note: a value of 1 disables retries").

At minimum this should be noted in the docs.

amirejaz · 2026-03-12T04:34:44Z

pkg/transport/proxy/transparent/transparent_proxy.go

-)
-
 // performHealthCheckRetry performs a retry health check after a delay
 // Returns true if the retry was successful (health check recovered), false otherwise


Redundant guard: The retryDelay == 0 branch below (line 682) can never be true anymore. healthCheckRetryDelay is now always initialised to a positive value by getHealthCheckRetryDelay() in the constructor — either from the env var or DefaultHealthCheckRetryDelay. The withHealthCheckRetryDelay option also guards against non-positive values.

This was meaningful when the struct field could be left at the zero-value, but it's now dead code. Consider removing it to avoid confusion for future readers.

amirejaz · 2026-03-12T04:34:44Z

pkg/transport/proxy/transparent/transparent_test.go

+	assert.Equal(t, 8, proxy.healthCheckFailureThreshold)
+}
+
+func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) {


Test is env-var sensitive and can produce a spurious failure. This test calls t.Parallel() but doesn't isolate itself from TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. If that env var is set in the test environment (e.g. to 10), newMinimalProxy() picks it up. The option withHealthCheckFailureThreshold(0) is then a no-op (0 is not > 0), so proxy.healthCheckFailureThreshold becomes 10 — and the assertion against DefaultHealthCheckFailureThreshold (5) fails.

Fix: drop t.Parallel() and add t.Setenv for isolation, matching the pattern used in TestNewTransparentProxyDefaultValues:

Suggested change

func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) {

func TestWithHealthCheckFailureThresholdOption_IgnoresNonPositive(t *testing.T) {

t.Setenv(HealthCheckFailureThresholdEnvVar, "")

proxy := newMinimalProxy(withHealthCheckFailureThreshold(0))

assert.Equal(t, DefaultHealthCheckFailureThreshold, proxy.healthCheckFailureThreshold)

}

- Fix failure window estimate: 70s → 60s to match the formula - Document threshold=1 behavior (immediate shutdown, no retries) - Note that pingTimeout > interval extends the failure window - Remove pre-existing dead code: retryDelay == 0 guard was unreachable before this PR (constructor always set a positive value via DefaultHealthCheckRetryDelay) - Fix test hermeticity: isolate env vars in option function test Signed-off-by: Greg Katz <gkatz@indeed.com>

gkatz2 · 2026-03-18T06:01:21Z

Hi @amirejaz — thanks for the thorough review. Your 5 inline findings plus the Copilot bot's suggestions have all been addressed in the last two commits (9e89dc30 and a41f8a87):

Test hermeticity: all getter tests and the default-values test now explicitly set env vars (including empty string for default cases)
Failure window formula corrected to ~60s
pingTimeout > interval interaction documented
threshold=1 behavior (immediate shutdown, no retries) documented
Dead retryDelay == 0 guard removed

Would you have time for a re-review? Thanks!

gkatz2 requested review from ChrisJBurns, JAORMX, amirejaz, blkt, jhrozek, rdimitrov and yrobla as code owners March 11, 2026 16:36

github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 11, 2026

gkatz2 force-pushed the fix/configurable-health-check-4084 branch from 0656ee6 to 7d879e7 Compare March 11, 2026 18:54

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026

Retrigger CI

f0ec550

Signed-off-by: Greg Katz <gkatz@indeed.com>

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026

Merge branch 'main' into fix/configurable-health-check-4084

2fd71b6

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026

amirejaz requested a review from Copilot March 11, 2026 23:41

Copilot started reviewing on behalf of amirejaz March 11, 2026 23:41 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026

amirejaz reviewed Mar 12, 2026

View reviewed changes

gkatz2 closed this Mar 15, 2026

gkatz2 reopened this Mar 15, 2026


		Duration values use Go's `time.ParseDuration` format (e.g., `10s`, `500ms`, `1m30s`). Invalid values are ignored with a warning log, and the default is used instead.

		Failure window: With the defaults, the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (5 intervals × 10s + retry delays). This is designed to survive transient network disruptions without prematurely killing healthy backends.

	> Note: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy has a separate `TOOLHIVE_PROXY_REQUEST_TIMEOUT` setting. The vMCP server uses its own circuit breaker pattern.
	> Note: These parameters only affect the transparent proxy (used by SSE and streamable HTTP transports). The stdio transport's streamable HTTP proxy uses separate timeout settings, and the vMCP server uses its own circuit breaker pattern.

Conversation

gkatz2 commented Mar 11, 2026

Summary

Type of change

Test plan

Changes

Does this introduce a user-facing change?

Special notes for reviewers

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amirejaz commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

amirejaz commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026

Uh oh!

amirejaz left a comment

Choose a reason for hiding this comment

Uh oh!

amirejaz Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

amirejaz Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

amirejaz Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

amirejaz Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

amirejaz Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

gkatz2 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Mar 11, 2026 •

edited

Loading