Skip to content

Conversation

@tbuchaillot
Copy link
Contributor

Description

Related Issue

Motivation and Context

How This Has Been Tested

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactoring or add test (improvements in base code or adds test coverage to functionality)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning why it's required
  • I would like a code coverage CI quality gate exception and have explained why

@probelabs
Copy link

probelabs bot commented Nov 13, 2025

🔍 Code Analysis Results

This PR introduces comprehensive, native Prometheus metrics support to Tyk Gateway, enhancing its observability. It adds a new, configurable HTTP endpoint (defaulting to :9090/metrics) that exposes a wide range of metrics covering system health, gateway request processing (RED metrics), Redis connection pools, and middleware performance.

The implementation is self-contained and follows existing instrumentation patterns, running alongside StatsD and NewRelic integrations without interference. It includes detailed documentation on configuration, available metrics, and best practices for cardinality management.

Files Changed Analysis

The changes introduce a significant new feature, reflected in the addition of several key files and modifications to core gateway components:

  • New Documentation (IMPLEMENTATION_PLAN.md, PROMETHEUS_METRICS.md): Provides a detailed implementation strategy and comprehensive user-facing documentation for the new feature.
  • New Core Logic (gateway/instrumentation_prometheus.go): This is the heart of the feature, defining and managing all Prometheus metric collectors, including gauges, counters, and histograms for system, gateway, and Redis stats. It also handles the periodic collection of metrics that are not event-driven.
  • New Integration Sink (gateway/instrumentation_prometheus_sink.go): Implements the health.Sink interface, allowing the Prometheus module to subscribe to existing internal gateway events (e.g., middleware execution timing) without modifying the core middleware processing loop.
  • Configuration (config/config.go, tyk.conf.example): A new prometheus section is added to the gateway configuration, allowing users to enable the feature and customize settings like the listen address, path, and metric prefix.
  • Gateway Lifecycle (gateway/server.go, gateway/instrumentation_handlers.go): The Prometheus metrics server is integrated into the gateway's startup and graceful shutdown procedures.
  • Request Path Integration (gateway/handler_success.go): Hooks into the success handler to record critical per-request metrics, including latency (total, upstream, gateway) and status codes.
  • Storage Layer (storage/connection_handler.go, storage/redis_cluster.go): The Redis connection handler is updated to expose connection pool statistics, which are then consumed by the new metrics module.
  • Testing Utility (generate-traffic.sh): A new script is included to help developers and users generate load for testing and validating the metrics output.

Architecture & Impact Assessment

This PR introduces a new observability component to the Tyk Gateway architecture.

  • What this PR accomplishes: It provides a robust, industry-standard solution for monitoring the gateway's health and performance using Prometheus. This allows operators to gain deeper insights, set up sophisticated alerting, and diagnose issues more effectively.
  • Key technical changes:
    • A new, independent HTTP server is started on a dedicated port to serve the /metrics endpoint, ensuring separation from the main API traffic.
    • It leverages the prometheus/client_golang library for all metric types.
    • A background goroutine is launched for periodic polling of system metrics (CPU, memory, goroutines) and Redis stats, while request-specific metrics are pushed in real-time from the request handler.
    • It introduces a distinction between low-cardinality overall gateway metrics (enabled by default) and high-cardinality per-API metrics (opt-in) to give users control over performance impact.
  • Affected system components:
    • Gateway Core: The main server lifecycle is updated to manage the new Prometheus server.
    • Configuration: A new prometheus block is added.
    • Request Handling: The success handler now records metrics for every request.
    • Storage: The Redis client layer now exposes pool statistics.

The following diagram illustrates how the new Prometheus component interacts with the rest of the gateway:

graph TD
    subgraph Tyk Gateway
        A[Gateway Server] -- initializes --> B(Instrumentation Handlers)
        B -- creates --> C[PrometheusMetrics Service]
        A -- handles --> D(API Request Handlers)
        D -- on completion, records --> C
        C -- periodically collects from --> E(System Stats)
        C -- periodically collects from --> F(Storage Layer / Redis)
    end
    subgraph "Metrics Endpoint (:9090)"
        H(HTTP Server) -- serves --> I(/metrics)
        C -- provides metrics to --> H
    end
    J[Prometheus Scraper] -- scrapes --> I
Loading

Scope Discovery & Context Expansion

The changes are well-encapsulated, but their impact touches several key areas of the gateway's operation:

  • Observability: This is a major enhancement to the gateway's observability stack, providing a modern, pull-based monitoring system alongside existing push-based integrations.
  • Performance: The introduction of a new metrics collection system has performance implications. The design thoughtfully addresses this by making high-cardinality metrics optional and performing periodic updates in a separate goroutine. The impact on request latency should be minimal as the RecordRequest call is lightweight.
  • Operations: Operators will need to update their deployment configurations to expose the new metrics port (:9090) and configure their Prometheus instances to scrape the new endpoint. The provided documentation is crucial for this.
  • Dependencies: The PR adds a new dependency on github.com/prometheus/client_golang, which will be reflected in go.mod and go.sum.

Based on the implementation, further areas to investigate would be:

  • Testing: The PR is a "POC" and lacks unit or integration tests for the new functionality. These would be essential before merging to a main branch to validate metric accuracy and server stability.
  • Security: The documentation mentions securing the metrics endpoint. In a production environment, it would be critical to ensure this endpoint is not publicly exposed, typically by binding it to a local or internal interface and using firewall rules.
Metadata
  • Review Effort: 5 / 5
  • Primary Label: feature

Powered by Visor from Probelabs

Last updated: 2025-11-13T17:23:24.865Z | Triggered by: opened | Commit: f0e00dd

💡 TIP: You can chat with Visor using /visor ask <your question>

@probelabs
Copy link

probelabs bot commented Nov 13, 2025

🔍 Code Analysis Results

Security Issues (2)

Severity Location Issue
🟡 Warning config/config.go:63
The default listen address for the Prometheus endpoint is set to `:9090`, which binds to all network interfaces. This could unintentionally expose internal gateway metrics to external networks if the host has a public IP address and is not explicitly firewalled.
💡 SuggestionChange the default `ListenAddress` to `localhost:9090`. This would make the metrics endpoint accessible only from the local machine by default, which is a more secure starting point. Users requiring remote access can then explicitly configure it to `0.0.0.0:9090` or a specific interface IP.
🟡 Warning gateway/instrumentation_handlers.go:70
The application calls `log.Fatal` if the Prometheus metrics server fails to start, for example, due to a port conflict. This terminates the entire Tyk gateway process, creating a denial-of-service vulnerability that can be triggered by a local user or process occupying the metrics port. The failure of an auxiliary service like metrics should not cause the main application to crash.
💡 SuggestionReplace `log.WithError(err).Fatal(...)` with a non-fatal error log, such as `log.WithError(err).Error(...)`. This will record the failure to start the metrics server while allowing the gateway to continue its primary function of proxying API traffic.

Architecture Issues (4)

Severity Location Issue
🟠 Error gateway/instrumentation_prometheus.go:367-382
The `tyk_gateway_system_cpu_usage_percent` metric is calculated based only on garbage collection pause times (`m.PauseTotalNs`). This is not representative of the total process CPU usage and can be highly misleading, as the metric's name implies it tracks overall CPU usage.
💡 SuggestionRemove this custom calculation. The optional process collector (`collectors.NewProcessCollector`), which is already part of this PR, provides a `process_cpu_seconds_total` counter. Users can calculate the rate of this counter in their Prometheus queries (e.g., `rate(process_cpu_seconds_total[5m])`) to get an accurate CPU usage. The documentation should be updated to reflect this.
🟠 Error gateway/instrumentation_handlers.go:88
The gateway will exit with a fatal error if the Prometheus metrics server fails to start (e.g., due to the port being in use). An auxiliary component like the metrics server should not be able to crash the entire gateway, which is a critical process.
💡 SuggestionReplace `log.WithError(err).Fatal(...)` with `log.WithError(err).Error(...)`. This will log the error and allow the gateway to continue running without the Prometheus metrics endpoint, improving the gateway's resilience.
🟡 Warning gateway/instrumentation_prometheus.go:240-246
The `redisLatency` histogram metric is defined and registered, but its `Observe` method is never called. This results in an empty metric being exposed, which can be confusing for users.
💡 SuggestionEither implement the logic to measure and record Redis operation latency or remove the unused metric. If this is planned for future work, a `// TODO` comment would be appropriate.
🟡 Warning storage/connection_handler.go:309-314
The new `RedisPoolStats` struct is a subset of `redis.PoolStats` from the `go-redis` library. The project already re-exports `redis.PoolStats` via the `internal/redis` package. Introducing a new, almost identical struct is redundant and inconsistent with the existing pattern of reusing types from the Redis client library.
💡 SuggestionRemove the `storage.RedisPoolStats` struct. Modify the `storage.ConnectionHandler.GetRedisStats` function to return the `*redis.PoolStats` type directly (using the internal package alias). This will reduce code duplication and improve consistency.

Performance Issues (3)

Severity Location Issue
🟡 Warning gateway/instrumentation_prometheus.go:385-418
The `RecordRequest` function, which is on a hot path, allocates multiple `prometheus.Labels` maps on every invocation. This can lead to increased memory usage and garbage collection pressure under high load.
💡 SuggestionUse `WithLabelValues` instead of `With(prometheus.Labels{...})` to avoid map allocations for each metric update. This is a more performant way to work with labeled metrics in the Prometheus client library.
🔧 Suggested Fix
if pm.enablePerAPIMetrics {
    pm.requestsTotal.WithLabelValues(apiID, method, statusClass).Inc()
    pm.requestDuration.WithLabelValues(apiID, method).Observe(totalSeconds)
if statusCode &gt;= 400 {
    pm.requestErrors.WithLabelValues(apiID, method, statusClass).Inc()
}

}

// Overall Gateway RED metrics (always recorded)
pm.gatewayRequestsTotal.WithLabelValues(method, statusClass).Inc()
pm.gatewayRequestDuration.WithLabelValues(method, statusClass).Observe(totalSeconds)
pm.gatewayRequestUpstreamLatency.WithLabelValues(method, statusClass).Observe(upstreamSeconds)
pm.gatewayRequestGatewayLatency.WithLabelValues(method, statusClass).Observe(gatewaySeconds)

🟡 Warning gateway/instrumentation_prometheus.go:423-426
The `RecordMiddlewareExecution` function allocates a `prometheus.Labels` map on every call. Since middleware execution is tracked, this function can be on a hot path, leading to unnecessary memory allocations and GC pressure.
💡 SuggestionUse `WithLabelValues` instead of `With(prometheus.Labels{...})` to avoid map allocations. The label order for `WithLabelValues` should be `middlewareName`, `apiID`.
🔧 Suggested Fix
pm.middlewareExecTime.WithLabelValues(middlewareName, apiID).Observe(duration)
🟡 Warning gateway/instrumentation_prometheus.go:315-336
The CPU usage metric `cpu_usage_percent` is calculated based on garbage collection pause times (`m.PauseTotalNs`). This is not a true representation of the process's CPU usage and can be highly misleading. A CPU-intensive process with low GC activity would report near-zero CPU usage.
💡 SuggestionSince `gopsutil` is already a dependency in the project, use it to get an accurate process CPU utilization. Alternatively, if that's not desired, rename the metric to `cpu_gc_pause_percent` to accurately reflect what it measures.

Quality Issues (3)

Severity Location Issue
🔴 Critical gateway/instrumentation_prometheus.go:1
The new Prometheus instrumentation feature, which includes several new files and modifications to existing ones, is being introduced without any automated tests. The implementation plan explicitly mentions the need for unit and integration tests (e.g., in `gateway/instrumentation_prometheus_test.go`), but these are absent from the pull request. This lack of testing introduces a significant risk of undetected bugs, panics (especially with metric label handling), and future regressions, making the feature difficult to maintain or refactor safely.
💡 SuggestionAdd comprehensive unit tests for the new Prometheus functionality. Tests should cover metric registration, correct label application, the periodic update logic in `UpdateSystemMetrics`, `UpdateGatewayMetrics`, and `UpdateRedisMetrics`, and the behavior of the `PrometheusSink`. Integration tests should also be added to verify the metrics endpoint is correctly exposed and serves valid metrics.
🟠 Error gateway/instrumentation_prometheus.go:332-358
The metric `tyk_gateway_system_cpu_usage_percent` is calculated using garbage collection pause durations (`runtime.MemStats.PauseTotalNs`). This does not represent the overall CPU usage of the Tyk process and is therefore highly misleading. Users will likely misinterpret this metric for total CPU load, leading to incorrect monitoring dashboards and alerting rules. The optional process collector, which can be enabled via configuration, already provides the standard and accurate `process_cpu_seconds_total` metric.
💡 SuggestionEither remove this custom CPU metric entirely and advise users to enable the standard process collector for accurate CPU monitoring, or rename the metric to accurately reflect what it measures, such as `tyk_gateway_system_gc_cpu_impact_percent`.
🟠 Error gateway/instrumentation_handlers.go:70
The use of `log.WithError(err).Fatal()` in the `startPrometheusServer` function will cause an immediate and non-graceful termination of the entire gateway process if the Prometheus server fails to start (e.g., due to a port conflict). This bypasses the gateway's graceful shutdown procedures for other components, which could lead to lost analytics data, inconsistent state, or other side effects.
💡 SuggestionRefactor `startPrometheusServer` and `setupPrometheusInstrumentation` to return an error on failure instead of calling `log.Fatal`. The calling function (`initSystem` in `gateway/server.go`) should then handle this error by initiating a proper graceful shutdown of the entire gateway.

✅ Dependency Check Passed

No dependency issues found – changes LGTM.

✅ Connectivity Check Passed

No connectivity issues found – changes LGTM.


Powered by Visor from Probelabs

Last updated: 2025-11-13T17:23:25.636Z | Triggered by: opened | Commit: f0e00dd

💡 TIP: You can chat with Visor using /visor ask <your question>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants