Skip to content

Fixes #XXXXX - Add cache logging, health endpoint, and concurrency limiter to registration module#1

Draft
pablomh wants to merge 6 commits intodevelopfrom
registration-observability
Draft

Fixes #XXXXX - Add cache logging, health endpoint, and concurrency limiter to registration module#1
pablomh wants to merge 6 commits intodevelopfrom
registration-observability

Conversation

@pablomh
Copy link
Copy Markdown
Owner

@pablomh pablomh commented Apr 4, 2026

Summary

Builds on smart-proxy#935 (in-memory registration script cache) with:

  • Cache HIT/MISS loggingread_registration_cache logs registration_script cache=HIT age=Xs or cache=MISS, with source=shared|local when Redis is in use
  • GET /register/health endpoint — returns 200 OK if the capsule can reach Foreman's /api/status, 503 if not; used by HAProxy health checks to route registration traffic away from degraded capsules
  • with_concurrency_limit(&block) — block-based abstraction that wraps the POST / handler; owns semaphore acquire/release and returns 503 Retry-After: 30 when the limit is exhausted. The handler expresses what to do; the helper owns how to limit concurrency.
  • Shared Redis/Valkey cache — optional :cache_url setting (redis://host:6379/0); when set, all capsule nodes in an LB pool share one cache so a single warm request benefits all nodes. Falls back to in-memory cache if Redis is unreachable.
  • :max_concurrent_registrations setting — configurable via registration.yml; default unlimited (backward compatible)

Depends on

  • smart-proxy#935 — in-memory registration script cache (base for this PR)

Design notes

with_concurrency_limit(&block) — top-down design

The POST / handler now reads as a statement of intent:

post '/' do
  with_concurrency_limit do
    resp = Proxy::Registration::ProxyRequest.new.host_register(request)
    handle_response(resp)
  end
end

The mechanical concern (semaphore acquire/release, 503 response, Retry-After header) lives entirely in with_concurrency_limit. The handler has no knowledge of how concurrency limiting works.

Shared cache architecture

One Redis/Valkey instance on the LB host, shared by all capsule nodes in the pool. Each capsule connects over the private network. If Redis is unreachable, the capsule falls back to its in-memory cache — registration continues working with per-node warm-up.

# /etc/foreman-proxy/settings.d/registration.yml
:cache_url: redis://lb-host-private-ip:6379/0
:max_concurrent_registrations: 50

Companion PRs

Repo PR What
smart-proxy #935 In-memory cache (base)
smart-proxy this PR HIT/MISS logging, health endpoint, concurrency limiter, shared cache
satperf registration-metrics registration_cache Ansible role + HAProxy health check config

Test plan

  • ruby test/registration/registration_api_test.rb
  • Verify cache=HIT / cache=MISS lines appear in proxy log
  • GET /register/health returns 200 when Foreman is reachable, 503 when not
  • POST / returns 503 with Retry-After: 30 when semaphore is exhausted
  • With :cache_url set, verify Redis keyspace_hits increments on cache hits
  • Without :cache_url, verify fallback to in-memory cache works

🤖 Generated with Claude Code

pablomh and others added 6 commits April 2, 2026 13:35
GET /register returns the same shell script for all hosts sharing the
same registration parameters (org, location, hostgroup, activation keys).
Under bulk registration — 100+ hosts hitting the same capsule simultaneously
— this endpoint is called once per host, each time proxying to Foreman and
waiting for the ERB template to render (~103ms in profiling).

Cache the rendered script in memory (5-minute TTL) keyed on a canonical
form of the request query string. The cache key is computed by parsing the
query string, sorting parameters alphabetically, and rebuilding — so
requests that differ only in parameter order share the same cache entry
and the same Foreman response, regardless of how the client ordered them.

The implementation uses three clearly separated layers:

  get '/'             — handles errors only; delegates to registration_script
  registration_script — owns the cache key and the business logic (what
                        to cache and what to do on a miss); raises
                        ScriptFetchError on non-200 so errors never reach
                        the cache write
  cache(key, &block)  — owns the mechanism: per-key double-checked locking
                        via a block abstraction that keeps the caller free
                        of locking concerns

Per-key locking allows concurrent requests for genuinely different keys
(e.g. different activation keys) to fetch from Foreman in parallel, while
serialising only threads competing for the same key. The per-key Mutex is
evicted from KEY_MUTEXES immediately after caching — once a key is hot,
all future requests take the lock-free fast path and KEY_MUTEXES is empty
under steady state.

Only HTTP 200 responses are cached. Non-200 responses raise ScriptFetchError
out of the cache block, which is rescued in get '/' and rendered via
handle_response without poisoning the cache.

Both KEY_MUTEXES and SCRIPT_CACHE use Concurrent::Map (already a
smart-proxy dependency) for lock-free, thread-safe access on all Ruby
VMs without relying on MRI's GIL.

Tests added:
- Cache hit: Foreman called once for repeated identical requests
- Per-key isolation: different parameter sets cached independently
- Parameter order independence: requests differing only in param order
  share the same cache entry
- TTL expiry: expired entries are not served; Foreman is re-called
- Mutex eviction: KEY_MUTEXES is empty after a successful cache write
- Error non-caching: Foreman is called on every request when it errors
- setup clears both SCRIPT_CACHE and KEY_MUTEXES between tests
Makes the script cache added in #39208 observable at debug log level:

  registration_script cache=HIT age=42s key_prefix=org_id=1&location_id=...
  registration_script cache=MISS key_prefix=org_id=1&location_id=...

The key is truncated to 40 characters to avoid flooding the log while
still distinguishing between different parameter sets.  Log level is
debug so it is silent in production by default and available on demand
via the smart-proxy log level setting.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Load balancers that front multiple capsules currently have no way to
distinguish a capsule that is up but cannot reach Foreman (and will
therefore fail every registration) from a healthy capsule.

Adds GET /register/health that:
- Returns 200 {"status":"ok"} if the capsule can reach Foreman's
  /api/status endpoint (any HTTP response = reachable)
- Returns 503 {"status":"error",...} if the connection fails
- Any unexpected error also returns 503 so the LB removes the capsule

The check uses the existing ForemanRequest infrastructure already used
by the registration proxy, so no new configuration is needed.

LB configuration example (HAProxy):
  option httpchk GET /register/health
  http-check expect status 200

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…deployments

When multiple capsule nodes sit behind a load balancer, the existing
per-node in-memory cache (PR #39208) means each node must independently
fetch the registration script from Foreman on its first request —
N capsule nodes = N warming requests. Under a burst of concurrent
registrations, each node experiences its own thundering herd.

This adds an optional shared Redis cache so that one warm request on
any node benefits all nodes in the pool. Configuration:

  # config/settings.d/registration.yml
  :redis_url: redis://lb-host:6379/0

Cache lookup order:
  1. Redis (shared, populated by whichever node fetched first)
  2. Per-node in-memory cache (fast path for repeat requests to same node)
  3. Foreman (miss — fetches and populates both caches)

A Redis HIT from one node also warms that node's local in-memory cache,
so subsequent requests to the same node skip Redis entirely.

Failure handling: Redis errors (connection refused, timeout, LoadError)
are caught and logged at warn level; the in-memory cache takes over
transparently. This means a Redis outage does not break registration —
it only reverts to per-node caching.

The 'redis' gem is added with require: false so it is only loaded when
:redis_url is configured. Without configuration, zero overhead.

Note: standalone capsules (single node) gain no benefit from Redis;
the existing in-memory cache already provides full hit rate after one
warm request. Redis is recommended only for LB deployments with 2+
capsule nodes.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Both Redis (RHEL 9) and Valkey (RHEL 10+, the Redis fork that ships
in RHEL 10 due to Redis license change to SSPL) use the same redis://
wire protocol and are supported by the same Ruby redis gem. The setting
name should not imply a specific package.

Renames:
  :redis_url         → :cache_url       (plugin setting)
  redis_client()     → registration_cache_client()  (class method)
  cache=HIT source=redis → cache=HIT source=shared  (log message)

The redis:// URI scheme in the setting value is unchanged — it is the
wire protocol identifier accepted by both Redis and Valkey.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Under bulk registration (500+ hosts), all POST /register requests arrive
at the capsule simultaneously and are forwarded to Foreman without any
throttling, creating the same pile-up at POST /rhsm/consumers that the
Katello caching PRs are trying to reduce.

Adds an optional :max_concurrent_registrations setting that limits how
many host-register requests the capsule forwards to Foreman in parallel:

  # /etc/foreman-proxy/settings.d/registration.yml
  :max_concurrent_registrations: 50

When all permits are taken the capsule returns 503 + Retry-After: 30
immediately (no in-capsule wait queue). The 503 propagates to the
registration script, and the orchestration layer (Ansible retry_failed,
satperf wave batching) decides when to retry.

Uses Concurrent::Semaphore from the concurrent-ruby gem (already a
smart-proxy dependency). The semaphore is lazy-initialised at class
level and persists for the lifetime of the process so the permit count
is shared across all concurrent Sinatra handler threads.

Default: unset (unlimited) — backward-compatible with existing deploys.
Sizing guide: start at 50-80% of Foreman's Rails thread pool size, then
tune based on observed queue depth in the HAProxy stats page.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant