Add language detection to auto-monitor to inject only the relevant SDK by Miqueasher · Pull Request #380 · aws/amazon-cloudwatch-agent-operator

Miqueasher · 2026-05-05T20:05:28Z

Summary

When monitorAllServices: true, auto-monitor currently injects all 4 language SDK init containers (Java, Python, Node.js, .NET) into every pod regardless of runtime, causing liveness/readiness probe failures, restart loops, and deployment instability.

This PR adds a registry-based language detector that inspects container image config (ENV, CMD, ENTRYPOINT) via google/go-containerregistry without pulling layers (~100-500ms, 5s timeout)
Falls back gracefully through image name patterns → pod-spec env vars → pod-spec commands → all languages (current behavior), ensuring zero regression

Detection Layers

Registry image config — fetch ENV/CMD/ENTRYPOINT from manifest (public registries + ECR with IRSA)
Image name patterns — match language keywords in image reference
Pod-spec env vars — check for JAVA_HOME, PYTHONPATH, NODE_ENV, DOTNET_ROOT, etc.
Pod-spec commands — match runtime binaries (java, python, node, dotnet)
Fallback — all configured languages

Dependencies added

Test Plan

Unit tests: 92/92 passing (65 new + 27 existing, zero regressions)
Live EKS validation (us-east-1): Custom operator build deployed with monitorAllServices=true, no manual annotations
Java, Python, Node.js, .NET public images → 1 init container each (detected via registry config)
Alpine, Nginx, Busybox → 4 init containers (correct fallback, no false positives)
Private ECR image with language keyword → detected via image name pattern
Pod-spec JAVA_HOME on Alpine → detected via env var fallback
Multi-container (Python + Nginx sidecar) → correctly detected Python only
Failure mode: When all detection fails, behavior is identical to current production
Operator image digest confirmed matching between running pod and ECR (sha256:5f74a89f76b3...), zero pod restarts observed

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mxiamxia · 2026-05-12T23:08:41Z

+	desc, err := remote.Get(ref,
+		remote.WithAuthFromKeychain(d.keychain),
+		remote.WithContext(ctx),
+	)


I think we need to cache this for the same kind of pod. otherwise for large cluster with thousands of pod, it will either be throttled or make some performance impact.

Agreed — we'll add an image-level cache so that the same image reference is only looked up once from the registry. If 200 workloads use the same image, the registry call fires once and subsequent lookups return the cached result.

Add language detection to auto-monitor to inject only the relevant SDK

cd169e4