feat: add CoreDNS hosts plugin support for LocalDNS #8165
feat: add CoreDNS hosts plugin support for LocalDNS #8165
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).
|
There was a problem hiding this comment.
Pull request overview
This PR extends AKS LocalDNS to optionally include a CoreDNS hosts plugin backed by a periodically refreshed /etc/localdns/hosts file, and wires the feature through VHD build, CSE/provisioning scripts, datamodel/config, and e2e/spec tests.
Changes:
- Add
aks-hosts-setupscript + systemd service/timer, and copy them into Linux VHDs. - Add
EnableHostsPluginconfig plumbed through datamodel, CSE env, corefile generation (with/without hosts variants), and localdns runtime selection/annotation. - Add/extend ShellSpec + e2e coverage for hosts-plugin behavior and related cleanup utilities.
Reviewed changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/packer/vhd-image-builder-mariner.json | Copies new aks-hosts-setup artifacts into Mariner VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-mariner-cvm.json | Copies new aks-hosts-setup artifacts into Mariner CVM VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-mariner-arm64.json | Copies new aks-hosts-setup artifacts into Mariner ARM64 VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-flatcar.json | Copies new aks-hosts-setup artifacts into Flatcar VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json | Copies new aks-hosts-setup artifacts into Flatcar ARM64 VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-cvm.json | Copies new aks-hosts-setup artifacts into CVM VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-base.json | Copies new aks-hosts-setup artifacts into base (Ubuntu/AzureLinux) VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-arm64-gen2.json | Copies new aks-hosts-setup artifacts into ARM64 Gen2 VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-acl.json | Copies new aks-hosts-setup artifacts into ACL VHD build staging. |
| vhdbuilder/packer/vhd-image-builder-acl-arm64.json | Copies new aks-hosts-setup artifacts into ACL ARM64 VHD build staging. |
| vhdbuilder/packer/packer_source.sh | Installs aks-hosts-setup script + systemd units onto the VHD filesystem. |
| spec/shellspec.Dockerfile | Adds dnsutils to enable nslookup/dig in ShellSpec container. |
| spec/parts/linux/cloud-init/artifacts/localdns_spec.sh | Updates expectations for new corefile variant behavior and adds tests for node annotation helper. |
| spec/parts/linux/cloud-init/artifacts/cse_main_spec.sh | New ShellSpec coverage for corefile selection logic sourced from localdns.sh. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Extends tests for localdns backward-compat guards + env file variant writing + aks-hosts-setup enablement. |
| spec/parts/linux/cloud-init/artifacts/aks_hosts_setup_spec.sh | New ShellSpec coverage for aks-hosts-setup DNS resolution, cloud selection, atomic writes, and error handling. |
| pkg/agent/datamodel/types_test.go | Adds unit tests for ShouldEnableHostsPlugin; simplifies ShouldEnableLocalDNS test. |
| pkg/agent/datamodel/types.go | Adds EnableHostsPlugin to LocalDNSProfile and ShouldEnableHostsPlugin() helper; extends corefile data. |
| pkg/agent/baker_test.go | Updates LocalDNS corefile generation tests for new GenerateLocalDNSCoreFile signature and hosts/no-hosts variants. |
| pkg/agent/baker.go | Plumbs hosts-plugin enablement + generates both corefile variants; template now conditionally emits hosts blocks. |
| parts/linux/cloud-init/artifacts/localdns.sh | Adds dynamic corefile variant selection, best-effort node annotation, and startup regeneration behavior. |
| parts/linux/cloud-init/artifacts/cse_main.sh | Enables aks-hosts-setup when hosts plugin enabled; starts localdns with the no-hosts variant. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Writes both corefile variants + SHOULD_ENABLE_HOSTS_PLUGIN to env file; adds enableAKSHostsSetup; adds backward-compat guards. |
| parts/linux/cloud-init/artifacts/cse_cmd.sh | Exposes SHOULD_ENABLE_HOSTS_PLUGIN and LOCALDNS_GENERATED_COREFILE_NO_HOSTS env vars to provisioning. |
| parts/linux/cloud-init/artifacts/aks-hosts-setup.timer | New timer to periodically refresh /etc/localdns/hosts. |
| parts/linux/cloud-init/artifacts/aks-hosts-setup.sh | New script to resolve critical FQDNs and atomically write /etc/localdns/hosts. |
| parts/linux/cloud-init/artifacts/aks-hosts-setup.service | New oneshot service to run the hosts population script (used by timer and manual refresh). |
| e2e/vmss.go | Adds MockUnknownCloud injection to validate unsupported cloud handling for aks-hosts-setup. |
| e2e/validators.go | Adds validators for aks-hosts-setup service/timer, hosts file content, and AA-flag/IP-match hosts-plugin behavior. |
| e2e/validation.go | Runs new hosts-plugin validators when the feature is enabled. |
| e2e/types.go | Adds scenario helpers to detect hosts-plugin enablement and choose default FQDNs per cloud. |
| e2e/scenario_localdns_hosts_test.go | New end-to-end scenarios for hosts plugin (legacy + scriptless) across multiple distros. |
| e2e/cluster.go | Adds cleanup of tagged Private DNS zones that can interfere with DNS-sensitive tests. |
| e2e/aks_model.go | Tags created Private DNS zones; adds cleanup helpers for zones and VNET links. |
| aks-node-controller/proto/aksnodeconfig/v1/localdns_config.proto | Adds enable_hosts_plugin to scriptless LocalDnsProfile. |
| aks-node-controller/pkg/gen/aksnodeconfig/v1/localdns_config.pb.go | Regenerates protobuf bindings for enable_hosts_plugin. |
| aks-node-controller/parser/testdata/AKSUbuntu2204+LocalDNS/generatedCSECommand | Adds parser golden output for LocalDNS (no hosts plugin). |
| aks-node-controller/parser/testdata/AKSUbuntu2204+LocalDNS+HostsPlugin/generatedCSECommand | Adds parser golden output for LocalDNS with hosts plugin enabled. |
| aks-node-controller/parser/templates/localdns.toml.gtpl | Adds conditional hosts plugin blocks and updates template data shape ($.Config, $.IncludeHostsPlugin). |
| aks-node-controller/parser/parser_test.go | Adds parser tests covering hosts-plugin env var output. |
| aks-node-controller/parser/parser.go | Adds SHOULD_ENABLE_HOSTS_PLUGIN + dual corefile env vars to generated CSE environment. |
| aks-node-controller/parser/helper_test.go | Updates corefile base64 generation tests to new helper signature. |
| aks-node-controller/parser/helper.go | Implements hosts-plugin enablement + dual corefile generation for scriptless path. |
| .pipelines/scripts/verify_shell.sh | Marks aks-hosts-setup.sh as intentionally bash-only. |
|
Could you add descriptions in the PR |
printf '%s\n' is POSIX-portable and won't interpret escape sequences, unlike echo which is shell-implementation-dependent.
- Fix comment in parser/helper.go: selection happens in localdns.sh not cse_main.sh - Use LOCALDNS_HOSTS_FILE override in select_localdns_corefile() for testability - Rewrite cse_main_spec.sh to use temp dir instead of /etc/localdns/hosts - Add actual permissions check to cloud-env file test in cse_config_spec.sh - Replace brittle tail -n +10 with sed in aks_hosts_setup_spec.sh
On first boot, localdns and aks-hosts-setup start concurrently. localdns often wins the race and selects the base corefile (without hosts plugin) because the hosts file isn't populated yet. Restarting localdns after the validator confirms the hosts file is ready lets it regenerate its corefile with the hosts plugin variant.
The aks-hosts-setup.sh script previously hardcoded cloud→FQDN mappings in a case statement covering only 3 clouds (Public, China, USGov). This meant 5+ sovereign clouds were unsupported and every new cloud required a VHD release. Have the RP pass LOCALDNS_CRITICAL_FQDNS (comma-separated) through the CSE env pipeline. The script becomes a simple resolver loop with no cloud-specific logic. If the env var is empty (old RP), the script exits gracefully and the corefile falls back to the no-hosts variant. Changes: - Add CriticalFQDNs field to LocalDNSProfile (types.go, proto) - Add GetLocalDNSCriticalFQDNs template func (baker.go) - Add LOCALDNS_CRITICAL_FQDNS to CSE env (cse_cmd.sh, parser.go) - Replace TARGET_CLOUD case statement in enableAKSHostsSetup with LOCALDNS_CRITICAL_FQDNS check (cse_config.sh) - Replace cloud→FQDN case statement in aks-hosts-setup.sh with comma-separated FQDN list parsing - Update all shell spec and Go unit tests
…t clarity - Remove Flatcar from hosts plugin e2e tests and VHD packer configs since Flatcar doesn't support aks-hosts-setup artifacts - Use Runtime.Cluster.Model.Location instead of NBC-specific fields in GetDefaultFQDNsForValidation/GetContainerRegistryFQDN so both legacy and scriptless bootstrap paths work - Add localdns and aks-hosts-setup journal log collection to aks-log-collector.sh - Improve comments: IPv6 filter explanation, timer refresh rationale, logging/journald documentation in aks-hosts-setup.sh - Switch spec from fragile tail -n +10 to sed-based script preparation
…P FQDNs The RP passes CriticalFQDNs to the NBC/AKSNodeConfig, which flows into LOCALDNS_CRITICAL_FQDNS for aks-hosts-setup.sh. Without this, the e2e hosts plugin tests had empty FQDNs and aks-hosts-setup.sh would exit without populating /etc/localdns/hosts. Adds the public-cloud FQDNs (mcr.microsoft.com, login.microsoftonline.com, acs-mirror.azureedge.net) to all three base LocalDNSProfile configs: - nbcToAKSNodeConfigV1 (scriptless path) - baseTemplateLinux AgentPoolProfile (legacy path) - baseTemplateLinux NBC root (legacy path)
ValidateLocalDNSHostsPluginBypass was hardcoding packages.microsoft.com as the test FQDN, but that FQDN is not in the NBC's CriticalFQDNs list so it never appears in /etc/localdns/hosts. Use the first entry from GetDefaultFQDNsForValidation() (mcr.microsoft.com) instead. ACL subtests were missing the VMConfigMutator to enable TrustedLaunch, causing ARM to reject the VMSS creation with a 400 BadRequest. Also simplify GetDefaultFQDNsForValidation to only return public cloud FQDNs since AgentBaker e2e only runs in public cloud.
Add tests that verify the hosts plugin can be activated on an already-running VM without reprovisioning — the brownfield scenario that occurs during feature flag rollout or cluster upgrade. Two test functions cover both bootstrap paths: - Test_LocalDNSHostsPlugin_Brownfield (legacy bash CSE) - Test_LocalDNSHostsPlugin_Brownfield_Scriptless (aks-node-controller) Each test follows a three-phase flow: 1. Baseline: VM boots with hosts plugin off, validateNoHostsPlugin confirms SHOULD_ENABLE_HOSTS_PLUGIN=false, no cloud-env, no hosts directive in active corefile 2. SSH mutation: enableHostsPluginOnRunningVM patches environment file, creates cloud-env, starts aks-hosts-setup timer/service 3. Validation: hosts file populated, service healthy, localdns restarted, AA flag proves authoritative response from hosts plugin New helpers in validators.go: - validateNoHostsPlugin: asserts hosts plugin is not active - enableHostsPluginOnRunningVM: generates experimental corefile in Go, injects it via SSH along with environment/cloud-env changes
Add Test_LocalDNSHostsPlugin_Rollback and _Rollback_Scriptless tests that verify disabling the hosts plugin on a running VM without reprovisioning. This covers the production rollback scenario where operators need to turn off the hosts plugin mid-lifecycle. Tests boot with EnableHostsPlugin=true (Phase 1 validates it works via ValidateCommonLinux), then disable via SSH (Phase 2), then comprehensively validate the disable took effect (Phase 3: environment file, removed state files, inactive timer, base corefile active, AA flag absent, DNS still resolves through localdns).
… consolidation Spec: docs/superpowers/specs/2026-03-31-hosts-plugin-disable-and-cloud-env-consolidation-design.md Plan: docs/superpowers/plans/2026-03-31-hosts-plugin-disable-and-cloud-env-consolidation.md
Move LOCALDNS_CRITICAL_FQDNS from a separate /etc/localdns/cloud-env file into the existing /etc/localdns/environment file. This eliminates a confusing extra file — all localdns state now lives in one place. - generateLocalDNSFiles() now writes LOCALDNS_CRITICAL_FQDNS to environment - enableAKSHostsSetup() no longer creates cloud-env - aks-hosts-setup.service reads from /etc/localdns/environment
When AKS-RP re-runs CSE with EnableHostsPlugin=false on a node that previously had it enabled, the else branch now actively cleans up: stops aks-hosts-setup timer and removes /etc/localdns/hosts. Also fixes call order bug: generateLocalDNSFiles() now runs before enableAKSHostsSetup() so the environment file (with FQDNs) is written before the timer starts.
Required for CSE re-run capability — allows updating the vmssCSE extension on an existing VMSS to trigger a new CSE execution.
Regenerates CSE from a modified NBC and pushes it via Azure extension update, triggering re-execution. Used by rollback tests to exercise the production disable path in cse_config.sh.
Rollback tests now exercise the actual production code path: RerunCSE pushes a new CSE with EnableHostsPlugin=false to the VMSS, which triggers enableLocalDNS() -> disableAKSHostsSetup().
- DisableNeverEnabled: verifies else branch is idempotent on fresh nodes - BackwardCompat_NewCSEOldServiceUnit: verifies graceful degradation when CriticalFQDNs are empty (simulates new-CSE/old-VHD scenario)
…oud-env refs - Remove disableHostsPluginOnRunningVM (replaced by production code path) - Update validateNoHostsPlugin: remove cloud-env existence check - Update enableHostsPluginOnRunningVM: write FQDNs to environment file - Update validateHostsPluginDisabled: remove LOCALDNS_COREFILE_EXPERIMENTAL and cloud-env checks (production RerunCSE path handles these differently)
This is the initial hosts plugin PR — no released VHD or CSE has hosts plugin code yet. The BackwardCompat_NewCSEOldServiceUnit test was testing a scenario (old VHD with cloud-env service file) that can never happen because cloud-env was never shipped. Remove it. Update the spec's backward compatibility section to clarify this is forward-looking design rationale for future changes, not a current concern.
RerunCSE is a no-op because cse_main.sh exits early when provision.complete exists. Replace with ReimageVMSSInstance which wipes the OS disk via BeginReimage API, so CSE runs from scratch with SHOULD_ENABLE_HOSTS_PLUGIN=false — exercising the actual production node image upgrade disable path.
Review fixes: - Fix unescaped FQDN dots in grep -E patterns (validators.go) - Update helper.go function comment to describe what it does vs what caller does - Add includeHostsPlugin=false test case in helper_test.go - Use upstream DNS from /etc/localdns/upstream-dns in aks-hosts-setup.sh - Persist upstream DNS servers in localdns.sh replace_azurednsip_in_corefile() - Add Case 2.5 legacy fallback in select_localdns_corefile() - Add upstream DNS server selection tests in aks_hosts_setup_spec.sh Removed (not ready for this PR): - Brownfield/rollback test functions and supporting VMSS helpers - Brownfield/rollback validators - docs/superpowers planning docs for disable/cloud-env consolidation
- Extract normalizeCorefileString() and assertCorefileBase64Contains() helpers from Test_getLocalDNSCorefileBase64 to reduce cognitive complexity from 27 to under 20 (gocognit lint threshold). - Update enableAKSHostsSetup shellspec tests: the function no longer writes a cloud-env file (FQDNs are persisted via generateLocalDNSFiles to /etc/localdns/environment instead). Tests now verify hosts file creation and timer enablement.
The AA flag alone is insufficient to prove the hosts plugin served the response — a forwarded authoritative response also has AA but includes the RA (recursion available) flag. Hosts plugin responses have "flags: qr aa rd" without "ra". This check catches false positives where a forwarded response passes the AA-only check.
…, replace EOL acs-mirror URL, bump TimeoutStartSec
- Rename aks-hosts-setup.{sh,service,timer} to aks-localdns-hosts-setup
to clarify the service is DNS-related (djsly feedback)
- Replace acs-mirror.azureedge.net with packages.aks.azure.com in e2e
CriticalFQDNs lists (EOL URL)
- Increase TimeoutStartSec from 60 to 180 to handle worst-case DNS
resolution time under transient failures
- Add backward-compat comment on duplicate LOCALDNS_GENERATED_COREFILE
getter in parser.go
- Add clarifying comments on base64-encoded env vars in cse_config.sh
- Add mkdir -p before writing /etc/localdns/upstream-dns in localdns.sh
- Fix misleading comment in validation.go about scriptless support
- Add permissions assertion to cse_config_spec.sh hosts file test
DNS flags (AA, RA) are unreliable signals for verifying whether a response came from the CoreDNS hosts plugin — CoreDNS can set these flags regardless of which plugin served the response. Rely on the stronger checks already present: node annotation, corefile verification, and resolved IPs matching /etc/localdns/hosts entries.
Cover gaps in upstream DNS server selection tests: - empty file (falls back to system resolver) - whitespace-only file (word-splits to zero servers, resolves nothing) - newline-separated server IPs (tr normalisation) - all upstream servers fail (loop exhausts, graceful exit) - @server passed to AAAA queries (not just A) - no @server when upstream-dns file is absent
| return "", fmt.Errorf("failed generate corefile for localdns using template: %w", err) | ||
| } | ||
| return base64.StdEncoding.EncodeToString([]byte(output)), nil | ||
| }, | ||
| "GetGeneratedLocalDNSCoreFileBase": func() (string, error) { | ||
| output, err := GenerateLocalDNSCoreFile(config, profile, false) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed generate base corefile for localdns using template: %w", err) | ||
| } | ||
| return base64.StdEncoding.EncodeToString([]byte(output)), nil | ||
| }, | ||
| "GetGeneratedLocalDNSCoreFileExperimental": func() (string, error) { | ||
| output, err := GenerateLocalDNSCoreFile(config, profile, true) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed generate experimental corefile for localdns using template: %w", err) |
There was a problem hiding this comment.
These error messages still say "using template", but GenerateLocalDNSCoreFile no longer takes a template argument and always uses the built-in localDNSCoreFileTemplateString. Updating the wording would make debugging clearer (e.g., "failed to generate localdns corefile" / "failed to generate base corefile").
| return "", fmt.Errorf("failed generate corefile for localdns using template: %w", err) | |
| } | |
| return base64.StdEncoding.EncodeToString([]byte(output)), nil | |
| }, | |
| "GetGeneratedLocalDNSCoreFileBase": func() (string, error) { | |
| output, err := GenerateLocalDNSCoreFile(config, profile, false) | |
| if err != nil { | |
| return "", fmt.Errorf("failed generate base corefile for localdns using template: %w", err) | |
| } | |
| return base64.StdEncoding.EncodeToString([]byte(output)), nil | |
| }, | |
| "GetGeneratedLocalDNSCoreFileExperimental": func() (string, error) { | |
| output, err := GenerateLocalDNSCoreFile(config, profile, true) | |
| if err != nil { | |
| return "", fmt.Errorf("failed generate experimental corefile for localdns using template: %w", err) | |
| return "", fmt.Errorf("failed to generate localdns corefile: %w", err) | |
| } | |
| return base64.StdEncoding.EncodeToString([]byte(output)), nil | |
| }, | |
| "GetGeneratedLocalDNSCoreFileBase": func() (string, error) { | |
| output, err := GenerateLocalDNSCoreFile(config, profile, false) | |
| if err != nil { | |
| return "", fmt.Errorf("failed to generate base localdns corefile: %w", err) | |
| } | |
| return base64.StdEncoding.EncodeToString([]byte(output)), nil | |
| }, | |
| "GetGeneratedLocalDNSCoreFileExperimental": func() (string, error) { | |
| output, err := GenerateLocalDNSCoreFile(config, profile, true) | |
| if err != nil { | |
| return "", fmt.Errorf("failed to generate experimental localdns corefile: %w", err) |
What this PR does / why we need it:
Adds CoreDNS hosts plugin support for LocalDNS. When enabled, critical AKS FQDNs (mcr.microsoft.com, packages.aks.azure.com, etc.) are resolved and cached in
/etc/localdns/hostsby a systemd timer (aks-hosts-setup), then served authoritatively by the CoreDNS hosts plugin — eliminating upstream DNS lookups for these endpoints.Key changes:
aks-hosts-setup.sh/.service/.timer— resolves AKS FQDNs and writes/etc/localdns/hostshosts /etc/localdns/hosts { fallthrough }block in both VnetDNS and KubeDNS listenersLOCALDNS_COREFILE_BASE(vanilla) andLOCALDNS_COREFILE_FULL(with all optional plugins).localdns.shdynamically selects the full variant once the hosts file is populated.enableLocalDNS()is the single entry point for all localdns setup including hosts pluginWhich issue(s) this PR fixes: