Skip to content

feat: add CoreDNS hosts plugin support for LocalDNS #8165

Open
saewoni wants to merge 48 commits intomainfrom
sakwa/localdns_poc_clean
Open

feat: add CoreDNS hosts plugin support for LocalDNS #8165
saewoni wants to merge 48 commits intomainfrom
sakwa/localdns_poc_clean

Conversation

@saewoni
Copy link
Copy Markdown
Contributor

@saewoni saewoni commented Mar 25, 2026

Prior review history: This PR continues the work from #7639, which contains review comments and testing notes. That PR could not be reopened after a branch force-push. Please refer to #7639 for prior context.

What this PR does / why we need it:

Adds CoreDNS hosts plugin support for LocalDNS. When enabled, critical AKS FQDNs (mcr.microsoft.com, packages.aks.azure.com, etc.) are resolved and cached in /etc/localdns/hosts by a systemd timer (aks-hosts-setup), then served authoritatively by the CoreDNS hosts plugin — eliminating upstream DNS lookups for these endpoints.

Key changes:

  • aks-hosts-setup.sh / .service / .timer — resolves AKS FQDNs and writes /etc/localdns/hosts
  • CoreDNS corefile template gains hosts /etc/localdns/hosts { fallthrough } block in both VnetDNS and KubeDNS listeners
  • Two corefile variants generated at provisioning time: LOCALDNS_COREFILE_BASE (vanilla) and LOCALDNS_COREFILE_FULL (with all optional plugins). localdns.sh dynamically selects the full variant once the hosts file is populated.
  • enableLocalDNS() is the single entry point for all localdns setup including hosts plugin
  • E2E: table-driven tests across 7 distros (legacy) and 5 distros (scriptless), validating dig AA flag + IP match against hosts file
  • VHD: packer configs updated to include aks-hosts-setup files on all Linux images

Which issue(s) this PR fixes:

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 25, 2026

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 8, 2026, 10:06 PM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends AKS LocalDNS to optionally include a CoreDNS hosts plugin backed by a periodically refreshed /etc/localdns/hosts file, and wires the feature through VHD build, CSE/provisioning scripts, datamodel/config, and e2e/spec tests.

Changes:

  • Add aks-hosts-setup script + systemd service/timer, and copy them into Linux VHDs.
  • Add EnableHostsPlugin config plumbed through datamodel, CSE env, corefile generation (with/without hosts variants), and localdns runtime selection/annotation.
  • Add/extend ShellSpec + e2e coverage for hosts-plugin behavior and related cleanup utilities.

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vhdbuilder/packer/vhd-image-builder-mariner.json Copies new aks-hosts-setup artifacts into Mariner VHD build staging.
vhdbuilder/packer/vhd-image-builder-mariner-cvm.json Copies new aks-hosts-setup artifacts into Mariner CVM VHD build staging.
vhdbuilder/packer/vhd-image-builder-mariner-arm64.json Copies new aks-hosts-setup artifacts into Mariner ARM64 VHD build staging.
vhdbuilder/packer/vhd-image-builder-flatcar.json Copies new aks-hosts-setup artifacts into Flatcar VHD build staging.
vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json Copies new aks-hosts-setup artifacts into Flatcar ARM64 VHD build staging.
vhdbuilder/packer/vhd-image-builder-cvm.json Copies new aks-hosts-setup artifacts into CVM VHD build staging.
vhdbuilder/packer/vhd-image-builder-base.json Copies new aks-hosts-setup artifacts into base (Ubuntu/AzureLinux) VHD build staging.
vhdbuilder/packer/vhd-image-builder-arm64-gen2.json Copies new aks-hosts-setup artifacts into ARM64 Gen2 VHD build staging.
vhdbuilder/packer/vhd-image-builder-acl.json Copies new aks-hosts-setup artifacts into ACL VHD build staging.
vhdbuilder/packer/vhd-image-builder-acl-arm64.json Copies new aks-hosts-setup artifacts into ACL ARM64 VHD build staging.
vhdbuilder/packer/packer_source.sh Installs aks-hosts-setup script + systemd units onto the VHD filesystem.
spec/shellspec.Dockerfile Adds dnsutils to enable nslookup/dig in ShellSpec container.
spec/parts/linux/cloud-init/artifacts/localdns_spec.sh Updates expectations for new corefile variant behavior and adds tests for node annotation helper.
spec/parts/linux/cloud-init/artifacts/cse_main_spec.sh New ShellSpec coverage for corefile selection logic sourced from localdns.sh.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Extends tests for localdns backward-compat guards + env file variant writing + aks-hosts-setup enablement.
spec/parts/linux/cloud-init/artifacts/aks_hosts_setup_spec.sh New ShellSpec coverage for aks-hosts-setup DNS resolution, cloud selection, atomic writes, and error handling.
pkg/agent/datamodel/types_test.go Adds unit tests for ShouldEnableHostsPlugin; simplifies ShouldEnableLocalDNS test.
pkg/agent/datamodel/types.go Adds EnableHostsPlugin to LocalDNSProfile and ShouldEnableHostsPlugin() helper; extends corefile data.
pkg/agent/baker_test.go Updates LocalDNS corefile generation tests for new GenerateLocalDNSCoreFile signature and hosts/no-hosts variants.
pkg/agent/baker.go Plumbs hosts-plugin enablement + generates both corefile variants; template now conditionally emits hosts blocks.
parts/linux/cloud-init/artifacts/localdns.sh Adds dynamic corefile variant selection, best-effort node annotation, and startup regeneration behavior.
parts/linux/cloud-init/artifacts/cse_main.sh Enables aks-hosts-setup when hosts plugin enabled; starts localdns with the no-hosts variant.
parts/linux/cloud-init/artifacts/cse_config.sh Writes both corefile variants + SHOULD_ENABLE_HOSTS_PLUGIN to env file; adds enableAKSHostsSetup; adds backward-compat guards.
parts/linux/cloud-init/artifacts/cse_cmd.sh Exposes SHOULD_ENABLE_HOSTS_PLUGIN and LOCALDNS_GENERATED_COREFILE_NO_HOSTS env vars to provisioning.
parts/linux/cloud-init/artifacts/aks-hosts-setup.timer New timer to periodically refresh /etc/localdns/hosts.
parts/linux/cloud-init/artifacts/aks-hosts-setup.sh New script to resolve critical FQDNs and atomically write /etc/localdns/hosts.
parts/linux/cloud-init/artifacts/aks-hosts-setup.service New oneshot service to run the hosts population script (used by timer and manual refresh).
e2e/vmss.go Adds MockUnknownCloud injection to validate unsupported cloud handling for aks-hosts-setup.
e2e/validators.go Adds validators for aks-hosts-setup service/timer, hosts file content, and AA-flag/IP-match hosts-plugin behavior.
e2e/validation.go Runs new hosts-plugin validators when the feature is enabled.
e2e/types.go Adds scenario helpers to detect hosts-plugin enablement and choose default FQDNs per cloud.
e2e/scenario_localdns_hosts_test.go New end-to-end scenarios for hosts plugin (legacy + scriptless) across multiple distros.
e2e/cluster.go Adds cleanup of tagged Private DNS zones that can interfere with DNS-sensitive tests.
e2e/aks_model.go Tags created Private DNS zones; adds cleanup helpers for zones and VNET links.
aks-node-controller/proto/aksnodeconfig/v1/localdns_config.proto Adds enable_hosts_plugin to scriptless LocalDnsProfile.
aks-node-controller/pkg/gen/aksnodeconfig/v1/localdns_config.pb.go Regenerates protobuf bindings for enable_hosts_plugin.
aks-node-controller/parser/testdata/AKSUbuntu2204+LocalDNS/generatedCSECommand Adds parser golden output for LocalDNS (no hosts plugin).
aks-node-controller/parser/testdata/AKSUbuntu2204+LocalDNS+HostsPlugin/generatedCSECommand Adds parser golden output for LocalDNS with hosts plugin enabled.
aks-node-controller/parser/templates/localdns.toml.gtpl Adds conditional hosts plugin blocks and updates template data shape ($.Config, $.IncludeHostsPlugin).
aks-node-controller/parser/parser_test.go Adds parser tests covering hosts-plugin env var output.
aks-node-controller/parser/parser.go Adds SHOULD_ENABLE_HOSTS_PLUGIN + dual corefile env vars to generated CSE environment.
aks-node-controller/parser/helper_test.go Updates corefile base64 generation tests to new helper signature.
aks-node-controller/parser/helper.go Implements hosts-plugin enablement + dual corefile generation for scriptless path.
.pipelines/scripts/verify_shell.sh Marks aks-hosts-setup.sh as intentionally bash-only.

@saewoni saewoni changed the title Sakwa/localdns poc clean feat: add CoreDNS hosts plugin support for LocalDNS Mar 25, 2026
Copilot AI review requested due to automatic review settings March 25, 2026 21:27
@Devinwong
Copy link
Copy Markdown
Collaborator

Could you add descriptions in the PR What this PR does / why we need it: for reviewers to understand what issues you are trying to solve?

kwaksaewon and others added 29 commits April 9, 2026 23:56
printf '%s\n' is POSIX-portable and won't interpret escape sequences,
unlike echo which is shell-implementation-dependent.
- Fix comment in parser/helper.go: selection happens in localdns.sh not cse_main.sh
- Use LOCALDNS_HOSTS_FILE override in select_localdns_corefile() for testability
- Rewrite cse_main_spec.sh to use temp dir instead of /etc/localdns/hosts
- Add actual permissions check to cloud-env file test in cse_config_spec.sh
- Replace brittle tail -n +10 with sed in aks_hosts_setup_spec.sh
On first boot, localdns and aks-hosts-setup start concurrently.
localdns often wins the race and selects the base corefile (without
hosts plugin) because the hosts file isn't populated yet. Restarting
localdns after the validator confirms the hosts file is ready lets it
regenerate its corefile with the hosts plugin variant.
The aks-hosts-setup.sh script previously hardcoded cloud→FQDN mappings
in a case statement covering only 3 clouds (Public, China, USGov).
This meant 5+ sovereign clouds were unsupported and every new cloud
required a VHD release.

Have the RP pass LOCALDNS_CRITICAL_FQDNS (comma-separated) through
the CSE env pipeline. The script becomes a simple resolver loop with
no cloud-specific logic. If the env var is empty (old RP), the script
exits gracefully and the corefile falls back to the no-hosts variant.

Changes:
- Add CriticalFQDNs field to LocalDNSProfile (types.go, proto)
- Add GetLocalDNSCriticalFQDNs template func (baker.go)
- Add LOCALDNS_CRITICAL_FQDNS to CSE env (cse_cmd.sh, parser.go)
- Replace TARGET_CLOUD case statement in enableAKSHostsSetup with
  LOCALDNS_CRITICAL_FQDNS check (cse_config.sh)
- Replace cloud→FQDN case statement in aks-hosts-setup.sh with
  comma-separated FQDN list parsing
- Update all shell spec and Go unit tests
…t clarity

- Remove Flatcar from hosts plugin e2e tests and VHD packer configs since
  Flatcar doesn't support aks-hosts-setup artifacts
- Use Runtime.Cluster.Model.Location instead of NBC-specific fields in
  GetDefaultFQDNsForValidation/GetContainerRegistryFQDN so both legacy
  and scriptless bootstrap paths work
- Add localdns and aks-hosts-setup journal log collection to aks-log-collector.sh
- Improve comments: IPv6 filter explanation, timer refresh rationale,
  logging/journald documentation in aks-hosts-setup.sh
- Switch spec from fragile tail -n +10 to sed-based script preparation
…P FQDNs

The RP passes CriticalFQDNs to the NBC/AKSNodeConfig, which flows into
LOCALDNS_CRITICAL_FQDNS for aks-hosts-setup.sh. Without this, the e2e
hosts plugin tests had empty FQDNs and aks-hosts-setup.sh would exit
without populating /etc/localdns/hosts.

Adds the public-cloud FQDNs (mcr.microsoft.com, login.microsoftonline.com,
acs-mirror.azureedge.net) to all three base LocalDNSProfile configs:
- nbcToAKSNodeConfigV1 (scriptless path)
- baseTemplateLinux AgentPoolProfile (legacy path)
- baseTemplateLinux NBC root (legacy path)
ValidateLocalDNSHostsPluginBypass was hardcoding packages.microsoft.com
as the test FQDN, but that FQDN is not in the NBC's CriticalFQDNs list
so it never appears in /etc/localdns/hosts. Use the first entry from
GetDefaultFQDNsForValidation() (mcr.microsoft.com) instead.

ACL subtests were missing the VMConfigMutator to enable TrustedLaunch,
causing ARM to reject the VMSS creation with a 400 BadRequest.

Also simplify GetDefaultFQDNsForValidation to only return public cloud
FQDNs since AgentBaker e2e only runs in public cloud.
Add tests that verify the hosts plugin can be activated on an
already-running VM without reprovisioning — the brownfield scenario
that occurs during feature flag rollout or cluster upgrade.

Two test functions cover both bootstrap paths:
- Test_LocalDNSHostsPlugin_Brownfield (legacy bash CSE)
- Test_LocalDNSHostsPlugin_Brownfield_Scriptless (aks-node-controller)

Each test follows a three-phase flow:
1. Baseline: VM boots with hosts plugin off, validateNoHostsPlugin
   confirms SHOULD_ENABLE_HOSTS_PLUGIN=false, no cloud-env, no hosts
   directive in active corefile
2. SSH mutation: enableHostsPluginOnRunningVM patches environment file,
   creates cloud-env, starts aks-hosts-setup timer/service
3. Validation: hosts file populated, service healthy, localdns restarted,
   AA flag proves authoritative response from hosts plugin

New helpers in validators.go:
- validateNoHostsPlugin: asserts hosts plugin is not active
- enableHostsPluginOnRunningVM: generates experimental corefile in Go,
  injects it via SSH along with environment/cloud-env changes
Add Test_LocalDNSHostsPlugin_Rollback and _Rollback_Scriptless tests
that verify disabling the hosts plugin on a running VM without
reprovisioning. This covers the production rollback scenario where
operators need to turn off the hosts plugin mid-lifecycle.

Tests boot with EnableHostsPlugin=true (Phase 1 validates it works via
ValidateCommonLinux), then disable via SSH (Phase 2), then comprehensively
validate the disable took effect (Phase 3: environment file, removed
state files, inactive timer, base corefile active, AA flag absent, DNS
still resolves through localdns).
… consolidation

Spec: docs/superpowers/specs/2026-03-31-hosts-plugin-disable-and-cloud-env-consolidation-design.md
Plan: docs/superpowers/plans/2026-03-31-hosts-plugin-disable-and-cloud-env-consolidation.md
Move LOCALDNS_CRITICAL_FQDNS from a separate /etc/localdns/cloud-env file
into the existing /etc/localdns/environment file. This eliminates a
confusing extra file — all localdns state now lives in one place.

- generateLocalDNSFiles() now writes LOCALDNS_CRITICAL_FQDNS to environment
- enableAKSHostsSetup() no longer creates cloud-env
- aks-hosts-setup.service reads from /etc/localdns/environment
When AKS-RP re-runs CSE with EnableHostsPlugin=false on a node that
previously had it enabled, the else branch now actively cleans up:
stops aks-hosts-setup timer and removes /etc/localdns/hosts.

Also fixes call order bug: generateLocalDNSFiles() now runs before
enableAKSHostsSetup() so the environment file (with FQDNs) is written
before the timer starts.
Required for CSE re-run capability — allows updating the vmssCSE
extension on an existing VMSS to trigger a new CSE execution.
Regenerates CSE from a modified NBC and pushes it via Azure extension
update, triggering re-execution. Used by rollback tests to exercise
the production disable path in cse_config.sh.
Rollback tests now exercise the actual production code path:
RerunCSE pushes a new CSE with EnableHostsPlugin=false to the VMSS,
which triggers enableLocalDNS() -> disableAKSHostsSetup().
- DisableNeverEnabled: verifies else branch is idempotent on fresh nodes
- BackwardCompat_NewCSEOldServiceUnit: verifies graceful degradation
  when CriticalFQDNs are empty (simulates new-CSE/old-VHD scenario)
…oud-env refs

- Remove disableHostsPluginOnRunningVM (replaced by production code path)
- Update validateNoHostsPlugin: remove cloud-env existence check
- Update enableHostsPluginOnRunningVM: write FQDNs to environment file
- Update validateHostsPluginDisabled: remove LOCALDNS_COREFILE_EXPERIMENTAL
  and cloud-env checks (production RerunCSE path handles these differently)
This is the initial hosts plugin PR — no released VHD or CSE has
hosts plugin code yet. The BackwardCompat_NewCSEOldServiceUnit test
was testing a scenario (old VHD with cloud-env service file) that
can never happen because cloud-env was never shipped. Remove it.

Update the spec's backward compatibility section to clarify this is
forward-looking design rationale for future changes, not a current
concern.
RerunCSE is a no-op because cse_main.sh exits early when
provision.complete exists. Replace with ReimageVMSSInstance which
wipes the OS disk via BeginReimage API, so CSE runs from scratch
with SHOULD_ENABLE_HOSTS_PLUGIN=false — exercising the actual
production node image upgrade disable path.
Review fixes:
- Fix unescaped FQDN dots in grep -E patterns (validators.go)
- Update helper.go function comment to describe what it does vs what caller does
- Add includeHostsPlugin=false test case in helper_test.go
- Use upstream DNS from /etc/localdns/upstream-dns in aks-hosts-setup.sh
- Persist upstream DNS servers in localdns.sh replace_azurednsip_in_corefile()
- Add Case 2.5 legacy fallback in select_localdns_corefile()
- Add upstream DNS server selection tests in aks_hosts_setup_spec.sh

Removed (not ready for this PR):
- Brownfield/rollback test functions and supporting VMSS helpers
- Brownfield/rollback validators
- docs/superpowers planning docs for disable/cloud-env consolidation
- Extract normalizeCorefileString() and assertCorefileBase64Contains()
  helpers from Test_getLocalDNSCorefileBase64 to reduce cognitive
  complexity from 27 to under 20 (gocognit lint threshold).

- Update enableAKSHostsSetup shellspec tests: the function no longer
  writes a cloud-env file (FQDNs are persisted via generateLocalDNSFiles
  to /etc/localdns/environment instead). Tests now verify hosts file
  creation and timer enablement.
The AA flag alone is insufficient to prove the hosts plugin served the
response — a forwarded authoritative response also has AA but includes
the RA (recursion available) flag. Hosts plugin responses have
"flags: qr aa rd" without "ra". This check catches false positives
where a forwarded response passes the AA-only check.
…, replace EOL acs-mirror URL, bump TimeoutStartSec

- Rename aks-hosts-setup.{sh,service,timer} to aks-localdns-hosts-setup
  to clarify the service is DNS-related (djsly feedback)
- Replace acs-mirror.azureedge.net with packages.aks.azure.com in e2e
  CriticalFQDNs lists (EOL URL)
- Increase TimeoutStartSec from 60 to 180 to handle worst-case DNS
  resolution time under transient failures
- Add backward-compat comment on duplicate LOCALDNS_GENERATED_COREFILE
  getter in parser.go
- Add clarifying comments on base64-encoded env vars in cse_config.sh
- Add mkdir -p before writing /etc/localdns/upstream-dns in localdns.sh
- Fix misleading comment in validation.go about scriptless support
- Add permissions assertion to cse_config_spec.sh hosts file test
DNS flags (AA, RA) are unreliable signals for verifying whether a
response came from the CoreDNS hosts plugin — CoreDNS can set these
flags regardless of which plugin served the response. Rely on the
stronger checks already present: node annotation, corefile
verification, and resolved IPs matching /etc/localdns/hosts entries.
Cover gaps in upstream DNS server selection tests:
- empty file (falls back to system resolver)
- whitespace-only file (word-splits to zero servers, resolves nothing)
- newline-separated server IPs (tr normalisation)
- all upstream servers fail (loop exhausts, graceful exit)
- @server passed to AAAA queries (not just A)
- no @server when upstream-dns file is absent
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 45 changed files in this pull request and generated 1 comment.

Comment on lines 1235 to +1249
return "", fmt.Errorf("failed generate corefile for localdns using template: %w", err)
}
return base64.StdEncoding.EncodeToString([]byte(output)), nil
},
"GetGeneratedLocalDNSCoreFileBase": func() (string, error) {
output, err := GenerateLocalDNSCoreFile(config, profile, false)
if err != nil {
return "", fmt.Errorf("failed generate base corefile for localdns using template: %w", err)
}
return base64.StdEncoding.EncodeToString([]byte(output)), nil
},
"GetGeneratedLocalDNSCoreFileExperimental": func() (string, error) {
output, err := GenerateLocalDNSCoreFile(config, profile, true)
if err != nil {
return "", fmt.Errorf("failed generate experimental corefile for localdns using template: %w", err)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages still say "using template", but GenerateLocalDNSCoreFile no longer takes a template argument and always uses the built-in localDNSCoreFileTemplateString. Updating the wording would make debugging clearer (e.g., "failed to generate localdns corefile" / "failed to generate base corefile").

Suggested change
return "", fmt.Errorf("failed generate corefile for localdns using template: %w", err)
}
return base64.StdEncoding.EncodeToString([]byte(output)), nil
},
"GetGeneratedLocalDNSCoreFileBase": func() (string, error) {
output, err := GenerateLocalDNSCoreFile(config, profile, false)
if err != nil {
return "", fmt.Errorf("failed generate base corefile for localdns using template: %w", err)
}
return base64.StdEncoding.EncodeToString([]byte(output)), nil
},
"GetGeneratedLocalDNSCoreFileExperimental": func() (string, error) {
output, err := GenerateLocalDNSCoreFile(config, profile, true)
if err != nil {
return "", fmt.Errorf("failed generate experimental corefile for localdns using template: %w", err)
return "", fmt.Errorf("failed to generate localdns corefile: %w", err)
}
return base64.StdEncoding.EncodeToString([]byte(output)), nil
},
"GetGeneratedLocalDNSCoreFileBase": func() (string, error) {
output, err := GenerateLocalDNSCoreFile(config, profile, false)
if err != nil {
return "", fmt.Errorf("failed to generate base localdns corefile: %w", err)
}
return base64.StdEncoding.EncodeToString([]byte(output)), nil
},
"GetGeneratedLocalDNSCoreFileExperimental": func() (string, error) {
output, err := GenerateLocalDNSCoreFile(config, profile, true)
if err != nil {
return "", fmt.Errorf("failed to generate experimental localdns corefile: %w", err)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants