Skip to content

fix: localdns readiness curl must bypass HTTP proxy on http-proxy clusters#8834

Open
yewmsft wants to merge 2 commits into
mainfrom
yew/localdns-httpproxy-readiness-proxy-bypass
Open

fix: localdns readiness curl must bypass HTTP proxy on http-proxy clusters#8834
yewmsft wants to merge 2 commits into
mainfrom
yew/localdns-httpproxy-readiness-proxy-bypass

Conversation

@yewmsft

@yewmsft yewmsft commented Jul 4, 2026

Copy link
Copy Markdown
Member

Incident repair item for ICM 51000001093225.

Summary

On AKS clusters configured with an HTTP proxy, enabling LocalDNS causes node bootstrap (CSE) to hang until timeout, failing node provisioning (create / scale / node-image upgrade / surge). Root cause is a shell-quoting bug in the LocalDNS readiness health check.

  • localdns.service inherits the system-wide proxy env (via systemd DefaultEnvironment on http-proxy clusters). The readiness probe curls the node-local listener 169.254.10.10:8181/ready, which must not be sent through the proxy — a link-local address is not routable by an external HTTP proxy.
  • CURL_COMMAND was a string invoked unquoted ($($CURL_COMMAND)). Word-splitting happens without quote-removal, so any attempt to pass --noproxy '*' reached curl as a literal 3-char argument '*', leaving the proxy bypass ineffective. The probe was proxied, never returned OK, and CSE blocked until timeout.
  • Reproduces regardless of outbound type (UDR or LoadBalancer) — the only conditions are "HTTP proxy configured" + "LocalDNS enabled".

Fix

  • Declare CURL_COMMAND as a bash array and expand as "${CURL_COMMAND[@]}" at both call sites (wait_for_localdns_ready, start_localdns_watchdog) so arguments are passed verbatim with correct quoting.
  • Add an explicit --noproxy "${LOCALDNS_NODE_LISTENER_IP}" so the node-local readiness request always bypasses the proxy, plus --connect-timeout/--max-time so a probe can't hang.
  • Update the wait_for_localdns_ready ShellSpec stubs to array form (CURL_COMMAND=(echo OK)), required by the new expansion.

Validation

  • Reproduced the failure and confirmed the fix end-to-end on http-proxy clusters (both UDR and LoadBalancer outbound): with the fix, the surge node bootstraps and joins with LocalDNS enabled — without adding link-local IPs to noProxy.
  • make generate-testdata — all pkg/agent/... tests pass, no generated-file diff (localdns.sh is not embedded in the CSE snapshot).
  • ShellSpec localdns_spec.sh: 85 examples, 0 failures.
  • bash -n syntax check on both files.

Test plan

  • ShellSpec localdns_spec.sh passes (85/0)
  • pkg/agent/... go tests pass, no snapshot drift
  • e2e: LocalDNS-enabled nodepool on an http-proxy cluster bootstraps successfully

🤖 Generated with Claude Code

…sters

On clusters configured with an HTTP proxy, the proxy environment is inherited
by localdns.service via systemd DefaultEnvironment. The LocalDNS readiness
health check curls the node-local listener (169.254.10.10:8181/ready), which
must not go through the proxy since a link-local address is not routable by an
external proxy.

CURL_COMMAND was a string invoked unquoted ($CURL_COMMAND), so any attempt to
pass --noproxy '*' underwent word-splitting without quote-removal and curl
received a literal 3-char arg '*', leaving the proxy bypass ineffective. The
readiness probe was then proxied, never returned OK, and node bootstrap (CSE)
blocked until timeout -- failing node provisioning whenever LocalDNS was enabled
behind an HTTP proxy. Reproduces regardless of outbound type (UDR or LB).

Fix: declare CURL_COMMAND as a bash array and expand it as "${CURL_COMMAND[@]}"
so arguments are passed verbatim with correct quoting, and add an explicit
--noproxy for the node-local listener IP plus connect/max timeouts. Update the
ShellSpec stubs to the array form accordingly.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 4, 2026 07:29
@yewmsft yewmsft requested a review from SriHarsha001 as a code owner July 4, 2026 07:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes LocalDNS readiness/health checks hanging on HTTP-proxy clusters by ensuring the curl probe to the node-local readiness endpoint bypasses proxy settings and that curl arguments are passed with correct shell quoting.

Changes:

  • Convert CURL_COMMAND from a string to a bash array and expand it as "${CURL_COMMAND[@]}" to preserve argument boundaries.
  • Add --noproxy "${LOCALDNS_NODE_LISTENER_IP}" plus curl timeouts to prevent proxying and avoid hangs during readiness checks.
  • Update ShellSpec stubs to set CURL_COMMAND as an array to match the new calling convention.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
parts/linux/cloud-init/artifacts/localdns.sh Switch curl invocation to array-based execution and force proxy bypass + timeouts for the link-local readiness probe.
spec/parts/linux/cloud-init/artifacts/localdns_spec.sh Update tests to stub CURL_COMMAND as an array to align with the new implementation.

Replace @SriHarsha001 with @saewoni as the localdns code owner across
localdns.sh, localdns.service, localdns-delegate.conf, and localdns_spec.sh.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@yewmsft

yewmsft commented Jul 4, 2026

Copy link
Copy Markdown
Member Author

/azp run AKS Linux VHD Build - PR check-in gate

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants