Skip to content

feat: add windows-log-analysis Copilot skill (LLM sub-skills)#8214

Open
timmy-wright wants to merge 3 commits intomainfrom
timmy/windows-log-analysis-skill
Open

feat: add windows-log-analysis Copilot skill (LLM sub-skills)#8214
timmy-wright wants to merge 3 commits intomainfrom
timmy/windows-log-analysis-skill

Conversation

@timmy-wright
Copy link
Copy Markdown
Contributor

@timmy-wright timmy-wright commented Apr 1, 2026

feat: add windows-log-analysis Copilot skill (LLM sub-skills)

Summary

Adds a Copilot CLI skill for diagnosing Windows AKS node issues from log bundles produced by collect-windows-logs.ps1. This skill uses LLM sub-skill markdown files that instruct AI agents how to analyze each log category.

It also adds a skill to save markdown to disk because my agent kept having so many issues with this task.

Why LLM sub-skills instead of scripts?

  • Resilient to format changes — LLM reads raw files instead of brittle regex/column parsing
  • Discovers novel issues — not limited to hard-coded patterns
  • Parallel dispatch — each sub-skill runs as an independent sub-agent
  • Domain knowledge preserved — HCS error codes, HNS failure modes, CSE error codes, known bugs from GitHub issues all encoded as analyst instructions

Architecture

SKILL.md (orchestrator)
├── common-reference.md (encoding, thresholds, error codes, dispatch guidance)
└── sub-skills/
    ├── analyze-containers.md    # Pod restarts, crash-loops, readiness
    ├── analyze-termination.md   # Stuck Terminating pods, zombie HCS, Defender file locks
    ├── analyze-images.md        # Dangling images, mutable tags, GC failures, snapshot bloat
    ├── analyze-disk.md          # C: drive free space trends
    ├── analyze-hcs.md           # Host Compute Service: lifecycle tracking, error codes, vmcompute health
    ├── analyze-hns.md           # Host Network Service: endpoints, LBs, CNI, DNS, WFP/VFP
    ├── analyze-kubeproxy.md     # kube-proxy: HNS policy sync, DSR, port range conflicts, SNAT
    ├── analyze-kubelet.md       # Node conditions, lease renewal, clock skew, cert rotation
    ├── analyze-memory.md        # Physical RAM, pagefile, OOM, process memory
    ├── analyze-crashes.md       # WER reports, minidumps, BSODs, unexpected reboots
    ├── analyze-csi.md           # CSI proxy, SMB/Azure Files mounts, Azure Disk
    ├── analyze-gmsa.md          # gMSA/CCG authentication, Kerberos, credential specs
    ├── analyze-gpu.md           # nvidia-smi, DirectX device plugin, Xid errors
    ├── analyze-bootstrap.md     # CSE flow, WINDOWS_CSE_ERROR codes (0-83), bootstrap config
    ├── analyze-extensions.md    # Azure VM extension execution errors
    └── analyze-services.md      # Windows service health, node versions, OS info

What the skill detects

Sub-Skill Key Detections
containers Crash-looping containers (≥10 restarts), pods not Ready
termination Zombie HCS containers, orphaned shims, containerd reinstall without drain, Defender file lock interference
images Dangling images from mutable tags, containerd GC failure (k8s#116020), snapshot accumulation
disk C: drive free space with cross-snapshot trend analysis
hcs vmcompute memory/handle leaks, HCS operation duration degradation, creation storms, 30+ error codes
hns Endpoint leaks (IP exhaustion), LB count drops after HNS reset, stale LB rules, WFP filter accumulation
kubeproxy DSR degraded policies, excluded port range conflicts with NodePort, stale LB rules, SNAT exhaustion
kubelet NotReady/DiskPressure/MemoryPressure, lease renewal failures, clock skew, certificate rotation
memory Physical RAM exhaustion, pagefile misconfiguration, per-process working set analysis
crashes Application crashes (kubelet, containerd, shim), BSODs, WER correlation with service events
csi CSI proxy crashes, stale SMB global mappings, credential rotation failures, named pipe version mismatches
gmsa CCG plugin errors, Kerberos ticket failures, domain controller connectivity, credential spec validation
gpu nvidia-smi parsing, Xid error classification, ECC memory errors, DirectX device plugin scheduling
bootstrap CSE execution timeline, 83 WINDOWS_CSE_ERROR codes, bootstrap config validation, service startup ordering
extensions VM extension exit codes with curl progress false-positive filtering
services 12 critical AKS service health checks, service PID cross-reference, start type validation

Orchestrator features

  • Symptom-based dispatchcommon-reference.md includes a dispatch table so agents pick the right 3-5 sub-skills instead of running all 16
  • Synthesis decision treeSKILL.md provides a full decision tree for combining findings across sub-skills
  • 17 root cause chains — maps symptoms → checks → root causes (e.g., disk pressure → images → mutable tags)
  • Timeline correlation — instructions for building cross-sub-skill event timelines from anchor events
  • Consistent structure — all sub-skills use identical sections: Purpose, Input Files, Analysis Steps, Findings Format, Known Patterns, Cross-References

Key research that informed the sub-skills

Files changed

  • .github/skills/windows-log-analysis/SKILL.md — orchestrator with decision tree and root cause chains
  • .github/skills/windows-log-analysis/sub-skills/*.md — 16 sub-skills + common reference (3,337 lines total)
  • .github/skills/windows-log-analysis/.gitignore

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new GitHub Copilot skill under .github/skills/windows-log-analysis/ to help diagnose Windows AKS node issues from log bundles produced by staging/cse/windows/debug/collect-windows-logs.ps1, including an accompanying Python analyzer script.

Changes:

  • Introduces SKILL.md with a log-bundle reference guide and troubleshooting playbooks.
  • Adds analyze-windows-logs.py to scan multi-snapshot bundles, trend key metrics, and emit prioritized findings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File Description
.github/skills/windows-log-analysis/SKILL.md Skill definition and reference guide for interpreting collected Windows node logs
.github/skills/windows-log-analysis/analyze-windows-logs.py Python 3 analyzer for automated triage of collected Windows log bundles

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

@timmy-wright timmy-wright changed the title Add windows-log-analysis Copilot skill feat: add windows-log-analysis Copilot skill Apr 1, 2026
Copilot AI review requested due to automatic review settings April 1, 2026 01:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

@timmy-wright timmy-wright force-pushed the timmy/windows-log-analysis-skill branch from b63d0ec to b6818e8 Compare April 2, 2026 09:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

| `kubelet.log` | UTF-8 | Kubelet stdout logs (if present) |
| `kubelet.err.log` | UTF-8 | Kubelet stderr logs (if present) |
| `<ts>-cri-containerd-pods.txt` | UTF-16-LE with BOM | `crictl pods` — cross-reference pod state |
| `*_services.csv` | UTF-8 | Service status timeline used for kubelet crash/restart and clock skew checks |
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*_services.csv is exported by collect-windows-logs.ps1 via Export-Csv without -Encoding, which defaults to UTF-16LE on Windows PowerShell. Marking it as UTF-8 here will cause parsers to mis-decode the file; update the encoding (and ideally the pattern to <ts>_services.csv for consistency with other entries).

Suggested change
| `*_services.csv` | UTF-8 | Service status timeline used for kubelet crash/restart and clock skew checks |
| `<ts>_services.csv` | UTF-16-LE with BOM | Service status timeline used for kubelet crash/restart and clock skew checks |

Copilot uses AI. Check for mistakes.
Comment on lines +337 to +340
| `bootstrap-config` | analyze-bootstrap |
| `*-ccg-*.evtx` or CCG event logs | analyze-gmsa |
| `gmsa-*.log` or gMSA credential spec files | analyze-gmsa |
| `kubectl-describe-nodes.log` | analyze-gpu, analyze-kubelet |
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File-dispatch mapping seems incomplete: kubectl-describe-nodes.log is consumed by analyze-kubelet.md (node conditions/taints/events) as well as GPU analysis, but the table only routes it to analyze-gpu. Add analyze-kubelet here to avoid skipping kubelet triage when this file is present.

Copilot uses AI. Check for mistakes.
timmy-wright and others added 2 commits April 7, 2026 09:57
…e.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comment on lines +245 to +249
| 2003 | Resource Exhaustion Detector | Resource exhaustion detected |
| 2004 | System | Low memory condition |
| 6008 | System | Unexpected shutdown (preceding crash/BSOD) |

**Note:** Windows Event IDs are **not globally unique** — the same numeric ID (e.g., `2003`) can appear under different providers with different meanings. When writing analysis logic, always key on **Event ID + Source** together.
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Windows Event ID table looks inconsistent with how collect-windows-logs.ps1 exports events: the collector comments that Event ID 2004 is used for resource exhaustion/low memory events, but this table lists Resource Exhaustion Detector as 2003 and assigns 2004 to the System provider. Please align this table with the actual provider+event IDs emitted in the collected bundle (or explicitly note the exact ProviderName for the 2004 events).

Suggested change
| 2003 | Resource Exhaustion Detector | Resource exhaustion detected |
| 2004 | System | Low memory condition |
| 6008 | System | Unexpected shutdown (preceding crash/BSOD) |
**Note:** Windows Event IDs are **not globally unique** — the same numeric ID (e.g., `2003`) can appear under different providers with different meanings. When writing analysis logic, always key on **Event ID + Source** together.
| 2004 | Microsoft-Windows-Resource-Exhaustion-Detector | Resource exhaustion detected / low memory condition |
| 6008 | System | Unexpected shutdown (preceding crash/BSOD) |
**Note:** Windows Event IDs are **not globally unique** — the same numeric ID can appear under different providers with different meanings. When writing analysis logic, always key on **Event ID + Source** together.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +72
**WINDOWS_CSE_ERROR codes** (from AgentBaker `windowscsehelper.ps1`):

| Code | Name | Meaning |
|------|------|---------|
| 0 | SUCCESS | CSE completed successfully |
| 1 | UNKNOWN | Unexpected error in catch block |
| 2 | DOWNLOAD_FILE_WITH_RETRY | File download failed after retries |
| 3 | INVOKE_EXECUTABLE | Executable invocation failed |
| 4 | FILE_NOT_EXIST | Required file missing |
| 5 | CHECK_API_SERVER_CONNECTIVITY | Cannot reach API server |
| 6 | PAUSE_IMAGE_NOT_EXIST | Pause container image missing |
| 7 | GET_SUBNET_PREFIX | Failed to get subnet prefix |
| 8 | GENERATE_TOKEN_FOR_ARM | ARM token generation failed |
| 9 | NETWORK_INTERFACES_NOT_EXIST | No network interfaces found |
| 10 | NETWORK_ADAPTER_NOT_EXIST | Network adapter missing |
| 11 | MANAGEMENT_IP_NOT_EXIST | Management IP not found |
| 12 | CALICO_SERVICE_ACCOUNT_NOT_EXIST | Calico SA missing |
| 13 | CONTAINERD_NOT_INSTALLED | containerd binary not found |
| 14 | CONTAINERD_NOT_RUNNING | containerd service not running |
| 15 | OPENSSH_NOT_INSTALLED | OpenSSH not installed |
| 16 | OPENSSH_FIREWALL_NOT_CONFIGURED | OpenSSH firewall rule missing |
| 17 | INVALID_PARAMETER_IN_AZURE_CONFIG | Bad azure.json parameter |
| 19 | GET_CA_CERTIFICATES | CA cert retrieval failed |
| 20 | DOWNLOAD_CA_CERTIFICATES | CA cert download failed |
| 21 | EMPTY_CA_CERTIFICATES | CA certs empty |
| 22 | ENABLE_SECURE_TLS | Secure TLS enablement failed |
| 23–28 | GMSA_* | gMSA setup failures |
| 29 | NOT_FOUND_MANAGEMENT_IP | Management IP lookup failed |
| 30 | NOT_FOUND_BUILD_NUMBER | Windows build number not found |
| 31 | NOT_FOUND_PROVISIONING_SCRIPTS | Provisioning scripts missing |
| 32 | START_NODE_RESET_SCRIPT_TASK | Node reset task failed to start |
| 33–40 | DOWNLOAD_*_PACKAGE | Package download failures (CSE, K8s, CNI, HNS, Calico, gMSA, CSI proxy, containerd) |
| 41 | SET_TCP_DYNAMIC_PORT_RANGE | TCP port range configuration failed |
| 43 | PULL_PAUSE_IMAGE | Pause image pull failed |
| 45 | CONTAINERD_BINARY_EXIST | containerd binary check failed |
| 46–48 | SET_*_PORT_RANGE | Port range exclusion failures |
| 49 | NO_CUSTOM_DATA_BIN | CustomData.bin missing (very early failure) |
| 50 | NO_CSE_RESULT_LOG | CSE did not produce result log |
| 52 | RESIZE_OS_DRIVE | OS drive resize failed |
| 53–61 | GPU_* | GPU driver installation failures |
| 62 | UPDATING_KUBE_CLUSTER_CONFIG | Kube cluster config update failed |
| 64 | GET_CONTAINERD_VERSION | containerd version detection failed |
| 65–67 | CREDENTIAL_PROVIDER_* | Credential provider install/config failures |
| 68 | ADJUST_PAGEFILE_SIZE | Pagefile resize failed |
| 70–71 | SECURE_TLS_BOOTSTRAP_* | Secure TLS bootstrap client failures |
| 72 | CILIUM_NETWORKING_INSTALL_FAILED | Cilium install failed |
| 73 | EXTRACT_ZIP | Zip extraction failed |
| 74–75 | LOAD/PARSE_METADATA | Metadata failures |
| 76–83 | ORAS_* | Network-isolated cluster artifact pull failures |
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says the table is sourced from windowscsehelper.ps1 and later references a “full code table”, but the table omits several defined codes (e.g., 18, 42, 44, 51, 63, 69). To avoid misdiagnosis, either (a) include the missing codes/ranges, or (b) label this as a partial list of common codes and link readers to parts/windows/windowscsehelper.ps1 for the authoritative set.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +19
Writing markdown to files using bash HEREDOC (`cat << 'EOF'`) breaks when content contains:
- Backticks (`` ` ``) — interpreted as command substitution even in some HEREDOC forms
- `$variable` — interpreted as shell expansion in unquoted HEREDOCs
- The HEREDOC delimiter appearing in the content itself
- Nested quotes and backslashes causing silent corruption
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first bullet claims backticks are interpreted as command substitution when using a quoted heredoc example (cat << 'EOF'), but in bash a single-quoted heredoc delimiter prevents command substitution/expansion. Consider rewording this list to distinguish unquoted vs quoted heredocs (the real pitfalls are unquoted delimiters, delimiter collisions, and tool-specific filtering), otherwise readers may avoid a safe approach unnecessarily.

Suggested change
Writing markdown to files using bash HEREDOC (`cat << 'EOF'`) breaks when content contains:
- Backticks (`` ` ``) — interpreted as command substitution even in some HEREDOC forms
- `$variable` — interpreted as shell expansion in unquoted HEREDOCs
- The HEREDOC delimiter appearing in the content itself
- Nested quotes and backslashes causing silent corruption
Writing markdown to files via shell can break depending on how the HEREDOC is written:
- In unquoted HEREDOCs, backticks (`` ` ``) and `$variable` can be interpreted by the shell
- The HEREDOC delimiter can appear in the content itself and terminate the document early
- Nested quotes, backslashes, or tool-specific filtering can still cause silent corruption

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +16
**Zip structure**: `Extension-Logs-<ts>/<plugin-name>/<version>/CommandExecution*.log`

Both zips and extracted directories should be scanned. Track seen `(plugin_name, filename)` pairs to avoid duplicate findings.
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documented zip layout doesn’t match how collect-windows-logs.ps1 builds Extension-Logs.zip: it copies C:\WindowsAzure\Logs\Plugins\* into a temp folder and zips that whole directory, so paths typically include a Plugins/ prefix and publisher/extension name segments (not just <plugin-name>/<version>). Consider updating this to reflect the actual on-disk structure so the analyzer can reliably locate CommandExecution*.log files.

Suggested change
**Zip structure**: `Extension-Logs-<ts>/<plugin-name>/<version>/CommandExecution*.log`
Both zips and extracted directories should be scanned. Track seen `(plugin_name, filename)` pairs to avoid duplicate findings.
**Zip structure**: The collector zips the copied `C:\WindowsAzure\Logs\Plugins\*` tree, so paths typically look like `Extension-Logs-<ts>/Plugins/<publisher>.<extension-type>/<version>/CommandExecution*.log` (or the equivalent extracted directory layout).
Both zips and extracted directories should be scanned recursively for `CommandExecution*.log` files under the `Plugins/` tree. Do not assume a fixed `<plugin-name>/<version>` depth; derive the plugin name from the extension directory path and track seen `(plugin_name, filename)` pairs to avoid duplicate findings.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ignore-for-release This pull request will not be included within official release notes windows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants