feat: add windows-log-analysis Copilot skill (LLM sub-skills)#8214
feat: add windows-log-analysis Copilot skill (LLM sub-skills)#8214timmy-wright wants to merge 3 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new GitHub Copilot skill under .github/skills/windows-log-analysis/ to help diagnose Windows AKS node issues from log bundles produced by staging/cse/windows/debug/collect-windows-logs.ps1, including an accompanying Python analyzer script.
Changes:
- Introduces
SKILL.mdwith a log-bundle reference guide and troubleshooting playbooks. - Adds
analyze-windows-logs.pyto scan multi-snapshot bundles, trend key metrics, and emit prioritized findings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| .github/skills/windows-log-analysis/SKILL.md | Skill definition and reference guide for interpreting collected Windows node logs |
| .github/skills/windows-log-analysis/analyze-windows-logs.py | Python 3 analyzer for automated triage of collected Windows log bundles |
.github/skills/windows-log-analysis/sub-skills/analyze-system.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/analyze-termination.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/analyze-termination.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/analyze-images.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/analyze-services.md
Outdated
Show resolved
Hide resolved
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
b63d0ec to
b6818e8
Compare
| | `kubelet.log` | UTF-8 | Kubelet stdout logs (if present) | | ||
| | `kubelet.err.log` | UTF-8 | Kubelet stderr logs (if present) | | ||
| | `<ts>-cri-containerd-pods.txt` | UTF-16-LE with BOM | `crictl pods` — cross-reference pod state | | ||
| | `*_services.csv` | UTF-8 | Service status timeline used for kubelet crash/restart and clock skew checks | |
There was a problem hiding this comment.
*_services.csv is exported by collect-windows-logs.ps1 via Export-Csv without -Encoding, which defaults to UTF-16LE on Windows PowerShell. Marking it as UTF-8 here will cause parsers to mis-decode the file; update the encoding (and ideally the pattern to <ts>_services.csv for consistency with other entries).
| | `*_services.csv` | UTF-8 | Service status timeline used for kubelet crash/restart and clock skew checks | | |
| | `<ts>_services.csv` | UTF-16-LE with BOM | Service status timeline used for kubelet crash/restart and clock skew checks | |
| | `bootstrap-config` | analyze-bootstrap | | ||
| | `*-ccg-*.evtx` or CCG event logs | analyze-gmsa | | ||
| | `gmsa-*.log` or gMSA credential spec files | analyze-gmsa | | ||
| | `kubectl-describe-nodes.log` | analyze-gpu, analyze-kubelet | |
There was a problem hiding this comment.
File-dispatch mapping seems incomplete: kubectl-describe-nodes.log is consumed by analyze-kubelet.md (node conditions/taints/events) as well as GPU analysis, but the table only routes it to analyze-gpu. Add analyze-kubelet here to avoid skipping kubelet triage when this file is present.
.github/skills/windows-log-analysis/sub-skills/common-reference.md
Outdated
Show resolved
Hide resolved
…e.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| | 2003 | Resource Exhaustion Detector | Resource exhaustion detected | | ||
| | 2004 | System | Low memory condition | | ||
| | 6008 | System | Unexpected shutdown (preceding crash/BSOD) | | ||
|
|
||
| **Note:** Windows Event IDs are **not globally unique** — the same numeric ID (e.g., `2003`) can appear under different providers with different meanings. When writing analysis logic, always key on **Event ID + Source** together. |
There was a problem hiding this comment.
The Windows Event ID table looks inconsistent with how collect-windows-logs.ps1 exports events: the collector comments that Event ID 2004 is used for resource exhaustion/low memory events, but this table lists Resource Exhaustion Detector as 2003 and assigns 2004 to the System provider. Please align this table with the actual provider+event IDs emitted in the collected bundle (or explicitly note the exact ProviderName for the 2004 events).
| | 2003 | Resource Exhaustion Detector | Resource exhaustion detected | | |
| | 2004 | System | Low memory condition | | |
| | 6008 | System | Unexpected shutdown (preceding crash/BSOD) | | |
| **Note:** Windows Event IDs are **not globally unique** — the same numeric ID (e.g., `2003`) can appear under different providers with different meanings. When writing analysis logic, always key on **Event ID + Source** together. | |
| | 2004 | Microsoft-Windows-Resource-Exhaustion-Detector | Resource exhaustion detected / low memory condition | | |
| | 6008 | System | Unexpected shutdown (preceding crash/BSOD) | | |
| **Note:** Windows Event IDs are **not globally unique** — the same numeric ID can appear under different providers with different meanings. When writing analysis logic, always key on **Event ID + Source** together. |
| **WINDOWS_CSE_ERROR codes** (from AgentBaker `windowscsehelper.ps1`): | ||
|
|
||
| | Code | Name | Meaning | | ||
| |------|------|---------| | ||
| | 0 | SUCCESS | CSE completed successfully | | ||
| | 1 | UNKNOWN | Unexpected error in catch block | | ||
| | 2 | DOWNLOAD_FILE_WITH_RETRY | File download failed after retries | | ||
| | 3 | INVOKE_EXECUTABLE | Executable invocation failed | | ||
| | 4 | FILE_NOT_EXIST | Required file missing | | ||
| | 5 | CHECK_API_SERVER_CONNECTIVITY | Cannot reach API server | | ||
| | 6 | PAUSE_IMAGE_NOT_EXIST | Pause container image missing | | ||
| | 7 | GET_SUBNET_PREFIX | Failed to get subnet prefix | | ||
| | 8 | GENERATE_TOKEN_FOR_ARM | ARM token generation failed | | ||
| | 9 | NETWORK_INTERFACES_NOT_EXIST | No network interfaces found | | ||
| | 10 | NETWORK_ADAPTER_NOT_EXIST | Network adapter missing | | ||
| | 11 | MANAGEMENT_IP_NOT_EXIST | Management IP not found | | ||
| | 12 | CALICO_SERVICE_ACCOUNT_NOT_EXIST | Calico SA missing | | ||
| | 13 | CONTAINERD_NOT_INSTALLED | containerd binary not found | | ||
| | 14 | CONTAINERD_NOT_RUNNING | containerd service not running | | ||
| | 15 | OPENSSH_NOT_INSTALLED | OpenSSH not installed | | ||
| | 16 | OPENSSH_FIREWALL_NOT_CONFIGURED | OpenSSH firewall rule missing | | ||
| | 17 | INVALID_PARAMETER_IN_AZURE_CONFIG | Bad azure.json parameter | | ||
| | 19 | GET_CA_CERTIFICATES | CA cert retrieval failed | | ||
| | 20 | DOWNLOAD_CA_CERTIFICATES | CA cert download failed | | ||
| | 21 | EMPTY_CA_CERTIFICATES | CA certs empty | | ||
| | 22 | ENABLE_SECURE_TLS | Secure TLS enablement failed | | ||
| | 23–28 | GMSA_* | gMSA setup failures | | ||
| | 29 | NOT_FOUND_MANAGEMENT_IP | Management IP lookup failed | | ||
| | 30 | NOT_FOUND_BUILD_NUMBER | Windows build number not found | | ||
| | 31 | NOT_FOUND_PROVISIONING_SCRIPTS | Provisioning scripts missing | | ||
| | 32 | START_NODE_RESET_SCRIPT_TASK | Node reset task failed to start | | ||
| | 33–40 | DOWNLOAD_*_PACKAGE | Package download failures (CSE, K8s, CNI, HNS, Calico, gMSA, CSI proxy, containerd) | | ||
| | 41 | SET_TCP_DYNAMIC_PORT_RANGE | TCP port range configuration failed | | ||
| | 43 | PULL_PAUSE_IMAGE | Pause image pull failed | | ||
| | 45 | CONTAINERD_BINARY_EXIST | containerd binary check failed | | ||
| | 46–48 | SET_*_PORT_RANGE | Port range exclusion failures | | ||
| | 49 | NO_CUSTOM_DATA_BIN | CustomData.bin missing (very early failure) | | ||
| | 50 | NO_CSE_RESULT_LOG | CSE did not produce result log | | ||
| | 52 | RESIZE_OS_DRIVE | OS drive resize failed | | ||
| | 53–61 | GPU_* | GPU driver installation failures | | ||
| | 62 | UPDATING_KUBE_CLUSTER_CONFIG | Kube cluster config update failed | | ||
| | 64 | GET_CONTAINERD_VERSION | containerd version detection failed | | ||
| | 65–67 | CREDENTIAL_PROVIDER_* | Credential provider install/config failures | | ||
| | 68 | ADJUST_PAGEFILE_SIZE | Pagefile resize failed | | ||
| | 70–71 | SECURE_TLS_BOOTSTRAP_* | Secure TLS bootstrap client failures | | ||
| | 72 | CILIUM_NETWORKING_INSTALL_FAILED | Cilium install failed | | ||
| | 73 | EXTRACT_ZIP | Zip extraction failed | | ||
| | 74–75 | LOAD/PARSE_METADATA | Metadata failures | | ||
| | 76–83 | ORAS_* | Network-isolated cluster artifact pull failures | |
There was a problem hiding this comment.
This section says the table is sourced from windowscsehelper.ps1 and later references a “full code table”, but the table omits several defined codes (e.g., 18, 42, 44, 51, 63, 69). To avoid misdiagnosis, either (a) include the missing codes/ranges, or (b) label this as a partial list of common codes and link readers to parts/windows/windowscsehelper.ps1 for the authoritative set.
| Writing markdown to files using bash HEREDOC (`cat << 'EOF'`) breaks when content contains: | ||
| - Backticks (`` ` ``) — interpreted as command substitution even in some HEREDOC forms | ||
| - `$variable` — interpreted as shell expansion in unquoted HEREDOCs | ||
| - The HEREDOC delimiter appearing in the content itself | ||
| - Nested quotes and backslashes causing silent corruption |
There was a problem hiding this comment.
The first bullet claims backticks are interpreted as command substitution when using a quoted heredoc example (cat << 'EOF'), but in bash a single-quoted heredoc delimiter prevents command substitution/expansion. Consider rewording this list to distinguish unquoted vs quoted heredocs (the real pitfalls are unquoted delimiters, delimiter collisions, and tool-specific filtering), otherwise readers may avoid a safe approach unnecessarily.
| Writing markdown to files using bash HEREDOC (`cat << 'EOF'`) breaks when content contains: | |
| - Backticks (`` ` ``) — interpreted as command substitution even in some HEREDOC forms | |
| - `$variable` — interpreted as shell expansion in unquoted HEREDOCs | |
| - The HEREDOC delimiter appearing in the content itself | |
| - Nested quotes and backslashes causing silent corruption | |
| Writing markdown to files via shell can break depending on how the HEREDOC is written: | |
| - In unquoted HEREDOCs, backticks (`` ` ``) and `$variable` can be interpreted by the shell | |
| - The HEREDOC delimiter can appear in the content itself and terminate the document early | |
| - Nested quotes, backslashes, or tool-specific filtering can still cause silent corruption |
| **Zip structure**: `Extension-Logs-<ts>/<plugin-name>/<version>/CommandExecution*.log` | ||
|
|
||
| Both zips and extracted directories should be scanned. Track seen `(plugin_name, filename)` pairs to avoid duplicate findings. |
There was a problem hiding this comment.
The documented zip layout doesn’t match how collect-windows-logs.ps1 builds Extension-Logs.zip: it copies C:\WindowsAzure\Logs\Plugins\* into a temp folder and zips that whole directory, so paths typically include a Plugins/ prefix and publisher/extension name segments (not just <plugin-name>/<version>). Consider updating this to reflect the actual on-disk structure so the analyzer can reliably locate CommandExecution*.log files.
| **Zip structure**: `Extension-Logs-<ts>/<plugin-name>/<version>/CommandExecution*.log` | |
| Both zips and extracted directories should be scanned. Track seen `(plugin_name, filename)` pairs to avoid duplicate findings. | |
| **Zip structure**: The collector zips the copied `C:\WindowsAzure\Logs\Plugins\*` tree, so paths typically look like `Extension-Logs-<ts>/Plugins/<publisher>.<extension-type>/<version>/CommandExecution*.log` (or the equivalent extracted directory layout). | |
| Both zips and extracted directories should be scanned recursively for `CommandExecution*.log` files under the `Plugins/` tree. Do not assume a fixed `<plugin-name>/<version>` depth; derive the plugin name from the extension directory path and track seen `(plugin_name, filename)` pairs to avoid duplicate findings. |
feat: add windows-log-analysis Copilot skill (LLM sub-skills)
Summary
Adds a Copilot CLI skill for diagnosing Windows AKS node issues from log bundles produced by
collect-windows-logs.ps1. This skill uses LLM sub-skill markdown files that instruct AI agents how to analyze each log category.It also adds a skill to save markdown to disk because my agent kept having so many issues with this task.
Why LLM sub-skills instead of scripts?
Architecture
What the skill detects
Orchestrator features
common-reference.mdincludes a dispatch table so agents pick the right 3-5 sub-skills instead of running all 16SKILL.mdprovides a full decision tree for combining findings across sub-skillsKey research that informed the sub-skills
Update-DefenderPreferencesmissing path exclusions forC:\ProgramData\containerdandcontainerd-shim-runhcs-v1.exeFiles changed
.github/skills/windows-log-analysis/SKILL.md— orchestrator with decision tree and root cause chains.github/skills/windows-log-analysis/sub-skills/*.md— 16 sub-skills + common reference (3,337 lines total).github/skills/windows-log-analysis/.gitignore