skill-up is a CLI evaluation framework for Agent Skill developers. Declare your eval environment, dependencies, test cases, and grading strategy in evals/eval.yaml and evals/cases/*.yaml, then run evaluations locally or in CI to generate structured reports.
Warning
The core business logic of this repository is implemented, but the project is still in an early evolution stage: the code is not yet fully stable, and some CLI commands, configuration fields, and public APIs may still change in future releases. Please review the CHANGELOG and verify compatibility before using it in production.
- Declarative Eval Config: Define evaluation environment, engine, model, and cases through YAML (
eval.yaml+cases/*.yaml). - Multi-Engine Support: Works with Qoder CLI, Claude Code, and Codex as Agent Engines.
- Flexible Judging: Supports
rule_based,script, andagent_judgeevaluation strategies. - Structured Reports: Outputs Anthropic-compatible
grading.json,benchmark.json,benchmark.md, plusresult.json, JUnit XML, and HTML reports. - Anthropic Compatible: Import
evals.jsonviaskill-up import, or auto-detect with--auto. - CI-Ready: Designed for local development and continuous integration pipelines.
- Go 1.25 or later β required for building and running the CLI.
From source:
go install github.com/alibaba/skill-up/cmd/skill-up@latestPrebuilt binaries: Download from GitHub Releases.
Build locally:
make build
# or
go build -o bin/skill-up ./cmd/skill-upIn your Skill directory, create evals/eval.yaml:
schema_version: v1alpha1
environment:
type: none
skills:
- source: local_path
path: .
engine:
name: claude_code
cases:
files:
- evals/cases/hello-world.yaml
defaults:
timeout_seconds: 120
max_turns: 5
report:
formats: [json]Create evals/cases/hello-world.yaml:
id: hello-world
title: Skill should respond to basic requests
input:
prompt: |
Please generate a Hello World program
expect:
must_contain:
- "Hello"
- "World"
judge:
type: rule_based
success:
- output_contains:
all: ["Hello", "World"]skill-up validate ./evals/eval.yamlskill-up run ./evals/eval.yamlResults are written to <skill-name>-workspace/iteration-1/.
For engineering conventions (Conventional Commits, Git hooks, golangci-lint), see CONTRIBUTING.md.
skill-up auto-loads an optional user-level config that supplies default OpenTelemetry env vars and per-environment runtime kwargs. The embedded defaults are empty; downstream consumers maintain their own config file.
embed (empty) < user (~/.config/skill-up/config.yaml) < project ($PWD/.skill-up.yaml) < explicit (--config)
| Source | Path |
|---|---|
embed |
empty Config{} β no vendor defaults baked in |
user |
$SKILL_EVAL_CONFIG, else $XDG_CONFIG_HOME/skill-up/config.yaml, else ~/.config/skill-up/config.yaml |
project |
$PWD/.skill-up.yaml |
explicit |
--config <path> (must exist) |
Missing files at the user and project layers are silently skipped; a missing --config path is a hard error. A corrupt config at any layer also fails the run.
skill-up init # writes ~/.config/skill-up/config.yaml (XDG-aware)
skill-up init --local # writes $PWD/.skill-up.yaml
skill-up init --print # writes the template to stdout
skill-up init --force # overwrite an existing fileschema_version: v1alpha1
kind: SkillEvalConfig
telemetry:
service_name: skill-up # OTEL_SERVICE_NAME
traces_exporter: otlp # OTEL_TRACES_EXPORTER
traces:
endpoint: http://localhost:4317 # OTEL_EXPORTER_OTLP_TRACES_ENDPOINT (4317 for grpc, 4318/v1/traces for http/protobuf)
protocol: grpc # OTEL_EXPORTER_OTLP_TRACES_PROTOCOL (grpc | http/protobuf); skill-up defaults to grpc
resource_attributes: # serialized into OTEL_RESOURCE_ATTRIBUTES
deployment.environment: local
verbose: false # if true, also enables OTEL_LOG_* payload capture
env: # arbitrary defaults, applied only-if-unset
OTEL_EXPORTER_OTLP_HEADERS: authorization=${OTLP_TOKEN}
runtime_kwargs: # keyed by environment.type
opensandbox:
base_url: http://localhost:8080
# extensions: '{}'For environment variables: any value already set in the process environment wins; the config only fills in missing keys.
For runtime_kwargs: explicit --runtime-kwarg on run > eval.yaml environment.kwargs > user-config runtime_kwargs[environment.type].
Prefer ${ENV_VAR} references inside the config file rather than baking secret literals. The redaction mechanism (userconfig.Redact) masks fields tagged secret:"true" when printing; currently no Config field carries the tag, but the mechanism is in place for future fields.
Use skill-up import to migrate an Anthropic-compatible evals.json into the YAML layout used by this repo:
skill-up import ./evals/evals.json --output ./evals| Command | Description |
|---|---|
skill-up run [path] |
Run evaluation cases and produce reports |
skill-up validate [path] |
Validate eval.yaml and case files |
skill-up list-cases [path] |
List all cases referenced by the config |
skill-up report <result.json> |
Generate reports from a previous run |
skill-up import <evals.json> |
Import Anthropic evals.json to YAML cases |
skill-up debug judge <input.json> |
Debug judge module with a JSON input |
skill-up debug report <input.json> |
Debug report module with a JSON input |
skill-up/
βββ cmd/skill-up/ # CLI entrypoint
βββ internal/ # Private implementation
β βββ cli/ # Cobra commands
β βββ config/ # YAML config loader & validator
β βββ credential/ # API key & credential resolution
β βββ runtime/ # Workspace runtime (none / opensandbox)
β βββ agent/ # Agent Engine adapters
β βββ judge/ # Evaluation judges
β βββ report/ # Report generators (JSON / JUnit / HTML)
β βββ runner/ # End-to-end orchestration
βββ pkg/transcript/ # Public transcript parsing API
βββ docs/ # VitePress documentation site
β βββ .vitepress/ # VitePress config
β βββ guide/ # English user guide
β βββ zh/ # Chinese user guide
β βββ public/ # Static assets (logo, etc.)
βββ e2e/ # End-to-end tests
βββ examples/ # Example fixtures and scripts
βββ Makefile # Build & quality targets
βββ go.mod / go.sum # Go module dependencies
βββ README.md # This file
Apache License 2.0 β see LICENSE.
