skill-up

English | 中文

Overview

skill-up is a CLI evaluation framework for Agent Skill developers. Declare your eval environment, dependencies, test cases, and grading strategy in evals/eval.yaml and evals/cases/*.yaml, then run evaluations locally or in CI to generate structured reports.

Warning

The core business logic of this repository is implemented, but the project is still in an early evolution stage: the code is not yet fully stable, and some CLI commands, configuration fields, and public APIs may still change in future releases. Please review the CHANGELOG and verify compatibility before using it in production.

Features

Declarative Eval Config: Define evaluation environment, engine, model, and cases through YAML (eval.yaml + cases/*.yaml).
Multi-Engine Support: Works with Qoder CLI, Claude Code, and Codex as Agent Engines.
Flexible Judging: Supports rule_based, script, and agent_judge evaluation strategies.
Structured Reports: Outputs Anthropic-compatible grading.json, benchmark.json, benchmark.md, plus result.json, JUnit XML, and HTML reports.
Anthropic Compatible: Import evals.json via skill-up import, or auto-detect with --auto.
CI-Ready: Designed for local development and continuous integration pipelines.

Requirements

Go 1.25 or later — required for building and running the CLI.

Installation

From source:

go install github.com/alibaba/skill-up/cmd/skill-up@latest

Prebuilt binaries: Download from GitHub Releases.

Build locally:

make build
# or
go build -o bin/skill-up ./cmd/skill-up

Quick Start

1. Create Eval Config

In your Skill directory, create evals/eval.yaml:

schema_version: v1alpha1

environment:
  type: none

skills:
  - source: local_path
    path: .

engine:
  name: claude_code

cases:
  files:
    - evals/cases/hello-world.yaml
  defaults:
    timeout_seconds: 120
    max_turns: 5

report:
  formats: [json]

2. Write a Test Case

Create evals/cases/hello-world.yaml:

id: hello-world
title: Skill should respond to basic requests

input:
  prompt: |
    Please generate a Hello World program

expect:
  must_contain:
    - "Hello"
    - "World"

judge:
  type: rule_based
  success:
    - output_contains:
        all: ["Hello", "World"]

3. Validate Config

skill-up validate ./evals/eval.yaml

4. Run Evaluation

skill-up run ./evals/eval.yaml

Results are written to <skill-name>-workspace/iteration-1/.

For engineering conventions (Conventional Commits, Git hooks, golangci-lint), see CONTRIBUTING.md.

User config

skill-up auto-loads an optional user-level config that supplies default OpenTelemetry env vars and per-environment runtime kwargs. The embedded defaults are empty; downstream consumers maintain their own config file.

Discovery chain (lowest to highest precedence)

embed (empty) < user (~/.config/skill-up/config.yaml) < project ($PWD/.skill-up.yaml) < explicit (--config)

Source	Path
`embed`	empty `Config{}` — no vendor defaults baked in
`user`	`$SKILL_EVAL_CONFIG`, else `$XDG_CONFIG_HOME/skill-up/config.yaml`, else `~/.config/skill-up/config.yaml`
`project`	`$PWD/.skill-up.yaml`
`explicit`	`--config <path>` (must exist)

Missing files at the user and project layers are silently skipped; a missing --config path is a hard error. A corrupt config at any layer also fails the run.

Quickstart

skill-up init              # writes ~/.config/skill-up/config.yaml (XDG-aware)
skill-up init --local      # writes $PWD/.skill-up.yaml
skill-up init --print      # writes the template to stdout
skill-up init --force      # overwrite an existing file

Schema

schema_version: v1alpha1
kind: SkillEvalConfig

telemetry:
  service_name: skill-up                              # OTEL_SERVICE_NAME
  traces_exporter: otlp                                 # OTEL_TRACES_EXPORTER
  traces:
    endpoint: http://localhost:4317                     # OTEL_EXPORTER_OTLP_TRACES_ENDPOINT (4317 for grpc, 4318/v1/traces for http/protobuf)
    protocol: grpc                                      # OTEL_EXPORTER_OTLP_TRACES_PROTOCOL (grpc | http/protobuf); skill-up defaults to grpc
  resource_attributes:                                  # serialized into OTEL_RESOURCE_ATTRIBUTES
    deployment.environment: local
  verbose: false                                        # if true, also enables OTEL_LOG_* payload capture

env:                                                    # arbitrary defaults, applied only-if-unset
  OTEL_EXPORTER_OTLP_HEADERS: authorization=${OTLP_TOKEN}

runtime_kwargs:                                         # keyed by environment.type
  opensandbox:
    base_url: http://localhost:8080
    # extensions: '{}'

Precedence

For environment variables: any value already set in the process environment wins; the config only fills in missing keys.

For runtime_kwargs: explicit --runtime-kwarg on run > eval.yaml environment.kwargs > user-config runtime_kwargs[environment.type].

Secrets

Prefer ${ENV_VAR} references inside the config file rather than baking secret literals. The redaction mechanism (userconfig.Redact) masks fields tagged secret:"true" when printing; currently no Config field carries the tag, but the mechanism is in place for future fields.

Importing `evals.json`

Use skill-up import to migrate an Anthropic-compatible evals.json into the YAML layout used by this repo:

skill-up import ./evals/evals.json --output ./evals

CLI Overview

Command	Description
`skill-up run [path]`	Run evaluation cases and produce reports
`skill-up validate [path]`	Validate `eval.yaml` and case files
`skill-up list-cases [path]`	List all cases referenced by the config
`skill-up report <result.json>`	Generate reports from a previous run
`skill-up import <evals.json>`	Import Anthropic `evals.json` to YAML cases
`skill-up debug judge <input.json>`	Debug judge module with a JSON input
`skill-up debug report <input.json>`	Debug report module with a JSON input

Project Structure

skill-up/
├── cmd/skill-up/          # CLI entrypoint
├── internal/              # Private implementation
│   ├── cli/               # Cobra commands
│   ├── config/            # YAML config loader & validator
│   ├── credential/        # API key & credential resolution
│   ├── runtime/           # Workspace runtime (none / opensandbox)
│   ├── agent/             # Agent Engine adapters
│   ├── judge/             # Evaluation judges
│   ├── report/            # Report generators (JSON / JUnit / HTML)
│   └── runner/            # End-to-end orchestration
├── pkg/transcript/        # Public transcript parsing API
├── docs/                  # VitePress documentation site
│   ├── .vitepress/        # VitePress config
│   ├── guide/             # English user guide
│   ├── zh/                # Chinese user guide
│   └── public/            # Static assets (logo, etc.)
├── e2e/                   # End-to-end tests
├── examples/              # Example fixtures and scripts
├── Makefile               # Build & quality targets
├── go.mod / go.sum        # Go module dependencies
└── README.md              # This file

License

Apache License 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.githooks		.githooks
.github		.github
assets		assets
cmd/skill-up		cmd/skill-up
docs		docs
e2e		e2e
examples		examples
internal		internal
pkg		pkg
.editorconfig		.editorconfig
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yaml		.goreleaser.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh.md		README.zh.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum
package-lock.json		package-lock.json
package.json		package.json
revive.toml		revive.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill-up

Overview

Features

Requirements

Installation

Quick Start

1. Create Eval Config

2. Write a Test Case

3. Validate Config

4. Run Evaluation

User config

Discovery chain (lowest to highest precedence)

Quickstart

Schema

Precedence

Secrets

Importing `evals.json`

CLI Overview

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skill-up

Overview

Features

Requirements

Installation

Quick Start

1. Create Eval Config

2. Write a Test Case

3. Validate Config

4. Run Evaluation

User config

Discovery chain (lowest to highest precedence)

Quickstart

Schema

Precedence

Secrets

Importing evals.json

CLI Overview

Project Structure

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Importing `evals.json`

Packages