-
Notifications
You must be signed in to change notification settings - Fork 187
feat(workflows): add beval behavioral evaluation workflow for dt-coach agent #1129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
af3f6ca
ef56eae
8faa4ea
50a03dd
26fcbe7
b2feabd
5a7ae11
ade4c27
c708932
01849f7
de9e55e
859fa91
7e5afbe
967c680
3bf5071
b00d3f0
4f1a9c2
fcaf374
d662c71
d1c8b08
b156cf7
86b028e
0b867a3
6a61043
5a288b6
373b4c6
6f932cd
a5f8c4b
dc17252
c343528
3d491eb
d44bab9
d7e0019
d6e0e80
5ff0c54
5976fba
78e32a0
93d6137
61dcbdd
d4e85fc
b743d2f
ca9daa1
a430a06
a3bf74d
e2b0414
70a0bd7
0e23ce4
172ef62
66fe5b1
30762ac
70d59c6
d02597d
0bdc402
feca948
41796da
9c256a4
b7035d4
c5cb5e3
292f7d6
11fd4e7
eedd6cf
8a01cca
033d010
da126e6
2e96685
bb3a172
ca91e7a
d43c8dd
75b41ce
a39650a
519d4e8
385c86e
36bcbd5
ba171ef
184e948
e1dd72a
787e28f
c27b061
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,7 +24,8 @@ | |
| "**/Cargo.lock", | ||
| "CHANGELOG.md", | ||
| "logs/**", | ||
| "docs/docusaurus/build/**" | ||
| "docs/docusaurus/build/**", | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔍 Missing The Suggested addition to # Beval evaluation results
beval/**/results/ |
||
| "beval/**/results/**" | ||
| ], | ||
| "ignoreRegExpList": [ | ||
| "/#.*/g", | ||
|
|
@@ -62,22 +63,25 @@ | |
| "general-technical" | ||
| ], | ||
| "words": [ | ||
| "agentic", | ||
| "atheris", | ||
|
eedorenko marked this conversation as resolved.
|
||
| "behaviour", | ||
| "behavioural", | ||
| "beval", | ||
| "brainwriting", | ||
| "clusterfuzzlite", | ||
| "collab", | ||
| "easyops", | ||
| "figjam", | ||
| "hideable", | ||
| "learning", | ||
| "parseable", | ||
| "smol", | ||
| "subcat", | ||
| "whiteboarding", | ||
| "wireframes", | ||
| "ˈpræksɪs", | ||
| "πρᾶξις", | ||
| "agentic" | ||
| "πρᾶξις" | ||
| ], | ||
| "reporters": [ | ||
| "default", | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| name: Behavioral Evaluation (beval) | ||
|
|
||
| on: | ||
| workflow_call: | ||
| secrets: | ||
| COPILOT_TOKEN: | ||
| required: true | ||
| workflow_dispatch: | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
eedorenko marked this conversation as resolved.
|
||
|
|
||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
| cancel-in-progress: false | ||
|
|
||
| jobs: | ||
| evaluate: | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 30 | ||
|
|
||
| env: | ||
| AGENT_REPO_ROOT: ${{ github.workspace }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This branch has drifted from
Both will likely fail the |
||
| uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v4.2.2 | ||
|
eedorenko marked this conversation as resolved.
|
||
| with: | ||
| persist-credentials: false | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 | ||
| with: | ||
| python-version: "3.12" | ||
|
|
||
| - name: Install GitHub Copilot CLI | ||
| run: | | ||
| npm ci --prefix beval | ||
| echo "${{ github.workspace }}/beval/node_modules/.bin" >> "$GITHUB_PATH" | ||
|
|
||
| - name: Install beval | ||
| # beval is hosted under a personal account (vyta) while an org-owned | ||
| # home is evaluated. The install is pinned to a specific commit SHA to | ||
| # mitigate supply-chain risk in the interim. | ||
| run: pip install --no-cache-dir "beval[all] @ git+https://github.com/vyta/beval.git@a2effa10cec1b06c394811587fede0070174d589#subdirectory=python" | ||
|
eedorenko marked this conversation as resolved.
|
||
|
|
||
| - name: Start agent (TCP) | ||
| env: | ||
| COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_TOKEN }} | ||
| run: | | ||
| copilot --acp --port 3000 & | ||
|
eedorenko marked this conversation as resolved.
|
||
| for i in $(seq 1 30); do | ||
| nc -z 127.0.0.1 3000 && break | ||
| echo "Waiting for agent to start ($i)..." | ||
| sleep 2 | ||
| done | ||
| nc -z 127.0.0.1 3000 || { echo "Agent failed to start"; exit 1; } | ||
|
|
||
| - name: Start judge (TCP) | ||
| env: | ||
| COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_TOKEN }} | ||
| run: | | ||
| copilot --acp --port 3001 & | ||
| for i in $(seq 1 30); do | ||
| nc -z 127.0.0.1 3001 && break | ||
| echo "Waiting for judge to start ($i)..." | ||
| sleep 2 | ||
| done | ||
| nc -z 127.0.0.1 3001 || { echo "Judge failed to start"; exit 1; } | ||
|
|
||
| - name: Run evaluations | ||
| run: | | ||
| beval \ | ||
| -c beval/dt-coach/eval.config.yaml \ | ||
| run \ | ||
| --cases beval/dt-coach/cases/ \ | ||
| --agent beval/dt-coach/agent.yaml \ | ||
| -m validation \ | ||
| -o beval/dt-coach/results/results.json | ||
|
|
||
| - name: Upload results | ||
| uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v4.4.3 | ||
| if: always() | ||
| with: | ||
| name: beval-results-${{ github.run_id }} | ||
| path: beval/dt-coach/results/ | ||
| retention-days: 30 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| name: dt-coach | ||
| description: > | ||
| Design Thinking Coach — a conversational coaching agent that guides teams | ||
| through the 9 Design Thinking for HVE methods using a Think/Speak/Empower | ||
| philosophy. | ||
| protocol: acp | ||
| connection: | ||
| transport: tcp | ||
| host: ${AGENT_HOST:-127.0.0.1} | ||
| port: ${AGENT_PORT:-3000} | ||
| cwd: ${AGENT_REPO_ROOT:-.} | ||
| model: ${AGENT_MODEL:-claude-opus-4.6-1m} | ||
| init_prompt: "Launch .github/agents/design-thinking/dt-coach.agent.md" | ||
| timeout: 120 | ||
| retry: | ||
| max_attempts: 2 | ||
| backoff: 5.0 | ||
| metadata: | ||
| domain: design-thinking | ||
| version: "0.1" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| background: | ||
| category: coaching-behaviors | ||
| given: | ||
| domain: design-thinking | ||
|
|
||
| cases: | ||
| # ── Think / Speak / Empower philosophy ────────────────────────── | ||
|
|
||
| - id: think_speak_empower_pattern | ||
| name: Response follows Think/Speak/Empower structure | ||
| tags: [philosophy, core] | ||
| given: | ||
| query: > | ||
| Our team has been struggling with a legacy inventory system. Users | ||
| keep asking for a dashboard, but we're not sure that's the real | ||
| problem. Can you help us figure out what to do? | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| shares an observation or insight conversationally (e.g. "I'm | ||
| noticing..." or "This makes me think...") and ends with a | ||
| choice or open question that empowers the user to decide what | ||
| to explore next, rather than giving a directive or action plan | ||
|
|
||
| - id: short_conversational_responses | ||
| name: Keep responses concise — no methodology lectures | ||
| tags: [conversation-style, core] | ||
| given: | ||
| query: > | ||
| What is design thinking and how does it work? | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 2000] | ||
| - the answer should be: > | ||
| gives a brief, conversational explanation without delivering a | ||
| long methodology lecture or listing all 9 methods in detail; | ||
| keeps the response focused and asks what the user wants to | ||
| work on rather than comprehensively explaining the framework | ||
|
|
||
| - id: empowers_with_choices | ||
| name: End with choices not directives | ||
| tags: [philosophy, core] | ||
| given: | ||
| query: > | ||
| We just finished interviewing 8 stakeholders. What do we do now? | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| presents options or asks a question that lets the user choose | ||
| the next step rather than issuing a single directive like | ||
| "you should do X"; the response ends with something like | ||
| "does that resonate?" or "want to explore that or move forward?" | ||
|
|
||
| # ── Coaching boundaries ───────────────────────────────────────── | ||
|
|
||
| - id: collaborate_not_execute | ||
| name: Work WITH users, not FOR them | ||
| tags: [boundaries, core] | ||
| given: | ||
| query: > | ||
| Can you create a stakeholder map for our project? The key people | ||
| are the VP of Operations, two plant managers, a shift supervisor, | ||
| and the IT director. | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| does NOT simply produce a finished stakeholder map; instead | ||
| guides the user to co-create it by asking about relationships, | ||
| influence levels, or perspectives that would make the map | ||
| more useful | ||
|
|
||
| - id: no_prescriptive_solutions | ||
| name: Do not prescribe specific solutions to user problems | ||
| tags: [boundaries, core] | ||
| given: | ||
| query: > | ||
| Our factory floor workers are ignoring the new safety checklist app. | ||
| Adoption is at 15%. How do we fix this? | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| does NOT jump to prescribing a specific fix like "add | ||
| gamification" or "simplify the UI"; instead helps the user | ||
| explore WHY adoption is low by asking questions about user | ||
| context, pain points, or assumptions that haven't been tested | ||
|
|
||
| - id: never_make_users_feel_foolish | ||
| name: Stay curious and supportive when users are confused | ||
| tags: [boundaries, tone] | ||
| given: | ||
| query: > | ||
| I don't really understand what input synthesis means. We just have | ||
| a bunch of interview notes and I'm not sure what to do with them. | ||
| This feels overwhelming. | ||
| stages: | ||
| - when: the agent processes the request | ||
| then: | ||
| - completion time should be under: 120 | ||
| - when: the agent responds | ||
| then: | ||
| - response length should be: [50, 3000] | ||
| - the answer should be: > | ||
| responds with empathy and curiosity, normalizing the feeling | ||
| of being overwhelmed; does NOT lecture about synthesis | ||
| methodology but instead offers a small, manageable starting | ||
| point and reassures the user |
Uh oh!
There was an error while loading. Please reload this page.