Major refactoring by bastoica · Pull Request #107 · sys-intelligence/system-intelligence-benchmark

bastoica · 2026-01-31T07:23:55Z

Description

This PR implements shared base classes and reusable primitives for agent validation oracles. It also refactors the artifact-specific agent validation oracles for Acto, Anvil, EqWalker, and Wasabi to use the same unified/standardized code structure. The goal is to reduce duplicate logic, make the oracle code easier to read and maintain, and ensure agent validation runs consistently across artifacts in ArtEvalBench.

Changes

Implemented 4 base oracle classes corresponding to the 4 canonical stages of the artifact evaluation process (environment setup, benchmark preparation, artifact build, and experiment runs) along shared requirement/check primitives.
Refactored Acto, Anvil, EqWalker, and Wasabi artifact-specific agent validation oracles to the standardized primitives and base orchestrator flow.
Created a README that documents the base classes and primitives, explains how to derive new oracle classes, and shows how to use the helper functions; basically this serves as a step-by-step guide for implementing the agent validation oracle part for any new artifact to-be-added to ArtEvalBench.

Testing

Installed the refactored artifacts locally and ran the agent validation oracles as standalone using a main.py runner.

Checklist

[x ] Tests pass locally
[x ] Code follows project style guidelines
[x ] Documentation updated (if needed)

…t checker/oracle scripts

Major refactoring

bastoica added 6 commits January 31, 2026 01:21

feat: implemented base classes and interfaces to standardize the agen…

f846f70

…t checker/oracle scripts

chore: some cleaning and restructuring

ce1caac

refactor: adapt egwalker's oracles to use the standardized interface

ac5ebe2

refactor: adapt anvil's oracles to use the standardized interface

a65d740

refactor: adapt wasabi's oracles to use the standardized interface

03c8392

refactor: adapt acto's oracles to use the standardized interface

f3b2b1c

bastoica self-assigned this Jan 31, 2026

bastoica requested review from Couen and xuafeng January 31, 2026 07:24

bastoica added enhancement New feature or request feature new feature required labels Jan 31, 2026

bastoica linked an issue Jan 31, 2026 that may be closed by this pull request

Fix the eval scripts issue of existing artifacts #103

Closed

bastoica marked this pull request as ready for review February 3, 2026 21:52

xuafeng merged commit 7ef9288 into sys-intelligence:main Feb 4, 2026
4 checks passed

tareknaser pushed a commit that referenced this pull request Feb 5, 2026

Merge pull request #107 from bastoica/main

b716f57

Major refactoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major refactoring#107

Major refactoring#107
xuafeng merged 6 commits intosys-intelligence:mainfrom
bastoica:main

bastoica commented Jan 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bastoica commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Testing

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bastoica commented Jan 31, 2026 •

edited

Loading