Port Artifact of EMT (in OSDI'25) to the ArtEvalBench of System Intelligence Benchmark by EscapistArcadia · Pull Request #108 · sys-intelligence/system-intelligence-benchmark

EscapistArcadia · 2026-02-01T17:48:10Z

Description

This PR is to add the artifact of EMT, presented at OSDI'2025, to the ArtEvalBench of System Intelligence Benchmark.

Changes

Adds EMT AE's entry to ArtEvalBench's schema file;
Adds Oracle scripts running EMT AE.

Testing

I have finished testing the full AE pipeline locally by running python main.py, together with git clone, git submodule update, and a local image build. These commands are kept in the Oracle scripts but commented out.

Checklist

Tests pass locally
Code follows project style guidelines
Documentation updated (if needed)

… patch in the future.

…eep one-click effect.

xuafeng · 2026-02-02T21:14:03Z

@EscapistArcadia Hi Shanbo, thanks a lot for your contribution. Let us review this and feedback soon.

Copilot

Pull request overview

This PR adds the EMT artifact evaluation (from OSDI 2025) to the ArtEvalBench of the System Intelligence Benchmark. The implementation provides Oracle scripts to automate the evaluation of EMT's experimental results by comparing reproduced metrics against reference values.

Changes:

Added EMT artifact entry to the ArtEvalBench schema file
Implemented Oracle scripts for environment setup, experiment execution, and result validation
Created utilities for logging and path configuration

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl	Adds EMT artifact entry to the benchmark registry
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/utils.py	Defines constants, paths, and logging configuration for EMT evaluation
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_experiment_runs.py	Implements experiment execution and result validation logic
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_env_setup.py	Handles dependency checking and environment setup
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_benchmark_prep.py	Placeholder for benchmark preparation
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_artifact_build.py	Placeholder for artifact build
benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/main.py	Entry point for orchestrating the evaluation process

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-02T21:15:09Z

benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/utils.py

+HOME = Path.cwd().parent
+REPO_DIRS = {"emt": HOME / "emt"}
+
+FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference 


The variable name FIG18_REFERENCE_PATH refers to Figure 18 but the path includes 'emt-figure16.ref.csv'. This inconsistency should be resolved by either renaming the variable to FIG16_REFERENCE_PATH or updating the filename to 'emt-figure18.ref.csv'.

Suggested change

FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference

FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure18.ref.csv" # TODO: Fill the paper data to the reference

Copilot · 2026-02-02T21:15:09Z

benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_experiment_runs.py

+    if not ok:
+      return False, why


This check on line 158 will always evaluate the result from the already-completed 'never' check on line 153, not the commented-out 'always' check on line 157. Either remove this dead code or uncomment line 157 and assign its result to 'ok, why' before this check.

Copilot · 2026-02-02T21:15:09Z