Port Artifact of EMT (in OSDI'25) to the ArtEvalBench of System Intelligence Benchmark#108
Port Artifact of EMT (in OSDI'25) to the ArtEvalBench of System Intelligence Benchmark#108EscapistArcadia wants to merge 11 commits intosys-intelligence:mainfrom
Conversation
… patch in the future.
…eep one-click effect.
|
@EscapistArcadia Hi Shanbo, thanks a lot for your contribution. Let us review this and feedback soon. |
There was a problem hiding this comment.
Pull request overview
This PR adds the EMT artifact evaluation (from OSDI 2025) to the ArtEvalBench of the System Intelligence Benchmark. The implementation provides Oracle scripts to automate the evaluation of EMT's experimental results by comparing reproduced metrics against reference values.
Changes:
- Added EMT artifact entry to the ArtEvalBench schema file
- Implemented Oracle scripts for environment setup, experiment execution, and result validation
- Created utilities for logging and path configuration
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl | Adds EMT artifact entry to the benchmark registry |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/utils.py | Defines constants, paths, and logging configuration for EMT evaluation |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_experiment_runs.py | Implements experiment execution and result validation logic |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_env_setup.py | Handles dependency checking and environment setup |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_benchmark_prep.py | Placeholder for benchmark preparation |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/oracle_artifact_build.py | Placeholder for artifact build |
| benchmarks/arteval_bench/data/benchmark/osdi25_emt/_agent_eval/main.py | Entry point for orchestrating the evaluation process |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| HOME = Path.cwd().parent | ||
| REPO_DIRS = {"emt": HOME / "emt"} | ||
|
|
||
| FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference |
There was a problem hiding this comment.
The variable name FIG18_REFERENCE_PATH refers to Figure 18 but the path includes 'emt-figure16.ref.csv'. This inconsistency should be resolved by either renaming the variable to FIG16_REFERENCE_PATH or updating the filename to 'emt-figure18.ref.csv'.
| FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure16.ref.csv" # TODO: Fill the paper data to the reference | |
| FIG18_REFERENCE_PATH = HOME / "_agent_eval" / "refs" / "emt-figure18.ref.csv" # TODO: Fill the paper data to the reference |
| if not ok: | ||
| return False, why |
There was a problem hiding this comment.
This check on line 158 will always evaluate the result from the already-completed 'never' check on line 153, not the commented-out 'always' check on line 157. Either remove this dead code or uncomment line 157 and assign its result to 'ok, why' before this check.
| if not Path(self.go_root).exists(): | ||
| problems.append("$HOME/go directory missing (install golang and configure GOPATH)") | ||
| if not Path(self.go_bin).exists(): | ||
| problems.append("$HOME/go/bin directory missing (ensure Go tools are installed)") | ||
|
|
||
| # Check PATH contains Go path | ||
| path_env = os.environ.get("PATH", "") | ||
| go_root_str = str(self.go_root) | ||
| go_bin_str = str(self.go_bin) | ||
| if go_root_str not in path_env or go_bin_str not in path_env: | ||
| problems.append("PATH missing $HOME/go or $HOME/go/bin " | ||
| "(export PATH=$HOME/go:$HOME/go/bin:$PATH)") |
There was a problem hiding this comment.
The Go-related checks appear unnecessary for the EMT artifact evaluation. The code comment on line 80 questions whether this is needed, and no Go dependencies are listed in DEPENDENCIES. These checks should either be removed or the TODO comment on line 80 should be resolved to clarify if Go is required.
|
|
||
|
|
||
| def run(self) -> bool: | ||
| # DEBUG PERPOSES ONLY, WILL REMOVE LATER |
There was a problem hiding this comment.
Corrected spelling of 'PERPOSES' to 'PURPOSES'.
| # DEBUG PERPOSES ONLY, WILL REMOVE LATER | |
| # DEBUG PURPOSES ONLY, WILL REMOVE LATER |
|
|
||
| def paths_check(self): | ||
| """ | ||
| Check that Python virtual environment is succesfully created |
There was a problem hiding this comment.
Corrected spelling of 'succesfully' to 'successfully'.
| Check that Python virtual environment is succesfully created | |
| Check that Python virtual environment is successfully created |
| if not Path(self.venv_dir).exists(): | ||
| problems.append(".venv virtual environment missing (run 'python3 -m venv .venv')") | ||
|
|
||
| # Check Go directories exit |
There was a problem hiding this comment.
Corrected spelling of 'exit' to 'exist'.
| # Check Go directories exit | |
| # Check Go directories exist |
| def check_dependency(self, dep: Dependency) -> Optional[str]: | ||
| """ | ||
| Core method that checks whether a certain dependency of a version | ||
| equal or greather than a reference version is installed. |
There was a problem hiding this comment.
Corrected spelling of 'greather' to 'greater'.
| equal or greather than a reference version is installed. | |
| equal or greater than a reference version is installed. |
| {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} No newline at end of file | ||
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docer_env": ""} No newline at end of file |
There was a problem hiding this comment.
The field name 'docer_env' appears to be a typo. It should likely be 'docker_env' to match the pattern from other entries in this file.
| {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| \ No newline at end of file | |
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docer_env": ""} | |
| {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/anvil/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/acto/_agent_eval/main.py", "expected_score": 4, "docker_env": "bastoica/ae-agent-ubuntu24.04:latest"} | |
| {"artifact_id": "osdi25_emt", "artifact_dir": "osdi25_emt", "artifact_readme": "osdi25_emt/emt/README.md", "artifact_url": "https://github.com/xlab-uiuc/emt-system-intelligence-benchmark", "evaluator": "osdi25_emt/emt/_agent_eval/main.py", "expected_score": 4, "docker_env": ""} |
Description
This PR is to add the artifact of EMT, presented at OSDI'2025, to the ArtEvalBench of System Intelligence Benchmark.
Changes
Testing
I have finished testing the full AE pipeline locally by running python main.py, together with
git clone,git submodule update, and a local image build. These commands are kept in the Oracle scripts but commented out.Checklist