swe-bench #280

shreymodi1 · 2025-10-15T21:35:04Z

SWE-bench integration into eval-protocol, run swe-bench locally and use our RemoteRolloutProcessor to interact with the server.py

...ounts__pyroworks__models__swe-1-mtp#accounts__pyroworks__deployments__r5dfiiwp.eval-run.json

…we-1-mtp#accounts__pyroworks__deployments__r5dfiiwp.eval-run.json

xzrderek

same as other pr, let dylan / benny take a look before merging

examples/swebench/server.py

examples/swebench/tracing_model.py

examples/swebench/server.py

examples/swebench/tests/test_swebench.py

…-sdk into swebenchintegration

examples/swebench/README.md

examples/swebench/server.py

examples/swebench/tests/test_swebench.py

dphuang2 · 2025-10-17T18:05:08Z

examples/swebench/tests/test_swebench.py

+        remote_base_url="http://127.0.0.1:3000",
+        model_base_url="https://tracing.fireworks.ai",
+        timeout_seconds=1800,
+        output_data_loader=default_fireworks_output_data_loader,


why do you need to specify this? I thought there is a default one

dphuang2 · 2025-10-17T18:09:08Z

examples/swebench/tracing_model.py

@@ -0,0 +1,159 @@
+"""
+TracingFireworksModel - Routes through tracing using OpenAI SDK.


can you give a more comprehensive explanation for this file. current this is confusing for somebody who doesn't know SWE-bench

Why do we need it?

What is it doing?

dphuang2

LGTM besides some other nits! Do you mind sharing a screenshot of the local UI running in this PR if its easy to collect. I am just curious what the output looks like.

xzrderek · 2025-10-20T18:17:43Z

examples/swebench/tests/test_swebench.py

+    max_dataset_rows=2,
+    rollout_processor=RemoteRolloutProcessor(
+        remote_base_url="http://127.0.0.1:3000",
+        model_base_url="https://tracing.fireworks.ai",


don't think you need this

xzrderek · 2025-10-20T18:20:49Z

examples/swebench/tests/test_swebench.py

+                if message.startswith("EVAL_RESULT:"):
+                    result_json = message.replace("EVAL_RESULT:", "")
+                    row.evaluation_result = EvaluateResult.model_validate_json(result_json)
+                    break


hmmm i don't quite get this logic here. i thought we should be reading from the tracing.fireworks.ai, check out default_fireworks_output_data_loader.

maybe i am misunderstanding?

examples/swebench/server.py

…nchintegration

Shrey Modi added 2 commits October 15, 2025 14:33

swe-bench

71f4165

linterrors

2dad518

mayinghan reviewed Oct 16, 2025

View reviewed changes

...ounts__pyroworks__models__swe-1-mtp#accounts__pyroworks__deployments__r5dfiiwp.eval-run.json Outdated Show resolved Hide resolved

Delete examples/swebench/fireworks_ai__accounts__pyroworks__models__s…

9ffbf9e

…we-1-mtp#accounts__pyroworks__deployments__r5dfiiwp.eval-run.json

xzrderek reviewed Oct 16, 2025

View reviewed changes

Shrey Modi added 4 commits October 16, 2025 14:47

Merge branch 'main' into swebenchintegration

e38f117

addressing dereks comments

0d12311

Merge branch 'swebenchintegration' of github.com:eval-protocol/python…

541cbe6

…-sdk into swebenchintegration

pyproject removal due to dependancy issue

b16bd50

dphuang2 reviewed Oct 17, 2025

View reviewed changes

examples/swebench/README.md Show resolved Hide resolved

dphuang2 reviewed Oct 17, 2025

View reviewed changes

examples/swebench/server.py Outdated Show resolved Hide resolved

dphuang2 reviewed Oct 17, 2025

View reviewed changes

examples/swebench/tests/test_swebench.py Outdated Show resolved Hide resolved

dphuang2 reviewed Oct 17, 2025

View reviewed changes

dphuang2 approved these changes Oct 17, 2025

View reviewed changes

Shrey Modi added 4 commits October 17, 2025 11:58

changepyproject.toml

14c6f46

added sandboxing of runs and remote server support

47ef37b

remote server changes

e447ad6

addressed comments

e08ca9a

xzrderek reviewed Oct 20, 2025

View reviewed changes

dphuang2 reviewed Oct 20, 2025

View reviewed changes

examples/swebench/server.py Show resolved Hide resolved

xzrderek reviewed Oct 20, 2025

View reviewed changes

examples/swebench/server.py Show resolved Hide resolved

Shrey Modi added 2 commits October 20, 2025 15:10

Merge branch 'main' of github.com:eval-protocol/python-sdk into swebe…

0b38ca4

…nchintegration

porting to fireworks tracing

867d947

		@@ -0,0 +1,159 @@
		"""
		TracingFireworksModel - Routes through tracing using OpenAI SDK.

swe-bench #280

Are you sure you want to change the base?

swe-bench #280

Uh oh!

Conversation

shreymodi1 commented Oct 15, 2025

Uh oh!

Uh oh!

xzrderek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dphuang2 Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dphuang2 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

dphuang2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xzrderek Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

xzrderek Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

xzrderek Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dphuang2 Oct 17, 2025 •

edited

Loading

dphuang2 left a comment •

edited

Loading