You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running RFOptuna over the RAG/evals path with num_shards > 1 (sharded, prune-and-relaunch
Optuna search) fails at four distinct points. Each is independently reproducible and has a small,
local fix. With all four applied, sharded Optuna RAG search runs end-to-end: trials complete, the
median pruner prunes after shard boundaries, and replacement pipelines are suggested, launched,
and complete normally (verified at num_shards = 1, 4, and 8 on a 4×L4 box).
The four issues, in the order they surface during a run:
_object_labels crashes on dict/None categorical knob choices (vars() on a non-object).
_template_to_leaf_evals omits the pipeline key, so the controller KeyErrors when
launching a config.
query_actor.py deletes the inference engine and then trusts a stale hash on reuse, causing AttributeError under sharding.
The controller's replacement-pipeline path doesn't inherit the parent's rate limiter /
max-completion-tokens, so relaunched trials fail to initialize their inference engine.
To Reproduce
Steps to reproduce the behavior:
Configure an RFOptuna search over a RAG eval experiment with a categorical knob whose choices
are dicts (e.g. a reranker spec) or include None, set num_shards = 4, and run experiment.run_evals(...).
The run fails at bug 1 during Optuna label construction; fixing each
bug in turn surfaces the next. (Single-shard num_shards=1 partially masks bugs 3 and 4 because
no relaunch/sharded reuse occurs.)
Expected Behavior
With all four fixes applied, a sharded RFOptuna RAG eval search runs end-to-end on a 4×L4 box at num_shards = 1, 4, and 8:
trials complete and log metrics,
the median pruner prunes pipelines after shard boundaries,
replacement pipelines are suggested, launched, and complete (no init failures),
Path exercised: run_evals (RAG/context-eng), RFOptuna sampler with median pruner, num_shards ∈ {4, 8} (i.e. >1)
Generator: a closed-model API (Gemini 2.5 Pro via OpenAI-compatible endpoint); RAG eval, no
local model weights
Reproducible regardless of generator; the failures are in orchestration, not the model calls
Additional Context
Bug 1 — _object_labels crashes on dict/None categorical choices
Symptom
TypeError: vars() argument must have __dict__ attribute
raised from the Optuna label-construction helper when a categorical knob's choices are dicts or
include None.
Root cause
_object_labels (in the automl/Optuna integration) calls vars(choice) to build a human-readable
label for each categorical option. That assumes every choice is an object with a __dict__. RAG
knobs routinely pass plain dict choices (e.g. a reranker config) or None (knob disabled), and vars() raises on both.
Fix
Handle dict and None (and any non-__dict__ value) explicitly instead of assuming an object.
We applied this as a wrapper, _safe_object_labels, that returns the dict as-is / a sentinel for None / falls back to str(choice) for anything without __dict__, and is used in place of the
raw _object_labels. In-tree, the equivalent is to guard the vars() call:
in the controller when it tries to launch a leaf eval config produced by the Optuna template
expansion.
Root cause
_template_to_leaf_evals builds each leaf config dict with an api_config key but no pipeline
key. Downstream, the controller indexes the config by ["pipeline"] when scheduling/launching, so
the missing key is fatal. (The grid/random expansion paths populate pipeline; the Optuna
template path does not.)
Fix
Ensure the leaf config carries a pipeline entry. The minimal change is to set pipeline from the api_config (rename/duplicate) when constructing the leaf dict in _template_to_leaf_evals, so the
config shape matches what the controller expects from the other samplers.
# in _template_to_leaf_evals, when assembling each leaf config:leaf["pipeline"] =leaf.pop("api_config") # or: leaf["pipeline"] = leaf["api_config"]
(Match whichever key the grid/random paths use so all samplers produce identically-shaped configs.)
Bug 3 — query_actor.py deletes the inference engine, then reuses a stale hash → AttributeError under sharding
Symptom
AttributeError: 'QueryActor' object has no attribute 'inference_engine'
on the second and later shards of a config (i.e. only manifests with num_shards > 1).
Root cause
Two interacting lines in rapidfireai/evals/actors/query_actor.py:
On teardown of a shard, the actor does del self.inference_engine (≈ line 149). After the
attribute is deleted, any later access raises AttributeError rather than being recognized as
"no engine present".
The reuse branch (≈ line 139) decides whether to rebuild the engine purely from a config
hash — if the new shard's config hash matches the previous one, it assumes the engine is still
there and skips rebuilding. But the engine was del'd, so the hash matches while the attribute
is gone → it tries to use self.inference_engine and AttributeErrors. Fix
Two small changes:
Set the attribute to None instead of deleting it, so its absence is representable:
# query_actor.py ~line 149self.inference_engine=None# was: del self.inference_engine
Make the reuse branch also check the engine actually exists, not just that the hash matches:
After the Optuna pruner prunes a pipeline and the controller suggests a replacement, the
relaunched pipeline fails to initialize its inference engine:
APIInferenceEngine missing rate_limiter_actor (replacement pipelines fail-init under sharding)
All replacement trials fail to start; only the initial wave of trials runs.
Root cause
In the controller's replacement-pipeline loop (≈ line 1673, after scheduler.add_pipeline(new_pid, ...)), the new pipeline id is registered with the scheduler but
is not added to the maps that hold per-pipeline API resources — specifically the rate limiter
actor and the max-completion-tokens setting. The initial pipelines get these at experiment setup;
replacements created mid-run inherit neither, so when the worker builds the APIInferenceEngine
for a replacement it has no rate_limiter_actor and init fails.
Fix
When creating a replacement pipeline, copy the parent's API-resource entries to the new pipeline
id. We added, immediately after scheduler.add_pipeline(new_pid, ...):
# controller.py ~line 1673, in the replacement-pipeline looppipeline_to_rate_limiter[new_pid] =pipeline_to_rate_limiter[pipeline_id]
pipeline_to_max_completion_tokens[new_pid] =pipeline_to_max_completion_tokens[pipeline_id]
(Use whatever the parent's id variable is named in that scope — pipeline_id / parent_pid.)
With this, replacement pipelines initialize their inference engine and complete their shards
normally. Verified: with the fix, num_shards 4 and 8 report init failures: 0 and replacement
pipelines (ids 13+) complete all shards.
Bug Description
Running
RFOptunaover the RAG/evals path withnum_shards > 1(sharded, prune-and-relaunchOptuna search) fails at four distinct points. Each is independently reproducible and has a small,
local fix. With all four applied, sharded Optuna RAG search runs end-to-end: trials complete, the
median pruner prunes after shard boundaries, and replacement pipelines are suggested, launched,
and complete normally (verified at
num_shards= 1, 4, and 8 on a 4×L4 box).The four issues, in the order they surface during a run:
_object_labelscrashes on dict/None categorical knob choices (vars()on a non-object)._template_to_leaf_evalsomits thepipelinekey, so the controllerKeyErrors whenlaunching a config.
query_actor.pydeletes the inference engine and then trusts a stale hash on reuse, causingAttributeErrorunder sharding.max-completion-tokens, so relaunched trials fail to initialize their inference engine.
To Reproduce
Steps to reproduce the behavior:
Configure an
RFOptunasearch over a RAG eval experiment with a categorical knob whose choicesare dicts (e.g. a reranker spec) or include
None, setnum_shards = 4, and runexperiment.run_evals(...).The run fails at bug 1 during Optuna label construction; fixing each
bug in turn surfaces the next. (Single-shard
num_shards=1partially masks bugs 3 and 4 becauseno relaunch/sharded reuse occurs.)
Expected Behavior
With all four fixes applied, a sharded
RFOptunaRAG eval search runs end-to-end on a 4×L4 box atnum_shards= 1, 4, and 8:Environment
main@aa7f4c736d0de0582dc086ed69c231fed15a6190(merged PR Optuna Integration for RapidFire AI #234; past v0.15.2)run_evals(RAG/context-eng),RFOptunasampler with median pruner,num_shards ∈ {4, 8}(i.e.>1)local model weights
Additional Context
Bug 1 —
_object_labelscrashes on dict/None categorical choicesSymptom
raised from the Optuna label-construction helper when a categorical knob's choices are dicts or
include
None.Root cause
_object_labels(in the automl/Optuna integration) callsvars(choice)to build a human-readablelabel for each categorical option. That assumes every choice is an object with a
__dict__. RAGknobs routinely pass plain
dictchoices (e.g. a reranker config) orNone(knob disabled), andvars()raises on both.Fix
Handle dict and
None(and any non-__dict__value) explicitly instead of assuming an object.We applied this as a wrapper,
_safe_object_labels, that returns the dict as-is / a sentinel forNone/ falls back tostr(choice)for anything without__dict__, and is used in place of theraw
_object_labels. In-tree, the equivalent is to guard thevars()call:Bug 2 —
_template_to_leaf_evalsomits thepipelinekey → controllerKeyErrorSymptom
in the controller when it tries to launch a leaf eval config produced by the Optuna template
expansion.
Root cause
_template_to_leaf_evalsbuilds each leaf config dict with anapi_configkey but nopipelinekey. Downstream, the controller indexes the config by
["pipeline"]when scheduling/launching, sothe missing key is fatal. (The grid/random expansion paths populate
pipeline; the Optunatemplate path does not.)
Fix
Ensure the leaf config carries a
pipelineentry. The minimal change is to setpipelinefrom theapi_config(rename/duplicate) when constructing the leaf dict in_template_to_leaf_evals, so theconfig shape matches what the controller expects from the other samplers.
(Match whichever key the grid/random paths use so all samplers produce identically-shaped configs.)
Bug 3 —
query_actor.pydeletes the inference engine, then reuses a stale hash →AttributeErrorunder shardingSymptom
on the second and later shards of a config (i.e. only manifests with
num_shards > 1).Root cause
Two interacting lines in
rapidfireai/evals/actors/query_actor.py:del self.inference_engine(≈ line 149). After theattribute is deleted, any later access raises
AttributeErrorrather than being recognized as"no engine present".
hash — if the new shard's config hash matches the previous one, it assumes the engine is still
there and skips rebuilding. But the engine was
del'd, so the hash matches while the attributeis gone → it tries to use
self.inference_engineandAttributeErrors.Fix
Two small changes:
Noneinstead of deleting it, so its absence is representable:With both, a config spanning multiple shards correctly reuses its engine when present and rebuilds
when it was torn down.
Bug 4 — replacement (relaunched) pipelines don't inherit the parent's rate limiter / max-completion-tokens → init failure
Symptom
After the Optuna pruner prunes a pipeline and the controller suggests a replacement, the
relaunched pipeline fails to initialize its inference engine:
All replacement trials fail to start; only the initial wave of trials runs.
Root cause
In the controller's replacement-pipeline loop (≈ line 1673, after
scheduler.add_pipeline(new_pid, ...)), the new pipeline id is registered with the scheduler butis not added to the maps that hold per-pipeline API resources — specifically the rate limiter
actor and the max-completion-tokens setting. The initial pipelines get these at experiment setup;
replacements created mid-run inherit neither, so when the worker builds the
APIInferenceEnginefor a replacement it has no
rate_limiter_actorand init fails.Fix
When creating a replacement pipeline, copy the parent's API-resource entries to the new pipeline
id. We added, immediately after
scheduler.add_pipeline(new_pid, ...):(Use whatever the parent's id variable is named in that scope —
pipeline_id/parent_pid.)With this, replacement pipelines initialize their inference engine and complete their shards
normally. Verified: with the fix,
num_shards4 and 8 reportinit failures: 0and replacementpipelines (ids 13+) complete all shards.