Skip to content

[BUG] RFOptuna eval path (PR #234): 4 bugs prevent end-to-end sharded Optuna RAG runs (categorical labels, missing pipeline key, query-actor engine reuse, replacement-pipeline rate limiter) #268

@AdamRolander

Description

@AdamRolander

Bug Description

Running RFOptuna over the RAG/evals path with num_shards > 1 (sharded, prune-and-relaunch
Optuna search) fails at four distinct points. Each is independently reproducible and has a small,
local fix. With all four applied, sharded Optuna RAG search runs end-to-end: trials complete, the
median pruner prunes after shard boundaries, and replacement pipelines are suggested, launched,
and complete normally (verified at num_shards = 1, 4, and 8 on a 4×L4 box).

The four issues, in the order they surface during a run:

  1. _object_labels crashes on dict/None categorical knob choices (vars() on a non-object).
  2. _template_to_leaf_evals omits the pipeline key, so the controller KeyErrors when
    launching a config.
  3. query_actor.py deletes the inference engine and then trusts a stale hash on reuse, causing
    AttributeError under sharding.
  4. The controller's replacement-pipeline path doesn't inherit the parent's rate limiter /
    max-completion-tokens, so relaunched trials fail to initialize their inference engine.

To Reproduce

Steps to reproduce the behavior:
Configure an RFOptuna search over a RAG eval experiment with a categorical knob whose choices
are dicts (e.g. a reranker spec) or include None, set num_shards = 4, and run
experiment.run_evals(...).

The run fails at bug 1 during Optuna label construction; fixing each
bug in turn surfaces the next. (Single-shard num_shards=1 partially masks bugs 3 and 4 because
no relaunch/sharded reuse occurs.)

Expected Behavior

With all four fixes applied, a sharded RFOptuna RAG eval search runs end-to-end on a 4×L4 box at
num_shards = 1, 4, and 8:

  • trials complete and log metrics,
  • the median pruner prunes pipelines after shard boundaries,
  • replacement pipelines are suggested, launched, and complete (no init failures),
  • 24-config searches (12 initial + 12 relaunched) finish per shard setting.

Environment

  • RapidFire AI: main @ aa7f4c736d0de0582dc086ed69c231fed15a6190 (merged PR Optuna Integration for RapidFire AI #234; past v0.15.2)
  • Python 3.12, Ubuntu 24.04
  • 4× NVIDIA L4 (24 GB), CUDA 12.x
  • Path exercised: run_evals (RAG/context-eng), RFOptuna sampler with median pruner,
    num_shards ∈ {4, 8} (i.e. >1)
  • Generator: a closed-model API (Gemini 2.5 Pro via OpenAI-compatible endpoint); RAG eval, no
    local model weights
  • Reproducible regardless of generator; the failures are in orchestration, not the model calls

Additional Context

Bug 1 — _object_labels crashes on dict/None categorical choices

Symptom

TypeError: vars() argument must have __dict__ attribute

raised from the Optuna label-construction helper when a categorical knob's choices are dicts or
include None.

Root cause

_object_labels (in the automl/Optuna integration) calls vars(choice) to build a human-readable
label for each categorical option. That assumes every choice is an object with a __dict__. RAG
knobs routinely pass plain dict choices (e.g. a reranker config) or None (knob disabled), and
vars() raises on both.

Fix

Handle dict and None (and any non-__dict__ value) explicitly instead of assuming an object.
We applied this as a wrapper, _safe_object_labels, that returns the dict as-is / a sentinel for
None / falls back to str(choice) for anything without __dict__, and is used in place of the
raw _object_labels. In-tree, the equivalent is to guard the vars() call:

def _object_labels(obj):
    if obj is None:
        return "None"
    if isinstance(obj, dict):
        return obj
    if hasattr(obj, "__dict__"):
        return vars(obj)
    return str(obj)

Bug 2 — _template_to_leaf_evals omits the pipeline key → controller KeyError

Symptom

KeyError: 'pipeline'

in the controller when it tries to launch a leaf eval config produced by the Optuna template
expansion.

Root cause

_template_to_leaf_evals builds each leaf config dict with an api_config key but no pipeline
key. Downstream, the controller indexes the config by ["pipeline"] when scheduling/launching, so
the missing key is fatal. (The grid/random expansion paths populate pipeline; the Optuna
template path does not.)

Fix

Ensure the leaf config carries a pipeline entry. The minimal change is to set pipeline from the
api_config (rename/duplicate) when constructing the leaf dict in _template_to_leaf_evals, so the
config shape matches what the controller expects from the other samplers.

# in _template_to_leaf_evals, when assembling each leaf config:
leaf["pipeline"] = leaf.pop("api_config")   # or: leaf["pipeline"] = leaf["api_config"]

(Match whichever key the grid/random paths use so all samplers produce identically-shaped configs.)


Bug 3 — query_actor.py deletes the inference engine, then reuses a stale hash → AttributeError under sharding

Symptom

AttributeError: 'QueryActor' object has no attribute 'inference_engine'

on the second and later shards of a config (i.e. only manifests with num_shards > 1).

Root cause

Two interacting lines in rapidfireai/evals/actors/query_actor.py:

  • On teardown of a shard, the actor does del self.inference_engine (≈ line 149). After the
    attribute is deleted, any later access raises AttributeError rather than being recognized as
    "no engine present".
  • The reuse branch (≈ line 139) decides whether to rebuild the engine purely from a config
    hash
    — if the new shard's config hash matches the previous one, it assumes the engine is still
    there and skips rebuilding. But the engine was del'd, so the hash matches while the attribute
    is gone → it tries to use self.inference_engine and AttributeErrors.
    Fix

Two small changes:

  1. Set the attribute to None instead of deleting it, so its absence is representable:
    # query_actor.py ~line 149
    self.inference_engine = None     # was: del self.inference_engine
  2. Make the reuse branch also check the engine actually exists, not just that the hash matches:
    # query_actor.py ~line 139
    if config_hash == self._last_config_hash and self.inference_engine is not None:
        ...reuse...
    else:
        ...rebuild...

With both, a config spanning multiple shards correctly reuses its engine when present and rebuilds
when it was torn down.


Bug 4 — replacement (relaunched) pipelines don't inherit the parent's rate limiter / max-completion-tokens → init failure

Symptom

After the Optuna pruner prunes a pipeline and the controller suggests a replacement, the
relaunched pipeline fails to initialize its inference engine:

APIInferenceEngine missing rate_limiter_actor   (replacement pipelines fail-init under sharding)

All replacement trials fail to start; only the initial wave of trials runs.

Root cause

In the controller's replacement-pipeline loop (≈ line 1673, after
scheduler.add_pipeline(new_pid, ...)), the new pipeline id is registered with the scheduler but
is not added to the maps that hold per-pipeline API resources — specifically the rate limiter
actor and the max-completion-tokens setting. The initial pipelines get these at experiment setup;
replacements created mid-run inherit neither, so when the worker builds the APIInferenceEngine
for a replacement it has no rate_limiter_actor and init fails.

Fix

When creating a replacement pipeline, copy the parent's API-resource entries to the new pipeline
id. We added, immediately after scheduler.add_pipeline(new_pid, ...):

# controller.py ~line 1673, in the replacement-pipeline loop
pipeline_to_rate_limiter[new_pid] = pipeline_to_rate_limiter[pipeline_id]
pipeline_to_max_completion_tokens[new_pid] = pipeline_to_max_completion_tokens[pipeline_id]

(Use whatever the parent's id variable is named in that scope — pipeline_id / parent_pid.)
With this, replacement pipelines initialize their inference engine and complete their shards
normally. Verified: with the fix, num_shards 4 and 8 report init failures: 0 and replacement
pipelines (ids 13+) complete all shards.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions