Skip to content

Fall back to __piloRefMap when [data-pilo-ref] selector returns 0 in validateElementRef #449

@lmorchard

Description

@lmorchard

Problem

When the agent fails an action with Invalid element reference 'EN'. The element does not exist on the current page., there are two distinct root causes that the current code can't distinguish:

  1. LLM hallucination — the model invented a ref that was never in any snapshot.
  2. DOM mutation between snapshot and click — the snapshot promised a ref, set data-pilo-ref on the DOM, but a subsequent re-render (commonly React reconciliation) stripped the attribute before the click executed.

For (1), failing loudly is correct. For (2), we have the information to recover — ariaSnapshot.ts:55-57 already maintains globalThis.__piloRefMap (a Map<string, Element>) precisely because of "DOM re-renders that strip data-pilo-ref attributes, e.g. React reconciliation" — but the current validateElementRef path doesn't consult it.

This was identified during the investigation of #447 (closed). Per-task event traces from React-heavy sites (Booking, Google Flights, Apple) repeatedly show valid refs failing lookup because the attribute was stripped between snapshot and click; the Element reference held in __piloRefMap is still valid in those cases.

The proposed fallback is independent of the ref scheme — it would equally improve the current E1/E2/... behavior. No PR #447 dependency.

Current behavior

packages/core/src/browser/playwrightBrowser.ts:770-789 (validateElementRef):

private async validateElementRef(ref: string): Promise<Locator> {
  if (!this.page) throw new Error("Browser not started");

  const locator = this.page.locator(`[data-pilo-ref="${ref}"]`);
  const count = await locator.count();

  if (count === 0) {
    throw new InvalidRefException(ref);
  }
  // ...
}

If count === 0, the call fails immediately with InvalidRefException. The Element reference stored in __piloRefMap (which Playwright cannot see directly, since it lives in the page context) is unused.

Equivalent code in packages/extension/src/background/ExtensionBrowser.ts:356-358 does a single document.querySelector and has no fallback either.

Proposed fix

When the [data-pilo-ref="..."] selector returns 0, fall through to evaluating in the page context and checking globalThis.__piloRefMap.get(ref). If the map has a live Element, re-attach the data-pilo-ref attribute and return a locator scoped to that element.

Sketch for playwrightBrowser.ts:

private async validateElementRef(ref: string): Promise<Locator> {
  if (!this.page) throw new Error("Browser not started");

  const locator = this.page.locator(`[data-pilo-ref="${ref}"]`);
  const count = await locator.count();
  if (count === 1) return locator;
  if (count > 1) throw new InvalidRefException(ref, "Multiple elements...");

  // Fallback: __piloRefMap survives attribute-strip re-renders.
  const reattached = await this.page.evaluate((r) => {
    const map = (globalThis as any).__piloRefMap as Map<string, Element> | undefined;
    const el = map?.get(r);
    if (!el || !el.isConnected) return false;
    el.setAttribute("data-pilo-ref", r);
    return true;
  }, ref);

  if (!reattached) throw new InvalidRefException(ref);
  return this.page.locator(`[data-pilo-ref="${ref}"]`);
}

el.isConnected guards against the element having been removed from the DOM entirely (genuine DOM mutation, not just attribute strip — in that case the ref really is dead).

ExtensionBrowser.ts gets an analogous patch but simpler because it's already running in the page context — direct __piloRefMap.get(ref) + setAttribute + re-query.

Why this matters

From PR #447's eval traces:

  • Booking task: agent's click E_6b28 succeeded via locator resolution but timed out at click time (<input id=":R9qmr5:" value="flight" type="...">). That suggests the attribute was still there for the lookup. But other Booking actions failed with "ref does not exist" — likely the same React-reconciliation stripping attributes between snapshot rendering and the action 1-2 seconds later.
  • Google Flights similarly: refs that were in the snapshot the LLM saw came back as "does not exist" at click time.

Some of these failures the trace shows were hallucinations (good — should fail loud). But some were genuine attribute-strip cases where __piloRefMap would have recovered the element.

Acceptance criteria

  • validateElementRef (playwright) and the extension's equivalent fall back to __piloRefMap when the attribute-selector lookup fails.
  • Fallback respects Element.isConnected so detached elements still fail loudly.
  • New unit test simulates the React-strip scenario: snapshot sets attribute → test code strips it → click succeeds via fallback.
  • Existing tests still pass.
  • Manual eval spot-check on a known React-heavy site (Booking or Google Flights) shows fewer stale_element_refs-classified failures from attribute-strip cases.

Effort estimate

~half a day. Two surface-level patches, two unit tests, one spot-check.

Out of scope

  • Distinguishing hallucination vs attribute-strip in error messages or failure classification (separate concern, would let the eval judge bucket these correctly).
  • Re-running ref-scheme experiments (feat(ariaTree): content-derived element refs (E_<hex>) #447) on top of this. That's a separate decision after this lands and runs in eval.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions