Fall back to __piloRefMap when [data-pilo-ref] selector returns 0 in validateElementRef

## Problem

When the agent fails an action with `Invalid element reference 'EN'. The element does not exist on the current page.`, there are two distinct root causes that the current code can't distinguish:

1. **LLM hallucination** — the model invented a ref that was never in any snapshot.
2. **DOM mutation between snapshot and click** — the snapshot promised a ref, set `data-pilo-ref` on the DOM, but a subsequent re-render (commonly React reconciliation) stripped the attribute before the click executed.

For (1), failing loudly is correct. For (2), we have the information to recover — `ariaSnapshot.ts:55-57` already maintains `globalThis.__piloRefMap` (a `Map<string, Element>`) precisely because of "DOM re-renders that strip data-pilo-ref attributes, e.g. React reconciliation" — but the current `validateElementRef` path doesn't consult it.

This was identified during the investigation of #447 (closed). Per-task event traces from React-heavy sites (Booking, Google Flights, Apple) repeatedly show valid refs failing lookup because the attribute was stripped between snapshot and click; the `Element` reference held in `__piloRefMap` is still valid in those cases.

The proposed fallback is independent of the ref scheme — it would equally improve the current `E1/E2/...` behavior. No PR #447 dependency.

## Current behavior

`packages/core/src/browser/playwrightBrowser.ts:770-789` (`validateElementRef`):

```ts
private async validateElementRef(ref: string): Promise<Locator> {
  if (!this.page) throw new Error("Browser not started");

  const locator = this.page.locator(`[data-pilo-ref="${ref}"]`);
  const count = await locator.count();

  if (count === 0) {
    throw new InvalidRefException(ref);
  }
  // ...
}
```

If `count === 0`, the call fails immediately with `InvalidRefException`. The Element reference stored in `__piloRefMap` (which Playwright cannot see directly, since it lives in the page context) is unused.

Equivalent code in `packages/extension/src/background/ExtensionBrowser.ts:356-358` does a single `document.querySelector` and has no fallback either.

## Proposed fix

When the `[data-pilo-ref="..."]` selector returns 0, fall through to evaluating in the page context and checking `globalThis.__piloRefMap.get(ref)`. If the map has a live `Element`, re-attach the `data-pilo-ref` attribute and return a locator scoped to that element.

Sketch for `playwrightBrowser.ts`:

```ts
private async validateElementRef(ref: string): Promise<Locator> {
  if (!this.page) throw new Error("Browser not started");

  const locator = this.page.locator(`[data-pilo-ref="${ref}"]`);
  const count = await locator.count();
  if (count === 1) return locator;
  if (count > 1) throw new InvalidRefException(ref, "Multiple elements...");

  // Fallback: __piloRefMap survives attribute-strip re-renders.
  const reattached = await this.page.evaluate((r) => {
    const map = (globalThis as any).__piloRefMap as Map<string, Element> | undefined;
    const el = map?.get(r);
    if (!el || !el.isConnected) return false;
    el.setAttribute("data-pilo-ref", r);
    return true;
  }, ref);

  if (!reattached) throw new InvalidRefException(ref);
  return this.page.locator(`[data-pilo-ref="${ref}"]`);
}
```

`el.isConnected` guards against the element having been removed from the DOM entirely (genuine DOM mutation, not just attribute strip — in that case the ref really is dead).

`ExtensionBrowser.ts` gets an analogous patch but simpler because it's already running in the page context — direct `__piloRefMap.get(ref)` + `setAttribute` + re-query.

## Why this matters

From PR #447's eval traces:

- Booking task: agent's `click E_6b28` succeeded via locator resolution but timed out at click time (`<input id=":R9qmr5:" value="flight" type="...">`). That suggests the attribute was still there for the lookup. But other Booking actions failed with "ref does not exist" — likely the same React-reconciliation stripping attributes between snapshot rendering and the action 1-2 seconds later.
- Google Flights similarly: refs that *were* in the snapshot the LLM saw came back as "does not exist" at click time.

Some of these failures the trace shows were hallucinations (good — should fail loud). But some were genuine attribute-strip cases where `__piloRefMap` would have recovered the element.

## Acceptance criteria

- `validateElementRef` (playwright) and the extension's equivalent fall back to `__piloRefMap` when the attribute-selector lookup fails.
- Fallback respects `Element.isConnected` so detached elements still fail loudly.
- New unit test simulates the React-strip scenario: snapshot sets attribute → test code strips it → click succeeds via fallback.
- Existing tests still pass.
- Manual eval spot-check on a known React-heavy site (Booking or Google Flights) shows fewer `stale_element_refs`-classified failures from attribute-strip cases.

## Effort estimate

~half a day. Two surface-level patches, two unit tests, one spot-check.

## Out of scope

- Distinguishing hallucination vs attribute-strip in error messages or failure classification (separate concern, would let the eval judge bucket these correctly).
- Re-running ref-scheme experiments (#447) on top of this. That's a separate decision after this lands and runs in eval.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fall back to __piloRefMap when [data-pilo-ref] selector returns 0 in validateElementRef #449

Problem

Current behavior

Proposed fix

Why this matters

Acceptance criteria

Effort estimate

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fall back to __piloRefMap when [data-pilo-ref] selector returns 0 in validateElementRef #449

Description

Problem

Current behavior

Proposed fix

Why this matters

Acceptance criteria

Effort estimate

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions