Skip to content

Action vocabulary additions (send_keys, screenshot, upload_file, dropdown_options) #436

@lmorchard

Description

@lmorchard

Current state

Pilo's web action vocabulary (packages/core/src/tools/webActionTools.ts) covers the core interactions: click, fill, select, hover, check, uncheck, focus, enter, wait, goto, back, forward, extract, done, abort. Plus webSearch, Tabstack tools, and request_user_data are gated tools.

Common interaction patterns the agent cannot currently perform without workarounds:

  • Keyboard shortcuts: Escape to dismiss modals, Tab/Shift+Tab to navigate focus, Ctrl+A to select all, ArrowDown/ArrowUp for autocomplete navigation, etc. The only existing keyboard tool is enter which presses Enter on a specific ref.
  • On-demand screenshot: screenshots are only captured at snapshot time when vision: true. The agent has no way to request a fresh screenshot mid-step (e.g., "did that click cause a visual change I should look at?").
  • File upload: form file inputs are common. Pilo has no tool to populate them.
  • Dropdown option enumeration: when selecting from a <select>, the agent currently has to guess the value or click-to-open the dropdown, see the options, then click an option. A direct "what options does this select have?" tool is cheaper.

The gap

Each missing tool has real-world cases:

  • send_keys — sites that don't expose a "close modal" button (rely on Escape), forms that require keyboard navigation (combobox + ArrowDown + Enter to select), Ctrl+S to save in web editors. The system prompt currently mentions Escape (prompts.ts:339) and arrow keys but the agent can't use them.
  • screenshot — diagnostic tool for the agent itself ("the click didn't seem to do anything, let me see"), and a debug surface for users watching task replays. Currently the screenshot is bundled with each snapshot (vision mode) and is wasted on snapshots where vision isn't useful.
  • upload_file — any task involving "upload your resume", "attach a screenshot", "import a CSV" requires a file upload tool.
  • dropdown_options — closes a common multi-step pattern (open dropdown, read options, close, then select) into a single zero-cost tool call.

Proposed scope

Four small tool additions. Each can ship independently.

A. send_keys tool

send_keys: tool({
  description:
    "Press a keyboard key or combination on the currently focused element. " +
    "Examples: 'Escape' to dismiss a modal, 'Tab' to move focus, 'Control+a' " +
    "to select all, 'ArrowDown' for combobox navigation.",
  inputSchema: z.object({
    keys: z.string().describe("Key name or combination (e.g., 'Escape', 'Control+a', 'Shift+Tab')"),
    ref: z.string().optional()
      .describe("If provided, focus this element first before pressing keys"),
  }),
  execute: async ({ keys, ref }) => {
    return performActionWithValidation(PageAction.SendKeys, context, ref, keys);
  },
}),

Implementation in playwrightBrowser.ts: page.keyboard.press(keys) (Playwright understands Control+a syntax natively).

B. screenshot tool

screenshot: tool({
  description:
    "Take a screenshot of the current page. The image will be included in the next " +
    "page snapshot. Use sparingly — most decisions can be made from the accessibility " +
    "tree alone.",
  inputSchema: z.object({
    fullPage: z.boolean().default(false)
      .describe("Capture the entire page including content below the fold"),
    ref: z.string().optional()
      .describe("If provided, screenshot only this element"),
  }),
  execute: async ({ fullPage, ref }) => {
    const buf = await context.browser.getScreenshot({ withMarks: true, fullPage, ref });
    // Emit BROWSER_SCREENSHOT_CAPTURED event
    // Attach to next user message? Or include in the action result?
    // Probably: include in the next snapshot, and emit immediately.
  },
}),

The exact threading (where the screenshot data ends up in messages) needs design. Simplest: the action result includes a screenshotAvailable: true flag; the next snapshot user message includes the captured image. Vision mode behavior is unchanged.

C. upload_file tool

upload_file: tool({
  description:
    "Upload a file to a file input element. The file path must be a local filesystem path " +
    "or a URL that the agent has been authorized to fetch.",
  inputSchema: z.object({
    ref: z.string().describe("Element reference of the file input (or its container)"),
    path: z.string().describe("Local file path or pre-authorized URL"),
  }),
  execute: async ({ ref, path }) => {
    return performActionWithValidation(PageAction.UploadFile, context, ref, path);
  },
}),

Implementation via Playwright's locator.setInputFiles(path). If the ref isn't a file input directly, find the nearest descendant file input. Validate the file exists and is non-empty before passing to Playwright.

Security considerations: this tool can read arbitrary local files. Gate behind a WebAgentOptions.allowFileUpload?: { allowedPaths?: string[] } option so deployments can constrain it. Default: disabled.

D. dropdown_options tool

dropdown_options: tool({
  description:
    "List the options available in a dropdown (<select>) or ARIA menu. " +
    "Free and fast — use before select() to verify the option you want is present.",
  inputSchema: z.object({
    ref: z.string().describe("Element reference of the dropdown"),
  }),
  execute: async ({ ref }) => {
    return performActionWithValidation(PageAction.DropdownOptions, context, ref);
  },
}),

Implementation: locate the element, check if it's a <select> (return its <option> text + value) or has role="combobox"/role="menu" (return descendants with role="option" or role="menuitem"). Return up to N options (cap at 200 to prevent runaway).

Implementation notes

  • All four tools follow the existing performActionWithValidation pattern for consistency.
  • Each requires a new PageAction.* enum value and corresponding handler in playwrightBrowser.ts.
  • Each needs a tool example added to buildToolExamples in prompts.ts.
  • upload_file security is non-trivial. Default to disabled; opt-in via config; document the security model.
  • screenshot adds latency and image-content to the conversation. Keep its description discouraging overuse ("Use sparingly").
  • These can ship as four separate PRs or one. The simpler ones (send_keys, dropdown_options) are 2-4 hours each. screenshot and upload_file are 1 day each.

Acceptance criteria

  • All four tools exist in webActionTools.ts (upload_file gated by config), have tool descriptions in prompts, and are exercised in tests.
  • send_keys works for single keys and combinations (Control+a, Shift+Tab).
  • screenshot produces an image attached to the next snapshot user message; emits the right events.
  • upload_file validates the file exists; respects the allow-list when configured; gracefully rejects when not enabled.
  • dropdown_options returns options for both <select> and ARIA menus.
  • The system prompt now references send_keys for autocomplete/escape patterns (where the current text says "use keyboard navigation" without a tool to do it).

Effort estimate

3-4 days for all four, including tests and prompt updates. Each is independent enough to split into separate PRs if useful.

Related issues

Pairs with the zero-LLM exploration tools issue (dropdown_options is conceptually similar to find_elements). The screenshot tool benefits from the scroll-position context work (knowing what's visible helps decide whether to screenshot).

Files likely affected

  • packages/core/src/tools/webActionTools.ts
  • packages/core/src/browser/ariaBrowser.ts (PageAction enum)
  • packages/core/src/browser/playwrightBrowser.ts (handlers)
  • packages/core/src/prompts.ts (tool examples)
  • packages/core/src/config/defaults.ts (upload allow-list)
  • packages/core/test/

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions