The agent loop
Every task follows the same cycle: perceive the page, plan the next step, act on it, verify the result. Grounded in what is actually on screen.
Perceive
The agent takes a visual snapshot, reads the rendered DOM, and builds a structured page outline.
Plan
Based on the task and page state, the agent decides the next action to take.
Act
The agent executes the action: click, type, navigate, extract, or wait.
Verify
The agent checks that the action had the expected effect before continuing.
What the agent sees
Three inputs give the agent a complete picture of every page:
Rendered DOM
The full document structure with interactive elements identified.
Visual snapshot
A screenshot of the page as rendered, used for spatial reasoning.
Structured outline
A semantic summary of the page: headings, forms, buttons, links, tables.
What the agent can do
Navigate & click
Go to URLs, click buttons, links, and interactive elements.
Type & fill
Enter text in forms, search bars, and text fields.
Scroll & read
Scroll through content and extract visible information.
Download
Download files with explicit user approval.
Upload
Attach files to forms and upload interfaces.
Multi-tab
Work across multiple tabs simultaneously.
Grounded, not hallucinated
Every action the agent takes is grounded on the live page. It selects from elements that actually exist in the DOM. There are no hallucinated clicks or phantom interactions.
Run transcripts
Every step the agent takes is logged: the page state, the action chosen, the reasoning, and the result. You can replay any task step by step.