Skip to content

The agent loop

Every task follows the same cycle: perceive the page, plan the next step, act on it, verify the result. Grounded in what is actually on screen.

1

Perceive

The agent takes a visual snapshot, reads the rendered DOM, and builds a structured page outline.

2

Plan

Based on the task and page state, the agent decides the next action to take.

3

Act

The agent executes the action: click, type, navigate, extract, or wait.

4

Verify

The agent checks that the action had the expected effect before continuing.

What the agent sees

Three inputs give the agent a complete picture of every page:

Rendered DOM

The full document structure with interactive elements identified.

Visual snapshot

A screenshot of the page as rendered, used for spatial reasoning.

Structured outline

A semantic summary of the page: headings, forms, buttons, links, tables.

What the agent can do

Navigate & click

Go to URLs, click buttons, links, and interactive elements.

Type & fill

Enter text in forms, search bars, and text fields.

Scroll & read

Scroll through content and extract visible information.

Download

Download files with explicit user approval.

Upload

Attach files to forms and upload interfaces.

Multi-tab

Work across multiple tabs simultaneously.

Grounded, not hallucinated

Every action the agent takes is grounded on the live page. It selects from elements that actually exist in the DOM. There are no hallucinated clicks or phantom interactions.

Run transcripts

Every step the agent takes is logged: the page state, the action chosen, the reasoning, and the result. You can replay any task step by step.