2026-01-12 · 6 min read

How the agent sees the web

One of the most common questions we get is: how does the agent actually see a web page? The answer involves three complementary inputs that give the agent a complete picture.

First, the rendered DOM. The agent reads the full document structure with interactive elements identified — buttons, links, inputs, selectors. This gives it a structural map of what's on the page and what can be interacted with.

Second, a visual snapshot. The agent takes a screenshot of the page as rendered. This is used for spatial reasoning — understanding layout, position, and visual context that the DOM alone can't convey.

Third, a structured outline. A semantic summary of the page: headings, forms, buttons, links, tables. This compressed representation helps the agent reason about the page's purpose and content hierarchy.

Together, these three inputs allow the agent to take grounded actions — clicking elements that actually exist, typing in fields that are actually visible, and navigating based on the real state of the page.