Skip to content

feat: add GUI click-model browser mode#910

Open
Neel Gupta (neel04) wants to merge 13 commits intodevfrom
feat/click-model
Open

feat: add GUI click-model browser mode#910
Neel Gupta (neel04) wants to merge 13 commits intodevfrom
feat/click-model

Conversation

@neel04
Copy link
Copy Markdown
Contributor

Summary

  • Add GUI click-only browser mode backed by Molmo point prediction for click/hover coordinates.
  • Restrict agent browser tools to GUI-safe actions, page navigation, screenshots, scroll, and focused text entry.
  • Add GUI click logging, coordinate scaling, ACL checks, and configurable Molmo endpoint settings.
  • Expose eval one-off task overrides and set AGI SDK suite workers to 20.

Run AGI SDK Eval

Use Fireworks for Kimi:

cd packages/browseros-agent/apps/eval

FIREWORKS_API_KEY=... \
BROWSEROS_EVAL_PYTHON=.venv/bin/python \
bun run eval run --config configs/legacy/agisdk-real.json

This runs the full agisdk-real suite with 20 workers using accounts/fireworks/models/kimi-k2p5 on Fireworks.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

❌ Tests failed — 3/1159 failed

Suite Passed Failed Skipped
agent 76/76 0 0
build 9/9 0 0
eval 101/103 2 0
server-agent 263/264 1 0
server-api 202/202 0 0
server-browser 4/4 0 0
server-integration 9/10 0 1
server-lib 161/161 0 0
server-root 60/63 0 3
server-skills 31/31 0 0
server-tools 236/236 0 0
Failed tests
  • evaladaptEvalConfigFile > adapts BrowserOS AGI SDK comparison configs
  • evalEvalSuiteSchema > validates the daily AGISDK 10-task suite
  • server-agentmode-aware framing > GUI click-only mode exposes only GUI click and page-opening guidance

View workflow run

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 1, 2026

Greptile Summary

This PR replaces the element-ID-based click/hover tools with a Molmo visual-model backend that resolves coordinates from a natural-language prompt, adds a type_text tool for typing into the focused element, and gates the whole flow behind a GUI_CLICK_ONLY_MODE flag that is currently hardcoded to true. The eval CLI gains --query/--start-url/--output-dir pass-through flags and the AGI SDK worker count is bumped to 20.

  • P1 — hardcoded ephemeral endpoint: MOLMO_POINT_ENDPOINT is baked to a specific RunPod proxy URL; when the pod restarts every click/hover will hang for 60 s before throwing. It must be read from an env var.
  • P1 — no kill switch: GUI_CLICK_ONLY_MODE = true is a compile-time constant with no env-var override; it silently makes the chatMode tool-filter branch dead code and removes all element-based input from every agent session.
  • P2 — excessive logging: every click/hover emits three logger.info calls with large payloads, violating the project's debug-logging rule.

Confidence Score: 3/5

Not safe to merge as-is — the hardcoded ephemeral RunPod URL will break production click/hover when the pod restarts, and the always-on mode flag silently disables chatMode restrictions.

Two independent P1 issues: an ephemeral infrastructure endpoint with no env-var escape hatch, and a compile-time constant that globally overrides all agent modes with no kill switch.

molmo-point-config.ts (hardcoded endpoint), gui-click-only.ts (hardcoded mode flag), ai-sdk-agent.ts (dead chatMode branch)

Important Files Changed

Filename Overview
packages/browseros-agent/apps/server/src/tools/molmo-point-config.ts New config file with hardcoded ephemeral RunPod endpoint and no env-var override — will break when the pod restarts
packages/browseros-agent/apps/server/src/agent/gui-click-only.ts New mode file with GUI_CLICK_ONLY_MODE hardcoded to true — no env/config kill switch, affects all agent sessions globally
packages/browseros-agent/apps/server/src/agent/ai-sdk-agent.ts Wires GUI click-only mode into agent setup; chatMode tool-filter branch is now unreachable dead code because GUI_CLICK_ONLY_MODE is always true
packages/browseros-agent/apps/server/src/tools/input.ts click/hover globally replaced with GUI-prompt versions; type_text added; scroll element param removed; excessive per-action debug logging
packages/browseros-agent/apps/server/src/tools/molmo-point-client.ts New Molmo HTTP client with good error handling, response truncation, and PNG dimension parsing; verbose info-level logging on every request/response cycle
packages/browseros-agent/apps/server/src/tools/gui-click-resolver.ts New coordinate resolver: takes screenshot, queries Molmo, scales point from image to viewport space; scaling logic is correct and validated by tests
packages/browseros-agent/apps/server/src/tools/acl/acl-guard.ts Adds type_text to guarded tools and resolves focused element for ACL checks; integrates cleanly with framework-level checkAcl
packages/browseros-agent/apps/server/src/browser/browser.ts Adds resolveFocusedElement, viewportSize, and typeText helpers; fixes scroll center calculation to use cssVisualViewport
packages/browseros-agent/apps/server/src/agent/prompt.ts Adds guiClickOnly prompt sections for all prompt builders; cleanly isolated with early-return guards
packages/browseros-agent/apps/eval/src/cli/args.ts Adds --query, --start-url, --output-dir CLI args to thread through to existing RunEvalOptions fields
Prompt To Fix All With AI
Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 5
packages/browseros-agent/apps/server/src/tools/molmo-point-config.ts:1-5
**Hardcoded ephemeral RunPod endpoint**

`MOLMO_POINT_ENDPOINT` is set to a specific RunPod proxy URL. RunPod proxy hostnames are ephemeral — they become invalid as soon as the pod is stopped or restarted. When that happens every `click` and `hover` call will hang for the full `MOLMO_POINT_TIMEOUT_MS` (60 s) before throwing, making the browser completely unusable. There is no environment variable or config override to change it at runtime.

### Issue 2 of 5
packages/browseros-agent/apps/server/src/tools/molmo-point-config.ts:1-2
Read the endpoint from an environment variable so the RunPod pod can be rotated without a code deploy. The hardcoded URL will become invalid the moment the pod restarts.

```suggestion
export const MOLMO_POINT_ENDPOINT =
  process.env.MOLMO_POINT_ENDPOINT ??
  'https://gseb9k0a2n2vhl-8000.proxy.runpod.net/'
```

### Issue 3 of 5
packages/browseros-agent/apps/server/src/agent/gui-click-only.ts:1
**`GUI_CLICK_ONLY_MODE` is hardcoded `true` with no runtime kill switch**

`GUI_CLICK_ONLY_MODE = true` unconditionally puts every agent session into GUI-click-only mode. There is no environment variable, config flag, or per-session toggle to disable it. The consequence is that the existing chat-mode tool restriction in `ai-sdk-agent.ts` is now dead code (the `else if (config.resolvedConfig.chatMode)` branch can never execute). The old element-based `click` and `hover` schemas are also permanently gone from the registry, so any caller that depended on `element` IDs will silently get wrong behaviour. A guard like `process.env.GUI_CLICK_ONLY_MODE === 'true'` would give an operational kill switch without a code deploy.

### Issue 4 of 5
packages/browseros-agent/apps/server/src/agent/ai-sdk-agent.ts:115-133
**Dead `chatMode` tool-filter branch**

Because `GUI_CLICK_ONLY_MODE` is always `true`, the `else if (config.resolvedConfig.chatMode)` branch filtering tools by `CHAT_MODE_ALLOWED_TOOLS` can never be reached. The chat-mode tool restriction silently no longer applies, which may expose write tools in chat sessions. This violates the project rule to remove dead code.

### Issue 5 of 5
packages/browseros-agent/apps/server/src/tools/input.ts:68-115
**Excessive debug logging per project rule**

Every `click` and `hover` call emits three `logger.info` calls (`'GUI click dispatching'`, `'GUI click dispatched'`, and the same for hover), each carrying large structured payloads including screenshot dimensions, hit-element properties, and model responses. The project rule explicitly asks to remove excessive logging statements after debugging. This pattern also appears in `molmo-point-client.ts` (`'Molmo point request started'`, `'Molmo point response received'`). The request-started and dispatched/dispatching pairs should be removed or gated behind a `DEBUG`-level guard.

Reviews (1): Last reviewed commit: "fix: evals & timeoue" | Re-trigger Greptile

Comment on lines +1 to +5
export const MOLMO_POINT_ENDPOINT =
'https://gseb9k0a2n2vhl-8000.proxy.runpod.net/'

export const MOLMO_POINT_MAX_NEW_TOKENS = 64
export const MOLMO_POINT_TIMEOUT_MS = 60_000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hardcoded ephemeral RunPod endpoint

MOLMO_POINT_ENDPOINT is set to a specific RunPod proxy URL. RunPod proxy hostnames are ephemeral — they become invalid as soon as the pod is stopped or restarted. When that happens every click and hover call will hang for the full MOLMO_POINT_TIMEOUT_MS (60 s) before throwing, making the browser completely unusable. There is no environment variable or config override to change it at runtime.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/tools/molmo-point-config.ts
Line: 1-5

Comment:
**Hardcoded ephemeral RunPod endpoint**

`MOLMO_POINT_ENDPOINT` is set to a specific RunPod proxy URL. RunPod proxy hostnames are ephemeral — they become invalid as soon as the pod is stopped or restarted. When that happens every `click` and `hover` call will hang for the full `MOLMO_POINT_TIMEOUT_MS` (60 s) before throwing, making the browser completely unusable. There is no environment variable or config override to change it at runtime.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1 to +2
export const MOLMO_POINT_ENDPOINT =
'https://gseb9k0a2n2vhl-8000.proxy.runpod.net/'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Read the endpoint from an environment variable so the RunPod pod can be rotated without a code deploy. The hardcoded URL will become invalid the moment the pod restarts.

Suggested change
export const MOLMO_POINT_ENDPOINT =
'https://gseb9k0a2n2vhl-8000.proxy.runpod.net/'
export const MOLMO_POINT_ENDPOINT =
process.env.MOLMO_POINT_ENDPOINT ??
'https://gseb9k0a2n2vhl-8000.proxy.runpod.net/'
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/tools/molmo-point-config.ts
Line: 1-2

Comment:
Read the endpoint from an environment variable so the RunPod pod can be rotated without a code deploy. The hardcoded URL will become invalid the moment the pod restarts.

```suggestion
export const MOLMO_POINT_ENDPOINT =
  process.env.MOLMO_POINT_ENDPOINT ??
  'https://gseb9k0a2n2vhl-8000.proxy.runpod.net/'
```

How can I resolve this? If you propose a fix, please make it concise.

@@ -0,0 +1,18 @@
export const GUI_CLICK_ONLY_MODE = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 GUI_CLICK_ONLY_MODE is hardcoded true with no runtime kill switch

GUI_CLICK_ONLY_MODE = true unconditionally puts every agent session into GUI-click-only mode. There is no environment variable, config flag, or per-session toggle to disable it. The consequence is that the existing chat-mode tool restriction in ai-sdk-agent.ts is now dead code (the else if (config.resolvedConfig.chatMode) branch can never execute). The old element-based click and hover schemas are also permanently gone from the registry, so any caller that depended on element IDs will silently get wrong behaviour. A guard like process.env.GUI_CLICK_ONLY_MODE === 'true' would give an operational kill switch without a code deploy.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/agent/gui-click-only.ts
Line: 1

Comment:
**`GUI_CLICK_ONLY_MODE` is hardcoded `true` with no runtime kill switch**

`GUI_CLICK_ONLY_MODE = true` unconditionally puts every agent session into GUI-click-only mode. There is no environment variable, config flag, or per-session toggle to disable it. The consequence is that the existing chat-mode tool restriction in `ai-sdk-agent.ts` is now dead code (the `else if (config.resolvedConfig.chatMode)` branch can never execute). The old element-based `click` and `hover` schemas are also permanently gone from the registry, so any caller that depended on `element` IDs will silently get wrong behaviour. A guard like `process.env.GUI_CLICK_ONLY_MODE === 'true'` would give an operational kill switch without a code deploy.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 115 to 133
toolContext,
config.resolvedConfig.toolApprovalConfig,
)
const browserTools = config.resolvedConfig.chatMode
? Object.fromEntries(
Object.entries(allBrowserTools).filter(([name]) =>
CHAT_MODE_ALLOWED_TOOLS.has(name),
),
)
: allBrowserTools
let browserTools = allBrowserTools
if (GUI_CLICK_ONLY_MODE) {
browserTools = Object.fromEntries(
Object.entries(allBrowserTools).filter(([name]) =>
isGuiClickOnlyBrowserToolAllowed(name),
),
)
} else if (config.resolvedConfig.chatMode) {
browserTools = Object.fromEntries(
Object.entries(allBrowserTools).filter(([name]) =>
CHAT_MODE_ALLOWED_TOOLS.has(name),
),
)
}
if (config.resolvedConfig.chatMode) {
logger.info('Chat mode enabled, restricting to read-only browser tools', {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Dead chatMode tool-filter branch

Because GUI_CLICK_ONLY_MODE is always true, the else if (config.resolvedConfig.chatMode) branch filtering tools by CHAT_MODE_ALLOWED_TOOLS can never be reached. The chat-mode tool restriction silently no longer applies, which may expose write tools in chat sessions. This violates the project rule to remove dead code.

Rule Used: Remove unused/dead code rather than leaving it in ... (source)

Learned From
browseros-ai/BrowserOS-agent#126

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/server/src/agent/ai-sdk-agent.ts
Line: 115-133

Comment:
**Dead `chatMode` tool-filter branch**

Because `GUI_CLICK_ONLY_MODE` is always `true`, the `else if (config.resolvedConfig.chatMode)` branch filtering tools by `CHAT_MODE_ALLOWED_TOOLS` can never be reached. The chat-mode tool restriction silently no longer applies, which may expose write tools in chat sessions. This violates the project rule to remove dead code.

**Rule Used:** Remove unused/dead code rather than leaving it in ... ([source](https://app.greptile.com/review/custom-context?memory=9b045db4-2630-428c-95b7-ccf048d34547))

**Learned From**
[browseros-ai/BrowserOS-agent#126](https://github.com/browseros-ai/BrowserOS-agent/pull/126)

How can I resolve this? If you propose a fix, please make it concise.

@neel04 Neel Gupta (neel04) force-pushed the feat/click-model branch 4 times, most recently from 8a21b97 to 4955b48 Compare May 5, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant