Give AI agents control of any Windows app — no vision model, no screenshots, no cloud.
ApexComputerUse reads the Windows accessibility tree (the same data the OS exposes to screen readers) and serves it over a plain HTTP REST API. Any AI agent — in any language, on any machine — can find, inspect, and control any desktop app or browser by making simple HTTP requests. No screenshots. No pixel coordinates. No cloud dependency.
5–20 tokens per action instead of 1,000–3,500 for a screenshot. A full browser page in onscreen-only mode is ~126 elements of compact JSON — less than the cost of a single screenshot of the same page.
Works on Win32, WPF, UWP, WinForms, and browsers. Controlled via HTTP REST, named pipes, cmd.exe, and Telegram.
Requirements: Windows 10/11 · .NET 10 SDK
git clone https://github.com/your-org/ApexComputerUse
cd ApexComputerUse
dotnet build
dotnet run --project ApexComputerUse- The app opens. In the Remote Control tab, click Start HTTP.
- Open
http://localhost:8080/in a browser — the interactive console appears. - Pick any open window from the Windows panel on the left.
- Browse its element tree, click an action button, see the result.
Or go straight to curl:
# Confirm the server is up
curl http://localhost:8080/ping
# Find Notepad and read its text editor content
curl -X POST http://localhost:8080/find -H "Content-Type: application/json" -d '{"window":"Notepad"}'
curl http://localhost:8080/exec?action=gettextOCR: requires
eng.traineddata— download from github.com/tesseract-ocr/tessdata and place it intessdata\next to the executable.AI Vision: requires a GGUF vision model and projector — see Usage — AI.
Most AI computer-use tools — Claude Computer Use, OpenAI CUA, UI-TARS, OmniParser — work by sending a screenshot to a vision model and guessing pixel coordinates to click. This approach has compounding costs:
- Screenshot token costs scale with resolution and vary by provider. A 1024×768 image runs ~765 tokens (OpenAI) to ~1,050 tokens (Anthropic). At 1920×1080 that rises to ~1,840 tokens (Anthropic) or ~2,125 tokens (OpenAI). At 2048×2048, OpenAI charges ~2,765 tokens and Anthropic ~2,500–3,500 tokens. Gemini is the exception, typically staying under 1,000 tokens even for ~4K images. And this cost is paid on every single step.
- Screenshots stack in conversation history — a 20-step task accumulates 20+ images in context.
- Coordinate grounding is fragile: it breaks on window resize, DPI scaling, and multi-monitor setups.
- Published benchmarks confirm the accuracy ceiling: even specialist 7B vision models score only 18.9% on real professional UIs (ScreenSpot-Pro, 2025). GPT-4o scores below 2% on unscaled professional screens.
ApexComputerUse reads the accessibility tree the OS already maintains — the same tree used by screen readers and test automation. This gives every element a name, control type, and AutomationId, without rendering a pixel.
Interacting with an element by name costs 5–20 tokens. The element map for a full browser page in onscreen-only mode is typically 100–200 elements of compact JSON — compared to ~1,050 tokens for a single screenshot of the same page, with none of the coordinate fragility.
This is the same direction taken by the most efficient browser-only tools: browser-use claims 50% fewer tokens than screenshot alternatives; Vercel's agent-browser returns 200–400 tokens per page snapshot and uses 82–93% fewer tokens than Playwright MCP. ApexComputerUse brings the same approach to the entire Windows desktop.
| Tool | Coverage | HTTP API | Stable element IDs | Onscreen filter | Status |
|---|---|---|---|---|---|
| ApexComputerUse | Windows desktop + browsers | ✅ REST | ✅ SHA-256 hash | ✅ ?onscreen=true |
Active |
| UFO2 (Microsoft) | Windows desktop + browsers | ❌ research agent | ❌ bounding-box | Partial | Research only |
| UI Automata | Windows desktop + browsers | MCP only | Selector-based | Shadow DOM cache | Active |
| Windows-Use | Windows desktop | ❌ Python lib | ❌ | Partial | Active |
| WinAppDriver | Windows desktop | WebDriver | XPath / selectors | ❌ | Paused by Microsoft |
| browser-use | Browser only | ❌ Python lib | Element hash | ✅ | Active |
| Playwright MCP | Browser only | MCP | Session-scoped refs | Partial | Active |
| Claude Computer Use | Any (screenshot) | Cloud API | ❌ coordinates | ❌ | Active |
No other tool combines: Windows UIA3 coverage, SHA-256 stable element IDs, a language-agnostic HTTP REST API, and an onscreen visibility filter — in a single deployable binary.
Every element is assigned a SHA-256 hash-based numeric ID derived from its control type, name, AutomationId, and position in the tree. These IDs are stable across sessions — an agent can reference the same element in turn 1 and turn 20 without re-querying the tree. No other tool in the Windows desktop automation space publishes this property.
GET /elements?onscreen=true prunes any element where IsOffscreen = true during the tree scan, skipping entire offscreen subtrees. On a live Chewy.com product page this reduces 634 elements to 126 — an 80% reduction — putting token cost per step in the same range as the best browser-only tools while covering all desktop apps too.
The filter composes with the existing type filter: ?onscreen=true&type=Button.
- Find any window and element by name or AutomationId (exact or fuzzy match)
- Filter element search by ControlType
- Persistent, hash-based stable element and window IDs (survive app restarts)
- Onscreen-only element map (
?onscreen=true) — prunes offscreen subtrees at scan time - Element nodes include
boundingRectangle(x, y, width, height) for spatial context and visual rendering - Execute all common UI actions: click, type, select, toggle, scroll, drag & drop, etc.
- OCR any UI element using Tesseract
- Multimodal AI: describe UI elements, ask questions about them, analyse image/audio files using a local vision LLM (LLamaSharp MTMD)
- Remote control via HTTP REST API (curl-friendly JSON)
- Remote control via named pipe (PowerShell module included)
- Remote control via cmd.exe batch helper (
apex.cmd) - Remote control via Telegram bot
- Screenshot capture of elements, windows, and full screen (returned as base64 PNG)
- Interactive HTTP test console — served at
GET /, includes live windows list, element tree browser, grouped command builder covering every action, inline capture/OCR/AI vision/UI map buttons, format selector (JSON/HTML/Text/PDF), format demo links, and a response log - UI Map Renderer — renders the element tree as a colour-coded overlay drawn directly on screen, and optionally exports a PNG image; accessible via Tools → Render UI Map or
GET /uimap(returns base64 PNG) - Format-adaptive responses — every endpoint serves HTML, plain text, JSON, or PDF via URL extension (
.json,.html,.txt,.pdf),?format=parameter, orAcceptheader; default is an HTML page with embedded JSON readable by any AI that can fetch a URL - System utility routes —
/ping,/sysinfo,/env,/ls,/runfor AI agents that need OS-level context without a separate tool
git clone https://github.com/your-org/ApexComputerUse
cd ApexComputerUse
dotnet run --project ApexComputerUseDownload eng.traineddata (and any other language files) from github.com/tesseract-ocr/tessdata and place them in a tessdata\ folder next to the executable:
tessdata\
eng.traineddata
(other languages...)
- Message @BotFather on Telegram and create a bot with
/newbot. - Copy the token (format:
123456789:ABC-DEF...). - Paste it into the Bot Token field in the app and click Start Telegram.
| Field | Description |
|---|---|
| Window Name | Partial title of the target window. Fuzzy-matched if no exact match found. |
| AutomationId | The element's AutomationId (checked first). |
| Element Name | The element's Name property (fallback if AutomationId is blank). |
| Search Type | Filter the element search to a specific ControlType. All searches everything. |
| Control Type | Selects the action group (Button, TextBox, etc.). |
| Action | The action to perform on the found element. |
| Value / Index | Input for actions that need it (text to type, index, row,col, x,y, etc.). |
Find Element — locates the window and element, logs what was found. Execute Action — runs the selected action against the last found element.
| Item | Description |
|---|---|
| Run AI Computer Use Mode | Launches the interactive multimodal AI agent loop (requires model loaded on the Model tab). |
| Output UI Map | Scans the current window's element tree and logs it as nested JSON to the console tab. |
| Render UI Map | Scans the current window's element tree, draws a colour-coded bounding-box overlay on screen for 5 seconds, and offers to save the overlay as a PNG image. |
Every window and element is assigned a stable numeric ID (SHA-256 hash-based) that persists across sessions. These IDs can be used in find commands instead of titles or AutomationIds.
# 1. Get windows with their IDs
curl http://localhost:8080/windows
# Returns: [{"id":42,"title":"Notepad"},{"id":107,"title":"Calculator"},...]
# 2. Get elements with their IDs for the current window
curl http://localhost:8080/elements
# Onscreen elements only (prunes offscreen subtrees — 80% fewer elements on browser pages)
curl "http://localhost:8080/elements?onscreen=true"
# Combine with type filter
curl "http://localhost:8080/elements?onscreen=true&type=Button"
# Returns nested JSON including bounding rectangles:
# {
# "id": 105,
# "controlType": "Edit",
# "name": "Text Editor",
# "automationId": "15",
# "boundingRectangle": { "x": 0, "y": 30, "width": 800, "height": 600 },
# "children": [...]
# }
# 3. Find using numeric IDs (no fuzzy matching, direct map lookup)
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"window":42,"id":105}'Using numeric IDs is faster and unambiguous — the element is resolved directly from the in-memory map without any search or fuzzy logic. Every find call also auto-focuses the matched window.
Map rendering isn't just a debugging convenience — it has compounding implications for token consumption at scale.
With screenshot-based AI automation, every interaction requires sending a fresh image to the model. At typical desktop resolutions that's 1,000–3,500 tokens per screenshot depending on the provider and resolution — every single step, accumulating in conversation history. With ApexComputerUse's map approach, the UI is rendered once as a structured, text-based representation. After that initial render, each individual interaction references elements by name, costing 5–20 tokens on average.
The ?onscreen=true filter further reduces the element map to only what is visible in the current viewport. On a real browser page this produces 126 elements of compact JSON — well under the cost of a single screenshot of the same page.
| Per step | 20-step task | |
|---|---|---|
| Screenshot (1024×768) | ~765–1,050 tokens | ~15,000–21,000 tokens in images alone |
| Screenshot (1920×1080) | ~1,840–2,125 tokens | ~37,000–43,000 tokens in images alone |
| Screenshot (2048×2048) | ~2,765–3,500 tokens | ~55,000–70,000 tokens in images alone |
| ApexComputerUse (full map) | 400–1,800 tokens (one-time) + ~10 per action | ~1,000 tokens total |
ApexComputerUse (?onscreen=true) |
200–600 tokens (one-time) + ~10 per action | ~400 tokens total |
Provider breakdown: at 1024×768, Anthropic ≈ 1,050 tokens / OpenAI ≈ 765 tokens. At 1920×1080, Anthropic ≈ 1,840 / OpenAI ≈ 2,125. At 2048×2048, OpenAI ≈ 2,765 / Anthropic ≈ 2,500–3,500. Gemini is notably more efficient — typically under 1,000 tokens even for ~4K images. All providers compound costs across steps: every screenshot remains in context for the life of the conversation.
| Screen Capture | Map Approach | |
|---|---|---|
| Per-interaction cost | 2,500–10,000 tokens (image) | 5–20 tokens (text reference) |
| Session setup cost | none — image sent every time | 400–1,800 tokens (one-time map render) |
| Interactions per person/day | 100 | 100 |
Screenshot: 2,500 tokens each · Initial map: 400 tokens · Per-action after map: 8 tokens
By time period — 1 person:
| Timeframe | Screen Capture | Map Approach | Tokens Saved |
|---|---|---|---|
| 1 day | 250,000 | 1,192 | 248,808 |
| 1 week | 1,750,000 | 8,344 | 1,741,656 |
| 1 year | 91,250,000 | 435,080 | 90,814,920 |
Annual totals — by team size:
| Team Size | Screen Capture | Map Approach | Reduction Factor |
|---|---|---|---|
| 1 person | 91,250,000 | 435,080 | ~210x |
| 10 people | 912,500,000 | 4,350,800 | ~210x |
| 50 people | 4,562,500,000 | 21,754,000 | ~210x |
Start the HTTP server from the Remote Control group box, then use curl or open http://localhost:8080/ in a browser to access the interactive test console.
Opening the root URL in any browser launches a dark-themed console with:
- Windows panel — live list of all open windows; click to select and auto-load its element tree
- Elements panel — nested element tree flattened with indentation; onscreen-only toggle; ControlType filter; click any element to select it
- Command builder — grouped action buttons covering every action: Click, Text, Keys, State, Scroll, Toggle, Select, Window, Range/Slider, Grid/Table, Transform, Wait, Capture, AI Vision; Value input with context-sensitive hints; ▶ Execute button
- AI Vision buttons —
status,describe,ask,file; requires model loaded on the Model tab - Format selector — dropdown in the header (JSON / HTML / Text / PDF); all requests use the selected format; format demo links (help, status, windows) open directly in a new tab in the chosen format
- Response log — newest result at top; captures rendered as inline images (click to zoom); PDF responses shown as an "Open PDF" link (browser-native rendering)
Every endpoint adapts its response to whatever format the caller can consume, selected by priority:
- URL file extension — append
.json,.html,.txt, or.pdfto any path ?format=query parameter —html,text,json, orpdfAcceptrequest header —text/html,text/plain,application/json, orapplication/pdf- Default:
html
# URL extension (highest priority — works even if the AI cannot set headers or query params)
curl http://localhost:8080/status.json
curl http://localhost:8080/help.txt
curl http://localhost:8080/windows.html
curl http://localhost:8080/status.pdf --output status.pdf
# ?format= query parameter
curl "http://localhost:8080/ping?format=text"
curl "http://localhost:8080/ping?format=json"
# Accept header
curl -H "Accept: application/json" http://localhost:8080/ping
curl -H "Accept: application/pdf" http://localhost:8080/help --output help.pdf
# HTML response (default — works in any browser or AI that can fetch a page)
curl http://localhost:8080/pingHTML includes a <pre> block for human readability and an embedded <script type="application/json" id="apex-result"> block containing the full result as JSON — allowing any AI that can fetch a webpage to extract structured data without a vision model.
PDF is a valid A4 document using the built-in Courier font (no external dependencies). Useful for AI systems that can only accept PDF attachments.
All command endpoints accept both POST (JSON body) and GET (query string parameters), so any command can be expressed as a plain URL — no request body required:
# Find a window via GET
curl "http://localhost:8080/find?window=Notepad"
# Execute an action via GET
curl "http://localhost:8080/exec?action=gettext"
# Combine with URL extension for full URL-only access
curl "http://localhost:8080/find.json?window=Notepad&id=15"
curl "http://localhost:8080/exec.pdf?action=describe" --output result.pdfGET parameter names match the JSON body field names: window, id / automationId, name / elementName, type / searchType, action, value, onscreen, prompt, model, proj.
All endpoints return the same canonical structure:
{
"success": true,
"action": "ping",
"data": { "key": "value", ... },
"error": null
}HTTP status: 200 on success, 400 on error.
# Health check
curl http://localhost:8080/ping
# System information (OS, machine, user, CPU, CLR)
curl http://localhost:8080/sysinfo
# All environment variables
curl http://localhost:8080/env
# Directory listing (defaults to current working directory)
curl http://localhost:8080/ls
curl "http://localhost:8080/ls?path=C:\Users"
# Run a shell command (cmd.exe /c); 30-second timeout
curl "http://localhost:8080/run?cmd=whoami"
curl -X POST http://localhost:8080/run \
-H "Content-Type: application/json" \
-d '{"value":"dir C:\\"}'/run response data fields: cmd, stdout, stderr, exit_code.
# List all open windows (with stable IDs)
curl http://localhost:8080/windows
# Get current state
curl http://localhost:8080/status
# List all elements in the current window (nested JSON with IDs and bounding rectangles)
curl http://localhost:8080/elements
# Onscreen elements only — prunes offscreen subtrees for maximum token efficiency
curl "http://localhost:8080/elements?onscreen=true"
# Filter by ControlType
curl "http://localhost:8080/elements?type=Button"
# Both filters combined
curl "http://localhost:8080/elements?onscreen=true&type=Button"
# Render the current window's UI element tree as a colour-coded PNG (returns base64)
curl http://localhost:8080/uimap
# Help
curl http://localhost:8080/help
# Find a window and element by title/name
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"window":"Notepad","id":"15"}'
# Find by element name with ControlType filter
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"window":"Notepad","name":"Text Editor","type":"Edit"}'
# Find by numeric window/element IDs (fast, no fuzzy search)
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"window":42,"id":105}'
# Type text into the found element
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"action":"type","value":"Hello World"}'
# Click a button
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"action":"click"}'
# Read text from element
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"action":"gettext"}'
# Capture current element (returns base64 PNG in data field)
curl -X POST http://localhost:8080/capture
# Capture full screen
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"screen"}'
# Capture multiple elements stitched into one image
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"elements","value":"42,105,106"}'
# OCR the found element
curl -X POST http://localhost:8080/ocr
# OCR a region (x,y,width,height) within the element
curl -X POST http://localhost:8080/ocr \
-H "Content-Type: application/json" \
-d '{"value":"0,0,300,50"}'
# Check AI model status
curl http://localhost:8080/ai/status
# Load a vision/audio LLM (run once; model stays loaded until the server restarts)
curl -X POST http://localhost:8080/ai/init \
-H "Content-Type: application/json" \
-d '{"model":"C:\\models\\vision.gguf","proj":"C:\\models\\mmproj.gguf"}'
# Describe the currently selected UI element using the vision model
# Captures the element as an image and sends it to the LLM
curl -X POST http://localhost:8080/ai/describe
# Describe with a custom prompt
curl -X POST http://localhost:8080/ai/describe \
-H "Content-Type: application/json" \
-d '{"prompt":"List every button you can see."}'
# Ask a specific question about the current element
curl -X POST http://localhost:8080/ai/ask \
-H "Content-Type: application/json" \
-d '{"prompt":"Is there an error message visible?"}'
# Describe an image file on disk
curl -X POST http://localhost:8080/ai/file \
-H "Content-Type: application/json" \
-d '{"value":"C:\\screenshots\\app.png","prompt":"What dialog is shown?"}'| Field | Aliases | Description |
|---|---|---|
window |
— | Window title (partial match) or numeric ID from /windows |
automationId |
id |
Element AutomationId string or numeric ID from /elements |
elementName |
name |
Element Name property (fallback if id not given) |
searchType |
type |
ControlType filter (All or e.g. Button) |
action |
— | Action name (see list below) |
value |
— | Value/input for the action |
model |
modelPath |
AI: path to LLM .gguf file |
proj |
mmProjPath |
AI: path to multimodal projector .gguf file |
prompt |
— | AI: question or instruction text |
After starting the bot, send commands to it in any Telegram chat:
/find window=Notepad id=15
/find window=Calculator name=Equals type=Button
/exec action=type value="Hello from Telegram"
/exec action=click
/exec action=gettext
/ocr
/ocr value=0,0,300,50
/status
/windows
/elements
/elements type=Button
/help
Key=value pairs support quoted values for multi-word strings:
/find window="My Application" name="Save Button"
/exec action=type value="some text with spaces"
AI commands work the same way:
/ai action=status
/ai action=init model=C:\models\vision.gguf proj=C:\models\mmproj.gguf
/ai action=describe
/ai action=describe prompt="List every button you can see."
/ai action=ask prompt="Is there an error message visible?"
/ai action=file value=C:\screenshots\app.png prompt="What dialog is shown?"
The app exposes a named pipe server (default name ApexComputerUse). Start it from the Remote Control group box, then use the bundled ApexComputerUse.psm1 module:
# Import the module
Import-Module .\Scripts\ApexComputerUse.psm1
# Connect to the pipe (must be started in the app first)
Connect-FlaUI # default pipe name: ApexComputerUse
Connect-FlaUI -PipeName MyPipe -TimeoutMs 10000
# Discovery
Get-FlaUIWindows # list all open window titles
Get-FlaUIStatus # current window/element state
Get-FlaUIHelp # command reference
Get-FlaUIElements # list all elements in current window
Get-FlaUIElements -Type Button # filter by ControlType
# Find
Find-FlaUIElement -Window 'Notepad'
Find-FlaUIElement -Window 'Notepad' -Name 'Text Editor' -Type Edit
Find-FlaUIElement -Window 'Calculator' -Id 'num5Button'
# Execute actions
Invoke-FlaUIAction -Action click
Invoke-FlaUIAction -Action type -Value 'Hello from PowerShell'
Invoke-FlaUIAction -Action gettext
Invoke-FlaUIAction -Action screenshot
# OCR
Invoke-FlaUIOcr
Invoke-FlaUIOcr -Region '0,0,300,50'
# AI
Invoke-FlaUIAi -SubCommand init -Model 'C:\models\v.gguf' -Proj 'C:\models\p.gguf'
Invoke-FlaUIAi -SubCommand status
Invoke-FlaUIAi -SubCommand describe -Prompt 'What buttons are visible?'
Invoke-FlaUIAi -SubCommand ask -Prompt 'Is there an error message?'
Invoke-FlaUIAi -SubCommand file -Value 'C:\screen.png' -Prompt 'Describe this.'
# Send raw JSON (advanced)
Send-FlaUICommand @{ command='find'; window='Notepad'; elementName='Text Editor' }
# Disconnect
Disconnect-FlaUI| Cmdlet | Key Parameters | Description |
|---|---|---|
Connect-FlaUI |
PipeName, TimeoutMs |
Connect to the pipe server |
Disconnect-FlaUI |
— | Close the connection |
Send-FlaUICommand |
Request (hashtable) |
Send a raw JSON command |
Get-FlaUIWindows |
— | List open window titles |
Get-FlaUIStatus |
— | Show current window/element |
Get-FlaUIHelp |
— | Server command reference |
Get-FlaUIElements |
Type |
List elements in current window |
Find-FlaUIElement |
Window, Id, Name, Type |
Find a window and element |
Invoke-FlaUIAction |
Action, Value |
Execute action on current element |
Invoke-FlaUIOcr |
Region |
OCR current element or region |
Invoke-FlaUICapture |
Target, Value |
Capture screen/window/element(s); returns base64 PNG in data |
Invoke-FlaUIAi |
SubCommand, Model, Proj, Prompt, Value |
Multimodal AI sub-commands |
The pipe connection is session-based: window and element state are preserved across calls within a single
Connect-FlaUI/Disconnect-FlaUIsession. UseFind-FlaUIElementto select a target, then callInvoke-FlaUIActionas many times as needed without re-finding.
Use Scripts\apex.cmd — a batch helper that wraps the HTTP server with simpler positional syntax. Requires the HTTP server to be started first and curl (built-in on Windows 10+).
:: Optional: override port
set APEX_PORT=8080
:: Discovery
apex windows
apex status
apex elements
apex elements Button
apex help
:: Find a window and element
apex find Notepad
apex find "My App" id=btnOK
apex find Notepad name="Text Editor" type=Edit
:: Execute actions
apex exec click
apex exec type value=Hello
apex exec gettext
apex exec screenshot
:: Capture
apex capture
apex capture action=screen
apex capture action=window
apex capture action=elements value=42,105,106
:: OCR
apex ocr
apex ocr 0,0,300,50
:: AI
apex ai status
apex ai init model=C:\models\v.gguf proj=C:\models\p.gguf
apex ai describe
apex ai describe prompt="What do you see?"
apex ai ask prompt="Is there an error message?"
apex ai file value=C:\screen.png prompt="Describe this."Add Scripts\ to your PATH (or copy apex.cmd next to your scripts) to use it from any directory.
The AI command set is backed by MtmdHelper, which uses LLamaSharp to run a local multimodal (vision + audio) LLM. No cloud API is required.
Download a vision-capable GGUF model and its multimodal projector (e.g. LFM2-VL from LM Studio) and note the paths to both .gguf files. Then call ai init before any inference commands.
| Sub-action | Required params | Optional params | Description |
|---|---|---|---|
init |
model=<path> proj=<path> |
— | Load the LLM and projector into memory |
status |
— | — | Report whether the model is loaded and which modalities it supports |
describe |
— (uses current element) | prompt=<text> |
Capture the current UI element as an image and ask the vision model to describe it |
ask |
prompt=<text> |
— | Ask a specific question about the current UI element (captures element image) |
file |
value=<file path> |
prompt=<text> |
Send an image or audio file from disk to the model |
Note:
describe,ask, andfilerequire a priorfindcommand to select a window/element. The model must be initialized withinitbefore any inference call. Each inference call starts completely fresh — no chat history is retained between calls.
The HTTP test console (GET /) has a dedicated AI Vision button group (purple-tinted):
| Button | Endpoint | Value field |
|---|---|---|
| status | GET /ai/status |
— |
| describe | POST /ai/describe |
Optional prompt (e.g. list all buttons) |
| ask | POST /ai/ask |
Required question (e.g. what number is shown?) |
Select an element in the Elements panel first, then click describe or ask. The console shows a "Running vision model…" notice immediately and updates with the result when inference completes.
The UI Map Renderer scans the current window's accessibility tree and renders every element's bounding rectangle as a colour-coded overlay. Each control type gets a deterministic, visually distinct colour. Element names are drawn inside the bounding box.
# Returns base64-encoded PNG of the current window's element tree
curl http://localhost:8080/uimapRequires a prior find call to select a window. The response data.result field contains the base64 PNG — identical format to the /capture endpoints. In the interactive test console, the UI map button (in the Capture group) renders the result inline in the response log.
Tools → Render UI Map draws the overlay directly on screen for 5 seconds (press Escape to dismiss early) and offers to save it as a PNG file. This also triggers a live screen overlay, which is not available via the HTTP API.
Tools → Output UI Map logs the raw nested JSON element tree to the console tab — useful for inspecting the tree structure or copying it for use with an AI agent.
Element JSON includes bounding rectangles:
{
"id": 105,
"controlType": "Button",
"name": "OK",
"automationId": "btn_ok",
"boundingRectangle": { "x": 120, "y": 340, "width": 80, "height": 30 },
"children": []
}| Action | Aliases | Value | Description |
|---|---|---|---|
click |
— | — | Smart click: Invoke → Toggle → SelectionItem → mouse fallback |
mouse-click |
mouseclick |
— | Force mouse left-click (bypasses smart chain) |
middle-click |
middleclick |
— | Middle-mouse-button click |
invoke |
— | — | Invoke pattern directly |
right-click |
rightclick |
— | Right-click |
double-click |
doubleclick |
— | Double-click |
click-at |
clickat |
x,y |
Click at pixel offset from element top-left |
drag |
— | x,y |
Drag element to screen coordinates |
hover |
— | — | Move mouse over element |
highlight |
— | — | Draw orange highlight around element for 1 second |
focus |
— | — | Set keyboard focus |
keys |
— | text | Send keystrokes; supports {CTRL}, {ALT}, {SHIFT}, {F5}, Ctrl+A, Alt+F4, etc. |
screenshot |
capture |
— | Save element image to Desktop\Apex_Captures |
describe |
— | — | Return full element property description (UIA properties — not AI vision) |
patterns |
— | — | List automation patterns supported by the element |
bounds |
— | — | Return bounding rectangle |
isenabled |
— | — | Returns True or False |
isvisible |
— | — | Returns True or False |
wait |
— | automationId | Wait for element with given AutomationId to appear |
| Action | Aliases | Value | Description |
|---|---|---|---|
type |
enter |
text | Enter text (smart: Value pattern → keyboard) |
insert |
— | text | Type at current caret position |
gettext |
text |
— | Smart read: Text pattern → Value → LegacyIAccessible → Name |
getvalue |
value |
— | Smart read: Value → Text → LegacyIAccessible → Name |
setvalue |
— | text | Smart set: Value pattern (if writable) → RangeValue (if numeric) → keyboard |
clearvalue |
— | — | Set value to empty string via Value pattern |
appendvalue |
— | text | Append text to current value |
getselectedtext |
— | — | Get selected text via Text pattern |
selectall |
— | — | Ctrl+A |
copy |
— | — | Ctrl+C |
cut |
— | — | Ctrl+X |
paste |
— | — | Ctrl+V |
undo |
— | — | Ctrl+Z |
clear |
— | — | Select all and delete |
| Action | Aliases | Value | Description |
|---|---|---|---|
setrange |
— | number | Set RangeValue pattern |
getrange |
— | — | Read current RangeValue |
rangeinfo |
— | — | Min / max / smallChange / largeChange |
| Action | Aliases | Value | Description |
|---|---|---|---|
toggle |
— | — | Toggle CheckBox (cycles state) |
toggle-on |
toggleon |
— | Set toggle to On |
toggle-off |
toggleoff |
— | Set toggle to Off |
gettoggle |
— | — | Read current toggle state (On / Off / Indeterminate) |
| Action | Aliases | Value | Description |
|---|---|---|---|
expand |
— | — | Expand via ExpandCollapse pattern |
collapse |
— | — | Collapse via ExpandCollapse pattern |
expandstate |
— | — | Read current ExpandCollapse state |
| Action | Aliases | Value | Description |
|---|---|---|---|
select |
— | item text | Select ComboBox/ListBox item by text |
select-item |
selectitem |
— | Select current element via SelectionItem pattern |
addselect |
— | — | Add element to multi-selection |
removeselect |
— | — | Remove element from selection |
isselected |
— | — | Returns True or False |
getselection |
— | — | Get selected items from a Selection container |
select-index |
selectindex |
n | Select ComboBox/ListBox item by zero-based index |
getitems |
— | — | List all items in a ComboBox or ListBox (newline-separated) |
getselecteditem |
— | — | Get currently selected item text |
| Action | Aliases | Value | Description |
|---|---|---|---|
minimize |
— | — | Minimize window |
maximize |
— | — | Maximize window |
restore |
— | — | Restore window to normal state |
windowstate |
— | — | Read current window visual state (Normal / Maximized / Minimized) |
| Action | Aliases | Value | Description |
|---|---|---|---|
move |
— | x,y |
Move element via Transform pattern |
resize |
— | w,h |
Resize element via Transform pattern |
| Action | Aliases | Value | Description |
|---|---|---|---|
scroll-up |
scrollup |
n (optional) | Scroll up n clicks (default 3) |
scroll-down |
scrolldown |
n (optional) | Scroll down n clicks (default 3) |
scroll-left |
scrollleft |
n (optional) | Horizontal scroll left n clicks (default 3) |
scroll-right |
scrollright |
n (optional) | Horizontal scroll right n clicks (default 3) |
scrollinto |
scrollintoview |
— | Scroll element into view |
scrollpercent |
— | h,v |
Scroll to h%/v% position via Scroll pattern (0–100) |
getscrollinfo |
— | — | Scroll position and scrollable flags |
| Action | Aliases | Value | Description |
|---|---|---|---|
griditem |
— | row,col |
Get element description at grid cell |
gridinfo |
— | — | Row and column counts |
griditeminfo |
— | — | Row / column / span for a GridItem element |
Returns a screen capture inline as a base64-encoded PNG in the data field. Supports four targets.
| Target | Description |
|---|---|
element (default) |
Current element (requires a prior find) |
window |
Current window (requires a prior find) |
screen |
Full display |
elements |
Multiple elements by ID, stitched vertically into one image |
For elements, provide comma-separated numeric IDs from a prior elements scan in the value field.
# Current element
curl -X POST http://localhost:8080/capture
# Full screen
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"screen"}'
# Current window
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"window"}'
# Multiple elements stitched into one image
curl -X POST http://localhost:8080/capture \
-H "Content-Type: application/json" \
-d '{"action":"elements","value":"42,105,106"}'Response data field contains the base64 PNG. Decode it to get the image:
curl -s -X POST http://localhost:8080/capture -d '{"action":"screen"}' \
| python -c "import sys,json,base64; d=json.load(sys.stdin)['data']; open('screen.png','wb').write(base64.b64decode(d))"Telegram: /capture sends the image as a photo message (not text).
/capture
/capture action=screen
/capture action=window
/capture action=elements value=42,105,106
PowerShell:
$r = Send-FlaUICommand @{ command='capture'; action='screen' }
[IO.File]::WriteAllBytes('screen.png', [Convert]::FromBase64String($r.data))Note: This is distinct from the
screenshotexec action, which saves toDesktop\Apex_Capturesand returns only the file path.
OCR uses Tesseract. Download language files from github.com/tesseract-ocr/tessdata and place them in a tessdata\ folder next to the executable (e.g. tessdata\eng.traineddata). Additional languages work the same way.
Captures saved by OCR Element + Save go to Desktop\Apex_Captures\.
The AI command set is backed by MtmdHelper using LLamaSharp's multimodal (MTMD) API. Supports vision and audio modalities depending on the model. Every inference call is fully stateless — no chat history is retained between calls.
Download a vision-capable GGUF model and its multimodal projector (e.g. LFM2-VL from LM Studio) and note the paths to both .gguf files. Then call ai init before any inference commands.
ApexComputerUse/
├── Form1.cs / Form1.Designer.cs — Main UI (tabs: Console, Find & Execute, Remote Control, Model)
├── FlaUIHelper.cs — All FlaUI automation wrappers
├── ElementIdGenerator.cs — Stable SHA-256 hash-based element ID mapping
├── CommandProcessor.cs — Shared remote command logic (used by all server types)
├── HttpCommandServer.cs — HTTP REST server (System.Net.HttpListener)
│ ├── ApexResult — Canonical {success, action, data, error} result type
│ ├── FormatAdapter — Format negotiation (HTML / JSON / text / PDF)
│ └── PdfWriter — Minimal PDF generator (no external dependencies)
├── PipeCommandServer.cs — Named-pipe server
├── TelegramController.cs — Telegram bot (Telegram.Bot)
├── OcrHelper.cs — Tesseract OCR wrapper
├── MtmdHelper.cs — Stateless multimodal LLM wrapper (LLamaSharp MTMD)
├── MtmdInteractiveModeExecute.cs — Interactive AI computer use mode (Tools menu)
├── UiMapRenderer.cs — Renders element trees as colour-coded screen overlays and PNG images
└── Scripts/
├── ApexComputerUse.psm1 — PowerShell module (pipe-based)
└── apex.cmd — cmd.exe helper (HTTP-based)
OCR: place Tesseract language files in a
tessdata\folder next to the executable. Not included in the repo — download from github.com/tesseract-ocr/tessdata.



