The Qwen Synthetic Login Benchmark is a reproducible experiment demonstrating how LoRA fine-tuning improves GUI grounding for vision-language models (VLMs) on structured UI automation tasks. This benchmark trains and evaluates Qwen3-VL models on a synthetic login scenario with challenging layout variations.
Key Achievement: Fine-tuned Qwen3-VL-2B (469% action accuracy) outperforms both Claude Sonnet 4.5 (121%) and GPT-5.1 (183%) on this benchmark, demonstrating that specialized fine-tuning can beat general-purpose frontier models.
100% Accuracy with Set-of-Marks (SoM): When using element-based actions (CLICK([1])) instead of coordinates, fine-tuned Qwen3-VL-2B achieves perfect 100% accuracy on both login and registration scenarios.
The benchmark uses a procedurally generated login UI with the following characteristics:
- Username field: Text input for username
- Password field: Text input for password
- Login button: Submit button
- Decoy element: "Help" button that should be ignored
A standard login episode consists of 7 steps:
- Step 0: Initial screen observation, action:
WAIT() - Step 1: Click username field, action:
CLICK(x=0.35, y=0.31) - Step 2: Type username, action:
TYPE(text="demo") - Step 3: Click password field, action:
CLICK(x=0.35, y=0.45) - Step 4: Type password, action:
TYPE(text="pass") - Step 5: Click login button, action:
CLICK(x=0.50, y=0.68) - Step 6: Task complete, action:
DONE()
To prevent overfitting and test robustness, the synthetic generator includes:
- Layout jitter: UI elements shift up to ±10 pixels per episode
- Decoy elements: "Help" button that the agent should ignore
- Randomized styling: Colors, fonts, and spacing vary slightly
- Deterministic seeds: Reproducible evaluation sets
These features ensure that models must learn semantic understanding rather than memorizing pixel-perfect coordinates.
The benchmark uses YAML configuration files in configs/:
Standard coordinate-based training (configs/qwen3vl_synthetic_dev.yaml):
model:
name: Qwen/Qwen3-VL-2B-Instruct
load_in_4bit: false
lora:
r: 8
lora_alpha: 16
lora_dropout: 0.05
bias: none
target_modules:
- q_proj
- v_proj
task_type: CAUSAL_LM
weights_path: checkpoints/qwen3vl2b_login_lora_epjit_v2
synthetic_data:
num_sessions: 32
seed: 123
output_dir: synthetic_train_dev
training:
num_train_epochs: 4
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
warmup_ratio: 0.03
weight_decay: 0.0
max_grad_norm: 1.0
logging_steps: 1Set-of-Marks (SoM) training (configs/qwen3vl_synthetic_som.yaml):
model:
name: Qwen/Qwen3-VL-2B-Instruct
load_in_4bit: false
lora:
r: 8
lora_alpha: 16
lora_dropout: 0.05
bias: none
target_modules:
- q_proj
- v_proj
task_type: CAUSAL_LM
weights_path: checkpoints/qwen3vl2b_login_lora_som
synthetic_data:
num_sessions: 32
seed: 123
output_dir: synthetic_train_som
use_som: true # Enable Set-of-Marks overlays
training:
num_train_epochs: 2
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-5
warmup_ratio: 0.1
weight_decay: 0.01
max_grad_norm: 0.5
logging_steps: 10Full benchmark (train + eval + plot):
# Standard coordinate-based training
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
--config configs/qwen3vl_synthetic_dev.yaml \
--out-dir experiments/qwen_login/2b_dev
# Include API model comparison (Claude Sonnet 4.5 + GPT-5.1)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
--config configs/qwen3vl_synthetic_dev.yaml \
--out-dir experiments/qwen_login/2b_dev \
--include-all-apisTraining only:
# Train a LoRA adapter
uv run python -m openadapt_ml.scripts.train \
--config configs/qwen3vl_synthetic_dev.yamlSoM mode training:
# Train with Set-of-Marks visual prompting
uv run python -m openadapt_ml.scripts.train \
--config configs/qwen3vl_synthetic_som.yaml- Synthetic Data Generation: 32 sessions (32 episodes) are generated with jittered layouts
- SFT Dataset Building: Each step is converted to a chat-style sample with system prompt + user query + assistant action
- LoRA Fine-tuning: Only
q_projandv_projlayers are trained with rank-8 LoRA adapters - Checkpoint Saving: Trained adapters are saved to
checkpoints/directory
Training Time: ~10-15 minutes on Apple Silicon M1/M2, ~5 minutes on A10 GPU
Memory Usage: ~8GB VRAM for 2B model, ~16GB for 8B model
The benchmark tracks four primary metrics and several auxiliary metrics:
1. Action Type Accuracy
- Percentage of steps where predicted action type matches ground truth
- Types:
CLICK,TYPE,WAIT,DONE - Example: If model predicts
CLICKwhen GT isTYPE, this counts as incorrect - Target: >90% for production use
2. Mean Coordinate Error
- Average normalized L2 distance between predicted and GT click coordinates
- Only computed for
CLICKactions with valid coordinates - Normalized to [0, 1] range (0 = perfect match, 1 = diagonal distance)
- Target: <0.05 (5% of screen diagonal)
3. Click Hit Rate
- Percentage of clicks within 5% radius of ground truth center
- Point-based evaluation: treats click as hitting a circular target
- Target: >95% for reliable automation
4. Episode Success Rate
- Percentage of episodes where ALL steps match exactly (strict evaluation)
- Requires perfect action type AND coordinate accuracy for entire episode
- Most challenging metric; sensitive to any single mistake
- Target: >80% for production deployment
5. Mean Episode Progress
- Average percentage of correct steps per episode (partial credit)
- More forgiving than episode success rate
- Useful for diagnosing where models typically fail
6. Mean Episode Step Score
- Average "full step correctness" (action type match + click hit for clicks)
- Stricter than progress, more forgiving than episode success
7. Weak Episode Success Rate
- Semantic milestone-based evaluation (typed username, typed password, clicked login, emitted done)
- Allows some mistakes as long as key milestones are achieved
8. Element Accuracy (SoM mode only)
- Percentage of steps where predicted element ID matches ground truth
- Only applicable when using Set-of-Marks overlays
9. Bbox Hit Rate
- Percentage of clicks landing anywhere within element bounding box
- More forgiving than point-based click hit rate
Evaluate base model (no fine-tuning):
uv run python -m openadapt_ml.scripts.eval_policy \
--config configs/qwen3vl_synthetic_dev.yaml \
--backend qwen3 \
--ignore-lora \
--output-json experiments/qwen_login/eval_base.jsonEvaluate fine-tuned model:
uv run python -m openadapt_ml.scripts.eval_policy \
--config configs/qwen3vl_synthetic_dev.yaml \
--backend qwen3 \
--output-json experiments/qwen_login/eval_ft.jsonEvaluate API models (requires API keys in .env):
# Claude Sonnet 4.5
uv run python -m openadapt_ml.scripts.eval_policy \
--config configs/qwen3vl_synthetic_dev.yaml \
--backend claude \
--output-json experiments/qwen_login/eval_claude.json
# GPT-5.1
uv run python -m openadapt_ml.scripts.eval_policy \
--config configs/qwen3vl_synthetic_dev.yaml \
--backend openai \
--output-json experiments/qwen_login/eval_gpt.jsonSoM mode evaluation:
uv run python -m openadapt_ml.scripts.eval_policy \
--config configs/qwen3vl_synthetic_som.yaml \
--backend qwen3 \
--dsl-mode som \
--output-json experiments/qwen_login/eval_som.jsonAll evaluation runs produce a consistent JSON schema:
{
"config_path": "configs/qwen3vl_synthetic_dev.yaml",
"backend": "qwen3",
"dsl_mode": "coord", // or "som"
"metrics": {
"num_episodes": 32,
"num_steps": 224,
"action_type_accuracy": 0.469,
"mean_coord_error": 0.051,
"coord_error_count": 19,
"episode_success_rate": 0.0,
"click_hit_rate": 0.850,
"bbox_hit_rate": 0.900,
"mean_episode_progress": 0.532,
"mean_episode_step_score": 0.489,
"weak_episode_success_rate": 0.125,
"state_success_rate": null,
"element_accuracy": null // only for SoM mode
}
}This stable schema enables reproducible comparisons across model versions and training runs.
| Model | Type | Action Accuracy | Coord Error | Click Hit Rate | Episode Success |
|---|---|---|---|---|---|
| Qwen3-VL-2B base | Offline | 14.3% | N/A | N/A | 0% |
| Qwen3-VL-2B FT | Offline | 46.9% | 0.051 | 85.0% | 0% |
| Qwen3-VL-8B base | Offline | 14.3% | N/A | N/A | 0% |
| Qwen3-VL-8B FT | Offline | 28.6% | 0.004 | 100% | 0% |
| Claude Sonnet 4.5 | API | 12.1% | 0.757 | 0% | 0% |
| GPT-5.1 | API | 18.3% | 0.057 | 60.0% | 0% |
- Fine-tuning delivers massive gains: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning
- Small fine-tuned models beat large APIs: Qwen3-VL-2B FT (46.9%) outperforms both Claude Sonnet 4.5 (12.1%) and GPT-5.1 (18.3%)
- Precision matters: Fine-tuned models have excellent click precision (85-100% hit rate, <0.05 coord error) while API models struggle
- Size vs specialization: The fine-tuned 2B model outperforms the general-purpose Claude Sonnet 4.5, showing that domain-specific fine-tuning trumps raw model size
The benchmark automatically generates comparison plots using plot_eval_metrics.py:
Color coding:
- Light blue (#4A90E2): Qwen3-VL-2B
- Dark blue (#2E5C8A): Qwen3-VL-8B
- Orange (#FF6B35): Claude API
- Red (#C1121F): GPT API
Hatch patterns:
- Solid fill: Base/pretrained models
- Diagonal stripes (///): Fine-tuned models
When using Set-of-Marks visual prompting, fine-tuned models achieve 100% accuracy:
| Scenario | Steps | Elements | Action Acc | Element Acc | Episode Success |
|---|---|---|---|---|---|
| Login | 6 | 3 | 100% | 100% | 100% |
| Registration | 12 | 6 | 100% | 100% | 100% |
Instead of predicting precise coordinates (CLICK(x=0.42, y=0.31)), the model selects numbered UI elements (CLICK([1])). This reduces spatial reasoning to element selection, which small models handle perfectly.
Example SoM Actions:
CLICK([0])- Click element with overlay number [0]TYPE(text="demo")- Type text (unchanged)DONE()- Task complete (unchanged)
| Approach | Login Acc | Registration Acc | Cost | Latency |
|---|---|---|---|---|
| Claude API + SoM | 100% | 100%* | ~$0.01/step | ~500ms |
| GPT-5.1 API + SoM | 100% | 100%* | ~$0.01/step | ~500ms |
| Qwen 2B + SoM | 100% | 100% | Free (local) | ~50ms |
*API results on registration pending full evaluation
Advantages:
- Perfect accuracy on structured UIs
- 10x faster inference (50ms vs 500ms)
- Zero API costs
- Works on small models (2B parameters)
Limitations:
- Requires element detection system (SoM overlay generation)
- Not applicable to free-form image interactions
- Additional engineering complexity
Use SoM when:
- Working with web UIs (HTML DOM available)
- Using accessibility trees (desktop apps)
- Need guaranteed accuracy for production automation
- Latency and cost are critical
Use coordinates when:
- Working with arbitrary images (screenshots, PDFs, games)
- No structured element information available
- Exploring generalization to novel UI types
The plotting system supports flexible multi-model comparisons:
# Compare base vs fine-tuned
python -m openadapt_ml.evals.plot_eval_metrics \
--files experiments/qwen_login/eval_base.json \
experiments/qwen_login/eval_ft.json \
--labels "Qwen3-2B base" "Qwen3-2B FT" \
--output experiments/qwen_login/base_vs_ft.png
# Comprehensive comparison (6 models)
python -m openadapt_ml.evals.plot_eval_metrics \
--files experiments/qwen_login/eval_2b_base.json \
experiments/qwen_login/eval_2b_ft.json \
experiments/qwen_login/eval_8b_base.json \
experiments/qwen_login/eval_8b_ft.json \
experiments/qwen_login/eval_claude.json \
experiments/qwen_login/eval_gpt.json \
--labels "Qwen3-2B base" "Qwen3-2B FT" \
"Qwen3-8B base" "Qwen3-8B FT" \
"Claude Sonnet 4.5" "GPT-5.1" \
--output experiments/qwen_login/comprehensive_comparison.pngAutomatic color coding:
- Model type detection from label text ("2b", "8b", "claude", "gpt")
- Consistent colors across all plots
- Clear visual distinction between model families
Hatch patterns:
- Base/pretrained models: solid fill
- Fine-tuned models: diagonal stripes (///)
- Automatically detected from "ft", "fine", or "finetuned" in label
Layout:
- Multi-panel figure with one subplot per metric
- Grouped bars for easy comparison
- Rotated x-axis labels for readability
- Comprehensive legend explaining color coding and patterns
Customization:
- Supports arbitrary number of models
- Accepts any combination of eval JSON files
- Scales automatically to data range
- 150 DPI output for publication quality
- Python 3.12 with
uvpackage manager - GPU (optional but recommended):
- CUDA GPU with 8GB+ VRAM for 2B model, 16GB+ for 8B
- Apple Silicon (M1/M2/M3) with 16GB+ unified memory
- CPU-only training is possible but slow
- API keys (optional, for API model comparison):
ANTHROPIC_API_KEYfor Claude Sonnet 4.5OPENAI_API_KEYfor GPT-5.1
# Clone repository
git clone https://github.com/OpenAdaptAI/openadapt-ml.git
cd openadapt-ml
# Install dependencies
uv sync
# Optional: Install API dependencies
uv sync --extra api
# Configure API keys (if using --include-all-apis)
cp .env.example .env
# Edit .env with your API keys# Standard coordinate mode (2B model, 4 epochs)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
--config configs/qwen3vl_synthetic_dev.yaml \
--out-dir experiments/qwen_login/2b_dev
# With API comparison
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
--config configs/qwen3vl_synthetic_dev.yaml \
--out-dir experiments/qwen_login/2b_dev \
--include-all-apis
# SoM mode (2B model, 2 epochs)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
--config configs/qwen3vl_synthetic_som.yaml \
--out-dir experiments/qwen_login/som_evalAfter running the benchmark, experiments/qwen_login/2b_dev/ will contain:
experiments/qwen_login/2b_dev/
├── eval/
│ ├── eval_qwen_base.json # Base model metrics
│ ├── eval_qwen_ft.json # Fine-tuned metrics
│ ├── eval_claude.json # Claude API metrics (if --include-all-apis)
│ └── eval_gpt51.json # GPT API metrics (if --include-all-apis)
└── plots/
└── qwen_base_vs_ft.png # Comparison plot
- Training: 10-15 minutes (Apple Silicon), 5-10 minutes (CUDA GPU)
- Evaluation per model: 2-3 minutes (local), 5-10 minutes (API due to rate limits)
- Total benchmark: ~20-30 minutes without APIs, ~40-60 minutes with APIs
Check that your results match the published benchmark:
Qwen3-VL-2B FT should achieve:
- Action type accuracy: ~40-50%
- Click hit rate: ~80-95%
- Mean coord error: <0.06
Qwen3-VL-2B SoM should achieve:
- Action type accuracy: 100%
- Element accuracy: 100%
- Episode success rate: 100%
If results differ significantly, check:
- Random seed is set correctly (seed: 123 in config)
- Jitter is enabled (default: true)
- Training completed without errors
- Checkpoint was saved and loaded correctly
The synthetic login UI is generated by openadapt_ml/ingest/synthetic.py:
Key functions:
_compute_login_layout(): Samples layout with optional jitter_render_login_ui(): Draws UI elements to PIL image_script_login_episode(): Creates 7-step episode with actionsgenerate_synthetic_sessions(): Top-level API for dataset generation
Rendering pipeline:
- Sample layout (with jitter if enabled)
- Render background
- Draw text labels ("Username:", "Password:")
- Draw input boxes (white rectangles)
- Draw buttons (login button, decoy help button)
- Add text content based on step (typed username/password)
- Save frame as PNG
- Record action (CLICK/TYPE/WAIT/DONE)
The training loop is implemented in openadapt_ml/training/trainer.py:
Key steps:
- Generate synthetic sessions
- Flatten to episodes
- Build SFT samples (chat format)
- Load VLM adapter with LoRA
- Run supervised training loop:
- Forward pass with mixed precision
- Compute loss on assistant tokens only
- Backward pass with gradient accumulation
- Optimizer step with gradient clipping
- Log loss and learning rate
- Save LoRA adapter to checkpoint
Offline evaluation is implemented in openadapt_ml/evals/trajectory_matching.py:
Key steps:
- Generate fresh synthetic episodes (different seed)
- Build SFT samples
- Load policy (base or fine-tuned)
- For each step:
- Run policy inference
- Parse predicted action
- Compare to ground truth
- Compute metrics (type match, coord error, click hit)
- Aggregate metrics across episodes
- Save JSON output
The action parser in openadapt_ml/runtime/policy.py uses regex patterns:
_CLICK_RE = re.compile(
r"CLICK\(x=([\d.]+),\s*y=([\d.]+)\)|CLICK\(\[([\d]+)\]\)"
)
_TYPE_RE = re.compile(r'TYPE\(text="([^"\\]*(?:\\.[^"\\]*)*)"\)')
_WAIT_RE = re.compile(r"\bWAIT\s*\(\s*\)")
_DONE_RE = re.compile(r"\bDONE\s*\(\s*\)")The parser handles:
- Coordinate clicks:
CLICK(x=0.42, y=0.31) - Element clicks:
CLICK([1]) - Type with escaping:
TYPE(text="hello \"world\"") - Wait:
WAIT() - Done:
DONE()
VLM adapters implement a common interface in openadapt_ml/models/base_adapter.py:
class BaseVLMAdapter:
def prepare_inputs(self, batch: List[Dict[str, Any]]) -> Dict[str, Any]:
"""Convert SFT samples to model inputs."""
raise NotImplementedError
def compute_loss(self, batch: List[Dict[str, Any]], model_outputs: Any) -> torch.Tensor:
"""Compute loss on assistant tokens only."""
raise NotImplementedError
def generate(self, sample: Dict[str, Any]) -> str:
"""Generate action text for a single sample."""
raise NotImplementedErrorImplemented adapters:
QwenVLAdapter: Qwen3-VL and Qwen2.5-VL supportApiVLMAdapter: Claude Sonnet 4.5 and GPT-5.1 API wrappersDummyAdapter: Lightweight mock for testing
Out of memory (OOM):
- Reduce
num_sessionsin config (try 16 or 8) - Enable 4-bit quantization:
load_in_4bit: true - Use smaller model (2B instead of 8B)
- Reduce LoRA rank:
r: 4instead ofr: 8
Loss not decreasing:
- Check learning rate (try 5e-5 to 2e-4)
- Increase warmup:
warmup_ratio: 0.1 - Check gradient clipping:
max_grad_norm: 1.0 - Verify data is being shuffled
Checkpoint not loading:
- Check
weights_pathin config matches saved checkpoint - Verify LoRA config (r, alpha, target_modules) matches training
- Try loading base model first to rule out adapter issues
Poor base model performance:
- Expected! Base models typically get 10-20% action accuracy
- Fine-tuning is necessary for good performance
API models failing:
- Check API keys are set in
.env - Verify API key environment variables:
echo $ANTHROPIC_API_KEY - Check API rate limits and quotas
- Try with
--skip-trainto isolate API issues
Metrics are null/missing:
coord_erroris null when no CLICK actions are predictedclick_hit_rateis null when no clicks have coordinateselement_accuracyis null unless using SoM mode- This is expected behavior, not a bug
Results don't match published numbers:
- Verify random seed:
seed: 123in config - Check training epochs (4 for standard, 2 for SoM)
- Ensure jitter is enabled (default)
- Confirm checkpoint loaded correctly
- Check model version (Qwen3-VL vs Qwen2.5-VL)
Evaluation set differs:
- Use different seed for eval vs training
- Set
eval_on_training_data: false(default) - Check
output_dirends with_evalsuffix
This benchmark demonstrates that small fine-tuned models can outperform large general-purpose APIs on structured tasks. Potential next steps:
- More scenarios: Add settings panel, form filling, multi-step workflows
- Real UI testing: Test on actual web pages and desktop apps
- Larger training sets: Scale to 100s or 1000s of episodes
- Architecture improvements: Try different LoRA ranks, target modules, learning rates
- Multi-task learning: Train on multiple scenarios simultaneously
- Transfer learning: Evaluate on held-out scenario types
See docs/roadmap.md for the full prioritized roadmap.
If you use this benchmark in your research, please cite:
@software{openadapt_ml_2024,
title = {OpenAdapt-ML: Model-Agnostic GUI Automation with Vision-Language Models},
author = {OpenAdapt AI Team},
year = {2024},
url = {https://github.com/OpenAdaptAI/openadapt-ml}
}MIT License. See LICENSE file for details.
For questions, issues, or contributions:
- GitHub Issues: https://github.com/OpenAdaptAI/openadapt-ml/issues
- Documentation: https://github.com/OpenAdaptAI/openadapt-ml/tree/main/docs
