Qwen Synthetic Login Benchmark

Overview

The Qwen Synthetic Login Benchmark is a reproducible experiment demonstrating how LoRA fine-tuning improves GUI grounding for vision-language models (VLMs) on structured UI automation tasks. This benchmark trains and evaluates Qwen3-VL models on a synthetic login scenario with challenging layout variations.

Key Achievement: Fine-tuned Qwen3-VL-2B (469% action accuracy) outperforms both Claude Sonnet 4.5 (121%) and GPT-5.1 (183%) on this benchmark, demonstrating that specialized fine-tuning can beat general-purpose frontier models.

100% Accuracy with Set-of-Marks (SoM): When using element-based actions (CLICK([1])) instead of coordinates, fine-tuned Qwen3-VL-2B achieves perfect 100% accuracy on both login and registration scenarios.

Synthetic Login Scenario

The benchmark uses a procedurally generated login UI with the following characteristics:

UI Elements

Username field: Text input for username
Password field: Text input for password
Login button: Submit button
Decoy element: "Help" button that should be ignored

Episode Structure

A standard login episode consists of 7 steps:

Step 0: Initial screen observation, action: WAIT()
Step 1: Click username field, action: CLICK(x=0.35, y=0.31)
Step 2: Type username, action: TYPE(text="demo")
Step 3: Click password field, action: CLICK(x=0.35, y=0.45)
Step 4: Type password, action: TYPE(text="pass")
Step 5: Click login button, action: CLICK(x=0.50, y=0.68)
Step 6: Task complete, action: DONE()

Hardening Features

To prevent overfitting and test robustness, the synthetic generator includes:

Layout jitter: UI elements shift up to ±10 pixels per episode
Decoy elements: "Help" button that the agent should ignore
Randomized styling: Colors, fonts, and spacing vary slightly
Deterministic seeds: Reproducible evaluation sets

These features ensure that models must learn semantic understanding rather than memorizing pixel-perfect coordinates.

Training Setup

Configuration Files

The benchmark uses YAML configuration files in configs/:

Standard coordinate-based training (configs/qwen3vl_synthetic_dev.yaml):

model:
  name: Qwen/Qwen3-VL-2B-Instruct
  load_in_4bit: false

lora:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
  bias: none
  target_modules:
    - q_proj
    - v_proj
  task_type: CAUSAL_LM
  weights_path: checkpoints/qwen3vl2b_login_lora_epjit_v2

synthetic_data:
  num_sessions: 32
  seed: 123
  output_dir: synthetic_train_dev

training:
  num_train_epochs: 4
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 1
  learning_rate: 1.0e-4
  warmup_ratio: 0.03
  weight_decay: 0.0
  max_grad_norm: 1.0
  logging_steps: 1

Set-of-Marks (SoM) training (configs/qwen3vl_synthetic_som.yaml):

model:
  name: Qwen/Qwen3-VL-2B-Instruct
  load_in_4bit: false

lora:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
  bias: none
  target_modules:
    - q_proj
    - v_proj
  task_type: CAUSAL_LM
  weights_path: checkpoints/qwen3vl2b_login_lora_som

synthetic_data:
  num_sessions: 32
  seed: 123
  output_dir: synthetic_train_som
  use_som: true  # Enable Set-of-Marks overlays

training:
  num_train_epochs: 2
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 1
  learning_rate: 5.0e-5
  warmup_ratio: 0.1
  weight_decay: 0.01
  max_grad_norm: 0.5
  logging_steps: 10

Training Commands

Full benchmark (train + eval + plot):

# Standard coordinate-based training
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --out-dir experiments/qwen_login/2b_dev

# Include API model comparison (Claude Sonnet 4.5 + GPT-5.1)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --out-dir experiments/qwen_login/2b_dev \
  --include-all-apis

Training only:

# Train a LoRA adapter
uv run python -m openadapt_ml.scripts.train \
  --config configs/qwen3vl_synthetic_dev.yaml

SoM mode training:

# Train with Set-of-Marks visual prompting
uv run python -m openadapt_ml.scripts.train \
  --config configs/qwen3vl_synthetic_som.yaml

Training Process

Synthetic Data Generation: 32 sessions (32 episodes) are generated with jittered layouts
SFT Dataset Building: Each step is converted to a chat-style sample with system prompt + user query + assistant action
LoRA Fine-tuning: Only q_proj and v_proj layers are trained with rank-8 LoRA adapters
Checkpoint Saving: Trained adapters are saved to checkpoints/ directory

Training Time: ~10-15 minutes on Apple Silicon M1/M2, ~5 minutes on A10 GPU

Memory Usage: ~8GB VRAM for 2B model, ~16GB for 8B model

Evaluation Metrics

The benchmark tracks four primary metrics and several auxiliary metrics:

Primary Metrics

1. Action Type Accuracy

Percentage of steps where predicted action type matches ground truth
Types: CLICK, TYPE, WAIT, DONE
Example: If model predicts CLICK when GT is TYPE, this counts as incorrect
Target: >90% for production use

2. Mean Coordinate Error

Average normalized L2 distance between predicted and GT click coordinates
Only computed for CLICK actions with valid coordinates
Normalized to [0, 1] range (0 = perfect match, 1 = diagonal distance)
Target: <0.05 (5% of screen diagonal)

3. Click Hit Rate

Percentage of clicks within 5% radius of ground truth center
Point-based evaluation: treats click as hitting a circular target
Target: >95% for reliable automation

4. Episode Success Rate

Percentage of episodes where ALL steps match exactly (strict evaluation)
Requires perfect action type AND coordinate accuracy for entire episode
Most challenging metric; sensitive to any single mistake
Target: >80% for production deployment

Auxiliary Metrics

5. Mean Episode Progress

Average percentage of correct steps per episode (partial credit)
More forgiving than episode success rate
Useful for diagnosing where models typically fail

6. Mean Episode Step Score

Average "full step correctness" (action type match + click hit for clicks)
Stricter than progress, more forgiving than episode success

7. Weak Episode Success Rate

Semantic milestone-based evaluation (typed username, typed password, clicked login, emitted done)
Allows some mistakes as long as key milestones are achieved

8. Element Accuracy (SoM mode only)

Percentage of steps where predicted element ID matches ground truth
Only applicable when using Set-of-Marks overlays

9. Bbox Hit Rate

Percentage of clicks landing anywhere within element bounding box
More forgiving than point-based click hit rate

Evaluation Commands

Evaluate base model (no fine-tuning):

uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --backend qwen3 \
  --ignore-lora \
  --output-json experiments/qwen_login/eval_base.json

Evaluate fine-tuned model:

uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --backend qwen3 \
  --output-json experiments/qwen_login/eval_ft.json

Evaluate API models (requires API keys in .env):

# Claude Sonnet 4.5
uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --backend claude \
  --output-json experiments/qwen_login/eval_claude.json

# GPT-5.1
uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --backend openai \
  --output-json experiments/qwen_login/eval_gpt.json

SoM mode evaluation:

uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_som.yaml \
  --backend qwen3 \
  --dsl-mode som \
  --output-json experiments/qwen_login/eval_som.json

Evaluation JSON Schema

All evaluation runs produce a consistent JSON schema:

{
  "config_path": "configs/qwen3vl_synthetic_dev.yaml",
  "backend": "qwen3",
  "dsl_mode": "coord",  // or "som"
  "metrics": {
    "num_episodes": 32,
    "num_steps": 224,
    "action_type_accuracy": 0.469,
    "mean_coord_error": 0.051,
    "coord_error_count": 19,
    "episode_success_rate": 0.0,
    "click_hit_rate": 0.850,
    "bbox_hit_rate": 0.900,
    "mean_episode_progress": 0.532,
    "mean_episode_step_score": 0.489,
    "weak_episode_success_rate": 0.125,
    "state_success_rate": null,
    "element_accuracy": null  // only for SoM mode
  }
}

This stable schema enables reproducible comparisons across model versions and training runs.

Results: Standard Coordinate Mode

Comprehensive Model Comparison

Model	Type	Action Accuracy	Coord Error	Click Hit Rate	Episode Success
Qwen3-VL-2B base	Offline	14.3%	N/A	N/A	0%
Qwen3-VL-2B FT	Offline	46.9%	0.051	85.0%	0%
Qwen3-VL-8B base	Offline	14.3%	N/A	N/A	0%
Qwen3-VL-8B FT	Offline	28.6%	0.004	100%	0%
Claude Sonnet 4.5	API	12.1%	0.757	0%	0%
GPT-5.1	API	18.3%	0.057	60.0%	0%

Key Findings

Fine-tuning delivers massive gains: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning
Small fine-tuned models beat large APIs: Qwen3-VL-2B FT (46.9%) outperforms both Claude Sonnet 4.5 (12.1%) and GPT-5.1 (18.3%)
Precision matters: Fine-tuned models have excellent click precision (85-100% hit rate, <0.05 coord error) while API models struggle
Size vs specialization: The fine-tuned 2B model outperforms the general-purpose Claude Sonnet 4.5, showing that domain-specific fine-tuning trumps raw model size

Visualization

The benchmark automatically generates comparison plots using plot_eval_metrics.py:

Color coding:

Light blue (#4A90E2): Qwen3-VL-2B
Dark blue (#2E5C8A): Qwen3-VL-8B
Orange (#FF6B35): Claude API
Red (#C1121F): GPT API

Hatch patterns:

Solid fill: Base/pretrained models
Diagonal stripes (///): Fine-tuned models

Results: Set-of-Marks (SoM) Mode

Perfect Accuracy Achieved

When using Set-of-Marks visual prompting, fine-tuned models achieve 100% accuracy:

Scenario	Steps	Elements	Action Acc	Element Acc	Episode Success
Login	6	3	100%	100%	100%
Registration	12	6	100%	100%	100%

How SoM Works

Instead of predicting precise coordinates (CLICK(x=0.42, y=0.31)), the model selects numbered UI elements (CLICK([1])). This reduces spatial reasoning to element selection, which small models handle perfectly.

Example SoM Actions:

CLICK([0]) - Click element with overlay number [0]
TYPE(text="demo") - Type text (unchanged)
DONE() - Task complete (unchanged)

Cost/Latency Comparison

Approach	Login Acc	Registration Acc	Cost	Latency
Claude API + SoM	100%	100%*	~$0.01/step	~500ms
GPT-5.1 API + SoM	100%	100%*	~$0.01/step	~500ms
Qwen 2B + SoM	100%	100%	Free (local)	~50ms

*API results on registration pending full evaluation

When to Use SoM Mode

Advantages:

Perfect accuracy on structured UIs
10x faster inference (50ms vs 500ms)
Zero API costs
Works on small models (2B parameters)

Limitations:

Requires element detection system (SoM overlay generation)
Not applicable to free-form image interactions
Additional engineering complexity

Use SoM when:

Working with web UIs (HTML DOM available)
Using accessibility trees (desktop apps)
Need guaranteed accuracy for production automation
Latency and cost are critical

Use coordinates when:

Working with arbitrary images (screenshots, PDFs, games)
No structured element information available
Exploring generalization to novel UI types

Plotting System

Using plot_eval_metrics.py

The plotting system supports flexible multi-model comparisons:

# Compare base vs fine-tuned
python -m openadapt_ml.evals.plot_eval_metrics \
  --files experiments/qwen_login/eval_base.json \
          experiments/qwen_login/eval_ft.json \
  --labels "Qwen3-2B base" "Qwen3-2B FT" \
  --output experiments/qwen_login/base_vs_ft.png

# Comprehensive comparison (6 models)
python -m openadapt_ml.evals.plot_eval_metrics \
  --files experiments/qwen_login/eval_2b_base.json \
          experiments/qwen_login/eval_2b_ft.json \
          experiments/qwen_login/eval_8b_base.json \
          experiments/qwen_login/eval_8b_ft.json \
          experiments/qwen_login/eval_claude.json \
          experiments/qwen_login/eval_gpt.json \
  --labels "Qwen3-2B base" "Qwen3-2B FT" \
           "Qwen3-8B base" "Qwen3-8B FT" \
           "Claude Sonnet 4.5" "GPT-5.1" \
  --output experiments/qwen_login/comprehensive_comparison.png

Plot Features

Automatic color coding:

Model type detection from label text ("2b", "8b", "claude", "gpt")
Consistent colors across all plots
Clear visual distinction between model families

Hatch patterns:

Base/pretrained models: solid fill
Fine-tuned models: diagonal stripes (///)
Automatically detected from "ft", "fine", or "finetuned" in label

Layout:

Multi-panel figure with one subplot per metric
Grouped bars for easy comparison
Rotated x-axis labels for readability
Comprehensive legend explaining color coding and patterns

Customization:

Supports arbitrary number of models
Accepts any combination of eval JSON files
Scales automatically to data range
150 DPI output for publication quality

Reproducing the Benchmark

Prerequisites

Python 3.12 with uv package manager
GPU (optional but recommended):
- CUDA GPU with 8GB+ VRAM for 2B model, 16GB+ for 8B
- Apple Silicon (M1/M2/M3) with 16GB+ unified memory
- CPU-only training is possible but slow
API keys (optional, for API model comparison):
- ANTHROPIC_API_KEY for Claude Sonnet 4.5
- OPENAI_API_KEY for GPT-5.1

Installation

# Clone repository
git clone https://github.com/OpenAdaptAI/openadapt-ml.git
cd openadapt-ml

# Install dependencies
uv sync

# Optional: Install API dependencies
uv sync --extra api

# Configure API keys (if using --include-all-apis)
cp .env.example .env
# Edit .env with your API keys

Running the Full Benchmark

# Standard coordinate mode (2B model, 4 epochs)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --out-dir experiments/qwen_login/2b_dev

# With API comparison
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --out-dir experiments/qwen_login/2b_dev \
  --include-all-apis

# SoM mode (2B model, 2 epochs)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_som.yaml \
  --out-dir experiments/qwen_login/som_eval

Expected Outputs

After running the benchmark, experiments/qwen_login/2b_dev/ will contain:

experiments/qwen_login/2b_dev/
├── eval/
│   ├── eval_qwen_base.json    # Base model metrics
│   ├── eval_qwen_ft.json      # Fine-tuned metrics
│   ├── eval_claude.json       # Claude API metrics (if --include-all-apis)
│   └── eval_gpt51.json        # GPT API metrics (if --include-all-apis)
└── plots/
    └── qwen_base_vs_ft.png    # Comparison plot

Expected Runtime

Training: 10-15 minutes (Apple Silicon), 5-10 minutes (CUDA GPU)
Evaluation per model: 2-3 minutes (local), 5-10 minutes (API due to rate limits)
Total benchmark: ~20-30 minutes without APIs, ~40-60 minutes with APIs

Verifying Results

Check that your results match the published benchmark:

Qwen3-VL-2B FT should achieve:

Action type accuracy: ~40-50%
Click hit rate: ~80-95%
Mean coord error: <0.06

Qwen3-VL-2B SoM should achieve:

Action type accuracy: 100%
Element accuracy: 100%
Episode success rate: 100%

If results differ significantly, check:

Random seed is set correctly (seed: 123 in config)
Jitter is enabled (default: true)
Training completed without errors
Checkpoint was saved and loaded correctly

Implementation Details

Synthetic Data Generation

The synthetic login UI is generated by openadapt_ml/ingest/synthetic.py:

Key functions:

_compute_login_layout(): Samples layout with optional jitter
_render_login_ui(): Draws UI elements to PIL image
_script_login_episode(): Creates 7-step episode with actions
generate_synthetic_sessions(): Top-level API for dataset generation

Rendering pipeline:

Sample layout (with jitter if enabled)
Render background
Draw text labels ("Username:", "Password:")
Draw input boxes (white rectangles)
Draw buttons (login button, decoy help button)
Add text content based on step (typed username/password)
Save frame as PNG
Record action (CLICK/TYPE/WAIT/DONE)

Training Pipeline

The training loop is implemented in openadapt_ml/training/trainer.py:

Key steps:

Generate synthetic sessions
Flatten to episodes
Build SFT samples (chat format)
Load VLM adapter with LoRA
Run supervised training loop:
- Forward pass with mixed precision
- Compute loss on assistant tokens only
- Backward pass with gradient accumulation
- Optimizer step with gradient clipping
- Log loss and learning rate
Save LoRA adapter to checkpoint

Evaluation Pipeline

Offline evaluation is implemented in openadapt_ml/evals/trajectory_matching.py:

Key steps:

Generate fresh synthetic episodes (different seed)
Build SFT samples
Load policy (base or fine-tuned)
For each step:
- Run policy inference
- Parse predicted action
- Compare to ground truth
- Compute metrics (type match, coord error, click hit)
Aggregate metrics across episodes
Save JSON output

DSL Action Parsing

The action parser in openadapt_ml/runtime/policy.py uses regex patterns:

_CLICK_RE = re.compile(
    r"CLICK\(x=([\d.]+),\s*y=([\d.]+)\)|CLICK\(\[([\d]+)\]\)"
)
_TYPE_RE = re.compile(r'TYPE\(text="([^"\\]*(?:\\.[^"\\]*)*)"\)')
_WAIT_RE = re.compile(r"\bWAIT\s*\(\s*\)")
_DONE_RE = re.compile(r"\bDONE\s*\(\s*\)")

The parser handles:

Coordinate clicks: CLICK(x=0.42, y=0.31)
Element clicks: CLICK([1])
Type with escaping: TYPE(text="hello \"world\"")
Wait: WAIT()
Done: DONE()

Model Adapters

VLM adapters implement a common interface in openadapt_ml/models/base_adapter.py:

class BaseVLMAdapter:
    def prepare_inputs(self, batch: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Convert SFT samples to model inputs."""
        raise NotImplementedError

    def compute_loss(self, batch: List[Dict[str, Any]], model_outputs: Any) -> torch.Tensor:
        """Compute loss on assistant tokens only."""
        raise NotImplementedError

    def generate(self, sample: Dict[str, Any]) -> str:
        """Generate action text for a single sample."""
        raise NotImplementedError

Implemented adapters:

QwenVLAdapter: Qwen3-VL and Qwen2.5-VL support
ApiVLMAdapter: Claude Sonnet 4.5 and GPT-5.1 API wrappers
DummyAdapter: Lightweight mock for testing

Troubleshooting

Training Issues

Out of memory (OOM):

Reduce num_sessions in config (try 16 or 8)
Enable 4-bit quantization: load_in_4bit: true
Use smaller model (2B instead of 8B)
Reduce LoRA rank: r: 4 instead of r: 8

Loss not decreasing:

Check learning rate (try 5e-5 to 2e-4)
Increase warmup: warmup_ratio: 0.1
Check gradient clipping: max_grad_norm: 1.0
Verify data is being shuffled

Checkpoint not loading:

Check weights_path in config matches saved checkpoint
Verify LoRA config (r, alpha, target_modules) matches training
Try loading base model first to rule out adapter issues

Evaluation Issues

Poor base model performance:

Expected! Base models typically get 10-20% action accuracy
Fine-tuning is necessary for good performance

API models failing:

Check API keys are set in .env
Verify API key environment variables: echo $ANTHROPIC_API_KEY
Check API rate limits and quotas
Try with --skip-train to isolate API issues

Metrics are null/missing:

coord_error is null when no CLICK actions are predicted
click_hit_rate is null when no clicks have coordinates
element_accuracy is null unless using SoM mode
This is expected behavior, not a bug

Reproducibility Issues

Results don't match published numbers:

Verify random seed: seed: 123 in config
Check training epochs (4 for standard, 2 for SoM)
Ensure jitter is enabled (default)
Confirm checkpoint loaded correctly
Check model version (Qwen3-VL vs Qwen2.5-VL)

Evaluation set differs:

Use different seed for eval vs training
Set eval_on_training_data: false (default)
Check output_dir ends with _eval suffix

Next Steps

This benchmark demonstrates that small fine-tuned models can outperform large general-purpose APIs on structured tasks. Potential next steps:

More scenarios: Add settings panel, form filling, multi-step workflows
Real UI testing: Test on actual web pages and desktop apps
Larger training sets: Scale to 100s or 1000s of episodes
Architecture improvements: Try different LoRA ranks, target modules, learning rates
Multi-task learning: Train on multiple scenarios simultaneously
Transfer learning: Evaluate on held-out scenario types

See docs/roadmap.md for the full prioritized roadmap.

Citation

If you use this benchmark in your research, please cite:

@software{openadapt_ml_2024,
  title = {OpenAdapt-ML: Model-Agnostic GUI Automation with Vision-Language Models},
  author = {OpenAdapt AI Team},
  year = {2024},
  url = {https://github.com/OpenAdaptAI/openadapt-ml}
}

License

MIT License. See LICENSE file for details.

Contact

For questions, issues, or contributions:

GitHub Issues: https://github.com/OpenAdaptAI/openadapt-ml/issues
Documentation: https://github.com/OpenAdaptAI/openadapt-ml/tree/main/docs

FilesExpand file tree

qwen_login_experiment.md

Latest commit

History