Skip to content

Latest commit

 

History

History
729 lines (557 loc) · 22.8 KB

File metadata and controls

729 lines (557 loc) · 22.8 KB

Qwen Synthetic Login Benchmark

Overview

The Qwen Synthetic Login Benchmark is a reproducible experiment demonstrating how LoRA fine-tuning improves GUI grounding for vision-language models (VLMs) on structured UI automation tasks. This benchmark trains and evaluates Qwen3-VL models on a synthetic login scenario with challenging layout variations.

Key Achievement: Fine-tuned Qwen3-VL-2B (469% action accuracy) outperforms both Claude Sonnet 4.5 (121%) and GPT-5.1 (183%) on this benchmark, demonstrating that specialized fine-tuning can beat general-purpose frontier models.

100% Accuracy with Set-of-Marks (SoM): When using element-based actions (CLICK([1])) instead of coordinates, fine-tuned Qwen3-VL-2B achieves perfect 100% accuracy on both login and registration scenarios.


Synthetic Login Scenario

The benchmark uses a procedurally generated login UI with the following characteristics:

UI Elements

  • Username field: Text input for username
  • Password field: Text input for password
  • Login button: Submit button
  • Decoy element: "Help" button that should be ignored

Episode Structure

A standard login episode consists of 7 steps:

  1. Step 0: Initial screen observation, action: WAIT()
  2. Step 1: Click username field, action: CLICK(x=0.35, y=0.31)
  3. Step 2: Type username, action: TYPE(text="demo")
  4. Step 3: Click password field, action: CLICK(x=0.35, y=0.45)
  5. Step 4: Type password, action: TYPE(text="pass")
  6. Step 5: Click login button, action: CLICK(x=0.50, y=0.68)
  7. Step 6: Task complete, action: DONE()

Hardening Features

To prevent overfitting and test robustness, the synthetic generator includes:

  • Layout jitter: UI elements shift up to ±10 pixels per episode
  • Decoy elements: "Help" button that the agent should ignore
  • Randomized styling: Colors, fonts, and spacing vary slightly
  • Deterministic seeds: Reproducible evaluation sets

These features ensure that models must learn semantic understanding rather than memorizing pixel-perfect coordinates.


Training Setup

Configuration Files

The benchmark uses YAML configuration files in configs/:

Standard coordinate-based training (configs/qwen3vl_synthetic_dev.yaml):

model:
  name: Qwen/Qwen3-VL-2B-Instruct
  load_in_4bit: false

lora:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
  bias: none
  target_modules:
    - q_proj
    - v_proj
  task_type: CAUSAL_LM
  weights_path: checkpoints/qwen3vl2b_login_lora_epjit_v2

synthetic_data:
  num_sessions: 32
  seed: 123
  output_dir: synthetic_train_dev

training:
  num_train_epochs: 4
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 1
  learning_rate: 1.0e-4
  warmup_ratio: 0.03
  weight_decay: 0.0
  max_grad_norm: 1.0
  logging_steps: 1

Set-of-Marks (SoM) training (configs/qwen3vl_synthetic_som.yaml):

model:
  name: Qwen/Qwen3-VL-2B-Instruct
  load_in_4bit: false

lora:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
  bias: none
  target_modules:
    - q_proj
    - v_proj
  task_type: CAUSAL_LM
  weights_path: checkpoints/qwen3vl2b_login_lora_som

synthetic_data:
  num_sessions: 32
  seed: 123
  output_dir: synthetic_train_som
  use_som: true  # Enable Set-of-Marks overlays

training:
  num_train_epochs: 2
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 1
  learning_rate: 5.0e-5
  warmup_ratio: 0.1
  weight_decay: 0.01
  max_grad_norm: 0.5
  logging_steps: 10

Training Commands

Full benchmark (train + eval + plot):

# Standard coordinate-based training
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --out-dir experiments/qwen_login/2b_dev

# Include API model comparison (Claude Sonnet 4.5 + GPT-5.1)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --out-dir experiments/qwen_login/2b_dev \
  --include-all-apis

Training only:

# Train a LoRA adapter
uv run python -m openadapt_ml.scripts.train \
  --config configs/qwen3vl_synthetic_dev.yaml

SoM mode training:

# Train with Set-of-Marks visual prompting
uv run python -m openadapt_ml.scripts.train \
  --config configs/qwen3vl_synthetic_som.yaml

Training Process

  1. Synthetic Data Generation: 32 sessions (32 episodes) are generated with jittered layouts
  2. SFT Dataset Building: Each step is converted to a chat-style sample with system prompt + user query + assistant action
  3. LoRA Fine-tuning: Only q_proj and v_proj layers are trained with rank-8 LoRA adapters
  4. Checkpoint Saving: Trained adapters are saved to checkpoints/ directory

Training Time: ~10-15 minutes on Apple Silicon M1/M2, ~5 minutes on A10 GPU

Memory Usage: ~8GB VRAM for 2B model, ~16GB for 8B model


Evaluation Metrics

The benchmark tracks four primary metrics and several auxiliary metrics:

Primary Metrics

1. Action Type Accuracy

  • Percentage of steps where predicted action type matches ground truth
  • Types: CLICK, TYPE, WAIT, DONE
  • Example: If model predicts CLICK when GT is TYPE, this counts as incorrect
  • Target: >90% for production use

2. Mean Coordinate Error

  • Average normalized L2 distance between predicted and GT click coordinates
  • Only computed for CLICK actions with valid coordinates
  • Normalized to [0, 1] range (0 = perfect match, 1 = diagonal distance)
  • Target: <0.05 (5% of screen diagonal)

3. Click Hit Rate

  • Percentage of clicks within 5% radius of ground truth center
  • Point-based evaluation: treats click as hitting a circular target
  • Target: >95% for reliable automation

4. Episode Success Rate

  • Percentage of episodes where ALL steps match exactly (strict evaluation)
  • Requires perfect action type AND coordinate accuracy for entire episode
  • Most challenging metric; sensitive to any single mistake
  • Target: >80% for production deployment

Auxiliary Metrics

5. Mean Episode Progress

  • Average percentage of correct steps per episode (partial credit)
  • More forgiving than episode success rate
  • Useful for diagnosing where models typically fail

6. Mean Episode Step Score

  • Average "full step correctness" (action type match + click hit for clicks)
  • Stricter than progress, more forgiving than episode success

7. Weak Episode Success Rate

  • Semantic milestone-based evaluation (typed username, typed password, clicked login, emitted done)
  • Allows some mistakes as long as key milestones are achieved

8. Element Accuracy (SoM mode only)

  • Percentage of steps where predicted element ID matches ground truth
  • Only applicable when using Set-of-Marks overlays

9. Bbox Hit Rate

  • Percentage of clicks landing anywhere within element bounding box
  • More forgiving than point-based click hit rate

Evaluation Commands

Evaluate base model (no fine-tuning):

uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --backend qwen3 \
  --ignore-lora \
  --output-json experiments/qwen_login/eval_base.json

Evaluate fine-tuned model:

uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --backend qwen3 \
  --output-json experiments/qwen_login/eval_ft.json

Evaluate API models (requires API keys in .env):

# Claude Sonnet 4.5
uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --backend claude \
  --output-json experiments/qwen_login/eval_claude.json

# GPT-5.1
uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --backend openai \
  --output-json experiments/qwen_login/eval_gpt.json

SoM mode evaluation:

uv run python -m openadapt_ml.scripts.eval_policy \
  --config configs/qwen3vl_synthetic_som.yaml \
  --backend qwen3 \
  --dsl-mode som \
  --output-json experiments/qwen_login/eval_som.json

Evaluation JSON Schema

All evaluation runs produce a consistent JSON schema:

{
  "config_path": "configs/qwen3vl_synthetic_dev.yaml",
  "backend": "qwen3",
  "dsl_mode": "coord",  // or "som"
  "metrics": {
    "num_episodes": 32,
    "num_steps": 224,
    "action_type_accuracy": 0.469,
    "mean_coord_error": 0.051,
    "coord_error_count": 19,
    "episode_success_rate": 0.0,
    "click_hit_rate": 0.850,
    "bbox_hit_rate": 0.900,
    "mean_episode_progress": 0.532,
    "mean_episode_step_score": 0.489,
    "weak_episode_success_rate": 0.125,
    "state_success_rate": null,
    "element_accuracy": null  // only for SoM mode
  }
}

This stable schema enables reproducible comparisons across model versions and training runs.


Results: Standard Coordinate Mode

Comprehensive Model Comparison

Model Type Action Accuracy Coord Error Click Hit Rate Episode Success
Qwen3-VL-2B base Offline 14.3% N/A N/A 0%
Qwen3-VL-2B FT Offline 46.9% 0.051 85.0% 0%
Qwen3-VL-8B base Offline 14.3% N/A N/A 0%
Qwen3-VL-8B FT Offline 28.6% 0.004 100% 0%
Claude Sonnet 4.5 API 12.1% 0.757 0% 0%
GPT-5.1 API 18.3% 0.057 60.0% 0%

Key Findings

  1. Fine-tuning delivers massive gains: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning
  2. Small fine-tuned models beat large APIs: Qwen3-VL-2B FT (46.9%) outperforms both Claude Sonnet 4.5 (12.1%) and GPT-5.1 (18.3%)
  3. Precision matters: Fine-tuned models have excellent click precision (85-100% hit rate, <0.05 coord error) while API models struggle
  4. Size vs specialization: The fine-tuned 2B model outperforms the general-purpose Claude Sonnet 4.5, showing that domain-specific fine-tuning trumps raw model size

Visualization

The benchmark automatically generates comparison plots using plot_eval_metrics.py:

Comprehensive Model Comparison

Color coding:

  • Light blue (#4A90E2): Qwen3-VL-2B
  • Dark blue (#2E5C8A): Qwen3-VL-8B
  • Orange (#FF6B35): Claude API
  • Red (#C1121F): GPT API

Hatch patterns:

  • Solid fill: Base/pretrained models
  • Diagonal stripes (///): Fine-tuned models

Results: Set-of-Marks (SoM) Mode

Perfect Accuracy Achieved

When using Set-of-Marks visual prompting, fine-tuned models achieve 100% accuracy:

Scenario Steps Elements Action Acc Element Acc Episode Success
Login 6 3 100% 100% 100%
Registration 12 6 100% 100% 100%

How SoM Works

Instead of predicting precise coordinates (CLICK(x=0.42, y=0.31)), the model selects numbered UI elements (CLICK([1])). This reduces spatial reasoning to element selection, which small models handle perfectly.

Example SoM Actions:

  • CLICK([0]) - Click element with overlay number [0]
  • TYPE(text="demo") - Type text (unchanged)
  • DONE() - Task complete (unchanged)

Cost/Latency Comparison

Approach Login Acc Registration Acc Cost Latency
Claude API + SoM 100% 100%* ~$0.01/step ~500ms
GPT-5.1 API + SoM 100% 100%* ~$0.01/step ~500ms
Qwen 2B + SoM 100% 100% Free (local) ~50ms

*API results on registration pending full evaluation

When to Use SoM Mode

Advantages:

  • Perfect accuracy on structured UIs
  • 10x faster inference (50ms vs 500ms)
  • Zero API costs
  • Works on small models (2B parameters)

Limitations:

  • Requires element detection system (SoM overlay generation)
  • Not applicable to free-form image interactions
  • Additional engineering complexity

Use SoM when:

  • Working with web UIs (HTML DOM available)
  • Using accessibility trees (desktop apps)
  • Need guaranteed accuracy for production automation
  • Latency and cost are critical

Use coordinates when:

  • Working with arbitrary images (screenshots, PDFs, games)
  • No structured element information available
  • Exploring generalization to novel UI types

Plotting System

Using plot_eval_metrics.py

The plotting system supports flexible multi-model comparisons:

# Compare base vs fine-tuned
python -m openadapt_ml.evals.plot_eval_metrics \
  --files experiments/qwen_login/eval_base.json \
          experiments/qwen_login/eval_ft.json \
  --labels "Qwen3-2B base" "Qwen3-2B FT" \
  --output experiments/qwen_login/base_vs_ft.png

# Comprehensive comparison (6 models)
python -m openadapt_ml.evals.plot_eval_metrics \
  --files experiments/qwen_login/eval_2b_base.json \
          experiments/qwen_login/eval_2b_ft.json \
          experiments/qwen_login/eval_8b_base.json \
          experiments/qwen_login/eval_8b_ft.json \
          experiments/qwen_login/eval_claude.json \
          experiments/qwen_login/eval_gpt.json \
  --labels "Qwen3-2B base" "Qwen3-2B FT" \
           "Qwen3-8B base" "Qwen3-8B FT" \
           "Claude Sonnet 4.5" "GPT-5.1" \
  --output experiments/qwen_login/comprehensive_comparison.png

Plot Features

Automatic color coding:

  • Model type detection from label text ("2b", "8b", "claude", "gpt")
  • Consistent colors across all plots
  • Clear visual distinction between model families

Hatch patterns:

  • Base/pretrained models: solid fill
  • Fine-tuned models: diagonal stripes (///)
  • Automatically detected from "ft", "fine", or "finetuned" in label

Layout:

  • Multi-panel figure with one subplot per metric
  • Grouped bars for easy comparison
  • Rotated x-axis labels for readability
  • Comprehensive legend explaining color coding and patterns

Customization:

  • Supports arbitrary number of models
  • Accepts any combination of eval JSON files
  • Scales automatically to data range
  • 150 DPI output for publication quality

Reproducing the Benchmark

Prerequisites

  1. Python 3.12 with uv package manager
  2. GPU (optional but recommended):
    • CUDA GPU with 8GB+ VRAM for 2B model, 16GB+ for 8B
    • Apple Silicon (M1/M2/M3) with 16GB+ unified memory
    • CPU-only training is possible but slow
  3. API keys (optional, for API model comparison):
    • ANTHROPIC_API_KEY for Claude Sonnet 4.5
    • OPENAI_API_KEY for GPT-5.1

Installation

# Clone repository
git clone https://github.com/OpenAdaptAI/openadapt-ml.git
cd openadapt-ml

# Install dependencies
uv sync

# Optional: Install API dependencies
uv sync --extra api

# Configure API keys (if using --include-all-apis)
cp .env.example .env
# Edit .env with your API keys

Running the Full Benchmark

# Standard coordinate mode (2B model, 4 epochs)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --out-dir experiments/qwen_login/2b_dev

# With API comparison
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_dev.yaml \
  --out-dir experiments/qwen_login/2b_dev \
  --include-all-apis

# SoM mode (2B model, 2 epochs)
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
  --config configs/qwen3vl_synthetic_som.yaml \
  --out-dir experiments/qwen_login/som_eval

Expected Outputs

After running the benchmark, experiments/qwen_login/2b_dev/ will contain:

experiments/qwen_login/2b_dev/
├── eval/
│   ├── eval_qwen_base.json    # Base model metrics
│   ├── eval_qwen_ft.json      # Fine-tuned metrics
│   ├── eval_claude.json       # Claude API metrics (if --include-all-apis)
│   └── eval_gpt51.json        # GPT API metrics (if --include-all-apis)
└── plots/
    └── qwen_base_vs_ft.png    # Comparison plot

Expected Runtime

  • Training: 10-15 minutes (Apple Silicon), 5-10 minutes (CUDA GPU)
  • Evaluation per model: 2-3 minutes (local), 5-10 minutes (API due to rate limits)
  • Total benchmark: ~20-30 minutes without APIs, ~40-60 minutes with APIs

Verifying Results

Check that your results match the published benchmark:

Qwen3-VL-2B FT should achieve:

  • Action type accuracy: ~40-50%
  • Click hit rate: ~80-95%
  • Mean coord error: <0.06

Qwen3-VL-2B SoM should achieve:

  • Action type accuracy: 100%
  • Element accuracy: 100%
  • Episode success rate: 100%

If results differ significantly, check:

  1. Random seed is set correctly (seed: 123 in config)
  2. Jitter is enabled (default: true)
  3. Training completed without errors
  4. Checkpoint was saved and loaded correctly

Implementation Details

Synthetic Data Generation

The synthetic login UI is generated by openadapt_ml/ingest/synthetic.py:

Key functions:

  • _compute_login_layout(): Samples layout with optional jitter
  • _render_login_ui(): Draws UI elements to PIL image
  • _script_login_episode(): Creates 7-step episode with actions
  • generate_synthetic_sessions(): Top-level API for dataset generation

Rendering pipeline:

  1. Sample layout (with jitter if enabled)
  2. Render background
  3. Draw text labels ("Username:", "Password:")
  4. Draw input boxes (white rectangles)
  5. Draw buttons (login button, decoy help button)
  6. Add text content based on step (typed username/password)
  7. Save frame as PNG
  8. Record action (CLICK/TYPE/WAIT/DONE)

Training Pipeline

The training loop is implemented in openadapt_ml/training/trainer.py:

Key steps:

  1. Generate synthetic sessions
  2. Flatten to episodes
  3. Build SFT samples (chat format)
  4. Load VLM adapter with LoRA
  5. Run supervised training loop:
    • Forward pass with mixed precision
    • Compute loss on assistant tokens only
    • Backward pass with gradient accumulation
    • Optimizer step with gradient clipping
    • Log loss and learning rate
  6. Save LoRA adapter to checkpoint

Evaluation Pipeline

Offline evaluation is implemented in openadapt_ml/evals/trajectory_matching.py:

Key steps:

  1. Generate fresh synthetic episodes (different seed)
  2. Build SFT samples
  3. Load policy (base or fine-tuned)
  4. For each step:
    • Run policy inference
    • Parse predicted action
    • Compare to ground truth
    • Compute metrics (type match, coord error, click hit)
  5. Aggregate metrics across episodes
  6. Save JSON output

DSL Action Parsing

The action parser in openadapt_ml/runtime/policy.py uses regex patterns:

_CLICK_RE = re.compile(
    r"CLICK\(x=([\d.]+),\s*y=([\d.]+)\)|CLICK\(\[([\d]+)\]\)"
)
_TYPE_RE = re.compile(r'TYPE\(text="([^"\\]*(?:\\.[^"\\]*)*)"\)')
_WAIT_RE = re.compile(r"\bWAIT\s*\(\s*\)")
_DONE_RE = re.compile(r"\bDONE\s*\(\s*\)")

The parser handles:

  • Coordinate clicks: CLICK(x=0.42, y=0.31)
  • Element clicks: CLICK([1])
  • Type with escaping: TYPE(text="hello \"world\"")
  • Wait: WAIT()
  • Done: DONE()

Model Adapters

VLM adapters implement a common interface in openadapt_ml/models/base_adapter.py:

class BaseVLMAdapter:
    def prepare_inputs(self, batch: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Convert SFT samples to model inputs."""
        raise NotImplementedError

    def compute_loss(self, batch: List[Dict[str, Any]], model_outputs: Any) -> torch.Tensor:
        """Compute loss on assistant tokens only."""
        raise NotImplementedError

    def generate(self, sample: Dict[str, Any]) -> str:
        """Generate action text for a single sample."""
        raise NotImplementedError

Implemented adapters:

  • QwenVLAdapter: Qwen3-VL and Qwen2.5-VL support
  • ApiVLMAdapter: Claude Sonnet 4.5 and GPT-5.1 API wrappers
  • DummyAdapter: Lightweight mock for testing

Troubleshooting

Training Issues

Out of memory (OOM):

  • Reduce num_sessions in config (try 16 or 8)
  • Enable 4-bit quantization: load_in_4bit: true
  • Use smaller model (2B instead of 8B)
  • Reduce LoRA rank: r: 4 instead of r: 8

Loss not decreasing:

  • Check learning rate (try 5e-5 to 2e-4)
  • Increase warmup: warmup_ratio: 0.1
  • Check gradient clipping: max_grad_norm: 1.0
  • Verify data is being shuffled

Checkpoint not loading:

  • Check weights_path in config matches saved checkpoint
  • Verify LoRA config (r, alpha, target_modules) matches training
  • Try loading base model first to rule out adapter issues

Evaluation Issues

Poor base model performance:

  • Expected! Base models typically get 10-20% action accuracy
  • Fine-tuning is necessary for good performance

API models failing:

  • Check API keys are set in .env
  • Verify API key environment variables: echo $ANTHROPIC_API_KEY
  • Check API rate limits and quotas
  • Try with --skip-train to isolate API issues

Metrics are null/missing:

  • coord_error is null when no CLICK actions are predicted
  • click_hit_rate is null when no clicks have coordinates
  • element_accuracy is null unless using SoM mode
  • This is expected behavior, not a bug

Reproducibility Issues

Results don't match published numbers:

  • Verify random seed: seed: 123 in config
  • Check training epochs (4 for standard, 2 for SoM)
  • Ensure jitter is enabled (default)
  • Confirm checkpoint loaded correctly
  • Check model version (Qwen3-VL vs Qwen2.5-VL)

Evaluation set differs:

  • Use different seed for eval vs training
  • Set eval_on_training_data: false (default)
  • Check output_dir ends with _eval suffix

Next Steps

This benchmark demonstrates that small fine-tuned models can outperform large general-purpose APIs on structured tasks. Potential next steps:

  1. More scenarios: Add settings panel, form filling, multi-step workflows
  2. Real UI testing: Test on actual web pages and desktop apps
  3. Larger training sets: Scale to 100s or 1000s of episodes
  4. Architecture improvements: Try different LoRA ranks, target modules, learning rates
  5. Multi-task learning: Train on multiple scenarios simultaneously
  6. Transfer learning: Evaluate on held-out scenario types

See docs/roadmap.md for the full prioritized roadmap.


Citation

If you use this benchmark in your research, please cite:

@software{openadapt_ml_2024,
  title = {OpenAdapt-ML: Model-Agnostic GUI Automation with Vision-Language Models},
  author = {OpenAdapt AI Team},
  year = {2024},
  url = {https://github.com/OpenAdaptAI/openadapt-ml}
}

License

MIT License. See LICENSE file for details.


Contact

For questions, issues, or contributions: