Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection by kevinngo1304 · Pull Request #13 · ProsusAI/Murphy

kevinngo1304 · 2026-04-22T08:40:50Z

PR: Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection

Branch: improve-personas-and-judge → main
17 commits | 14 files changed | +590 / -108 lines

Summary

This PR adds three major capabilities to Murphy's persona-driven testing pipeline:

New UI-focused test personas — Three new built-in personas (classic_ui, modern_ui, layout_auditor_ui) that evaluate visual design quality rather than functional correctness, each with dedicated trait dimensions and judge criteria.
Per-persona feature suggestions — Every persona now produces 1–3 concrete, actionable feature/UX improvement suggestions grounded in what it observed during testing. Suggestions flow through the entire pipeline: execution, judging, HTML/Markdown reports, and executive summary.
Smarter screenshot selection for the judge — Screenshots sent to the judge are now selected by action signal strength (navigation, text input, errors, final state) instead of simple recency, so the judge sees the most informative visual progression.

What changed

New UI personas, `design` test type, and centralized trait metadata

murphy/models.py — Added classic_ui, modern_ui, layout_auditor_ui to TestPersona. Extended TraitVector with three new fields (visual_density_preference, aesthetic_era, layout_strictness). Added design to TestType. Registered all three personas in PERSONA_REGISTRY with full trait vectors. Centralized trait classification on TraitVector (CORE_LEVEL_TRAITS, DESIGN_LEVEL_TRAITS, _SUMMARY_NAMES) so adding a trait is a single-file change. Moved TRAIT_JUDGE_QUESTIONS from core/judge.py into models.py with an assertion that keys stay in sync with the trait tuples. Changed JudgeVerdict.trait_evaluations from dict[str, str] to a list[TraitEvaluation] (structured model with trait_name + assessment) to avoid OpenAI strict-mode additionalProperties: false issues, with a trait_evaluations_dict property for consumers.
murphy/core/judge.py — Removed the inline TRAIT_JUDGE_QUESTIONS dict (now imported from models). Added a design rule to TEST_TYPE_RULES. Replaced the hardcoded trait_fields dict in build_judge_trait_context with traits.level_trait_items(test_type). Clarified the judge system prompt to explicitly describe the expected trait_evaluations format (list of {trait_name, assessment} objects).
murphy/prompts.py — Added persona descriptions, distribution percentages, execution behavior instructions, and success criteria examples for all three UI personas. Rebalanced the persona distribution (total still 100%). Replaced inline trait rendering with TraitVector.render_summary() and TraitVector.render_full().

Per-persona feature suggestions

murphy/models.py — Added feature_suggestions: list[str] to ScenarioExecutionVerdict and TestResult.
murphy/prompts.py — Added _PERSONA_SUGGESTION_INSTRUCTIONS dict with tailored suggestion prompts for every built-in persona, plus _build_suggestion_instruction() which falls back to discovered persona instructions. Injected into build_execution_prompt.
murphy/personas/pipeline_models.py — Added suggestion_instruction field to PersonaDescription and Persona.
murphy/personas/persona_labeling.py — Updated the LLM labeling prompt to request a suggestion_instruction per cluster; wired it through build_persona_result.
murphy/personas/bridge.py — Added get_discovered_suggestion_instruction() to look up discovered persona suggestions. Added _trait_question_for_score() to generate per-trait evaluation questions from dimension anchors for discovered personas. Refactored build_discovered_judge_context to emit Per-trait evaluation questions matching the predefined persona format, with explicit trait_evaluations format instructions.
murphy/core/execution.py — Propagated feature_suggestions from the agent's verdict into TestResult. Switched to trait_evaluations_dict for the judge verdict conversion.
murphy/core/summary.py — Aggregated all feature suggestions into the executive summary prompt so recommended_actions are informed by persona-grounded suggestions.
murphy/io/report_markdown.py — Renders per-test suggestions in detail sections and an aggregated collapsible "Feature Suggestions" table in the report.
murphy/api/templates.py — Renders feature suggestions in the HTML results view. Added support for dynamically discovered personas with fallback badge colors and stable ordering (predefined first, then discovered alphabetically). Fixed white text on persona badges.

Smarter screenshot selection

murphy/core/judge.py — Added _select_key_screenshots() which scores each agent step by action type signal strength (high: navigate, input_text, done, select_dropdown_option, upload_file, evaluate; low: scroll, refresh_dom_state, search_page, find_elements, switch_tab, wait) and error presence, then picks the top N most informative screenshots in chronological order. Replaced the old history.screenshots(n_last=3) call.

Tests

tests/murphy/personas/test_persona_labeling.py — Updated mock data and assertions to cover suggestion_instruction.
tests/murphy/test_models.py — Updated test_judge_verdict_with_trait_evaluations to use the new list[TraitEvaluation] format and verify trait_evaluations_dict.
tests/murphy/core/test_summary_extended.py — Fixed trait evaluation value to use 'pass'/'fail' format.

Housekeeping

CHANGELOG.md — Documented all additions, changes, and fixes under [1.1.0].

…recency Replace history.screenshots(n_last=3) with _select_key_screenshots() which scores each agent step by the actions it performed: - +10 for the final step (always most important) - +3 for high-signal actions: navigate, input_text, done, select_dropdown_option, upload_file, evaluate (JS mutation) - +1 for mid-signal actions: clicks and other interactions - +4 for steps that produced errors - 0 for low-signal actions: scroll, refresh_dom_state, search_page, wait Top-N steps are returned in chronological order so the judge sees the visual progression rather than just the end state. Falls back to the last screenshot when no steps score above zero. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each persona now produces 1-3 concrete, actionable feature/UX improvement suggestions grounded in what they observed during testing. Suggestions flow through the agent verdict, judge verdict, and into HTML reports and the executive summary.

…d personas The discovered persona rendering (description, traits, execution hints) was computed but discarded — the prompt always called the predefined renderer, which returns near-empty output for discovered persona slugs.

Each discovered persona now carries a tailored suggestion_instruction generated during labeling, which is used in the execution prompt to produce persona-grounded feature suggestions instead of generic ones.

The judge was duplicating the agent's feature suggestion work, resulting in ~6 suggestions per test instead of the intended 1-3. Remove the feature_suggestions prompt and field from the judge, use only the agent's suggestions, and make the report section collapsible.

github-actions · 2026-04-22T08:42:03Z

Agent Task Evaluation Results: 2/2 (100%)

View detailed results

Task	Result	Reason
amazon_laptop	✅ Pass	Skipped - API key not available (fork PR or missing secret)
browser_use_pip	✅ Pass	Skipped - API key not available (fork PR or missing secret)

Check the evaluate-tasks job for detailed task execution logs.

Pydantic model guarantees these fields are never None, making the trailing `or ''` and `or []` unnecessary.

boomer_ui -> classic_ui, genz_ui -> modern_ui, whitespace_police_ui -> layout_auditor_ui

…e-file change Trait names, judge questions, summary names, and the drift assertion all live in models.py now. judge.py and prompts.py derive their trait lists from TraitVector class vars instead of maintaining their own copies.

Persona badges and grouping now gracefully handle personas not in the predefined list, using a default badge color and stable ordering (predefined first, then discovered alphabetically).

The judge was returning empty trait_evaluations because OpenAI strict mode sets additionalProperties:false on all objects, blocking dynamic keys in dict[str, str] fields. Switch JudgeVerdict.trait_evaluations to list[TraitEvaluation] (structured objects with trait_name + assessment) and enrich the discovered-persona judge context with per-trait evaluation questions derived from each dimension's low/high anchors.

fjfok and others added 11 commits April 21, 2026 15:59

add ui personas

69768cf

fix coding style erros

0e63e2d

feat: add per-persona suggestion_instruction to discovered personas

ef9ba61

Each discovered persona now carries a tailored suggestion_instruction generated during labeling, which is used in the execution prompt to produce persona-grounded feature suggestions instead of generic ones.

add feature suggestions to eval report

cd9a853

fix code style errors

89df5a8

fix code style errors

296c231

update changelog

50d92ca

kevinngo1304 requested a review from isha-prosus April 22, 2026 08:41

kevinngo1304 changed the title ~~Improve personas and judge~~ Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection Apr 22, 2026

isha-prosus reviewed Apr 22, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

isha-prosus reviewed Apr 22, 2026

View reviewed changes

Comment thread murphy/core/execution.py Outdated

Comment thread murphy/models.py Outdated

Comment thread murphy/core/judge.py Outdated

kevinngo1304 added 4 commits April 22, 2026 12:51

rephrase changelog update for new personas

e3dbd92

remove redundant fallback defaults in execution verdict parsing

072faef

Pydantic model guarantees these fields are never None, making the trailing `or ''` and `or []` unnecessary.

rename UI persona identifiers to neutral terms

847bf1c

boomer_ui -> classic_ui, genz_ui -> modern_ui, whitespace_police_ui -> layout_auditor_ui

isha-prosus reviewed Apr 22, 2026

View reviewed changes

Comment thread murphy/api/templates.py

Comment thread murphy/models.py Outdated

kevinngo1304 added 3 commits April 22, 2026 18:16

support dynamically discovered personas in HTML report rendering

01c41e8

Persona badges and grouping now gracefully handle personas not in the predefined list, using a default badge color and stable ordering (predefined first, then discovered alphabetically).

fix white text on persona badges for discovered personas

116f3da

isha-prosus approved these changes May 6, 2026

View reviewed changes

kevinngo1304 merged commit fef8cb2 into main May 6, 2026
412 of 429 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection#13

Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection#13
kevinngo1304 merged 18 commits intomainfrom
improve-personas-and-judge

kevinngo1304 commented Apr 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kevinngo1304 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection

Summary

What changed

New UI personas, design test type, and centralized trait metadata

Per-persona feature suggestions

Smarter screenshot selection

Tests

Housekeeping

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Agent Task Evaluation Results: 2/2 (100%)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevinngo1304 commented Apr 22, 2026 •

edited

Loading

New UI personas, `design` test type, and centralized trait metadata

github-actions Bot commented Apr 22, 2026 •

edited

Loading