Problem
The product needs an official evaluation foundation for response quality, tool use, and safety rather than ad hoc scoring.
Scope
- Track use of Microsoft.Extensions.AI.Evaluation, Quality, NLP, Safety, and Reporting packages where each fits
- Cover evaluation targets such as relevance, groundedness, completeness, task adherence, tool-call accuracy, and safety
- Keep evaluation broad enough for coding and non-coding agents
Out of scope
- Vendor-specific evaluation stacks outside the approved default
- Final scorecard UI behavior beyond the evaluation foundation itself
Implementation notes
- Prefer the official Microsoft evaluation libraries as the default path
- Keep evaluation compatible with transcript replay and telemetry
- Separate evaluation foundation from scorecard presentation concerns
Definition of Done
- The issue defines which official evaluation packages belong in the product and why
- The issue makes the approved evaluation direction explicit for implementers
Verification
- Review the issue against the feature spec and repo AGENTS updates
Dependencies
Problem
The product needs an official evaluation foundation for response quality, tool use, and safety rather than ad hoc scoring.
Scope
Out of scope
Implementation notes
Definition of Done
Verification
Dependencies