FEAT: Score partial content from content-filtered responses by jsong468 · Pull Request #1689 · microsoft/PyRIT

jsong468 · 2026-05-04T22:53:57Z

Description

When OpenAI/Azure Chat Completions content filters trigger mid-generation (HTTP 200 with finish_reason=content_filter), the model may have already produced partial content before being cut off. Currently, PyRIT discards this partial content and treats the response identically to a full block (HTTP 400), so scorers return hardcoded failures and attacks backtrack. From an adversary's perspective, partial harmful content may constitute a successful attack even though the output was eventually filtered.

This PR introduces a score_blocked_content attribute on the Scorer base class that allows scorers to opt in to evaluating partial content from blocked responses instead of automatically treating them as failures.

Target layer — partial content extraction

Added _extract_partial_content(response) template method to OpenAITarget base class, returning None by default
OpenAIChatTarget and OpenAIResponsesTarget override it.
_handle_content_filter_response calls the hook and attaches the result to prompt_metadata["partial_content"] on the blocked MessagePiece before it is persisted to the DB
No changes to response_error="blocked" — full backward compatibility
Partial content extraction is currently only supported for the Chat Completions API and Responses API (mainly for Foundry deployments with content filters on); the _extract_partial_content hook can be overridden if this changes in the future.

Scorer layer — instance attribute on `Scorer`

Added score_blocked_content: bool = False class attribute on Scorer
When True, score_async creates a modified message where blocked pieces with partial_content metadata are replaced with text-type substitutes (response_error="none", converted_value_data_type="text", converted_value=<partial text>) before passing to _score_async
The substitute has response_error="none" so scorer short-circuits (e.g., refusal scorer's if response_error == "blocked" check) do not fire, and the LLM evaluates the actual content
When skip_on_error_result=True and self.score_blocked_content=True, blocked messages with partial content are not skipped.
The substitution happens in score_async (not _score_async) so that _score_async's signature remains (self, message, *, objective=None) — preserving full backward compatibility for external subclasses
The substitute is never persisted to the DB. The resulting Score references the original blocked piece's ID
Unique case: ConversationScorer reads self.score_blocked_content directly when building conversation text from DB history, using partial_content from metadata instead of converted_value for blocked pieces

Design decision — instance attribute, not call-site parameter or `AttackScoringConfig` field

Previous design considerations:

call-site parameter on score_async? Earlier iterations added score_blocked_content as a parameter on score_async, score_response_async, and score_response_multiple_scorers_async, threaded through from AttackScoringConfig via each attack's scoring calls. This had two problems:

Forwarding burden: Every new attack, scorer wrapper, or static method that calls score_async must remember to forward the parameter.
Backward compatibility: Adding a score_blocked_content parameter to _score_async broke external subclasses that override it with the original (self, message, *, objective=None) signature. Moving the substitution before _score_async solved this, but the forwarding burden through public APIs remained.

Why not on AttackScoringConfig? Putting the flag on the config centralizes it but still requires attacks to thread it through scoring calls. Adding it at the config level introduces complexity without significant benefit to the user.

Why an instance attribute? score_blocked_content is a scorer behavioral policy — "should this scorer evaluate partial content?" — similar to _validator or _score_aggregator. Setting it on the instance:

Zero forwarding: score_async reads self.score_blocked_content. No parameters to thread through score_response_async, attacks, or scorer wrappers.
Zero forgotten forwarding: New attacks, scorer wrappers, and static methods automatically inherit the behavior without code changes.
Full backward compatibility: No changes to score_async, _score_async, score_response_async, or score_response_multiple_scorers_async signatures.
Clean ConversationScorer handling: Reads self.score_blocked_content directly instead of inferring substitution state from message piece metadata.
Mutation concern: Since score_blocked_content is mutable instance state, a shared scorer could theoretically be mutated between calls. In practice, this is not a concern because:
The attribute is set once before the attack runs and never changed during execution
If a user did share a scorer between an attack (wants True) and external code (wants False), the mutation would be explicit and visible (scorer.score_blocked_content = True), unlike a call-site parameter where the absence of the flag silently defaults to False.
Between two attacks, the user can instantiate new scorers if the previous score_blocked_content policy does not apply anymore. This is reasonable since if you want different scoring policies, you need different scorers.

The alternative — a call-site parameter — would allow the same scorer instance to behave differently per call. However, this theoretical benefit doesn't materialize in practice (scorers aren't often shared) and comes at the cost of the forwarding burden described above.

TAP `error_score_map` interaction

TAP's error_score_map (default {"blocked": 0.0}) runs before the scorer and assigns fixed scores for error types. When "blocked" is in the map, the scorer is never invoked, so score_blocked_content has no effect. To evaluate partial content with TAP, pass error_score_map={} to disable the early-return. This interaction is documented in both the node and TreeOfAttacksWithPruningAttack error_score_map docstrings.

Interaction between `skip_on_error_result` and `score_blocked_content`

These flags serve different purposes:

skip_on_error_result is a performance optimization — don't waste LLM calls on broken responses (processing errors, empty responses)
score_blocked_content is a red teaming policy — content-filtered responses may contain useful partial content

They overlap because message.is_error() returns True for blocked responses. The implementation handles this: when both are active and the blocked message has partial content, scoring is not skipped.

`skip_on_error`	`score_blocked_content`	Behavior
`True`	`False`	Skip all errors including blocks (`PromptSendingAttack` default)
`False`	`False`	Score everything; blocked pieces get hardcoded False (`CrescendoAttack` default)
`True`	`True`	Skip real errors, but evaluate partial content from blocks
`False`	`True`	Score everything; evaluate partial content from blocks

Printing

ConsoleAttackResultPrinter is now updated to show blocked partial content when printing result or conversation. See below:

Tests

Tested via notebook to ensure that endpoints with filters on and triggered would produce the correct printing and scoring behavior. Further unit tests below:

TestCreateTextPieceFromBlocked (6 tests in test_scorer.py) — substitute piece creation, field preservation, response_error="none", None when no partial content
TestScoreAsyncWithBlockedContent (7 tests in test_scorer.py) — text-only scorer filtering, substitute scoring, refusal scorer short-circuit bypass, no-partial-content handling, mixed pieces, normal piece unaffected
TestSkipOnErrorWithBlockedContent (3 tests in test_scorer.py) — interaction between skip_on_error_result=True and score_blocked_content=True
TestScoreResponseAsyncBlockedContent (3 tests in test_scorer.py) — flag flows through from scorer instance via score_response_async and score_response_multiple_scorers_async
TestExtractPartialContentChatTarget (4 tests in test_openai_chat_target.py) — extraction from Chat Completions responses, None edge cases
TestContentFilterPreservesPartialContent (2 tests in test_openai_chat_target.py) — end-to-end: 200 + content_filter preserves metadata, no metadata when no content
test_conversation_scorer_uses_partial_content_when_score_blocked_content_enabled (in test_conversation_history_scorer.py) — verifies partial content is used in conversation text when flag is on
test_conversation_scorer_uses_error_json_when_score_blocked_content_disabled (in test_conversation_history_scorer.py) — verifies default behavior uses error JSON

fdubut · 2026-05-05T00:02:10Z

Thanks for implementing this feature! I did only a cursory review of the code but read the thorough PR description and the overall design looks good to me.

…et changes

jsong468 added 3 commits May 4, 2026 15:24

score blocked content

668512d

docstring

e6fae92

merge conflicts

3beff4a

jsong468 changed the title ~~score blocked content~~ FEAT: Score partial content from content-filtered responses May 4, 2026

fix unit tests

9a4505a

adrian-gavrila reviewed May 5, 2026

View reviewed changes

Comment thread pyrit/score/scorer.py Outdated

Comment thread pyrit/score/scorer.py Outdated

rlundeen2 reviewed May 5, 2026

View reviewed changes

Comment thread pyrit/score/true_false/true_false_scorer.py Outdated

rlundeen2 reviewed May 5, 2026

View reviewed changes

Comment thread pyrit/score/conversation_scorer.py

rlundeen2 reviewed May 5, 2026

View reviewed changes

Comment thread pyrit/executor/attack/core/attack_config.py Outdated

jsong468 added 2 commits May 5, 2026 16:53

fix conversation_scorer bug and score_async

fc6c7e7

minor truthiness change

c49debc

rlundeen2 reviewed May 6, 2026

View reviewed changes

Comment thread pyrit/score/scorer.py Outdated