FEAT: Score partial content from content-filtered responses#1689
Merged
Conversation
Contributor
|
Thanks for implementing this feature! I did only a cursory review of the code but read the thorough PR description and the overall design looks good to me. |
rlundeen2
reviewed
May 5, 2026
rlundeen2
reviewed
May 5, 2026
rlundeen2
reviewed
May 5, 2026
rlundeen2
reviewed
May 6, 2026
romanlutz
reviewed
May 6, 2026
romanlutz
reviewed
May 6, 2026
romanlutz
reviewed
May 6, 2026
romanlutz
reviewed
May 6, 2026
romanlutz
reviewed
May 6, 2026
rlundeen2
approved these changes
May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When OpenAI/Azure Chat Completions content filters trigger mid-generation (HTTP 200 with
finish_reason=content_filter), the model may have already produced partial content before being cut off. Currently, PyRIT discards this partial content and treats the response identically to a full block (HTTP 400), so scorers return hardcoded failures and attacks backtrack. From an adversary's perspective, partial harmful content may constitute a successful attack even though the output was eventually filtered.This PR introduces a
score_blocked_contentattribute on theScorerbase class that allows scorers to opt in to evaluating partial content from blocked responses instead of automatically treating them as failures.Target layer — partial content extraction
_extract_partial_content(response)template method toOpenAITargetbase class, returningNoneby defaultOpenAIChatTargetandOpenAIResponsesTargetoverride it._handle_content_filter_responsecalls the hook and attaches the result toprompt_metadata["partial_content"]on the blockedMessagePiecebefore it is persisted to the DBresponse_error="blocked"— full backward compatibility_extract_partial_contenthook can be overridden if this changes in the future.Scorer layer — instance attribute on
Scorerscore_blocked_content: bool = Falseclass attribute onScorerTrue,score_asynccreates a modified message where blocked pieces withpartial_contentmetadata are replaced with text-type substitutes (response_error="none",converted_value_data_type="text",converted_value=<partial text>) before passing to_score_asyncresponse_error="none"so scorer short-circuits (e.g., refusal scorer'sif response_error == "blocked"check) do not fire, and the LLM evaluates the actual contentskip_on_error_result=Trueandself.score_blocked_content=True, blocked messages with partial content are not skipped.score_async(not_score_async) so that_score_async's signature remains(self, message, *, objective=None)— preserving full backward compatibility for external subclassesScorereferences the original blocked piece's IDConversationScorerreadsself.score_blocked_contentdirectly when building conversation text from DB history, usingpartial_contentfrom metadata instead ofconverted_valuefor blocked piecesDesign decision — instance attribute, not call-site parameter or
AttackScoringConfigfieldPrevious design considerations:
call-site parameter on
score_async? Earlier iterations addedscore_blocked_contentas a parameter onscore_async,score_response_async, andscore_response_multiple_scorers_async, threaded through fromAttackScoringConfigvia each attack's scoring calls. This had two problems:score_asyncmust remember to forward the parameter.score_blocked_contentparameter to_score_asyncbroke external subclasses that override it with the original(self, message, *, objective=None)signature. Moving the substitution before_score_asyncsolved this, but the forwarding burden through public APIs remained.Why not on
AttackScoringConfig? Putting the flag on the config centralizes it but still requires attacks to thread it through scoring calls. Adding it at the config level introduces complexity without significant benefit to the user.Why an instance attribute?
score_blocked_contentis a scorer behavioral policy — "should this scorer evaluate partial content?" — similar to_validatoror_score_aggregator. Setting it on the instance:score_asyncreadsself.score_blocked_content. No parameters to thread throughscore_response_async, attacks, or scorer wrappers.score_async,_score_async,score_response_async, orscore_response_multiple_scorers_asyncsignatures.ConversationScorerhandling: Readsself.score_blocked_contentdirectly instead of inferring substitution state from message piece metadata.score_blocked_contentis mutable instance state, a shared scorer could theoretically be mutated between calls. In practice, this is not a concern because:True) and external code (wantsFalse), the mutation would be explicit and visible (scorer.score_blocked_content = True), unlike a call-site parameter where the absence of the flag silently defaults toFalse.score_blocked_contentpolicy does not apply anymore. This is reasonable since if you want different scoring policies, you need different scorers.The alternative — a call-site parameter — would allow the same scorer instance to behave differently per call. However, this theoretical benefit doesn't materialize in practice (scorers aren't often shared) and comes at the cost of the forwarding burden described above.
TAP
error_score_mapinteractionTAP's
error_score_map(default{"blocked": 0.0}) runs before the scorer and assigns fixed scores for error types. When"blocked"is in the map, the scorer is never invoked, soscore_blocked_contenthas no effect. To evaluate partial content with TAP, passerror_score_map={}to disable the early-return. This interaction is documented in both the node andTreeOfAttacksWithPruningAttackerror_score_mapdocstrings.Interaction between
skip_on_error_resultandscore_blocked_contentThese flags serve different purposes:
skip_on_error_resultis a performance optimization — don't waste LLM calls on broken responses (processing errors, empty responses)score_blocked_contentis a red teaming policy — content-filtered responses may contain useful partial contentThey overlap because
message.is_error()returnsTruefor blocked responses. The implementation handles this: when both are active and the blocked message has partial content, scoring is not skipped.skip_on_errorscore_blocked_contentTrueFalsePromptSendingAttackdefault)FalseFalseCrescendoAttackdefault)TrueTrueFalseTruePrinting
ConsoleAttackResultPrinter is now updated to show blocked partial content when printing result or conversation. See below:

Tests
Tested via notebook to ensure that endpoints with filters on and triggered would produce the correct printing and scoring behavior. Further unit tests below:
TestCreateTextPieceFromBlocked(6 tests intest_scorer.py) — substitute piece creation, field preservation,response_error="none",Nonewhen no partial contentTestScoreAsyncWithBlockedContent(7 tests intest_scorer.py) — text-only scorer filtering, substitute scoring, refusal scorer short-circuit bypass, no-partial-content handling, mixed pieces, normal piece unaffectedTestSkipOnErrorWithBlockedContent(3 tests intest_scorer.py) — interaction betweenskip_on_error_result=Trueandscore_blocked_content=TrueTestScoreResponseAsyncBlockedContent(3 tests intest_scorer.py) — flag flows through from scorer instance viascore_response_asyncandscore_response_multiple_scorers_asyncTestExtractPartialContentChatTarget(4 tests intest_openai_chat_target.py) — extraction from Chat Completions responses,Noneedge casesTestContentFilterPreservesPartialContent(2 tests intest_openai_chat_target.py) — end-to-end: 200 + content_filter preserves metadata, no metadata when no contenttest_conversation_scorer_uses_partial_content_when_score_blocked_content_enabled(intest_conversation_history_scorer.py) — verifies partial content is used in conversation text when flag is ontest_conversation_scorer_uses_error_json_when_score_blocked_content_disabled(intest_conversation_history_scorer.py) — verifies default behavior uses error JSON