test: harden MCP shell approval continuation#18038
Draft
starr-openai wants to merge 1 commit intomainfrom
Draft
Conversation
Co-authored-by: Codex <noreply@openai.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What was wrong
The affected MCP test covers shell-command approval continuation.
The intended behavior is:
function_call_outputback to the model.The old test tried to accommodate this multi-request flow by relaxing mock response-server request-count expectations too broadly. That avoided some timing failures, but it weakened the harness in places where strict request-count checking is valuable.
Why it was flaky
The approval flow legitimately spans more than one model request. A test that expects only one request, or that does not explicitly wait for the continuation request, can race the real protocol.
Depending on timing, it could assert before command output was posted back to the model, miss the second request, or fail because a legitimate extra request arrived. Broadly relaxing request counting made this less noisy but also made it easier to miss real regressions.
How this fixes it
The change models the real multi-request protocol explicitly and keeps the relaxation local to this one test.
create_mock_responses_serveris strict by default again.create_mock_responses_server_without_expected_counthelper is used only where the approval flow needs it.function_call_outputforcall1234before expecting the final response.This preserves strictness for the broader suite while making this test deterministic against the real approval-continuation behavior.
Validation
//codex-rs/mcp-server:mcp-server-all-test --test_filter=suite::codex_tool::test_shell_command_approval_triggers_elicitationpassed oncodex-flaky-mcp-0415Co-authored-by: Codex noreply@openai.com