Skip to content

test: harden MCP shell approval continuation#18038

Draft
starr-openai wants to merge 1 commit intomainfrom
starr/flaky-mcp-approval-20260415
Draft

test: harden MCP shell approval continuation#18038
starr-openai wants to merge 1 commit intomainfrom
starr/flaky-mcp-approval-20260415

Conversation

@starr-openai
Copy link
Copy Markdown
Contributor

@starr-openai starr-openai commented Apr 16, 2026

Summary

  • keep the shared mock responses server strict by default
  • add a relaxed mock helper only for the shell approval test with explicit request-count assertions
  • assert the continuation request contains structured shell function output before expecting the final response

What was wrong

The affected MCP test covers shell-command approval continuation.

The intended behavior is:

  • MCP invokes a shell command.
  • The shell command requires approval/elicitation.
  • After approval, Codex runs the command.
  • Codex sends structured function_call_output back to the model.
  • The model continues to the final response.

The old test tried to accommodate this multi-request flow by relaxing mock response-server request-count expectations too broadly. That avoided some timing failures, but it weakened the harness in places where strict request-count checking is valuable.

Why it was flaky

The approval flow legitimately spans more than one model request. A test that expects only one request, or that does not explicitly wait for the continuation request, can race the real protocol.

Depending on timing, it could assert before command output was posted back to the model, miss the second request, or fail because a legitimate extra request arrived. Broadly relaxing request counting made this less noisy but also made it easier to miss real regressions.

How this fixes it

The change models the real multi-request protocol explicitly and keeps the relaxation local to this one test.

  • The shared create_mock_responses_server is strict by default again.
  • A narrowly named create_mock_responses_server_without_expected_count helper is used only where the approval flow needs it.
  • The test asserts the shell side effect occurred.
  • The test waits for the expected two model requests.
  • It asserts the continuation request contains structured function_call_output for call1234 before expecting the final response.

This preserves strictness for the broader suite while making this test deterministic against the real approval-continuation behavior.

Validation

  • Remote Bazel before final review loop: //codex-rs/mcp-server:mcp-server-all-test --test_filter=suite::codex_tool::test_shell_command_approval_triggers_elicitation passed on codex-flaky-mcp-0415
  • Read-only subagent review loop converged with no remaining findings

Co-authored-by: Codex noreply@openai.com

Co-authored-by: Codex <noreply@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant