security: spotlight untrusted content + ship an honest threat model#158
Open
Shawnaldinho wants to merge 1 commit into
Open
security: spotlight untrusted content + ship an honest threat model#158Shawnaldinho wants to merge 1 commit into
Shawnaldinho wants to merge 1 commit into
Conversation
Closed PR willchen96#154 was cosmetic: it sanitised `<` to a homoglyph and wrapped one filename list in a static <available_documents> tag, and I had not actually tested it. The reviewer was right. This PR replaces it with something honest about what it can and cannot do. What changes: 1. New backend/src/lib/promptFence.ts - makeFenceNonce() returns a fresh 64-bit hex string per request. - fenceLabel/fenceBody wrap an untrusted span as «UNTRUSTED:<nonce>:<kind>»...payload...«END:<nonce>» The closing marker uses the same per-request nonce, so an attacker who controls the payload (filename, doc body text) cannot guess the close marker. - fenceInstructions(nonce) produces the boilerplate that goes into the system prompt exactly once per turn, telling the model to treat fenced content as data, not instructions, and that the nonce rotates per request and cannot be forged. - Hygiene: strips C0 control bytes; caps label-shaped fields at 512 chars. No XML angle-bracket substitution — that was the PR willchen96#154 mistake. The security comes from the unguessable nonce. 2. backend/src/lib/chatTools.ts - buildMessages() takes a fenceNonce, weaves fenceInstructions into the system prompt, and wraps every filename/folder/workflow title that flows from user-controlled state. - enrichWithPriorEvents() also takes the nonce and wraps filenames and workflow titles inside its prior-turn summary lines. - runToolCalls() now fences the high-leverage surface — the actual attack vector PR willchen96#154 missed: document body text returned by read_document, fetch_documents (per-doc), search excerpts from find_in_document, the JSON payloads of list_documents / list_workflows, and workflow prompt_md from read_workflow. 3. Per-request nonce generation in chat.ts and projectChat.ts; the same nonce is threaded into buildMessages and runLLMStream so the system-prompt convention matches the tool-result markers. 4. Adversarial corpus + structural test runner - backend/tests/promptFence/corpus.json: 20 attacks across the real surface (naive override in filenames, newline-break, XML-fence-break, oversize label, homoglyph, role-play in doc body, fence-close forgery with a guessed nonce, base64 payload, tool-call injection, exfiltration prompts, multi-turn drift, workflow title/prompt overrides, search hit injection, list_documents filename injection, folder path injection, instruction-shaped-but-benign, plus a clean control case). - backend/tests/promptFence/runStructural.ts walks every entry through the real fenceLabel/fenceBody/buildMessages code paths and asserts: per-request nonce shape; the legitimate close marker appears exactly once even when the payload tries to forge it with a different nonce; control bytes are stripped; oversize labels truncated; system prompt carries the matching fenceInstructions block. - 91 assertions pass on the current code. - npm run test:prompt-fence --prefix backend. 5. docs/SECURITY-MODEL.md - Lists threat actors and surfaces. - Documents what the codebase does today (spotlighting, hygiene, structural tests) and — explicitly — what it does NOT do: no behavioural validation against live models, no output classification, no capability containment between read tools and write tools, no context-window-crowding defence. - "The LLM is not a security boundary" stated up front. 6. README.md: a Security model section pointing at SECURITY-MODEL.md with the same explicit caveat and the GitHub private-advisories link. Honest scope: this raises the bar on casual prompt injection by filenames and by document body text reaching the model through tool results. It does not prevent a determined attacker. Real prevention needs output classification + capability containment + an adversarial behavioural test against the model itself; those are several-week follow-ups and are listed in SECURITY-MODEL.md as gaps.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
Replaces the closed #154. That PR was cosmetic — homoglyph substitution + a static
<available_documents>tag — and I had not actually tested it. The reviewer was right to call it out. This PR is the honest version.Scope statement up front: this raises the bar on casual prompt injection. It does not prevent a determined attacker. Real prevention needs output classification + capability containment + behavioural validation against the model, and those are several-week follow-ups documented as gaps in
docs/SECURITY-MODEL.md. The LLM is not a security boundary; the docs and README now say so explicitly.What changes
1. Per-request spotlighting fence —
backend/src/lib/promptFence.tsEach request generates a fresh 64-bit hex nonce. Untrusted spans are wrapped as
```
«UNTRUSTED::»...payload...«END:»
```
The system prompt carries a
fenceInstructions(nonce)block exactly once per turn telling the model that fenced content is data, not instructions, and that the nonce rotates per request and cannot be forged. Security comes from the unguessable nonce, not from sanitising the payload — that was the PR #154 mistake. Light hygiene strips C0 control bytes and caps label-shaped fields at 512 chars; no XML angle-bracket substitution.2. Apply it to the actual attack surface —
backend/src/lib/chatTools.tsThe real prompt-injection vector isn't filenames in the system prompt (the only thing #154 touched) — it's document body text returned by tool calls. This PR fences:
buildMessages— filenames, folder paths, workflow titles in the system prompt and on user messages.enrichWithPriorEvents— filenames and workflow titles in the prior-turn summary attached to the last assistant message.runToolCallsforread_document,find_in_document,fetch_documents,list_documents,list_workflows,read_workflow— document body text, search excerpts, the JSON payloads of list tools, and workflowprompt_md.3. Thread the nonce through the routes —
backend/src/routes/{chat,projectChat}.tsSame nonce is generated at the top of the POST handler and passed into
buildMessages,enrichWithPriorEvents, andrunLLMStreamso the system-prompt convention matches the tool-result markers.4. Adversarial corpus + structural test runner —
backend/tests/promptFence/20 attacks across the real surface: naive override in filenames, newline-break, XML-fence-break, oversize label, homoglyph, role-play in doc body, fence-close forgery with a guessed nonce, base64 payload, tool-call injection, exfiltration prompts, multi-turn drift, workflow title/prompt overrides, search-hit injection,
list_documentsfilename injection, folder path injection, instruction-shaped-but-benign, plus a clean control case.runStructural.tswalks every entry through the realfenceLabel/fenceBody/buildMessagescode paths and asserts:fenceInstructionsblock;```
$ npm run test:prompt-fence --prefix backend
OK 91 structural assertions passed across 20 corpus entries.
```
I actually ran this. Output included verbatim above.
5. Threat model —
docs/SECURITY-MODEL.md6. README — security section
One paragraph + a link to the threat model + the GitHub private-advisories link.
What this PR explicitly does NOT claim
read_document→edit_documentwithout user confirmation.Testing
```
npm run build --prefix backend # passes
npm run test:prompt-fence --prefix backend # 91/91 assertions pass
```
Closes the prompt-injection concern from https://insights.flank.ai/where-mikeoss-falls-short.html (gap 12) in honest scope: structural defence-in-depth + threat-model writeup, not a claim of prevention.