Skip to content

security: spotlight untrusted content + ship an honest threat model#158

Open
Shawnaldinho wants to merge 1 commit into
willchen96:mainfrom
Shawnaldinho:feat/prompt-injection-defense-rung1
Open

security: spotlight untrusted content + ship an honest threat model#158
Shawnaldinho wants to merge 1 commit into
willchen96:mainfrom
Shawnaldinho:feat/prompt-injection-defense-rung1

Conversation

@Shawnaldinho
Copy link
Copy Markdown

Background

Replaces the closed #154. That PR was cosmetic — homoglyph substitution + a static <available_documents> tag — and I had not actually tested it. The reviewer was right to call it out. This PR is the honest version.

Scope statement up front: this raises the bar on casual prompt injection. It does not prevent a determined attacker. Real prevention needs output classification + capability containment + behavioural validation against the model, and those are several-week follow-ups documented as gaps in docs/SECURITY-MODEL.md. The LLM is not a security boundary; the docs and README now say so explicitly.

What changes

1. Per-request spotlighting fence — backend/src/lib/promptFence.ts

Each request generates a fresh 64-bit hex nonce. Untrusted spans are wrapped as

```
«UNTRUSTED::»...payload...«END:»
```

The system prompt carries a fenceInstructions(nonce) block exactly once per turn telling the model that fenced content is data, not instructions, and that the nonce rotates per request and cannot be forged. Security comes from the unguessable nonce, not from sanitising the payload — that was the PR #154 mistake. Light hygiene strips C0 control bytes and caps label-shaped fields at 512 chars; no XML angle-bracket substitution.

2. Apply it to the actual attack surface — backend/src/lib/chatTools.ts

The real prompt-injection vector isn't filenames in the system prompt (the only thing #154 touched) — it's document body text returned by tool calls. This PR fences:

  • buildMessages — filenames, folder paths, workflow titles in the system prompt and on user messages.
  • enrichWithPriorEvents — filenames and workflow titles in the prior-turn summary attached to the last assistant message.
  • runToolCalls for read_document, find_in_document, fetch_documents, list_documents, list_workflows, read_workflow — document body text, search excerpts, the JSON payloads of list tools, and workflow prompt_md.

3. Thread the nonce through the routes — backend/src/routes/{chat,projectChat}.ts

Same nonce is generated at the top of the POST handler and passed into buildMessages, enrichWithPriorEvents, and runLLMStream so the system-prompt convention matches the tool-result markers.

4. Adversarial corpus + structural test runner — backend/tests/promptFence/

20 attacks across the real surface: naive override in filenames, newline-break, XML-fence-break, oversize label, homoglyph, role-play in doc body, fence-close forgery with a guessed nonce, base64 payload, tool-call injection, exfiltration prompts, multi-turn drift, workflow title/prompt overrides, search-hit injection, list_documents filename injection, folder path injection, instruction-shaped-but-benign, plus a clean control case.

runStructural.ts walks every entry through the real fenceLabel/fenceBody/buildMessages code paths and asserts:

  • per-request nonce shape (16 lowercase hex);
  • two consecutive nonces differ (replay protection);
  • the legitimate close marker appears exactly once even when the payload tries to forge it with a different static nonce;
  • control bytes are stripped;
  • oversize labels truncate to 512 chars + ellipsis;
  • the system prompt carries the matching fenceInstructions block;
  • the evil filename's embedded "SYSTEM:" string lands inside an unclosed fence, not bare in the system prompt.

```
$ npm run test:prompt-fence --prefix backend
OK 91 structural assertions passed across 20 corpus entries.
```

I actually ran this. Output included verbatim above.

5. Threat model — docs/SECURITY-MODEL.md

  • Threat actors and surfaces table.
  • What the codebase does today (this PR's mechanism).
  • What the codebase does NOT do — explicitly: no behavioural validation against live models, no output classification, no capability containment between read and write tools in a single turn, no context-window-crowding defence, multi-turn carry-over of a compromised prior turn is unmitigated.
  • What operators should do (don't upload from untrusted sources without reviewing tool calls; report via GitHub security advisories).

6. README — security section

One paragraph + a link to the threat model + the GitHub private-advisories link.

What this PR explicitly does NOT claim

  • It does not prevent prompt injection. Plausibly-shaped requests inside a fence (no fence-break attempt) will still reach the model and may still be obeyed.
  • It does not test behavioural compliance against the model. The structural tests prove the wrapping is correct; whether a given model honours the spotlighting convention requires running the corpus against a live API and judging responses. Documented as a follow-up.
  • It does not add capability containment. A turn can still chain read_documentedit_document without user confirmation.

Testing

```
npm run build --prefix backend # passes
npm run test:prompt-fence --prefix backend # 91/91 assertions pass
```

Closes the prompt-injection concern from https://insights.flank.ai/where-mikeoss-falls-short.html (gap 12) in honest scope: structural defence-in-depth + threat-model writeup, not a claim of prevention.

Closed PR willchen96#154 was cosmetic: it sanitised `<` to a homoglyph and
wrapped one filename list in a static <available_documents> tag, and
I had not actually tested it. The reviewer was right. This PR
replaces it with something honest about what it can and cannot do.

What changes:

1. New backend/src/lib/promptFence.ts
   - makeFenceNonce() returns a fresh 64-bit hex string per request.
   - fenceLabel/fenceBody wrap an untrusted span as
       «UNTRUSTED:<nonce>:<kind>»...payload...«END:<nonce>»
     The closing marker uses the same per-request nonce, so an
     attacker who controls the payload (filename, doc body text)
     cannot guess the close marker.
   - fenceInstructions(nonce) produces the boilerplate that goes
     into the system prompt exactly once per turn, telling the model
     to treat fenced content as data, not instructions, and that the
     nonce rotates per request and cannot be forged.
   - Hygiene: strips C0 control bytes; caps label-shaped fields at
     512 chars. No XML angle-bracket substitution — that was the
     PR willchen96#154 mistake. The security comes from the unguessable nonce.

2. backend/src/lib/chatTools.ts
   - buildMessages() takes a fenceNonce, weaves fenceInstructions
     into the system prompt, and wraps every filename/folder/workflow
     title that flows from user-controlled state.
   - enrichWithPriorEvents() also takes the nonce and wraps filenames
     and workflow titles inside its prior-turn summary lines.
   - runToolCalls() now fences the high-leverage surface — the actual
     attack vector PR willchen96#154 missed: document body text returned by
     read_document, fetch_documents (per-doc), search excerpts from
     find_in_document, the JSON payloads of list_documents /
     list_workflows, and workflow prompt_md from read_workflow.

3. Per-request nonce generation in chat.ts and projectChat.ts; the
   same nonce is threaded into buildMessages and runLLMStream so the
   system-prompt convention matches the tool-result markers.

4. Adversarial corpus + structural test runner
   - backend/tests/promptFence/corpus.json: 20 attacks across the
     real surface (naive override in filenames, newline-break,
     XML-fence-break, oversize label, homoglyph, role-play in doc
     body, fence-close forgery with a guessed nonce, base64 payload,
     tool-call injection, exfiltration prompts, multi-turn drift,
     workflow title/prompt overrides, search hit injection,
     list_documents filename injection, folder path injection,
     instruction-shaped-but-benign, plus a clean control case).
   - backend/tests/promptFence/runStructural.ts walks every entry
     through the real fenceLabel/fenceBody/buildMessages code paths
     and asserts: per-request nonce shape; the legitimate close
     marker appears exactly once even when the payload tries to
     forge it with a different nonce; control bytes are stripped;
     oversize labels truncated; system prompt carries the matching
     fenceInstructions block.
   - 91 assertions pass on the current code.
   - npm run test:prompt-fence --prefix backend.

5. docs/SECURITY-MODEL.md
   - Lists threat actors and surfaces.
   - Documents what the codebase does today (spotlighting, hygiene,
     structural tests) and — explicitly — what it does NOT do:
     no behavioural validation against live models, no output
     classification, no capability containment between read tools
     and write tools, no context-window-crowding defence.
   - "The LLM is not a security boundary" stated up front.

6. README.md: a Security model section pointing at SECURITY-MODEL.md
   with the same explicit caveat and the GitHub private-advisories
   link.

Honest scope: this raises the bar on casual prompt injection by
filenames and by document body text reaching the model through tool
results. It does not prevent a determined attacker. Real prevention
needs output classification + capability containment + an adversarial
behavioural test against the model itself; those are several-week
follow-ups and are listed in SECURITY-MODEL.md as gaps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant