security: spotlight untrusted content + ship an honest threat model by Shawnaldinho · Pull Request #158 · willchen96/mike

Shawnaldinho · 2026-05-20T10:43:19Z

Background

Replaces the closed #154. That PR was cosmetic — homoglyph substitution + a static <available_documents> tag — and I had not actually tested it. The reviewer was right to call it out. This PR is the honest version.

Scope statement up front: this raises the bar on casual prompt injection. It does not prevent a determined attacker. Real prevention needs output classification + capability containment + behavioural validation against the model, and those are several-week follow-ups documented as gaps in docs/SECURITY-MODEL.md. The LLM is not a security boundary; the docs and README now say so explicitly.

What changes

1. Per-request spotlighting fence — `backend/src/lib/promptFence.ts`

Each request generates a fresh 64-bit hex nonce. Untrusted spans are wrapped as

```
«UNTRUSTED::»...payload...«END:»
```

The system prompt carries a fenceInstructions(nonce) block exactly once per turn telling the model that fenced content is data, not instructions, and that the nonce rotates per request and cannot be forged. Security comes from the unguessable nonce, not from sanitising the payload — that was the PR #154 mistake. Light hygiene strips C0 control bytes and caps label-shaped fields at 512 chars; no XML angle-bracket substitution.

2. Apply it to the actual attack surface — `backend/src/lib/chatTools.ts`

The real prompt-injection vector isn't filenames in the system prompt (the only thing #154 touched) — it's document body text returned by tool calls. This PR fences:

buildMessages — filenames, folder paths, workflow titles in the system prompt and on user messages.
enrichWithPriorEvents — filenames and workflow titles in the prior-turn summary attached to the last assistant message.
runToolCalls for read_document, find_in_document, fetch_documents, list_documents, list_workflows, read_workflow — document body text, search excerpts, the JSON payloads of list tools, and workflow prompt_md.

3. Thread the nonce through the routes — `backend/src/routes/{chat,projectChat}.ts`

Same nonce is generated at the top of the POST handler and passed into buildMessages, enrichWithPriorEvents, and runLLMStream so the system-prompt convention matches the tool-result markers.

4. Adversarial corpus + structural test runner — `backend/tests/promptFence/`

20 attacks across the real surface: naive override in filenames, newline-break, XML-fence-break, oversize label, homoglyph, role-play in doc body, fence-close forgery with a guessed nonce, base64 payload, tool-call injection, exfiltration prompts, multi-turn drift, workflow title/prompt overrides, search-hit injection, list_documents filename injection, folder path injection, instruction-shaped-but-benign, plus a clean control case.

runStructural.ts walks every entry through the real fenceLabel/fenceBody/buildMessages code paths and asserts:

per-request nonce shape (16 lowercase hex);
two consecutive nonces differ (replay protection);
the legitimate close marker appears exactly once even when the payload tries to forge it with a different static nonce;
control bytes are stripped;
oversize labels truncate to 512 chars + ellipsis;
the system prompt carries the matching fenceInstructions block;
the evil filename's embedded "SYSTEM:" string lands inside an unclosed fence, not bare in the system prompt.

```
$ npm run test:prompt-fence --prefix backend
OK 91 structural assertions passed across 20 corpus entries.
```

I actually ran this. Output included verbatim above.

5. Threat model — `docs/SECURITY-MODEL.md`

Threat actors and surfaces table.
What the codebase does today (this PR's mechanism).
What the codebase does NOT do — explicitly: no behavioural validation against live models, no output classification, no capability containment between read and write tools in a single turn, no context-window-crowding defence, multi-turn carry-over of a compromised prior turn is unmitigated.
What operators should do (don't upload from untrusted sources without reviewing tool calls; report via GitHub security advisories).

6. README — security section

One paragraph + a link to the threat model + the GitHub private-advisories link.

What this PR explicitly does NOT claim

It does not prevent prompt injection. Plausibly-shaped requests inside a fence (no fence-break attempt) will still reach the model and may still be obeyed.
It does not test behavioural compliance against the model. The structural tests prove the wrapping is correct; whether a given model honours the spotlighting convention requires running the corpus against a live API and judging responses. Documented as a follow-up.
It does not add capability containment. A turn can still chain read_document → edit_document without user confirmation.

Testing

```
npm run build --prefix backend # passes
npm run test:prompt-fence --prefix backend # 91/91 assertions pass
```

Closes the prompt-injection concern from https://insights.flank.ai/where-mikeoss-falls-short.html (gap 12) in honest scope: structural defence-in-depth + threat-model writeup, not a claim of prevention.

Closed PR willchen96#154 was cosmetic: it sanitised `<` to a homoglyph and wrapped one filename list in a static <available_documents> tag, and I had not actually tested it. The reviewer was right. This PR replaces it with something honest about what it can and cannot do. What changes: 1. New backend/src/lib/promptFence.ts - makeFenceNonce() returns a fresh 64-bit hex string per request. - fenceLabel/fenceBody wrap an untrusted span as «UNTRUSTED:<nonce>:<kind>»...payload...«END:<nonce>» The closing marker uses the same per-request nonce, so an attacker who controls the payload (filename, doc body text) cannot guess the close marker. - fenceInstructions(nonce) produces the boilerplate that goes into the system prompt exactly once per turn, telling the model to treat fenced content as data, not instructions, and that the nonce rotates per request and cannot be forged. - Hygiene: strips C0 control bytes; caps label-shaped fields at 512 chars. No XML angle-bracket substitution — that was the PR willchen96#154 mistake. The security comes from the unguessable nonce. 2. backend/src/lib/chatTools.ts - buildMessages() takes a fenceNonce, weaves fenceInstructions into the system prompt, and wraps every filename/folder/workflow title that flows from user-controlled state. - enrichWithPriorEvents() also takes the nonce and wraps filenames and workflow titles inside its prior-turn summary lines. - runToolCalls() now fences the high-leverage surface — the actual attack vector PR willchen96#154 missed: document body text returned by read_document, fetch_documents (per-doc), search excerpts from find_in_document, the JSON payloads of list_documents / list_workflows, and workflow prompt_md from read_workflow. 3. Per-request nonce generation in chat.ts and projectChat.ts; the same nonce is threaded into buildMessages and runLLMStream so the system-prompt convention matches the tool-result markers. 4. Adversarial corpus + structural test runner - backend/tests/promptFence/corpus.json: 20 attacks across the real surface (naive override in filenames, newline-break, XML-fence-break, oversize label, homoglyph, role-play in doc body, fence-close forgery with a guessed nonce, base64 payload, tool-call injection, exfiltration prompts, multi-turn drift, workflow title/prompt overrides, search hit injection, list_documents filename injection, folder path injection, instruction-shaped-but-benign, plus a clean control case). - backend/tests/promptFence/runStructural.ts walks every entry through the real fenceLabel/fenceBody/buildMessages code paths and asserts: per-request nonce shape; the legitimate close marker appears exactly once even when the payload tries to forge it with a different nonce; control bytes are stripped; oversize labels truncated; system prompt carries the matching fenceInstructions block. - 91 assertions pass on the current code. - npm run test:prompt-fence --prefix backend. 5. docs/SECURITY-MODEL.md - Lists threat actors and surfaces. - Documents what the codebase does today (spotlighting, hygiene, structural tests) and — explicitly — what it does NOT do: no behavioural validation against live models, no output classification, no capability containment between read tools and write tools, no context-window-crowding defence. - "The LLM is not a security boundary" stated up front. 6. README.md: a Security model section pointing at SECURITY-MODEL.md with the same explicit caveat and the GitHub private-advisories link. Honest scope: this raises the bar on casual prompt injection by filenames and by document body text reaching the model through tool results. It does not prevent a determined attacker. Real prevention needs output classification + capability containment + an adversarial behavioural test against the model itself; those are several-week follow-ups and are listed in SECURITY-MODEL.md as gaps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

security: spotlight untrusted content + ship an honest threat model#158

security: spotlight untrusted content + ship an honest threat model#158
Shawnaldinho wants to merge 1 commit into
willchen96:mainfrom
Shawnaldinho:feat/prompt-injection-defense-rung1

Shawnaldinho commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Shawnaldinho commented May 20, 2026

Background

What changes

1. Per-request spotlighting fence — backend/src/lib/promptFence.ts

2. Apply it to the actual attack surface — backend/src/lib/chatTools.ts

3. Thread the nonce through the routes — backend/src/routes/{chat,projectChat}.ts

4. Adversarial corpus + structural test runner — backend/tests/promptFence/

5. Threat model — docs/SECURITY-MODEL.md

6. README — security section

What this PR explicitly does NOT claim

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Per-request spotlighting fence — `backend/src/lib/promptFence.ts`

2. Apply it to the actual attack surface — `backend/src/lib/chatTools.ts`

3. Thread the nonce through the routes — `backend/src/routes/{chat,projectChat}.ts`

4. Adversarial corpus + structural test runner — `backend/tests/promptFence/`

5. Threat model — `docs/SECURITY-MODEL.md`