Skip to content

Commit 4150c3c

Browse files
committed
Enable prompt tuning overrides and document alignment
1 parent 966b7ba commit 4150c3c

18 files changed

Lines changed: 637 additions & 22 deletions

README.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,13 +86,25 @@ graphrag/
8686

8787
## Integration Testing Strategy
8888

89-
- **No fakes.** We removed the legacy fake Postgres store. Every graph operation in tests uses real services orchestrated by Testcontainers.
90-
- **Security coverage.** `Integration/PostgresGraphStoreIntegrationTests.cs` includes payloads that mimic SQL/Cypher injection attempts to ensure values remain literals and labels/types are strictly validated.
91-
- **Cross-backend validation.** `Integration/GraphStoreIntegrationTests.cs` exercises Postgres, Neo4j, and Cosmos (when available) through the shared `IGraphStore` abstraction.
89+
- **No fakes.** We removed the legacy fake Postgres store. Every graph operation in tests uses real services orchestrated by Testcontainers.
90+
- **Security coverage.** `Integration/PostgresGraphStoreIntegrationTests.cs` includes payloads that mimic SQL/Cypher injection attempts to ensure values remain literals and labels/types are strictly validated.
91+
- **Cross-backend validation.** `Integration/GraphStoreIntegrationTests.cs` exercises Postgres, Neo4j, and Cosmos (when available) through the shared `IGraphStore` abstraction.
9292
- **Workflow smoke tests.** Pipelines (e.g., `IndexingPipelineRunnerTests`) and finalization steps run end-to-end with the fixture-provisioned infrastructure.
9393

9494
---
9595

96+
## Indexing, Querying, and Prompt Tuning Alignment
97+
98+
The .NET port mirrors the [GraphRAG indexing architecture](https://microsoft.github.io/graphrag/index/overview/) and its query workflows so downstream applications retain parity with the Python reference implementation.
99+
100+
- **Indexing overview.** Workflows such as `extract_graph`, `create_communities`, and `community_summaries` map 1:1 to the [default data flow](https://microsoft.github.io/graphrag/index/default_dataflow/) and persist the same tables (`text_units`, `entities`, `relationships`, `communities`, `community_reports`, `covariates`). The new prompt template loader honours manual or auto-tuned prompts before falling back to the stock templates in `prompts/`.
101+
- **Query capabilities.** The query pipeline retains global search, local search, drift search, and question generation semantics described in the [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/). Each orchestrator continues to assemble context from the indexed tables so you can reference [global](https://microsoft.github.io/graphrag/query/global_search/) or [local](https://microsoft.github.io/graphrag/query/local_search/) narratives interchangeably.
102+
- **Prompt tuning.** GraphRAG’s [manual](https://microsoft.github.io/graphrag/prompt_tuning/manual_prompt_tuning/) and [auto](https://microsoft.github.io/graphrag/prompt_tuning/auto_prompt_tuning/) strategies are surfaced through `GraphRagConfig.PromptTuning`. Store custom templates under `prompts/` or point `PromptTuning.Manual.Directory`/`PromptTuning.Auto.Directory` at your tuning outputs. Files follow the stage keys documented in `docs/indexing-and-query.md` (for example, `index/community_reports/system.txt` overrides the community summary system prompt). Templates can use placeholders such as `{{max_length}}`, `{{max_entities}}`, and `{{entities}}`.
103+
104+
See [`docs/indexing-and-query.md`](docs/indexing-and-query.md) for a deeper mapping between the .NET workflows and the research publications underpinning GraphRAG.
105+
106+
---
107+
96108
## Local Cosmos Testing
97109

98110
1. Install and start the [Azure Cosmos DB Emulator](https://learn.microsoft.com/azure/cosmos-db/local-emulator).

docs/indexing-and-query.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# Indexing, Querying, and Prompt Tuning in GraphRAG for .NET
2+
3+
GraphRAG for .NET keeps feature parity with the Python reference project described in the [Microsoft Research blog](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/) and the [GraphRAG paper](https://arxiv.org/pdf/2404.16130). This document explains how the .NET workflows map to the concepts documented on [microsoft.github.io/graphrag](https://microsoft.github.io/graphrag/), highlights the supported query modes, and shows how to customise prompts via manual or auto tuning outputs.
4+
5+
## Indexing Architecture
6+
7+
- **Workflow parity.** Each indexing stage matches the Python pipeline and the [default data flow](https://microsoft.github.io/graphrag/index/default_dataflow/):
8+
- `load_input_documents``create_base_text_units``summarize_descriptions`
9+
- `extract_graph` persists `entities` and `relationships`
10+
- `create_communities` produces `communities`
11+
- `community_summaries` writes `community_reports`
12+
- `extract_covariates` stores `covariates`
13+
- **Storage schema.** Tables share the column layout described under [index outputs](https://microsoft.github.io/graphrag/index/outputs/). The new strongly-typed records (`CommunityRecord`, `CovariateRecord`, etc.) mirror the JSON representation used by the Python implementation.
14+
- **Cluster configuration.** `GraphRagConfig.ClusterGraph` exposes the same knobs as the Python `cluster_graph` settings, enabling largest-component filtering and deterministic seeding.
15+
16+
## Query Capabilities
17+
18+
The query layer ports the orchestrators documented in the [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/):
19+
20+
- **Global search** ([docs](https://microsoft.github.io/graphrag/query/global_search/)) traverses community summaries and graph context to craft answers spanning the corpus.
21+
- **Local search** ([docs](https://microsoft.github.io/graphrag/query/local_search/)) anchors on a document neighbourhood when you need focused context.
22+
- **Drift search** ([docs](https://microsoft.github.io/graphrag/query/drift_search/)) monitors narrative changes across time slices.
23+
- **Question generation** ([docs](https://microsoft.github.io/graphrag/query/question_generation/)) produces follow-up questions to extend an investigation.
24+
25+
Every orchestrator consumes the same indexed tables as the Python project, so the .NET stack interoperates with BYOG scenarios described in the [index architecture guide](https://microsoft.github.io/graphrag/index/architecture/).
26+
27+
## Prompt Tuning
28+
29+
Manual and auto prompt tuning are both available without code changes:
30+
31+
1. **Manual overrides** follow the rules from [manual prompt tuning](https://microsoft.github.io/graphrag/prompt_tuning/manual_prompt_tuning/).
32+
- Place custom templates under a directory referenced by `GraphRagConfig.PromptTuning.Manual.Directory` and set `Enabled = true`.
33+
- Filenames follow the stage key pattern `section/workflow/kind.txt` (see table below).
34+
2. **Auto tuning** integrates the outputs documented in [auto prompt tuning](https://microsoft.github.io/graphrag/prompt_tuning/auto_prompt_tuning/).
35+
- Point `GraphRagConfig.PromptTuning.Auto.Directory` at the folder containing the generated prompts and set `Enabled = true`.
36+
- The runtime prefers explicit paths from workflow configs, then manual overrides, then auto-tuned files, and finally the built-in defaults in `prompts/`.
37+
38+
### Stage Keys and Placeholders
39+
40+
| Workflow | Stage key | Purpose | Supported placeholders |
41+
|----------|-----------|---------|------------------------|
42+
| `extract_graph` (system) | `index/extract_graph/system.txt` | System prompt that instructs the extractor. | _N/A_ |
43+
| `extract_graph` (user) | `index/extract_graph/user.txt` | User prompt template for individual text units. | `{{max_entities}}`, `{{text}}` |
44+
| `community_summaries` (system) | `index/community_reports/system.txt` | System guidance for cluster summarisation. | _N/A_ |
45+
| `community_summaries` (user) | `index/community_reports/user.txt` | User prompt template for entity lists. | `{{max_length}}`, `{{entities}}` |
46+
47+
Placeholders are replaced at runtime with values drawn from workflow configuration:
48+
49+
- `{{max_entities}}``ExtractGraphConfig.EntityTypes.Count + 5` (minimum 1)
50+
- `{{text}}` → the original text unit content
51+
- `{{max_length}}``CommunityReportsConfig.MaxLength`
52+
- `{{entities}}` → bullet list of entity titles and descriptions
53+
54+
If a template is omitted, the runtime falls back to the built-in prompts stored under `prompts/` and bundled with the repository.
55+
56+
## Integration Tests
57+
58+
`tests/ManagedCode.GraphRag.Tests/Integration/CommunitySummariesIntegrationTests.cs` exercises the new prompt loader end-to-end using the file-backed pipeline storage. Combined with the existing Aspire-powered suites, the tests demonstrate how indexing, community detection, and summarisation behave with tuned prompts while remaining faithful to the [GraphRAG BYOG guidance](https://microsoft.github.io/graphrag/index/byog/).
59+
60+
## Further Reading
61+
62+
- [GraphRAG prompt tuning overview](https://microsoft.github.io/graphrag/prompt_tuning/overview/)
63+
- [GraphRAG index methods](https://microsoft.github.io/graphrag/index/methods/)
64+
- [GraphRAG query overview](https://microsoft.github.io/graphrag/query/overview/)
65+
- [GraphRAG default dataflow](https://microsoft.github.io/graphrag/index/default_dataflow/)
66+
67+
These resources underpin the .NET implementation and provide broader context for customising or extending the library.

prompts/community_graph.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
You are an investigative analyst. Produce concise, neutral summaries that describe the shared theme binding the supplied entities.
2+
Highlight how they relate, why the cluster matters, and any notable signals the reader should know. Do not invent facts.

prompts/community_text.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Summarise the key theme that connects the following entities in no more than {{max_length}} characters. Focus on what unites them and why the group matters. Avoid bullet lists.
2+
3+
Entities:
4+
{{entities}}
5+
6+
Provide a single paragraph answer.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
You are a precise information extraction engine. Analyse the supplied text and return structured JSON describing:
2+
- distinct entities (people, organisations, locations, products, events, concepts, technologies, dates, other)
3+
- relationships between those entities
4+
5+
Rules:
6+
- Only use information explicitly stated or implied in the text.
7+
- Prefer short, human-readable titles.
8+
- Use snake_case relationship types (e.g., "works_with", "located_in").
9+
- Always return valid JSON adhering to the response schema.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
Extract up to {{max_entities}} of the most important entities and their relationships from the following text.
2+
3+
Text (between <BEGIN_TEXT> and <END_TEXT> markers):
4+
<BEGIN_TEXT>
5+
{{text}}
6+
<END_TEXT>
7+
8+
Respond with JSON matching this schema:
9+
{
10+
"entities": [
11+
{
12+
"title": "string",
13+
"type": "person | organization | location | product | event | concept | technology | date | other",
14+
"description": "short description",
15+
"confidence": 0.0 - 1.0
16+
}
17+
],
18+
"relationships": [
19+
{
20+
"source": "entity title",
21+
"target": "entity title",
22+
"type": "relationship_type",
23+
"description": "short description",
24+
"weight": 0.0 - 1.0,
25+
"bidirectional": true | false
26+
}
27+
]
28+
}

src/ManagedCode.GraphRag/Community/CommunityBuilder.cs

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,7 @@ public static IReadOnlyList<CommunityRecord> Build(
4848
while (queue.Count > 0)
4949
{
5050
var current = queue.Dequeue();
51-
if (!component.Contains(current))
52-
{
53-
component.Add(current);
54-
}
51+
component.Add(current);
5552

5653
if (!adjacency.TryGetValue(current, out var neighbors) || neighbors.Count == 0)
5754
{

src/ManagedCode.GraphRag/Config/ExtractGraphConfig.cs

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,11 @@ public sealed class ExtractGraphConfig
44
{
55
public string ModelId { get; set; } = "default_chat_model";
66

7+
public string? SystemPrompt { get; set; }
8+
= "prompts/index/extract_graph.system.txt";
9+
710
public string? Prompt { get; set; }
8-
= "prompts/index/extract_graph.txt";
11+
= "prompts/index/extract_graph.user.txt";
912

1013
public List<string> EntityTypes { get; set; } = new() { "person", "organization", "location" };
1114

src/ManagedCode.GraphRag/Config/GraphRagConfig.cs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,8 @@ public sealed class GraphRagConfig
5454

5555
public CommunityReportsConfig CommunityReports { get; set; } = new();
5656

57+
public PromptTuningConfig PromptTuning { get; set; } = new();
58+
5759
public SnapshotsConfig Snapshots { get; set; } = new();
5860

5961
public Dictionary<string, object?> Extensions { get; set; } = new(StringComparer.OrdinalIgnoreCase);
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
namespace GraphRag.Config;
2+
3+
public sealed class PromptTuningConfig
4+
{
5+
public ManualPromptTuningConfig Manual { get; set; } = new();
6+
7+
public AutoPromptTuningConfig Auto { get; set; } = new();
8+
}
9+
10+
public sealed class ManualPromptTuningConfig
11+
{
12+
public bool Enabled { get; set; }
13+
14+
public string? Directory { get; set; }
15+
}
16+
17+
public sealed class AutoPromptTuningConfig
18+
{
19+
public bool Enabled { get; set; }
20+
21+
public string? Directory { get; set; }
22+
23+
public string? Strategy { get; set; }
24+
}

0 commit comments

Comments
 (0)