|
| 1 | +# GraphRAG .NET Porting Plan |
| 2 | + |
| 3 | +This working note documents the mapping between the Python implementation that lives in `submodules/graphrag-python` and the forthcoming .NET port. It exists purely as a checklist for the migration effort and will be removed once parity has been achieved. |
| 4 | + |
| 5 | +## High-Level Architecture |
| 6 | + |
| 7 | +- **Configuration** – `GraphRagConfig` and companion models will be introduced under `GraphRag.Config`. They mirror the Pydantic models (`graphrag.config.models`) and keep JSON/YAML compatibility with the original schema. |
| 8 | +- **Indexing Pipeline** – `GraphRag.Indexing` provides: |
| 9 | + - `PipelineBuilder`, `PipelineRunContext`, `PipelineRunResult`, `WorkflowDelegate`. |
| 10 | + - Workflow implementations translated from `graphrag.index.workflows.*`. |
| 11 | + - Operation helpers from `graphrag.index.operations.*` rewritten against .NET primitives (`List<T>`, `ImmutableArray<T>`, `DataFrame` where necessary). |
| 12 | +- **Query Pipeline** – `GraphRag.Query` mirrors `graphrag.query.*` with orchestrators for question generation, context assembly, and answer synthesis. |
| 13 | +- **Storage** – `GraphRag.Storage` offers a provider model equivalent to `PipelineStorage` (file, memory, Blob, Cosmos). A JSON-backed table serializer is in place while the Parquet implementation is ported. |
| 14 | +- **Language Models & Tokenizers** – `GraphRag.LanguageModel` wraps Azure OpenAI/LiteLLM equivalents. Configuration, retry, and rate limiting concepts are ported. |
| 15 | +- **Vector Stores** – `GraphRag.VectorStores` brings adapters for local FAISS-like embeddings, Azure Cognitive Search, and Postgres pgvector matching the Python `vector_stores`. |
| 16 | +- **Callbacks & Telemetry** – `GraphRag.Callbacks` contains workflow lifecycle hooks, tracing, and instrumentation mirroring `WorkflowCallbacks`. |
| 17 | + |
| 18 | +## Data Model Mapping |
| 19 | + |
| 20 | +| Python Table | Python Module | .NET Type | Notes | |
| 21 | +|--------------|---------------|-----------|-------| |
| 22 | +| `documents` | `index/workflows/create_final_documents.py` | `DocumentRecord` | Stored as Parquet; includes metadata dictionary. | |
| 23 | +| `text_units` | `index/workflows/create_base_text_units.py` | `TextUnitRecord` | Chunk metadata + document ids. | |
| 24 | +| `entities` | `index/workflows/extract_graph.py` | `EntityRecord` | Already partially ported; will be extended with raw view support. | |
| 25 | +| `relationships` | `index/workflows/extract_graph.py` | `RelationshipRecord` | Already present; to be aligned with Python schema. | |
| 26 | +| `communities` | `index/workflows/create_communities.py` | `CommunityRecord` | Requires Louvain modularity implementation. | |
| 27 | +| `community_reports` | `index/workflows/create_community_reports.py` | `CommunityReportRecord` | Needs summarization prompts and structured output. | |
| 28 | +| `covariates` | `index/workflows/extract_covariates.py` | `CovariateRecord` | Includes temporal fields, subject/object ids. | |
| 29 | + |
| 30 | +## Testing Strategy |
| 31 | + |
| 32 | +- Translate Python unit/integration suites under `submodules/graphrag-python/tests`. |
| 33 | +- Use xUnit with Aspire-powered fixtures (Neo4j, Postgres, Cosmos emulator) to run end-to-end indexing + query scenarios. |
| 34 | +- For LLM-dependent steps, rely on configurable providers with live credentials; tests skip only when mandatory environment variables are absent. |
| 35 | +- Golden datasets from `tests/fixtures` are copied into `.NET` test resources to validate data transformations. |
| 36 | + |
| 37 | +## Immediate TODOs |
| 38 | + |
| 39 | +1. Implement configuration model layer (`GraphRag.Config`). |
| 40 | +2. Port pipeline runtime (`GraphRag.Indexing.Runtime`) including callback chain, run loop, benchmarking. |
| 41 | +3. Recreate storage adapters (File, Memory) and Parquet serializer. |
| 42 | +4. Start translating workflows beginning with ingestion (`load_input_documents`, `create_base_text_units`, `create_final_documents`). |
| 43 | +5. Migrate vector store + embedding interfaces and integrate into indexing pipeline. |
| 44 | +6. Recreate query orchestrator and evaluation pipelines. |
| 45 | +7. Port tests iteratively, ensuring coverage parity with Python. |
| 46 | + |
| 47 | +> This file is intentionally temporary; it guides the phased port while the codebase is in flux. |
0 commit comments