Skip to content

Commit 499d47d

Browse files
committed
refactor converter pipeline and harden azure video indexer flow
1 parent 82a3462 commit 499d47d

99 files changed

Lines changed: 5595 additions & 671 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 167 additions & 62 deletions
Large diffs are not rendered by default.

Directory.Build.props

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222
<PackageLicenseExpression>MIT</PackageLicenseExpression>
2323
<PackageReadmeFile>README.md</PackageReadmeFile>
2424
<Product>Managed Code - MarkItDown</Product>
25-
<Version>10.0.2</Version>
26-
<PackageVersion>10.0.2</PackageVersion>
25+
<Version>10.0.3</Version>
26+
<PackageVersion>10.0.3</PackageVersion>
2727
</PropertyGroup>
2828

2929
<PropertyGroup Condition="'$(GITHUB_ACTIONS)' == 'true'">

README.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -416,6 +416,7 @@ The `AzureIntelligenceOptions`, `GoogleIntelligenceOptions`, and `AwsIntelligenc
416416

417417
- **Managed identity**: omit the `ApiKey`/`ArmAccessToken` properties and the providers automatically fall back to `DefaultAzureCredential`. Assign the managed identity the *Cognitive Services User* role for Document Intelligence and Vision, and follow the [Video Indexer managed identity instructions](https://learn.microsoft.com/azure/azure-video-indexer/video-indexer-use-azure-ad) to authorize uploads.
418418
- **Video Indexer tips**: Video uploads require both the Video Indexer account (ID + region) and either the full resource ID or the trio of subscription id/resource group/account name, plus an ARM token or Azure AD identity with `Contributor` access on the Video Indexer resource. The interactive CLI exposes dedicated prompts for these values under “Configure cloud providers”.
419+
- **Video Indexer polling controls**: `AzureMediaIntelligenceOptions` supports `PollingInterval` and `MaxProcessingTime` to control how long conversion waits for Azure Video Indexer processing.
419420

420421
```csharp
421422
var azureOptions = new AzureIntelligenceOptions
@@ -432,11 +433,43 @@ The `AzureIntelligenceOptions`, `GoogleIntelligenceOptions`, and `AwsIntelligenc
432433
{
433434
AccountId = "<video-indexer-account-id>",
434435
AccountName = "<video-indexer-account-name>",
435-
Location = "trial"
436+
Location = "eastus",
437+
ResourceId = "/subscriptions/<subscription-guid>/resourcegroups/<resource-group>/providers/Microsoft.VideoIndexer/accounts/<account-name>/",
438+
ArmAccessToken = "<video-indexer-arm-token>",
439+
PollingInterval = TimeSpan.FromSeconds(10),
440+
MaxProcessingTime = TimeSpan.FromMinutes(15)
436441
}
437442
};
438443
```
439444

445+
#### Azure Video Indexer quick-start checklist
446+
447+
1. Create/identify a Video Indexer account in Azure and copy:
448+
- `AccountId`
449+
- `Location` (for example `eastus`)
450+
- full `ResourceId`
451+
2. Get an ARM access token for Video Indexer (or configure managed identity with proper access).
452+
3. Set `AzureIntelligenceOptions.Media` with those values.
453+
4. Convert an `.mp4` with `MediaTranscriptionRequest(PreferredProvider: Azure)` and verify the result contains:
454+
- `### Video Transcript` with time ranges and speaker metadata
455+
- `### Video Analysis` with sentiment/topics/keywords and Video Indexer state metadata
456+
457+
#### Live integration test credentials (safe defaults)
458+
459+
The live test `VideoIndexer_MarkItDownClient_LiveMp4ToMarkdown` in
460+
`tests/MarkItDown.Tests/Intelligence/Integration/AzureIntelligenceIntegrationTests.cs`
461+
uses hardcoded placeholders by default:
462+
463+
```csharp
464+
private const string HardcodedVideoIndexerArmAccessToken = "TOKEN";
465+
private const string HardcodedVideoIndexerAccountId = "ACCOUNT_GUID";
466+
private const string HardcodedVideoIndexerResourceId =
467+
"/subscriptions/SUBSCRIPTION-GUID/resourcegroups/AzureAI/providers/Microsoft.VideoIndexer/accounts/ACCOUNT_NAME/";
468+
```
469+
470+
When placeholders are present, that test exits early (no external call), so CI/local runs stay green without secrets.
471+
To execute the real live path, replace those placeholders with valid values.
472+
440473
#### Google Cloud setup
441474

442475
- **Docs**: [Document AI](https://cloud.google.com/document-ai/docs), [Vision API](https://cloud.google.com/vision/docs), [Speech-to-Text](https://cloud.google.com/speech-to-text/docs).
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# ADR-0001: Disk-First Workspace Pipeline
2+
3+
Status: Implemented
4+
Date: 2026-02-19
5+
Related Features: `docs/Features/disk-first-conversion-pipeline.md`, `docs/Features/structured-docx-pdf-conversion.md`
6+
Supersedes: none
7+
Superseded by: none
8+
9+
---
10+
11+
## Implementation plan (step-by-step)
12+
13+
- [x] Analyze existing conversion source materialization and workspace persistence
14+
- [x] Record decision and trade-offs for disk-first contract
15+
- [x] Map decision invariants to tests/docs
16+
- [x] Link architecture and feature documentation
17+
18+
---
19+
20+
## Context
21+
22+
- The project converts potentially large documents (PDF/Office/archives/media) and must avoid memory pressure and hidden buffering.
23+
- Existing project rules explicitly prohibit `MemoryStream`-based conversion paths.
24+
- Conversion also needs stable artifact persistence and deterministic cleanup.
25+
- Goal: enforce a file-backed conversion path so all converters/middleware operate on consistent workspace resources.
26+
- Non-goal: replacing every in-memory helper in unrelated utility paths outside conversion flow.
27+
28+
---
29+
30+
## Stakeholders (who needs this to be clear)
31+
32+
| Role | What they need to know | Questions this ADR must answer |
33+
| --- | --- | --- |
34+
| Product / Owner | Stable conversion for large files | Can large files be processed reliably? |
35+
| Engineering | Required pipeline invariants | Where must disk materialization happen? |
36+
| DevOps / SRE | Temp storage/cleanup implications | How do we avoid leaked workspaces? |
37+
| QA | Testable behavior guarantees | Which tests prove disk-first behavior? |
38+
39+
---
40+
41+
## Decision
42+
43+
The conversion pipeline uses disk-backed workspace materialization as the primary contract for source and artifact handling.
44+
45+
Key points:
46+
47+
- Converters materialize input via shared base abstractions before extraction.
48+
- Artifacts are persisted through workspace/storage abstractions and surfaced in metadata.
49+
- Conversion result composition happens after extraction/middleware on persisted resources.
50+
51+
---
52+
53+
## Diagram
54+
55+
```mermaid
56+
flowchart LR
57+
A["Input stream/path/url"] --> B["Materialize source to workspace"]
58+
B --> C["Converter extraction"]
59+
C --> D["Persist artifacts"]
60+
D --> E["Middleware enrichment"]
61+
E --> F["Markdown composition"]
62+
F --> G["Result + workspace metadata"]
63+
```
64+
65+
---
66+
67+
## Alternatives considered
68+
69+
### Option A: In-memory buffering during conversion
70+
71+
- Pros: fewer temp files, simpler initial implementation
72+
- Cons: poor behavior on large payloads, higher GC/LOH pressure, unstable memory profile
73+
- Rejected because: violates repository rule and does not scale safely.
74+
75+
### Option B: Hybrid memory-first with fallback spill-to-disk
76+
77+
- Pros: can be fast for tiny files
78+
- Cons: hidden mode switches, inconsistent behavior, harder diagnostics
79+
- Rejected because: increases complexity and invites silent fallback behavior.
80+
81+
---
82+
83+
## Consequences
84+
85+
### Positive
86+
87+
- Predictable resource profile for large conversions.
88+
- Consistent artifact paths for downstream enrichment/debugging.
89+
- Easier enforcement of workspace persistence rules.
90+
91+
### Negative / risks
92+
93+
- More disk I/O and temporary file management complexity.
94+
- Risk of stale workspace data if cleanup paths break.
95+
- Mitigation: centralize workspace lifecycle and test factory/workspace behavior.
96+
97+
---
98+
99+
## Impact
100+
101+
### Code
102+
103+
- Affected modules / services: `Core`, `Conversion`, converter base classes.
104+
- New boundaries / responsibilities: converter extraction assumes persisted sources.
105+
- Feature flags / toggles: pipeline/workspace options in `MarkItDownOptions` and `ConversionRequest`.
106+
107+
### Data / configuration
108+
109+
- Data model / schema changes: none.
110+
- Config changes: workspace/storage options determine artifact persistence location.
111+
- Backwards compatibility strategy: keep public conversion APIs unchanged.
112+
113+
### Documentation
114+
115+
- Feature docs to update: disk-first and structured conversion feature specs.
116+
- Testing docs to update: conversion and workspace tests mapping.
117+
- Architecture docs to update: module/contracts map in architecture overview.
118+
- `docs/Architecture/Overview.md` updates: include workspace module and links.
119+
- Notes for `AGENTS.md`: keep explicit prohibition of memory-stream conversion paths.
120+
121+
---
122+
123+
## Verification
124+
125+
### Objectives
126+
127+
- Prove conversion entry points work with file-backed materialization.
128+
- Prove workspace factory persists artifacts and handles path policies.
129+
- Prove failures still return actionable conversion errors.
130+
131+
### Test environment
132+
133+
- Environment: local .NET SDK and in-repo test assets.
134+
- Data/reset strategy: deterministic fixture files and generated catalog.
135+
- External dependencies: not required for core disk-first behavior tests.
136+
137+
### Test commands
138+
139+
- build: `dotnet build MarkItDown.slnx`
140+
- test: `dotnet test MarkItDown.slnx`
141+
- format: `dotnet format MarkItDown.slnx`
142+
- coverage: `dotnet test MarkItDown.slnx --collect:"XPlat Code Coverage"`
143+
144+
### New or changed tests
145+
146+
| ID | Scenario | Level (Unit / Int / API / UI) | Expected result | Notes / Data |
147+
| --- | --- | --- | --- | --- |
148+
| TST-001 | Workspace factory creates and resolves artifact directories | Integration | Files/artifacts persisted with expected policy | `tests/MarkItDown.Tests/Conversion/ArtifactWorkspaceFactoryTests.cs` |
149+
| TST-002 | Non-seekable stream conversion still succeeds | Integration | Pipeline handles buffered disk path | `tests/MarkItDown.Tests/MarkItDownTests.cs` |
150+
151+
### Regression and analysis
152+
153+
- Regression suites: `tests/MarkItDown.Tests/MarkItDownIntegrationTests.cs`, converter suites.
154+
- Static analysis: analyzer-enforced build in CI/local.
155+
- Monitoring during rollout: conversion failure counters and duration telemetry.
156+
157+
---
158+
159+
## Rollout and migration
160+
161+
- Migration steps: keep converters aligned with base materialization pattern.
162+
- Backwards compatibility: no public API break required.
163+
- Rollback: revert converter/base changes that violate prior behavior.
164+
165+
---
166+
167+
## References
168+
169+
- `docs/DocumentProcessingPipeline.md`
170+
- `src/MarkItDown/Converters/Base/DocumentPipelineConverterBase.cs`
171+
- `src/MarkItDown/Conversion/ArtifactWorkspaceFactory.cs`
172+
- `tests/MarkItDown.Tests/Conversion/ArtifactWorkspaceFactoryTests.cs`

0 commit comments

Comments
 (0)