|
| 1 | +# Foundry Quota Preflight |
| 2 | + |
| 3 | +> **Applies to**: Plan Forge v2.92.1-dev+ |
| 4 | +> **Source**: Phase-FOUNDRY-QUOTA-PREFLIGHT (enterprise-fleet-readiness.md §11.6) |
| 5 | +
|
| 6 | +Before Plan Forge sends tokens to your Azure OpenAI / Azure AI Foundry deployment it can |
| 7 | +check the deployment's TPM capacity and compare it against the slice token estimate. This |
| 8 | +**quota preflight** keeps plan execution from hitting a rate-limit wall mid-run. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## How It Works |
| 13 | + |
| 14 | +1. `forge_run_plan` calls `getDeploymentQuota()` at the start of each slice (before the |
| 15 | + worker is dispatched). |
| 16 | +2. The result is passed to `compareSliceEstimate()`, which classifies headroom as |
| 17 | + **safe / warning / critical / unknown**. |
| 18 | +3. A `[foundry-quota]` annotation is injected into the slice log. If the status is |
| 19 | + `critical`, the orchestrator emits a `quota-warning` event and, when |
| 20 | + `PFORGE_FOUNDRY_QUOTA_PREFLIGHT=block` is set, halts execution with an actionable error. |
| 21 | + |
| 22 | +``` |
| 23 | +[foundry-quota] safe — 68.3% headroom (eastus-prod-gpt-4.1). |
| 24 | +Cap=100,000 tpm, used=0 tpm, slice est=31,700 tokens. |
| 25 | +``` |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Prerequisites |
| 30 | + |
| 31 | +- An Azure OpenAI Service or Azure AI Foundry deployment already configured per |
| 32 | + `docs/integrations/byo-azure-openai.md`. |
| 33 | +- A credential that satisfies `credential.getToken("https://management.azure.com/.default")`: |
| 34 | + - **Entra / Managed Identity** — set `AZURE_AUTH_MODE=entra` (requires `@azure/identity`). |
| 35 | + - **Service Principal** — `AZURE_AUTH_MODE=managed-identity` with env vars |
| 36 | + `AZURE_TENANT_ID`, `AZURE_CLIENT_ID`, `AZURE_CLIENT_SECRET`. |
| 37 | + |
| 38 | +> **Required Azure RBAC role**: The identity used must hold the |
| 39 | +> **Cognitive Services Usages Reader** role (built-in) on the Azure OpenAI account or its |
| 40 | +> resource group. This role grants read-only access to the control-plane quota endpoint |
| 41 | +> (`Microsoft.CognitiveServices/accounts/deployments/read`) without allowing any data-plane |
| 42 | +> or model-serving operations. |
| 43 | +
|
| 44 | +--- |
| 45 | + |
| 46 | +## Activation |
| 47 | + |
| 48 | +### Warn-only mode (default) |
| 49 | + |
| 50 | +Set the feature flag — quota checks run, headroom is logged, but execution never blocks: |
| 51 | + |
| 52 | +```bash |
| 53 | +export PFORGE_FOUNDRY_QUOTA_PREFLIGHT=warn # or just set the var with any non-empty value |
| 54 | +``` |
| 55 | + |
| 56 | +Or in `.forge/secrets.json`: |
| 57 | + |
| 58 | +```json |
| 59 | +{ |
| 60 | + "PFORGE_FOUNDRY_QUOTA_PREFLIGHT": "warn" |
| 61 | +} |
| 62 | +``` |
| 63 | + |
| 64 | +### Block mode |
| 65 | + |
| 66 | +Stop the run before a slice that would exceed quota: |
| 67 | + |
| 68 | +```bash |
| 69 | +export PFORGE_FOUNDRY_QUOTA_PREFLIGHT=block |
| 70 | +``` |
| 71 | + |
| 72 | +With `block` mode, execution halts on `critical` status and the following structured error |
| 73 | +is returned: |
| 74 | + |
| 75 | +```json |
| 76 | +{ |
| 77 | + "ok": false, |
| 78 | + "reason": "quota_preflight_critical", |
| 79 | + "message": "[foundry-quota] critical — -3.2% headroom …", |
| 80 | + "deployment": "eastus-prod-gpt-4.1" |
| 81 | +} |
| 82 | +``` |
| 83 | + |
| 84 | +### Disable |
| 85 | + |
| 86 | +```bash |
| 87 | +unset PFORGE_FOUNDRY_QUOTA_PREFLIGHT # or set to empty string / "false" / "off" |
| 88 | +``` |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +## Threshold Reference |
| 93 | + |
| 94 | +| Status | Headroom after subtracting current usage + slice estimate | |
| 95 | +|---|---| |
| 96 | +| `safe` | ≥ 30 % | |
| 97 | +| `warning` | 10 – 30 % | |
| 98 | +| `critical` | < 10 % (including negative — over-budget) | |
| 99 | +| `unknown` | Quota unavailable (fail-open; execution continues) | |
| 100 | + |
| 101 | +**Fail-open guarantee**: any error fetching quota (`timeout`, `rate_limited`, `forbidden`, |
| 102 | +`network_error`, etc.) returns `status: "unknown"` and never blocks execution, regardless of |
| 103 | +the `PFORGE_FOUNDRY_QUOTA_PREFLIGHT` mode. |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +## Cache Behaviour |
| 108 | + |
| 109 | +Quota values are cached in-process for **5 minutes** (configurable via the `ttlMs` |
| 110 | +parameter in `foundry-quota.mjs`). This means: |
| 111 | + |
| 112 | +- A plan with 10 slices hitting the same deployment makes **at most 1** control-plane call |
| 113 | + per 5-minute window, not 10. |
| 114 | +- If you resize a deployment mid-run, the new capacity is reflected within 5 minutes. |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## Required Azure Permissions |
| 119 | + |
| 120 | +| Action | Required role | |
| 121 | +|---|---| |
| 122 | +| Read deployment quota (`GET /deployments/{name}`) | **Cognitive Services Usages Reader** | |
| 123 | +| Acquire token for `management.azure.com` | Any Entra identity / service principal | |
| 124 | + |
| 125 | +The `Cognitive Services Usages Reader` role is a built-in Azure role that grants |
| 126 | +`Microsoft.CognitiveServices/*/read` without any write or data-plane permissions. Assign it |
| 127 | +at the **resource-group** or **subscription** level to cover all AOAI accounts in scope. |
| 128 | + |
| 129 | +```bash |
| 130 | +az role assignment create \ |
| 131 | + --assignee "<service-principal-client-id>" \ |
| 132 | + --role "Cognitive Services Usages Reader" \ |
| 133 | + --scope "/subscriptions/<sub-id>/resourceGroups/<rg-name>" |
| 134 | +``` |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +## Quota Response Shape |
| 139 | + |
| 140 | +`getDeploymentQuota()` returns either a success object or a fail-open error: |
| 141 | + |
| 142 | +```ts |
| 143 | +// Success |
| 144 | +{ |
| 145 | + ok: true, |
| 146 | + deploymentName: string, |
| 147 | + model: string, // e.g. "gpt-4.1" |
| 148 | + tpmCapacity: number | null, // tokens-per-minute capacity from control plane |
| 149 | + tpmUsage: number | null, // current usage (null = not reported by this endpoint) |
| 150 | + ptuCapacity: number | null, // provisioned throughput capacity (future) |
| 151 | + ptuUsage: number | null, |
| 152 | + sku: string | null, |
| 153 | + fetchedAt: string, // ISO 8601 timestamp |
| 154 | +} |
| 155 | + |
| 156 | +// Fail-open |
| 157 | +{ |
| 158 | + ok: false, |
| 159 | + reason: "missing_required_params" | "no_credential" | "no_token" | "token_error" |
| 160 | + | "rate_limited" | "forbidden" | "service_unavailable" | "timeout" |
| 161 | + | "network_error" | "http_<code>", |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## Troubleshooting |
| 168 | + |
| 169 | +| Symptom | Likely cause | Fix | |
| 170 | +|---|---|---| |
| 171 | +| `reason: "no_credential"` | `credential` not provided | Set `AZURE_AUTH_MODE=entra` or `managed-identity` | |
| 172 | +| `reason: "forbidden"` | Missing RBAC role | Assign **Cognitive Services Usages Reader** to the identity | |
| 173 | +| `reason: "rate_limited"` | Too many control-plane calls | Cache TTL is already 5 min; check for multiple concurrent workers | |
| 174 | +| `reason: "timeout"` | Control-plane slow or unreachable | Check network connectivity to `management.azure.com`; quota check fails open | |
| 175 | +| `status: "unknown"` on every slice | Any of the above | Execution continues; review the `[foundry-quota]` log annotation for the `reason` field | |
| 176 | +| `tpmCapacity: null` | Deployment uses PTU (provisioned) | PTU capacity is not reported on the same endpoint; status will be `unknown` | |
| 177 | + |
| 178 | +--- |
| 179 | + |
| 180 | +## Related Docs |
| 181 | + |
| 182 | +- `docs/integrations/byo-azure-openai.md` — BYO AOAI / Foundry provider setup |
| 183 | +- `docs/integrations/foundry-toolbox-mcp.md` — Foundry Toolbox MCP server wiring |
| 184 | +- `pforge-mcp/foundry-quota.mjs` — Implementation (`getDeploymentQuota`, `compareSliceEstimate`, cache) |
| 185 | +- `pforge-mcp/tests/foundry-quota.test.mjs` — 20 unit tests covering all error codes and threshold boundaries |
0 commit comments