Azure Switchboard

Batteries-included, coordination-free client loadbalancing for Azure OpenAI and OpenAI.

uv add azure-switchboard

Overview

azure-switchboard is a Python 3 asyncio library that provides an API-compatible client loadbalancer for Chat Completions. You instantiate a Switchboard with one or more Deployments, and requests are distributed across healthy deployments using the power of two random choices method. Deployments can point at Azure OpenAI (base_url=.../openai/v1/) or OpenAI (base_url=None).

Features

API Compatibility: Switchboard.create is a transparently-typed proxy for OpenAI.chat.completions.create.
Coordination-Free: The default Two Random Choices algorithm does not require coordination between client instances to achieve excellent load distribution characteristics.
Utilization-Aware: TPM/RPM utilization is tracked per model per deployment for use during selection.
Batteries Included:
- Session Affinity: Provide a session_id to route requests in the same session to the same deployment.
- Automatic Failover: Retries are controlled by a tenacity AsyncRetrying policy (failover_policy).
- Pluggable Selection: Custom selection algorithms can be provided by passing a callable to the selector parameter on the Switchboard constructor.
- OpenTelemetry Integration: Built-in metrics for request routing and healthy deployment counts.
Lightweight: Small codebase with minimal dependencies: openai, tenacity, wrapt, and opentelemetry-api.

Runnable Example

#!/usr/bin/env python3
#
# To run this, use:
#   uv run --env-file .env tools/readme_example.py
#
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "azure-switchboard",
# ]
# ///

import asyncio
import os

from azure_switchboard import Deployment, Model, Switchboard

azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")

deployments = []
if azure_openai_endpoint and azure_openai_api_key:
    # create 3 deployments. reusing the endpoint
    # is fine for the purposes of this demo
    for name in ("east", "west", "south"):
        deployments.append(
            Deployment(
                name=name,
                base_url=f"{azure_openai_endpoint}/openai/v1/",
                api_key=azure_openai_api_key,
                models=[Model(name="gpt-4o-mini")],
            )
        )

if openai_api_key:
    deployments.append(
        Deployment(
            name="openai",
            api_key=openai_api_key,
            models=[Model(name="gpt-4o-mini")],
        )
    )

if not deployments:
    raise RuntimeError(
        "Set AZURE_OPENAI_ENDPOINT/AZURE_OPENAI_API_KEY or OPENAI_API_KEY to run this example."
    )


async def main():
    async with Switchboard(deployments=deployments) as sb:
        print("Basic functionality:")
        await basic_functionality(sb)

        print("Session affinity (should warn):")
        await session_affinity(sb)


async def basic_functionality(switchboard: Switchboard):
    # Make a completion request (non-streaming)
    response = await switchboard.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, world!"}],
    )

    print("completion:", response.choices[0].message.content)

    # Make a streaming completion request
    stream = await switchboard.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, world!"}],
        stream=True,
    )

    print("streaming: ", end="")
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

    print()


async def session_affinity(switchboard: Switchboard):
    session_id = "anything"

    # First message will select a random healthy
    # deployment and associate it with the session_id
    r = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Who won the World Series in 2020?"}],
    )

    d1 = switchboard.select_deployment(model="gpt-4o-mini", session_id=session_id)
    print("deployment 1:", d1)
    print("response 1:", r.choices[0].message.content)

    # Follow-up requests with the same session_id will route to the same deployment
    r2 = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "Who won the World Series in 2020?"},
            {"role": "assistant", "content": r.choices[0].message.content},
            {"role": "user", "content": "Who did they beat?"},
        ],
    )

    print("response 2:", r2.choices[0].message.content)

    # Simulate a failure by marking down the deployment
    d1.models["gpt-4o-mini"].mark_down()

    # A new deployment will be selected for this session_id
    r3 = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Who won the World Series in 2021?"}],
    )

    d2 = switchboard.select_deployment(model="gpt-4o-mini", session_id=session_id)
    print("deployment 2:", d2)
    print("response 3:", r3.choices[0].message.content)
    assert d2 != d1


if __name__ == "__main__":
    asyncio.run(main())

Benchmarks

just bench
uv run --env-file .env tools/bench.py -v -r 1000 -d 10 -e 500
Distributing 1000 requests across 10 deployments
Max inflight requests: 1000

Request 500/1000 completed
Utilization Distribution:
0.000 - 0.200 |   0
0.200 - 0.400 |  10 ..............................
0.400 - 0.600 |   0
0.600 - 0.800 |   0
0.800 - 1.000 |   0
Avg utilization: 0.339 (0.332 - 0.349)
Std deviation: 0.006

{
    'bench_0': {'gpt-4o-mini': {'util': 0.361, 'tpm': '10556/30000', 'rpm': '100/300'}},
    'bench_1': {'gpt-4o-mini': {'util': 0.339, 'tpm': '9819/30000', 'rpm': '100/300'}},
    'bench_2': {'gpt-4o-mini': {'util': 0.333, 'tpm': '9405/30000', 'rpm': '97/300'}},
    'bench_3': {'gpt-4o-mini': {'util': 0.349, 'tpm': '10188/30000', 'rpm': '100/300'}},
    'bench_4': {'gpt-4o-mini': {'util': 0.346, 'tpm': '10210/30000', 'rpm': '99/300'}},
    'bench_5': {'gpt-4o-mini': {'util': 0.341, 'tpm': '10024/30000', 'rpm': '99/300'}},
    'bench_6': {'gpt-4o-mini': {'util': 0.343, 'tpm': '10194/30000', 'rpm': '100/300'}},
    'bench_7': {'gpt-4o-mini': {'util': 0.352, 'tpm': '10362/30000', 'rpm': '102/300'}},
    'bench_8': {'gpt-4o-mini': {'util': 0.35, 'tpm': '10362/30000', 'rpm': '102/300'}},
    'bench_9': {'gpt-4o-mini': {'util': 0.365, 'tpm': '10840/30000', 'rpm': '101/300'}}
}

Utilization Distribution:
0.000 - 0.100 |   0
0.100 - 0.200 |   0
0.200 - 0.300 |   0
0.300 - 0.400 |  10 ..............................
0.400 - 0.500 |   0
0.500 - 0.600 |   0
0.600 - 0.700 |   0
0.700 - 0.800 |   0
0.800 - 0.900 |   0
0.900 - 1.000 |   0
Avg utilization: 0.348 (0.333 - 0.365)
Std deviation: 0.009

Distribution overhead: 926.14ms
Average response latency: 5593.77ms
Total latency: 17565.37ms
Requests per second: 1079.75
Overhead per request: 0.93ms

Distribution overhead scales ~linearly with the number of deployments.

Configuration Reference

switchboard.Model Parameters

Parameter	Description	Default
`name`	Model name as sent to Chat Completions	Required
`tpm`	Tokens-per-minute budget used for utilization tracking and routing	0 (unlimited)
`rpm`	Requests-per-minute budget used for utilization tracking and routing	0 (unlimited)
`default_cooldown`	Cooldown duration (seconds) after a deployment/model failure mark-down	10.0

switchboard.Deployment Parameters

Parameter	Description	Default
`name`	Unique identifier for the deployment	Required
`base_url`	API base URL. Azure example: `https://<resource>.openai.azure.com/openai/v1/`. OpenAI: leave `None`.	None
`api_key`	API key for the deployment	None
`timeout`	Per-request timeout in seconds. Override per deployment for batch jobs that need longer budgets.	30.0
`models`	Models available on this deployment	Built-in model name defaults

Timeout vs. Rate-Limit Cooldown

azure-switchboard distinguishes between two categories of API errors:

RateLimitError / APIConnectionError: These are correlated with the specific deployment — the deployment is saturated or unreachable. The affected model is marked down with the configured default_cooldown (default 10s) so the load balancer avoids it.
APITimeoutError: Timeouts during an upstream-wide slowdown are not correlated with any particular deployment. Marking a deployment down in this case wastes capacity without providing a fix — every deployment would cycle through cooldown in rotation. Timeouts are re-raised without triggering a cooldown.

If your workload has a longer latency budget (e.g. batch structured-output jobs), set timeout on the relevant DeploymentConfig rather than relying on the default.

switchboard.Switchboard Parameters

Parameter	Description	Default
`deployments`	List of deployment configs	Required
`selector`	Deployment selection function `(model, eligible_deployments) -> deployment`	`two_random_choices`
`failover_policy`	Tenacity `AsyncRetrying` policy used around each `create` call	`AsyncRetrying(stop=stop_after_attempt(2), retry=retry_if_not_exception_type(SwitchboardError), reraise=True)`
`ratelimit_window`	How often usage counters reset (seconds). Set `0` to disable periodic reset.	60.0
`max_sessions`	LRU capacity for session affinity map	1024

Development

This project uses uv for package management, and just for task automation. See the justfile for available commands.

git clone https://github.com/arini-ai/azure-switchboard
cd azure-switchboard

just install

Running tests

just test

Release

This library uses CalVer for versioning. On push to master, if tests pass, a package is automatically built, released, and uploaded to PyPI.

Locally, the package can be built with uv:

uv build

OpenTelemetry Integration

azure-switchboard uses OpenTelemetry metrics via the meter azure_switchboard.switchboard.

Metrics emitted on the request path include:

healthy_deployments_count (gauge)
requests (counter, with deployment + model attributes)

To run with local OTEL instrumentation:

just otel-run

Contributing

Fork/clone repo
Make changes
Run tests with just test
Lint with just lint
Commit and make a PR

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
.trunk		.trunk
.vscode		.vscode
src/azure_switchboard		src/azure_switchboard
tests		tests
tools		tools
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
env.template		env.template
justfile		justfile
otel-collector-config.yaml		otel-collector-config.yaml
pyproject.toml		pyproject.toml
trunk		trunk
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure Switchboard

Overview

Features

Runnable Example

Benchmarks

Configuration Reference

switchboard.Model Parameters

switchboard.Deployment Parameters

Timeout vs. Rate-Limit Cooldown

switchboard.Switchboard Parameters

Development

Running tests

Release

OpenTelemetry Integration

Contributing

License

About

Uh oh!

Releases 29

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Azure Switchboard

Overview

Features

Runnable Example

Benchmarks

Configuration Reference

switchboard.Model Parameters

switchboard.Deployment Parameters

Timeout vs. Rate-Limit Cooldown

switchboard.Switchboard Parameters

Development

Running tests

Release

OpenTelemetry Integration

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 29

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages