An LLM API gateway circuit breaker stops an application from repeatedly sending traffic into a route that is already failing. Without that guardrail, a timeout can trigger retries, retries can trigger fallback attempts, fallback attempts can trigger more provider errors, and the app can turn one upstream incident into a provider failure loop.
The goal is not to replace retries or model fallback. The goal is to decide when a route is unhealthy enough that the gateway should stop trying it for a short window, send a controlled probe later, and choose a safer outcome while the breaker is open: fallback, queue, degrade, or fail closed.
Flatkey is relevant because flatkey.ai publicly positions the product around one API key, an OpenAI-compatible base URL at https://router.flatkey.ai/v1, routing, unified billing, usage analytics, dashboard controls, automatic switching, load balancing, and quota limits. Those are useful central points for reliability work. They do not remove the need to define a clear LLM API gateway circuit breaker policy for your own application workflows.
Quick Answer: What An LLM API Gateway Circuit Breaker Should Do
A practical LLM API gateway circuit breaker has three route states and one fail-closed path. Keep the policy simple enough that on-call engineers can explain it during an incident.
| State | Gateway Behavior | What Moves It | Evidence To Log |
|---|---|---|---|
| Closed | Traffic can use the provider, model, endpoint family, account, or route group. | Error rate, timeout rate, latency, overload responses, or failed health probes cross the threshold. | Route policy ID, selected model, provider, endpoint family, latency, status code, retry count, and cost. |
| Open | The gateway stops sending normal traffic to the unhealthy route for a cooldown window. | Cooldown expires, or an operator manually permits a probe. | Breaker reason, opened-at time, blocked attempt count, fallback route, queue decision, or fail-closed reason. |
| Half-open | The gateway allows a limited number of probe requests before restoring traffic. | Successful probes close the breaker; failed probes open it again. | Probe sample size, probe workflow, probe result, latency, usage, and route owner approval. |
| Fail closed | The gateway refuses to route the request because the problem is not a provider-health issue. | Auth, policy, quota, safety, data-boundary, invalid request, or tool side-effect risk. | Stop reason, owner, user-visible message, and remediation path. |
Why Retries Create Provider Failure Loops
Retries are useful when a request fails for a transient reason. They become dangerous when every user request creates several more upstream calls, especially during a provider outage or overload window. A retry loop can consume rate limits, spend quota, increase latency, and hide the original failure behind a final error.
A circuit breaker changes the retry question. Instead of asking, "Should this one request retry again?", the gateway asks, "Is this route healthy enough to receive more traffic right now?" That route-level view matters for LLM workloads because each request can be expensive, long-running, streaming, tool-using, and customer visible.
Microsoft's cloud design pattern for circuit breakers describes the same core idea for remote services: after repeated failures, the circuit opens so the application does not keep trying an operation that is likely to fail. For an AI route, the same pattern needs LLM-specific boundaries: model behavior, endpoint family, token spend, streaming state, tool side effects, data class, and fallback approval.
Classify Errors Before They Reach The Breaker
The fastest way to build a bad AI API circuit breaker is to count every failure as provider health. That creates false positives. It can also hide problems the app owner must fix.
| Error Or Event | Breaker Decision | Reason | Default Outcome |
|---|---|---|---|
| Provider 500, 503, overload, unavailable, connection failure, repeated upstream timeout | Count toward route health. | These are plausible provider, route, capacity, or network-health signals. | Retry within a tight budget, then open the route breaker if thresholds are crossed. |
| 429 request-rate limit | Classify carefully. | A provider-wide overload signal and an app-created burst need different handling. | Throttle, back off, or open only the scoped route that is actually saturated. |
| 429 monthly quota, exhausted credits, or spend limit | Do not count as provider health. | This is a budget or account-owner condition. | Fail closed, alert the budget owner, or route only if a pre-approved budget exists. |
| 401 auth, incorrect key, organization membership, IP allowlist, unsupported region | Do not count as provider health. | The request is not allowed to use the route. | Fail closed and fix credentials, account, IP, or region policy. |
| Invalid request, unsupported parameter, unsupported model, malformed schema | Do not count as provider health. | The app sent a request shape the route cannot serve. | Fix the request or choose a compatible model before routing. |
| Safety, moderation, DLP, compliance, or unapproved data-class block | Never bypass with fallback. | Routing to another model could cross a policy boundary. | Fail closed and record the policy decision. |
| Tool already executed, partial stream already shown, user cancelled request | Do not silently replay. | The app may create duplicate side effects or merge two model outputs. | Mark incomplete, require explicit user retry, or use an idempotent recovery path. |
OpenAI's error-code guide is a useful example of why this taxonomy matters: it separates authentication and IP allowlist problems, unsupported-region problems, rate limits, quota exhaustion, server errors, overload, and sudden request-rate slowdowns. Anthropic and Google Gemini docs make similar distinctions between rate limits, overload/unavailable conditions, invalid requests, and permission or quota problems. Your LLM API gateway circuit breaker should keep those classes separate before it opens a route.
Scope The Breaker To The Smallest Route That Explains The Failure
A breaker that is too broad causes unnecessary downtime. A breaker that is too narrow lets the same provider failure loop continue through nearby paths. Scope the breaker to the smallest route dimension that explains the incident.
| Scope | Use When | Risk If Scoped Wrong |
|---|---|---|
| Provider | Multiple models from the same provider are unavailable or overloaded. | Too broad if only one model, account, or endpoint family is failing. |
| Model | One model family has repeated 5xx, timeout, or unsupported-route errors. | Too narrow if the upstream account or provider is saturated. |
| Endpoint family | Chat works but Responses, image, video, Anthropic Messages, or Gemini routes behave differently. | Mixing endpoint families can hide protocol-specific failures. |
| Account, group, region, or vendor path | Only one upstream account, routing group, region, or vendor path is failing. | Failing to isolate can burn healthy capacity elsewhere. |
| Workflow | Tool calls, streaming, batch jobs, or customer-facing chat have different safety and replay rules. | A route that is safe for batch enrichment may be unsafe for live user streams. |
For Flatkey users, this means you should test from the workflow and route you actually plan to use. The current Flatkey pricing API snapshot for this article returned 638 model rows, 23 vendors, and endpoint families for OpenAI chat completions, OpenAI Responses, Anthropic messages, Gemini generateContent, image generation, and OpenAI video. Treat that as dated evidence from June 18, 2026, not as a permanent route contract.
Set Thresholds That Match LLM Traffic
An LLM API gateway circuit breaker should not open because of one isolated failure. It also should not wait until every customer request is failing. Use thresholds that combine minimum traffic volume, failure ratio, latency, and cooldown.
| Threshold | Practical Starting Point | Why It Matters |
|---|---|---|
| Minimum sample size | Open only after enough requests or probes have been observed. | Prevents a single expensive completion from opening a global route. |
| Failure ratio | Track retryable upstream failures separately from app-owned failures. | Stops auth, quota, and malformed request errors from poisoning route health. |
| Latency or timeout threshold | Use endpoint-specific timeout budgets for chat, streaming, image, and video paths. | A good threshold for chat may be wrong for video or batch generation. |
| Open cooldown | Hold the route open long enough to stop retry storms, then probe. | Protects both the provider and your own request queue. |
| Half-open probe limit | Allow a small, controlled number of test requests before closing. | Prevents a full traffic surge when a provider is only partially recovered. |
| Cost ceiling | Set a maximum estimated spend for retries, fallback, and probes. | Prevents reliability recovery from becoming a billing incident. |
Rate limits are part of the threshold discussion. OpenAI's rate-limit guide explains that rate limits protect against abuse, ensure fair access, and help manage aggregate load. If your app keeps retrying into a rate-limited route, your own traffic pattern can become the incident. The LLM API gateway circuit breaker should work with client-side pacing, queueing, and quota controls, not fight them.
Decide What Happens While The Breaker Is Open
Opening a breaker is only useful if the gateway has a defined next action. Do not let every open route automatically fall back to any available model.
| Open-State Action | Use When | Required Guardrail |
|---|---|---|
| Fallback route | The backup model or provider is already approved for the workflow. | Run the same eval, schema, tool, data-boundary, and cost checks before production. |
| Queue | The job is asynchronous or the user experience can tolerate delay. | Preserve owner, customer, model, cost, and retry metadata. |
| Degrade | A lower-risk partial result is acceptable, such as a cached response or reduced feature. | Make the degraded state visible to the app and logs. |
| Fail closed | The request has policy, budget, safety, auth, region, or side-effect risk. | Return a clear error and alert the right owner instead of trying another model. |
Public Vercel AI Gateway docs describe model fallbacks as a gateway pattern with ordered backup models. Use that as category evidence only. In your own stack, fallback is a separate approval decision. The breaker decides whether a route is currently healthy; fallback decides whether another route is allowed to serve the same request.
Streaming And Tool Calls Need Extra Stops
Streaming makes a provider failure loop easier to hide. If the app silently restarts a request after partial output, a user can see a merged answer from two attempts. Tool calls add a second problem: a retry or fallback can duplicate a refund, ticket update, email, database write, or external action.
Use these rules in the LLM API gateway circuit breaker policy:
- Before first output: retry or fallback can be allowed if the route is approved and the breaker is closed or half-open.
- After first output: mark the stream incomplete and require explicit user retry instead of silent fallback.
- After tool execution: do not replay unless the tool is idempotent and the operation has a replay key.
- After a policy block: fail closed. Do not route to a different model to bypass the block.
This pairs with the AI API retry strategy, model fallback checklist, and AI API load balancing and failover guides. The breaker should share the same failure taxonomy as those playbooks.
Observability Fields For Circuit Breaker Review
If a request succeeds only because the gateway silently skipped a broken route, the incident still needs to be visible. Cloudflare's AI Gateway docs provide a public example of AI gateway observability patterns: request logs can include provider, status, tokens, cost, and duration, and custom metadata can tag requests for later filtering. Your gateway logs should give the same level of route evidence for breaker decisions.
| Field | Why Operators Need It |
|---|---|
| Breaker policy ID and version | Shows which rule opened, closed, or bypassed the route. |
| Breaker state at decision time | Explains whether the route was closed, open, half-open, or fail-closed. |
| Requested model, selected model, provider, account, group, and endpoint family | Separates user intent from the gateway route decision. |
| Error class per attempt | Distinguishes upstream failures from auth, quota, invalid request, policy, and tool errors. |
| Latency, timeout, retry count, and probe result | Shows whether the route failed slowly, failed quickly, or recovered during half-open probing. |
| Partial-output flag and tool side-effect status | Prevents hidden mixed-output or duplicate-action incidents. |
| Usage, cost, quota owner, and final disposition | Connects reliability recovery to spend, budgets, and accountability. |
The companion AI API observability logs article goes deeper on incident logging. For circuit breakers, prioritize the route state and the exact reason a request was blocked, probed, routed, queued, or failed closed.
A Flatkey Rollout Plan For Circuit Breaker Policies
Use this staged approach before relying on a LLM API gateway circuit breaker for customer traffic through Flatkey or any OpenAI-compatible gateway.
- Create a staging key: keep breaker tests separate from production customer traffic.
- Confirm the base route: point an OpenAI-compatible client at
https://router.flatkey.ai/v1and verify the model, endpoint family, usage row, and dashboard visibility. - Snapshot the route catalog: save the Flatkey pricing page and pricing API response on the rollout date so route and pricing assumptions are auditable.
- Define the error taxonomy: decide which errors count toward route health and which fail closed before the breaker sees them.
- Start with one workflow: apply the breaker to one model route, one endpoint family, and one traffic class before expanding.
- Force failure tests: simulate provider timeout, 500, 503, request-rate 429, quota exhaustion, auth failure, malformed request, partial stream, and tool side effect.
- Verify open-state behavior: confirm fallback, queue, degrade, or fail-closed actions match the approval matrix.
- Review logs and billing: confirm breaker state, selected route, usage, cost, and quota owner are visible after each test.
- Set rollback: disable the policy if it opens too broadly, hides app-owned errors, or drives unexpected spend.
Circuit Breaker Policy Template
This template is not a Flatkey API contract. Treat it as a review artifact for engineering, product, finance, and security owners.
{
"policy_id": "support-chat-provider-breaker-v1",
"workflow": "customer-support-chat",
"environment": "production",
"route_scope": {
"provider": "primary-provider",
"model": "primary-approved-model",
"endpoint_family": "openai-chat-completions",
"traffic_class": "customer-visible-stream"
},
"count_toward_breaker": [
"upstream_5xx",
"provider_overloaded",
"provider_unavailable",
"upstream_timeout",
"connection_reset"
],
"fail_closed_before_breaker": [
"auth_error",
"ip_allowlist_error",
"unsupported_region",
"quota_exhausted",
"invalid_request",
"schema_incompatible",
"safety_or_policy_block",
"unapproved_data_class",
"tool_side_effect_already_committed"
],
"thresholds": {
"window_seconds": 60,
"minimum_requests": 20,
"failure_ratio_to_open": 0.5,
"timeout_ratio_to_open": 0.4,
"open_cooldown_seconds": 90,
"half_open_probe_requests": 3,
"max_total_attempts_per_request": 2
},
"open_state_action": {
"default": "fail_closed",
"allowed_fallback_policy_ids": [
"support-chat-fallback-v1"
],
"allow_after_partial_output": false,
"allow_after_tool_side_effect": false
},
"logging": {
"record_breaker_state": true,
"record_route_scope": true,
"record_error_class_per_attempt": true,
"record_probe_results": true,
"record_usage_cost_and_quota_owner": true
}
}
FAQ
What is an LLM API gateway circuit breaker?
An LLM API gateway circuit breaker is a route-health policy that stops normal traffic from hitting an unhealthy model, provider, account, or endpoint family after repeated retryable failures. It opens for a cooldown window, allows limited half-open probes, and closes only after the route looks healthy again.
Which LLM API errors should open a circuit breaker?
Provider-side 5xx errors, overload, unavailable responses, connection failures, and repeated upstream timeouts are typical candidates. Auth errors, IP allowlist failures, unsupported regions, exhausted quota, malformed requests, policy blocks, and tool side effects should usually fail closed instead of opening a provider-health breaker.
How is a circuit breaker different from retry or fallback?
Retry decides whether one request should be attempted again. Fallback decides whether another approved route can serve the request. A circuit breaker decides whether a route should receive normal traffic at all while it appears unhealthy.
Should circuit breakers apply to streaming LLM responses?
Yes, but with stricter boundaries. A breaker can protect the route before the first visible token. After partial output or a tool side effect, the app should not silently replay or fallback unless the workflow is explicitly designed for idempotent recovery.
Final Check Before You Enable The Breaker
Before you enable an LLM API gateway circuit breaker, ask one question: if this route opens during a provider incident, can the team explain what failed, why normal traffic stopped, where traffic went next, what it cost, and how to close or roll back the policy?
If the answer is no, keep the breaker in staging. If the answer is yes, use Flatkey's centralized model access, routing, usage visibility, billing, and quota controls as part of the review loop. When you are ready to validate routes behind one OpenAI-compatible gateway, get a key and start with one workflow, one model route, and one breaker policy.



