AI Gateway ArchitectureJuly 2, 2026Big Y

Model Routing Policy Design: Match Model, Cost, Latency, and Risk to Each Workflow

Design a model routing policy that maps AI workflows to model routes, fallback rules, cost ceilings, latency SLOs, risk gates, and evidence checks.

Every AI product eventually outgrows a single default model. A support bot, code reviewer, invoice extractor, image prompt helper, and internal research agent do not need the same latency target, context budget, reasoning depth, tool behavior, fallback rule, or approval trail. A model routing policy turns those tradeoffs into an operating contract instead of a pile of ad hoc model= strings.

The goal is not to choose one "best" model. The goal is to make model choice reviewable. A good model routing policy tells engineers which model class to use, when to spend more, when to optimize for latency, when to block fallback, and what evidence must exist before a workflow moves to production.

Flatkey matters in this discussion because routing is easier to govern when model access, keys, request logs, usage analytics, pricing review, and operational controls live in one place. Use the policy below as the design layer, then validate the current Flatkey model catalog and pricing page before production rollout.

Model routing policy design in one table

Start with workflow classes, not provider names. The table below is the practical first pass for a model routing policy.

Workflow class	Primary route	Cost rule	Latency rule	Risk rule	Fallback rule	Required evidence
Fast classification	Small or low-reasoning text model	Lowest cost that passes evals	Tight p95 target	Low business risk	Safe to retry or downgrade	Accuracy sample, p95 latency, cost per 1,000 calls
Customer chat	Balanced model with tool support	Cap by conversation or account	Low p95 and stable streaming	Medium risk	Fallback only to models with tested tone and tool behavior	Conversation evals, refusal checks, tool-call success, transcript QA
Code and technical reasoning	Strong reasoning model	Spend more only on hard tasks	Looser latency budget	Medium to high risk	Fallback to reviewed peer route, not to a weak model	Task evals, diff correctness, tool trace, rollback path
Structured extraction	Model with schema support	Optimize per valid record	Batch or near-real-time	Medium risk	Retry with same or stricter route before fallback	Schema-valid rate, field accuracy, error taxonomy
Procurement or finance review	Pinned reviewed model route	Cost secondary to auditability	Async acceptable	High risk	No automatic fallback without approval	Source trace, model version, request log, reviewer sign-off
Background summarization	Lower-cost route or batch-friendly route	Minimize cost per accepted summary	Async	Low to medium risk	Fallback after retry budget is exhausted	Sample quality, retry rate, cached-token metrics

This table is not the final policy. It is the decision surface. Each row needs measurable gates before it becomes production routing.

What a model routing policy must decide

A model routing policy is a written rule that maps a workflow to model capabilities, cost ceilings, latency SLOs, fallback behavior, and evidence requirements. It should answer six questions for every production workflow:

What is the workflow trying to optimize: speed, quality, cost, reliability, safety, modality, context length, or auditability?
Which capabilities are required: tool calling, structured output, long context, image input, streaming, low reasoning, high reasoning, or provider-specific endpoint support?
What is the failure budget: retry, fallback, degrade, queue, ask a human, or stop?
What can change automatically: provider, model size, reasoning effort, timeout, route group, or nothing?
What must be logged: requested model, served model, route, status, usage units, cost, fallback attempt, and owner?
What proof is needed before launch: eval score, latency sample, quota test, billing trace, security review, or procurement approval?

OpenAI's current GPT-5.5 guidance is useful here because it treats API configuration as part of model performance, not as an afterthought. The docs call out Responses API state handling, reasoning effort, verbosity, structured outputs, prompt caching, tool design, hosted tools, and state management as factors that affect intelligence, reliability, latency, and cost. That is exactly the kind of dimension a model routing policy should preserve.

Policy dimension 1: workflow risk

Risk is the first routing split because it controls how much automation is allowed.

Low-risk workflows can usually tolerate retries, cheaper routes, and broad fallback. Examples include internal tagging, lightweight summaries, draft suggestions, and non-critical classification. These are good candidates for aggressive cost controls because an occasional retry or review sample is acceptable.

Medium-risk workflows need stronger acceptance tests. Customer support, workflow automation, code suggestions, and sales-assist tools may not require human review every time, but they do require tone checks, tool-call checks, and route evidence when errors occur.

High-risk workflows should be pinned more tightly. Procurement reviews, legal summaries, finance approvals, security decisions, and regulated workflows should not silently fall back to a different model or provider just because the primary route is slow. The model routing policy should require explicit approval before fallback changes the risk posture.

The simple rule: if a human would ask "which model actually answered this?" after a bad outcome, the route needs stronger logging and weaker automatic fallback.

Policy dimension 2: latency and user experience

Latency belongs in the policy because the same model can be acceptable for an async workflow and unacceptable for an interactive product.

For interactive chat, set p50, p95, timeout, and streaming expectations. If time-to-first-token matters, measure it separately from total completion time. For background tasks, define maximum queue time and completion deadline instead.

Do not set a vague rule like "use the fast model." Write the model routing policy as a testable target:

workflow: support_chat_triage
latency:
  p95_first_token_ms: 1200
  p95_complete_ms: 7000
  timeout_ms: 10000
fallback:
  on_timeout: use_reviewed_balanced_route
  on_schema_error: retry_same_route_once
  on_safety_or_policy_error: stop_and_escalate

OpenAI's prompt caching docs are another reminder that latency is not only model selection. Stable prompt prefixes, consistent cache keys, and cache-hit monitoring can materially change latency and input-token cost for repeated workloads. If caching is part of the plan, make it a policy requirement and log cached-token metrics.

Policy dimension 3: cost ceilings

Cost controls should be expressed per workflow outcome, not only per token. A cheap route that fails often can cost more than a stronger route that succeeds on the first attempt.

Use three cost limits:

Limit	Example	Why it matters
Per request	Maximum cost for one request, image, video job, or turn	Prevents a single call from surprising finance
Per workflow	Maximum cost for a completed ticket, extraction, answer, or document	Accounts for retries and fallback
Per owner	Budget by app, team, customer, environment, or key	Keeps spend tied to accountability

Flatkey's pricing page is useful during this stage because it gives teams a current model and pricing review path, while the product surface emphasizes usage metering, request logs, usage analytics, and cost controls. Before production routing, check the current /pricing page and confirm the model row, endpoint family, and usage unit for the actual workflow.

Policy dimension 4: capability fit

Model routing policy design should start from required capabilities. Price and latency only matter after the route can do the job.

For each workflow, score these capabilities:

Capability	Route question	Acceptance test
Tool use	Does the route call required tools correctly?	Tool-call success rate and argument validation
Structured output	Does the route satisfy the schema?	Valid JSON/schema rate plus field-level accuracy
Long context	Does the route preserve instructions and relevant context?	Long-document eval with expected citations or fields
Vision or files	Does the route handle the actual input modality?	Real sample set with size and format coverage
Streaming	Does the route preserve event shape and recovery behavior?	SSE/streaming smoke test and timeout handling
Safety behavior	Does the route refuse or escalate correctly?	Red-team prompts and refusal/override checks

OpenAI's Structured Outputs docs recommend schema-backed output when the application needs reliable structure. For a routing policy, that means a route that cannot satisfy the output contract should not be used for structured extraction just because it is cheaper.

Policy dimension 5: fallback boundaries

Fallback is not automatically good. It can rescue uptime, but it can also change behavior, context handling, price, data boundary, tool support, or output shape.

Write fallback rules as explicit transitions:

workflow: invoice_extraction
primary_route: extraction_schema_route
allowed_fallbacks:
  - extraction_schema_route_backup
blocked_fallbacks:
  - general_chat_route
  - creative_writing_route
fallback_triggers:
  retry_same_route_once:
    - transient_5xx
    - rate_limit
  stop_and_queue:
    - schema_invalid_after_retry
    - unsupported_file_type
    - compliance_flag
evidence:
  log_requested_model: true
  log_served_model: true
  log_fallback_reason: true
  log_usage_units: true

A mature model routing policy separates retry from fallback. Retry means "try the same contract again." Fallback means "change the route." That change should be visible in logs and accepted by the workflow owner.

Policy dimension 6: observability and billing evidence

Routing without evidence is guesswork. Your policy should define the fields that every production request must expose.

Evidence field	Why it belongs in the policy
Workflow name	Connects spend and errors to a business process
App, team, key, or customer owner	Enables chargeback and incident ownership
Requested model and served model	Shows whether fallback or route substitution happened
Endpoint family	Separates chat, responses, image, video, Anthropic, Gemini, and other route shapes
Status and error class	Distinguishes provider errors, gateway errors, policy stops, and validation errors
Usage units	Lets finance reconcile text, cache, image, audio, and video usage
Cost or balance impact	Converts engineering traces into reviewable spend
Fallback reason	Explains why the policy changed route

Flatkey's current public positioning fits this evidence need: one gateway for model access, routing, billing, usage analytics, and operational controls. For this article, the live pricing API check on July 2, 2026 returned success: true and exposed endpoint families including openai, openai-response, anthropic, gemini, image-generation, openai-video, and video. Treat that as dated source evidence, not a promise that every route, model row, or price unit will remain unchanged.

A practical model routing policy template

Use this template as the first version of your internal routing standard.

policy_name: customer_support_v1
owner:
  team: support_platform
  approver: product_and_finance
workflow:
  description: classify, answer, and escalate support requests
  environment: production
  data_sensitivity: customer_content
route_selection:
  primary_route: balanced_support_route
  required_capabilities:
    - tool_calling
    - structured_outputs
    - streaming
  blocked_routes:
    - experimental_models
    - unreviewed_provider_routes
latency:
  p95_first_token_ms: 1200
  p95_complete_ms: 7000
cost:
  max_cost_per_conversation: approved_budget
  owner_key: support_platform_prod
risk:
  human_review_required_when:
    - refund_exception
    - legal_or_policy_question
    - confidence_below_threshold
fallback:
  retry_same_route_once:
    - transient_error
    - rate_limit
  fallback_to_backup_route:
    - primary_route_unavailable
  stop_without_fallback:
    - safety_refusal
    - schema_invalid_after_retry
    - unapproved_data_region
evidence:
  required_logs:
    - workflow
    - requested_model
    - served_model
    - endpoint_family
    - route_status
    - usage_units
    - cost_or_balance
    - fallback_reason
acceptance_tests:
  min_eval_pass_rate: 0.95
  max_schema_error_rate: 0.01
  max_unreviewed_fallback_rate: 0

The exact route names will differ by team. The important part is that the policy makes model choice, fallback, cost, latency, and proof reviewable.

Acceptance tests before production

Do not ship a model routing policy without running tests that simulate normal and failed paths.

Run a golden dataset through the primary route and record quality, schema validity, latency, and usage.
Trigger a rate-limit or transient-error path and verify retry behavior.
Trigger a schema failure and confirm the policy retries, stops, or escalates as written.
Trigger a blocked fallback and confirm the gateway does not silently change route.
Compare requested model, served model, endpoint family, and usage units in logs.
Check whether finance can reconcile the same sample requests to cost, prepaid balance, invoice, or export rows.
Run a rollback test that pins the previous route or disables fallback.

For deeper gateway architecture context, pair this checklist with the Flatkey guides on AI API gateways, LLM API gateway architecture, and AI API load balancing and failover.

Where Flatkey fits

Flatkey should not replace the policy. It should make the policy easier to enforce and review.

Use Flatkey when the team wants one key for connected AI models, a current pricing and model catalog review path, usage visibility, request-level evidence, quotas, and a simpler billing conversation than scattered provider accounts. The model routing policy still needs owners, acceptance tests, route constraints, fallback rules, and rollback plans.

A practical Flatkey proof run looks like this:

Pick one production-like workflow and one owner key.
Confirm the current model row and endpoint family on Flatkey pricing.
Send normal, streaming, structured, and controlled-failure requests if the workflow uses them.
Review request logs for requested model, served model, status, usage units, and fallback evidence.
Confirm quota behavior and cost or balance review with the workflow owner.
Move only the tested route to production, then expand the policy row by row.

This keeps the model routing policy grounded in real evidence instead of an architecture diagram.

FAQ

What is a model routing policy?

A model routing policy is a written rule that maps each AI workflow to an approved model route, capability requirements, cost ceiling, latency target, fallback behavior, and evidence checklist.

Why not route every request to the strongest model?

The strongest route is often slower and more expensive than a workflow needs. A model routing policy lets low-risk workflows use efficient routes while preserving stronger routes for hard reasoning, sensitive decisions, or high-value work.

When should fallback be blocked?

Block fallback when a route change could alter data handling, compliance posture, output schema, tool behavior, user-facing quality, or billing ownership. In those cases, queue, retry, or escalate instead of silently changing model route.

How often should teams update a model routing policy?

Review it whenever model catalogs, pricing units, endpoint behavior, risk requirements, or eval results change. At minimum, review active production policies quarterly and after any major model migration.

What is the first metric to watch?

Watch cost per accepted outcome, not only cost per token. Then pair it with p95 latency, fallback rate, schema-valid rate, and request-level billing evidence.

How does Flatkey help with model routing policy design?

Flatkey can provide one gateway surface for model access, pricing review, routing, usage analytics, request logs, quotas, and billing review. That gives teams a practical place to validate whether the model routing policy is behaving as written.

Start with Flatkey pricing, choose one workflow, then get a key and run a small proof that checks route behavior, logs, quotas, cost, fallback, and rollback before production rollout.