AI Gateway ArchitectureJuly 2, 2026Big Y

Context Window Routing: Pick Models by Prompt Size Without Losing Cost Control

Use context window routing to choose retrieval, compaction, caching, or long-context models by prompt size while keeping AI gateway costs reviewable.

Long prompts are not automatically a long-context problem. Some prompts should be trimmed, some should use retrieval, some should reserve more output space, and some really do need a larger context window. Context window routing is the policy that decides which path a request should take before it burns budget on the wrong model.

The goal is simple: route by the actual prompt size, the required answer shape, and the evidence you need for cost control. A 6,000-token support conversation, a 70,000-token contract review, and a 900,000-token codebase scan should not share one default route just because they all fit behind the same API key.

Flatkey is useful in this design because model access, routing, usage review, billing, and operational controls are easier to manage from one gateway surface than from scattered provider accounts. Use the framework below to design context window routing rules, then validate the current model row, endpoint family, and usage unit on Flatkey pricing before production rollout.

Context window routing starts with a token budget

Start every route decision with a budget, not a model name.

required_context =
  system_and_policy_tokens
+ user_input_tokens
+ retrieved_or_attached_context_tokens
+ tool_schema_and_tool_result_tokens
+ conversation_history_tokens
+ reserved_output_tokens
+ reserved_reasoning_tokens
+ safety_margin_tokens

The route is eligible only if the required context fits the model's usable context window after you reserve output and reasoning space. OpenAI's reasoning model guidance is a good reminder here: when generated tokens reach the context window or max_output_tokens, the response can become incomplete, and teams should leave room for reasoning and output while they calibrate the workload.

That reserve matters for cost control. If a request barely fits the context window, the model may still fail, truncate, or spend heavily on input tokens before returning an unusable answer. Good context window routing protects against that by routing oversized requests to the right path before the call is made.

A practical routing matrix

Use this matrix as the first pass for context window routing. Tune the thresholds against your real token counts, model catalog, latency SLOs, and quality evaluations.

Prompt class	Typical signal	Recommended route	Cost-control rule	Required evidence
Short task	Small prompt, small answer, no long history	Fast low-cost route	Avoid long-context models unless evals require them	Prompt tokens, output tokens, success rate
Normal chat	Moderate history, tools, or structured answer	Balanced route with tool and schema support	Cap by conversation or owner	Served model, tool result size, schema-valid rate
Long document	Large file, transcript, policy, or contract	Long-context route or retrieval route	Compare full-context cost against retrieval cost	Input tokens, cited spans, answer quality
Huge corpus	Many files, codebase, logs, or archive	Retrieval, chunking, compaction, then selective long-context route	Do not stuff the corpus by default	Retrieved chunks, dropped context, cache hit rate
Reasoning-heavy prompt	Long task plus planning, tools, or code reasoning	Route with explicit output and reasoning reserve	Reserve output space before sending the prompt	Incomplete rate, reasoning/output tokens, p95 latency
Compliance or finance review	Sensitive content and audit requirements	Pinned reviewed route	Block automatic fallback unless approved	Requested model, served model, owner, cost trace

This is context window routing in operational form: each class has a route, a cost rule, and proof that the route worked.

Do not use the biggest context window as the default

Large context windows are useful. They are not a free replacement for routing.

Google's Gemini long-context docs describe 1M-token context windows and explain how long context can unlock workflows that previously needed summarization, retrieval, or filtering. Anthropic's context-window docs describe context as the working memory that includes request content, tool results, documents, tool definitions, and output. Both points matter: larger windows expand what is possible, but everything you place into the window still needs to be paid for, validated, and logged.

The safest default is not "send everything." The safer default is:

Keep short prompts on efficient routes.
Use retrieval when the answer depends on a small slice of a large corpus.
Use long context when the model must compare many parts of the source at once.
Reserve output and reasoning budget before calling a reasoning model.
Log enough usage detail to compare cost per accepted outcome.

That is the cost-control core of context window routing.

When retrieval beats long context

Retrieval is usually better when the task has a narrow evidence need. Examples include "find the renewal clause," "summarize this incident from the three relevant log lines," or "answer from the current API docs." In those cases, sending the entire contract, log archive, or documentation site may increase cost without improving accuracy.

Use retrieval when:

The answer should cite a small number of passages.
Most of the corpus is irrelevant to the user question.
The same corpus is queried repeatedly by many users.
You need to restrict data exposure by tenant, project, team, or permission.
The cost of full-context input would dominate the value of the answer.

Context window routing should send the request through retrieval first, then pass only the selected chunks, metadata, and instructions to the model. Log the retrieved source IDs, token count, and answer acceptance result. If the answer fails because too much context was missing, promote that workflow to a larger context route and record the reason.

When long context beats retrieval

Long context is stronger when the task needs broad comparison. Examples include reviewing a full policy set for contradiction, analyzing a complete transcript, comparing sections across a large contract, or using an entire repository as a reference set for a planning task.

Use a long-context route when:

The task depends on relationships across many distant sections.
The model needs the whole document structure, not just isolated passages.
Retrieval quality is hard to verify before generation.
The source is a single bounded artifact, such as one PDF, one transcript, or one code bundle.
The expected value of the answer justifies the larger input cost.

Even then, context window routing should not skip cost checks. Measure full input tokens, cached tokens if available, output tokens, latency, retry rate, and accepted-answer rate. The routing policy should prove that the long-context route was better than retrieval, not just easier to implement.

Prompt caching belongs in the route decision

Prompt caching can change the economics of repeated long prompts. OpenAI's prompt caching docs explain that eligible long prompts can benefit when static content appears first and variable content appears later; they also expose cached_tokens in usage details so teams can monitor cache behavior.

Context window routing should treat cacheability as a first-class signal:

Prompt pattern	Routing implication
Stable system policy plus many user questions	Put stable content first and measure cached-token share
Repeated large documentation bundle	Consider cache-aware long-context route
Highly dynamic user-specific data	Do not assume cache savings
Shared tool definitions across many calls	Keep tool schemas stable where possible
Short prompt below cache threshold	Optimize route/model first; caching may not help

Cached tokens may lower cost or latency depending on provider behavior, but they do not make the context window infinite. Anthropic's docs make the important distinction directly: cached prompt prefixes can still occupy the context window. The routing policy should record cache hits as cost evidence, not as permission to ignore token limits.

Reserve output, reasoning, and tool space

Context window routing often fails because teams count only input tokens. The model still needs room to answer.

For each route, define:

Maximum input tokens: the largest request the route may accept.
Reserved output tokens: room for the visible answer, JSON, citations, or tool arguments.
Reserved reasoning tokens: extra room for reasoning models or hard tasks.
Tool overhead: tool definitions, tool calls, and tool results.
Safety margin: a buffer for tokenizer variance and prompt growth.

Use a route guard like this:

route: contract_review_long_context
max_context_window_tokens: provider_model_limit
max_input_tokens: 180000
reserved_output_tokens: 12000
reserved_reasoning_tokens: 25000
tool_overhead_tokens: 5000
safety_margin_tokens: 8000
on_over_budget:
  first: summarize_or_retrieve
  second: ask_for_scope_reduction
  blocked: send_anyway

The numbers above are placeholders, not universal limits. The important part is the shape of the guardrail: the route has an input ceiling, answer reserve, reasoning reserve, and explicit over-budget behavior.

Cost controls for context window routing

Do not measure cost only per token. Measure cost per accepted outcome.

Cost metric	Why it matters
Cost per request	Catches oversized single calls
Cost per accepted answer	Accounts for retries, bad retrieval, and failed long-context calls
Cost per workflow	Shows the true cost of a ticket, review, extraction, or report
Cost per owner	Connects usage to app, team, customer, or environment
Cache-adjusted input cost	Separates repeated stable prefixes from dynamic context
Fallback cost	Shows whether fallback is rescuing reliability or hiding a bad primary route

Flatkey's public product surface is relevant because it positions the platform around unified model access, routing, billing, usage analytics, and operational controls. The live pricing API check for this article on July 2, 2026 returned success: true and exposed endpoint families including openai, anthropic, gemini, image-generation, openai-video, and video. Treat that as dated evidence for route planning, not a promise that every model, price, or endpoint will remain unchanged.

A context window routing policy template

Put the rules in a format engineering, finance, and procurement can review.

policy_name: context_window_routing_v1
owner:
  team: ai_platform
  approvers:
    - engineering
    - finance
workflow_classes:
  short_task:
    max_input_tokens: 8000
    route: efficient_text_route
    fallback: retry_same_route_once
  normal_chat:
    max_input_tokens: 32000
    route: balanced_tool_route
    fallback: reviewed_balanced_backup
  long_document_review:
    max_input_tokens: 180000
    route: long_context_route
    fallback: summarize_then_retry
  huge_corpus_question:
    route: retrieval_first_route
    fallback: scoped_long_context_route
budget_rules:
  reserve_output_tokens: required_by_workflow
  reserve_reasoning_tokens: required_by_model_class
  block_when_over_budget: true
  require_cache_metrics_when_prompt_repeats: true
evidence:
  required_fields:
    - workflow_class
    - requested_model
    - served_model
    - endpoint_family
    - input_tokens
    - cached_tokens
    - output_tokens
    - reasoning_tokens
    - route_decision
    - fallback_reason
    - owner_key
    - cost_or_balance_impact
acceptance_tests:
  max_incomplete_rate: agreed_threshold
  max_over_budget_rate: zero_for_production
  min_answer_acceptance_rate: workflow_eval_threshold
  finance_reconciliation_sample: required

This template makes context window routing testable. If the route changes, the owner can see why. If the prompt grows, the guardrail can block it. If the request repeats, cache metrics become part of the review.

Acceptance tests before production

Run these tests before you let context window routing handle production traffic:

Send a short prompt and confirm it stays off the long-context route.
Send a normal chat prompt with tools and confirm tool definitions and results are counted.
Send a long document prompt and verify reserved output space remains available.
Send an over-budget prompt and confirm the route summarizes, retrieves, or asks for scope reduction instead of sending blindly.
Trigger a reasoning-heavy task and check incomplete-response handling.
Repeat a stable long prompt and confirm cached-token metrics are recorded when the provider exposes them.
Compare retrieval-first and full-context answers on the same evaluation set.
Review requested model, served model, endpoint family, usage units, fallback reason, and cost or balance impact in logs.

For broader architecture, pair these checks with Flatkey's guides to AI API gateways, LLM API gateway architecture, AI API load balancing and failover, and model routing policy design.

Where Flatkey fits

Flatkey should not be the only place where the policy exists. It should be the place where teams can make the policy easier to run and review.

Use Flatkey to centralize model access, route review, current pricing checks, usage visibility, quotas, request logs, and billing review. Then keep the context window routing policy in code or configuration so route decisions are repeatable. The gateway gives finance and operations a clearer place to inspect usage; the policy tells engineering what route is allowed.

A practical Flatkey proof run looks like this:

Choose one workflow with known prompt-size ranges.
Check the current model and endpoint options on Flatkey pricing.
Run short, normal, long, over-budget, and repeated-cacheable prompts.
Review request logs for route decision, served model, usage, cache fields where available, fallback reason, and owner key.
Confirm quota and cost-review behavior with the workflow owner.
Move only the tested routes to production, then expand context window routing row by row.

When the proof passes, get a key and keep the first rollout narrow. The point of context window routing is not to add complexity; it is to stop prompt growth from silently turning into runaway cost, incomplete answers, and unreviewable model choices.

FAQ

What is context window routing?

Context window routing is a policy that chooses the model route, retrieval path, compaction path, or rejection behavior based on prompt size, output reserve, reasoning reserve, tool overhead, cost controls, and required evidence.

How is context window routing different from model routing?

Model routing can choose by quality, price, latency, modality, region, or provider. Context window routing focuses on whether the request fits the usable context budget and whether a smaller, retrieval-first, cached, or long-context route is the right cost-control choice.

When should a team use retrieval instead of a long-context model?

Use retrieval when the answer depends on a small part of a large corpus, when permissions matter, or when repeated full-context input would be expensive. Use long context when the task needs broad comparison across many distant parts of the source.

Why reserve output and reasoning tokens?

A prompt can fit the input side of the context window and still fail because there is not enough room left for reasoning or the visible answer. Reserving output and reasoning tokens reduces incomplete responses and wasted spend.

Does prompt caching remove the need for context window routing?

No. Prompt caching can reduce latency or input cost for repeated prefixes, but cached tokens still need to be considered in the context window. Context window routing should log cached-token metrics while still enforcing budget limits.

How does Flatkey help with context window routing?

Flatkey gives teams one gateway surface for model access, route review, pricing checks, usage analytics, request logs, quotas, and billing review. That makes it easier to validate whether context window routing is controlling prompt size and cost as designed.