BYO-LLM adapter pattern: how to avoid lock-in on a single model
Bring-Your-Own-LLM with minimal ~150-line adapters per provider. Why mono-LLM frameworks are rigid, how to drop to a uniform signature.
BYO-LLM adapter pattern: how to write an agent that runs on any model
The AI agent you build today will outlive the model you launch with. Models change rapidly: in 18 months, today’s dominant model may be obsolete, uneconomical or retired. Moreover, different clients, different jurisdictions and different cost requirements may force you to run the same agent on multiple models simultaneously. The solution is not a rewrite; it is an architecture from the start that treats the model as a pluggable component.
This article describes the Bring-Your-Own-LLM (BYO-LLM) pattern with minimal adapters that we use internally: ~150 lines of code per provider, a uniform signature for the application, model swap without re-engineering.
TL;DR
- Frameworks tightly integrating a single LLM produce architectural lock-in that is hard to break.
- The BYO-LLM pattern exposes a uniform signature (
generate(messages, tools, config) -> response) and wraps each provider in a ~150-line adapter. - Adapters handle: message format translation, tool format translation, response normalization, error normalization, retry/backoff handling.
- The application stays agnostic: you change the provider via a config variable, not via refactoring.
- Initial cost is 1-2 days per provider; the gain is the freedom to choose the model based on real need.
Why a single LLM is a trap
When you start an AI project, there is a temptation to use the official SDK of a single provider. The syntax is idiomatic, examples are rich, support is available. Short term, this is the fast choice. Medium term, it becomes problematic for four reasons:
Model changes within your own vendor. Anthropic, OpenAI, Google go through model generations with API, parameter or behavior changes. Code tightly bound to a specific model breaks.
Vendor change. Models with the best capabilities shift dramatically. The leading model at launch may not be the leading model after 12 months. Lock-in to an SDK makes migration expensive.
Geographic / regulatory requirements. A client with strict GDPR requirements may demand an EU-hosted model; a client with sensitive data may demand a self-hosted model; a public sector client may have specific vendor restrictions. An agent supporting only one vendor loses these contracts.
Cost-aware routing. As argued in the routing article, at scale you want to use different models for different tasks. That is impossible if your code is tangled with a specific SDK.
The BYO-LLM pattern
The pattern has three pieces:
1. Uniform signature. The application calls a function that looks identical for any provider:
response = llm.generate(
messages=[{"role": "user", "content": "..."}],
tools=[...],
config={"model": "alias_X", "temperature": 0.2, "max_tokens": 4000}
)
The application does not know which provider runs. The alias_X alias is resolved by the router (see cost-aware routing article) to a concrete model.
2. Adapter per provider. Each provider has a small adapter that:
- Translates messages from internal format to provider format
- Translates tools (unified JSON Schema → provider-specific format)
- Calls the provider’s official SDK
- Normalizes the response (text, tool calls, stop reason, usage tokens)
- Normalizes errors (rate limit, auth, server error, content filter)
- Handles retry with exponential backoff
3. Central registry. A dict mapping provider name (string) to adapter class:
PROVIDERS = {
"claude": ClaudeAdapter,
"openai": OpenAIAdapter,
"gemini": GeminiAdapter,
"claude_cli": ClaudeCLIAdapter,
"openai_compat": OpenAICompatAdapter, # for local servers
}
def get_adapter(provider: str) -> LLMAdapter:
return PROVIDERS[provider]()
Adapter example (~150 lines)
Simplified sketch for a Claude adapter:
class ClaudeAdapter:
def __init__(self):
self.client = anthropic.Anthropic() # reads API key from env
def generate(self, messages, tools=None, config=None):
# Message translation: internal format is similar to OpenAI;
# Claude wants system separate
system = ""
chat_messages = []
for m in messages:
if m["role"] == "system":
system += m["content"] + "\n"
else:
chat_messages.append(m)
# Tool translation
claude_tools = self._translate_tools(tools or [])
# SDK call
try:
resp = self.client.messages.create(
model=config["model"],
system=system,
messages=chat_messages,
tools=claude_tools,
temperature=config.get("temperature", 0.2),
max_tokens=config.get("max_tokens", 4000),
)
except anthropic.RateLimitError as e:
raise LLMRateLimitError(str(e))
except anthropic.APIStatusError as e:
if e.status_code >= 500:
raise LLMUpstreamError(str(e))
raise LLMClientError(str(e))
# Response normalization
text = ""
tool_calls = []
for block in resp.content:
if block.type == "text":
text += block.text
elif block.type == "tool_use":
tool_calls.append({
"id": block.id,
"name": block.name,
"arguments": block.input,
})
return LLMResponse(
text=text,
tool_calls=tool_calls,
stop_reason=self._normalize_stop(resp.stop_reason),
usage={
"prompt_tokens": resp.usage.input_tokens,
"completion_tokens": resp.usage.output_tokens,
},
)
def _translate_tools(self, tools):
# Unified JSON Schema → Claude tools format
return [
{
"name": t["name"],
"description": t["description"],
"input_schema": t["parameters"],
}
for t in tools
]
def _normalize_stop(self, claude_stop):
# Claude uses 'end_turn', 'tool_use', 'stop_sequence', 'max_tokens'
return {
"end_turn": "stop",
"tool_use": "tool_call",
"max_tokens": "length",
"stop_sequence": "stop",
}.get(claude_stop, "stop")
This adapter is under 80 lines without retry. With retry plus logging, it reaches ~150 lines.
The adapter for OpenAI-compatible (used for local servers like vLLM, Ollama API, LM Studio) is even shorter: the OpenAI API is in fact the most widely implemented by local runtimes and nearly identical to the internal signature.
Differences that must be isolated in the adapter
Differences between providers are not cosmetic. The adapter must hide them:
Message format. OpenAI and most local LLMs use a message array with roles including system. Claude uses system separately. Gemini uses user / model roles. The adapter translates.
Tool format. OpenAI: {"type": "function", "function": {...}}. Claude: {"name": ..., "input_schema": ...}. Gemini: function_declarations array. The adapter translates from a unified format (we chose OpenAI-style JSON Schema as the pivot).
Stop reason. Each provider uses different vocabulary. The adapter maps to a small unified enum.
Usage tokens. Field names differ (input_tokens vs prompt_tokens). Numbers can differ by a few tokens due to tokenization.
Tool calling streaming. Some providers offer streaming for tool calls (incremental parsing), others only complete responses. For simplicity, our adapters internally use non-streaming and expose streaming only at top level if the provider supports it.
Rate limit shape. Per-minute, per-day, tokens-per-minute limits. Rate limit errors are detected differently. The adapter normalizes to a single LLMRateLimitError type with optional retry_after.
Special types: claude_cli and CLI subprocess
One adapter we use internally is for Claude CLI as subprocess. The pattern is described in another article; here we just mention that the adapter has the same public signature, but the implementation calls the claude CLI via subprocess instead of the HTTP SDK. The application does not know the difference.
The benefit: an agent can use a premium model via personal subscription (fixed monthly cost) and a cloud model via API (cost per token), treating them identically.
Common mistakes
Mistake 1: too rich a signature. The temptation to expose every feature of every provider produces a signature that cannot be implemented uniformly. Solution: minimal signature, specific features exposed via an optional extra field that only supporting providers use.
Mistake 2: retry at adapter level and at application level. Result: multiplied retries. Clear decision: retry on rate limit and server errors at adapter level (max 3); retry on business logic at application level.
Mistake 3: lack of timeouts. A slow non-responding provider can hang an agent. The adapter sets a default timeout (60-120 seconds) overridable via config.
Mistake 4: assuming tools work identically. Different models have different tool-use quality. A smaller model can struggle with complex schemas. The adapter does not solve this; it is the router’s responsibility not to send complex tasks to insufficient models. But the adapter must report errors clearly.
Initial cost vs gain
Implementing one adapter takes 1-2 days for a developer familiar with the provider. Three adapters cover 95% of practical cases (Anthropic, OpenAI, OpenAI-compatible for local). Total initial cost: 5-10 person-days.
Gain in 12 months:
- Migrating to a new model (better or cheaper) via one config line
- Support for clients with specific hosting requirements (self-hosted, EU-only)
- Cost-aware routing without restructuring
- Risk reduction: if a provider has an outage or changes terms, automatic fallback
Conclusion
Lock-in to a single LLM is an architectural decision you will regret in 12-18 months. The BYO-LLM pattern with minimal adapters is a few days’ investment that pays off permanently. The application stays agnostic, the model becomes pluggable, the provider decision becomes a configuration decision rather than a code decision.
Bonus: writing an adapter for a new provider forced us to deeply understand how that provider works — which made us better consultants. Knowing multiple models is, by itself, a competitive advantage.
Related articles
- Cost-aware LLM routing
- Claude Code CLI as agent runtime
- Pillar IRIS — the CAI Technology orchestrator agent
- Pillar Consulting — AI assessment
External sources
- Anthropic Messages API reference — official reference for the Claude format
- OpenAI Chat Completions API reference — OpenAI reference and de facto standard for compatible servers
- Google Gemini API reference — Gemini reference for function calling and roles
- vLLM OpenAI-compatible server documentation — local runtime with OpenAI API
- LangChain BaseChatModel — adapter abstraction discussion — industrial example of adapter pattern
Next step
If your team is building an AI agent and would like to discuss the BYO-LLM pattern applied to your stack, we offer a 30-minute technical consultation at no cost.