CAI Technology
Menu ☰
iris · · 12 min read

BYO-LLM adapter pattern: how to avoid lock-in on a single model

Bring-Your-Own-LLM with minimal ~150-line adapters per provider. Why mono-LLM frameworks are rigid, how to drop to a uniform signature.

CAI Technology · Last reviewed: 4/30/2026
BYO-LLM adapter pattern: how to avoid lock-in on a single model

BYO-LLM adapter pattern: how to write an agent that runs on any model

The AI agent you build today will outlive the model you launch with. Models change rapidly: in 18 months, today’s dominant model may be obsolete, uneconomical or retired. Moreover, different clients, different jurisdictions and different cost requirements may force you to run the same agent on multiple models simultaneously. The solution is not a rewrite; it is an architecture from the start that treats the model as a pluggable component.

This article describes the Bring-Your-Own-LLM (BYO-LLM) pattern with minimal adapters that we use internally: ~150 lines of code per provider, a uniform signature for the application, model swap without re-engineering.

TL;DR

Why a single LLM is a trap

When you start an AI project, there is a temptation to use the official SDK of a single provider. The syntax is idiomatic, examples are rich, support is available. Short term, this is the fast choice. Medium term, it becomes problematic for four reasons:

Model changes within your own vendor. Anthropic, OpenAI, Google go through model generations with API, parameter or behavior changes. Code tightly bound to a specific model breaks.

Vendor change. Models with the best capabilities shift dramatically. The leading model at launch may not be the leading model after 12 months. Lock-in to an SDK makes migration expensive.

Geographic / regulatory requirements. A client with strict GDPR requirements may demand an EU-hosted model; a client with sensitive data may demand a self-hosted model; a public sector client may have specific vendor restrictions. An agent supporting only one vendor loses these contracts.

Cost-aware routing. As argued in the routing article, at scale you want to use different models for different tasks. That is impossible if your code is tangled with a specific SDK.

The BYO-LLM pattern

The pattern has three pieces:

1. Uniform signature. The application calls a function that looks identical for any provider:

response = llm.generate(
    messages=[{"role": "user", "content": "..."}],
    tools=[...],
    config={"model": "alias_X", "temperature": 0.2, "max_tokens": 4000}
)

The application does not know which provider runs. The alias_X alias is resolved by the router (see cost-aware routing article) to a concrete model.

2. Adapter per provider. Each provider has a small adapter that:

3. Central registry. A dict mapping provider name (string) to adapter class:

PROVIDERS = {
    "claude": ClaudeAdapter,
    "openai": OpenAIAdapter,
    "gemini": GeminiAdapter,
    "claude_cli": ClaudeCLIAdapter,
    "openai_compat": OpenAICompatAdapter,  # for local servers
}

def get_adapter(provider: str) -> LLMAdapter:
    return PROVIDERS[provider]()

Adapter example (~150 lines)

Simplified sketch for a Claude adapter:

class ClaudeAdapter:
    def __init__(self):
        self.client = anthropic.Anthropic()  # reads API key from env

    def generate(self, messages, tools=None, config=None):
        # Message translation: internal format is similar to OpenAI;
        # Claude wants system separate
        system = ""
        chat_messages = []
        for m in messages:
            if m["role"] == "system":
                system += m["content"] + "\n"
            else:
                chat_messages.append(m)

        # Tool translation
        claude_tools = self._translate_tools(tools or [])

        # SDK call
        try:
            resp = self.client.messages.create(
                model=config["model"],
                system=system,
                messages=chat_messages,
                tools=claude_tools,
                temperature=config.get("temperature", 0.2),
                max_tokens=config.get("max_tokens", 4000),
            )
        except anthropic.RateLimitError as e:
            raise LLMRateLimitError(str(e))
        except anthropic.APIStatusError as e:
            if e.status_code >= 500:
                raise LLMUpstreamError(str(e))
            raise LLMClientError(str(e))

        # Response normalization
        text = ""
        tool_calls = []
        for block in resp.content:
            if block.type == "text":
                text += block.text
            elif block.type == "tool_use":
                tool_calls.append({
                    "id": block.id,
                    "name": block.name,
                    "arguments": block.input,
                })

        return LLMResponse(
            text=text,
            tool_calls=tool_calls,
            stop_reason=self._normalize_stop(resp.stop_reason),
            usage={
                "prompt_tokens": resp.usage.input_tokens,
                "completion_tokens": resp.usage.output_tokens,
            },
        )

    def _translate_tools(self, tools):
        # Unified JSON Schema → Claude tools format
        return [
            {
                "name": t["name"],
                "description": t["description"],
                "input_schema": t["parameters"],
            }
            for t in tools
        ]

    def _normalize_stop(self, claude_stop):
        # Claude uses 'end_turn', 'tool_use', 'stop_sequence', 'max_tokens'
        return {
            "end_turn": "stop",
            "tool_use": "tool_call",
            "max_tokens": "length",
            "stop_sequence": "stop",
        }.get(claude_stop, "stop")

This adapter is under 80 lines without retry. With retry plus logging, it reaches ~150 lines.

The adapter for OpenAI-compatible (used for local servers like vLLM, Ollama API, LM Studio) is even shorter: the OpenAI API is in fact the most widely implemented by local runtimes and nearly identical to the internal signature.

Differences that must be isolated in the adapter

Differences between providers are not cosmetic. The adapter must hide them:

Message format. OpenAI and most local LLMs use a message array with roles including system. Claude uses system separately. Gemini uses user / model roles. The adapter translates.

Tool format. OpenAI: {"type": "function", "function": {...}}. Claude: {"name": ..., "input_schema": ...}. Gemini: function_declarations array. The adapter translates from a unified format (we chose OpenAI-style JSON Schema as the pivot).

Stop reason. Each provider uses different vocabulary. The adapter maps to a small unified enum.

Usage tokens. Field names differ (input_tokens vs prompt_tokens). Numbers can differ by a few tokens due to tokenization.

Tool calling streaming. Some providers offer streaming for tool calls (incremental parsing), others only complete responses. For simplicity, our adapters internally use non-streaming and expose streaming only at top level if the provider supports it.

Rate limit shape. Per-minute, per-day, tokens-per-minute limits. Rate limit errors are detected differently. The adapter normalizes to a single LLMRateLimitError type with optional retry_after.

Special types: claude_cli and CLI subprocess

One adapter we use internally is for Claude CLI as subprocess. The pattern is described in another article; here we just mention that the adapter has the same public signature, but the implementation calls the claude CLI via subprocess instead of the HTTP SDK. The application does not know the difference.

The benefit: an agent can use a premium model via personal subscription (fixed monthly cost) and a cloud model via API (cost per token), treating them identically.

Common mistakes

Mistake 1: too rich a signature. The temptation to expose every feature of every provider produces a signature that cannot be implemented uniformly. Solution: minimal signature, specific features exposed via an optional extra field that only supporting providers use.

Mistake 2: retry at adapter level and at application level. Result: multiplied retries. Clear decision: retry on rate limit and server errors at adapter level (max 3); retry on business logic at application level.

Mistake 3: lack of timeouts. A slow non-responding provider can hang an agent. The adapter sets a default timeout (60-120 seconds) overridable via config.

Mistake 4: assuming tools work identically. Different models have different tool-use quality. A smaller model can struggle with complex schemas. The adapter does not solve this; it is the router’s responsibility not to send complex tasks to insufficient models. But the adapter must report errors clearly.

Initial cost vs gain

Implementing one adapter takes 1-2 days for a developer familiar with the provider. Three adapters cover 95% of practical cases (Anthropic, OpenAI, OpenAI-compatible for local). Total initial cost: 5-10 person-days.

Gain in 12 months:

Conclusion

Lock-in to a single LLM is an architectural decision you will regret in 12-18 months. The BYO-LLM pattern with minimal adapters is a few days’ investment that pays off permanently. The application stays agnostic, the model becomes pluggable, the provider decision becomes a configuration decision rather than a code decision.

Bonus: writing an adapter for a new provider forced us to deeply understand how that provider works — which made us better consultants. Knowing multiple models is, by itself, a competitive advantage.

External sources

Next step

If your team is building an AI agent and would like to discuss the BYO-LLM pattern applied to your stack, we offer a 30-minute technical consultation at no cost.

We start with a 30-minute conversation.

Free AI-readiness audit for companies with 50+ employees. We reply within 24 hours.