How I Built My Own LLM Gateway

By Erik Anderson Tag: Tech & Automation ~9 min read

I run sixty-plus active projects through one Claude account. Most of them call Claude on a schedule — a content pipeline at 5 a.m., a trading scanner every fifteen minutes, an email reactor whenever a client writes in, a podcast generator at 7. They don't coordinate with each other. They just fire when their cron tells them to.

The result, predictably, is rate limits. Claude's 5-hour rolling window doesn't care that the contract scanner accidentally went into a retry loop at 3 a.m. and burned the budget the email reactor needed at 9. By the time a client email lands, the account is already saturated. The most important call of the day fails because the least important one already happened.

I tried the obvious things. Cron schedules spread out. Per-project rate limiters in code. A spreadsheet that mapped which scripts ran when. None of it worked, because the problem isn't scheduling — it's that no single piece of software had a global view of the budget. Every script was making local decisions about a global resource.

So I built one piece of software that does. It's called PrimeRouter. This post is what it does, why it's shaped the way it is, and what I'd tell someone thinking about building one.

The Shape of the Problem

Before the gateway, every service called Claude through a thin wrapper that did one thing: catch a 429, sleep, retry. That wrapper had four failure modes and they all bit me:

No global budget view. Each caller detected rate limits on its own. There was no single place that could say "the account is at 82% of the 5-hour window — stop sending non-critical traffic."
No priority. A speculative blog draft and an in-flight client email got the same shot at the budget. The cheap call won the race more often than the important one.
No provider diversity. When Claude rate-limited, everything just queued behind the wait. I had Codex, Ollama, and a local agentic backend running on a separate machine, and none of it picked up the slack.
No accounting. I had no idea which projects were burning the most tokens until something obviously broke. Post-incident debugging meant reading thirty service logs.

The fix had to live one layer deeper than any individual service. A gateway. Every /opt/ service calls it instead of claude -p directly. The gateway makes the routing, priority, and accounting decisions in one place.

13 Priority Tiers

The single most important design choice was that not all calls are equal, and the gateway has to know which is which. PrimeRouter has thirteen priority tiers, declared in a YAML file. They look something like this:

critical — client-facing work that has a human waiting on it (an email reactor handling a client request, a payment-flow webhook).
fixer_attempt_1 through fixer_attempt_3 — auto-fix retries on review-blocked branches. Each retry deliberately routes to a different provider so the model doesn't converge on the same wrong fix three times in a row.
review — code-review and analysis runs.
scheduled — routine pipelines that have a deadline (the morning ScanBrief, the daily podcast).
background — speculative or batchable work (a content draft, a metric refresh).
overnight — anything that can wait until 1 a.m. local time. The scheduler defers these explicitly.

When the global 5-hour budget thins, the gateway closes off the lower tiers first. Critical calls keep getting through. Background calls get a 503-style "try later." Overnight calls get pushed to the actual overnight drain.

This sounds obvious in retrospect. It wasn't obvious at the time. Most rate-limit advice you read on the internet treats every call as equally precious. Mine aren't. The blog post that publishes at 6 a.m. can survive being a few hours late. The email a paying customer sent at 10 a.m. cannot.

"Most rate-limit advice treats every call as equally precious. Mine aren't. The blog post can wait. The customer email can't."

Multi-Provider Failover

The gateway speaks to four different backends today: Claude (the default for almost everything), Codex over SSH (for code generation when the local sandbox is healthy), Hermes (a local agentic backend running qwen3-coder on an M3 Mac with 70 tokens/sec throughput), and Ollama (small models for mechanical work).

Each tier has a chain — the gateway tries one provider, and if that backend is unavailable or returns a known-bad signal, it falls over to the next. The "known-bad signal" detection took longer to get right than I expected. Codex sandboxes can return exit code zero with a sandbox-failure body in stdout, so a naïve check would record success on a call that actually never ran. The gateway parses for those bodies explicitly and treats them as failures.

The provider diversity also matters because retries on the same model produce the same wrong answer. If a code-review run is blocked because Claude misread the diff, asking Claude again with a slightly different prompt usually doesn't help. Asking Codex or a local model often does. The fixer-attempt chain encodes this — three retries, three different providers, three different reasoning traces.

Fleet Telemetry

Both my servers run Claude CLI sessions all day. So does my Mac. They all push OpenTelemetry traces to a central collector on the third box, which aggregates token usage across the fleet. The gateway reads from that aggregate.

The reason this matters: Claude's 5-hour budget is per-account, not per-host. A naïve rate limiter on each server would think it had its own budget. The gateway sees one bucket. When the account is at 82%, every host knows it.

The telemetry is also how I answer questions like "which project burned 40% of yesterday's budget" without grepping logs. The dashboard surfaces it directly, broken down by service and call class.

Per-Workflow Context Injection

Here's where the gateway pays for itself in tokens, not just in priority. Every call carries a workflow field that names what the caller is doing — website_fix, app_change, code_review, knowledge_response. The gateway consults a per-workflow profile that says: for this kind of call, here are the GAMEPLAN sections to inject as system context, here's the tool list to allow, and here's the token budget for the preamble.

A code-review call gets the project's GAMEPLAN, the relevant fix-guides, and an empty tool list (review is read-only). A knowledge call gets a different slice and read-only tools. An app-change call gets the full surface plus write tools. None of them get the kitchen sink.

The byte budget is the part that surprised me. The naïve version of "inject context" is to load every relevant doc into the system prompt — and that's where you find out you've blown 30K tokens before the user's actual prompt is even read. The gateway's context profiles cap the preamble at a configurable byte count and prefer the most-relevant sections within that cap. The cap is observable as a Prometheus histogram, so I can see when it's getting tight.

What I'd Tell Someone Building One

The article that I think nailed the user-side discipline is Pawel Huryn's Stop Hitting Claude Code Limits — twenty-two concrete techniques for the human at the keyboard. Cache management, model locking, effort tuning, lean tool loading. If you read one piece on this topic, read that one.

The gateway is the layer below those techniques. Everything in Huryn's list is something you do yourself, in your own session, with your own discipline. The gateway is what you build when you have a fleet of services that can't sit at a keyboard and exercise discipline.

The lessons I'd flag if you're building one:

Priority is the foundational decision. Build the tier system before you build the routing logic. If you can't articulate which calls are more important than which, you have no basis for refusing one.
Provider diversity is real, but it's not free. Each backend has its own auth, its own quirks, its own definition of failure. Budget time for the integration.
Telemetry first, decisions second. You cannot make a sane scheduling decision without an aggregated view of usage. Build the telemetry pipe before the scheduler.
Context profiles save more tokens than you'd guess. The prompt isn't the expensive part of a Claude call. The system context you load around the prompt is. Cap it. Profile it. Watch the histogram.
Subprocess isolation is your friend. Each call runs as a fresh claude -p subprocess with model and tools fixed for that invocation. There's no mid-session drift, because there's no mid-session.
Make the failure modes loud. A silent rate-limit drop is worse than a visible 500 — at least the 500 lets the caller decide whether to retry, defer, or escalate.

What's Next

The gateway is in production and handling thousands of calls per day across the fleet. The next round of work is closing three specific gaps from the Huryn list — disabling 1M-context fallback to save cache writes, passing through Claude's --effort flag so background work can opt down to medium reasoning, and wiring up skill-based model routing so mechanical tasks get a Haiku instead of an Opus. None of those are big diffs. All of them are real money.

The longer-term move is bringing in OpenRouter as a fifth backend so I can route ultra-low-stakes work to GLM-5.1 at roughly a twelfth of Opus cost. And eventually a Tree-sitter-based code-review graph that lets the reviewer load only the functions a diff touches, instead of loading whole files. That's claimed to reduce review tokens by a factor of seven or more. I'm skeptical of the specific number, but the direction is right.

If you're building something similar, I'd love to compare notes. Email me. The full design doc is publicly visible in the project's README — it's the kind of doc I wished I'd been able to read before I started.

The Technical Blueprint

The Autonomous Engineer — Book 2

The complete guide to running your own automation empire — including infrastructure architecture, AI tooling, monitoring, and the design patterns behind systems like the one in this post.

Book 2 on Amazon Book 1 on Amazon Free Starter Kit

← Back to Blog