throttlekit
TALE · the cost axis

Govern what your LLM spends.

An LLM completion's true cost is its output tokens — known only as it streams. TALE is a token-budget escrow that meters that cost as it's produced and stops at the boundary, with overshoot independent of max_tokens and of concurrency. The only limiter that governs spend, not just requests.

Why cost is different
A request's true cost is revealed after it runs.

Rate and concurrency are known at admission — you can price them before you let a request in. An LLM completion can't be: its cost is the number of tokens it generates, and you don't know that until it has streamed. The two obvious ways to handle that each fail at one end.

Option A · reserve max_tokens

Never overshoots. Wastes almost everything.

Hold the worst case up front and you're safe — but on a heavy-tailed length distribution you sterilize most of every reservation. Utilization collapses as the cap grows; you reject traffic you could have served.

Option B · admit, then count

Fully utilized. Blows the budget.

Let requests in and tally after, and overshoot grows with concurrency × max_tokens — every in-flight stream can run past the limit at once. LiteLLM was measured overshooting its budget by 6.6×.

reserve → meter → reconcile
Meter the budget, don't reserve the cap.

TALE takes reserve-max's safety at admit-then-count's utilization. It doesn't pre-commit the cap; it debits actual tokens as they're produced and stops the instant the budget is reached — so overshoot depends only on how coarsely you meter, never on the cap.

1 · reserve

A small, learned hold.

Commit a reservation before the cost is known — only to set the 429 and pace concurrency, never as the safety mechanism.

2 · meter

Debit real tokens.

Count what's actually produced, one atomic check-and-increment per debit, and refuse the moment the budget is spent.

3 · reconcile

Settle to the truth.

The reservation is released against the metered actual; utilization stays ~1 because nothing was held hostage to the cap.

The meter rule — src/admission/index.ts
if (served ≥ L) return deny;  // stop at the boundary
served += tokens;  // count the real, post-hoc cost
return allow(remaining = max(0, L − served));

A debit is admitted iff budget remained before it; the single debit that crosses L is counted in full, then every later debit is refused. So worst-case overshoot Δ ≤ (largest single debit) − 1 — and per-token metering (g = 1) overshoots by exactly zero. Because the check-and-increment is one atomic step, that holds no matter how many streams meter at once: independent of concurrency, independent of max_tokens.

Three layers — safety, then efficiency
The meter holds the bound. The rest just spends it well.
tokenBudget

The streaming meter.

The genuinely new result: stop-at-boundary metering that caps production at L for any reservation — learned, maximal, zero, or adversarial. Safety lives here, and nowhere else.

Δ = 0 at g = 1 · utilization ≈ 1
learnedReservation

The right hold, online.

Over- vs under-reserving is the asymmetric newsvendor loss; its minimizer is the critical-fractile quantile. Projected OGD learns it online with O(√T) regret — no tuning, no model.

pinball loss · R_T ≤ (3/2)·D·G·√T
predictiveReservation

Use a hint, safely.

A Hedge meta-learner blends "follow the length prediction" with the robust learner. Good hints ⇒ near-clairvoyant cost; adversarial hints ⇒ it falls back to the no-regret quantile — with the hard bound still intact.

consistency + robustness
From your gateway
One debit per stream — in Python.

Most LLM traffic is Python, so the cost axis is reachable from it directly. debit meters the actual tokens a stream produces against a windowed budget; a denial is a normal decision, not an exception to catch — when the budget is spent, you stop generating.

# pip install throttlekit-py
from throttlekit import ServiceBackend

with ServiceBackend("localhost:50051") as rl:
    for chunk in stream:
        d = rl.debit("llm-budget", "tenant:42", tokens=chunk.tokens)
        if not d.allowed:
            break  # budget reached — stop generating

Verified against the throttlekit-py API (ServiceBackend.debit(policy, key, tokens)). How the Python client works →

Across a fleet
One budget, many gateways — still bounded.

distributedTokenBudget runs the same stop-at-boundary rule as one atomic read-modify-write against a shared counter, and a single server-side key rolls the window so gateway clock skew can never split one budget into two. Only the one crossing debit per window can exceed Lindependent of the gateway count.

It is, exactly, GALE window-coupled leasing at B = 1 with the token as the unit — and the test suite proves the produced series is byte-identical to GALE's. The cost axis and the placement axis are the same proof, twice.

What proves it · the honest edges
Bounds you can re-derive — and the tunable you should know.
  • Δ = 0 for every max_tokens — streaming (g = 1) gives overshoot 0 and utilization 1 at every cap; reserve-max's utilization collapses and admit-then-count's overshoot grows with the cap. test/cost/token-budget.test.ts
  • Sublinear regret, machine-checked — the critical-fractile identity and the explicit (3/2)·D·G·√T envelope, plus overshoot 0 for every reservation policy. test/cost/learned-reservation.test.ts
  • Safe even following a bad predictor — overshoot 0 while blindly following an anti-correlated hint. test/cost/predicted-reservation.test.ts
  • Fleet bound, every gateway count — windowCoupled overshoot 0 for C ∈ {1…32}, byte-identical to GALE. test/cost/distributed-budget.test.ts

The tunable: per-token metering (g = 1) gives Δ = 0; chunking by g tokens to amortize meter calls trades a bounded Δ ≤ g − 1 for less overhead — your choice. The boundary: tokenBudget's check-then-increment is atomic only within one process; a fleet needs distributedTokenBudget, whose bound holds because the store makes the debit atomic. Layers 2–3 are efficiency, not safety — a reservation alone guarantees nothing about L; it must be paired with the meter.