ThrottleKit — govern rate, concurrency & cost, provably

Why switch

What only ThrottleKit does.

Capability	express-rate-limit	rate-limiter-flexible	@upstash/ratelimit	ThrottleKit
LLM token-budget escrow — the cost axis (TALE)	–	–	–	✓
Fleet-size-independent overshoot bound, TLA⁺-checked (GALE)	–	–	–	✓
Unified rate × concurrency × cost in one decision	–	–	–	✓
One algorithm, proven bit-identical across backends	–	–	–	✓ (6 stores)
Two-tier leasing — amortized round trips, bounded overshoot	–	–	–	✓
Synchronous, allocation-free check	–	–	–	✓ 169 ns
Polyglot from one verified core (Python today)	–	–	–	✓
Zero runtime dependencies	–	–	–	✓

The incumbents are good at what they do — this is what ThrottleKit adds on top. Every row is a shipped, tested feature; benchmarks (incl. the rows an incumbent wins) are reproducible on your hardware. Full comparison →

Two engines

TALE

Govern the cost.

A token-budget escrow built for LLM gateways: meter post-hoc output-token cost as it streams, with overshoot bounded by debit granularity — Δ=0 per token, independent of concurrency.

tokenBudget · distributedTokenBudget
learned / predictive reservation (online newsvendor)
reachable from Python — pip install throttlekit-py

How TALE works →

GALE

Prove the bound.

Provable distributed leasing: lease a batch of credits in one round trip, serve them locally at in-process speed — and window-coupling collapses worst-case global admissions to exactly the limit, independent of fleet size.

window-coupled leasing · adaptive lease sizing
weighted-fair escrow · distributed adaptive concurrency
machine-checked in TLA⁺, re-checked in CI

How GALE works →

Measured, not claimed — BENCH.md

169 ns

A full GCRA checkSync decision, in-process — 5.9M ops/s, ~0 B/op.

bench/run.ts

66.4k

ops/sec via two-tier leasing over Redis, batch 100 — ~85× the strict path.

bench/run.ts --redis

≤ Limit

Global admissions under window-coupled leasing — independent of fleet size, TLA⁺-checked.

spec/ · CI

In-process, single hot key, Node 24 / Ryzen AI 9 HX 370. Redis/Postgres absolute latency is the local Docker network, not the database — the relative shape holds, the absolute p50 does not transfer. Run them on your hardware. Full methodology →

30-second quickstart

// Sync fast path — allocation-free, 169 ns.
import { rateLimit, gcra } from "throttlekit";

const limiter = rateLimit({
  strategy: gcra({ limit: 100, periodMs: 60_000 }),
});

const d = limiter.checkSync(userId);
if (!d.allowed) throw new Error(`retry ${d.retryAfterMs}ms`);

# Cost axis — bound LLM spend, not just calls.
from throttlekit import ServiceBackend

with ServiceBackend("localhost:50051") as rl:
    # debit real output tokens as they stream — overshoot bounded
    d = rl.debit("llm-budget", "tenant:42", tokens=output_tokens)
    if not d.allowed: stop_generating()  # budget spent

Meter what your LLM spends.
Prove what your fleet admits.

Counting requests is the easy 10%.

Meter the axis everyone ignores: cost.

A distributed bound you can prove.

Govern the cost.

Prove the bound.

Meter what your LLM spends.Prove what your fleet admits.

Counting requests is the easy 10%.

Meter the axis everyone ignores: cost.

A distributed bound you can prove.

Govern the cost.

Prove the bound.

Meter what your LLM spends.
Prove what your fleet admits.