| Capability | express-rate-limit | rate-limiter-flexible | @upstash/ratelimit | ThrottleKit |
|---|---|---|---|---|
| LLM token-budget escrow — the cost axis (TALE) | – | – | – | ✓ |
| Fleet-size-independent overshoot bound, TLA⁺-checked (GALE) | – | – | – | ✓ |
| Unified rate × concurrency × cost in one decision | – | – | – | ✓ |
| One algorithm, proven bit-identical across backends | – | – | – | ✓ (6 stores) |
| Two-tier leasing — amortized round trips, bounded overshoot | – | – | – | ✓ |
| Synchronous, allocation-free check | – | – | – | ✓ 169 ns |
| Polyglot from one verified core (Python today) | – | – | – | ✓ |
| Zero runtime dependencies | – | – | – | ✓ |
The incumbents are good at what they do — this is what ThrottleKit adds on top. Every row is a shipped, tested feature; benchmarks (incl. the rows an incumbent wins) are reproducible on your hardware. Full comparison →
Govern the cost.
A token-budget escrow built for LLM gateways: meter post-hoc output-token cost as it streams, with overshoot bounded by debit granularity — Δ=0 per token, independent of concurrency.
- tokenBudget · distributedTokenBudget
- learned / predictive reservation (online newsvendor)
- reachable from Python — pip install throttlekit-py
Prove the bound.
Provable distributed leasing: lease a batch of credits in one round trip, serve them locally at in-process speed — and window-coupling collapses worst-case global admissions to exactly the limit, independent of fleet size.
- window-coupled leasing · adaptive lease sizing
- weighted-fair escrow · distributed adaptive concurrency
- machine-checked in TLA⁺, re-checked in CI
checkSync decision, in-process — 5.9M ops/s, ~0 B/op.In-process, single hot key, Node 24 / Ryzen AI 9 HX 370. Redis/Postgres absolute latency is the local Docker network, not the database — the relative shape holds, the absolute p50 does not transfer. Run them on your hardware. Full methodology →
// Sync fast path — allocation-free, 169 ns.
import { rateLimit, gcra } from "throttlekit";
const limiter = rateLimit({
strategy: gcra({ limit: 100, periodMs: 60_000 }),
});
const d = limiter.checkSync(userId);
if (!d.allowed) throw new Error(`retry ${d.retryAfterMs}ms`);# Cost axis — bound LLM spend, not just calls.
from throttlekit import ServiceBackend
with ServiceBackend("localhost:50051") as rl:
# debit real output tokens as they stream — overshoot bounded
d = rl.debit("llm-budget", "tenant:42", tokens=output_tokens)
if not d.allowed: stop_generating() # budget spent