throttlekit
throttlekit
Beyond rate limiting

Meter what your LLM spends.
Prove what your fleet admits.

Every other limiter just counts requests. ThrottleKit governs rate, concurrency, and cost — each behind a bound you can prove.

Counting requests is the easy 10%.

ThrottleKit ships the other 90% — two engines no other limiter has, each a checked guarantee instead of a hope.

TALE · cost

Meter the axis everyone ignores: cost.

An LLM token-budget escrow. Output tokens — known only as they stream — bounded anyway. The only limiter that governs spend, not just requests.

GALE · the bound

A distributed bound you can prove.

Lease credits, serve locally — one round trip per ~100 requests, overshoot still bounded, independent of fleet size. Machine-checked in TLA⁺.

Admitted Limit.

Proven — at any fleet size. Machine-checked in TLA⁺.

scroll
Why switch
What only ThrottleKit does.
Capabilityexpress-rate-limitrate-limiter-flexible@upstash/ratelimitThrottleKit
LLM token-budget escrow — the cost axis (TALE)
Fleet-size-independent overshoot bound, TLA⁺-checked (GALE)
Unified rate × concurrency × cost in one decision
One algorithm, proven bit-identical across backends✓ (6 stores)
Two-tier leasing — amortized round trips, bounded overshoot
Synchronous, allocation-free check✓ 169 ns
Polyglot from one verified core (Python today)
Zero runtime dependencies

The incumbents are good at what they do — this is what ThrottleKit adds on top. Every row is a shipped, tested feature; benchmarks (incl. the rows an incumbent wins) are reproducible on your hardware. Full comparison →

Measured, not claimed — BENCH.md
169 ns
A full GCRA checkSync decision, in-process — 5.9M ops/s, ~0 B/op.
bench/run.ts
66.4k
ops/sec via two-tier leasing over Redis, batch 100 — ~85× the strict path.
bench/run.ts --redis
≤ Limit
Global admissions under window-coupled leasing — independent of fleet size, TLA⁺-checked.
spec/ · CI

In-process, single hot key, Node 24 / Ryzen AI 9 HX 370. Redis/Postgres absolute latency is the local Docker network, not the database — the relative shape holds, the absolute p50 does not transfer. Run them on your hardware. Full methodology →

30-second quickstart
// Sync fast path — allocation-free, 169 ns.
import { rateLimit, gcra } from "throttlekit";

const limiter = rateLimit({
  strategy: gcra({ limit: 100, periodMs: 60_000 }),
});

const d = limiter.checkSync(userId);
if (!d.allowed) throw new Error(`retry ${d.retryAfterMs}ms`);
# Cost axis — bound LLM spend, not just calls.
from throttlekit import ServiceBackend

with ServiceBackend("localhost:50051") as rl:
    # debit real output tokens as they stream — overshoot bounded
    d = rl.debit("llm-budget", "tenant:42", tokens=output_tokens)
    if not d.allowed: stop_generating()  # budget spent