1 pointby vidur24 hours ago1 comment
  • vidur24 hours ago
    I built a proof-of-concept that streams LLM tokens as Huffman-compressed binary over WebSocket instead of JSON text over SSE.

    The Problem: Current LLM APIs (OpenAI, Anthropic, self-hosted) send decoded text wrapped in JSON. For every token, you get something like: `data: {"choices":[{"delta":{"content":"hello"}}]}`. This is verbose, wastes bandwidth, and forces the server to decode tokens to text (CPU cost).

    The Solution: Stream raw token IDs as binary. The server sends Huffman-compressed token IDs over WebSocket, and the client decodes them locally using WASM. This offloads token decoding from server to client.

    Results from mock benchmarks: - 30% faster for inline completions (the critical vibecoding use case) - 25% faster for small completions (100 tokens) - 12% faster overall average - ~60% bandwidth savings (3 bytes/token vs 8 bytes/token) - Client-side decoding means servers can handle more concurrent users

    Architecture:

    LLM → Token IDs → Huffman encode → WebSocket (binary) → WASM decode → Text vs. LLM → Token IDs → Decode to text → JSON → SSE (HTTP) → Parse → Text

    Tech Stack: Rust (WASM for encoder/decoder), TypeScript (test harness), Node.js (mock servers). Includes comprehensive benchmarks comparing both protocols on identical workloads.

    Limitations: - Requires modifying the LLM server to expose token IDs (standard APIs don't do this) - Tokenizer is baked in at build time (`./build.sh <tokenizer_name>`) - can't switch models dynamically - Mock server only - no real LLM integration yet - VS Code extension is non-functional (command registration issues) - Best for self-hosted deployments where you control the stack

    The VS Code extension code is included but doesn't work. Benchmarks and Node.js examples demonstrate the approach.

    Why it matters: - Protocol-level thinking for LLM APIs (not just server scaling) - Shows binary protocols + client-side decoding beats traditional HTTP/JSON - Opens discussion about whether LLM APIs should expose token IDs

    Built this in ~3K LOC. Fully open source (MIT). Includes comprehensive benchmarks and Node.js examples.

    Try it: https://github.com/vidur2/token_entropy_encoder

    Looking for feedback on the approach, potential issues, and whether this is worth pursuing further!