I built a two-stage prompt compressor that runs entirely locally before your prompt hits any frontier model API.
How it works:
1. llama3.2:1b (via Ollama) compresses the prompt to its semantic minimum
2. nomic-embed-text validates that the compressed version preserves the original meaning (cosine ≥ 0.85)
3. If validation fails → original is returned unchanged. No silent corruption.
When it actually helps:
The effect is meaningful only on longer inputs. Short prompts are skipped entirely — no cost, no risk.
┌─────────────────────────────────┬────────────┬────────┐
│ Input │ Tokens │ Saving │
├─────────────────────────────────┼────────────┼────────┤
│ < 80 tokens │ skipped │ 0% │
├─────────────────────────────────┼────────────┼────────┤
│ Academic abstract (207t) │ 207 → 78 │ 62% │
├─────────────────────────────────┼────────────┼────────┤
│ Structured research doc (1116t) │ 1116 → 275 │ 75% │
├─────────────────────────────────┼────────────┼────────┤
│ Short command (4t) │ skipped │ 0% │
└─────────────────────────────────┴────────────┴────────┘
If you're sending short one-liners, this won't help. If you're injecting long context, research text, or system prompts — it pays off from the first call.
Known limitation:
Cosine similarity is blind to negation. "way smaller" vs "way larger" scores 0.985. The LLM stage handles this by explicitly preserving negations and conditionals, but it's an open
research question — tracked in issue #1.
Install as MCP (Claude Code):
{
"mcpServers": {
"token-compressor": {
"command": "python3",
"args": ["/path/to/token-compressor/mcp_server.py"]
}
}
}
Requires: Ollama + llama3.2:1b + nomic-embed-text
Repo: https://github.com/base76-research-lab/token-compressor-