2 pointsby kylepma4 hours ago1 comment

kylepma3 hours ago
I built this after noticing that most LLM apps re-query the API for semantically identical questions. The architecture uses three layers: exact match (O(1)), normalized match (O(1)), then HNSW-indexed semantic search (O(log n)) as a fallback. Local embeddings keep everything private. In benchmarks, this reduces API calls by 50-80% on typical workloads. Happy to go deep on any of the tradeoffs — HNSW vs flat search, why SQLite over pgvector, how the Matryoshka cascade works, etc