Show HN: Assay – Found 250 bugs in LiteLLM, LobeChat via AI code verification(github.com)

2 pointsby tywellshn3 hours ago2 comments

3 hours ago
undefined
tywellshn3 hours ago
Hi HN, I'm Ty. I built Assay because I kept shipping bugs that my AI coding assistant hallucinated into existence.
Three independent papers have proven that LLM hallucination is mathematically inevitable (Xu et al. 2024, Banerjee et al. 2024, Karpowicz 2025). You can't train it away. You can't prompt it away. So I built a verification layer instead.
How it works: Assay extracts every implicit claim code makes (e.g., "this function handles null input," "this query is injection-safe"), then verifies each one. First an adversarial LLM pass, then a deterministic formal verifier that can override the LLM's verdict.
We ran it on 4 popular open-source projects. Live results:
- LiteLLM (18K stars): 1,381 claims, 185 bugs, 30 critical — https://tryassay.ai/reports/0bccf817-1cb6-43ff-b724-866f1453... - Chatbot UI (28K stars): 476 claims, 41 bugs, 12 critical — https://tryassay.ai/reports/cc8c0c61-9b5a-4774-aed1-f99cc4f6... - LobeChat (50K stars): 205 claims, 14 bugs, 1 critical — https://tryassay.ai/reports/915dfc1a-64ec-483d-b4b5-effb53a8... - Open Interpreter (55K stars): 12 claims, 4 bugs, 2 critical — https://tryassay.ai/reports/347aa2bb-4249-468a-a835-12da3472...
"But can't the verifier hallucinate too?" Yes. That's why we added a formal verifier underneath — pure regex/pattern-matching, no LLM, can't hallucinate. On its first production call, the LLM judge said PASS on code with SQL injection. The formal verifier overrode it to FAIL.
Benchmarks (validated against real test suites, not LLM judgment): - HumanEval: 86.6% baseline to 100% pass@5 with Assay (164/164 problems) - SWE-bench: 18.3% baseline to 30.3% with Assay (+65.5%)
Try it:
```
  npx tryassay assess /path/to/your/project
```
npm: https://www.npmjs.com/package/tryassay Paper: https://doi.org/10.5281/zenodo.18522644
Drop a repo link in the comments and I'll run it for free.