I'm not sure this looks credible. 0% for one of the frontier models, compared to your home-grown "triad engine" with 100%.
*Sample 20q* = hardest edge cases (47 Rome anachronisms Claude fails completely). Public on GitHub—run it yourself.
*Full 222q* = broader test (Claude gets 45%, still poor). Gated to prevent contamination.
Why 0% on samples? Claude 4.6 injects modern moralizing ("slavery immoral") into 110 CE Rome characters. Triad's λ/μ/ν agents + Sand Spreader catch cultural hallucination.
Eval code reproducible: `python eval_framework.py samples/sample_20q.jsonl`
Try it → you'll see Claude fails basic anachronisms our multi-agent system doesn't.