1 pointby anulum5 hours ago1 comment
  • soletta5 hours ago
    Sounds interesting. What makes DeBERTA + RAG any better than detecting contradictions in the context than a frontier LLM, and why? I see that the NLI scorer itself was evaluated, but I’d love to see data about how the full system performs vs SotA if you have any on hand.
    • anulum5 hours ago
      @soletta Great question — this is exactly why we built it this way.

      *Short answer*: frontier LLMs are excellent at static self-critique, but terrible for *real-time token-by-token streaming guardrails* because of latency, cost, and lack of persistent custom memory.

      *Why DeBERTa + RAG wins here*: - *Latency*: DeBERTa-v3-base + Rust kernel scores every ~4 tokens in ~220 ms (AggreFact eval). A frontier LLM call (GPT-4o/Claude 3.5) is 400–2000 ms per check. You can’t do that mid-stream without killing UX. - *Cost*: Frontier self-checking at scale = real money. This runs fully local/offline after the one-time model download. - *Custom knowledge*: The 0.4× RAG weight pulls from your GroundTruthStore (ChromaDB). Frontier models don’t have a live, updatable external fact base unless you keep stuffing context (expensive + context-window limited). - *Determinism & auditability*: Small fine-tunable NLI model + fixed vector DB = reproducible decisions. LLMs-as-judges are stochastic and hard to debug in prod.

      We’re completely transparent: the NLI scorer alone is *not SOTA* (66.2% balanced acc on LLM-AggreFact 29k samples — see full table vs MiniCheck/Bespoke/HHEM in the repo). The value is the live system: NLI + user KB + actual streaming halt that no one else ships today.

      Full end-to-end comparisons vs. LLM-as-judge in streaming setups are next on the roadmap (happy to run them on any dataset you care about).

      Have you tried frontier self-critique in real streaming agents? What broke for you?

      Repo benchmarks: https://github.com/anulum/director-ai#benchmarks