Same harness, same prompts, same playbooks, baseline vs VerifiedX.
Current result:
baseline executed 18 unjustified high-impact action points with VerifiedX that dropped to 0 false blocks in the current suite: 0 surviving-goal completion improved from 41.7% to 100% The repo includes methodology, raw artifacts, and repro steps.
This is a public proxy eval based on legal workflow classes Luminance publicly markets. It is not a claim about their internal system.
Legal is the first public instance. The same method applies to support, healthcare RCM, procurement, and finance too.
Happy to answer questions on methodology, false blocks, overhead, or how to design domain-specific action-boundary evals.