This non-Preview release scored 16/25. Probably the same model as the preview, or at least not particularly improved if you want agentic performance.
Good to see more options for large open models though!
It's hard to point definitively to a reason it underperforms but generally models that perform well at agentic tasks were trained on very large numbers of tokens (Qwen, frontier models) or were heavily post trained for reasoning (see eg Nemotron-Cascade-2-30B-A3B at 21/25 vs the base model Nemotron-3-Nano-30B-A3B-Base at 12/25 )