1 pointby xdotli2 hours ago1 comment
  • xdotli2 hours ago
    We collected 86 tasks from 105 domain experts across 11 domains, every task is verifiable, human created and has verified Skills. SOTA model without skills score ~30% without skills.

    We found a few interesting things: 1. Skills substitute for model scale — Haiku 4.5 with Skills (27.7%) beats Opus 4.5 without (22.0%). The right procedural knowledge can be worth more than a bigger model. 2. Skills' improvement has nothing to do with LLMs' internal knowledge. We have an ablation where no Skills provided for the agent, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs' latent domain knowledge. The result is: Curated Skills: +16.2pp average improvement across all 7 agent configs Self-generated Skills: -1.3pp: models can't write their own procedural knowledge pre-trajectory feedbacks. This is used to isolate the impact of LLMs' latent domain knowledge.