We found a few interesting things: 1. Skills substitute for model scale — Haiku 4.5 with Skills (27.7%) beats Opus 4.5 without (22.0%). The right procedural knowledge can be worth more than a bigger model. 2. Skills' improvement has nothing to do with LLMs' internal knowledge. We have an ablation where no Skills provided for the agent, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs' latent domain knowledge. The result is: Curated Skills: +16.2pp average improvement across all 7 agent configs Self-generated Skills: -1.3pp: models can't write their own procedural knowledge pre-trajectory feedbacks. This is used to isolate the impact of LLMs' latent domain knowledge.