Introducing *How2Everything*—an open framework for benchmarking & improving how LLMs generate step-by-step procedures.
LLMs constantly produce instructions for everything from filing taxes to plans for AI agents, but improving this capability is challenging. Outputs can sound fluent while describing steps that don't actually work, surface-level metrics miss critical mistakes like omitted prerequisites or contradictory instructions, and manual verification doesn't scale.
How2Everything closes this gap with a practical loop: mine real procedures from the web → benchmark LLM outputs → detect critical failures (missing steps, wrong order, omissions) → use that signal to train better models.
It has three main components:
*How2Mine*—a pipeline that extracts & standardizes procedures from web pages covering 14 topics
*How2Bench*—a 7,000-procedure benchmark built from How2Mine
*How2Score*—an evaluation protocol powered by How2Judge, an open 8B judge model trained to flag critical failures
How2Judge agrees with human judgments ~80% of the time and is cheap enough for large-scale eval, making it practical as both a benchmark scorer and an RL reward signal.
RL training with How2Score yields >10-point gains on Qwen3 4B, Qwen3 8B, and Olmo 3 7B Think, with no regressions across 12 standard benchmarks covering knowledge, reasoning, chat, math, and code. How2Bench also scales cleanly, remaining informative from early 1B pretraining checkpoints through frontier LLMs. And we stress-tested two shortcut explanations (format compliance and memorization); neither accounts for the improvements, pointing to real gains in procedure generation.
The full How2Everything framework, including How2Judge, is available now.
Blog: https://allenai.org/blog/how2everything
Paper: https://arxiv.org/pdf/2602.08808
Code: https://github.com/lilakk/how2everything
HF: https://huggingface.co/collections/how2everything/how2everyt...