- Instruction files: Anthropic recommends <200 lines for CLAUDE.md. The "lost in the middle" problem shows 30%+ accuracy drop for information in the middle of the context window.
- Project structure: Independent benchmarks consistently show that 60–80% of tokens go toward figuring out where things are.
- Session length: There’s a strong intuition that longer sessions are better — the agent “already knows” our codebase, we don’t need to re-explain anything. In practice, the opposite is true.
- Self-verification: Anthropic calls giving the agent runnable tests "the single highest-leverage thing" for agent performance.
- Scaffolding: When an agent produces bad output, our first instinct is usually “the model is dumb.” But almost every time, the problem is in the scaffolding.