Two findings that surprised us:
1. The same model has completely different internal representations of "difficulty" depending on decoding settings. What GPT-oss thinks is hard with greedy ≠ what it thinks is hard with sampling.
2. Model difficulty and human difficulty are orthogonal. The problems they struggle with aren't the ones we struggle with, and this gap increases with extended reasoning.
Code: https://github.com/KabakaWilliam/llms_know_difficulty Probes: https://huggingface.co/CoffeeGitta/pika-probes
Happy to answer questions.