2 pointsby harman260710 hours ago1 comment
  • harman260710 hours ago
    Hi HN, I’m one of the authors.

    This paper studies how LLMs self-verify candidate solutions when doing test-time scaling (parallel reasoning / Best-of-N style generation).

    We found that models are often much better at pairwise comparisons, where the model scores two solutions (A and B) jointly, than at assigning absolute scores to their own solutions independently.

    The paper introduces:

    • Pairwise self-verification instead of pointwise scoring

    • V1-Infer, a ranking algorithm that selects good candidates efficiently

    • V1-PairRL, RL training where generation and verification co-evolve to produce stronger self-verifiers

    Across coding and reasoning benchmarks, we observe improved verification accuracy and good scaling when increasing verification compute budget.

    One motivation is that many recent test-time scaling approaches (for example RSA: https://arxiv.org/abs/2509.26626 ) rely on sequential aggregation loops. Pairwise verification enables a more parallel form of selection, which may reduce latency in deep thinking pipelines and scaffolds.

    Happy to answer questions.