2 pointsby muchomuchach04 hours ago1 comment
  • muchomuchach04 hours ago
    I’m involved with this project.

    A recurring issue in on-policy RL for LLMs is GPU under-utilization while actors wait for weight syncs from the learner. PipelineRL uses in-flight weight updates: actors keep sampling while the learner updates weights, which reduces policy lag without stalling the pipeline.

    In practice this gives ~2× wall-clock speedups on large models.

    A paper on the approach was recently accepted to TMLR and discusses policy-lag bounds in more detail.