2 pointsby ottoselymesi2 days ago1 comment
  • ottoselymesi2 days ago
    OP here. Wrote this to handle ragged/irregular data without padding or sorting. Instead of "one thread per stream" (divergence hell), it uses a holistic grid-stride traversal.

    Benchmarks on GTX 1070 (Pascal): Ragged Reduction: ~2.45x faster than baseline. Nested Analytics: ~1.98x faster (single-pass).

    Header-only C++17. Happy to answer questions.