OP here.
Wrote this to handle ragged/irregular data without padding or sorting.
Instead of "one thread per stream" (divergence hell), it uses a holistic grid-stride traversal.
Benchmarks on GTX 1070 (Pascal):
Ragged Reduction: ~2.45x faster than baseline.
Nested Analytics: ~1.98x faster (single-pass).
Header-only C++17. Happy to answer questions.