ROLV – Beats Vendor Kernels, Cross‑Platform(img1.wsimg.com)

1 pointby heggenhougen3 hours ago2 comments

heggenhougen3 hours ago
rolv.ai publishing reproducible benchmarks showing that ROLV outperforms vendor dense kernels and vendor sparse libraries across GPUs, CPUs, and TPUs while producing the same hash/artifact on every platform. Highlights: commodity Intel Xeon Kimi K2.5 expert FFN slice runs 228.38 ms → 6.36 ms per iteration (35.9×), token throughput 2,240 → 80,500 t/s, and kernel energy drops 97.8%. Nsight‑validated traces and full methodology are in the PDF below; results cover LLM FFNs, recommendation GEMMs, FEM, GNNs, mobile and EV proxies, and production subsamples. We welcome technical critique on methodology, requests for additional baselines, and calls for the exact scripts to reproduce any specific run.
heggenhougen3 hours ago
I’m on the Rolv.ai team — below is a detailed expansion on our benchmarks and what to inspect in the PDF. This focuses on artifact parity, methodology, standout results, and the exact verification checks reviewers can run to validate our claims.
Reproducibility and artifact parity We publish reproducible artifact hashes and full environment manifests for NVIDIA, AMD, Intel CPU, AMD CPU, Apple M4, and Google TPU. We do not distribute proprietary binaries or IP; instead the PDF lists the ROLV artifact hash (identical across platforms), container manifests, and the exact command lines and verification tests you can run to confirm matching outputs, checksums, Nsight/perf traces, and power logs.
What we validated and why it matters Cross‑platform parity — identical outputs and checksums across vendor GPUs, CPUs, and TPUs to eliminate measurement drift from build differences.
Vendor comparisons — benchmarks against vendor dense kernels and vendor sparse libraries (cuBLAS/cuSPARSE, ROCm sparse, vendor BLAS on CPUs, TPU sparse primitives where available) with per‑kernel wall time, memory transfer time, and conversion overheads.
Energy and throughput — kernel energy where measurable and end‑to‑end token throughput for LLM slices and iteration times for non‑LLM workloads; Nsight traces and power logs are referenced. Standout, independently validated numbers (March 2026) Kimi K2.5 expert FFN (7168×2048, batch=512, ~87% sparsity) on a commodity Intel Xeon (13 GB usable RAM): dense baseline 228.38 ms → ROLV 6.36 ms per iteration (35.9×); token throughput 2,240 → 80,500 t/s; kernel energy 16,283.97 J → 350.74 J (97.8% saved).
Finite Element Solver (mobile phone chassis drop test): 193.16× speedup; 99.5% energy saved (multi‑CPU).
LLM proxy matrix (4096×5120, 50% sparsity) on NVIDIA B200: 158.72× speedup; 99.37% energy saved; 40.5M t/s with Nsight‑validated tolerance harness.
Large recommendation GEMM (Meta‑style ranking): 98.76× speedup; 99.0% energy saved.
Additional production and research workloads (GNNs, ViT attention, MusicGen, Llama shapes) are listed with per‑run sparsities and exact matrix shapes in the PDF. Methodology highlights (what to inspect in the PDF) Exact shapes and sparsities — matrix dimensions, sparsity pattern (random/pruned/structured), and batch sizes.
Baseline definitions — vendor dense kernel and vendor sparse library baselines include conversion costs; we report raw kernel times and end‑to‑end times.
Measurement rig — wall‑clock timing, Nsight kernel timelines, and device power sampling points; CPU runs include perf counters and the exact kernel invocation sequence.
Tolerance and correctness — numerical tolerance checks, output checksums, and unit tests used to validate functional equivalence.
Repro scripts — container manifests and run_benchmark verification commands are referenced so reviewers can run the verification tests and compare hashes and checksums.