3 pointsby sadopc6 hours ago1 comment
  • sadopc6 hours ago

      I benchmarked 6 TPC-H analytical queries on Apple M4 across three execution paths: DuckDB SQL, NumPy CPU kernels, and MLX GPU kernels. The goal was to quantify whether unified memory
      actually matters for GPU-accelerated analytics.
    
      What I found:
    
      - MLX GPU kernels are 1.3x-3.1x faster than identical NumPy CPU kernels on compute-heavy queries (Q1, Q6). The advantage scales with data size.
      - DuckDB's optimized SQL engine beats hand-written GPU kernels on every standard TPC-H query. A C++ vectorized engine with a query optimizer is a different class of performance than
      Python-orchestrated GPU kernels.
      - A custom GPU-favorable query (pure parallel arithmetic, no joins) showed MLX beating DuckDB by 1.6x and NumPy by 16x -- confirming the GPU wins when the workload fits.
      - If the M4 GPU were behind a PCIe 4.0 bus, data transfer would add 10-36% overhead. Unified memory eliminates this entirely.
    
      Honest takeaway: Unified memory removes the transfer bottleneck, but the engine's software stack matters more than the hardware for typical analytical queries. GPU analytics needs
      workloads heavy on parallel arithmetic and light on joins to beat an optimized CPU engine.
    
      MLX limitations I worked around: no boolean indexing (used overflow bin pattern), float32 only (~0.08% precision loss over millions of rows), mx.array(numpy) is a copy not zero-copy.
    
      Full paper: https://github.com/sadopc/unified-db-2/blob/main/PAPER.md
    
      All code is MIT. Runs end-to-end with one command.