2 pointsby arun99-996 hours ago1 comment
  • arun99-996 hours ago
    If you're preparing for systems or performance-engineering roles, this repo shows how a simple matmul evolves into a high-performance kernel.

    It demonstrates:

    why loop order matters

    how cache locality dominates performance

    how tiling + registers change everything

    how multithreading scales

    You can run all benchmarks with one script and see ~100× speedup from naive → optimized.

    Good practice for:

    low-level optimization

    ML systems

    HPC

    performance interviews

    Repo: https://github.com/arun-reddy-a/matmul-cpu