5 pointsby subset8 hours ago1 comment
  • ttd8 hours ago
    I love these optimization tales. Memory throughput bottlenecks (extremely common, perhaps moreso than they seem) are my favorite to tackle - there are frequently some juicy optimizations that can apply there.

    Do model weights have any spatial locality that can be exploited? If so, there are some more general pre-compression techniques that might be interesting to try, e.g. bitshuffle is one I've worked with (https://github.com/kiyo-masui/bitshuffle).

    Another fun fact: in some scenarios (depends a lot on CPU and memory characteristics), gzip+memcpy+gunzip can be faster end-to-end than just memcpy. I forget where I first heard this but my familiarity comes from the blosc compression library.