My experience has been that getting over the daunting factor of feeling afraid of a big wide world with a lot of noise and marketing and simply committing to a problem, learning it, and slowly bootstrapping it over time, tends to yield phenomenal results in the long run for most applications. And, if not, then there's often an applicable one/side field that can be pivoted to for still making immense/incredible progress.
The big players may have the advantage of scale, but there is so, so much that can be done still if you look around and keep a good feel for it. <3 :)
and it is so wonderful for it:)
https://openalex.org/works?page=1&filter=title_and_abstract....
Without numbers, I am left wondering whether they omitted CUDA graph benchmarks due to a lack of effort, or because they actually did the benchmarks and did not want to admit that their approach was not as much of a performance advance as they portray it to be.
I’m surprised the reduction in overhead for graphs vs streams alone was so little. I feel I’ve observed larger gains, but maybe I’m conflating CPU overhead with launch latency.
They should mention whether they did the graph uploads up front and whether they needed to change parameters within the graph.
Having said that, 1B model is an extreme example - hence the 1.5x speedup. For regular models and batch sizes this would probably buy you a few percent.
And of course the effect on throughput at larger batch sizes, which they allude to at the end.
Overall a very interesting result!
I wonder if we'll see OS level services/daemons to try and lower the time to first token as these things get used more. And the interface for application developers will be a simple system prompt.
In some ways the idea sounds nice, but there would be a lot of downsides:
- Memory eaten up by potentially unused models
- Less compute available to software running specialized models for specific tasks
I kept devstral in memory 15gb~ always since I have so much extra.
I can’t wait for a few years from now where I can have triple the memory bandwidth at this size ram