Explicitly says they want to avoid running layers on CPU for performance reasons, and has benchmarks, but doesn't have benchmarks comparing against offloading to CPU. That would be really helpful for verifying that this is actually worth doing. (I suspect it is, but intuition is inferior to empirical evidence.)