So I started making my own quantize tool in Rust and as I was peeling back the layers, running tests, etc I figured out the ORT quantizer places the Dequantize node between the Conv and the ReLU, breaking kernel fusion. But if you move the Dequantize after the activation instead the rewrite is mathematically identical — max(0, x · scale) == max(0, x) · scale for scale > 0 — but allows the runtime to fuse the kernels and you get a large boost in speed at the same accuracy. I've tested it on production environments as well, however it doesn't work great on YOLO currently, but tbh I'm not pressed to get it tuned up for YOLO at the moment.
Some quick benchmarks (i5-13420H CPU, single-threaded ORT):
SqueezeNet: 2.32x faster than FP32 (ORT quantizer made it slower than FP32)
ResNet50: 2.46x faster than FP32, 40% faster than ORT's quantizer
Zero Python dependencies, single native binary, self-calibrating. Apache-2.0.
Technical write-up: https://coreepoch.dev/research/kenosis-activation-aware-quan...
cargo install kenosis-cli kenosis quantize model.onnx -o model_int8.onnx --static-int8