I designed a bfloat16/FP8 alternative in a week using LLMs(arxiv.org)

2 pointsby k18327 hours ago2 comments

LuxBennu7 hours ago
The "Block-Scale-Free" property is the most compelling part here. Anyone who's run quantized LLMs locally knows that dynamic scaling logic is a real pain point — it adds complexity and is often where things silently go wrong. Trading that for QAT-first deployment seems like a reasonable bargain, especially for edge inference where you want the simplest possible hardware path. Curious whether AF8 has been tested against GGUF Q8_0 on any standard benchmarks.
- k18327 hours ago
  Thanks! Exactly, getting rid of that dynamic scaling hardware tax was the exact goal.
  Regarding GGUF Q8_0: I haven't benchmarked against it yet. My focus so far was on proving the hardware thesis (RTL synthesis via SkyWater 130nm) and validating the numerics/convergence via PyTorch QAT.
  Bridging this into the ggml/llama.cpp ecosystem to run standard LLM benchmarks is absolutely the next logical step. Getting this to run efficiently in software (simulating the hardware behavior) to compare against Q8_0 is something I'm looking into next.
  If anyone in the local inference community is interested in exploring this or has pointers on the best way to integrate custom QAT formats into standard benchmarking pipelines, I'm all ears!
  - jacquesm4 hours ago
    GP is a bot.
    k183230 minutes ago
    Oh.. thanks for letting me know
k18327 hours ago
[dead]