3 pointsby kirkmaddocks2 hours ago1 comment
  • 0xecro12 hours ago
    Interesting approach. I work in embedded Linux/edge AI where we constantly struggle to move knowledge from large training models down to quantized INT8 models on constrained hardware (ARM Cortex-A class). Have you tested transfer to quantized or pruned targets? If the behavioural encoding survives that compression, this could be a much cleaner path than classical distillation for on-device deployment.
    • kirkmaddocks2 hours ago
      We haven't built quantisation-aware transfer yet, but the architecture lends itself to it better than you might expect.

      Mode A (activation transfer) operates at the representation level, not the parameter level. The source model's knowledge gets projected into a 2048-dim hub space — the receiving model doesn't need to match architecturally or in precision. A 200M FP32 training model and a 5M INT8 edge model can both have UHS encoders/decoders. The hub space is agnostic to what's underneath.

      Mode B (behavioural) is probably the most interesting path for your use case. It transfers decision boundaries rather than activations or weights. If the quantised model can reproduce the input-output mapping, internal precision is irrelevant.

      It's similar in spirit to distillation but decoupled through the hub space — teacher and student don't need to be online simultaneously, and you get a full audit trail of what knowledge went where (which matters if you're shipping medical/industrial edge models under EU AI Act).

      The gap today is the decoder side. DecoderMLP outputs FP32. We'd need a quantisation-aware variant that respects the INT8 grid — straight-through estimator at minimum, learned quantisation boundaries ideally. We'd also want empirical drift characterisation across FP32→FP16→INT8→INT4 so you'd know your expected fidelity floor for a given target.

      The swarm angle is where it gets genuinely useful for edge fleets. If you've got N devices training locally on-site data, they contribute quantised-model tokens back to a full-precision aggregator. The robust aggregation strategy (Huber-style cosine clipping) handles quantisation noise across heterogeneous devices naturally.

      We're planning a quantisation-aware transfer module next. If you're interested in testing against real Cortex-A INT8 workloads, we'd welcome the collaboration — repo is at github.com/incocreativedev/tessera-core.