Mode A (activation transfer) operates at the representation level, not the parameter level. The source model's knowledge gets projected into a 2048-dim hub space — the receiving model doesn't need to match architecturally or in precision. A 200M FP32 training model and a 5M INT8 edge model can both have UHS encoders/decoders. The hub space is agnostic to what's underneath.
Mode B (behavioural) is probably the most interesting path for your use case. It transfers decision boundaries rather than activations or weights. If the quantised model can reproduce the input-output mapping, internal precision is irrelevant.
It's similar in spirit to distillation but decoupled through the hub space — teacher and student don't need to be online simultaneously, and you get a full audit trail of what knowledge went where (which matters if you're shipping medical/industrial edge models under EU AI Act).
The gap today is the decoder side. DecoderMLP outputs FP32. We'd need a quantisation-aware variant that respects the INT8 grid — straight-through estimator at minimum, learned quantisation boundaries ideally. We'd also want empirical drift characterisation across FP32→FP16→INT8→INT4 so you'd know your expected fidelity floor for a given target.
The swarm angle is where it gets genuinely useful for edge fleets. If you've got N devices training locally on-site data, they contribute quantised-model tokens back to a full-precision aggregator. The robust aggregation strategy (Huber-style cosine clipping) handles quantisation noise across heterogeneous devices naturally.
We're planning a quantisation-aware transfer module next. If you're interested in testing against real Cortex-A INT8 workloads, we'd welcome the collaboration — repo is at github.com/incocreativedev/tessera-core.