The short version: AMD shipped a chip with three processors (CPU, GPU, NPU) sharing one memory bus, and every ML runtime only uses one at a time. This project coordinates all three with learned routing — it profiles each operation, assigns it to the fastest device, and adapts over time.
Key findings: - Vulkan outperforms ROCm by ~60% on the Radeon 890M (hipMallocManaged is broken on gfx1150, Vulkan sees full memory pool) - The NPU kernel driver had an init-order bug — we patched it to get Llama 3.2 1B running at 40-46 tok/s at 2W - The personality database correctly learns routing after ~5 runs (embed→NPU, matmul→GPU, tokenize→CPU) - PyTorch had to be built from source — no pre-built wheels exist for gfx1150 (RDNA 3.5)
Pre-alpha. Architecture + working PoC, not production code. Blog posts with more detail: - https://dev.to/peterc3dev/your-amd-apu-has-three-processors-... - https://dev.to/peterc3dev/i-got-all-three-processors-talking...