2 pointsby peterc3dev3 hours ago2 comments

peterc3dev3 hours ago
Author here. I built this on a GPD Pocket 4 (Ryzen AI 9 HX 370) running CachyOS.
The short version: AMD shipped a chip with three processors (CPU, GPU, NPU) sharing one memory bus, and every ML runtime only uses one at a time. This project coordinates all three with learned routing — it profiles each operation, assigns it to the fastest device, and adapts over time.
Key findings: - Vulkan outperforms ROCm by ~60% on the Radeon 890M (hipMallocManaged is broken on gfx1150, Vulkan sees full memory pool) - The NPU kernel driver had an init-order bug — we patched it to get Llama 3.2 1B running at 40-46 tok/s at 2W - The personality database correctly learns routing after ~5 runs (embed→NPU, matmul→GPU, tokenize→CPU) - PyTorch had to be built from source — no pre-built wheels exist for gfx1150 (RDNA 3.5)
Pre-alpha. Architecture + working PoC, not production code. Blog posts with more detail: - https://dev.to/peterc3dev/your-amd-apu-has-three-processors-... - https://dev.to/peterc3dev/i-got-all-three-processors-talking...
8jef3 hours ago
This is amazing work, and should develop in more ways to fully use these chips. I'll make sure to follow your lead on the matter, and find time to try it out.