From one side, large hand-written assembly programs don't usually have much optimizations - neither compiler-induced micro-optimizations, nor LTO optimization, nor high-level optimizations from complex algorithms. So normally, I'd expect large hand-written assembly to be slower.
From the other side, despite having lots of code, most of the time is spent in the main loop, so maybe that part is optimized tightly? It does not seem to be super tight at the first sight, but I did not look too closely..
Would be very interesting to see benchmark of this vs regular CPython vs pypy.