The hardest part wasn't the AD itself, but managing memory safety during the "growth" phase. Since NOMA compiles to native code (LLVM), I had to ensure that when a weight buffer gets realloc'd (moved in memory):
The gradient tape updates its pointers.
The optimizer state (Adam moments) is correctly mapped to the new indices.
The benchmark I linked shows the result: "Preserving" this state allows the model to continue converging immediately after resizing, whereas "Resetting" it causes a massive performance regression.
I'm specifically curious if anyone here has experience with handling SSA Phi-nodes during reverse-mode AD on the Control Flow Graph? That's my next big hurdle for supporting complex control flow.