The goal was to explore performance limits of:
Jacobian mixed-add
Batch inversion using Montgomery’s trick
Large-scale scalar stepping
GPU memory coalescing strategies
On RTX 5060 I’m getting ~2.5B mixed-add operations/sec.
Key design decisions:
Little-endian limb layout for hardware efficiency
Big-endian only for visualization
Deterministic memory layout
No dynamic allocation in hot paths
Would love feedback from people working on ECC or GPU math.