1 pointby shrecshrec7 hours ago3 comments

4 hours ago
undefined
shrecshrec7 hours ago
I implemented a full secp256k1 engine from scratch in C++ and CUDA with zero external dependencies (no GMP, no OpenSSL).
The goal was to explore performance limits of:
Jacobian mixed-add
Batch inversion using Montgomery’s trick
Large-scale scalar stepping
GPU memory coalescing strategies
On RTX 5060 I’m getting ~2.5B mixed-add operations/sec.
Key design decisions:
Little-endian limb layout for hardware efficiency
Big-endian only for visualization
Deterministic memory layout
No dynamic allocation in hot paths
Would love feedback from people working on ECC or GPU math.
shrecshrec4 hours ago
I will be glad to hire any suggestions from everyone abut future improvements and ideas.