1 pointby shrecshrec7 hours ago3 comments
  • 4 hours ago
    undefined
  • shrecshrec7 hours ago
    I implemented a full secp256k1 engine from scratch in C++ and CUDA with zero external dependencies (no GMP, no OpenSSL).

    The goal was to explore performance limits of:

    Jacobian mixed-add

    Batch inversion using Montgomery’s trick

    Large-scale scalar stepping

    GPU memory coalescing strategies

    On RTX 5060 I’m getting ~2.5B mixed-add operations/sec.

    Key design decisions:

    Little-endian limb layout for hardware efficiency

    Big-endian only for visualization

    Deterministic memory layout

    No dynamic allocation in hot paths

    Would love feedback from people working on ECC or GPU math.

  • shrecshrec4 hours ago
    I will be glad to hire any suggestions from everyone abut future improvements and ideas.