Some of you may remember the purego discussion from 2023 (https://news.ycombinator.com/item?id=34763681, 268 points). It proved that calling C from Go without CGO is viable. We built on that foundation.
Why another FFI library?
We needed to call wgpu-native (WebGPU) from Go — thousands of FFI calls per frame. purego's reflect-based dispatch (RegisterFunc → reflect.MakeFunc → sync.Pool per call) was too much overhead for our use case. We also needed struct passing by value and callback float returns, which purego doesn't support.
So we took a libffi-style approach:
cif := &types.CallInterface{}
ffi.PrepareCallInterface(cif, types.DefaultCall, retType, argTypes) // once
ffi.CallFunction(cif, fnPtr, &result, args) // many times, zero alloc
Type classification is pre-computed at prepare time, not at call time. The call path is: Go → runtime.cgocall → hand-written asm → C function. The asm loads GP/FP registers per ABI from a flat argument buffer — no interpretation at call time.The assembly:
Three hand-written stubs: System V AMD64 (RDI,RSI,RDX,RCX,R8,R9 + XMM0-7), Win64 (RCX,RDX,R8,R9 + XMM0-3, 32-byte shadow), AAPCS64 (X0-X7, D0-D7, HFA support). Each is ~100 lines of Plan 9 asm.
What goffi does that purego doesn't:
Struct passing by value (8B in RAX, 9-16B in RAX+RDX, >16B via sret pointer). Callback float returns via XMM0 — purego panics on float/double return from callbacks. Typed errors — 5 error types with errors.As() instead of generic errors. Context support for timeouts and cancellation.
What purego does better (being honest):
purego supports 8 GOARCHes and 20+ OS×ARCH combinations — we cover 6 targets (amd64×4 + arm64×2). purego auto-marshals strings, bools, and slices — we work with raw unsafe.Pointer. purego has a much simpler API — one line to bind a function. purego has full Darwin ARM64 variadic stack packing — we don't yet.
Related projects worth knowing:
JupiterRider/ffi (https://github.com/JupiterRider/ffi) — pure Go binding for native libffi via purego. Supports variadic and struct pass/return, but requires libffi.so at runtime. If you need variadic calls today, that's a good option.
Where we use it:
goffi powers gogpu (https://github.com/gogpu) — a pure Go GPU computing platform with WebGPU bindings, zero CGO. It's also used in Born (https://github.com/born-ml/born) — an ML framework for Go with PyTorch-like API, type-safe tensors, and automatic differentiation. Both projects ship as single binaries with no C toolchain required. 89% test coverage, CI on Linux/Windows/macOS, MIT license.
We wrote a deep dive on the architecture, assembly, and callback mechanism: https://dev.to/kolkov/goffi-zero-cgo-foreign-function-interf...
Happy to discuss FFI internals, ABI details, or the trade-offs between the different approaches!