Under those same conditions, you can quite readily do ~100 Gb/core-second (ignoring encryption, encryption will bottleneck you to 30-50 Gb/core-second on modern chips with AES acceleration instructions) in software with feature parity with proper protocol design and implementation.
We measured:
1. Association state + Per PATH CC/RTO, timers, RTT tracking, cwnd etc.
2. Selective ACKs and re-transmit logic.
3. chunk framing + tsn sequences.
4. ordered vs unordered delivery, and fragmentation/reassembly.
much more ...
Also our vnet-based implementation isn't just dumb buffer, we have packet on wire validation, SCTP parsing, CRC32c validations. deterministic network conditions emulator. With real time conditions.
Sure you can get 100 GB/Core second if you bypass all of that and just do huge batching
The blog post claim is just under the same SCTP semantics and the same test harness, enabling RACK has a huge win. not the absolute ceilings of in-process "virtual network" sockets :)
If we ever want a true ceiling number, we could add a separate fast path (e.g, a dump-writer / sink that skips most validation) or validate after run, but that's not in scope right now. our scope was: (1) validate Pion/SCTP PRs and (2) compare performance against other branches and version. so for relative benchmark under identical conditions.
on head-of-line blocking: we have a pending RFC 8260 message interleaving (I-DATA) implementation, and we've tested with it; it helps reduce HoL on the sender side (especially around fragmentation). our benchmark tool has a flag to run with interleaving, and we tested it quit a bit. We plan to release it in Jan.