10-phase optimization of a UDP FEC tunnel — SIMD, io_uring, GSO, 5 ISAs, Win64
GitHub Actions runner — UDP loopback, 1400B packets, median of 3 runs
Higher is better (Mbps)
GSO (UDP_SEGMENT cmsg, kernel 4.18+) packs 30 FEC shards into one sendmsg call.
One sk_buff + one route lookup instead of 30. Fallback to sendmmsg on older kernels.
io_uring multishot recv (kernel 5.19+) pre-posts recv buffers. Fallback to recvfrom.
Throughput trend dashboard →
CI-validated on every push — 5 ISAs, 32 cross-architecture interop tests
| Architecture | SIMD Tier | CRC32C | Cook XOR | io_uring | GSO | Binary |
|---|---|---|---|---|---|---|
| x86_64 | AVX2SSSE3 | SSE4.2 HW | SSE2 128b | ✓ | ✓ | Linux static |
| aarch64 | NEON 2x | ARMv8 HW | NEON 128b | ✓ | ✓ | Linux static |
| mips (big-endian) | scalar | SW slice8 | scalar | — | ✓ | Linux static |
| PowerPC e500v2 | SPE 64b | SW slice8 | SPE evxor | — | ✓ | OpenWrt musl |
| RISC-V 64 | scalar | SW slice8 | scalar | — | ✓ | OpenWrt musl |
| Windows x86_64 | AVX2SSSE3 | SSE4.2 HW | SSE2 128b | IOCP | ✗ | wepoll .exe |
SIMD dispatch is runtime (CPUID/HWCAP). Scalar fallback always available.
32 cross-architecture interop tests (4 arch pairs × 4 FEC/key configs) pass on every CI push.
Download release binaries →
GitHub Actions CI runner — AVX2 + SSE4.2 vs scalar — lower is better (ns/op)
Baseline: commit 0a87d34 (scalar) —
Optimized: commit 2f0ac6a (AVX2/SSSE3 + CRC32C SSE4.2 + pre-alloc)
Microbench trend dashboard →
qemu-ppc-static -cpu e500v2 on GitHub Actions — target: TP-Link TL-WDR4900
PPC numbers via QEMU emulation — absolute timings do not reflect real hardware performance.
Baseline: commit 66df8ad —
Current: commit 2f0ac6a
SPE (Signal Processing Extension): 64-bit evldd/evxor/evstdd for XOR cook.
No AltiVec, no hardware CRC — addmul1 and CRC32C are software-only.
PPC trend (current) · PPC trend (baseline)
Windows 10/11 — wepoll + IOCP pre-posted recv — no UDP GSO equivalent
Windows has no UDP_SEGMENT (GSO) equivalent. Each FEC shard requires a separate WSASendTo call.
IOCP pre-posted recv (Phase 9): GetQueuedCompletionStatusEx batch drain, +25% over blocking recvfrom.
RIO slab send (Phase 10): prototyped but blocked — WSA_FLAG_REGISTERED_IO sockets reject overlapped
WSARecvFrom (error 10038). Hybrid IOCP recv + RIO send is impossible.
No path to match Linux GSO performance without a Winsock API change.