UDPspeeder Optimization Results

10-phase optimization of a UDP FEC tunnel — SIMD, io_uring, GSO, 5 ISAs, Win64

970
Mbps no-FEC (Linux)
690
Mbps FEC+GSO (Linux)
15x
addmul1 SIMD speedup
5
architectures
6
release binaries

Tunnel Throughput (CI loopback)

GitHub Actions runner — UDP loopback, 1400B packets, median of 3 runs

+49%
no-FEC vs baseline
+21%
FEC with GSO
+85%
FEC vs baseline
30:1
GSO send batching

Linux — io_uring recv + GSO send slab

Higher is better (Mbps)

GSO (UDP_SEGMENT cmsg, kernel 4.18+) packs 30 FEC shards into one sendmsg call. One sk_buff + one route lookup instead of 30. Fallback to sendmmsg on older kernels.
io_uring multishot recv (kernel 5.19+) pre-posts recv buffers. Fallback to recvfrom.
Throughput trend dashboard →


Platform & Architecture Support

CI-validated on every push — 5 ISAs, 32 cross-architecture interop tests

Architecture SIMD Tier CRC32C Cook XOR io_uring GSO Binary
x86_64 AVX2SSSE3 SSE4.2 HW SSE2 128b Linux static
aarch64 NEON 2x ARMv8 HW NEON 128b Linux static
mips (big-endian) scalar SW slice8 scalar Linux static
PowerPC e500v2 SPE 64b SW slice8 SPE evxor OpenWrt musl
RISC-V 64 scalar SW slice8 scalar OpenWrt musl
Windows x86_64 AVX2SSSE3 SSE4.2 HW SSE2 128b IOCP wepoll .exe

SIMD dispatch is runtime (CPUID/HWCAP). Scalar fallback always available. 32 cross-architecture interop tests (4 arch pairs × 4 FEC/key configs) pass on every CI push.
Download release binaries →


x86_64: SIMD Optimized vs Scalar Baseline

GitHub Actions CI runner — AVX2 + SSE4.2 vs scalar — lower is better (ns/op)

15x
addmul1/1500B
14x
rs_encode k10/15
9x
rs_decode k10/15
2.7x
crc32c vs crc32_old
Scalar baseline
SIMD optimized

Baseline: commit 0a87d34 (scalar) — Optimized: commit 2f0ac6a (AVX2/SSSE3 + CRC32C SSE4.2 + pre-alloc)
Microbench trend dashboard →


PowerPC e500v2: CRC32C + SPE XOR vs Baseline (QEMU)

qemu-ppc-static -cpu e500v2 on GitHub Actions — target: TP-Link TL-WDR4900

1.4x
crc32c vs crc32_old
1.5x
rs_encode k10/15
1.3x
rs_decode k10/15
Baseline (byte-at-a-time XOR, old CRC32)
Current (SPE XOR, CRC32C, pre-alloc)

PPC numbers via QEMU emulation — absolute timings do not reflect real hardware performance.
Baseline: commit 66df8ad — Current: commit 2f0ac6a
SPE (Signal Processing Extension): 64-bit evldd/evxor/evstdd for XOR cook. No AltiVec, no hardware CRC — addmul1 and CRC32C are software-only.
PPC trend (current) · PPC trend (baseline)


Windows x86_64

Windows 10/11 — wepoll + IOCP pre-posted recv — no UDP GSO equivalent

193
Mbps no-FEC
122
Mbps FEC 20:10
+25%
IOCP recv gain

Linux vs Windows Throughput

Windows has no UDP_SEGMENT (GSO) equivalent. Each FEC shard requires a separate WSASendTo call.
IOCP pre-posted recv (Phase 9): GetQueuedCompletionStatusEx batch drain, +25% over blocking recvfrom.
RIO slab send (Phase 10): prototyped but blocked — WSA_FLAG_REGISTERED_IO sockets reject overlapped WSARecvFrom (error 10038). Hybrid IOCP recv + RIO send is impossible.
No path to match Linux GSO performance without a Winsock API change.