CVPN-2346 Implement GSO offload on lightway-server#413
Open
kp-samuel-tam wants to merge 7 commits into
Open
Conversation
|
Code coverage summary for 431f4c3: ✅ Region coverage 66% passes |
2b45738 to
ed9e475
Compare
76f46f2 to
45a8d96
Compare
Add the `gso` module to lightway-core with VirtioNetHdr definition, checksum helpers, and segment build/count functions for splitting GSO superpackets into individual segments with correct per-segment header fixups (IP ID, TCP seq, checksums). Also add tun-rs workspace dependency to lightway-core and lightway-server Cargo.toml.
45a8d96 to
b5849fe
Compare
Add the `send_gso` method to the OutsideIOSendCallback trait for sending concatenated wire packets via kernel GSO (UDP_SEGMENT). Include todo!() stub implementations in client TCP/UDP, server TCP, and test harnesses to satisfy the trait contract.
Add gso_buf/gso_size fields to TlsIOAdapter so the wolfssl send() callback can buffer raw encrypted segments during GSO processing. Add udp_send_gso to wrap buffered segments with wire headers and send as one sendmsg via the vectored send_gso callback. The implementation uses a zero-copy fast path when no outside plugins are configured: scatter-gather via iovec with a shared header buffer and borrowed slices of the encrypted segment buffer. The plugin path builds each segment as its own BytesMut and enforces the uniform-stride requirement of UDP_SEGMENT.
Add inside_data_received_gso and send_to_outside_gso methods to Connection. These process a GSO superpacket as a single packet through plugins/encoder, then split into per-segment encrypted frames and collect into a wire buffer for batch send via UDP_SEGMENT.
Add offload config field to TunConfig to enable IFF_VNET_HDR on TUN devices. Add recv_gso for raw reads that include the virtio_net_hdr prefix, and prepend a zeroed virtio header on try_send when offload is enabled.
Extend send_to_socket to accept an optional gso_size parameter and build UDP_SEGMENT cmsg for kernel-level segmentation. Implement the real send_gso on UdpSocket using this path.
Add enable_tun_offload config option and wire it through ServerConfig to main. Extract the default inside IO loop into its own function and add inside_io_loop_gso that reads virtio-framed superpackets from TUN, dispatches GSO vs single-packet paths, and sets gso_max_size on the TUN device.
b5849fe to
c62619f
Compare
| } | ||
|
|
||
| impl PluginList { | ||
| #[cfg(target_os = "linux")] |
Contributor
There was a problem hiding this comment.
This need not be a linux specific method. Looks generic
| // Expose the full slab to `recv_gso` as `&mut [u8]`. | ||
| // SAFETY: every byte of the slab was zero-initialized at | ||
| // construction; subsequent iters only ever shrunk `len` or | ||
| // overwrote bytes. We never hand out uninitialized memory. |
Contributor
There was a problem hiding this comment.
This does not sound safe, as pkt is mutable we could also create new BytesMut and replace it.
So now if you reserve, it might be unintialized.
I think what you want is https://docs.rs/bytes/latest/bytes/struct.BytesMut.html#method.spare_capacity_mut which gives pointer to spare buffer which you can sent to recv_gso
Comment on lines
+471
to
+503
| #[cfg(target_os = "linux")] | ||
| if config.enable_tun_offload { | ||
| // TODO: derive from a proper inside-MTU server config field. | ||
| const INSIDE_MTU: u32 = 1350; | ||
| // Cap gso_max_size so a single UDP_SEGMENT sendmsg of the | ||
| // re-segmented superpacket stays within one IPv4 UDP | ||
| // datagram (65535 − 20 IP − 8 UDP = 65507 bytes payload) | ||
| // and ≤ UDP_MAX_SEGMENTS (64) segments. Each wire segment | ||
| // is: inside packet (IP+TCP hdr 40 + MSS) + worst-case | ||
| // crypto overhead (~40, Expresslane is the larger of | ||
| // DTLS-37 and Expresslane-40) + Lightway hdr (16). | ||
| const UDP_MAX_GSO_PAYLOAD: u32 = 65507; | ||
| const UDP_MAX_SEGMENTS: u32 = 64; | ||
| const CRYPTO_OVERHEAD: u32 = 40; | ||
| const WIRE_HDR: u32 = 16; | ||
| const IP_TCP_HDR: u32 = 40; | ||
| let wire_per_seg = INSIDE_MTU + CRYPTO_OVERHEAD + WIRE_HDR; | ||
| let n_max = (UDP_MAX_GSO_PAYLOAD / wire_per_seg).min(UDP_MAX_SEGMENTS); | ||
| let mss = INSIDE_MTU - IP_TCP_HDR; | ||
| let max_gso_buf_size = IP_TCP_HDR + n_max * mss; | ||
| if let Ok(name) = tun.name() { | ||
| let _ = std::process::Command::new("ip") | ||
| .args([ | ||
| "link", | ||
| "set", | ||
| &name, | ||
| "gso_max_size", | ||
| &max_gso_buf_size.to_string(), | ||
| ]) | ||
| .output(); | ||
| } | ||
| } | ||
|
|
Contributor
There was a problem hiding this comment.
Better to make this a config and move the responsibility of setting gso size, outside of server to avoid needing ADMIN capability
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implement GSO on server side on DTLS and Expresslane, specifically on bulk server->client traffic. This consistently halves the total syscalls used during bulk transfers, and also improves aggregated server throughput by 2x for multiple clients doing transfers.
When
--enable-tun-offloadis set, the server reads TSO superpackets from the TUN withIFF_VNET_HDR, segments them in userspace, and emits each superpacket as a singlesendmsg(UDP_SEGMENT)instead of N per-segment syscalls. On a single-flow iperf3 reverse test the kernel UDP send path collapses near-completely:udp_sendmsg0.71% → ~0.05%,sock_alloc_send_pskb2.61% → ~0.13%,mlx5e_xmit1.88% → ~0%.Trade-off: kernel work is replaced with userspace work (per-segment IP/TCP/UDP checksum recomputation, segment assembly). Kernel-side wins are clear and measurable; userspace cost is now the dominant factor.
Pacing: each
sendmsg(UDP_SEGMENT)produces a NIC burst of up to N segments. This can exceed receiver socket buffer depth and increase tail drops at peak rates. We will need to revisit better TX pacing under congested links.Future work will focus on compatibility with TUN backends like io_uring, GRO on the server side, and full GSO/GRO on client side, where single-flow workloads should see the biggest visible speedup not in this PR.
Motivation and Context
See ticket CVPN-2346.
How Has This Been Tested?
Types of changes
Checklist:
main