Misc. Notes on Writing a TUN/TAP Device

Taking a course in networking(网络原理) prepares one for other difficulties in their lives. Here are some notes on writing a TUN/TAP device in Linux.

MTU

Since both Ethernet and IP are “datagram-y”, the outer connection will benefit from using a datagram protocol, since we won’t be dealing with retransmission and latency coming from sequential delivery. Wireguard and tinc both do this and use UDP.

But the Internet™ is a dangerous place. MTU across multiple hops might be hard to predict. While wireguard just sends the encapsulated packet and hopes for the best, tinc tries to do Path MTU discovery, so it knows the (approximate) MTU towards a certain destination.

This MTU is not necessarily the same as the MTU set on the TUN/TAP device. The detected PMTU may lag behind the real network condition, and the device drivers are not expected to change their MTU on the fly. Therefore, tinc’s dataplane has to do a lot of MTU-related work that’s normally done by the Linux kernel according to the device MTU, with a lot of arithmetic with hardcoded constants. This includes:

Generation of ICMP “Fragmentation Needed” in order for PMTU discovery inside the tunnel to work. (v4 & v6)
IPv4 Fragmentation (This is NOT FUN to implement AT ALL. Luckily for us, IPv6 drops fragmentation. Unluckily for us, we haven’t dropped IPv4.)
TCP MSS Clamping

Double routing

For a TUN device with multiple peers, when it’s fed with a packet, there is no way for the driver to know where the packet is routed to by the kernel. Asking the kernel is time-consuming and may race. So the TUN driver has to do the routing again, and this routing procedure is also not necessarily identical to the kernel’s routing table. Sticking to their respective philosophies, Wireguard routes according to CIDR ranges from the configuration file, while tinc routes according to the advertised subnets from the peers.

For a P2P tunnel, this is not a problem. For a TAP device, this is also mostly not a problem, because we assume MAC addresses can be used as unique identifiers, and simply drop unrecognized unicast MAC destinations. But for a TAP to work, we need broadcasting.

Speaking of broadcasting…

Broadcasting in a Mesh

Broadcasting in a mesh network can take a few approaches:

Flood fill + filtering. This is the “easy way”, but relies on the assumption that packet delivery time is bounded, because we need to eventually discard the old filters.
Spanning tree. Although the most elegant solution, this is just a nightmare to implement, and requires a meta-protocol. It also risks creating routing loops, so a lot of care has to be taken.
Duplicated unicast. If the ingest node knows the internal ID of each reachable node in the broadcast domain, it can duplicate a broadcast packet into multiple unicast packets. This works best when:

We’re already doing IP routing at the source, so the entire internal forwarding plane within the mesh operates on internal node IDs.
Each node knows the internal ID of each node, and knows how to route to each node, but does not know a lot of other information (i.e. the entire network topology). This is common for mesh networks using a distance-vector routing protocol.

tinc is a serious mesh VPN, so it took on the challenge of implementing both 2 & 3. I’m lazy, so only option 3 for me.

Android-specific problems

After the creation of the VPN, the traffic of the VPN connection itself will be routed inside the VPN. The VPN application has to add itself to the blacklist of the VPN. In contrast, common practice on Linux is to use a fwmark w/ ip rule to route the traffic using another routing table. We don’t have that on Android.
The application cannot close the fd of the TUN device. Relevant control logic has to deliberately leak the fd, before asking VpnService to terminate.
Android apps can only create TUN.
You cannot actually control the name of the TUN device. You also cannot easily listen to the changes on the device (i.e. routes and addresses through rtnetlink). We can poll it, but since the configurations of a TUN device cannot be easily changed by the user, we can assume it stays the same after it’s created.

Local tunnel loop on Linux

Linux really hates when it receives an input packet that apparently originates from a local address. This will mess up our testing setup, since we want to loop the tunnel back onto the same machine.

The objectively correct way to solve this problem is via netns. But we’re lazy. Tuning the following sysctl parameter will allow us to send IPv4 packets looping back to the same machine:

echo "1" > /proc/sys/net/ipv4/conf/all/accept_local

Surprisingly, we don’t need rp_filter settings besides this. Apparently, accept_local short-circuits local addresses. IPv6 reverse path filtering is implemented in netfilter, so if not explicitly enabled, we won’t be disturbed by it.

Additional resources

tinc’s author Guus Sliepen gave a great talk on some of the nitty-gritty details about tinc: tinc: the difficulties of a peer-to-peer VPN on the hostile Internet. The talk was given in 2010, which is closer to the birth of Linux’s TUN/TAP interface and tinc (1997) than today. Time goes fast.