Overview

  • Introduction to TCP
  • TUN/TAP devices
  • TCP implementation
  • Modelling TCP in Rust

Introduction to TCP

Layer 3 - IP packets

  • Connection-less
  • Unreliable
  • Verification via Internet Checksum
  • Packet fragmentation & re-assembly based on MTU
0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version|  IHL  |Type of Service|          Total Length         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Identification        |Flags|      Fragment Offset    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Time to Live |    Protocol   |         Header Checksum       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Source Address                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Destination Address                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Options                    |    Padding    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Layer 4 - TCP

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |       Destination Port        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Sequence Number                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Acknowledgment Number                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Data |       |C|E|U|A|P|R|S|F|                               |
| Offset| Rsrvd |W|C|R|C|S|S|Y|I|            Window             |
|       |       |R|E|G|K|H|T|N|N|                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Checksum            |         Urgent Pointer        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           [Options]                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               :
:                             Data                              :
:                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • Connection-oriented
  • Reliable
  • Verification via checksum
  • Packets identified by sequence number

Connection Flow

    TCP Peer A                                           TCP Peer B

1.  CLOSED                                               LISTEN

2.  SYN-SENT    --> <SEQ=100><CTL=SYN>               --> SYN-RECEIVED

3.  ESTABLISHED <-- <SEQ=300><ACK=101><CTL=SYN,ACK>  <-- SYN-RECEIVED

4.  ESTABLISHED --> <SEQ=101><ACK=301><CTL=ACK>       --> ESTABLISHED

5.  ESTABLISHED --> <SEQ=101><ACK=301><CTL=ACK><DATA> --> ESTABLISHED
  • Ephemeral port assigned by stack for each 4-tuple / quad
  • Eg. (192.168.1.4:50100, 1.1.1.1:443)
  • Initial sequence number is random

TUN/TAP devices

Creating TUN device

ioctl_write_int!(tunsetiff, b'T' as u8, 202 as u32);

fn create_ifreq(devname: &str, ifru_flags: i16) -> libc::ifreq {
    let mut ifreq = unsafe { MaybeUninit::<libc::ifreq>::zeroed().assume_init() };
    for (left, right) in ifreq.ifr_name[..15].iter_mut().zip(devname.chars()) {
        *left = right as _;
    }

    ifreq.ifr_ifru.ifru_flags = ifru_flags;
    ifreq
}

impl TunDevice {
    pub fn new(devname: &str) -> Result<Self, std::io::Error> {
        let tap_fd = unsafe {
            OwnedFd::from_raw_fd(nix::fcntl::open(
                "/dev/net/tun",
                OFlag::O_RDWR,
                Mode::empty(),
            )?)
        };

        let ifreq = create_ifreq(devname, (libc::IFF_TUN | libc::IFF_NO_PI) as i16);
        unsafe {
            tunsetiff(tap_fd.as_raw_fd(), &ifreq as *const _ as u64)?;
        }

        std::process::Command::new("ip")
            .arg("link")
            .arg("set")
            .arg(devname)
            .arg("up")
            .spawn()?
            .wait()?;

        std::process::Command::new("ip")
            .arg("route")
            .arg("add")
            .arg("dev")
            .arg(devname)
            .arg("10.0.0.0/24")
            .spawn()?
            .wait()?;

        std::process::Command::new("ip")
            .arg("addr")
            .arg("add")
            .arg("dev")
            .arg(devname)
            .arg("local")
            .arg("10.0.0.2/24")
            .spawn()?
            .wait()?;

}
$ ip a
...
7: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether b4:8c:9d:d4:77:f7 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.4/24 brd 192.168.1.255 scope global dynamic noprefixroute wlan0
       valid_lft 65278sec preferred_lft 54478sec
...
9: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 500
    link/none 
    inet 10.0.0.2/24 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::6705:55f2:d2f5:b3ab/64 scope link stable-privacy proto kernel_ll 
       valid_lft forever preferred_lft forever
sysctl -w net.ipv4.ip_forward=1
iptables -I INPUT --source 10.0.0.0/24 -j ACCEPT
iptables -t nat -I POSTROUTING --out-interface wlan0 -j MASQUERADE
iptables -I FORWARD --in-interface wlan0 --out-interface tun0 -j ACCEPT
iptables -I FORWARD --in-interface tun0 --out-interface wlan0 -j ACCEPT

Routing Packets

Handling Packets

  • read() / write() allow reading/writing raw IP packets
fn read_packets(&self) -> Result<(), std::io::Error> {
    loop {
        let mut buf = vec![0_u8; 65536];
        let size = nix::unistd::read(self.tap_fd.as_raw_fd(), &mut buf[..])?;
        match etherparse::Ipv4HeaderSlice::from_slice(&buf) {
            Ok(ip) => match ip.protocol() {
                etherparse::IpNumber::TCP => {
                    match etherparse::TcpSlice::from_slice(&buf[ip.slice().len()..size]) {
                        Ok(tcp) => {
                            /* handle TCP packet */
                        },
                        Err(e) => ...,
                    }
                },
                _ => ...,
            },
            Err(e) => ...,
        }
    }
}
$ tcpdump -v -i tun0
tcpdump: listening on tun0, link-type RAW (Raw IP), snapshot length 262144 bytes
00:20:04.624733 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    10.0.0.1.51737 > tcpbin.com.4242: Flags [S], cksum 0x5cc1 (correct), seq 3502768018, win 65535, length 0
00:20:04.832437 IP (tos 0x0, ttl 51, id 0, offset 0, flags [DF], proto TCP (6), length 44)
    tcpbin.com.4242 > 10.0.0.1.51737: Flags [S.], cksum 0x2327 (correct), seq 3004300668, ack 3502768019, win 29200, options [mss 1250], length 0
00:20:04.832896 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    10.0.0.1.51737 > tcpbin.com.4242: Flags [.], cksum 0x3a12 (correct), seq 1, ack 1, win 29200, length 0
00:20:04.832971 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 46)

TCP implementation

3-way handshake

    TCP Peer A                                           TCP Peer B

1.  CLOSED                                               LISTEN

2.  SYN-SENT    --> <SEQ=100><CTL=SYN>               ...

3.  (duplicate) ... <SEQ=90><CTL=SYN>               --> SYN-RECEIVED

4.  SYN-SENT    <-- <SEQ=300><ACK=91><CTL=SYN,ACK>  <-- SYN-RECEIVED

5.  SYN-SENT    --> <SEQ=91><CTL=RST>               --> LISTEN

6.              ... <SEQ=100><CTL=SYN>               --> SYN-RECEIVED

7.  ESTABLISHED <-- <SEQ=400><ACK=101><CTL=SYN,ACK>  <-- SYN-RECEIVED

8.  ESTABLISHED --> <SEQ=101><ACK=401><CTL=ACK>      --> ESTABLISHED
  • Random sequence number helps identify stale connections
TcpSlice {
    header: TcpHeader {
        source_port: 4242,
        destination_port: 41171,
        sequence_number: 2188365908,
        acknowledgment_number: 1116455958,
        ns: false,
        fin: false,
        syn: true,
        rst: false,
        psh: false,
        ack: true,
        urg: false,
        ece: false,
        cwr: false,
        window_size: 29200,
        checksum: 29631,
        urgent_pointer: 0,
        options: [
            MaximumSegmentSize(
                1300,
            ),
        ],
    },
    payload: [],
}
  • Each peer includes window size and options in headers (eg. SACK, window scale)

Sending Data

Window size = 5

1. Initial packet being sent out

  0   1   2   3   4   5   6   7   8   9
+---------------------------------------+
| h | e | l | l | o | w | o | r | l | d |
+---------------------------------------+
<--------------------------------------->
   buffer
<------------------->
   window
 (SND.UNA = 0,
  SND.NXT = 5)

2. First 2 bytes acknowledged by receiver, window moves forward

  0   1   2   3   4   5   6   7   8   9
+---------------------------------------+
| h | e | l | l | o | w | o | r | l | d |
+---------------------------------------+
        <------------------------------->
           buffer
        <------------------->
           window
         (SND.UNA = 2,
          SND.NXT = 7)
  • Sliding window is used to control in-flight packets to avoid overwhelming the receiver

Receiving Data

  • ACKs are cumulative (unless SACK is used)
  • Each in-order segment is ACKed, out-of-order segments generate duplicate ACKs
1. Receiving initial segment (2 bytes), send ACK for `2` (next expected sequence number)

  0   1   2   3   4
+-------------------+
| h | e | ? | ? | ? |
+-------------------+
<----------------->
 window
<------->
 received (RCV.NXT = 2)

2. Received more packets (`llowo`) in different TCP segments, out-of-order

1. `ow`, stored but can't ACK because out-of-order, ACK `2`

          2   3   4   5   6
+---------------------------+
| h | e | ? | ? | ? | o | w |
+---------------------------+
         <----------------->
          window
         <------->
          received (RCV.NXT = 2)

2. `o`, stored but can't ACK because out-of-order, ACK `2`

          2   3   4   5   6
+---------------------------+
| h | e | ? | ? | ? | o | w |
+---------------------------+
         <----------------->
          window
         <------->
          received (RCV.NXT = 2)

3. `ll`, stored and ACK sent out for `7`

          2   3   4   5   6
+---------------------------+
| h | e | l | l | ? | o | w |
+---------------------------+
         <----------------->
          window
         <------->
          received (RCV.NXT = 7)

Data retransmission

  • RTT is calculated for each segment, re-sending it if not ACKed within the timeout
  • RTT is not calculated for re-transmitted segments, and timeout doubles on each retransmission (max 60s)
srtt - smoothed round-trip time
rttvar - round-trip time variation
rto - retransmission timeout

rto = 1000
G (clock granularity) = 0.01

On first measurement R

srtt = R
rttvar = R/2
rto = srtt + max(G, 4*rttvar)

On subsequent measurements

rttvar = 0.75 * rttvar + 0.25 * |srtt - R|
srtt = 0.875 * srtt + 0.125 * R
rto = srtt + max(G, 4 * rttvar)

Closing connections

  • Cleanly closing involves a 4 way "handshake"
  • Connections can be abruptly closed by sending an RST
    TCP Peer A                                           TCP Peer B

1.  ESTABLISHED                                          ESTABLISHED

2.  (Close)
    FIN-WAIT-1  --> <SEQ=100><ACK=300><CTL=FIN,ACK>  --> CLOSE-WAIT

3.  FIN-WAIT-2  <-- <SEQ=300><ACK=101><CTL=ACK>      <-- CLOSE-WAIT

4.                                                       (Close)
    TIME-WAIT   <-- <SEQ=300><ACK=101><CTL=FIN,ACK>  <-- LAST-ACK

5.  TIME-WAIT   --> <SEQ=101><ACK=301><CTL=ACK>      --> CLOSED

6.  (2 MSL)
    CLOSED

TODO

  • Selective Acknowledgment (SACK) - allows only actual missing segments to be re-transmitted
  • Fast retransmits - Re-transmits segments on multiple duplicate ACKs
  • Nagle's algorithm - Prevents sending out several small packets
  • Delayed ACKs - Batch ACKs on new outgoing segments
  • Timestamps
  • Congestion control - Congestion window, ECN, etc.

Modelling TCP in Rust

  • TunDevice manages socket creation & packet input/output
  • Each TcpSocket instance needs to be accessed by TunDevice for passing on TCP packets & executing timers, and also by user code to concurrently read & write

deck

By git-bruh

deck

  • 12