Overview
- Introduction to TCP
- TUN/TAP devices
- TCP implementation
- Modelling TCP in Rust
Introduction to TCP
Layer 3 - IP packets
- Connection-less
- Unreliable
- Verification via Internet Checksum
- Packet fragmentation & re-assembly based on MTU
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service| Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+Layer 4 - TCP
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |C|E|U|A|P|R|S|F| |
| Offset| Rsrvd |W|C|R|C|S|S|Y|I| Window |
| | |R|E|G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| [Options] |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :
: Data :
: |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
- Connection-oriented
- Reliable
- Verification via checksum
- Packets identified by sequence number
Connection Flow
TCP Peer A TCP Peer B
1. CLOSED LISTEN
2. SYN-SENT --> <SEQ=100><CTL=SYN> --> SYN-RECEIVED
3. ESTABLISHED <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
4. ESTABLISHED --> <SEQ=101><ACK=301><CTL=ACK> --> ESTABLISHED
5. ESTABLISHED --> <SEQ=101><ACK=301><CTL=ACK><DATA> --> ESTABLISHED
- Ephemeral port assigned by stack for each 4-tuple / quad
- Eg. (192.168.1.4:50100, 1.1.1.1:443)
- Initial sequence number is random
TUN/TAP devices
Creating TUN device
ioctl_write_int!(tunsetiff, b'T' as u8, 202 as u32);
fn create_ifreq(devname: &str, ifru_flags: i16) -> libc::ifreq {
let mut ifreq = unsafe { MaybeUninit::<libc::ifreq>::zeroed().assume_init() };
for (left, right) in ifreq.ifr_name[..15].iter_mut().zip(devname.chars()) {
*left = right as _;
}
ifreq.ifr_ifru.ifru_flags = ifru_flags;
ifreq
}
impl TunDevice {
pub fn new(devname: &str) -> Result<Self, std::io::Error> {
let tap_fd = unsafe {
OwnedFd::from_raw_fd(nix::fcntl::open(
"/dev/net/tun",
OFlag::O_RDWR,
Mode::empty(),
)?)
};
let ifreq = create_ifreq(devname, (libc::IFF_TUN | libc::IFF_NO_PI) as i16);
unsafe {
tunsetiff(tap_fd.as_raw_fd(), &ifreq as *const _ as u64)?;
}
std::process::Command::new("ip")
.arg("link")
.arg("set")
.arg(devname)
.arg("up")
.spawn()?
.wait()?;
std::process::Command::new("ip")
.arg("route")
.arg("add")
.arg("dev")
.arg(devname)
.arg("10.0.0.0/24")
.spawn()?
.wait()?;
std::process::Command::new("ip")
.arg("addr")
.arg("add")
.arg("dev")
.arg(devname)
.arg("local")
.arg("10.0.0.2/24")
.spawn()?
.wait()?;
}$ ip a
...
7: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether b4:8c:9d:d4:77:f7 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.4/24 brd 192.168.1.255 scope global dynamic noprefixroute wlan0
valid_lft 65278sec preferred_lft 54478sec
...
9: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 500
link/none
inet 10.0.0.2/24 scope global tun0
valid_lft forever preferred_lft forever
inet6 fe80::6705:55f2:d2f5:b3ab/64 scope link stable-privacy proto kernel_ll
valid_lft forever preferred_lft foreversysctl -w net.ipv4.ip_forward=1
iptables -I INPUT --source 10.0.0.0/24 -j ACCEPT
iptables -t nat -I POSTROUTING --out-interface wlan0 -j MASQUERADE
iptables -I FORWARD --in-interface wlan0 --out-interface tun0 -j ACCEPT
iptables -I FORWARD --in-interface tun0 --out-interface wlan0 -j ACCEPTRouting Packets
Handling Packets
- read() / write() allow reading/writing raw IP packets
fn read_packets(&self) -> Result<(), std::io::Error> {
loop {
let mut buf = vec![0_u8; 65536];
let size = nix::unistd::read(self.tap_fd.as_raw_fd(), &mut buf[..])?;
match etherparse::Ipv4HeaderSlice::from_slice(&buf) {
Ok(ip) => match ip.protocol() {
etherparse::IpNumber::TCP => {
match etherparse::TcpSlice::from_slice(&buf[ip.slice().len()..size]) {
Ok(tcp) => {
/* handle TCP packet */
},
Err(e) => ...,
}
},
_ => ...,
},
Err(e) => ...,
}
}
}$ tcpdump -v -i tun0
tcpdump: listening on tun0, link-type RAW (Raw IP), snapshot length 262144 bytes
00:20:04.624733 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
10.0.0.1.51737 > tcpbin.com.4242: Flags [S], cksum 0x5cc1 (correct), seq 3502768018, win 65535, length 0
00:20:04.832437 IP (tos 0x0, ttl 51, id 0, offset 0, flags [DF], proto TCP (6), length 44)
tcpbin.com.4242 > 10.0.0.1.51737: Flags [S.], cksum 0x2327 (correct), seq 3004300668, ack 3502768019, win 29200, options [mss 1250], length 0
00:20:04.832896 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
10.0.0.1.51737 > tcpbin.com.4242: Flags [.], cksum 0x3a12 (correct), seq 1, ack 1, win 29200, length 0
00:20:04.832971 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 46)TCP implementation
3-way handshake
TCP Peer A TCP Peer B
1. CLOSED LISTEN
2. SYN-SENT --> <SEQ=100><CTL=SYN> ...
3. (duplicate) ... <SEQ=90><CTL=SYN> --> SYN-RECEIVED
4. SYN-SENT <-- <SEQ=300><ACK=91><CTL=SYN,ACK> <-- SYN-RECEIVED
5. SYN-SENT --> <SEQ=91><CTL=RST> --> LISTEN
6. ... <SEQ=100><CTL=SYN> --> SYN-RECEIVED
7. ESTABLISHED <-- <SEQ=400><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
8. ESTABLISHED --> <SEQ=101><ACK=401><CTL=ACK> --> ESTABLISHED- Random sequence number helps identify stale connections
TcpSlice {
header: TcpHeader {
source_port: 4242,
destination_port: 41171,
sequence_number: 2188365908,
acknowledgment_number: 1116455958,
ns: false,
fin: false,
syn: true,
rst: false,
psh: false,
ack: true,
urg: false,
ece: false,
cwr: false,
window_size: 29200,
checksum: 29631,
urgent_pointer: 0,
options: [
MaximumSegmentSize(
1300,
),
],
},
payload: [],
}- Each peer includes window size and options in headers (eg. SACK, window scale)
Sending Data
Window size = 5
1. Initial packet being sent out
0 1 2 3 4 5 6 7 8 9
+---------------------------------------+
| h | e | l | l | o | w | o | r | l | d |
+---------------------------------------+
<--------------------------------------->
buffer
<------------------->
window
(SND.UNA = 0,
SND.NXT = 5)
2. First 2 bytes acknowledged by receiver, window moves forward
0 1 2 3 4 5 6 7 8 9
+---------------------------------------+
| h | e | l | l | o | w | o | r | l | d |
+---------------------------------------+
<------------------------------->
buffer
<------------------->
window
(SND.UNA = 2,
SND.NXT = 7)
- Sliding window is used to control in-flight packets to avoid overwhelming the receiver
Receiving Data
- ACKs are cumulative (unless SACK is used)
- Each in-order segment is ACKed, out-of-order segments generate duplicate ACKs
1. Receiving initial segment (2 bytes), send ACK for `2` (next expected sequence number)
0 1 2 3 4
+-------------------+
| h | e | ? | ? | ? |
+-------------------+
<----------------->
window
<------->
received (RCV.NXT = 2)
2. Received more packets (`llowo`) in different TCP segments, out-of-order
1. `ow`, stored but can't ACK because out-of-order, ACK `2`
2 3 4 5 6
+---------------------------+
| h | e | ? | ? | ? | o | w |
+---------------------------+
<----------------->
window
<------->
received (RCV.NXT = 2)
2. `o`, stored but can't ACK because out-of-order, ACK `2`
2 3 4 5 6
+---------------------------+
| h | e | ? | ? | ? | o | w |
+---------------------------+
<----------------->
window
<------->
received (RCV.NXT = 2)
3. `ll`, stored and ACK sent out for `7`
2 3 4 5 6
+---------------------------+
| h | e | l | l | ? | o | w |
+---------------------------+
<----------------->
window
<------->
received (RCV.NXT = 7)Data retransmission
- RTT is calculated for each segment, re-sending it if not ACKed within the timeout
- RTT is not calculated for re-transmitted segments, and timeout doubles on each retransmission (max 60s)
srtt - smoothed round-trip time
rttvar - round-trip time variation
rto - retransmission timeout
rto = 1000
G (clock granularity) = 0.01
On first measurement R
srtt = R
rttvar = R/2
rto = srtt + max(G, 4*rttvar)
On subsequent measurements
rttvar = 0.75 * rttvar + 0.25 * |srtt - R|
srtt = 0.875 * srtt + 0.125 * R
rto = srtt + max(G, 4 * rttvar)
Closing connections
- Cleanly closing involves a 4 way "handshake"
- Connections can be abruptly closed by sending an RST
TCP Peer A TCP Peer B
1. ESTABLISHED ESTABLISHED
2. (Close)
FIN-WAIT-1 --> <SEQ=100><ACK=300><CTL=FIN,ACK> --> CLOSE-WAIT
3. FIN-WAIT-2 <-- <SEQ=300><ACK=101><CTL=ACK> <-- CLOSE-WAIT
4. (Close)
TIME-WAIT <-- <SEQ=300><ACK=101><CTL=FIN,ACK> <-- LAST-ACK
5. TIME-WAIT --> <SEQ=101><ACK=301><CTL=ACK> --> CLOSED
6. (2 MSL)
CLOSEDTODO
- Selective Acknowledgment (SACK) - allows only actual missing segments to be re-transmitted
- Fast retransmits - Re-transmits segments on multiple duplicate ACKs
- Nagle's algorithm - Prevents sending out several small packets
- Delayed ACKs - Batch ACKs on new outgoing segments
- Timestamps
- Congestion control - Congestion window, ECN, etc.
Modelling TCP in Rust
-
TunDevicemanages socket creation & packet input/output - Each
TcpSocketinstance needs to be accessed by TunDevice for passing on TCP packets & executing timers, and also by user code to concurrently read & write
deck
By git-bruh
deck
- 12