TCP/IP Deep Dive for Engineers

TCP/IP is the backbone of modern networking, but most engineers interact with it only through abstractions (HTTP libraries, ORMs, load balancers). When things break or when you need to optimize — understanding what's happening on the wire is irreplaceable.

1. The Three-Way Handshake

Every TCP connection starts with SYN, SYN-ACK, ACK:

Client                           Server
  ├───────────── SYN ──────────►│  Step 1
  │◄────────── SYN-ACK ────────┤  Step 2
  ├───────────── ACK ──────────►│  Step 3
  │◄══════════ Data ══════════►│
# tcpdump view
sudo tcpdump -i eth0 -nn 'tcp port 80'

# Flags: [S] = SYN, [S.] = SYN-ACK, [.] = ACK, [F] = FIN, [R] = RST

The cost: The handshake adds 1 RTT before any data flows. On a 100ms link, that's 100ms overhead per new connection. This is why connection reuse (HTTP keep-alive, connection pools) is critical.

2. TCP Window and Flow Control

TCP uses a sliding window to control data in-flight:

# Check initial window size (in SYN packet)
tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0'

# Window scaling (RFC 1323): win 65535 with wscale 7 = ~8 MB

Zero Window: If the receiver's buffer is full, it advertises win 0. The sender must stop until a window update arrives.

3. Congestion Control

Slow Start: Start cwnd at ~14KB (10 segments), double every RTT until loss. Congestion Avoidance: Increase by 1 segment per RTT. On loss, reduce cwnd.

# Check current algorithm
sysctl net.ipv4.tcp_congestion_control  # default: cubic

# Switch to BBR
sudo sysctl net.ipv4.tcp_congestion_control=bbr

# Test with iperf3
iperf3 -c server-ip -t 30 -C cubic
iperf3 -c server-ip -t 30 -C bbr
Algorithm Pros Cons
Cubic (default) Good for long-fat pipes Slow recovery on lossy links
BBR Model-based, excellent throughput, low latency Can be unfair to Cubic flows
Reno Conservative, well-understood Poor on high-BDP links

4. MTU, MSS, and Fragmentation

Link Type MTU
Ethernet 1500 bytes
PPPoE 1492 bytes
Jumbo frames 9000 bytes
Loopback 65536 bytes

MSS = MTU - 20 (IP) - 20 (TCP) = 1460 bytes for standard Ethernet.

# Path MTU Discovery
tracepath -n example.com

# MSS clamping
sudo iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN   -j TCPMSS --clamp-mss-to-pmtu

Common MTU issues: VPN tunnels (GRE: +24 bytes, IPSec: +50-70 bytes), Docker VXLAN overlays (+50 bytes). Fix: lower the MTU on the tunnel interface.

5. TCP Timeout and Retransmission

# Check per-connection TCP info (rtt, rto)
ss -i | head -20

# Monitor retransmissions
netstat -s | grep -i retransmit

# TCP tuning
sysctl net.ipv4.tcp_timestamps=1
sysctl net.ipv4.tcp_sack=1
sysctl net.ipv4.tcp_fack=1

Initial RTO is 1 second (RFC 6298), doubling on each retransmission up to 120s.

6. Practical tcpdump Analysis

# Connection lifecycle
tcpdump -i eth0 -nn 'tcp port 443'

# Only SYNs (new connection attempts)
tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'

# Retransmissions
tcpdump -i eth0 -nn 'tcp[13] & 8 != 0'
Metric What It Tells You Healthy Range
Average RTT Network latency Depends on geography
Retransmission rate Packet loss / congestion < 1%
Out-of-order packets Network reordering < 0.1%
Zero window events Slow consumer Should be 0

7. TCP Tuning for High Performance

# Increase buffer sizes (datacenter)
sysctl net.core.rmem_max=134217728
sysctl net.core.wmem_max=134217728
sysctl net.ipv4.tcp_rmem='4096 87380 134217728'
sysctl net.ipv4.tcp_wmem='4096 65536 134217728'

# Enable fast open
sysctl net.ipv4.tcp_fastopen=3

# Reduce keepalive interval
sysctl net.ipv4.tcp_keepalive_time=300

Summary

When debugging performance issues:

  1. Check RTT (network latency is #1 factor)
  2. Check retransmission rate (packet loss kills throughput)
  3. Check window scaling and buffer sizes
  4. Check MTU restrictions (VPN overlays)
  5. Check congestion control algorithm (Cubic vs BBR)

Understanding these fundamentals turns "the network is slow" into a specific, actionable diagnosis.