TCP/IP is the backbone of modern networking, but most engineers interact with it only through abstractions (HTTP libraries, ORMs, load balancers). When things break or when you need to optimize — understanding what's happening on the wire is irreplaceable.
1. The Three-Way Handshake
Every TCP connection starts with SYN, SYN-ACK, ACK:
Client Server
├───────────── SYN ──────────►│ Step 1
│◄────────── SYN-ACK ────────┤ Step 2
├───────────── ACK ──────────►│ Step 3
│◄══════════ Data ══════════►│
# tcpdump view
sudo tcpdump -i eth0 -nn 'tcp port 80'
# Flags: [S] = SYN, [S.] = SYN-ACK, [.] = ACK, [F] = FIN, [R] = RST
The cost: The handshake adds 1 RTT before any data flows. On a 100ms link, that's 100ms overhead per new connection. This is why connection reuse (HTTP keep-alive, connection pools) is critical.
2. TCP Window and Flow Control
TCP uses a sliding window to control data in-flight:
# Check initial window size (in SYN packet)
tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0'
# Window scaling (RFC 1323): win 65535 with wscale 7 = ~8 MB
Zero Window: If the receiver's buffer is full, it advertises win 0. The sender must stop until a window update arrives.
3. Congestion Control
Slow Start: Start cwnd at ~14KB (10 segments), double every RTT until loss. Congestion Avoidance: Increase by 1 segment per RTT. On loss, reduce cwnd.
# Check current algorithm
sysctl net.ipv4.tcp_congestion_control # default: cubic
# Switch to BBR
sudo sysctl net.ipv4.tcp_congestion_control=bbr
# Test with iperf3
iperf3 -c server-ip -t 30 -C cubic
iperf3 -c server-ip -t 30 -C bbr
| Algorithm | Pros | Cons |
|---|---|---|
| Cubic (default) | Good for long-fat pipes | Slow recovery on lossy links |
| BBR | Model-based, excellent throughput, low latency | Can be unfair to Cubic flows |
| Reno | Conservative, well-understood | Poor on high-BDP links |
4. MTU, MSS, and Fragmentation
| Link Type | MTU |
|---|---|
| Ethernet | 1500 bytes |
| PPPoE | 1492 bytes |
| Jumbo frames | 9000 bytes |
| Loopback | 65536 bytes |
MSS = MTU - 20 (IP) - 20 (TCP) = 1460 bytes for standard Ethernet.
# Path MTU Discovery
tracepath -n example.com
# MSS clamping
sudo iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
Common MTU issues: VPN tunnels (GRE: +24 bytes, IPSec: +50-70 bytes), Docker VXLAN overlays (+50 bytes). Fix: lower the MTU on the tunnel interface.
5. TCP Timeout and Retransmission
# Check per-connection TCP info (rtt, rto)
ss -i | head -20
# Monitor retransmissions
netstat -s | grep -i retransmit
# TCP tuning
sysctl net.ipv4.tcp_timestamps=1
sysctl net.ipv4.tcp_sack=1
sysctl net.ipv4.tcp_fack=1
Initial RTO is 1 second (RFC 6298), doubling on each retransmission up to 120s.
6. Practical tcpdump Analysis
# Connection lifecycle
tcpdump -i eth0 -nn 'tcp port 443'
# Only SYNs (new connection attempts)
tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'
# Retransmissions
tcpdump -i eth0 -nn 'tcp[13] & 8 != 0'
| Metric | What It Tells You | Healthy Range |
|---|---|---|
| Average RTT | Network latency | Depends on geography |
| Retransmission rate | Packet loss / congestion | < 1% |
| Out-of-order packets | Network reordering | < 0.1% |
| Zero window events | Slow consumer | Should be 0 |
7. TCP Tuning for High Performance
# Increase buffer sizes (datacenter)
sysctl net.core.rmem_max=134217728
sysctl net.core.wmem_max=134217728
sysctl net.ipv4.tcp_rmem='4096 87380 134217728'
sysctl net.ipv4.tcp_wmem='4096 65536 134217728'
# Enable fast open
sysctl net.ipv4.tcp_fastopen=3
# Reduce keepalive interval
sysctl net.ipv4.tcp_keepalive_time=300
Summary
When debugging performance issues:
- Check RTT (network latency is #1 factor)
- Check retransmission rate (packet loss kills throughput)
- Check window scaling and buffer sizes
- Check MTU restrictions (VPN overlays)
- Check congestion control algorithm (Cubic vs BBR)
Understanding these fundamentals turns "the network is slow" into a specific, actionable diagnosis.