TCP/IP Deep Dive for Engineers

June 5, 2026 Systems

TCP/IP is the backbone of modern networking, but most engineers interact with it only through abstractions (HTTP libraries, ORMs, load balancers). When things break or when you need to optimize — understanding what's happening on the wire is irreplaceable.

1. The Three-Way Handshake

Every TCP connection starts with SYN, SYN-ACK, ACK:

Client                           Server
  ├───────────── SYN ──────────►│  Step 1
  │◄────────── SYN-ACK ────────┤  Step 2
  ├───────────── ACK ──────────►│  Step 3
  │◄══════════ Data ══════════►│

# tcpdump view
sudo tcpdump -i eth0 -nn 'tcp port 80'

# Flags: [S] = SYN, [S.] = SYN-ACK, [.] = ACK, [F] = FIN, [R] = RST

The cost: The handshake adds 1 RTT before any data flows. On a 100ms link, that's 100ms overhead per new connection. This is why connection reuse (HTTP keep-alive, connection pools) is critical.

2. TCP Window and Flow Control

TCP uses a sliding window to control data in-flight:

# Check initial window size (in SYN packet)
tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0'

# Window scaling (RFC 1323): win 65535 with wscale 7 = ~8 MB

Zero Window: If the receiver's buffer is full, it advertises win 0. The sender must stop until a window update arrives.

3. Congestion Control

Slow Start: Start cwnd at ~14KB (10 segments), double every RTT until loss. Congestion Avoidance: Increase by 1 segment per RTT. On loss, reduce cwnd.

# Check current algorithm
sysctl net.ipv4.tcp_congestion_control  # default: cubic

# Switch to BBR
sudo sysctl net.ipv4.tcp_congestion_control=bbr

# Test with iperf3
iperf3 -c server-ip -t 30 -C cubic
iperf3 -c server-ip -t 30 -C bbr

Algorithm	Pros	Cons
Cubic (default)	Good for long-fat pipes	Slow recovery on lossy links
BBR	Model-based, excellent throughput, low latency	Can be unfair to Cubic flows
Reno	Conservative, well-understood	Poor on high-BDP links

4. MTU, MSS, and Fragmentation

Link Type	MTU
Ethernet	1500 bytes
PPPoE	1492 bytes
Jumbo frames	9000 bytes
Loopback	65536 bytes

MSS = MTU - 20 (IP) - 20 (TCP) = 1460 bytes for standard Ethernet.

# Path MTU Discovery
tracepath -n example.com

# MSS clamping
sudo iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN   -j TCPMSS --clamp-mss-to-pmtu

Common MTU issues: VPN tunnels (GRE: +24 bytes, IPSec: +50-70 bytes), Docker VXLAN overlays (+50 bytes). Fix: lower the MTU on the tunnel interface.

5. TCP Timeout and Retransmission

# Check per-connection TCP info (rtt, rto)
ss -i | head -20

# Monitor retransmissions
netstat -s | grep -i retransmit

# TCP tuning
sysctl net.ipv4.tcp_timestamps=1
sysctl net.ipv4.tcp_sack=1
sysctl net.ipv4.tcp_fack=1

Initial RTO is 1 second (RFC 6298), doubling on each retransmission up to 120s.

6. Practical tcpdump Analysis

# Connection lifecycle
tcpdump -i eth0 -nn 'tcp port 443'

# Only SYNs (new connection attempts)
tcpdump -i eth0 -nn 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'

# Retransmissions
tcpdump -i eth0 -nn 'tcp[13] & 8 != 0'

Metric	What It Tells You	Healthy Range
Average RTT	Network latency	Depends on geography
Retransmission rate	Packet loss / congestion	< 1%
Out-of-order packets	Network reordering	< 0.1%
Zero window events	Slow consumer	Should be 0

7. TCP Tuning for High Performance

# Increase buffer sizes (datacenter)
sysctl net.core.rmem_max=134217728
sysctl net.core.wmem_max=134217728
sysctl net.ipv4.tcp_rmem='4096 87380 134217728'
sysctl net.ipv4.tcp_wmem='4096 65536 134217728'

# Enable fast open
sysctl net.ipv4.tcp_fastopen=3

# Reduce keepalive interval
sysctl net.ipv4.tcp_keepalive_time=300

Summary

When debugging performance issues:

Check RTT (network latency is #1 factor)
Check retransmission rate (packet loss kills throughput)
Check window scaling and buffer sizes
Check MTU restrictions (VPN overlays)
Check congestion control algorithm (Cubic vs BBR)

Understanding these fundamentals turns "the network is slow" into a specific, actionable diagnosis.