"Bufferbloat is an epidemic of bad latency caused by excessive buffering in network equipment. It is not a fundamental problem with the Internet's design. It is a consequence of cheap RAM and the mistaken belief that large buffers are always good." -- Jim Gettys, Bufferbloat: Dark Buffers in the Internet (CACM, 2011)
Lecture (100 min, two 50-min blocks)
11.1 The Latency Problem: Bufferbloat
Why buffers exist: routers and switches buffer packets when the output link is momentarily overloaded. Without buffers, any burst of traffic exceeding link capacity results in packet drops and TCP retransmissions. Buffers absorb bursts and reduce drop rates.
Why large buffers cause problems: TCP's congestion control depends on packet loss (or ECN marks) as a signal that the network is congested. It increases sending rate until it detects loss, then backs off. If the buffer is large enough to absorb the entire congestion buildup before dropping, packets are queued for hundreds of milliseconds -- or more. TCP does not back off until it sees loss; loss happens only when the buffer overflows; the buffer takes a long time to overflow because it is large. Meanwhile, interactive traffic (VoIP, gaming, SSH keystrokes) is stuck behind the queue.
Measuring bufferbloat:
# Flent: run RRUL (Real-time Response Under Load) test
# Simultaneously downloads and uploads at full speed while measuring latency
flent rrul -p all_scaled -t "Bufferbloat test" -H TARGET_HOST -o /tmp/rrul.png
# Simple manual test: start a large download; simultaneously ping gateway
ping -i 0.1 GATEWAY_IP &
wget -q http://TARGET/largefile -O /dev/null
# If ping RTT jumps from 2ms to 200ms during download: bufferbloat is present
DSLReports Bufferbloat test: visit dslreports.com/speedtest (accessible via browser). The test simultaneously measures download speed, upload speed, and latency-under-load. A grade below B indicates problematic bufferbloat.
11.2 CoDel and FQ-CoDel: Fixing Bufferbloat
The fix for bufferbloat is Active Queue Management (AQM): the router drops or marks packets based on sojourn time (how long a packet has waited in the queue), not just queue depth.
CoDel (Controlled Delay, RFC 8289): maintains a running estimate of the minimum sojourn time in the buffer. If the minimum sojourn time stays above a target (5ms by default) for longer than an interval (100ms by default), CoDel starts dropping packets. TCP detects the drops and backs off. Once the queue drains, CoDel stops dropping.
FQ-CoDel (Flow Queue CoDel, RFC 8290): extends CoDel by adding per-flow fair queuing. Traffic is hashed into separate queues by 5-tuple (src IP, dst IP, src port, dst port, protocol). Each flow gets its own CoDel-managed queue. No single bulk flow can fill the shared buffer and starve interactive flows.
# Apply FQ-CoDel to an outbound interface (Linux tc)
sudo tc qdisc add dev eth0 root fq_codel
# View current qdisc
tc qdisc show dev eth0
# See FQ-CoDel statistics
tc -s qdisc show dev eth0
# More complete cake qdisc (FQ-CoDel successor with better shaping)
sudo tc qdisc add dev eth0 root cake bandwidth 100mbit
CAKE (Common Applications Kept Enhanced): the practical successor to FQ-CoDel in consumer routers and modern Linux systems. CAKE adds traffic classification (separating bulk from interactive), network shaping, and better handling of ACK compression. Available in Linux 5.4+.
11.3 TCP Congestion Control Variants
TCP's congestion control algorithm determines how fast a sender transmits data and how it responds to congestion signals.
CUBIC (RFC 8312): the default TCP congestion control in Linux (since kernel 2.6.19) and the dominant algorithm on the Internet. CUBIC uses a cubic function of time since the last congestion event to determine the congestion window size. It grows aggressively in high-bandwidth-delay-product environments (fast, long-distance links) but is still loss-based (requires packet loss to detect congestion).
BBR (Bottleneck Bandwidth and RTT, Google 2016 / BBRv3 2023): a fundamentally different approach. BBR does not react to packet loss as a primary congestion signal; it models the network's bottleneck bandwidth and RTT, then sends at a rate that keeps the bottleneck busy without filling the buffer.
BBR advantages:
- Higher throughput on links with non-congestion loss (wireless, satellite)
- Much lower buffering (does not fill queues before detecting congestion)
- Better behavior in deep-buffer environments where loss-based control experiences bufferbloat
BBR limitations:
- Can be unfair to loss-based flows in shared networks (BBR does not back off on loss)
- BBRv1 had fairness problems; BBRv3 (2023) improves this substantially
# Enable BBR on Linux (requires kernel 4.9+)
sudo modprobe tcp_bbr
echo "tcp_bbr" | sudo tee -a /etc/modules-load.d/modules.conf
echo "net.core.default_qdisc=fq" | sudo tee -a /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Verify
sysctl net.ipv4.tcp_congestion_control
RENO: the original TCP congestion control algorithm; simple, well-understood; slow-start + additive increase / multiplicative decrease (AIMD). Still used in some embedded systems. Performs poorly on high-BDP links.
QUIC congestion control: QUIC uses pluggable congestion control; the default in most implementations is CUBIC or BBR. Because QUIC is implemented in user space, congestion control can be updated without kernel patches -- a significant operational advantage.
11.4 QUIC as a Transport Protocol
Week 6 introduced QUIC in the context of TLS. This week examines it as a transport-layer protocol competing with TCP.
QUIC design goals:
- Reduced handshake latency: 0-1 RTT for new connections; 0-RTT for resumed connections; combines transport + TLS handshake
- Head-of-line blocking elimination: independent byte streams within a QUIC connection; a lost packet in stream N does not block stream M (contrast: in HTTP/2 over TCP, all streams are blocked by a single lost TCP packet)
- Connection migration: connections are identified by a 64-bit Connection ID, not by (src_IP, src_port); handover between WiFi and cellular without reconnection
- Encrypted transport headers: sequence numbers, acknowledgment information, and most header fields are encrypted; middleboxes (firewalls, NAT boxes) cannot inspect them
- Unreliable datagrams: QUIC DATAGRAM extension (RFC 9221) allows unreliable, unordered delivery within a QUIC session; enables games and real-time media over QUIC
QUIC vs TCP performance:
| Scenario | TCP (with TLS 1.3) | QUIC |
|---|---|---|
| New connection to new server | 1 RTT (TCP) + 1 RTT (TLS 1.3) = 2 RTT before data | 1 RTT total |
| Resumed connection | 1 RTT (TCP) + 0 RTT (TLS session ticket) = 1 RTT | 0 RTT (first packet can contain data) |
| Multiple parallel streams, 1 packet loss | All streams stall until loss recovered | Only affected stream stalls |
| Mobile handover | Reconnect required (new (IP, port) = new TCP conn) | Connection migrates transparently |
11.5 Measuring Network Performance
# iperf3: bandwidth measurement
# Server
iperf3 -s
# Client: TCP throughput test
iperf3 -c SERVER_IP -t 30
# Client: UDP loss test
iperf3 -c SERVER_IP -u -b 100M -t 30
# Client: multiple parallel streams (simulates more realistic traffic)
iperf3 -c SERVER_IP -P 10 -t 30
# Netperf: latency-focused measurement
netperf -H SERVER_IP -t TCP_RR # request/response (measures RTT)
netperf -H SERVER_IP -t TCP_STREAM # bulk throughput
# Flent: comprehensive test suite
flent rrul -H SERVER_IP -p all_scaled -o result.png # RRUL latency-under-load test
flent tcp_download -H SERVER_IP -p totals # download throughput
ss and ip: socket and interface statistics:
# Detailed TCP socket state with congestion control info
ss -tin # -t=TCP, -i=internal info, -n=numeric
# Output includes:
# cwnd (congestion window), ssthresh (slow-start threshold)
# rtt/rttvar (measured RTT), pacing_rate, retrans
Lab Preview
Lab 10 measures and fixes bufferbloat:
- Set up a network with an artificially constrained outbound link (using
tc tbfto limit to 10 Mbps) - Run Flent RRUL test; document baseline latency-under-load (expected: 200-500ms for large buffers)
- Apply FQ-CoDel to the constraining interface; re-run Flent; document improved latency (expected: under 20ms)
- Switch TCP congestion control from CUBIC to BBR; re-run; compare throughput and latency under load
- Generate plots comparing all three configurations; write a 1-page interpretation
# Lab setup: artificial 10Mbps bottleneck with large buffer (netem + tbf)
sudo tc qdisc add dev eth1 root handle 1: tbf rate 10mbit burst 32kbit latency 400ms
# Add FQ-CoDel fix
sudo tc qdisc del dev eth1 root
sudo tc qdisc add dev eth1 root fq_codel
# Measure with iperf3 + parallel ping
iperf3 -c TARGET -t 60 &
ping -i 0.1 -c 600 GATEWAY | tee ping_during_iperf.txt
Homework
Reading (45 min): Kurose-Ross 9e Ch 3.7 (TCP Congestion Control). Focus on the AIMD algorithm, slow start, and the congestion avoidance state machine. Then read the Jim Gettys bufferbloat blog post "Bufferbloat: Dark Buffers in the Internet" (freely available; ~15 minutes; the clearest non-technical explanation of the problem).
Hands-on (60 min): Using iperf3 and a remote server (or two VMs), measure TCP throughput with CUBIC vs. BBR:
# Start iperf3 server on VM2
iperf3 -s -D
# From VM1: test CUBIC
sudo sysctl net.ipv4.tcp_congestion_control=cubic
iperf3 -c VM2_IP -t 30 -J | jq '.end.sum_received.bits_per_second'
# Switch to BBR
sudo sysctl net.ipv4.tcp_congestion_control=bbr
iperf3 -c VM2_IP -t 30 -J | jq '.end.sum_received.bits_per_second'
Document: did BBR improve throughput on your test path? Check ss -tin during the iperf3 run; compare cwnd values for CUBIC vs. BBR.
Toolchain Diary Entry
First-introduce this week: Flent; tc qdisc; iperf3 advanced usage
flent TESTNAME -H HOST -t TITLE -o OUTPUTFILE.png: run a Flent test and save a plot. Standard tests: rrul, tcp_download, tcp_upload, tcp_bidirectional.
tc qdisc show dev IFACE: show the active queuing discipline on an interface.
tc qdisc add dev IFACE root fq_codel: apply FQ-CoDel AQM to an interface.
tc qdisc del dev IFACE root: remove the root qdisc (restore kernel default).
tc qdisc add dev IFACE root tbf rate RATE burst SIZE latency DELAY: add Token Bucket Filter to artificially limit bandwidth (useful for lab bottleneck simulation).
iperf3 -c HOST -P N -t 30: parallel stream iperf3 test; -P N runs N simultaneous TCP connections.
iperf3 -c HOST -u -b RATE -t 30: UDP bandwidth test; measures loss and jitter.
ss -tin: TCP socket info including congestion window (cwnd), RTT estimate, and retransmission count.
sysctl net.ipv4.tcp_congestion_control=bbr: switch the kernel's default TCP congestion control (effective immediately for new connections).
Key Terms
- Bufferbloat: pathological latency caused by excessively large network buffers that allow TCP to fill queues before detecting congestion; discovered by Jim Gettys (2010); common in consumer routers and DSL/cable modems
- AQM: Active Queue Management; dropping or marking packets based on sojourn time (queue delay) rather than queue depth; CoDel and FQ-CoDel are the principal AQM algorithms
- CoDel (RFC 8289): Controlled Delay AQM; drops packets when minimum sojourn time exceeds target (5ms) for longer than an interval (100ms); drains queues to near-zero latency
- FQ-CoDel (RFC 8290): Flow Queue CoDel; combines per-flow fair queuing with CoDel AQM; isolates bulk flows from interactive traffic; standard in OpenWRT and Linux
- CAKE: Common Applications Kept Enhanced; AQM + traffic shaping + flow isolation; practical successor to FQ-CoDel; standard in modern OpenWRT; available in Linux 5.4+
- CUBIC (RFC 8312): default TCP congestion control in Linux; cubic window growth function; loss-based; optimized for high-bandwidth-delay-product links
- BBR: Bottleneck Bandwidth and RTT (Google); model-based TCP congestion control; measures bottleneck bandwidth and RTT to set sending rate; does not react to loss as primary signal; higher throughput on lossy links; fairer in BBRv3
- Head-of-line blocking: condition where a single blocked stream delays all subsequent data; affects TCP (one stream blocks all HTTP/2 streams); eliminated by QUIC's independent stream multiplexing
- BDP: Bandwidth-Delay Product; the number of bytes "in flight" in a full-speed pipe; BDP = bandwidth × RTT; congestion window must be at least BDP to saturate a high-speed link