"In a Clos network, any input port can reach any output port through two intermediate stages -- and the bisection bandwidth is full. This is the property that drove every major hyperscaler to the spine-leaf architecture in the 2010s." -- Dinesh G. Dutt, Cloud Native Data Center Networking, O'Reilly, 2019
Lecture (100 min)
2.1 Why Datacenter Topology Is a Security and Architecture Problem
A network operator who understands only the routing layer misses half the datacenter picture. Physical and logical topology shapes which paths traffic can take, how lateral movement propagates through an enterprise network, and how a monitoring system achieves coverage. The shift from three-tier to spine-leaf Clos topology in the 2010s was not just an efficiency improvement; it was an architectural decision with direct implications for traffic visibility, segmentation enforcement, and east-west threat propagation.
NET-201's switching module covered VLANs, STP, and basic layer-2 segmentation at enterprise scale. Week 2 scales this to datacenter fabric design and the protocol stack -- VXLAN + EVPN -- that hyperscalers and modern enterprise datacenters run.
2.2 The Three-Tier Legacy and Its Limits
The traditional datacenter topology was a three-tier hierarchy:
Internet
|
[Core layer] -- 1-2 routers; high-speed WAN uplinks; BGP peering
|
[Distribution layer] -- per-building or per-floor aggregation; HSRP/VRRP gateways
|
[Access layer] -- top-of-rack switches; servers connect here
The bisection bandwidth bottleneck: in a three-tier design, all east-west traffic (server-to-server, which dominates modern datacenter traffic) must traverse the distribution and potentially the core layers. As server counts grew and server-to-server traffic patterns (distributed databases, MapReduce, microservices) became dominant, the oversubscription ratio at the distribution and core layers became the binding constraint.
The spanning tree problem: VLAN-based segmentation across three tiers requires Spanning Tree Protocol (STP) to prevent loops. STP blocks redundant links, halving available bandwidth. Rapid convergence (RSTP) improves failover time but does not recover the blocked-link bandwidth waste.
2.3 Clos Topology and the Spine-Leaf Architecture
The Clos network was invented by Charles Clos at Bell Labs in 1953 for telephony switching. The key property: a Clos network provides non-blocking, rearrangeable connectivity between any input and any output port using multiple stages, where each stage has fewer but faster switches than the previous.
The two-stage Clos (spine-leaf) is the datacenter instantiation:
[Spine 1] [Spine 2] [Spine 3] [Spine 4]
| \ / | \ | / | | Full mesh: every leaf
| \ / | \ / \ | | connects to every spine
| \/ | X \ | |
[Leaf 1] [Leaf 2] [Leaf 3] [Leaf 4]
|||| |||| |||| ||||
[Servers] [Servers] [Servers] [Servers]
Bisection bandwidth: in a spine-leaf topology with N spines and M leaves, every leaf has N uplinks to N different spines. A server on Leaf 1 sending to a server on Leaf 4 can take any of the N available paths. Bisection bandwidth scales linearly with the number of spines. This is the structural reason spine-leaf won: adding spines adds bandwidth without redesigning the topology.
ECMP (Equal-Cost Multi-Path): the spine-leaf fabric uses ECMP at every layer. A packet from Leaf 1 to Leaf 4 has N equal-cost paths (one via each spine). ECMP hashes the flow (5-tuple: src IP, dst IP, src port, dst port, protocol) to one of N paths, distributing traffic across all spines.
2.4 VXLAN: Virtual Extensible LAN
Spine-leaf topologies run IP routing at every hop (including within the fabric). This creates a problem for workloads that require Layer 2 adjacency: virtual machines in a cloud environment must be able to communicate as if on the same VLAN, even when they are on different physical leaf switches.
VXLAN (RFC 7348) solves this by encapsulating Layer 2 Ethernet frames inside UDP packets, creating a Layer 2 overlay over a Layer 3 underlay.
VXLAN frame format:
Outer Ethernet Header
Outer IP Header (src=VTEP, dst=VTEP)
UDP Header (dst port 4789)
VXLAN Header (8 bytes):
| Flags (8b) | Reserved (24b) | VNI (24b) | Reserved (8b) |
Inner Ethernet Header (original Layer 2 frame)
Inner IP Header (original)
Original Payload
VNI (VXLAN Network Identifier): 24-bit field; equivalent to VLAN ID but with 16 million possible segment IDs versus 4,096 VLANs. Each tenant or workload group gets its own VNI.
VTEP (VXLAN Tunnel Endpoint): a virtual or physical switch that performs VXLAN encapsulation/decapsulation. Each leaf switch in the spine-leaf fabric is a VTEP.
2.5 EVPN: The VXLAN Control Plane
Early VXLAN implementations used flood-and-learn: VTEPs flooded BUM (Broadcast, Unknown unicast, Multicast) traffic to all other VTEPs to discover MAC-to-VTEP mappings. This scales poorly -- a datacenter with 1,000 VTEPs generates massive BUM flooding.
EVPN (Ethernet VPN, RFC 7432) provides a BGP-based control plane for VXLAN. Instead of flooding, VTEPs advertise their MAC and ARP information via BGP EVPN route types:
| EVPN Route Type | Name | Carries |
|---|---|---|
| Type 1 | Ethernet Auto-Discovery | PE reachability + fast failover |
| Type 2 | MAC/IP Advertisement | MAC address + optionally IP address (ARP suppression) |
| Type 3 | Inclusive Multicast | BUM traffic handling per VNI |
| Type 4 | ES Route | Multi-homing Ethernet segment synchronization |
| Type 5 | IP Prefix | IP routes (inter-VNI / external prefix advertisement) |
ARP suppression: a VTEP that has received a Type 2 route for a MAC/IP pair answers local ARP requests on behalf of the remote host. This eliminates ARP floods entirely in a well-deployed EVPN fabric.
Symmetric vs Asymmetric IRB (Integrated Routing and Bridging):
In asymmetric IRB, routing between VNIs happens at the ingress leaf. The packet arrives at the ingress leaf, is routed to the destination VNI, and then switched to the egress leaf. The egress leaf performs only switching. Simple, but every leaf must have every VNI's L3 gateway configured.
In symmetric IRB (the modern default), both the ingress and egress leaf perform routing. A special transit VNI (L3VNI) carries routed traffic between leaves. Each leaf needs only the VNIs for its own locally attached workloads.
2.6 BGP in the Fabric: iBGP with Route Reflectors vs eBGP Unnumbered
EVPN requires BGP sessions between VTEPs. Two dominant models:
iBGP with Route Reflectors: leaves form iBGP sessions to spine switches acting as route reflectors. Simpler to reason about when the underlay and overlay run in the same AS. Requires careful RR placement to avoid scalability bottlenecks.
eBGP Unnumbered (the hyperscaler model): each link uses a separate AS number; eBGP sessions are established using the link-local addresses negotiated by RFC 5549 (advertising IPv4 routes via IPv6 next-hops). Every leaf is in its own AS; every spine is in its own AS; no AS path loop issues. This is the Arista / Cumulux / FRR-native model documented in Dutt's Cloud Native Data Center Networking.
2.7 Architecture Comparison Sidebar: Clos vs Three-Tier vs Collapsed-Core
| Property | Three-Tier | Collapsed-Core | Spine-Leaf Clos |
|---|---|---|---|
| Bisection bandwidth | Oversubscribed at distribution | Better; fewer hops | Full; scales linearly with spines |
| Scale limit | ~5,000 servers before distribution choke | ~1,000-2,000 servers | 100,000+ servers (Facebook Fabric) |
| STP dependency | Yes (blocks 50% links) | Yes | No (L3 routing everywhere) |
| East-west path length | 2-4 hops (via distribution/core) | 2-3 hops | 2 hops (leaf → spine → leaf; always) |
| Failure domain | Distribution node failure = large blast radius | Smaller | Single spine or leaf; others absorb traffic |
| Traffic visibility | Hard (asymmetric paths) | Moderate | Good (predictable ECMP paths) |
| Cost | High (proprietary chassis) | Medium | Commodity whitebox switches |
| Hyperscaler adoption | Pre-2010 Facebook, Google | Never at hyperscale | Facebook (2009), Google (2012+), AWS |
The historical turning point: Facebook's 2009 "Data Center Network Architecture at Scale" paper (and subsequent talks at NANOG) publicly described the move to spine-leaf Clos. Google's Jupiter network (published 2015) revealed that Google had been running Clos fabrics at petabit scale since 2012. By 2020, spine-leaf was the default for any new datacenter of more than a few hundred servers.
Dutt / Cloud Native Data Center Networking Weave
Dutt's Cloud Native Data Center Networking is the practitioner's reference for the eBGP-unnumbered spine-leaf fabric. The opening chapters lay the architectural case for Clos topology -- the bisection-bandwidth argument -- and then walk through the FRR + EVPN configuration in detail. NET-301 Week 2 follows Dutt's framing for the Clos/EVPN portion; the security practitioner's addition is that a properly segmented EVPN fabric with per-tenant VNIs limits east-west blast radius at the Layer 2 boundary. An attacker who compromises a workload on VNI 1000 does not have Layer 2 adjacency to workloads on VNI 2000; the VTEP enforces the VNI boundary.
Lab 2 Introduction
Lab 2 builds on the SR-MPLS underlay from Lab 1. You will deploy a 2-spine / 4-leaf Containerlab VXLAN-EVPN fabric with FRR 9.x using eBGP unnumbered, configure two tenant VNIs with symmetric IRB, and demonstrate VM mobility: spin up two Docker containers on Leaf 1 and Leaf 2 with the same VNI; move one container from Leaf 1 to Leaf 3; observe that the EVPN Type 2 advertisement updates and the container's IP address remains reachable from Leaf 2 without any manual reconfiguration.
Independent Practice (5 hr)
- Dutt Cloud Native Data Center Networking Ch 1-3 (topology motivation, BGP in the datacenter, EVPN)
- Kurose-Ross 9e §6.4 (VLAN and VXLAN overview)
- RFC 7432 §§1-4 (EVPN architecture and route types; skip the ATM/FR legacy context)
- Lab 2 -- Part A (Containerlab spine-leaf topology bootstrap)
- Supplemental: read the Facebook "Fabric" NANOG 53 slides (2011; publicly archived); 20 min