Cisco
Cisco SD-WAN Multi-Region Fabric: A Resilient Design for 2026
Deploying Cisco Catalyst SD-WAN beyond a few hundred sites without a Multi-Region Fabric (MRF) architecture is a recipe for control plane collapse. While a single-region fabric appears simpler, its full-mesh nature creates a geometric explosion of BFD sessions and OMP updates that no vSmart or edge router can sustain at enterprise scale. MRF, formerly Hierarchical SD-WAN, is not an optional add-on; it is the mandatory design paradigm for any network exceeding 1,000 sites or spanning multiple continents. However, a successful MRF deployment hinges entirely on a flawlessly architected Region 0 and correctly sized transport gateways. Get this wrong, and you build a distributed bottleneck instead of a scalable fabric.
Understanding the Multi-Region Fabric Control Plane
A default, single-region Catalyst SD-WAN deployment is a flat control plane. Every edge router (cEdge/vEdge) establishes OMP adjacencies with every other edge router to learn TLOCs (Transport Locators) and service routes, and with every vSmart controller. BFD sessions verify path liveness between all TLOCs. At 100 sites, this is manageable. At 2,000 sites, each with two transports, a single router might need to maintain thousands of BFD and OMP sessions. This consumes significant CPU and memory, not just on the edge routers but critically on the vSmart controllers, which act as the central route reflectors. A single vSmart pair, even on robust hardware, hits practical limits around 2,500 OMP peering sessions.
MRF solves this by introducing a hierarchical control plane. The fabric is partitioned into a core region (Region 0) and multiple access regions (Regions 1-N).
- Access Regions: These contain the actual user sites—branches, campuses, and datacenter application environments. Routers within an access region operate in a full mesh among themselves.
- Region 0: This special region contains no user sites. Its sole purpose is to interconnect the access regions. It contains high-throughput border routers that act as transport gateways.
The magic is in the control plane abstraction. An edge router in Region 1 (e.g., London) does not peer with an edge router in Region 2 (e.g., Tokyo). Instead, the London router learns a summarized path to the Tokyo region via its local border router. The vSmarts enforce this segmentation, dramatically reducing the number of OMP and BFD sessions each device must maintain. This is not just a suggestion; for global networks running on Catalyst SD-WAN 20.9 or newer, it is the only stable architecture.
Core Architecture: Region 0, Border Routers, and Transport Gateways
The terminology here is precise. A router that provides an exit point for a region is a border router. When these border routers are used to interconnect regions, they function as transport gateways. In an MRF design, these terms are often used interchangeably for the same device.
Designing the Core: Region 0
Region 0 is the heart of the fabric; its design determines the stability, latency, and throughput of all inter-region communication. It is a transit-only region. Under no circumstances should service-side VPNs for user branches or datacenters terminate directly in Region 0. Its only members are the transport gateways themselves. For maximum stability, Region 0 transport gateways should be deployed in at least two, preferably three or more, geographically-dispersed, carrier-neutral facilities with high-speed connectivity. For a global network, think Equinix locations in Ashburn, London, and Singapore. The transport connecting these core sites should not be public internet; it must be a private, high-performance backbone (e.g., 100Gbps DWDM, dedicated carrier Ethernet, or a premium MPLS service).
Hardware Selection: No Cutting Corners
For the critical role of transport gateways in Region 0 and high-density access regions, hardware selection is paramount. Do not attempt to use entry-level branch routers. The required IPsec crypto performance and session scalability demand high-end platforms. The workhorse for this role is the Catalyst 8500 Series, specifically the C8500-12X, which provides up to 197 Gbps of IPsec throughput. For virtual deployments in a private cloud or colocation, a Catalyst 8000V (Cat8kV) instance provisioned with sufficient CPU cores (e.g., 16+ vCPUs on a UCS C220 M7) and SR-IOV for NIC performance is a viable alternative. For access region border routers in smaller regions, a pair of Catalyst 8300s can suffice, but performance must be carefully validated against aggregate throughput requirements.
Sizing Transport Gateways and Control Plane Components
Undersizing transport gateways is the most common and costly mistake in MRF design. The calculation requires an honest assessment of inter-region traffic flows and an understanding of IPsec overhead.
A Real Sizing Example
Let's model a transport gateway for a European access region (Region 1) with 600 branch sites, which needs to communicate with an American region (Region 2).
- Aggregate Branch Throughput: Assume each of the 600 branches has a 100 Mbps DIA circuit, with an average peak utilization of 40%, so 40 Mbps per site. The theoretical aggregate egress throughput is 600 * 40 Mbps = 24 Gbps.
- Estimate Inter-Region Traffic: Not all traffic will leave the region. Based on application analysis, let's say 30% of traffic is destined for the AMER region. This means the transport gateway must handle 24 Gbps * 0.30 = 7.2 Gbps of stateful traffic.
- Calculate Crypto Overhead: IPsec (ESP in tunnel mode with AES-256-GCM) adds encapsulation overhead. A conservative estimate is a 25% performance impact on raw throughput. So, the required crypto performance is 7.2 Gbps * 1.25 = 9.0 Gbps.
- Factor in Failover: You will deploy at least two transport gateways for redundancy (e.g., one in London, one in Frankfurt). Each gateway must be sized to handle the entire 9.0 Gbps load if the other fails. Sizing them for 4.5 Gbps each (50/50 load) guarantees a massive performance degradation during a failure.
- Select the Platform: A single Catalyst 8300 (C8300-2N2S-4T2X) maxes out around 10-15 Gbps of aggregate IPsec throughput under ideal conditions. Pushing 9 Gbps during a failover is risky and leaves no room for growth. The correct choice here is a pair of Catalyst 8500-12X switches or high-performance Cat8kV instances. While a competitor like a PA-5440 from Palo Alto Networks might offer ~40 Gbps of IPsec throughput, staying within the Catalyst ecosystem simplifies management under vManage.
TLOC Design and Path Control
The elegance of MRF lies in its use of TLOCs. A border router in an access region performs a crucial function: TLOC extension. It extends its own TLOCs to the edge routers within its region. When a branch cEdge in Frankfurt needs to send traffic to a branch cEdge in Dallas, it does not see the Dallas cEdge's TLOCs directly. It sees the TLOC of its local transport gateway (e.g., in London), which has a path to the AMER region.
The control plane flow is as follows:
- The Dallas cEdge advertises its local TLOCs and service-side prefixes to its local (AMER) border router via OMP.
- The AMER border router advertises these prefixes to the Region 0 vSmarts, but crucially, it advertises them with its own TLOC as the next hop.
- The Region 0 vSmarts pass this summary to the EMEA border router.
- The EMEA border router passes the reachability information to the Frankfurt cEdge.
The result: the Frankfurt cEdge simply forwards inter-region traffic to the London transport gateway's TLOC. The complex inter-continental pathing is handled by the structured hierarchy, not by individual branch routers. This allows for powerful policy application. You can create centralized control policies in vManage that dictate, for example, that all traffic from Region 1 to Region 2 with a specific DSCP marking must use the MPLS transport through Region 0, while all other traffic can use the internet transport.
Common Pitfall: Creating Backdoor Inter-Region Links
A fatal design flaw is establishing out-of-band connectivity between access regions that bypasses Region 0. For example, an engineer might directly link two datacenters, one in access Region 1 and one in access Region 2, with a dedicated Layer 3 connection for a specific application and then redistribute routes into OMP. This creates a "backdoor" path.
This completely undermines the MRF architecture. It introduces the risk of asymmetric routing, where traffic from Region 1 to 2 takes the backdoor link, but return traffic from 2 to 1 attempts to use the official Region 0 path. This wreaks havoc on stateful services like firewalls. It violates the control plane hierarchy, making troubleshooting with vManage's tools nearly impossible, as the real traffic path does not match the logically configured one. All inter-region traffic must transit the transport gateways via Region 0. There are no exceptions.
When NOT to Use Multi-Region Fabric
MRF adds a layer of design and operational complexity. It is not always the right answer. A well-scaled single region is preferable to a poorly implemented multi-region design.
You should not use MRF if:
- Your network has fewer than 500 sites. The control plane overhead is manageable on modern hardware. A pair of vSmarts (virtual or physical) can handle the OMP load, and edge routers like the Catalyst 8200 or 8300 can handle the BFD sessions for a network of this size.
- Your network is geographically contained. If all your sites are within a single continent (e.g., North America), the latency benefits of regionalization are minimal. A single region with controllers placed in geographically central datacenters (e.g., Chicago and Dallas) is more efficient.
- You lack the core network backbone for a reliable Region 0. If you cannot provision dedicated, high-speed, low-latency transport between your core Region 0 sites, MRF will not perform well. Trying to build Region 0 over the public internet introduces too much unpredictability and defeats the purpose of creating a stable core.
The primary trigger for MRF is scaling beyond the OMP/BFD limits of a single control plane domain, typically seen past the 1,000-1,500 site mark, or the need to enforce strict segmentation and optimized pathing across a global, multi-continent deployment.
Mastering the Multi-Region Fabric is essential for building a resilient, planet-scale Catalyst SD-WAN. Its hierarchical nature is the only way to overcome the inherent scaling limitations of flat network designs. By focusing on a robust, private Region 0, correctly sizing transport gateways for failover, and preserving the integrity of the control plane hierarchy, you can build a fabric that provides stable, policy-driven connectivity for thousands of sites. For expert guidance on designing, implementing, and managing your large-scale SD-WAN deployment, explore consulting services at techleague.io. To further your expertise, read our analyses on Catalyst 8000 vs. ISR 4000 platform selection and the intersection of SASE with fabric design in our ZTNA vs. VPN integration guide.
Frequently asked questions
Can I use the public internet for Region 0 transport connectivity?+
While technically possible by running tunnels over the internet, it is a fundamentally flawed design. Region 0 is your core backbone; its stability dictates the entire fabric's performance. Using the unpredictable public internet introduces variable latency and jitter, undermining the reliability MRF is meant to provide. Always use dedicated private transport like DWDM, carrier Ethernet, or premium MPLS for Region 0.
How many access regions should I create?+
Start with continental boundaries: AMER, EMEA, and APJC are common starting points. A good rule of thumb is to keep region sizes between 500-1000 sites to stay well within the control plane limits of border routers. Avoid creating dozens of small, granular regions, as this increases management complexity without providing significant scaling benefits.
Do I need separate vSmart controller clusters for each region?+
No, this is a common misconception. A single, centralized cluster of vSmart controllers manages the entire multi-region fabric. You assign routers and their sites to specific region numbers during configuration, and the single vSmart cluster enforces the hierarchical control plane boundaries based on these assignments.
Which Catalyst SD-WAN software version is required for MRF?+
The feature, originally named Hierarchical SD-WAN, has been available since Viptela OS 18.x. In modern Cisco Catalyst SD-WAN software (e.g., version 20.9 and later), it is a stable and mature feature. For any production MRF deployment, it is critical to use a long-lived, stable release recommended by Cisco, such as the upcoming 20.13 or a future equivalent.
Can a single branch site belong to more than one access region?+
No, an edge router (cEdge/vEdge) is explicitly assigned to a single access region via its system configuration. All of its OMP sessions for learning TLOCs and routes are established within the confines of that single region, either to other edges or to the region's designated border routers.
How do routing policies and QoS work with MRF?+
Policies and QoS are applied hierarchically. You can apply specific data policies, control policies, or application-aware routing policies that only affect traffic within an access region. Separately, you can apply policies at the transport gateways to govern traffic flowing through Region 0. This allows for granular control within a region and high-level control over inter-region backbone traffic.
Is Multi-Region Fabric the same as Cisco SD-WAN Cloud OnRamp?+
No, they solve different problems but are complementary. Cloud OnRamp for SaaS/IaaS is a feature that optimizes the path from a branch site to a specific cloud application or IaaS provider. MRF is a foundational architecture for scaling the entire WAN itself across many sites and geographies. You would typically use Cloud OnRamp *within* an access region of your MRF deployment.