Wednesday 28 February 2024

Another Ethernet VPN (EVPN) Introduction

Ethernet VPN (EVPN) Introduction

Instead of being a protocol, EVPN is a solution that utilizes the Multi-Protocol Border Gateway Protocol (MP-BGP) for its control plane in an overlay network. Besides, EVPN employs Virtual extensible Local Area Network (VXLAN) encapsulation for the data plane of the overlay network.


Multi-Protocol BGP (MP-BGP) is an extension of BGP-4 that allows BGP speakers to encode Network Layer Reachability Information (NLRI) of various address types, including IPv4/6, VPNv4, and MAC addresses, into BGP Update messages. 

The MP_REACH_NLRI path attribute (PA) carried within MP-BGP update messages includes Address Family Identifier (AFI) and Subsequent Address Family Identifier (SAFI) attributes. The combination of AFI and SAFI determines the semantics of the carried Network Layer Reachability Information (NLRI). For example, AFI-25 (L2VPN) with SAFI-70 (EVPN) defines an MP-BGP-based L2VPN solution, which extends a broadcast domain in a multipoint manner over a routed IPv4 infrastructure using an Ethernet VPN (EVPN) solution. 

BGP EVPN Route Types (BGP RT) carried in BGP update messages describe the advertised EVPN NLRIs (Network Layer Reachability Information) type. Besides publishing IP Prefix information with IP Prefix Route (EVPN RT 5), BGP EVPN uses MAC Advertisement Route (EVPN RT 2) for advertising hosts’ MAC/IP address reachability information. The Virtual Network Identifiers (VNI) describe the VXLAN segment of the advertised MAC/IP addresses. 

Among these two fundamental route types, BGP EVPN can create a shared delivery tree for Layer 2 Broadcast, Unknown Unicast, and Multicast (BUM) traffic using Inclusive Multicast Route (EVPN RT 3) for joining an Ingress Replication tunnel. This solution does not require a Multicast-enabled Underlay Network. Another option for BUM traffic is Multicast capable Underlay Network.

While EVPN RT 3 is used for building a Multicast tree for BUM traffic, the Tenant Routed Multicast (TRM) solution provides tenant-specific multicast forwarding between senders and receivers. TRM is based on the Multicast VPN (BGP AFI:1/SAFI:5 – Ipv4/Mcast-VPN). TRM uses MVPN Source Active A-D Route (MVPN RT 5) to publish Multicast stream source address and group). 

Using BGP EVPN's native multihoming solution, we can establish a port channel between Tenant Systems (TS) and two or more VTEP switches. From the perspective of the TS, a traditional port channel is deployed by bundling a set of Ethernet links into a single logical link. On the multihoming VTEP switches, these links are associated with a logical Port-Channel interface called Ethernet Segments (ES).

EVPN utilizes the EVPN Ethernet Segment Route (EVPN RT 4) as a signaling mechanism between member units to indicate which Ethernet Segments they are connected to. Additionally, VTEP switches use this EVPN RT 4 for selecting a Designated Forwarder (DF) for Broadcast, Unknown unicast, and Multicast (BUM) traffic.

When EVPN Multihoming is enabled on a set of VTEP switches, all local MAC/IP Advertisement Routes include the ES Type and ES Identifier. The EVPN multihoming solution employs the EVPN Ethernet A-D Route (EVPN RT 1) for rapid convergence. Leveraging EVPN RT 1, a VTEP switch can withdraw all MAC/IP Addresses learned via failed ES at once by describing the ESI value in MP-UNREACH-NLRI Path Attribute.

An EVPN fabric employs a proactive Control Plane learning model, while networks based on Spanning Tree Protocol (STP) rely on a reactive flood-and-learn-based Data Plane learning model. In an EVPN fabric, data paths between Tenant Systems are established before data exchange. It's worth noting that without enabling ARP suppression, local VTEP switches flood ARP Request messages. However, remote VTEP switches do not learn the source MAC address from the VXLAN encapsulated frames.

BGP EVPN provides various methods for filtering reachability information. For instance, we can establish an import/export policy based on BGP Route Targets (BGP RT). Additionally, we can deploy ingress/egress filters using elements such as prefix lists or BGP path attributes, like BGP Autonomous System numbers. Besides, BGP, OSPF, and IS-to-IS all support peer authentication.

EVPN Data Plane: VXLAN Introduction

The Virtual Extensible LAN (VXLAN) is an encapsulation schema enabling Broadcast Domain/VLAN stretching over a Layer 3 network. Switches or hosts performing encapsulation/decapsulation are called VXLAN Tunnel End Points (VTEP). VTEPs encapsulate the Ethernet frames, originated by local Tenant Systems (TS), within outer MAC and IP headers followed by UDP header with the destination port 4789, and the source port is calculated from the payload. Between the UDP header and the original Ethernet frame is the VXLAN header describing the VXLAN segment with VXLAN Network Identifier (VNI). A VNI is a 24-bit field, allowing (theoretically) for over 16 million unique VXLAN segments. 

VTEP devices allocate Layer 2 VNI (L2VNI) for Intra-VN connection and Layer 3 VNI (L3VNI) for Inter-NV connection. There are unique L2VNI for each VXLAN segment but one common L3VNI  for tenant-specific Inter-VN communication. Besides, the Generic Protocol Extension for VXLAN (VXLAN-GPE) enables leaf switches to add Group Policy information to data packets. 

EVPN Building Blocks

I have divided Figure 1-2 into four domains: 1) Service Abstraction – Broadcast Domain, 2) Overlay Control Plane, 3) Overlay Data Plane, and 4) Route Propagation. These domains consist of several components which have cross-domain dependencies. 

Service Abstraction - Broadcast Domain: Virtual LAN: 

A Broadcast Domain (BD) is a logical network segment where all connected devices share the same subnet and can reach each other with Broadcast and Unicast messages. Virtual LAN (VLAN) can be considered an abstraction of a BD. When we create a new VLAN and associate access/trunk interfaces with it, a switch starts building an address table of source MAC addresses from received frames originated by local Tenant Systems. With TS, I am referring to physical or virtual hosts. Besides, The Tenant System can be a forwarding component, such as a firewall and load balancer, attached to one or more Tenant-specific Virtual Networks. 

Service Abstraction - Broadcast Domain: EVPN Instance: 

EVPN Instance is identified by a Layer 2 Virtual Network Identifier (L2VNI). Besides L2VNI, EVPN instances have a unique Route Distinguisher (RD), allowing overlapping addresses between different Tenants and BGP Route Targets (BGP RT) for BGP import and export policies. Before deploying an EVI, we must configure the VLAN and associate it with the VN segment (EVPN Instance). This is because an autogenerated Route Distinguisher associated with EVI requires a VLAN identifier in the RD local administrator part (a base value 32767 + associated VLAN ID). When we deploy an EVPN Instance, a Layer 2 Forwarding Manager (L2FM) starts encoding local MAC address information from the MAC table to EVI-specific MAC-VRF (L2RIB) and the other way around. 

Overlay Control Plane

VTEP switches use BGP EVPN for publishing Tenant Systems’ (TS) reachability information. BGP Routing Information Base (BRIB) consists of Local RIB (Loc-RIB) and Adjacency RIB In/Out (Adj-RIB-In and Adj-RIB-Out) tables. The BGP process stores all valid local and remote Network Layer Reachability Information (NLRI) into the Loc-RIB, while Adj-RIB-Out is a peer-specific table where NLRIs are installed through the BGP Policy Engine. The Policy engine executes our deployed BGP peer policy. An example of Policy Engine operation in a Single-AS Fabric is a peer-specific route-reflector-client definition deployed in Spine switches. By setting a peered Leaf switch as a Route-Reflector (RR) client, we allow Spine switches to publish received NLRIs from one iBGP peer to another iBGP peer, which based on default BGP policy is not permitted. Local Tenant Systems MAC addresses and source interfaces are encoded to BGP Loc-RIB from the L2RIB with encapsulation type and source IP address obtained from the NVE interface configuration. 

When a VTEP receives an EVPN NLRI from the remote VTEP with importable Route Targets, it validates the route by checking that it has received from the configured BGP peer and with the correct remote ASN and reachable source IP address. Then, it installs the NLRI (RD, Encapsulation Type, Next Hop, other standard and extended communities, and VNIs) information into BGP Loc-RIB. Note that the local administrator part of the RD may change during the process if the VN segment is associated with another VLAN than in the remote VTEP. Remember that VLANs are locally significant, while EVPN Instances have fabric-wide meaning. Next, the best MAC route is encoded into L2RIB with the topology information (VLAN ID associated with the VXLAN segment) and the next-hop information. Besides, L2RIB describes the route source as BGP. Finally, L2FM programs the information into the MAC address table and sets the NVE peer interface ID as next-hop. Note that VXLAN Manager learns VXLAN peers from the data plane based on the source IP address. 

Overlay Data Plane: Network Virtualization Edge (NVE) Interface:

The configuration of a logical NVE Interface dictates the encapsulation type and tunnel IP address for VXLAN tunnels. The VXLAN tunnel source IP address is obtained from the logical Loopback interface, which must be reachable across fabric switches. The IP address of the NVE interface is used in BGP Update messages in the BGP MP-REACH-NLRI as a source IP address. The VXLAN encapsulation type is published as BGP EXTENDED-COMMUNITY Path Attribute along with the Route Target (L2VNI and L3VNI) and System MAC (if an IP address is included).

EVPN instances (EVI) are associated with an NVE interface as a Member VN. We must define the L2BUM traffic forwarding mode (Ingress-Replication or Multicast Group) under each member VN. VXLAN Manager is responsible for data plane encapsulation and decapsulation processes.

MAC Route Propagation: Local VTEP

The previous sections provided an overview of the MAC Route propagation process. This section recaps the operation. Tenant Systems can verify the uniqueness of their IP address by sending a Gratuitous ARP (GARP), which is an unsolicited ARP Reply. The VTEP switch learns the source MAC address from the incoming frame and adds it to the MAC address table. The VLAN ID associated with the MAC entry is derived from the configuration of the Attachment Circuit (incoming interface) or the 802.1Q tag in the Ethernet header. The Attachment Circuit serves as the next hop.

The Layer 2 Forwarding Manager (L2FM) transfers information from the MAC address table to the L2RIB of the MAC-VRF. Subsequently, the MAC route is encoded into the BGP Loc-RIB. The BGP process attaches the EVPN Instance-specific Route Distinguisher to the EVPN NLRI. Besides, EVI-specific Route Targets are configured as EXTENDED_COMMUNITY, along with VXLAN encapsulation defined in the NVE interface configuration. The Next Hop for EVPN NLRI is determined by the IP address associated with the local NVE interface. Finally, the MAC route is sent from the Loc-RIB through the BGP Policy Engine to the Adj-RIB-Out and forwarded to the BGP EVPN Peer.

Figure 1-2

Thursday 25 January 2024

BGP EVPN Part IV: MAC-VRF L2RIB Update: Local MAC Address

In Figure 1-3 we have VLAN 10 mapped to EVI/MAC-VRF L2VNI10000. TS-A1 (IP: 1000.0010.beef) is connected to VLAN10 via Attachment Circuit (AC) Ethernet 1/2, (ifindex: 1a000200). 

Figure 1-3: MAC-VRF: L2RIB Local Learning Process.

Example 1-1 shows the VLAN to L2VNI mapping information. 

Leaf-101# show vlan id 10 vn-segment
VLAN Segment-id
---- -----------
10   10000       

Example 1-1: VLAN to EVPN Instance Mapping Information.

Step-1 and 2: MAC Table Update 

During the startup process, TS-A1 sends a Gratuitous ARP (GARP) message to announce its presence on the network and validate the uniqueness of its IP address. It uses its IP address in the Target IP field (Example 1-2). If another host responds to this unsolicited ARP reply, it indicates a potential IP address conflict. 

Ethernet II, Src: 10:00:00:10:be:ef, Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (reply/gratuitous ARP)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: reply (2)
    [Is gratuitous: True]
    Sender MAC address: 10:00:00:10:be:ef (10:00:00:10:be:ef)
    Sender IP address:
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address:

Example 1-2: Gratuitous ARP from TS-A1.

Leaf-101 learns the MAC address of TS-A1 from the ingress frame and encodes the source MAC address 1000:0010:beef into the VLAN10 MAC address table (Example 1-3). The type is Dynamic, and the egress port (next-hop) is interface Ethernet 1/2. The default MAC address aging time in Cisco Nexus 9000 series switches is 1800 seconds (30 minutes). 

Leaf-101# show system internal l2fwder mac
        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
        age - seconds since last seen,+ - primary entry using vPC Peer-Link,
        (T) - True, (F) - False, C - ControlPlane MAC
   VLAN     MAC Address      Type      age     Secure NTFY Ports
*    10    1000.0010.beef   dynamic   00:03:43   F     F     Eth1/2

Example 1-3: Leaf-101 MAC Address Table.

Step-3: MAC-VRF’s L2RIB Update

The Layer 2 Forwarder (L2FWDER) component detects the new MAC address in the VLAN 10 Bridge Table. The L2FWDER registers this as a MAC move event, prompting it to program the MAC address reachability information into the Layer 2 Routing Information Base (L2RIB) associated with the MAC-IP VRF of tenant TENANT77. Examples 1-4 show, starting from the bottom, how the L2FWDER component first detects a new MAC address, 1000.0010.beef, associated with VLAN 10 over an Attachment Circuit 0x1a000200 (Ethernet1/2, Example 1-5). It adds a new local MAC route to the topology 10 (L2VNI 10000) with the next-hop interface ID 0x1a000200 (Ethernet1/2).

l2fwder_dbg_ev, 690 l2fwder_l2rib_add_delete_local_mac_routes,
154Adding route  topo-id: 10, macaddr: 1000.0010.beef, nhifindx: 0x1a000200
l2fwder_dbg_ev, 690 l2fwder_l2rib_mac_update,
739MAC move 1000.0010.beef (10) 0x0 -> 0x1a000200

Example 1-4: L2RIB Update by L2FWDER.

Example 1-5 verifies the snmp-ifindex 0x1a000200 mapping to physical interface Ethernet1/2.

Leaf-101# show interface snmp-ifindex | i 0x1a000200
Eth1/2          436208128  (0x1a000200)

Example 1-5: SNMP-ifindex to Interface Mapping Verification.

The examples 1-6 demonstrate the L2FWDER component process. An illustration below details the update events from the L2RIB perspective. The L2RIB receives the MAC route 1000.0010.beef (topology 10) and creates a new MAC route after a MAC mobility check. The route is then added to L2VNI 10000 and marked as a local route (rt_flags=L) with the next-hop interface Ethernet1/2.

Leaf-101# sh system internal l2rib event-history mac | i beef
Rcvd MAC ROUTE msg: (10, 1000.0010.beef), vni 0, admin_dist 0, seq 0, soo 0, 
(10,1000.0010.beef):Mobility check for new rte from prod: 3
(10,1000.0010.beef):Current non-del-pending route local:no, remote:no, linked mac-ip count:1
(10,1000.0010.beef):Clearing routelist flags: Del_Pend, 
(10,1000.0010.beef,3):Is local route. is_mac_remote_at_the_delete: 0
(10,1000.0010.beef,3):MAC route created with seq 0, flags L, (), 
(10,1000.0010.beef,3): soo 0, peerid 0, pc-ifindex 0
(10,1000.0010.beef,3):Encoding MAC best route (ADD, client id 5)
(10,1000.0010.beef,3):vni:10000 rt_flags:L, admin_dist:6, seq_num:0 ecmp_label:0 soo:0(--)
(10,1000.0010.beef,3):res:Regular esi:(F) peerid:0 nve_ifhdl:1224736769 mh_pc_ifidx:0 nh_count:1

Example 1-6: L2RIB from the L2RIB Perspective.

Figure 1-7 shows that the MAC address of TS-A1 is installed into the L2RIB associated with topology 10 (VN segment 10000). The entry is marked as a locally learned route (Prod=Local, Flag=L), with the interface Ethernet 1/2 set as the next hop for the MAC address.

Leaf-101# show l2route evpn mac evi 10
Flags -(Rmac): Router MAC (Stt):Static (L):Local (R):Remote (V):vPC link
(Dup):Duplicate (Spl):Split (Rcv):Recv (AD):Auto-Delete (D):Del Pending
(S):Stale (C):Clear, (Ps):Peer Sync (O):Re-Originated (Nho):NH-Override
(Pf):Permanently-Frozen, (Orp): Orphan
Topology    Mac Address    Prod   Flags         Seq No     Next-Hops
----------- -------------- ------ ------------- ---------- -----------------
10          1000.0010.beef Local  L,            0          Eth1/2

Example 1-7: Updated L2RIB.

Next, L2RIB Update, MAC-IP binding. 

Friday 19 January 2024

BGP EVPN Part III: BGP EVPN Local Learning Fundamentals

Multi-Protocol BGP (MP-BGP) is a BGP-4 extension that enables BGP speakers to encode Network Layer Reachability Information (NLRI) of various address types, such as IPv4/6, VPNv4, and MAC addresses, into BGP Update messages. MP-BGP features an MP_REACH_NLRI Path-Attribute (PA), which utilizes an Address Family Identifier (AFI) to describe service categories. Subsequent Address Family Identifier (SAFI), in turn, defines the solution used for providing the service. For example, L2VPN (AFI 25) is a primary category for Layer-2 VPN services, and the Ethernet Virtual Private Network (EVPN: SAFI 70) provides the service. Another L2VPN service is Virtual Private LAN Service (VPLS: SAFI 65). The main differences between these two L2VPN services are that only EVPN supports active/active multihoming, has a control-plane-based MAC address learning mechanism, and operates over an IP-routed infrastructure.

EVPN utilizes various Route Types (EVPN RT) to describe the Network Layer Reachability Information (NLRI) associated with Unicast, BUM (Broadcast, Unknown unicast, and Multicast) traffic, as well as ESI Multihoming. The following sections explain how EVPN RT 2 (MAC Advertisement Route) is employed to distribute MAC and IP address information of Tenant Systems enabling the expansion of VLAN over routed infrastructure. 

The Tenant System refers to a host, virtual machine, or an intra-tenant forwarding component attached to one or more Tenant-specific Virtual Networks. Examples of TS forwarding components include firewalls, load balancers, switches, and routers.

TS’s MAC Address Local Processing - Basics

Switch Leaf-101, in Figures 1-2, serves as our Virtual Tunnel Endpoint (VTEP) device, supporting the BGP L2VPN EVPN address family. By configuring an EVPN instance (EVI) on VTEP, we deploy an instance-specific MAC-VRF, a Virtual Routing and Forwarding table for MAC addresses (L2RIB). Each EVI is identified by a Layer-2 Virtual Network Identifier (L2VNI). In the VXLAN header on the data plane, a remote VTEP utilizes L2VNI to describe the EVPN instance to the local VTEP. Besides L2VNI, each EVI has a unique Route Target (RT), an extended community path attribute for import/export policies, and a Route Distinguisher (RD) to facilitate inter-tenant overlapping MAC and IP addresses. 

In Figure 1-2, we have an EVPN Instance (MAC-VRF) that we have given an L2VNI 10000. The RT associated with EVI is AS-specific with auto-derived service-Id (L2VNI) as Local Administrator (AS:L2VNI = 65000:10000). The attached RD, in turn, is auto-derived from the IP address bound to interface NVE1 and VLAN-Id added to the base number 32767 (

Tenant System (TS-A1) in Figure 1-2 is connected to VLAN 10 via an Attachment Circuit (AC), Ethernet 1/2 (an AC can be either a physical or logical interface). Leaf-101, serving as a VTEP device, must employ local MAC address learning from the data plane. For instance, when TS-A1 goes live, it may validate the uniqueness of its IP address using Gratuitous ARP (GARP) to check if its IP address has already been assigned to another TS. Leaf-101 then records the source MAC address from the ingress GARP message's Ethernet header into the Bridge Table of VLAN 10. The Next-Hop for the MAC address is AC E1/2. In addition to saving the information into the Bridge Table, Leaf-101 initiates the MAC entry aging timer, which defaults to 1800 seconds (30 minutes).

Once the MAC address information is saved in the Bridge Table, the Layer-2 Forwarder (L2FWDR) component replicates this information into the EVI’s L2RIB  as a Local route. Mapping VLAN to EVI is done under the Layer 2 VLAN configuration. We are using the VLAN-Based Service Interface solution with a one-to-one mapping between VLAN and EVI. So, there is only one VLAN per EVI. 

In addition to MAC address information, the VTEP device, Leaf-101, learns IP address details from the ingress GARP message sent by TS-A1 and stores this information in the tenant-specific ARP table. It is important to note that the tenant must have the VRF Context (IP-VRF) configured. Furthermore, a Layer 3 VLAN interface is necessary for VLAN (Broadcast Domain). In the absence of a Layer 3 VLAN, there will be no ARP table. The Host Mobility Manager (HMM) detects the ARP table update event and subsequently replicates the ARP entry to the tenant-specific local host database. Following this, the HMM updates the L2RIB by binding an IP address to a MAC address and adding information about the L3VNI assigned to IP-VRF to the TS-A1 MAC address entry.

HMM also encodes the IP address of TS-A1 ( into the IP-VRF as a tenant-specific local IP route. It's important to note that host routes are utilized for inter-VN routing and are not advertised as an IP Prefix Route (EVPN RT 5).

Next, Leaf-101 constructs a BGP routing entry in the BGP BGP-Loc RIB describing an NLRI related to TS-A1. Two BGP Route Target (RT) path attributes are assigned to the BGP routing table entry: RT65000:10000 (for MAC-VRF) for MAC addresses and RT65000:10077 (for IP-VRF) for IP addresses. By utilizing these RTs, remote VTEP devices can import the received EVPN NLRI into BGP tables. 

EVI instance with L2VNI 10000 is a member VNI for the interface NVE1. The primary task of the NVE1 interface is to encapsulate egress packets with VXLAN headers and remove encapsulation from ingress frames/packets. The BGP Extended Community Encapsulation type VXLAN (type 8) is configured based on the interface NVE1 settings.

The system MAC, in turn, belongs to VTEP device Leaf-101. But why do we need a system MAC address extended community? VXLAN is a “MAC in IP/UDP” encapsulation model. Since inter-VN packets, routed over an IRB service interface, do not utilize the MAC address of the Tenant System, the NVE’s system MAC address is used as the source MAC address in the inner Ethernet header. Therefore, the IRB service interface must have a Layer 2 VLAN and a non-IP address Layer 3 VLAN interface. This configuration allows the local VTEP to use the system MAC address as the source in the inner Ethernet frame. The remote VTEP, in turn, uses it as a destination MAC address in the inner Ethernet frame.

The MP_REACH_NLRI Path Attribute carries the EVPN Network Layer Reachability Information (NLRI) about TS-A1. The Next-Hop is the IP address of an interface NVE1. EVPN Route Type “MAC Advertisement Route” (RT 2). The Route Distinguisher global administrator is the BGP Router-Id, and the local administrator is 32777 (RD: MAC and IP addresses, as well as VNIs, are taken from the L2RIB. From the BGP Loc-RIB, the NLRI information is sent through the BGP policy engine to BGP Adj-RIB-In and sent to BGP peers, Spine switches, which, after processing the BGP received BGP update message, forward information to other VTEP switches.

The upcoming post details the local VTEP operation and explains how the remote VTEP processes the received BGP Update message.

Figure 1-2: BGP EVPN with VXLAN –Building Blocks.

Wednesday 3 January 2024

BGP EVPN Part II: Network Virtualization Overlay with BGP EVPN and VXLAN - Introduction

In Figure 1-1, we have a routed 3-stage Clos Fabric, where all Inter-Switch links are routed point-to-point layer-3 connections. As explained in previous sections, a switched layer-2 network with an STP control plane allows only one active path per VLAN/Instance and VLAN-based traffic load sharing. Due to the Equal Cost Multi-Path (ECMP) supported by routing protocols, a routed Clos Fabric enables flow-based traffic load balancing using all links from the ingress leaf via the spine layer down to the egress leaf. The convergence time for routing protocols is faster and less disruptive than STP topology change. Besides, a routed Clos Fabric architecture allows horizontal bandwidth scaling. We can increase the overall bandwidth capacity between switches, by adding a new spine switch. Dynamic routing protocols allow standalone and virtualized devices lossless In-Service Software Update (ISSU) by advertising infinite metrics or withdrawing all advertised routes.

But how do we stretch layer-2 segments over layer-3 infrastructure in a Multipoint-to-Multipoint manner, allowing tenant isolation and routing between segments? The answer relies on the Network Virtualization Overlay (NVO3) framework. 

BGP EVPN, as an NVO3 control plane protocol, uses EVPN Route Types (RT) in update messages for identifying the type of advertised EVPN NLRIs (Network Layer Reachability Information). Besides publishing prefix information with RT-5 (IP Prefix Route), BGP EVPN uses RT-2 (MAC-IP advertisement) for publishing hosts’ MAC/IP addresses NLRI. Among these two fundamental route types, BGP EVPN can create a shared delivery tree for layer-2 Broadcast traffic, such as ARP Request messages, without using a Multicast-enabled underlay network. Besides, BGP EVPN allows us to implement a Tenant Routed Multicast (TRM) solution. We can use a vPC for device multihoming, but BGP EVPN has a built-in ESI multihoming option utilizing RT-1 (Ethernet AD Route) and RT-4 (Ethernet Segment Route). This solution uses a proactive control plane learning, where Leaf switches publish reachability information when a hos joins the network.

Virtual Extensible LAN (VXLAN) encapsulation allows switches to add a Layer-2 Virtual Network Identifier (L2VNI) for Intra-VLAN traffic and L3VNI for Tenant-specific/VRF Inter-VLAN connections. The Generic Protocol Extension for VXLAN (VXLAN-GPE) enables leaf switches to add a Group Policy to data packets. 

Finally, adding a new Layer-2 segment to a BGP EVPN fabric requires configuration only in leaf switches. We don’t have to touch Inter-Switch links or spine switches, like we must do in Layer-2 switched infrastructure. 

In the upcoming chapter, we delve deeper into the implementation and advantages of the BGP EVPN with VXLAN data center fabric solution.

Figure 1-1: Routed 3-Stage Clos Fabric with BGP EVPN and VXLAN.

Monday 1 January 2024

BGP EVPN Part-I: Challenges in Traditional Switched Datacenter Networks

Inefficient Link Utilization

The default Layer 2 Control Plane protocol in Cisco NX-OS is a Rapid Per-VLAN Spanning Tree Plus (Rapid PVST+), which runs 802.1w standard Rapid Spanning Tree Protocol (RSTP) instance per VLAN. Rapid PVST+ builds a VLAN-specific, loop-free Layer 2 data path from the STP root switch to all non-root switches. Spanning Tree Protocol, no matter which mode we use, allows only one active path at a time and blocks all redundant links. One general solution for activating all Inter-switch links is placing an STP root switch for odd and even VLANs into different switches. However, STP allows only a VLAN-based traffic load balancing.

CPU and Memory Usage

After building a loop-free data path, switches running Rapid PVST+ monitor the state of the network by using Spanning Tree instance-based Bridge Protocol Data Units (BPDU). By default, each switch sends instance-based BPDU messages from their designated port in two-second intervals. If we have 2000 VLANs, all switches must process 2000 BPDUs. To reduce CPU and Memory consumption caused by BPDU processing, we can use Multiple Spanning Tree – MSTP (802.1s), where VLANs are associated with Instances. For example, we can attach VLANs 1-999 to one instance and VLANs 1000-1999 to another, reducing the instance count from 2000 to 2. Though the MSTP reduces CPU burden, it does not allow flow-based frame load-balancing. Besides, an MSTP brings an increased complexity, especially when we must connect an MSTP Region to a non-MSTP region.

Bandwidth Scaling

A Multi-Chassis Link Aggregation Group (MLAG) allows the bundling of separate Ethernet links from distinct devices into a unified logical port (Port-Channel). This solution enables a flow-based traffic load sharing. Cisco's solution for NX-OS is a virtual Port Channel (vPC). 

Scaling up a Port-Channel bandwidth requires adding new links to the bundle in both vPC pair switches. For example, to increase the 20 Gbps Port-Channel, we must add a 10 Gbps interface for both vPC pair spine switches and the upstream leaf switch. This link bundling gives us a 40 Gbps logical interface. So, what is the problem with this approach? Let us say that the Port Channel utilization is approximately 30 Gbps. If one of the vPC devices fails, the available Port-Channel bandwidth decreases from 40 Gbps to 20 Gbps. This over-subscription could lead to potential link congestion and packet drops.   

In-Service Software Upgrade

We can remove devices from the data path in the Layer 3 network without disturbing the data plane. For example, Cisco’s vPC enables In-Service Software Upgrade (ISSU) or hardware maintenance tasks using the Graceful Insertion and Removal (GIR) method. We may remove the NX-OS switch from the service using System Maintenance Mode. As a reaction to this mode, the switch advertises its OSPF point-2-point links with infinite metric 65535, withdraws all BGP routes, and shuts down vPC peer-link, keepalive link, and member ports. Besides, in BGP EVPN/VXLAN fabric, the switch also shuts down the loopback interface associated with an NVE interface. However, the process does not disable loopback interfaces used by OSPF and BGP, and OSPF adjacencies and BGP peering stays up. When the device has completed the protocol isolation processes, we can safely do the maintenance tasks without disturbing application traffic. 

This kind of signaling is not supported in the Layer 2 switched networks, and maintenance tasks requiring device removal will cause some level of data flow disruptions.

Flood Reduction

Hosts resolve each other's MAC-IP address-binding information using the Address Resolution Protocol (ARP). ARP Requests are transmitted as Layer 2 Broadcast messages. When switches receive these requests, they learn the requester's MAC address from the source MAC address in the incoming frame. Because of the Broadcast destination MAC address, switches flood ARP requests out of interfaces where the VLAN is permitted. Due to this reactive data plane learning process, all switches must process ARP messages.

When a non-edge port state on a switch transitions from Discarding/Learning to Forwarding, the switch recognizes this change as a Spanning Tree topology event and initiates a TcWhile timer (twice the STP Hello time). During the TcWhile timer duration, the switch sets the Tc (Topology Change) bit in each Bridge Protocol Data Unit (BPDU) message. In response to receiving a BPDU message with the Tc bit set, switches start the TCWhile timer, activate the TC bit in their outgoing BPDU messages, and clear their MAC address tables. This process propagates throughout the switching infrastructure. While hosts may retain the MAC-IP address binding information, switches must re-learn MAC addresses through a flood-and-learn process.


Most modern-day applications utilize server virtualization, commonly known as Virtual Machines (VMs). The VM live migration process allows for the seamless transfer of a running VM from one host to another without causing significant downtime or disrupting the services running on that VM. From a network perspective, VM live migration requires the availability of Layer 2 segments and gateways through the data center switches. 

From an operational standpoint, VLANs need to be configured not only on each switch but also on every Inter-Switch link in a switched network. Automatization can help reduce the deployment time and likelihood of configuration errors. However, a notable drawback is the necessity to deploy configuration changes to spine switches as well.

VLAN-Id Limitation

The Layer 2 segmentation with any Spanning Tree Protocol relies on a 12-bit VLAN identifier in 802.1Q tag on the Ethernet frame, giving 2^12 = 4096 unique VLANs. This limitation can pose challenges in environments requiring more VLANs for segmentation.

MAC Address Overlapping

In a multi-tenant environment, where server team administrators are allowed to allocate MAC addresses for application servers manually, there's a risk of MAC address conflicts. Such conflicts can lead to application disruptions and usability issues.

Next, How BGP EVPN with VXLAN Responds to these Challenges.

Wednesday 22 November 2023

Cisco Intent-Based Networking: Part II - Cisco ISE and Catalyst Center Migration

Cisco Identity Service Engine (ISE) and Catalyst Center Integration

Before you can add Cisco ISE to Catalyst Center’s global network settings as an Authentication, Authorization, and Accounting server (AAA) for clients and manage the Group-Based access policy implemented in Cisco ISE, you must integrate them. 

This post starts by explaining how to activate the pxGrid service on ISE, which it uses for pushing policy changes to Catalyst Center (steps 1a-f). Next, it illustrates the procedure to enable  External RESTful API (ERS) read/write on Cisco ISE to allow external clients to Create, Read, Update, and Delete (CRUD) processes on ISE. Catalyst Center uses ERS for pushing configuration to ISE. After starting the pxGrid service and enabling ERS, this post discusses how to initiate the connection between ISE and Catalyst Center (steps 2a-h and 3a-b). The last part depicts the Group-Based Access Control migration processes (4a-b).

Step-1: Start pxGrid Service and Enabling ERS on ISE

Open the Administrator tab on the main view of Cisco ISE. Then, under the System tab, select the Deployment option. The Deployment Nodes section displays the Cisco ISE Node along with its personas. In Figure 1-3, a standalone ISE Node is comprised of three personas: Policy Admin Node (PAN), Management Node (MnT), and Policy Service Node (PSN). To initiate the pxGrid service, click in the ISE standalone node (1d) and check the pxGrid tick box (1e) in the General Settings window. After saving the changes, pxGrid will be shown in the persona section alongside PAN, PSN, and MnT.

A brief note about Cisco ISE terminology: The term "Node" refers to an ISE node that may have one or multiple personas (PAN, PSN, MnT, pxGrid). These personas define the services provided by the node. For instance, pxGrid facilitates the distribution of context-specific data from Cisco ISE to various network systems, including ISE ecosystem partner systems and different Cisco platforms such as Catalyst Center.

To enable Catalyst Center to push configurations to ISE, activate the ERS in the Settings section under the System tab. 

Step-2: Add Cisco ISE on Catalyst Center

In Catalyst Center, you can access the same configuration window through various methods. In Figure 1-3, we begin the configuration process by clicking the System icon and selecting the Settings option. Then, under the External Services option, choose the Authentication and Policy Servers option. First, enter the server IP and then provide the Shared Secret. It's important to note that the Shared Secret defines the password for the AAA configuration pushed to network devices using the AAA service. The Username and Password fields are credentials utilized for accessing the ISE Graphical User Interface (GUI) and Command Line Interface (CLI). Please note that GUI and CLI passwords need to be the same. Besides, input the Fully Qualified Domain Name (FQDN) and the Subscriber name. After applying these changes, Catalyst Center performs the following actions: 2e) Initiates the CLI and GUI connection to ISE, 2f) Starts the certification import/export process to establish a trusted and secure connection with ISE, 2g) Discovers PAN primary and secondary nodes, as well as pxGrid nodes, and 2h) Connects to the pxGrid service.

To finalize the connection, accept the Catalyst Center connection request in ISE. Navigate to the pxGrid Service tab under the Cisco ISE Administrator tab. In our example, DNAC is a pxGrid client awaiting approval from the ISE admin. Approve the connection by clicking the Approve button.

Step-3: Add Cisco ISE on Catalyst Center

To utilize Catalyst Center as an administration point for Group-Based access control, you need to migrate policies from Cisco ISE. Start the process by selecting the 'Group-Based Access' option under the Policy icon. Then, choose the 'Start Migration' hyperlink. Once the migration is completed, the policy matrix will appear in the Policy tab. From there, you can define micro-segmentation rules between groups on Catalyst Center, which are subsequently pushed to Cisco ISE using REST API. The following section demonstrates how you can add Cisco ISE as AAA services.

Figure 1-3: Integrating Cisco ISE and Catalyst Center.

Sunday 12 November 2023

Cisco Intent-Based Networking: Part I - Introduction


This chapter introduces Cisco's approach to Intent-based Networking (IBN) through their Centralized SDN Controller, Cisco DNA Center, rebranded as Cisco Catalyst Center (from now on, I am using the abbreviation C3 for Cisco Catalyst Center). We focus on the network green field installation, showing workflows, configuration parameters, and relationships and dependencies between building blocks. The C3 workflow is divided into four main entities: 1) Design, 2) Policy, 3) Provision, and 4) Assurance, each having its own sub-processes. This chapter introduces the Design phase focusing on Network Hierarchy, Network Settings, and Network Profile with Configuration Templates. 

This post deprecates the previous post, "Cisco Intent-Based Networking: Part I, Overview."

Network Hierarchy

Network Hierarchy is a logical structure for organizing network devices. At the root of this hierarchy is the Global Area, where you establish your desired network structure. In our example, the hierarchy consists of four layers: Area (country - Finland), Sub-area (city - Joensuu), Building (JNS01), and Floor (JNS01-FLR01). Areas and Buildings indicate the location, while Floors provide environmental information relevant to wireless networks, such as floor type, measurements, and wall properties.

Network Settings

Network settings define device credentials (CLI, HTTP(S), SNMP, and NETCONF) required for accessing devices during the discovery process. Additionally, network settings describe global configurations (DHCP, DNS, NTP, AAA, and Telemetry) applied to devices during provisioning at a site. We also configure a global IP pool, which we can later break down into site-specific subnets.

In order for you to use the Cisco Identity Service Engine for device/client AAA services (Authentication, Authorization, and Accounting), C3-ISE integration is required. To integrate the Cisco Identity Service Engine with C3, enable the pxGrid persona and External RESTful Service (ERS) in  Cisco ISE. Subsequently, connect C3 to pxGrid as an XMPP client. As the final step, migrate ISE Group-Based Access Control policies to your C3. Through the ISE-C3 integration, you can utilize C3 not only as an AAA server but also for configuring Scalable Group Tag (SGT) policies between groups.

Configuration Templates and Network Profiles

Next, we build a site and device type-specific configuration templates. As a first step, we create a Project, a folder for our templates. In Figure 1-1, we have a Composite template into which we attach two Regular templates. Regular templates include CLI configuration parameters and variables. Then, we create a Profile into which we associate our templates. In Figure 1-1, we have attached the Composite template to the Profile. We make the templates available for devices, which we later provision to the site using a profile-to-site association. Note that we are using Day-N templates. Day-0 templates are for the Plug-and-Play provisioning process.

Figure 1-1: Design – Network Hierarchy, Global Network Settings, and Network Profiles.