Friday 19 July 2024

AI/ML Networking: Part-III: Basics of Neural Networks Training Process

Neural Network Architecture Overview

Deep Neural Networks (DNN) leverage various architectures for training, with one of the simplest and most fundamental being the Feedforward Neural Network (FNN). Figure 2-1 illustrates our simple, three-layer FNN.

Input Layer: 

The first layer receives the initial data, consisting of parameters X1, X2, and X3. Each neuron in the input layer passes these data parameters to the next hidden layer.

Hidden Layer: 

The neurons in the hidden layer calculate a weighted sum of the input data, which is then passed through an activation function. In our example, we are using the Rectified Linear Unit (ReLU) activation function. These calculations produce activation values for neurons. The activation value is modified input data value received from the input layer and published to upper layer.

Output Layer: 

Neurons in this layer calculate the weighted sum in the same manner as neurons in the hidden layer, but the result of the activation function is the final output.


The process described above is known as the Forwarding pass operation. Once the forward pass process is completed, the result is passed through a loss function, where the received value is compared to the expected value. The difference between these two values triggers the backpropagation process. The Loss calculation is the initial phase of Backpropagation process. During backpropagation, the network fine-tunes the weight values , neuron by neuron, from the output layer through the hidden layers. The neurons in the input layer do not participate in the backpropagation process because they do not have weight values to be adjusted.


After the backpropagation process, a new iteration of the forward pass begins from the first hidden layer. This loop continues until the received and expected values are close enough to expected value, indicating that the training is complete.

Figure 2-1: Deep Neural Network Basic Structure and Operations.

Forwarding Pass 


Next, let's examine the operation of a Neural Network in more detail. Figure 2-2 illustrates a simple, three-layer Feedforward Neural Network (FNN) data model. The input layer has two neurons, H1 and H2, each receiving one input data value: a value of one (1) is fed to neuron H1 by input neuron X1, and a value of zero (0) is fed to neuron H2 by input neuron X2. The neurons in the input layer do not calculate a weighted sum or an activation value but instead pass the data to the next layer, which is the first hidden layer.

The hidden layer in our example consists of two neurons. These neurons use the ReLU activation function to calculate the activation value. During the initialization phase, the weight values for these neurons are assigned using the He Initialization method, which is often used with the ReLU function. The He Initialization method calculates the variance as 2/where n is the number of neurons in the previous layer. In this example, with two input neurons, this gives a variance of  1 (=2/2). The weights are then drawn from a normal distribution ~N(0,√variance), which in this case is  ~N(0,1). Basically, this means that the randomly generated weight values are centered around zero with a standard deviation of one.

In Figure 2-2, the weight value for neuron H3 in the hidden layer is 0.5 for both input sources X1 (input data 1) and X2 (input data 0). Similarly, for the hidden layer neuron H4, the weight value is 1 for both input sources X1 (input data 1) and X2 (input data 0). Neurons in the hidden and output layers also have a bias variable. If the input to a neuron is zero, the output would also be zero if there were no bias. The bias ensures that a neuron can still produce a meaningful output even when the input is zero (i.e., the neuron is inactive). Neurons H3 and O5 have a bias value of 0.5, while neuron H4 has a bias value of 0 (I am using zero for simplify the calculation). 

Let’s start the forward pass process from neuron H3 in the hidden layer. First, we calculate the weighted sum using the formula below, where Z3 represents the weighted sum of input. Here, Xn is the actual input data value received from the input layer’s neuron, and Wn  is the weight associated with that particular input neuron.

The weighted sum calculation (Z3) for neuron H3:

Z3 = (X1 ⋅ W31) + (X2 ⋅ W32) + b3
Given:
Z3 = (1 ⋅ 0.5) + (0 ⋅ 0.5) + 0
Z3 = 0.5 + 0 + 0
Z3 = 0.5

To get the activation value a3 (shown as H3=0.5 in figure), we apply the ReLU function. The ReLU function outputs zero (0) if the calculated weighted sum Z is less than or equal to zero; otherwise, it outputs the value of the weighted sum Z.

The activation value a3 for H3 is:

ReLU (Z3) = ReLU (0.5) = 0.5

The weighted sum calculation for neuron H4:

Z4 = (X1 ⋅ W41) + (X2 ⋅ W42) + b4
Given:
Z4 = (1 ⋅ 1) + (0 ⋅1) + 0.5
Z4 = 1 + 0 + 0.5
Z4 = 1.5

The activation value using ReLU for Z4 is:

ReLU (Z4) = ReLU (1.5) = 1.5

 


Figure 2-2: Forwarding Pass on Hidden Layer.

After neurons H3 and H4 publish their activation values to neuron O5 in the output layer, O5 calculates the weighted sum Z5 for inputs with weights W53=1and W54=1. Using Z5, it calculates the output using the ReLU function. The difference between the received output value (Yr) and the expected value (Ye) triggers a backpropagation process. In our example, Yr−Ye=0.5.

Backpropagation process

The loss function measures the difference between the predicted output and the actual expected output. The loss function value indicates how well the neural network is performing. A high loss value means the network's predictions are far from the actual values, while a low loss value means the predictions are close.

After calculating the loss, backpropagation is initiated to minimize this loss. Backpropagation involves calculating the gradient of the loss function with respect to each weight and bias in the network. This step is crucial for adjusting the weights and biases to reduce the loss in subsequent forwarding pass iterations.

Loss function is calculated using the formula below:

Loss (L) = (H3 x W53 + H4 x W54 + b5 – Ye)2
Given:
L = (0.5 x 1 + 1.5 x 1 + 0.5 - 2)2
L = (0.5 + 1.5 + 0.5 - 2)2
L = 0.52
L= 0.25


Figure 2-3: Forwarding Pass on Output Layer.

The result of the loss function is then fed into the gradient calculation process, where we compute the gradient of the loss function with respect to each weight and bias in the network. The gradient calculation result is then used to fine-tune the old weight values. The Eta hyper-parameter η (the learning rate) controls the step size during weight updates in the backpropagation process, balancing the speed of convergence with the stability of training. In our example, we are using a learning rate of 1/100 = 0.01. The term hyper-parameters refers to parameters that affect the final result.

First, we compute the partial derivative of the loss function (gradient calculation) with respect to the old weight values. The following example shows the gradient calculation for weight W53. The same computation applies to W54  and b3.

Gradient Calculation:

∂L   = 2W53 x (Yr – Ye)
∂W53

 Given

 = 2 x 0.5 x (2.5 - 2)
 = 1 x 0.5
 = 0.5

New weight value calculation.

W53 (new) = W53(old) – η x ∂L/∂W53
Given:
W53 (new) = 1–0.01 x 0.5
W53 (new) = 0.995

 


Figure 2-4: Backpropagation - Gradient Calculation and New Weight Value Computation.


Figure 2-5 shows the formulas for calculating the new bias b3. The process is the same than what was used with updating the weight values.



Figure 2-5: Backpropagation - Gradient Calculation and New Bias Computation.

After updating the weights and biases, the backpropagation process moves to the hidden layer. Gradient computation in the hidden layer is more complex because the loss function only includes weights from the output layer as you can see from the Loss function formula below:

Loss (L) = (H3 x W53 + H4 x W54 + b5 – Ye)2

The formula for computing the weights and biases for neurons in the hidden layers uses the chain rule. The mathematical formula shown below, but the actual computation is beyond the scope of this chapter.

∂L   =    ∂L  x  ∂H3    
∂W31   ∂H3    ∂W31    

After the backpropagation process is completed, the next iteration of the forward pass starts. This loop continues until the received result is close enough to the expected result.

If the size of the input data exceeds the GPU’s memory capacity or if the computing power of one GPU is insufficient for the data model, we need to decide on a parallelization strategy. This strategy defines how the training workload is distributed across several GPUs. Parallelization impacts network load if we need more GPUs than are available on one server. Dividing the workload among GPUs within a single GPU-server or between multiple GPU-servers triggers synchronization of calculated gradients between GPUs. When the gradient is calculated, the GPUs synchronize the results and compute the average gradient, which is then used to update the weight values.

The upcoming chapter introduces pipeline parallelization and synchronization processes in detail. We will also discuss why lossless connection is required for AI/ML.



Tuesday 16 July 2024

AI/ML Networking: Part-II: Introduction of Deep Neural Networks

Machine Learning (ML) is a subset of Artificial Intelligence (AI). ML is based on algorithms that allow learning, predicting, and making decisions based on data rather than pre-programmed tasks. ML leverages Deep Neural Networks (DNNs), which have multiple layers, each consisting of neurons that process information from sub-layers as part of the training process. Large Language Models (LLMs), such as OpenAI’s GPT (Generative Pre-trained Transformers), utilize ML and Deep Neural Networks.

For network engineers, it is crucial to understand the fundamental operations and communication models used in ML training processes. To emphasize the importance of this, I quote the Chinese philosopher and strategist Sun Tzu, who lived around 600 BCE, from his work The Art of War.

If you know the enemy and know yourself, you need not fear the result of a hundred battles.

We don’t have to be data scientists to design a network for AI/ML, but we must understand the operational fundamentals and communication patterns of ML. Additionally, we must have a deep understanding of network solutions and technologies to build a lossless and cost-effective network for enabling efficient training processes.

In the upcoming two posts, I will explain the basics of: 

a) Data Models: Layers and neurons, forward and backward passes, and algorithms. 

b) Parallelization Strategies: How training times can be reduced by dividing the model into smaller entities, batches, and even micro-batches, which are processed by several GPUs simultaneously.

The number of parameters, the selected data model, and the parallelization strategy affect the network traffic that crosses the data center switch fabric.

After these two posts, we will be ready to jump into the network part. 

At this stage, you may need to read (or re-read) my previous post about Remote Direct Memory Access (RDMA), a solution that enables GPUs to write data from local memory to remote GPUs' memory.



Thursday 27 June 2024

AI/ML Networking Part I: RDMA Basics

Remote Direct Memory Access - RDMA Basics


Introduction

Remote Direct Memory Access (RDMA) architecture enables efficient data transfer between Compute Nodes (CN) in a High-Performance Computing (HPC) environment. RDMA over Converged Ethernet version 2 (RoCEv2) utilizes a routed IP Fabric as a transport network for RDMA messages. Due to the nature of RDMA packet flow, the transport network must provide lossless, low-latency packet transmission. The RoCEv2 solution uses UDP in the transport layer, which does not handle packet losses caused by network congestion (buffer overflow on switches or on a receiving Compute Node). To avoid buffer overflow issues, Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) are used as signaling mechanisms to react to buffer threshold violations by requesting a lower packet transfer rate.

Before moving to RDMA processes, let’s take a brief look at our example Compute Nodes. Figure 1-1 illustrates our example Compute Nodes (CN). Both Client and Server CNs are equipped with one Graphical Processing Unit (GPU). The GPU has a Network Interface Card (NIC) with one interface. Additionally, the GPU has Device Memory Units to which it has a direct connection, bypassing the CPU. In real life, a CN may have several GPUs, each with multiple memory units. Intra-GPU communication within the CN happens over high-speed NVLinks. The connection to remote CNs occurs over the NIC, which has at least one high-speed uplink port/interface.

Figure 1-1 also shows the basic idea of a stacked Fine-Grained 3D DRAM (FG-DRAM) solution. In our example, there are four vertically interconnected DRAM dies, each divided into eight Banks. Each Bank contains four memory arrays, each consisting of rows and columns that contain memory units (transistors whose charge indicates whether a bit is set to 1 or 0). FG-DRAM enables cross-DRAM grouping into Ranks, increasing memory capacity and bandwidth.

The upcoming sections introduce the required processes and operations when the Client Compute Node wants to write data from its device memory to the Server Compute Node’s device memory. I will discuss the design models and requirements for lossless IP Fabric in later chapters.



Figure 1-1: Fine-Grained DRAM High-Level Architecture.

Friday 24 May 2024

BGP EVPN Fabric - Remote Leaf MAC Learning Process

Remote VTEP Leaf-102: Low-Level Control Plane Analysis


In this section, we will first examine the update process of the BGP tables on the VTEP switch Leaf-102 when it receives a BGP Update message from Spine-11. After that, we will go through the update processes for the MAC-VRF and the MAC Address Table. Finally, we will examine how the VXLAN manager on Leaf-102 learns the IP address of Leaf-10's NVE interface and creates a unidirectional NVE peer record in the NVE Peer Database based on this information.


Remote Learning: BGP Processes

We have configured switches Leaf-101 and Leaf-102 as Route Reflector Clients on the Spine-11 switch. Spine-11 has stored the content of the BGP Update message sent by Leaf-101 in the neighbor-specific Adj-RIB-In of Leaf-101. Spine-11 does not import this information in its local BGP Loc-RIB because we have not defined a BGP import policy. Since Leaf-102 is an RR Client, the BGP process on Spine-11 copies this information in the neighbor-specific Adj-RIB-Out table for Leaf-102 and sends the information to Leaf-102 in a BGP Update message. The BGP process on Leaf-102 stores the received information from the Adj-RIB-In table to the BGP Loc-RIB according to the import policy of EVPN Instance 10010 (import RT 65000:10010). During the import process, the Route Distinguisher values are also modified to match the configuration of Leaf-102: change the RD value from 192.168.10.101:32777 (received RD) to 192.168.10.102:32777 (local RD).

Figure 3-13: MAC Address Propagation Process – From BGP Adj-RIB-Out.

Tuesday 14 May 2024

EVPN Instance Deployment Scenario 1: L2-Only EVPN Instance

In this scenario, we are building a protected Broadcast Domain (BD), which we extend to the VXLAN Tunnel Endpoint (VTEP) switches of the EVPN Fabric, Leaf-101 and Leaf-102. Note that the VTEP operates in the Network Virtualization Edge (NVE) role for the VXLAN segment. The term NVE refers to devices that encapsulate data packets to transport them over routed IP infrastructure. Another example of an NVE device is the MPLS Provider Edge (MPLS-PE) router at the edge of the MPLS network, doing MPLS labeling. The term “Tenant System” (TS) refers to a physical host, virtual machine, or an intra-tenant forwarding component attached to one or more Tenant-specific Virtual Networks. Examples of TS forwarding components include firewalls, load balancers, switches, and routers. 

We begin by configuring L2 VLAN 10 to Leaf-101 and Leaf-102 and associate it with the vn-segment 10010. From the NVE perspective, this constitutes an L2-Only network segment, meaning we do not configure an Anycast Gateway (AGW) for the segment, and it does not have any VRF association.

Next, we deploy a Layer 2 EVPN Instance (EVI) with VXLAN Network Identifier (VNI) 10010. We utilize the 'auto' option to generate the Route Distinguisher (RD) and the Route Target (RT) import and export values for the EVI. The RD value is derived from the NVE Interface IP address and the VLAN Identifier (VLAN 10) associated with the EVI, added to the base value 32767 (e.g., 192.168.100.101:32777). The use of the VLAN ID as part of the automatically generated RD value is the reason why VLAN is configured before the EVPN Instance. Similarly, the RT values are derived from the BGP ASN and the VNI (e.g., 65000:10010).

As the final step for EVPN Instance deployment, we add EVI 10010 under the NVE interface configuration as a member vni with the Multicast Group 239.1.1.1 we are using for Broadcast, Unknown Unicast, and Multicast (BUM) traffic. 

For connecting TS1 and TS2 to the Broadcast domain, we will configure Leaf-101's interface Eth 1/5 and Leaf-102's interface Eth1/3 as access ports for VLAN 10.

A few words regarding the terminology utilized in Figure 3-2. '3-Stage Routed Clos Fabric' denotes both the physical topology of the network and the model for forwarding data packets. The 3-Stage Clos topology has three switches (ingress, spine, and egress) between the attached Tenant Systems. Routed, in turn, means that switches forward packets based on the destination IP address.

With the term VXLAN Segment, I refer to a stretched Broadcast Domain, identified by the VXLAN Network Identifier value defined under the EVPN Instance on Leaf switches.



Figure 3-2: L2-Only Intra VN Connection.

Wednesday 8 May 2024

Deploying and Analyze EVPN Instances: Deployment Scenarios

In the previous section, we built a Single-AS EVPN Fabric with OSPF-enabled Underlay Unicast routing and PIM-SM for Multicast routing using Any Source Multicast service. In this section, we configure two L2-Only EVPN Instances (L2-EVI) and two L2/L3 EVPN Instances (L2/3-EVI) in the EVPN Fabric. We examine their operations in six scenarios depicted in Figure 3-1.

Scenario 1 (L2-Only EVI, Intra-VN): 

In the Deployment section, we configure an L2-Only EVI with a Layer 2 VXLAN Network Identifier (L2VNI) of 10010. The Default Gateway for the VLAN associated with the EVI is a firewall. In the Analyze section, we observe the Control Plane and Data Plane operation when a) connecting Tenant Systems TS1 and TS2 to the segment, and b) TS1 communicates with TS2 (Intra-VN Communication).

Scenario 2 (L2-Only EVI, Inter-VN): 

In the Deployment section, we configure another L2-Only EVI with L2VNI 10020, to which we attach TS3 and TS4. In the Analyze section, we examine EVPN Fabric's Control Plane and Data Plane operations when TS2 (L2VNI 10010) sends data to TS3 (L2VNI 10020), Inter-VN Communication.

Scenario 3 (L2/L3 EVI, Intra-VN): 

In the Deployment section, we configure a Virtual Routing and Forwarding (VRF) Instance named VRF-NWKT with L3VNI 10077. Next, we configure the EVI with L2VNI 10030. We attach VLAN 10 to this segment, which Anycast Gateway (AGW) we bind to the routing domain VRF-NWKT. In the Analyze section, we study the Control Plane process when TS5 joins the network, focusing mainly on TS5's host IP address propagation.

Scenario 4 (Intra-VN, Silent Host): 

In the Deployment section, we configure an EVI with L2VNI 10040 in the EVPN Fabric, where the VLAN attached to it belongs to the same routing domain VRF-NWKT as EVI 10030. This EVI includes a "Silent Host" TS8, which generates no data traffic unless requested. Besides, we publish the segment-specific subnetwork within the routing domain VRF-NWKT. In the Analyze section, we focus on examining the Control Plane aspect of the EVPN Route Type 5 (IP Prefix Route) process.

Scenario 5 (Inter-VN, Symmetric IRB): 

In this section, we examine the Integrated Routing and Bridging (IRB) Symmetric routing model between two EVPN Instances. We analyze Control Plane and Data Plane functionality by studying Inter-VN communication from the perspective of TS6 to destinations TS7 and TS8 (silent host).

Scenario 6 (Inter-VN between protected and unprotected VNs): 

In this final scenario's Deployment section, we configure the firewall to advertise the subnetworks of protected L2-Only EVPN instances to the routing domain VRF-NWKT. Then, in the Analyze section, we examine how these networks appear to unprotected EVPN Instances attached to the VRF-NWKT routing domain. We also investigate Data Plane packet forwarding concerning traffic between TS5 and TS1.

We will go through each scenario in detail in the upcoming chapters.

Figure 3-1: EVPN Instance Deploying and Analyzing Scenarios.


Thursday 2 May 2024

Configuration of BGP afi/safi L2VPN EVPN and NVE Tunnel Interface

Overlay Network Routing: MP-BGP L2VPN/EVPN



EVPN Fabric Data Plane – MP-BGP


Instead of being a protocol, EVPN is a solution that utilizes the Multi-Protocol Border Gateway Protocol (MP-BGP) for its control plane in an overlay network. Besides, EVPN employs Virtual eXtensible Local Area Network (VXLAN) encapsulation for the data plane of the overlay network.

Multi-Protocol BGP (MP-BGP) is an extension of BGP-4 that allows BGP speakers to encode Network Layer Reachability Information (NLRI) of various address types, including IPv4/6, VPNv4, and MAC addresses, into BGP Update messages. The MP_REACH_NLRI path attribute (PA) carried within MP-BGP update messages includes Address Family Identifier (AFI) and Subsequent Address Family Identifier (SAFI) attributes. The combination of AFI and SAFI determines the semantics of the carried Network Layer Reachability Information (NLRI). For example, AFI-25 (L2VPN) with SAFI-70 (EVPN) defines an MP-BGP-based L2VPN solution, which extends a broadcast domain in a multipoint manner over a routed IPv4 infrastructure using an Ethernet VPN (EVPN) solution.

BGP EVPN Route Types (BGP RT) carried in BGP update messages describe the advertised EVPN NLRIs (Network Layer Reachability Information) type. Besides publishing IP Prefix information with IP Prefix Route (EVPN RT 5), BGP EVPN uses MAC Advertisement Route (EVPN RT 2) for advertising hosts’ MAC/IP address reachability information. The Virtual Network Identifiers (VNI) describe the VXLAN segment of the advertised MAC/IP addresses. 

Among these two fundamental route types, BGP EVPN can create a shared delivery tree for Layer 2 Broadcast, Unknown Unicast, and Multicast (BUM) traffic using Inclusive Multicast Route (EVPN RT 3) for joining an Ingress Replication tunnel. This solution does not require a Multicast-enabled Underlay Network. Another option for BUM traffic is Multicast capable Underlay Network.

While EVPN RT 3 is used for building a Multicast tree for BUM traffic, The Tenant Routed Multicast (TRM) solution provides tenant-specific multicast forwarding between senders and receivers. TRM is based on the Multicast VPN (BGP AFI:1/SAFI:5 – Ipv4/Mcast-VPN). TRM uses MVPN Source Active A-D Route (MVPN RT 5) for publishing Multicast stream source address and group). 

Using BGP EVPN's native multihoming solution, we can establish a Port-Channel between Tenant Systems (TS) and two or more VTEP switches. From the perspective of the TS, a traditional Port-Channel is deployed by bundling a set of Ethernet links into a single logical link. On the multihoming VTEP switches, these links are associated with a logical Port-Channel interface referred to as Ethernet Segments (ES).

EVPN utilizes the EVPN Ethernet Segment Route (EVPN RT 4) as a signaling mechanism between member units to indicate which Ethernet Segments they are connected to. Additionally, VTEP switches use this EVPN RT 4 for selecting a Designated Forwarder (DF) for Broadcast, Unknown unicast, and Multicast (BUM) traffic.

When EVPN Multihoming is enabled on a set of VTEP switches, all local MAC/IP Advertisement Routes include the ES Type and ES Identifier. The EVPN multihoming solution employs the EVPN Ethernet A-D Route (EVPN RT 1) for rapid convergence. Leveraging EVPN RT 1, a VTEP switch can withdraw all MAC/IP Addresses learned via failed ES at once by describing the ESI value in MP-UNREACH-NLRI Path Attribute. 

Note! ESI multi-homing is supported only on the first-generation Cisco Nexus 9300 switches. Nexus 9200, 9300-EX switches and newer models doesn’t support ESI multi-homing. 

An EVPN fabric employs a proactive Control Plane learning model, while networks based on Spanning Tree Protocol (STP) rely on a reactive flood-and-learn-based Data Plane learning model. In an EVPN fabric, data paths between Tenant Systems are established prior to data exchange. It's worth noting that without enabling ARP suppression, local VTEP switches flood ARP Request messages. However, remote VTEP switches do not learn the source MAC address from the VXLAN encapsulated frames.

BGP EVPN provides various methods for filtering reachability information. For instance, we can establish an import/export policy based on BGP Route Targets (BGP RT). Additionally, we can deploy ingress/egress filters using elements such as prefix-lists or BGP path attributes, like BGP Autonomous System numbers. Besides, BGP, OSPF, and IS-to-IS all support peer authentication.

EVPN Fabric Data Plane –VXLAN


The Virtual eXtensible LAN (VXLAN) is an encapsulation schema that enables Broadcast Domain/VLAN stretching over a Layer 3 network. Switches or hosts performing encapsulation/decapsulation are called VXLAN Tunnel End Points (VTEP). VTEPs encapsulate the Ethernet frames, originated by local Tenant Systems (TS), within outer MAC and IP headers followed by UDP header with the destination port 4789 and source port is calculated from the payload. Between the UDP header and the original Ethernet frame is the VXLAN header describing the VXLAN segment with VXLAN Network Identifier (VNI). A VNI is a 24-bit field, theoretically allowing for over 16 million unique VXLAN segments. 

VTEP devices allocate Layer 2 VNI (L2VNI) for Intra-VN connection and Layer 3 VNI (L3VNI) for Inter-NV connection. There are unique L2VNI for each VXLAN segment but one common L3VNI  for tenant-specific Inter-VN communication. Besides, the Generic Protocol Extension for VXLAN (VXLAN-GPE) enables leaf switches to add Group Policy information to data packets. 

When a VTEP receives a EVPN NLRI from the remote VTEP with importable Route Targets, it validates the route by checking that it has received from the configured BGP peer and with the right remote ASN and reachable source IP address. Then, it installs the NLRI (RD, Encapsulation Type, Next Hop, other standard and extended communities and VNIs) information into BGP Loc-RIB. Note that the local administrator part of the RD may change during the process if the VN segment is associated with another VLAN than in the remote VTEP. Remember that VLANs are locally significant, while EVPN Instances has fabric-wide meaning. Next, the best MAC route (or routes, ECMP is enabled) is encoded into L2RIB with the topology information (VLAN Id associated with the VXLAN segment) and the next-hop information. Besides, L2RIB describes the route source as BGP. Finally, L2FM programs the information into MAC address table and sets the NVE peer interface Id as next-hop. Note that VXLAN Manager learns VXLAN peers from the data plane based on the source IP address. 

Our EVPN Fabric is a Single-AS solution, where Leaf and Spine switches are in the same BGP AS area, making Leaf-Spine switches iBGP neighbors. We assign a BGP AS area 6500 to all switches and configure both Spine switches as BGP Route Reflectors, as shown in Figure 2-6. We reserve the IP subnet 192.168.10.0/24 for the Overlay network's BGP process, from which we take IP addresses for the logical interface Loopback 10. We use these addresses as a) BGP Router Identifiers (BRIDs), b) defining BGP neighbors and c) source addresses for BGP Update messages.

Leaf switches act as VXLAN Tunnel Endpoints (VTEPs), responsible for encapsulating/decapsulating data packets to/from Customer networks on the Fabric's Transport network side. A logical Network Virtual Edge (NVE) interfaces of Leaf switches use VXLAN tunneling, where the tunnel source IP address is the IP address of Loopback 20. We reserve the subnet 192.168.20.0/24 for this purpose, as shown in Figure 2-6. 

In Figure 2-6, I have listed the VTEP Loopback identifier and IP address sections belonging to the Underlay network. The reason is that the source/destination IP addresses used for tunneling between VTEP devices must be routable by the devices in the Transport network (Underlay Network). In the context of BGP EVPN, the term "Overlay" refers to the fact that it advertises only the MAC and IP addresses and subnets required for IP communication among devices connected to EVPN segments.

The following image also lists mandatory NX-OS features that we must enable to configure both the BGP EVPN Control Plane and the Data Plane.



Figure 2-6: EVPN Fabric Overlay Network Control Plane and Data Plane.


Image 2-7 depicts our implementation of a Single-AS EVPN Fabric. The Spine switch serves as a BGP Route Reflector, forwarding BGP Update messages from Leaf switches to other Leaf switches. The BGP process on Leaf switches sets the IP address of the Loopback 10 interface as the Next-hop in the MP_REACH_NLRI Path Attribute for all advertised EVPN NLRI Route Types.

The Network Virtual Edge (NVE) interfaces use the IP address of Loopback 10 for VXLAN tunneling. The NVE interface sub-command "host reachability protocol BGP" instructs the NVE interface to use the Control Plane learning model based on the received BGP Updates about EVPN NLRIs.




Figure 2-7: EVPN Fabric Overlay Network Control Plane and Data Plane Building Blocks.



BGP EVPN Configuration


Example 2-18 shows the configuration of Spine-12 for BGP. The first two commands enable BGP EVPN. In the actual BGP configuration, we first specify the BGP AS number as 65000. Then, we attach the IP address we defined for Loopback 10 as the BGP Route ID. The command Address-family l2vpn evpn with the subcommand maximum-paths 2 enables flow-based load sharing across two BGP peers if their EVPN NLRI AS_PATH attributes are identical. The commonly used term for this is Equal Cost Multi-Pathing (ECMP). 

Using the neighbor command, we define the BGP neighbor's IP address. For each BGP neighbor, we define a BGP AS number and the source IP address for the locally generated BGP Update messages. With the command address-family l2vpn, we indicate that we want to exchange EVPN NLRI information with this neighbor. 

Depending on the advertised EVPN Route Type, a set of BGP Extended Community attributes are carried with advertised EVPN NLRIs. Hence, we need the command send-community extended. By default, the BGP loop prevention mechanism prevents iBGP peers from advertising NLRI information learned from other iBGP peers. We bypass this mechanism by configuring the Spine switches as BGP Route Reflectors using the neighbor-specific route-reflector-client command.


feature bgp
nv overlay evpn
!
router bgp 65000
  router-id 192.168.10.12
  address-family l2vpn evpn
    maximum-paths 2
  neighbor 192.168.10.101
    remote-as 65000
    update-source loopback10
    address-family l2vpn evpn
      send-community
      send-community extended
      route-reflector-client
!
  neighbor 192.168.10.102
    remote-as 65000
    update-source loopback10
    address-family l2vpn evpn
      send-community
      send-community extended
      route-reflector-client
!
  neighbor 192.168.10.103
    remote-as 65000
    update-source loopback10
    address-family l2vpn evpn
      send-community
      send-community extended
      route-reflector-client
!
  neighbor 192.168.10.104
    remote-as 65000
    update-source loopback10
    address-family l2vpn evpn
      send-community
      send-community extended
      route-reflector-client

Example 2-18: Spine Switches BGP Configuration.
Example 2-19 illustrates the BGP configuration of switch Leaf-101. The BGP configurations of all Leaf switches are identical except for the BGP router ID.

feature bgp
nv overlay evpn
!
router bgp 65000
  router-id 192.168.10.101
  address-family l2vpn evpn
    maximum-paths 2
  neighbor 192.168.10.11
    remote-as 65000
    update-source loopback10
    address-family l2vpn evpn
      send-community
      send-community extended

  neighbor 192.168.10.12
    remote-as 65000
    update-source loopback10
    address-family l2vpn evpn
      send-community
      send-community extended

Example 2-19: Leaf Switches BGP Configuration.

BGP EVPN Verification

From Example 2-20, we can see the BGP commands we have associated with the BGP neighbor Leaf-101 on Spine-11.


Spine-11# sh bgp l2vpn evpn neighbors 192.168.10.101 commands
Command information for 192.168.10.101
                 Update Source: locally configured
                     Remote AS: locally configured

 Address Family: L2VPN EVPN
                Send Community: locally configured
            Send Ext-community: locally configured
        Route Reflector Client: locally configured
Spine-11#

Example 2-20: Leaf Switches BGP Configuration.

Example 2-21 shows the BGP neighbors of Spine-11 with their AS numbers and statistics regarding received and sent BGP messages (Open, Keepalive, Update, and Notification). All EVPN Route Type counters are zero because we haven't yet deployed EVPN instances.


Spine-11# sh bgp l2vpn evpn summary
BGP summary information for VRF default, address family L2VPN EVPN
BGP router identifier 192.168.10.12, local AS number 65000
BGP table version is 6, L2VPN EVPN config peers 4, capable peers 4
0 network entries and 0 paths using 0 bytes of memory
BGP attribute entries [0/0], BGP AS path entries [0/0]
BGP community entries [0/0], BGP clusterlist entries [0/0]

Neighbor        V    AS    MsgRcvd    MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
192.168.10.101  4 65000         14         17        0    0    0 00:00:02 0
192.168.10.102  4 65000         19         20        0    0    0 00:00:02 0
192.168.10.103  4 65000          6          4        0    0    0 00:00:06 0
192.168.10.104  4 65000         14         17        0    0    0 00:00:02 0

Neighbor        T    AS PfxRcd     Type-2     Type-3     Type-4     Type-5     Type-12
192.168.10.101  I 65000 0          0          0          0          0          0
192.168.10.102  I 65000 0          0          0          0          0          0
192.168.10.103  I 65000 0          0          0          0          0          0
192.168.10.104  I 65000 0          0          0          0          0          0
Spine-11#

Example 2-21: Leaf Switches BGP Configuration.


Example 2-21 shows information and statistics about the BGP neighborship between switches Spine-11 and Leaf-101. Leaf-101 belongs to the same BGP Autonomous System (AS) area 65000 as Spine-11, making Leaf-101 an iBGP neighbor. I have highlighted the parts that confirm the functionality of our configuration. The neighborship state is "Established", indicating that the switches are ready to send and receive BGP Update messages. Spine-11 uses the logical interface Loopback10 as its source address in BGP Update messages. The Capabilities and Graceful Restart sections show that the switches support the BGP address family L2VPN EVPN. At the end of the output, we see that Leaf-101 is configured as a Route-Reflector Client.
Spine-11# sh bgp l2vpn evpn neighbors 192.168.10.101
BGP neighbor is 192.168.10.101, remote AS 65000, ibgp link, Peer index 3
  BGP version 4, remote router ID 192.168.10.101
  Neighbor previous state = OpenConfirm
  BGP state = Established, up for 00:02:40
  Neighbor vrf: default
  Using loopback10 as update source for this peer
  Using iod 71 (loopback10) as update source
  Last read 00:00:35, hold time = 180, keepalive interval is 60 seconds
  Last written 00:00:35, keepalive timer expiry due 00:00:24
  Received 18 messages, 0 notifications, 0 bytes in queue
  Sent 21 messages, 1 notifications, 0(0) bytes in queue
  Enhanced error processing: On
    0 discarded attributes
  Connections established 2, dropped 1
  Last update recd 00:02:35, Last update sent  = never
   Last reset by us 00:02:51, due to router-id configuration change
  Last error length sent: 0
  Reset error value sent: 0
  Reset error sent major: 6 minor: 107
  Notification data sent:
  Last reset by peer never, due to No error
  Last error length received: 0
  Reset error value received 0
  Reset error received major: 0 minor: 0
  Notification data received:

  Neighbor capabilities:
  Dynamic capability: advertised (mp, refresh, gr) received (mp, refresh, gr)
  Dynamic capability (old): advertised received
  Route refresh capability (new): advertised received
  Route refresh capability (old): advertised received
  4-Byte AS capability: advertised received
  Address family L2VPN EVPN: advertised received
  Graceful Restart capability: advertised received

  Graceful Restart Parameters:
  Address families advertised to peer:
    L2VPN EVPN
  Address families received from peer:
    L2VPN EVPN
  Forwarding state preserved by peer for:
  Restart time advertised to peer: 120 seconds
  Stale time for routes advertised by peer: 300 seconds
  Restart time advertised by peer: 120 seconds
  Extended Next Hop Encoding Capability: advertised received
  Receive IPv6 next hop encoding Capability for AF:
    IPv4 Unicast  VPNv4 Unicast

  Message statistics:
                              Sent               Rcvd
  Opens:                         4                  2
  Notifications:                 1                  0
  Updates:                       2                  2
  Keepalives:                   12                 12
  Route Refresh:                 0                  0
  Capability:                    2                  2
  Total:                        21                 18
  Total bytes:                 327                306
  Bytes in queue:                0                  0

  For address family: L2VPN EVPN
  BGP table version 10, neighbor version 10
  0 accepted prefixes (0 paths), consuming 0 bytes of memory
  0 received prefixes treated as withdrawn
  0 sent prefixes (0 paths)
  Community attribute sent to this neighbor
  Extended community attribute sent to this neighbor
  Third-party Nexthop will not be computed.
  Advertise GW IP is enabled
  Route reflector client
  Last End-of-RIB received 00:00:05 after session start
  Last End-of-RIB sent 00:00:05 after session start
  First convergence 00:00:05 after session start with 0 routes sent

  Local host: 192.168.10.11, Local port: 33940
  Foreign host: 192.168.10.101, Foreign port: 179
  fd = 90
Example 2-21: Leaf Switches BGP Configuration.

Overlay Network Data Plane: VXLAN 



NVE Interface Configuration


Example 2-22 shows the configuration of the NVE interface and the required feature configuration for client overlay networks. The "feature nv overlay" enables VXLAN overlay networks. The "feature vn-segment-vlan-based" specifies that only the MAC addresses of the VLAN associated with the respective EVPN instance (EVI) are stored in the MAC-VRF's Layer2 RIB (L2RIB). In other words, the EVPN instance forms a single broadcast domain. Under the NVE interface, we define the logical interface Loopback20's IP address as the tunnel source address. Additionally, we specify that the NVE interface implements the Control Plane learning model, meaning the switch learns remote MAC addresses from BGP Update messages, not from the data traffic received through the tunnel interface (Data Plane learning).

feature nv overlay
feature interface-vlan
feature vn-segment-vlan-based
!
interface nve1
  no shutdown
  host-reachability protocol bgp
  source-interface loopback20

Example 2-22: Leaf Switches BGP Configuration.

NVE Interface Verification


Example 2-23 shows the summary information about the settings of the interface NVE 1. Leaf-101 uses Loopback20 as a source interface when sending traffic over the interface NVE1. Besides, Leaf-101 uses the Control Plane learning model. Leaf-101 encodes the router MAC address to BGP Update messages as "Router MAC" Extended community associated with EVPN Route type2 (MAC-IP Advertisement Route) when the update carries both MAC and IP addresses. The remote leaf switches use it as a source MAC address in the inner Ethernet when frame when forwarding Inter-VN traffic.

Leaf-101# show nve interface nve 1
Interface: nve1, State: Up, encapsulation: VXLAN
 VPC Capability: VPC-VIP-Only [not-notified]
 Local Router MAC: 5003.0000.1b08
 Host Learning Mode: Control-Plane
 Source-Interface: loopback20 (primary: 192.168.20.101, secondary: 0.0.0.0)
Example 2-23: Leaf Switches BGP Configuration.

Example 2-24 demonstrates that Leaf-101 currently lacks any NVE peers because its VXLAN manager initiates an NVE peer relationship with other VTEPs upon receiving the first data packet over the NVE interface.


Leaf-101# show nve peers detail
Leaf-101#
Example 2-24: Leaf Switches BGP Configuration.

At this stage, we have configured the EVPN Fabric to the point where we can deploy our first EVPN instances and test and analyze both the Intra-VN and Inter-VN Control Plane and Data Plane perspectives.


Sunday 28 April 2024

Single-AS EVPN Fabric with OSPF Underlay: Underlay Network Multicast Routing: Any-Source Multicast - ASM

 Underlay Network Multicast Routing: PIM-SM

In a traditional Layer 2 network, switches forward Intra-VLAN data traffic based on the destination MAC address of Ethernet frames. Therefore, hosts within the same VLAN must resolve each other's MAC-IP address bindings using Address Resolution Protocol (ARP). When a host wants to open a new IP connection with a device in the same subnet and the destination MAC address is unknown, the connection initiator generates an ARP Request message. In the message, the sender provides its own MAC-IP binding information and queries the MAC address of the owner of the target IP. The ARP Request messages are Layer 2 Broadcast messages with the destination MAC address FF:FF:FF:FF:FF:FF. 

EVPN Fabric is a routed network and requires a solution for Layer 2 Broadcast messages. We can select either BGP EVPN-based Ingress-Replication (IR) solution or enable Multicast routing in Underlay network. This chapter introduces the latter model. As in previous Unicast Routing section, we follow the Multicast deployment workflow of Nexus Dashboard Fabric Controller (NDFC) graphical user interface. 

Figure 2-4 depicts the components needed to deploy Multicast service in the Underlay network. The default option for selecting “RP mode” is ASM (Any-Source Multicast). ASM is a multicast service model where receivers join a multicast group by sending PIM-join messages to a Multicast group-specific Rendezvous Point(s) (RP). RP is a “meeting point” to which the multicast source sends traffic and which RP forwards down to the shared tree. This process creates a shared multicast tree from RP to receiver. The multicast-enabled routers, in turn, use Protocol Independent Multicast – Sparse Mode (PIM-SM) multicast routing protocol for forwarding multicast traffic from senders to receivers. In default operation mode, PIM-SM allows receivers to switch from the shared multicast tree to the source-specific multicast tree. The other option for RP mode is Bi-directional PIM (BiDir). It is a variant of PIM-SM where multicast traffic always goes from sender to RP and from RP down to receivers over the shared multicast tree. In EVPN Fabric, Leaf switches are both multicast senders (forward local TS ARP messages) and receivers (wants to receiver remote ARP generated by TS connected to remote Leaf switches).

In our example, we create multicast group 239.1.1.0/24 using Any-Source Multicast (ASM) on both spine switches. We publish our Anycast-RPs to Leaf switches using IP address 192.168.254.1 (Loopback 251). Finally, we enable Protocol Independent Multicast (PIM) Sparse Mode on all Inter-Switch links and Loopback Interfaces. 


Figure 2-4: EVPN Fabric Protocol, and Resources  - Broadcast and Unknown unicast.


Figure 2-5 on next page illustrates the multicast configuration of our example EVPN Fabric's underlay network. In this setup, spine switches serve as Rendezvous Points (RPs) for the multicast group 239.1.1.0/24. Spine-11 and Spine-12 publish the configured RP IP address 192.168.254.1/32 (Loopback 251) to Leaf switches. These spine switches form part of the same RP-Set group, identifying themselves using their Loopback 0 interface addresses. In the Leaf switch setup, we define that Multicast Groups 239.1.1.0/24 will use the Rendezvous Point 192.168.254.1. 

Leaf switches act as both senders and receivers of multicast traffic. They indicate their willingness to receive traffic from multicast group 239.1.1.0/24 by sending a PIM Join message towards RP using the destination IP address 224.0.0.13 (all PIM Routes). In the message, they specify the group they want to join. Leaf switches register themselves with the Rendezvous Point as multicast traffic sources, using a PIM Register message. They send PIM Register messages to the configured group-specific Rendezvous Point IP address.



Figure 2-5: EVPN Fabric Underlay Network Multicast Replication.


Configuration


Example 2-6 demonstrates the multicast configurations of the Spine switches. We enable the PIM protocol with the command feature pim. Then, we configure Loopback interface 251 and define the IP address as 192.168.254.1/32. We add this loopback to Unicast (OSPF) and Multicast (PIM-SM) routing processes. Besides, we enable Multicast routing on Loopback 0 and Inter-Switch interfaces. After interface configurations, we bind the RP address 192.168.254.1 to Multicast Group List 239.1.1.0/24 and create an Anycast-RP Set list where we list Spine switches sharing the RP address 192.168.254.1. Note that the switches attached to the Rendezvous Point group synchronize multicast sources registered to them. Synchronization information is accepted only for devices whose RP identifier is defined as the group's RP.

feature pim
!
interface loopback251
  description Anycast-RP-Shared
  ip address 192.168.254.1/32
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
!
interface loopback 0
  ip pim sparse-mode
!
interface ethernet 1/1-4
  ip pim sparse-mode
!
ip pim rp-address 192.168.254.1 group-list 239.1.1.0/24
ip pim anycast-rp 192.168.254.1 192.168.0.11
ip pim anycast-rp 192.168.254.1 192.168.0.12

Example 2-6: Multicast Configuration - Spine-11 and Spine-12.


In Leaf switches, we first enable the PIM feature. Then, we include Loopback interfaces 0, 10, and 20 in multicast routing, as well as the Inter-Switch interfaces. Afterward, we specify the IP address of the multicast group-specific Rendezvous Point.


feature pim
!
ip pim rp-address 192.168.254.1 group-list 239.1.1.0/24
!
interface loopback 0
  ip pim sparse-mode
!
interface loopback 10
  ip pim sparse-mode
!
interface loopback 20
  ip pim sparse-mode
!
interface ethernet 1/1-2
  ip pim sparse-mode

Example 2-7: Multicast Configuration – Leaf-101 - 104.

In example 2-8, we can see that both Spine switches belong to the Anycast-RP 192.168.254.1 cluster. The RP-Set identifier IP address of Spine-11 is marked with an asterisk symbol (*). The command output also verifies that we have associated the Rendezvous Point with the Multicast Group Range 239.1.1.0/24. Example 2-9 verifies the RP-Multicast Group information from the Spine-12 perspective and the example from the Leaf-101 perspective. 
Spine-11# show ip pim rp vrf default
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP disabled
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None

Anycast-RP 192.168.254.1 members:
  192.168.0.11*  192.168.0.12

RP: 192.168.254.1*, (0),
 uptime: 00:06:24   priority: 255,
 RP-source: (local),
 group ranges:
 239.1.1.0/24
 
Example 2-8: RP-to-Multicast Group Mapping – Spine-11.


Spine-12# show ip pim rp vrf default
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP disabled
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None

Anycast-RP 192.168.254.1 members:
  192.168.0.11  192.168.0.12*

RP: 192.168.254.1*, (0),
 uptime: 00:05:51   priority: 255,
 RP-source: (local),
 group ranges:
 239.1.1.0/24

Example 2-9: RP-to-Multicast Group Mapping – Spine-12.


Leaf-101# show ip pim rp vrf default
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP disabled
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None

RP: 192.168.254.1, (0),
 uptime: 00:05:18   priority: 255,
 RP-source: (local),
 group ranges:
 239.1.1.0/24

Example 2-10: RP-to-Multicast Group Mapping – Leaf-101

Example 2-11 confirms that we have enabled PIM-SM on all necessary interfaces. Additionally, the example verifies that Spine-11 has established four PIM adjacencies over the Inter-Switch links Ethe1/1-4. Example 2-12 presents the same information from the viewpoint of Leaf-101.

Spine-11# show ip pim interface brief
PIM Interface Status for VRF "default"
Interface            IP Address      PIM DR Address  Neighbor  Border
                                                     Count     Interface
Ethernet1/1          192.168.0.11    192.168.0.101   1         no
Ethernet1/2          192.168.0.11    192.168.0.102   1         no
Ethernet1/3          192.168.0.11    192.168.0.103   1         no
Ethernet1/4          192.168.0.11    192.168.0.104   1         no
loopback0            192.168.0.11    192.168.0.11    0         no
loopback251          192.168.254.1   192.168.254.1   0         no

Example 2-11: Verification of PIM Interfaces – Spine-11.


Leaf-101# show ip pim interface brief
PIM Interface Status for VRF "default"
Interface            IP Address      PIM DR Address  Neighbor  Border
                                                     Count     Interface
Ethernet1/1          192.168.0.101   192.168.0.101   1         no
Ethernet1/2          192.168.0.101   192.168.0.101   1         no
loopback0            192.168.0.101   192.168.0.101   0         no

Example 2-12: Verification of PIM Interfaces – Leaf-101.

Example 2-13 provides more detailed information about the PIM neighbors of Spine-11.

Spine-11# show ip pim neighbor vrf default
PIM Neighbor Status for VRF "default"
Neighbor        Interface    Uptime    Expires   DR       Bidir-  BFD    ECMP Redirect
                                                 Priority Capable State     Capable
192.168.0.101   Ethernet1/1  00:11:29  00:01:41  1        yes     n/a     no
192.168.0.102   Ethernet1/2  00:10:39  00:01:35  1        yes     n/a     no
192.168.0.103   Ethernet1/3  00:10:16  00:01:29  1        yes     n/a     no
192.168.0.104   Ethernet1/4  00:09:58  00:01:18  1        yes     n/a     no
Example 2-13: Spine-11’s PIM Neighbors.

The "Mode" column in Example 2-14 is the initial evidence that we have deployed the Any-Source Multicast service.

Spine-11# show ip pim group-range
PIM Group-Range Configuration for VRF "default"
Group-range        Action Mode  RP-address      Shared-tree-range Origin
232.0.0.0/8        Accept SSM   -               -                 Local
239.1.1.0/24       -      ASM   192.168.254.1   -                 Static
Example 2-14: PIM Group Ranges.
The following three examples shows that Multicast Group 239.1.1.0/24 is not active yet. We will get back to this after EVPN Fabric is deployed and we have implemented our first EVPN segment.

Spine-11# show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 232.0.0.0/8), uptime: 00:08:38, pim ip
  Incoming interface: Null, RPF nbr: 0.0.0.0
  Outgoing interface list: (count: 0)

Example 2-15: Multicast Routing Information Base (MRIB) – Spine-11.

Spine-12# show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 232.0.0.0/8), uptime: 00:07:33, pim ip
  Incoming interface: Null, RPF nbr: 0.0.0.0
  Outgoing interface list: (count: 0)

Example 2-16: Multicast Routing Information Base (MRIB) – Spine-12.

Leaf-101# show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 232.0.0.0/8), uptime: 00:06:29, pim ip
  Incoming interface: Null, RPF nbr: 0.0.0.0
  Outgoing interface list: (count: 0)

Example 2-17: Multicast Routing Information Base (MRIB) – Leaf-101.

Next, we configure the Border Gateway Protocol (BGP) as the control plane protocol for the EVPN Fabric's Overlay Network.  

Thursday 25 April 2024

Single-AS EVPN Fabric with OSPF Underlay: Underlay Network Unicast Routing

 Introduction


Image 2-1 illustrates the components essential for designing a Single-AS, Multicast-enabled OSPF Underlay EVPN Fabric. These components need to be established before constructing the EVPN fabric. I've grouped them into five categories based on their function.

  • General: Defines the IP addressing scheme for Spine-Leaf Inter-Switch links, set the BGP AS number and number of BGP Route-Reflectors, and set the MAC address for the Anycast gateway for client-side VLAN routing interfaces.
  • Replication: Specifies the replication mode for Broadcast, Unknown Unicast, and Multicast (BUM) traffic generated by Tenant Systems. The options are Ingress-Replication and Multicast (ASM or BiDir options).
  • vPC: Describes vPC multihoming settings such as vPC Peer Link VLAN ID and Port-Channel ID, vPC Auto-recovery and Delay Restore timers, and define vPC Peer Keepalive interface.
  • Protocol: Defines the numbering schema for Loopback interfaces, set the OSPF Area identifier, and OSPF process name.
  • Resources: Reserves IP address ranges for Loopback interfaces defined in the Protocols category and for the Rendezvous Point specified in the Replication category. Besides, in this section, we reserve Layer 2 and Layer 3 VXLAN and VLAN ranges for overlay network segments.

The model presented in Figure 2-1 outlines the steps for configuring an EVPN fabric using the Nexus Dashboard Fabric Controller (NDFC) “Create Fabric” tool. Each category in the image corresponds to a tab in the NDFC's Easy_Fabric_11_1 Fabric Template.


Figure 2-1: EVPN Fabric Network Side Building Blogs.


Underlay Network Unicast Routing


Let's start the deployment process of EVPN Fabric from the definitions of General, Protocol, and Resources categories for the Underlay network. We won't define a separate subnet for Spine-Leaf Inter-Switch links; instead, we'll use unnumbered interfaces. For the routing protocol in the Underlay network, we'll choose OSPF and define the process name (UNDERLAY-NET) and Area Identifier (0.0.0.0) in the Protocols category. In the Protocols category, we also define the numbering schema for Loopback addresses. The Underlay Routing Loopback ID will be 0 (for OSPF Router and Unnumbered Inter-Switch interface), the Overlay Network Loopback ID will be 10 (from BGP EVPN peering), and the Loopback ID for VXLAN tunneling will be 20 (Outer IP source and destination IP addresses for VXLAN Tunnel encapsulation ). In the Resources category, we'll reserve IP address ranges, and for each loopback interface, we'll assign addresses as follows: Loopback 0: 192.168.0.0/24, Loopback 10: 192.168.10.0/24, and Loopback 20: 192.168.20.0/24.



Figure 2-2: EVPN Fabric General, Protocol, and Resources Definitions.


Figure 2-3 illustrates the Loopback addresses we have chosen for the Leaf and Spine switches. For example, Let's take the Leaf-101 switch as an example. We have assigned the IP address 192.168.0.101/32 for the Loopback 0 interface, which Leaf-101 uses as both the OSPF Router ID and the Inter-Switch link IP address. For the Loopback 10 interface, we've assigned the IP address 192.168.10.101/32, which Leaf-101 uses as both the BGP router ID and the BGP EVPN peering address. For the Loopback 20 interface, we have assigned the IP address 192.168.20.101/32, which Leaf-101 uses as the outermost IP source/destination IP address in VXLAN tunneling. Note that the Loopback 20 address is configured only on Leaf switches. The OSPF process advertises all three Loopback addresses in LSA (Link State Advertisement) messages to all its OSPF neighbors, which then process and forward them to their own OSPF neighbors.



Figure 2-3: EVPN Fabric Loopback Interface IP Addressing.

CLI Configuration


Example 2-1 shows the underlay network configuration of the EVPN Fabric for Leaf-101. Enable the OSPF feature and create the OSPF process. Then, configure the Loopback interfaces, assign them IP addresses, and associate them with the OSPF process. After that, configure the Inter-Switch Link (ISL) interfaces Eth1/1 and Eth1/2 to use the IP address assigned to Loopback 0 interface 0: 192.168.0.101/23. Specify the interface media and OSPF network type as point-to-point and connect Eth1/1 to the OSPF process. 

The commands "name-lookup" under the OSPF process and global "ip host" commands allow pinging the defined IP addresses by name. Additionally, the "show ip ospf neighbor" command displays OSPF neighbors' names instead of IP addresses. These commands are optional.

conf t
!
hostname Leaf-101
!
feature ospf 
!
router ospf UNDERLAY-NET
  router-id 192.168.0.101
  name-lookup
!
ip host Leaf-101 192.168.0.101
ip host Leaf-102 192.168.0.102
ip host Leaf-103 192.168.0.103
ip host Leaf-104 192.168.0.104
ip host Spine-11 192.168.0.11
ip host Spine-12 192.168.0.12
!
interface loopback 0
 description ** OSPF RID & Inter-Sw links IP addressing **
 ip address 192.168.0.101/32
 ip router ospf UNDERLAY-NET area 0.0.0.0
!
interface loopback 10
 description ** Overlay ControlPlane - BGP EVPN **
 ip address 192.168.10.101/32
 ip router ospf UNDERLAY-NET area 0.0.0.0
!
interface loopback 20
 description ** Overlay DataPlane - VTEP **
 ip address 192.168.20.101/32
 ip router ospf UNDERLAY-NET area 0.0.0.0
!
interface Ethernet1/1-2
  no switchport
  medium p2p
  ip unnumbered loopback0
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  no shutdown

Example 2-1: Leaf-101 - Underlay Network Configuration.

Verifications

Example 2-2 shows that the Leaf-101 switch's Ethernet interfaces 1/1 and 1/2, and all three Loopback interfaces, belong to the OSPF process UNDERLAY-NET in OSPF area 0.0.0.0. The OSPF network type for Ethernet interfaces is set to point-to-point. The example also verifies that the Leaf-101 switch has two OSPF neighbors, Spine-11, and Spine-12.


Leaf-101# show ip ospf interface brief ; show ip ospf neighbors ;
--------------------------------------------------------------------------------
 OSPF Process ID UNDERLAY-NET VRF default
 Total number of interface: 5
 Interface               ID     Area            Cost   State    Neighbors Status
 Eth1/1                  4      0.0.0.0         40     P2P      1         up
 Eth1/2                  5      0.0.0.0         40     P2P      1         up
 Lo0                     1      0.0.0.0         1      LOOPBACK 0         up
 Lo10                    2      0.0.0.0         1      LOOPBACK 0         up
 Lo20                    3      0.0.0.0         1      LOOPBACK 0         up
--------------------------------------------------------------------------------
 OSPF Process ID UNDERLAY-NET VRF default
 Total number of neighbors: 2
 Neighbor ID     Pri State            Up Time  Address         Interface
 Spine-11          1 FULL/ -          00:00:30 192.168.0.11    Eth1/1
 Spine-12          1 FULL/ -          00:00:30 192.168.0.12    Eth1/2

Example 2-2: Leaf-101 show ip ospf neighbors.


Example 2-3 on the next page displays the OSPF Link State Database (LSDB) for the Leaf-101 switch. The first section shows that all switches in the EVPN Fabric have sent descriptions of their OSPF links. Each Spine switch has six OSPF interfaces (2 x Loopback interfaces and 4 x Ethernet interfaces), while each Leaf switch has five OSPF interfaces (3 x Loopback interfaces and 2 x Ethernet interfaces). The second section provides detailed OSPF link descriptions for the Spine-11 switch.

Leaf-101# sh ip ospf database ; show ip ospf database 192.168.0.11 detail
--------------------------------------------------------------------------------
        OSPF Router with ID (Leaf-101) (Process ID UNDERLAY-NET VRF default)
                Router Link States (Area 0.0.0.0)
Link ID         ADV Router      Age        Seq#       Checksum Link Count
192.168.0.11    Spine-11        51         0x8000012c 0x3fcd   6
192.168.0.12    Spine-12        51         0x8000012c 0x4fb9   6
192.168.0.101   Leaf-101        50         0x8000012e 0x9adf   5
192.168.0.102   Leaf-102        615        0x8000012c 0xd0a6   5
192.168.0.103   Leaf-103        607        0x8000012c 0x036f   5
192.168.0.104   Leaf-104        599        0x8000012c 0x3538   5
--------------------------------------------------------------------------------
        OSPF Router with ID (Leaf-101) (Process ID UNDERLAY-NET VRF default)
                Router Link States (Area 0.0.0.0)
   LS age: 51
   Options: 0x2 (No TOS-capability, No DC)
   LS Type: Router Links
   Link State ID: 192.168.0.11
   Advertising Router: Spine-11
   LS Seq Number: 0x8000012c
   Checksum: 0x3fcd
   Length: 96
    Number of links: 6

     Link connected to: a Stub Network
      (Link ID) Network/Subnet Number: 192.168.0.11
      (Link Data) Network Mask: 255.255.255.255
       Number of TOS metrics: 0
         TOS   0 Metric: 1

     Link connected to: a Stub Network
      (Link ID) Network/Subnet Number: 192.168.10.11
      (Link Data) Network Mask: 255.255.255.255
       Number of TOS metrics: 0
         TOS   0 Metric: 1

     Link connected to: a Router (point-to-point)
     (Link ID) Neighboring Router ID: 192.168.0.101
     (Link Data) Router Interface address: 0.0.0.3
       Number of TOS metrics: 0
         TOS   0 Metric: 40

     Link connected to: a Router (point-to-point)
     (Link ID) Neighboring Router ID: 192.168.0.102
     (Link Data) Router Interface address: 0.0.0.4
       Number of TOS metrics: 0
         TOS   0 Metric: 40

     Link connected to: a Router (point-to-point)
     (Link ID) Neighboring Router ID: 192.168.0.103
     (Link Data) Router Interface address: 0.0.0.5
       Number of TOS metrics: 0
         TOS   0 Metric: 40

     Link connected to: a Router (point-to-point)
     (Link ID) Neighboring Router ID: 192.168.0.104
     (Link Data) Router Interface address: 0.0.0.6
       Number of TOS metrics: 0
         TOS   0 Metric: 40

Example 2-3: Leaf-101 – OSPF Links State Database.


Example 2-4 confirms that the Leaf-101 switch has run the Dijkstra algorithm against the LSDB and installed the best routes into the Unicast routing table. Note that for all Leaf switch Loopback IP addresses, there are two equal-cost paths via both Spine switches.


Leaf-101# show ip route ospf
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>
192.168.0.11/32, ubest/mbest: 1/0
    *via 192.168.0.11, Eth1/1, [110/41], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.0.12/32, ubest/mbest: 1/0
    *via 192.168.0.12, Eth1/2, [110/41], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.0.102/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.0.103/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.0.104/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.11/32, ubest/mbest: 1/0
    *via 192.168.0.11, Eth1/1, [110/41], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.12/32, ubest/mbest: 1/0
    *via 192.168.0.12, Eth1/2, [110/41], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.102/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.103/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.104/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.20.102/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.20.103/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.20.104/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra

Example 2-4: Leaf-101 – Unicast Routing Table.

Example 2-5 confirms that the Leaf-101 switch has an IP connectivity to all Fabric switches' Loopback 0 interfaces. Note that I've added dashes for clarity.


Leaf-101#ping Spine-11 ; ping Spine-12 ; ping Leaf-102 ; ping Leaf-103 ; ping Leaf-104
PING Spine-11 (192.168.0.11): 56 data bytes
64 bytes from 192.168.0.11: icmp_seq=0 ttl=254 time=4.715 ms
64 bytes from 192.168.0.11: icmp_seq=1 ttl=254 time=4.909 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Spine-11 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 1.849/3.369/4.909 ms
-----------------------------------------------------------------------
PING Spine-12 (192.168.0.12): 56 data bytes
64 bytes from 192.168.0.12: icmp_seq=0 ttl=254 time=3.14 ms
64 bytes from 192.168.0.12: icmp_seq=1 ttl=254 time=2.486 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Spine-12 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 1.896/2.279/3.14 ms
-----------------------------------------------------------------------
PING Leaf-102 (192.168.0.102): 56 data bytes
64 bytes from 192.168.0.102: icmp_seq=0 ttl=253 time=6.124 ms
64 bytes from 192.168.0.102: icmp_seq=1 ttl=253 time=4.663 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Leaf-102 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 4.663/5.56/6.794 ms
-----------------------------------------------------------------------
PING Leaf-103 (192.168.0.103): 56 data bytes
64 bytes from 192.168.0.103: icmp_seq=0 ttl=253 time=6.601 ms
64 bytes from 192.168.0.103: icmp_seq=1 ttl=253 time=7.512 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Leaf-103 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 3.674/5.892/7.512 ms
-----------------------------------------------------------------------
PING Leaf-104 (192.168.0.104): 56 data bytes
64 bytes from 192.168.0.104: icmp_seq=0 ttl=253 time=7.109 ms
64 bytes from 192.168.0.104: icmp_seq=1 ttl=253 time=7.777 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Leaf-104 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 5.869/6.822/7.777 ms
Leaf-101#

Example 2-5: Pinging to all Fabric switches Loopback 0 interfaces from Leaf-101.


In the next post, we configure IP-PIM Any-Source Multicast (ASM) routing in the Underlay network.