Monday, 1 January 2024

BGP EVPN Part-I: Challenges in Traditional Switched Datacenter Networks

Inefficient Link Utilization

The default Layer 2 Control Plane protocol in Cisco NX-OS is a Rapid Per-VLAN Spanning Tree Plus (Rapid PVST+), which runs 802.1w standard Rapid Spanning Tree Protocol (RSTP) instance per VLAN. Rapid PVST+ builds a VLAN-specific, loop-free Layer 2 data path from the STP root switch to all non-root switches. Spanning Tree Protocol, no matter which mode we use, allows only one active path at a time and blocks all redundant links. One general solution for activating all Inter-switch links is placing an STP root switch for odd and even VLANs into different switches. However, STP allows only a VLAN-based traffic load balancing.


CPU and Memory Usage

After building a loop-free data path, switches running Rapid PVST+ monitor the state of the network by using Spanning Tree instance-based Bridge Protocol Data Units (BPDU). By default, each switch sends instance-based BPDU messages from their designated port in two-second intervals. If we have 2000 VLANs, all switches must process 2000 BPDUs. To reduce CPU and Memory consumption caused by BPDU processing, we can use Multiple Spanning Tree – MSTP (802.1s), where VLANs are associated with Instances. For example, we can attach VLANs 1-999 to one instance and VLANs 1000-1999 to another, reducing the instance count from 2000 to 2. Though the MSTP reduces CPU burden, it does not allow flow-based frame load-balancing. Besides, an MSTP brings an increased complexity, especially when we must connect an MSTP Region to a non-MSTP region.


Bandwidth Scaling

A Multi-Chassis Link Aggregation Group (MLAG) allows the bundling of separate Ethernet links from distinct devices into a unified logical port (Port-Channel). This solution enables a flow-based traffic load sharing. Cisco's solution for NX-OS is a virtual Port Channel (vPC). 

Scaling up a Port-Channel bandwidth requires adding new links to the bundle in both vPC pair switches. For example, to increase the 20 Gbps Port-Channel, we must add a 10 Gbps interface for both vPC pair spine switches and the upstream leaf switch. This link bundling gives us a 40 Gbps logical interface. So, what is the problem with this approach? Let us say that the Port Channel utilization is approximately 30 Gbps. If one of the vPC devices fails, the available Port-Channel bandwidth decreases from 40 Gbps to 20 Gbps. This over-subscription could lead to potential link congestion and packet drops.   


In-Service Software Upgrade

We can remove devices from the data path in the Layer 3 network without disturbing the data plane. For example, Cisco’s vPC enables In-Service Software Upgrade (ISSU) or hardware maintenance tasks using the Graceful Insertion and Removal (GIR) method. We may remove the NX-OS switch from the service using System Maintenance Mode. As a reaction to this mode, the switch advertises its OSPF point-2-point links with infinite metric 65535, withdraws all BGP routes, and shuts down vPC peer-link, keepalive link, and member ports. Besides, in BGP EVPN/VXLAN fabric, the switch also shuts down the loopback interface associated with an NVE interface. However, the process does not disable loopback interfaces used by OSPF and BGP, and OSPF adjacencies and BGP peering stays up. When the device has completed the protocol isolation processes, we can safely do the maintenance tasks without disturbing application traffic. 

This kind of signaling is not supported in the Layer 2 switched networks, and maintenance tasks requiring device removal will cause some level of data flow disruptions.


Flood Reduction

Hosts resolve each other's MAC-IP address-binding information using the Address Resolution Protocol (ARP). ARP Requests are transmitted as Layer 2 Broadcast messages. When switches receive these requests, they learn the requester's MAC address from the source MAC address in the incoming frame. Because of the Broadcast destination MAC address, switches flood ARP requests out of interfaces where the VLAN is permitted. Due to this reactive data plane learning process, all switches must process ARP messages.

When a non-edge port state on a switch transitions from Discarding/Learning to Forwarding, the switch recognizes this change as a Spanning Tree topology event and initiates a TcWhile timer (twice the STP Hello time). During the TcWhile timer duration, the switch sets the Tc (Topology Change) bit in each Bridge Protocol Data Unit (BPDU) message. In response to receiving a BPDU message with the Tc bit set, switches start the TCWhile timer, activate the TC bit in their outgoing BPDU messages, and clear their MAC address tables. This process propagates throughout the switching infrastructure. While hosts may retain the MAC-IP address binding information, switches must re-learn MAC addresses through a flood-and-learn process.


Configuration

Most modern-day applications utilize server virtualization, commonly known as Virtual Machines (VMs). The VM live migration process allows for the seamless transfer of a running VM from one host to another without causing significant downtime or disrupting the services running on that VM. From a network perspective, VM live migration requires the availability of Layer 2 segments and gateways through the data center switches. 

From an operational standpoint, VLANs need to be configured not only on each switch but also on every Inter-Switch link in a switched network. Automatization can help reduce the deployment time and likelihood of configuration errors. However, a notable drawback is the necessity to deploy configuration changes to spine switches as well.


VLAN-Id Limitation

The Layer 2 segmentation with any Spanning Tree Protocol relies on a 12-bit VLAN identifier in 802.1Q tag on the Ethernet frame, giving 2^12 = 4096 unique VLANs. This limitation can pose challenges in environments requiring more VLANs for segmentation.


MAC Address Overlapping

In a multi-tenant environment, where server team administrators are allowed to allocate MAC addresses for application servers manually, there's a risk of MAC address conflicts. Such conflicts can lead to application disruptions and usability issues.

Next, How BGP EVPN with VXLAN Responds to these Challenges.

No comments:

Post a Comment