Sunday 28 April 2024

Single-AS EVPN Fabric with OSPF Underlay: Underlay Network Multicast Routing: Any-Source Multicast - ASM

 Underlay Network Multicast Routing: PIM-SM

In a traditional Layer 2 network, switches forward Intra-VLAN data traffic based on the destination MAC address of Ethernet frames. Therefore, hosts within the same VLAN must resolve each other's MAC-IP address bindings using Address Resolution Protocol (ARP). When a host wants to open a new IP connection with a device in the same subnet and the destination MAC address is unknown, the connection initiator generates an ARP Request message. In the message, the sender provides its own MAC-IP binding information and queries the MAC address of the owner of the target IP. The ARP Request messages are Layer 2 Broadcast messages with the destination MAC address FF:FF:FF:FF:FF:FF. 

EVPN Fabric is a routed network and requires a solution for Layer 2 Broadcast messages. We can select either BGP EVPN-based Ingress-Replication (IR) solution or enable Multicast routing in Underlay network. This chapter introduces the latter model. As in previous Unicast Routing section, we follow the Multicast deployment workflow of Nexus Dashboard Fabric Controller (NDFC) graphical user interface. 

Figure 2-4 depicts the components needed to deploy Multicast service in the Underlay network. The default option for selecting “RP mode” is ASM (Any-Source Multicast). ASM is a multicast service model where receivers join a multicast group by sending PIM-join messages to a Multicast group-specific Rendezvous Point(s) (RP). RP is a “meeting point” to which the multicast source sends traffic and which RP forwards down to the shared tree. This process creates a shared multicast tree from RP to receiver. The multicast-enabled routers, in turn, use Protocol Independent Multicast – Sparse Mode (PIM-SM) multicast routing protocol for forwarding multicast traffic from senders to receivers. In default operation mode, PIM-SM allows receivers to switch from the shared multicast tree to the source-specific multicast tree. The other option for RP mode is Bi-directional PIM (BiDir). It is a variant of PIM-SM where multicast traffic always goes from sender to RP and from RP down to receivers over the shared multicast tree. In EVPN Fabric, Leaf switches are both multicast senders (forward local TS ARP messages) and receivers (wants to receiver remote ARP generated by TS connected to remote Leaf switches).

In our example, we create multicast group 239.1.1.0/24 using Any-Source Multicast (ASM) on both spine switches. We publish our Anycast-RPs to Leaf switches using IP address 192.168.254.1 (Loopback 251). Finally, we enable Protocol Independent Multicast (PIM) Sparse Mode on all Inter-Switch links and Loopback Interfaces. 


Figure 2-4: EVPN Fabric Protocol, and Resources  - Broadcast and Unknown unicast.


Figure 2-5 on next page illustrates the multicast configuration of our example EVPN Fabric's underlay network. In this setup, spine switches serve as Rendezvous Points (RPs) for the multicast group 239.1.1.0/24. Spine-11 and Spine-12 publish the configured RP IP address 192.168.254.1/32 (Loopback 251) to Leaf switches. These spine switches form part of the same RP-Set group, identifying themselves using their Loopback 0 interface addresses. In the Leaf switch setup, we define that Multicast Groups 239.1.1.0/24 will use the Rendezvous Point 192.168.254.1. 

Leaf switches act as both senders and receivers of multicast traffic. They indicate their willingness to receive traffic from multicast group 239.1.1.0/24 by sending a PIM Join message towards RP using the destination IP address 224.0.0.13 (all PIM Routes). In the message, they specify the group they want to join. Leaf switches register themselves with the Rendezvous Point as multicast traffic sources, using a PIM Register message. They send PIM Register messages to the configured group-specific Rendezvous Point IP address.



Figure 2-5: EVPN Fabric Underlay Network Multicast Replication.


Configuration


Example 2-6 demonstrates the multicast configurations of the Spine switches. We enable the PIM protocol with the command feature pim. Then, we configure Loopback interface 251 and define the IP address as 192.168.254.1/32. We add this loopback to Unicast (OSPF) and Multicast (PIM-SM) routing processes. Besides, we enable Multicast routing on Loopback 0 and Inter-Switch interfaces. After interface configurations, we bind the RP address 192.168.254.1 to Multicast Group List 239.1.1.0/24 and create an Anycast-RP Set list where we list Spine switches sharing the RP address 192.168.254.1. Note that the switches attached to the Rendezvous Point group synchronize multicast sources registered to them. Synchronization information is accepted only for devices whose RP identifier is defined as the group's RP.

feature pim
!
interface loopback251
  description Anycast-RP-Shared
  ip address 192.168.254.1/32
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
!
interface loopback 0
  ip pim sparse-mode
!
interface ethernet 1/1-4
  ip pim sparse-mode
!
ip pim rp-address 192.168.254.1 group-list 239.1.1.0/24
ip pim anycast-rp 192.168.254.1 192.168.0.11
ip pim anycast-rp 192.168.254.1 192.168.0.12

Example 2-6: Multicast Configuration - Spine-11 and Spine-12.


In Leaf switches, we first enable the PIM feature. Then, we include Loopback interfaces 0, 10, and 20 in multicast routing, as well as the Inter-Switch interfaces. Afterward, we specify the IP address of the multicast group-specific Rendezvous Point.


feature pim
!
ip pim rp-address 192.168.254.1 group-list 239.1.1.0/24
!
interface loopback 0
  ip pim sparse-mode
!
interface loopback 10
  ip pim sparse-mode
!
interface loopback 20
  ip pim sparse-mode
!
interface ethernet 1/1-2
  ip pim sparse-mode

Example 2-7: Multicast Configuration – Leaf-101 - 104.

In example 2-8, we can see that both Spine switches belong to the Anycast-RP 192.168.254.1 cluster. The RP-Set identifier IP address of Spine-11 is marked with an asterisk symbol (*). The command output also verifies that we have associated the Rendezvous Point with the Multicast Group Range 239.1.1.0/24. Example 2-9 verifies the RP-Multicast Group information from the Spine-12 perspective and the example from the Leaf-101 perspective. 
Spine-11# show ip pim rp vrf default
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP disabled
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None

Anycast-RP 192.168.254.1 members:
  192.168.0.11*  192.168.0.12

RP: 192.168.254.1*, (0),
 uptime: 00:06:24   priority: 255,
 RP-source: (local),
 group ranges:
 239.1.1.0/24
 
Example 2-8: RP-to-Multicast Group Mapping – Spine-11.


Spine-12# show ip pim rp vrf default
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP disabled
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None

Anycast-RP 192.168.254.1 members:
  192.168.0.11  192.168.0.12*

RP: 192.168.254.1*, (0),
 uptime: 00:05:51   priority: 255,
 RP-source: (local),
 group ranges:
 239.1.1.0/24

Example 2-9: RP-to-Multicast Group Mapping – Spine-12.


Leaf-101# show ip pim rp vrf default
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP disabled
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None

RP: 192.168.254.1, (0),
 uptime: 00:05:18   priority: 255,
 RP-source: (local),
 group ranges:
 239.1.1.0/24

Example 2-10: RP-to-Multicast Group Mapping – Leaf-101

Example 2-11 confirms that we have enabled PIM-SM on all necessary interfaces. Additionally, the example verifies that Spine-11 has established four PIM adjacencies over the Inter-Switch links Ethe1/1-4. Example 2-12 presents the same information from the viewpoint of Leaf-101.

Spine-11# show ip pim interface brief
PIM Interface Status for VRF "default"
Interface            IP Address      PIM DR Address  Neighbor  Border
                                                     Count     Interface
Ethernet1/1          192.168.0.11    192.168.0.101   1         no
Ethernet1/2          192.168.0.11    192.168.0.102   1         no
Ethernet1/3          192.168.0.11    192.168.0.103   1         no
Ethernet1/4          192.168.0.11    192.168.0.104   1         no
loopback0            192.168.0.11    192.168.0.11    0         no
loopback251          192.168.254.1   192.168.254.1   0         no

Example 2-11: Verification of PIM Interfaces – Spine-11.


Leaf-101# show ip pim interface brief
PIM Interface Status for VRF "default"
Interface            IP Address      PIM DR Address  Neighbor  Border
                                                     Count     Interface
Ethernet1/1          192.168.0.101   192.168.0.101   1         no
Ethernet1/2          192.168.0.101   192.168.0.101   1         no
loopback0            192.168.0.101   192.168.0.101   0         no

Example 2-12: Verification of PIM Interfaces – Leaf-101.

Example 2-13 provides more detailed information about the PIM neighbors of Spine-11.

Spine-11# show ip pim neighbor vrf default
PIM Neighbor Status for VRF "default"
Neighbor        Interface    Uptime    Expires   DR       Bidir-  BFD    ECMP Redirect
                                                 Priority Capable State     Capable
192.168.0.101   Ethernet1/1  00:11:29  00:01:41  1        yes     n/a     no
192.168.0.102   Ethernet1/2  00:10:39  00:01:35  1        yes     n/a     no
192.168.0.103   Ethernet1/3  00:10:16  00:01:29  1        yes     n/a     no
192.168.0.104   Ethernet1/4  00:09:58  00:01:18  1        yes     n/a     no
Example 2-13: Spine-11’s PIM Neighbors.

The "Mode" column in Example 2-14 is the initial evidence that we have deployed the Any-Source Multicast service.

Spine-11# show ip pim group-range
PIM Group-Range Configuration for VRF "default"
Group-range        Action Mode  RP-address      Shared-tree-range Origin
232.0.0.0/8        Accept SSM   -               -                 Local
239.1.1.0/24       -      ASM   192.168.254.1   -                 Static
Example 2-14: PIM Group Ranges.
The following three examples shows that Multicast Group 239.1.1.0/24 is not active yet. We will get back to this after EVPN Fabric is deployed and we have implemented our first EVPN segment.

Spine-11# show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 232.0.0.0/8), uptime: 00:08:38, pim ip
  Incoming interface: Null, RPF nbr: 0.0.0.0
  Outgoing interface list: (count: 0)

Example 2-15: Multicast Routing Information Base (MRIB) – Spine-11.

Spine-12# show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 232.0.0.0/8), uptime: 00:07:33, pim ip
  Incoming interface: Null, RPF nbr: 0.0.0.0
  Outgoing interface list: (count: 0)

Example 2-16: Multicast Routing Information Base (MRIB) – Spine-12.

Leaf-101# show ip mroute
IP Multicast Routing Table for VRF "default"

(*, 232.0.0.0/8), uptime: 00:06:29, pim ip
  Incoming interface: Null, RPF nbr: 0.0.0.0
  Outgoing interface list: (count: 0)

Example 2-17: Multicast Routing Information Base (MRIB) – Leaf-101.

Next, we configure the Border Gateway Protocol (BGP) as the control plane protocol for the EVPN Fabric's Overlay Network.  

Thursday 25 April 2024

Single-AS EVPN Fabric with OSPF Underlay: Underlay Network Unicast Routing

 Introduction


Image 2-1 illustrates the components essential for designing a Single-AS, Multicast-enabled OSPF Underlay EVPN Fabric. These components need to be established before constructing the EVPN fabric. I've grouped them into five categories based on their function.

  • General: Defines the IP addressing scheme for Spine-Leaf Inter-Switch links, set the BGP AS number and number of BGP Route-Reflectors, and set the MAC address for the Anycast gateway for client-side VLAN routing interfaces.
  • Replication: Specifies the replication mode for Broadcast, Unknown Unicast, and Multicast (BUM) traffic generated by Tenant Systems. The options are Ingress-Replication and Multicast (ASM or BiDir options).
  • vPC: Describes vPC multihoming settings such as vPC Peer Link VLAN ID and Port-Channel ID, vPC Auto-recovery and Delay Restore timers, and define vPC Peer Keepalive interface.
  • Protocol: Defines the numbering schema for Loopback interfaces, set the OSPF Area identifier, and OSPF process name.
  • Resources: Reserves IP address ranges for Loopback interfaces defined in the Protocols category and for the Rendezvous Point specified in the Replication category. Besides, in this section, we reserve Layer 2 and Layer 3 VXLAN and VLAN ranges for overlay network segments.

The model presented in Figure 2-1 outlines the steps for configuring an EVPN fabric using the Nexus Dashboard Fabric Controller (NDFC) “Create Fabric” tool. Each category in the image corresponds to a tab in the NDFC's Easy_Fabric_11_1 Fabric Template.


Figure 2-1: EVPN Fabric Network Side Building Blogs.


Underlay Network Unicast Routing


Let's start the deployment process of EVPN Fabric from the definitions of General, Protocol, and Resources categories for the Underlay network. We won't define a separate subnet for Spine-Leaf Inter-Switch links; instead, we'll use unnumbered interfaces. For the routing protocol in the Underlay network, we'll choose OSPF and define the process name (UNDERLAY-NET) and Area Identifier (0.0.0.0) in the Protocols category. In the Protocols category, we also define the numbering schema for Loopback addresses. The Underlay Routing Loopback ID will be 0 (for OSPF Router and Unnumbered Inter-Switch interface), the Overlay Network Loopback ID will be 10 (from BGP EVPN peering), and the Loopback ID for VXLAN tunneling will be 20 (Outer IP source and destination IP addresses for VXLAN Tunnel encapsulation ). In the Resources category, we'll reserve IP address ranges, and for each loopback interface, we'll assign addresses as follows: Loopback 0: 192.168.0.0/24, Loopback 10: 192.168.10.0/24, and Loopback 20: 192.168.20.0/24.



Figure 2-2: EVPN Fabric General, Protocol, and Resources Definitions.


Figure 2-3 illustrates the Loopback addresses we have chosen for the Leaf and Spine switches. For example, Let's take the Leaf-101 switch as an example. We have assigned the IP address 192.168.0.101/32 for the Loopback 0 interface, which Leaf-101 uses as both the OSPF Router ID and the Inter-Switch link IP address. For the Loopback 10 interface, we've assigned the IP address 192.168.10.101/32, which Leaf-101 uses as both the BGP router ID and the BGP EVPN peering address. For the Loopback 20 interface, we have assigned the IP address 192.168.20.101/32, which Leaf-101 uses as the outermost IP source/destination IP address in VXLAN tunneling. Note that the Loopback 20 address is configured only on Leaf switches. The OSPF process advertises all three Loopback addresses in LSA (Link State Advertisement) messages to all its OSPF neighbors, which then process and forward them to their own OSPF neighbors.



Figure 2-3: EVPN Fabric Loopback Interface IP Addressing.

CLI Configuration


Example 2-1 shows the underlay network configuration of the EVPN Fabric for Leaf-101. Enable the OSPF feature and create the OSPF process. Then, configure the Loopback interfaces, assign them IP addresses, and associate them with the OSPF process. After that, configure the Inter-Switch Link (ISL) interfaces Eth1/1 and Eth1/2 to use the IP address assigned to Loopback 0 interface 0: 192.168.0.101/23. Specify the interface media and OSPF network type as point-to-point and connect Eth1/1 to the OSPF process. 

The commands "name-lookup" under the OSPF process and global "ip host" commands allow pinging the defined IP addresses by name. Additionally, the "show ip ospf neighbor" command displays OSPF neighbors' names instead of IP addresses. These commands are optional.

conf t
!
hostname Leaf-101
!
feature ospf 
!
router ospf UNDERLAY-NET
  router-id 192.168.0.101
  name-lookup
!
ip host Leaf-101 192.168.0.101
ip host Leaf-102 192.168.0.102
ip host Leaf-103 192.168.0.103
ip host Leaf-104 192.168.0.104
ip host Spine-11 192.168.0.11
ip host Spine-12 192.168.0.12
!
interface loopback 0
 description ** OSPF RID & Inter-Sw links IP addressing **
 ip address 192.168.0.101/32
 ip router ospf UNDERLAY-NET area 0.0.0.0
!
interface loopback 10
 description ** Overlay ControlPlane - BGP EVPN **
 ip address 192.168.10.101/32
 ip router ospf UNDERLAY-NET area 0.0.0.0
!
interface loopback 20
 description ** Overlay DataPlane - VTEP **
 ip address 192.168.20.101/32
 ip router ospf UNDERLAY-NET area 0.0.0.0
!
interface Ethernet1/1-2
  no switchport
  medium p2p
  ip unnumbered loopback0
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  no shutdown

Example 2-1: Leaf-101 - Underlay Network Configuration.

Verifications

Example 2-2 shows that the Leaf-101 switch's Ethernet interfaces 1/1 and 1/2, and all three Loopback interfaces, belong to the OSPF process UNDERLAY-NET in OSPF area 0.0.0.0. The OSPF network type for Ethernet interfaces is set to point-to-point. The example also verifies that the Leaf-101 switch has two OSPF neighbors, Spine-11, and Spine-12.


Leaf-101# show ip ospf interface brief ; show ip ospf neighbors ;
--------------------------------------------------------------------------------
 OSPF Process ID UNDERLAY-NET VRF default
 Total number of interface: 5
 Interface               ID     Area            Cost   State    Neighbors Status
 Eth1/1                  4      0.0.0.0         40     P2P      1         up
 Eth1/2                  5      0.0.0.0         40     P2P      1         up
 Lo0                     1      0.0.0.0         1      LOOPBACK 0         up
 Lo10                    2      0.0.0.0         1      LOOPBACK 0         up
 Lo20                    3      0.0.0.0         1      LOOPBACK 0         up
--------------------------------------------------------------------------------
 OSPF Process ID UNDERLAY-NET VRF default
 Total number of neighbors: 2
 Neighbor ID     Pri State            Up Time  Address         Interface
 Spine-11          1 FULL/ -          00:00:30 192.168.0.11    Eth1/1
 Spine-12          1 FULL/ -          00:00:30 192.168.0.12    Eth1/2

Example 2-2: Leaf-101 show ip ospf neighbors.


Example 2-3 on the next page displays the OSPF Link State Database (LSDB) for the Leaf-101 switch. The first section shows that all switches in the EVPN Fabric have sent descriptions of their OSPF links. Each Spine switch has six OSPF interfaces (2 x Loopback interfaces and 4 x Ethernet interfaces), while each Leaf switch has five OSPF interfaces (3 x Loopback interfaces and 2 x Ethernet interfaces). The second section provides detailed OSPF link descriptions for the Spine-11 switch.

Leaf-101# sh ip ospf database ; show ip ospf database 192.168.0.11 detail
--------------------------------------------------------------------------------
        OSPF Router with ID (Leaf-101) (Process ID UNDERLAY-NET VRF default)
                Router Link States (Area 0.0.0.0)
Link ID         ADV Router      Age        Seq#       Checksum Link Count
192.168.0.11    Spine-11        51         0x8000012c 0x3fcd   6
192.168.0.12    Spine-12        51         0x8000012c 0x4fb9   6
192.168.0.101   Leaf-101        50         0x8000012e 0x9adf   5
192.168.0.102   Leaf-102        615        0x8000012c 0xd0a6   5
192.168.0.103   Leaf-103        607        0x8000012c 0x036f   5
192.168.0.104   Leaf-104        599        0x8000012c 0x3538   5
--------------------------------------------------------------------------------
        OSPF Router with ID (Leaf-101) (Process ID UNDERLAY-NET VRF default)
                Router Link States (Area 0.0.0.0)
   LS age: 51
   Options: 0x2 (No TOS-capability, No DC)
   LS Type: Router Links
   Link State ID: 192.168.0.11
   Advertising Router: Spine-11
   LS Seq Number: 0x8000012c
   Checksum: 0x3fcd
   Length: 96
    Number of links: 6

     Link connected to: a Stub Network
      (Link ID) Network/Subnet Number: 192.168.0.11
      (Link Data) Network Mask: 255.255.255.255
       Number of TOS metrics: 0
         TOS   0 Metric: 1

     Link connected to: a Stub Network
      (Link ID) Network/Subnet Number: 192.168.10.11
      (Link Data) Network Mask: 255.255.255.255
       Number of TOS metrics: 0
         TOS   0 Metric: 1

     Link connected to: a Router (point-to-point)
     (Link ID) Neighboring Router ID: 192.168.0.101
     (Link Data) Router Interface address: 0.0.0.3
       Number of TOS metrics: 0
         TOS   0 Metric: 40

     Link connected to: a Router (point-to-point)
     (Link ID) Neighboring Router ID: 192.168.0.102
     (Link Data) Router Interface address: 0.0.0.4
       Number of TOS metrics: 0
         TOS   0 Metric: 40

     Link connected to: a Router (point-to-point)
     (Link ID) Neighboring Router ID: 192.168.0.103
     (Link Data) Router Interface address: 0.0.0.5
       Number of TOS metrics: 0
         TOS   0 Metric: 40

     Link connected to: a Router (point-to-point)
     (Link ID) Neighboring Router ID: 192.168.0.104
     (Link Data) Router Interface address: 0.0.0.6
       Number of TOS metrics: 0
         TOS   0 Metric: 40

Example 2-3: Leaf-101 – OSPF Links State Database.


Example 2-4 confirms that the Leaf-101 switch has run the Dijkstra algorithm against the LSDB and installed the best routes into the Unicast routing table. Note that for all Leaf switch Loopback IP addresses, there are two equal-cost paths via both Spine switches.


Leaf-101# show ip route ospf
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>
192.168.0.11/32, ubest/mbest: 1/0
    *via 192.168.0.11, Eth1/1, [110/41], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.0.12/32, ubest/mbest: 1/0
    *via 192.168.0.12, Eth1/2, [110/41], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.0.102/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.0.103/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.0.104/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.11/32, ubest/mbest: 1/0
    *via 192.168.0.11, Eth1/1, [110/41], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.12/32, ubest/mbest: 1/0
    *via 192.168.0.12, Eth1/2, [110/41], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.102/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.103/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.10.104/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.20.102/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.20.103/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
192.168.20.104/32, ubest/mbest: 2/0
    *via 192.168.0.11, Eth1/1, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra
    *via 192.168.0.12, Eth1/2, [110/81], 00:06:40, ospf-UNDERLAY-NET, intra

Example 2-4: Leaf-101 – Unicast Routing Table.

Example 2-5 confirms that the Leaf-101 switch has an IP connectivity to all Fabric switches' Loopback 0 interfaces. Note that I've added dashes for clarity.


Leaf-101#ping Spine-11 ; ping Spine-12 ; ping Leaf-102 ; ping Leaf-103 ; ping Leaf-104
PING Spine-11 (192.168.0.11): 56 data bytes
64 bytes from 192.168.0.11: icmp_seq=0 ttl=254 time=4.715 ms
64 bytes from 192.168.0.11: icmp_seq=1 ttl=254 time=4.909 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Spine-11 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 1.849/3.369/4.909 ms
-----------------------------------------------------------------------
PING Spine-12 (192.168.0.12): 56 data bytes
64 bytes from 192.168.0.12: icmp_seq=0 ttl=254 time=3.14 ms
64 bytes from 192.168.0.12: icmp_seq=1 ttl=254 time=2.486 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Spine-12 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 1.896/2.279/3.14 ms
-----------------------------------------------------------------------
PING Leaf-102 (192.168.0.102): 56 data bytes
64 bytes from 192.168.0.102: icmp_seq=0 ttl=253 time=6.124 ms
64 bytes from 192.168.0.102: icmp_seq=1 ttl=253 time=4.663 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Leaf-102 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 4.663/5.56/6.794 ms
-----------------------------------------------------------------------
PING Leaf-103 (192.168.0.103): 56 data bytes
64 bytes from 192.168.0.103: icmp_seq=0 ttl=253 time=6.601 ms
64 bytes from 192.168.0.103: icmp_seq=1 ttl=253 time=7.512 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Leaf-103 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 3.674/5.892/7.512 ms
-----------------------------------------------------------------------
PING Leaf-104 (192.168.0.104): 56 data bytes
64 bytes from 192.168.0.104: icmp_seq=0 ttl=253 time=7.109 ms
64 bytes from 192.168.0.104: icmp_seq=1 ttl=253 time=7.777 ms
<3 x ICMP replies have been removed to fit the entire output on one page>
--- Leaf-104 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 5.869/6.822/7.777 ms
Leaf-101#

Example 2-5: Pinging to all Fabric switches Loopback 0 interfaces from Leaf-101.


In the next post, we configure IP-PIM Any-Source Multicast (ASM) routing in the Underlay network. 


Monday 22 April 2024

BGP EVPN with VXLAN: Fabric Overview

 




Figure illustrates the simplified operation model of EVPN Fabric. At the bottom of the figure is four devices, Tenant Systems (TS), connected to the network. When speaking about TS, I am referring to physical or virtual hosts. Besides, The Tenant System can be a forwarding component attached to one or more Tenant-specific Virtual Networks. Examples of TS forwarding components include firewalls, load balancers, switches, and routers.

We have connected TS1 and TS2 to VLAN 10 and TS3-4 to VLAN 20. VLAN 10 is associated with EVPN Instance (EVI) 10010 and VLAN 20 to EVI 10020. Note that VLAN-Id is switch-specific, while EVI is Fabric-wide. Thus, subnet A can have VLAN-Id XX on one Leaf switch and VLAN-Id YY on another. However, we must map both VLAN XX and YY to the same EVPN Instance.

When a TS connected to the Fabric sends the first Ethernet frame, the Leaf switch stores the source MAC address in the MAC address table, where it is copied to the Layer 2 routing table (L2RIB) of the EVPN Instance. Then, the BGP process of the Leaf switch advertises the MAC address with its reachability information to its BGP EVPN peers, essentially the Spine switches. The Spine switches propagate the BGP Update message to their own BGP peers, essentially the Leaf switches. The Leaf switches install the received MAC address into the L2RIB of the EVI, from which the MAC address is copied to the VLAN MAC address table associated with the EVPN Instance. For TS1 and TS2 in the same VLAN to start communication, the operation must occur in the other direction (TS2 MAC learning process). The operation described above is a Control Plane operation.

The traffic between TS1 and TS2 passes through switches Leaf-101, Spine, and Leaf-102. Leaf-101 encapsulates the Ethernet data frame sent by TS1 with MAC (spine)/IP(Leaf-102)/UDP (port 4789) headers and a VXLAN header that identifies the EVPN instance using the Layer 2 Virtual Network Identifier (L2VNI). Upon verifying that the destination address of the outer IP frame belongs to it, Leaf-102 removes the tunnel encapsulation and forwards only the original Ethernet frame to TS2.

VPN Instances associated with the same Tenant/VRF Context share a common L3VNI over which Ethernet frames from different segments are sent using the L3VNI identifier. To route traffic between two EVPN segments, each VLAN naturally must have a routing interface. A VLAN routing interface is configured on each Leaf switch, which is associated with the same Anycast Gateway MAC address. In EVPN Fabric, gateway redundancy does not rely on HSRP, VRRP, or GLBP protocols. Instead, the gateway is configured on every Leaf switch, where we have deployed the VLAN. EVPN routing solution between EVPN segments is called Integrated Routing and Bridging (IRB). Cisco Nexus switches use Symmetric IRB (I will explain its operation in upcoming chapters). 

Yritän jatkuvasti kehittää yksinkertaisempia ja selkeämpiä tapoja kuvata EVPN Fabricin toimintamallia. Tässä on jälleen yksi. Tällä kertaa julkaisen artikkelin ainoastaan Linkedin artikkelina, en omassa blogissani (ainakaan vielä). Seuraavassa artikkelissa sovellan samaa mallia esitellessnäi EVPN Fabrikin configuroimisen.

Sunday 24 March 2024

Azure Networking: Cloud Scale Load Balancing

 Introduction


During the load balancer deployment process, we define a virtual IP (a.k.a front-end IP) for our published service. As a next step, we create a backend (BE) pool to which we attach Virtual Machines using either their associated vNIC or Direct IP (DIP). Then, we bind the VIP to BE using an Inbound rule. Besides, in this phase, we create and associate health probes with inbound rules for monitoring VM's service availability. If VMs in the backend pool also initiate outbound connections, we build an outbound policy, which states the source Network Address Translation (SNAT) rule (DIP, src port > VIP, src port).  

This chapter provides an overview of the components of the Azure load balancer service: Centralized SDN Controller, Virtual Load balancer pools, and Host Agents. In this chapter, we discuss control plane and data plane operation.  


Management & Control Plane – External Connections

Figure 20-1 depicts our example diagram. The top-most box, Loadbalancer deployment, shows our LB settings. We intend to forward HTTP traffic from the Internet to VIP 1.2.3.4 to either DIP 10.0.0.4 (vm-beetle) or DIP 10.0.0.5 (vm-bailey). The health probe associated with the inbound rule uses TCP port 80 for availability check. 

The Azure LB service control plane operation is implemented into a centralized, highly available SDN controller. The system consists of several controller instances, one of which is elected as an active instance. When the active controller receives our LB configuration, it distributes it to other replicas. The active controller creates VIP to DIP mapping entries with the destination protocol/port combination and programs them with the VIP to load balancers in a load balancer pool. Besides programming the load balancers, the SDN controller monitors their health.

The load balancer pool consists of several instances, which all advertise their configured VIPs to upstream routers via BGP, defining itself as a next hop. When one of the upstream routers receives an ingress packet to VIP, it uses Equal Cost Multi-Path (ECMP) for selecting the next hop load balancer. Therefore, flow-based packets may not end up with the same load balancer. However, this is not a problem. Load balancer units use the same hashing algorithm for selecting the DIP from the BE members. Therefore, they all choose the same DIP. Upstream routers and load balancers also use BGP as a failure detection. When the load balancer goes out of service, the BGP peering goes down. As a reaction, upstream routers exclude the failed load balancer from the ECMP process.

The Host Agent (HA) is the third piece of the load balancer service puzzle. SDN controller sends the VIP to DIP destination NAT (DNAT) policies to HA, which programs them to the Virtual Filtering Platform’s (VFP) NAT/SLB Layer. Besides, it monitors the BE member VM’s availability using our configured health probe. When the service in VM stops responding, the host agent reports to the SDN controller, which removes the failed DIP from the load balancers VIP to DIP mapping table. HA also adjusts a Network Security Group (NSG) rule in the VFP Security layer to allow monitoring traffic.



Figure 20-1: Cloud Scale LB Management & Control Plane Operation.

Data Plane - External Connections


Figure 20-2 depicts the data plane processes when an external host starts the TCP three-way handshake process with VIP 1.2.3.4. I have excluded redundant components from the figure to keep it simple. The data packet with the TCP SYN flag set arrives at Edge router Ro-1. It has two equal-cost next hops for 1.2.3.4 installed into the RIB. Therefore, it runs a channel hashing algorithm and selects the next hop via 10.1.1.1 (LB-1). Next, it forwards the packet toward LB-1.

Based on the TCP SYN flag, LB-1 notices that this is a new data flow. The destination IP address and transport layer information (TCP/80) match the inbound rule programmed to its VIP mapping table. The hashing algorithm result, calculated from 5-tuple, for this packet selects the DIP 10.0.0.4. After choosing the destination DIP, the load balancer updates its flow table (not shown in the figure). It leaves the original packet intact and adds tunnel headers (IP/UDP/VXLAN) using its IP address as a source and the DIP as a destination address. Note, the load balancers check ingress non-SYN and UDP packets first against the flow table. If no matching entries are found, packets are processed via the VIP-to-DIP mapping table.

The encapsulated packet arrives at the host. The tenant is identified based on the Virtual Network Identifier (VNI) carried in the VXLAN header. The VNet layer decapsulates the packet and based on the SYN flag in the TCP header, it is recognized as the first packet of the new flow. Therefore, the L3/L4 header information is sent through VFP layers (see details in Chapter 14). The Header Transposition engine then encodes the result to Unified Flow Table (UFT) with related actions: decapsulate, DNAT 1.2.3.4 > 10.0.04, and Allow (an NSG allows TCP/80 traffic). It also creates a paired flow entry since the 5-tuple hash result run against the reply message gives an equal flow id.

After processing the TCP SYN packet, VM 10.0.0.4 replies with TCP SYN, ACK packet. It uses its IP address as a source and the original IP address 172.16.1.4 as a destination. The UFT lookup result for flow-id matches to paired entry. Therefore, that packet is processed (Allowed and SNAT: DIP > VIP) and forwarded without running it against VFP layers. The unencapsulated packet is sent directly to Ro-1, bypassing the LB-1 (Direct Server Return - DSR). Therefore, the load balancer must process ingress flows only. This is one of the load balancer service optimization solutions.

The last packet of the TCP three-way handshake (and the subsequent packets of the flow) from the external host may end up in LB-2, but as it uses the same hashing algorithm as LB-1, it selects the same DIP.



Figure 20-2: Cloud Scale LB Data Plane Operation.

Data Plane and Control Plane for Outbound Traffic


A virtual machine with a customer-assigned public IP address could establish outbound connections to the Internet. Furthermore, these VMs are exposed and accessible on the Internet. VMs lacking a public IP address are automatically assigned an Azure-assigned IP address for egress-only Internet connections. Besides, VMs without public IP addresses can also utilize the Azure NAT Gateway for outbound connectivity. Among these three options, we can use the front-end IP address assigned to the load balancer for private IP-only VMs' outbound Internet access, creating an outbound rule with source NAT.

In Figure 20-3, we have an outbound rule where VIP 1.2.3.4 is associated with backend pool members (vm-beetle with the DIP 10.0.0.4). This rule is programmed to the SDN controller’s VIP mapping table.

Next, vm-beetle initiates a TCP three-way handshake process with an external host at the IP address 172.16.1.4. The TCP SYN flag indicates the start of a new data flow. Before packet forwarding, the host agent requests a Virtual IP (VIP) and source ports for the outbound connection's source Network Address Translation (SNAT). To accommodate the possibility of multiple connections from the VM, the controller allocates eight source ports for vm-beetle (we can adjust the port count). This allocation strategy eliminates the need for the host agent to request a new source port each time the VM initiates a new connection.

Subsequently, the controller synchronizes the DIP-to-VIP and source port mapping information across standby controllers. It then proceeds to program the mapping tables of the load balancer because the return traffic goes via load balancers. After these steps, the controller responds to the host agent's request. The host agent, in turn, updates the Source NAT (SNAT) policy within the Virtual Forwarding Plane (VFP) layer. 

After the parser component has transmitted the original packet's header group metadata through the Virtual Forwarding Plane (VFP) layers, the header transposition engine updates the Unified Flow Table. Finally, the packet is directed toward the Internet, circumventing the load balancer.



Figure 20-3: Outbound Traffic.

Fast Path

The Azure Load Balancer service must manage an extensive volume of network traffic. Therefore, Azure has developed several load balancer optimization solutions. For example, the source and destination NAT are offloaded from the load balancers to hosts (to VFP’s NAT layer). That enables Direct Server Return (DSR) solution where the return traffic from the VM is routed without tunnel encapsulation towards the destination, bypassing load balancers.

This section introduces another optimization solution known as Fastpath. Once the TCP three-way handshake between load-balanced virtual machines is complete, the data flow is redirected straight between the VMs, bypassing load balancers in both directions. The solution uses a redirect message where the load balancer tells their VIP, DIP, and port mapping information to the source VIP. The Fastpath solution has many similarities with Dynamic Multipoint VPN (DMVPN).

Figure 20-6 depicts the three-way handshake process between two load-balanced VMs. VM named 'vm-beetle' hosts a service accessible through VIP 1.1.1.1 (VIP-A). Conversely, the service running on 'vm-bailey' (with Destination IP: 10.2.2.4) is reached using VIP 2.2.2.2 (VIP-B). Vm-beetle starts a TCP three-way handshake process with vm-bailey by sending a TCP SYN message to VIP 2.2.2.2. TCP SYN packet is the first packet of this connection. Therefore, the packet's L3/L4 header information is forwarded through the VFP layers to create a new flow entry with associated actions.

An NSG allows the connection. The destination IP address is public, so the header group information is sent to the NAT layer. NAT layer rewrites the source IP to 1.1.1.1 as defined in our load balancer's Outbound NAT policy. Because the destination IP address is from the public IP address space, the TCP SYN message is sent unencapsulated toward the destination (Direct Server Return). Therefore, the VNet layer doesn’t create any encapsulation action. After header group information has passed all the layers, the header transposition engine rewrites the changed source IP address, and the flow is encoded to the Unified Flow Table (UFT) with actions (allow, snat). The subsequent packets of this flow are directed based on the UFT. Then the TCP SYN message is sent to the destination VIP.

The VIP-B receives the packet. It checks the VIP mapping table, selects the destination DIP using a hash algorithm, and programs the three-tuple (VIP, DIP, port) to the flow table. Then it encapsulates the packet and forwards it to vm-bailey. The host agent intercepts the ingress TCP SYN packet. The parser component takes the header group information and runs it through the VFP layers. The VNet layer action removes the outer VXLAN header, the NAT layer action rewrites the VIP 1.1.1.1 to DIP 10.2.2.4 based on the load balancer’s inbound policy, and the NSG layer allows the packet. Then, after updating UFT, the packet is forwarded to vm-bailey. Note that UFT entries created to vm-beetle’s and vm-bailey’s UFT tables have also paired, reversed direction flow entries.

The VM vm-bailey accepts the SYN-ACK message by sending a TCP SYN ACK message back to vm-beetle. Because of the paired flow entry, this message can be forwarded based on the UFT. The source NAT from DIP 10.2.2.4 to 2.2.2.2 is the only rewrite action for the packet. Then it is sent towards VIP 1.1.1.1. The load balancer VIP-A receives the TCP SYN ACK message and selects the DIP from the VIP mapping table. Then it encapsulates the packet and sends it towards vm-bailey. When vm-bailey receives the message, it processes it based on the UFT. First, the VXLAN header is removed. Then the destination VIP 1.1.1.1 is changed to 10.1.1.4. Finally, the packet is passed to vm-bailey because of the allowed action. The TCP ACK sent by vm-beetle is processed in the same way.


Figure 20-4: Fast Path: TCP Three-Way Handshake.

After VMs have completed the TCP three-way handshake process, the load balancer VIP-B sends the redirect message to VIP-A (VIP 1.1.1.1 VIP-A), where it tells that the rest of the packets of the flow '1.1.1.1, 2.2.2.2, TCP/80' can be sent directly to DIP 10.2.2.4. Based on its flow table (1.1.1.1 mapped to 10.1.1.4), VIP-A forwards the redirect message toward DIP 10.1.1.4. Besides, VIP-A sends a redirect message to VIP-B describing its VIP mapping information in the same way as VIP-B did. When the redirect process is completed and the UFT is updated with the new flow entries, the packets between vm-beetle and vm-bailey are sent without source NAT action over the VXLAN tunnel using DIP for both source and destination IP addresses.

Figure 20-5: Fast Path: Redirect.


References 


[1] Ananta: Cloud Scale Load Balancing, Parveen Patel et. Al., 

https://conferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p207.pdf


[2] SDN for the Cloud, Albert Greenberg

https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/keynote.pdf


[3] Azure Windows VM Agent overview

https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/agent-windows


[4] Scenario 1: Configure outbound connections to a specific set of public IPs or prefix

https://learn.microsoft.com/en-us/azure/load-balancer/outbound-rules#scenario1out



Tuesday 12 March 2024

Adding TS’s IP Address to MAC-VRF (L2RIB) and IP-VRF (L3RIB)

In the previous chapter, we discussed how a VTEP learns the local TS's MAC address and the process through which the MAC address is programmed into BGP tables. An example VTEP device was configured with a Layer 2 VLAN and an EVPN Instance without deploying a VRF Context or VLAN routing interface. This chapter introduces, at a theoretical level, how the VTEP device, besides the TS's MAC address, learns the TS's IP address information after we have configured the VRF Context and routing interface for our example VLAN.


Figure 1-3: MAC-VRF Tenant System’s IP Address Propagation.

I have divided Figure 1-3 into three sections. The section on the top left, Integrated Routing and Bridging - IRB illustrates the components required for intra-tenant routing and their interdependencies. By configuring a Virtual Routing and Forwarding Context (VRF Context), we create a closed routing environment with a per-tenant IP-VRF L3 Routing Information Base (L3RIB). Within the VRF Context, we define the Layer 3 Virtual Network Identifier (L3VNI) along with the Route Distinguisher (RD) and Route Target (RT) values. The RD of the VRF Context enables the use of overlapping IP addresses across different tenants. Based on the RT value of the VRF Context, remote VTEP devices can import IP address information to the correct BGP tables, and from where they are installed into the IP-VRF's L3RIB. For the VRF Context, we configure a Layer 2 VLAN (VLAN 50 in the image), which we associate with the L3VNI. Besides, we create an IP address-less routing interface for the VLAN and bind it with the VRF Context. These configurations are necessary because VXLAN is MAC-in-IP/UDP and requires an inner Ethernet header with source/destination MAC addresses for routed inter-VN traffic.

By deploying a VLAN with its routing interface, besides reserving hardware resources, the VTEP device can use a system MAC address in the inner Ethernet header in routed packets. To enable Data Plane tunneling, we must associate the VRF Context with the NVE interface. After setting up a VRF Context with its components, we can attach VLANs requiring inter-VN or external connections to the VRF Context. The upcoming chapters show how we deploy the configuration and verify the Control Plane and Data Plane operation.

The section at the bottom left, MAC-VRF Update Process – IP Address, describes the process where the IP address is associated with the MAC address of TS in the EVPN Instance's MAC-VRF L2RIB.

When we power on the TS, it may send a Gratuitous ARP (GARP) message to ensure the uniqueness of its IP address. TS may also send an ARP request to resolve the IP-MAC binding of its defined Default Gateway. In our example, the VTEP switch receives the GARP message from TS1 through the Attachment Circuit (AC) Eth1/10. The Sender MAC and IP address information carried in the GARP message and the VLAN-specific routing interface (Gateway - GW) are encoded in the ARP table by the Host Mobility Manager (HMM). Note that there must be a routing interface on the VLAN for the MAC-IP bindings of devices connected to the VLAN to be stored in the ARP table.

Next, HMM programs the ARP table with MAC/IP/GW information and stores it in the Attachment Circuit's Local Host Database (LHDB). Then, HMM encodes MAC/IP/Next-Hop information into the MAC-VRF's L2RIB along with the VLAN Identifier and L3VNI (Layer 3 Virtual Network Identifier). In addition, HMM encodes the IP address information as a host route in the IP-VRF. 

The BGP process programs the information into the Loc-RIB. EVPN NLRI includes the Route Distinguisher (RD), MAC and IP address information, and corresponding L2VNI and L3VNI identifiers. Route Targets, Encapsulation Type, and Router MAC are encoded as Extended Path Attributes. Finally, the routing information about EVPN NLRI is sent through the BGP Policy Engine to the BGP Adj-RIB-Out table and eventually to BGP EVPN peers.

Wednesday 28 February 2024

Another Ethernet VPN (EVPN) Introduction

Ethernet VPN (EVPN) Introduction


Instead of being a protocol, EVPN is a solution that utilizes the Multi-Protocol Border Gateway Protocol (MP-BGP) for its control plane in an overlay network. Besides, EVPN employs Virtual extensible Local Area Network (VXLAN) encapsulation for the data plane of the overlay network.


EVPN Control Plane: MP-BGP AFI: L2VPN, SAFI: EVPN


Multi-Protocol BGP (MP-BGP) is an extension of BGP-4 that allows BGP speakers to encode Network Layer Reachability Information (NLRI) of various address types, including IPv4/6, VPNv4, and MAC addresses, into BGP Update messages. 

The MP_REACH_NLRI path attribute (PA) carried within MP-BGP update messages includes Address Family Identifier (AFI) and Subsequent Address Family Identifier (SAFI) attributes. The combination of AFI and SAFI determines the semantics of the carried Network Layer Reachability Information (NLRI). For example, AFI-25 (L2VPN) with SAFI-70 (EVPN) defines an MP-BGP-based L2VPN solution, which extends a broadcast domain in a multipoint manner over a routed IPv4 infrastructure using an Ethernet VPN (EVPN) solution. 

BGP EVPN Route Types (BGP RT) carried in BGP update messages describe the advertised EVPN NLRIs (Network Layer Reachability Information) type. Besides publishing IP Prefix information with IP Prefix Route (EVPN RT 5), BGP EVPN uses MAC Advertisement Route (EVPN RT 2) for advertising hosts’ MAC/IP address reachability information. The Virtual Network Identifiers (VNI) describe the VXLAN segment of the advertised MAC/IP addresses. 

Among these two fundamental route types, BGP EVPN can create a shared delivery tree for Layer 2 Broadcast, Unknown Unicast, and Multicast (BUM) traffic using Inclusive Multicast Route (EVPN RT 3) for joining an Ingress Replication tunnel. This solution does not require a Multicast-enabled Underlay Network. Another option for BUM traffic is Multicast capable Underlay Network.

While EVPN RT 3 is used for building a Multicast tree for BUM traffic, the Tenant Routed Multicast (TRM) solution provides tenant-specific multicast forwarding between senders and receivers. TRM is based on the Multicast VPN (BGP AFI:1/SAFI:5 – Ipv4/Mcast-VPN). TRM uses MVPN Source Active A-D Route (MVPN RT 5) to publish Multicast stream source address and group). 

Using BGP EVPN's native multihoming solution, we can establish a port channel between Tenant Systems (TS) and two or more VTEP switches. From the perspective of the TS, a traditional port channel is deployed by bundling a set of Ethernet links into a single logical link. On the multihoming VTEP switches, these links are associated with a logical Port-Channel interface called Ethernet Segments (ES).

EVPN utilizes the EVPN Ethernet Segment Route (EVPN RT 4) as a signaling mechanism between member units to indicate which Ethernet Segments they are connected to. Additionally, VTEP switches use this EVPN RT 4 for selecting a Designated Forwarder (DF) for Broadcast, Unknown unicast, and Multicast (BUM) traffic.

When EVPN Multihoming is enabled on a set of VTEP switches, all local MAC/IP Advertisement Routes include the ES Type and ES Identifier. The EVPN multihoming solution employs the EVPN Ethernet A-D Route (EVPN RT 1) for rapid convergence. Leveraging EVPN RT 1, a VTEP switch can withdraw all MAC/IP Addresses learned via failed ES at once by describing the ESI value in MP-UNREACH-NLRI Path Attribute.

An EVPN fabric employs a proactive Control Plane learning model, while networks based on Spanning Tree Protocol (STP) rely on a reactive flood-and-learn-based Data Plane learning model. In an EVPN fabric, data paths between Tenant Systems are established before data exchange. It's worth noting that without enabling ARP suppression, local VTEP switches flood ARP Request messages. However, remote VTEP switches do not learn the source MAC address from the VXLAN encapsulated frames.

BGP EVPN provides various methods for filtering reachability information. For instance, we can establish an import/export policy based on BGP Route Targets (BGP RT). Additionally, we can deploy ingress/egress filters using elements such as prefix lists or BGP path attributes, like BGP Autonomous System numbers. Besides, BGP, OSPF, and IS-to-IS all support peer authentication.


EVPN Data Plane: VXLAN Introduction


The Virtual Extensible LAN (VXLAN) is an encapsulation schema enabling Broadcast Domain/VLAN stretching over a Layer 3 network. Switches or hosts performing encapsulation/decapsulation are called VXLAN Tunnel End Points (VTEP). VTEPs encapsulate the Ethernet frames, originated by local Tenant Systems (TS), within outer MAC and IP headers followed by UDP header with the destination port 4789, and the source port is calculated from the payload. Between the UDP header and the original Ethernet frame is the VXLAN header describing the VXLAN segment with VXLAN Network Identifier (VNI). A VNI is a 24-bit field, allowing (theoretically) for over 16 million unique VXLAN segments. 

VTEP devices allocate Layer 2 VNI (L2VNI) for Intra-VN connection and Layer 3 VNI (L3VNI) for Inter-NV connection. There are unique L2VNI for each VXLAN segment but one common L3VNI  for tenant-specific Inter-VN communication. Besides, the Generic Protocol Extension for VXLAN (VXLAN-GPE) enables leaf switches to add Group Policy information to data packets. 


EVPN Building Blocks


I have divided Figure 1-2 into four domains: 1) Service Abstraction – Broadcast Domain, 2) Overlay Control Plane, 3) Overlay Data Plane, and 4) Route Propagation. These domains consist of several components which have cross-domain dependencies. 

Service Abstraction - Broadcast Domain: Virtual LAN: 

A Broadcast Domain (BD) is a logical network segment where all connected devices share the same subnet and can reach each other with Broadcast and Unicast messages. Virtual LAN (VLAN) can be considered an abstraction of a BD. When we create a new VLAN and associate access/trunk interfaces with it, a switch starts building an address table of source MAC addresses from received frames originated by local Tenant Systems. With TS, I am referring to physical or virtual hosts. Besides, The Tenant System can be a forwarding component, such as a firewall and load balancer, attached to one or more Tenant-specific Virtual Networks. 

Service Abstraction - Broadcast Domain: EVPN Instance: 

EVPN Instance is identified by a Layer 2 Virtual Network Identifier (L2VNI). Besides L2VNI, EVPN instances have a unique Route Distinguisher (RD), allowing overlapping addresses between different Tenants and BGP Route Targets (BGP RT) for BGP import and export policies. Before deploying an EVI, we must configure the VLAN and associate it with the VN segment (EVPN Instance). This is because an autogenerated Route Distinguisher associated with EVI requires a VLAN identifier in the RD local administrator part (a base value 32767 + associated VLAN ID). When we deploy an EVPN Instance, a Layer 2 Forwarding Manager (L2FM) starts encoding local MAC address information from the MAC table to EVI-specific MAC-VRF (L2RIB) and the other way around. 

Overlay Control Plane

VTEP switches use BGP EVPN for publishing Tenant Systems’ (TS) reachability information. BGP Routing Information Base (BRIB) consists of Local RIB (Loc-RIB) and Adjacency RIB In/Out (Adj-RIB-In and Adj-RIB-Out) tables. The BGP process stores all valid local and remote Network Layer Reachability Information (NLRI) into the Loc-RIB, while Adj-RIB-Out is a peer-specific table where NLRIs are installed through the BGP Policy Engine. The Policy engine executes our deployed BGP peer policy. An example of Policy Engine operation in a Single-AS Fabric is a peer-specific route-reflector-client definition deployed in Spine switches. By setting a peered Leaf switch as a Route-Reflector (RR) client, we allow Spine switches to publish received NLRIs from one iBGP peer to another iBGP peer, which based on default BGP policy is not permitted. Local Tenant Systems MAC addresses and source interfaces are encoded to BGP Loc-RIB from the L2RIB with encapsulation type and source IP address obtained from the NVE interface configuration. 

When a VTEP receives an EVPN NLRI from the remote VTEP with importable Route Targets, it validates the route by checking that it has received from the configured BGP peer and with the correct remote ASN and reachable source IP address. Then, it installs the NLRI (RD, Encapsulation Type, Next Hop, other standard and extended communities, and VNIs) information into BGP Loc-RIB. Note that the local administrator part of the RD may change during the process if the VN segment is associated with another VLAN than in the remote VTEP. Remember that VLANs are locally significant, while EVPN Instances have fabric-wide meaning. Next, the best MAC route is encoded into L2RIB with the topology information (VLAN ID associated with the VXLAN segment) and the next-hop information. Besides, L2RIB describes the route source as BGP. Finally, L2FM programs the information into the MAC address table and sets the NVE peer interface ID as next-hop. Note that VXLAN Manager learns VXLAN peers from the data plane based on the source IP address. 

Overlay Data Plane: Network Virtualization Edge (NVE) Interface:

The configuration of a logical NVE Interface dictates the encapsulation type and tunnel IP address for VXLAN tunnels. The VXLAN tunnel source IP address is obtained from the logical Loopback interface, which must be reachable across fabric switches. The IP address of the NVE interface is used in BGP Update messages in the BGP MP-REACH-NLRI as a source IP address. The VXLAN encapsulation type is published as BGP EXTENDED-COMMUNITY Path Attribute along with the Route Target (L2VNI and L3VNI) and System MAC (if an IP address is included).

EVPN instances (EVI) are associated with an NVE interface as a Member VN. We must define the L2BUM traffic forwarding mode (Ingress-Replication or Multicast Group) under each member VN. VXLAN Manager is responsible for data plane encapsulation and decapsulation processes.

MAC Route Propagation: Local VTEP

The previous sections provided an overview of the MAC Route propagation process. This section recaps the operation. Tenant Systems can verify the uniqueness of their IP address by sending a Gratuitous ARP (GARP), which is an unsolicited ARP Reply. The VTEP switch learns the source MAC address from the incoming frame and adds it to the MAC address table. The VLAN ID associated with the MAC entry is derived from the configuration of the Attachment Circuit (incoming interface) or the 802.1Q tag in the Ethernet header. The Attachment Circuit serves as the next hop.

The Layer 2 Forwarding Manager (L2FM) transfers information from the MAC address table to the L2RIB of the MAC-VRF. Subsequently, the MAC route is encoded into the BGP Loc-RIB. The BGP process attaches the EVPN Instance-specific Route Distinguisher to the EVPN NLRI. Besides, EVI-specific Route Targets are configured as EXTENDED_COMMUNITY, along with VXLAN encapsulation defined in the NVE interface configuration. The Next Hop for EVPN NLRI is determined by the IP address associated with the local NVE interface. Finally, the MAC route is sent from the Loc-RIB through the BGP Policy Engine to the Adj-RIB-Out and forwarded to the BGP EVPN Peer.

Figure 1-2


Thursday 25 January 2024

BGP EVPN Part IV: MAC-VRF L2RIB Update: Local MAC Address

In Figure 1-3 we have VLAN 10 mapped to EVI/MAC-VRF L2VNI10000. TS-A1 (IP: 192.168.11.12/MAC: 1000.0010.beef) is connected to VLAN10 via Attachment Circuit (AC) Ethernet 1/2, (ifindex: 1a000200). 

Figure 1-3: MAC-VRF: L2RIB Local Learning Process.


Example 1-1 shows the VLAN to L2VNI mapping information. 


Leaf-101# show vlan id 10 vn-segment
VLAN Segment-id
---- -----------
10   10000       

Example 1-1: VLAN to EVPN Instance Mapping Information.


Step-1 and 2: MAC Table Update 


During the startup process, TS-A1 sends a Gratuitous ARP (GARP) message to announce its presence on the network and validate the uniqueness of its IP address. It uses its IP address in the Target IP field (Example 1-2). If another host responds to this unsolicited ARP reply, it indicates a potential IP address conflict. 

Ethernet II, Src: 10:00:00:10:be:ef, Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (reply/gratuitous ARP)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: reply (2)
    [Is gratuitous: True]
    Sender MAC address: 10:00:00:10:be:ef (10:00:00:10:be:ef)
    Sender IP address: 192.168.11.12
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address: 192.168.11.12

Example 1-2: Gratuitous ARP from TS-A1.


Leaf-101 learns the MAC address of TS-A1 from the ingress frame and encodes the source MAC address 1000:0010:beef into the VLAN10 MAC address table (Example 1-3). The type is Dynamic, and the egress port (next-hop) is interface Ethernet 1/2. The default MAC address aging time in Cisco Nexus 9000 series switches is 1800 seconds (30 minutes). 


Leaf-101# show system internal l2fwder mac
Legend: 
        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
        age - seconds since last seen,+ - primary entry using vPC Peer-Link,
        (T) - True, (F) - False, C - ControlPlane MAC
   VLAN     MAC Address      Type      age     Secure NTFY Ports
---------+-----------------+--------+---------+------+----+------------------
*    10    1000.0010.beef   dynamic   00:03:43   F     F     Eth1/2

Example 1-3: Leaf-101 MAC Address Table.


Step-3: MAC-VRF’s L2RIB Update

The Layer 2 Forwarder (L2FWDER) component detects the new MAC address in the VLAN 10 Bridge Table. The L2FWDER registers this as a MAC move event, prompting it to program the MAC address reachability information into the Layer 2 Routing Information Base (L2RIB) associated with the MAC-IP VRF of tenant TENANT77. Examples 1-4 show, starting from the bottom, how the L2FWDER component first detects a new MAC address, 1000.0010.beef, associated with VLAN 10 over an Attachment Circuit 0x1a000200 (Ethernet1/2, Example 1-5). It adds a new local MAC route to the topology 10 (L2VNI 10000) with the next-hop interface ID 0x1a000200 (Ethernet1/2).


l2fwder_dbg_ev, 690 l2fwder_l2rib_add_delete_local_mac_routes,
154Adding route  topo-id: 10, macaddr: 1000.0010.beef, nhifindx: 0x1a000200
l2fwder_dbg_ev, 690 l2fwder_l2rib_mac_update,
739MAC move 1000.0010.beef (10) 0x0 -> 0x1a000200

Example 1-4: L2RIB Update by L2FWDER.


Example 1-5 verifies the snmp-ifindex 0x1a000200 mapping to physical interface Ethernet1/2.

Leaf-101# show interface snmp-ifindex | i 0x1a000200
Eth1/2          436208128  (0x1a000200)

Example 1-5: SNMP-ifindex to Interface Mapping Verification.


The examples 1-6 demonstrate the L2FWDER component process. An illustration below details the update events from the L2RIB perspective. The L2RIB receives the MAC route 1000.0010.beef (topology 10) and creates a new MAC route after a MAC mobility check. The route is then added to L2VNI 10000 and marked as a local route (rt_flags=L) with the next-hop interface Ethernet1/2.


Leaf-101# sh system internal l2rib event-history mac | i beef
Rcvd MAC ROUTE msg: (10, 1000.0010.beef), vni 0, admin_dist 0, seq 0, soo 0, 
(10,1000.0010.beef):Mobility check for new rte from prod: 3
(10,1000.0010.beef):Current non-del-pending route local:no, remote:no, linked mac-ip count:1
(10,1000.0010.beef):Clearing routelist flags: Del_Pend, 
(10,1000.0010.beef,3):Is local route. is_mac_remote_at_the_delete: 0
(10,1000.0010.beef,3):MAC route created with seq 0, flags L, (), 
(10,1000.0010.beef,3): soo 0, peerid 0, pc-ifindex 0
(10,1000.0010.beef,3):Encoding MAC best route (ADD, client id 5)
(10,1000.0010.beef,3):vni:10000 rt_flags:L, admin_dist:6, seq_num:0 ecmp_label:0 soo:0(--)
(10,1000.0010.beef,3):res:Regular esi:(F) peerid:0 nve_ifhdl:1224736769 mh_pc_ifidx:0 nh_count:1
(10,1000.0010.beef,3):NH[0]:Eth1/2

Example 1-6: L2RIB from the L2RIB Perspective.

Figure 1-7 shows that the MAC address of TS-A1 is installed into the L2RIB associated with topology 10 (VN segment 10000). The entry is marked as a locally learned route (Prod=Local, Flag=L), with the interface Ethernet 1/2 set as the next hop for the MAC address.


Leaf-101# show l2route evpn mac evi 10
Flags -(Rmac): Router MAC (Stt):Static (L):Local (R):Remote (V):vPC link
(Dup):Duplicate (Spl):Split (Rcv):Recv (AD):Auto-Delete (D):Del Pending
(S):Stale (C):Clear, (Ps):Peer Sync (O):Re-Originated (Nho):NH-Override
(Pf):Permanently-Frozen, (Orp): Orphan
Topology    Mac Address    Prod   Flags         Seq No     Next-Hops
----------- -------------- ------ ------------- ---------- -----------------
10          1000.0010.beef Local  L,            0          Eth1/2

Example 1-7: Updated L2RIB.


Next, L2RIB Update, MAC-IP binding.