Wednesday 7 August 2019

VXLAN EVPN Multi-Site



Now you can also download my VXLAN book from the Leanpub.com 
"Virtual Extensible LAN VXLAN - A Practical guide to VXLAN Solution Part 1. (373 pages)

This chapter introduces the VXLAN EVPN Multi-Site (EVPN-MS) architecture for interconnecting EVPN Domains. The first section discusses the limitations of flat VXLAN EVPN fabric and the improvements that can be achieved with EVPN-MS. The second section focuses on the technical details of EVPN-MS solutions by using various configuration examples and packet captures.


Figure 1-1: Characteristics of Super-Spine VXLAN fabric.



Shared EVPN domain limitations

Figure 1-1 depicts the example BGP EVPN implementation that includes three Datacenters in three different locations. Each DC have seven Leaf-switches and two Spine-switches. For the DC-interconnect, there is a pair of Super-Spine switches. All VLANs/VNIs has to be available in each Leaf switch no matter of location. This means that full mesh NVE peering between each Leaf switches is required.
Even though the physical Underlay Network in this solution is hierarchical, the Overlay Network on top of it is flat i.e. there is one shared geographically dispersed EVPN domain (one L2 flooding domain). From the Underlay Network perspective, this means that the routing design and routing protocol choice should be consistent throughout the EVPN domain, otherwise there will be a complex and hard to manage IP prefix redistribution from one protocol to another. The same design requirements apply also to multi-destination traffic, the BUM traffic forwarding has to be based on the same solution throughout the EVPN domain. The Ingress-Replication (IR) does not scale well in large scale VXLAN EVPN fabric. In this example network, there are 21 Leaf switches. Each switch has 20 NVE peers, so if IR is used for BUM traffic forwarding, the copy of multi-destination frame/packet has to be individually sent to all NVE peers. This might lead to a situation where BUM traffic flows disturb the actual application data traffic on an uplink of sending switch. This is why the Multicast enabled Underlay-Network is preferred in large-scale solution. In summary, a large scale VXLAN EVPN fabric can’t rely on IP-only Underlay Network.

From the Overlay-Network perspective, the amount of bi-directional VXLAN tunnels on large-scale solution also has its challenges. Even though the example here consists of only 21 Leaf switches, there are 20 NVE peering per switch and 210 bi-directional VXLAN tunnels [n x (n-1)/2)]. If the count of Leaf switches is doubled from 21 to 42 (41 NVE peers per Leaf), the bi-directional tunnel count will rise up from 210 to 861. If each switch has 41 NVE peers, it also means 41 possible next-hop addresses per Leaf switch. In the case of VM moves inside one location, every single Leaf has to update the next-hop table.

There are no real plug-and-play capabilities in this solution. When either adding devices to infrastructure or adding a whole new site, each existing Leaf switch will build an NVE peering with an added device(s). The opposite happens when devices are removed from the infrastructure, each remaining Leaf switch will tear down the tunnels.
From the administrative perspective, this solution is managed as one entity. This excludes the design where e.g. customer wants to manage one DC while the service provider manages the other DC owned by the same customer.


EVPN Multi-Site Architecture Introduction

Figure 1-2 includes the same physical topology used in the previous example with an additional pair of Border Gateways (BGWs) in each site. The one big VXLAN EVPN fabric is now divided into the set of smaller fabrics, which are connected through the BGWs into DC Core routers/switches in a shared Common EVPN domain. This brings back the hierarchy into the Overlay Network.

Each fabric forms an individual management domain that has dedicated underlay routing architecture (routing protocol, interface IP-addressing and so on). In addition, either Multicast or Ingress-Replication (IR) can be used independently, one fabric can use Multicast based solution while the other fabric can use IR. Site-local Leaf switches form a bi-directional NVE peering (VXLAN tunnels) only with an intra-site Leaf switches and with a BGW switch. In Addition, Local BGW switches form an NVE peering between each other and also between site-external BGW switches.
The intra-site Underlay Network can use any IGP protocol or BGP for routing exchange while the Overlay Network routing use BGP (L2VPN EVPN afi). The Underlay Network routing protocol in Common EVPN domain between BGW switches and DC Core switch/router is eBGP (IPv4 unicast afi) while the eBGP (L2VPN EVPN afi) is used in the Overlay Network.

Because each site operates as an individual fabric, there are no Control Plane relationship requirements between sites. Connecting a new site to DC Core routers does not generate any major Control Plane changes from the protocol perspective (such as new NVE or routing protocol peering) in intra-site Leaf switches on remote site. New BGWs will only establish both Underlay and Overlay network eBGP peering with DC Core routers and forms NVE peering with the existing BGW switches. After that, BGW switched can exchanges routing information. In this manner, the EVPN Multi-Site solution is plug-and-play capable.

There are two BGWs per site in figure 1-2 but this is not the limitation. NX-OS 9.3.x support maximum of six BGWs per site.

Next sections introduce the VXLAN EVPN Multi-Site solution in detail.


Figure 1-2: Characteristics of Super-Spine VXLAN fabric.


Intra-Site EVPN Domain (Fabric)


This section shortly introduces the intra-site example solution used site 12 and 34 (figure 1-3). Both sites use the same Underlay and Overlay Network design. OSPF (RID 192.168.0.dev-number/32) is used for IP-connectivity between nodes and all Loopback address information is advertised internally. PIM BiDir is enabled on an Underlay Network and Spine switches are defined as a Pseudo Rendezvous Point. BGP L2EVPN peering is done between Loopback interface 77 (192.168.77.dev-number/32). NVE interfaces use IP address of Loopback 100 (192.168.100.dev-number/32) and NVE peering is established between these addresses. BGW-1 System MAC address is  5000.0002.0007, BGW-2 System MAC address is 5000.0003.0007 and BGW-3 System MAC address is 5000.0004.0007. The complete device configuration can be found from Appendix A at the end of this chapter. Left-hand site uses BGP AS 65012 and Site-Id 12. Right-hand site uses BGP AS65034 and Site-Id 34. For the sake of simplicity, the device count is minimized on this example network. Host Abba in VLAN 10 (IP:172.16.10.101/MAC: 1000.0010.abba) is connected to Leaf-101 and host beef (IP:172.16.10.102/MAC: 1000.0010.beef) is connected to Leaf-102. VLAN 10 is mapped to VNI 10000. In addition to unique switch Physical IP (PIP), intra-site BGW switches BGW-1 and BGW-2 share the same Virtual IP (VIP) that is taken from Loopback Interface 88 (192.168.88.12 in both devices).




Figure 1-3: Example EVPN Multi-Site topology.


Intra-Site NVE peering and VXLAN tunnels

This section explains the intra-site architecture. Example 1-1 shows that Leaf-101 has three NVE peers, one with the BGW shared Virtual-IP (VIP) and two with BGW switches Physical-IP (PIP). BGW switches advertise VIP address as a next-hop address concerning all Route-Type 2 and 5 updates received from the remote BGW. A physical IP address is used for three purposes. First, In case that BGW switch has directly connected hosts (only routing model is supported), the host prefix is advertised with PIP as a next-hop. Second, If BGW switch is connected to an external network, the networks received from the external site are advertise with PIP as a next-hop. Third, For the BUM traffic replication, BGW switches use PIP. This means that Ingress-Replication (IR) tunnels end-point address advertised within BGP L2VPN EVPN Route-Type 3 (Inclusive Multicast Route) is PIP.


Leaf-101# sh nve peers detail
Details of nve Peers:
----------------------------------------
Peer-Ip: 192.168.88.12
    NVE Interface       : nve1
    Peer State          : Up
    Peer Uptime         : 01:29:16
    Router-Mac          : n/a
    Peer First VNI      : 10000
    Time since Create   : 01:29:16
    Configured VNIs     : 10000,10077
    Provision State     : peer-add-complete
    Learnt CP VNIs      : 10000
    vni assignment mode : SYMMETRIC
    Peer Location       : N/A
Peer-Ip: 192.168.100.1
    NVE Interface       : nve1
    Peer State          : Up
    Peer Uptime         : 01:07:26
    Router-Mac          : n/a
    Peer First VNI      : 10000
    Time since Create   : 01:07:26
    Configured VNIs     : 10000,10077
    Provision State     : peer-add-complete
    Learnt CP VNIs      : 10000
    vni assignment mode : SYMMETRIC
    Peer Location       : N/A
Peer-Ip: 192.168.100.2
    NVE Interface       : nve1
    Peer State          : Up
    Peer Uptime         : 01:50:10
    Router-Mac          : n/a
    Peer First VNI      : 10000
    Time since Create   : 01:50:11
    Configured VNIs     : 10000,10077
    Provision State     : peer-add-complete
    Learnt CP VNIs      : 10000
    vni assignment mode : SYMMETRIC
    Peer Location       : N/A
Example 1-1: show nve peers detail.


BGW switches generate the BGP L2VPN EVPN Route-Type 2 (MAC Advertisement Route) advertisements about their system-MAC address with the next-hop address of NVE Interface (PIP). Example 1-3 shows that Leaf-101 has received updates from Spine-11 concerning all three BGW switches used in this example. Note that the next-hop address towards intra-site BGW switches System MAC is a Physical IP (PIP) of BGW switch while the next-hop address towards system-MAC of inter-site BGW-3 switch is shared Virtual IP address (VIP) used between Intra-Site BGW switches BGW-1 and BGW-2. Note that both VIP and PIP has to be advertised by the Underlay Network routing protocol. Even though the route origin is not visible in example 1-3, the BGP RID of advertising BGW can be seen from the Route-Distinguisher.

Leaf-101# sh bgp l2vpn evpn
<snipped>
   Network            Next Hop            Metric     LocPrf     Weight Path
Route Distinguisher: 192.168.77.1:32777
*>i[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                     100          0 i

Route Distinguisher: 192.168.77.2:32777
*>i[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                     100          0 i

Route Distinguisher: 192.168.77.3:32777
*>i[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.88.12                     100          0 65088 65034 i

Route Distinguisher: 192.168.77.101:32777    (L2VNI 10000)
*>i[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                     100          0 i
*>i[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                     100          0 i
*>i[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.88.12                     100          0 65088 65034 i
Example 1-3: show bgp l2vpn evpn.

The system-MAC attached to NVE interface can be verified by using show nve interface command.

BGW-1# sh nve interface
Interface: nve1, State: Up, encapsulation: VXLAN
 VPC Capability: VPC-VIP-Only [not-notified]
 Local Router MAC: 5000.0002.0007
 Host Learning Mode: Control-Plane
 Source-Interface: loopback100 (primary: 192.168.100.1, secondary: 0.0.0.0)
Example 1-4: show nve peers detail

Example 1-5 shows the BGP table entry on Leaf-101 concerning the system-MAC address of BGW-1. The route is imported into BGP table based on Route-Target 65012:10000. The encapsulation type is VXLAN (type 8) and the advertised next-hop address is 192.168.100.1 (PIP). Based on both encapsulation type VXLAN and Next-hop IP address Leaf-101 knows that switch with IP address 192.168.100.1 has to be VXLAN tunnel end-point. Note that system-MAC address is advertised as a sticky-MAC address (shown in partial Capture 1-1) with MAC-Mobility Extended Community where the static-flag is set one (1) and the Sequence number is set to zero. The captured packet is shown in Capture 1-1 after example1-5.

Leaf-101# sh bgp l2vpn evpn 5000.0002.0007
BGP routing table information for VRF default, address family L2VPN EVPN
Route Distinguisher: 192.168.77.1:32777
BGP routing table entry for [2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216, version 19
Paths: (1 available, best #1)
Flags: (0x000202) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported to 1 destination(s)
  AS-Path: NONE, path sourced internal to AS
    192.168.100.1 (metric 81) from 192.168.77.11 (192.168.77.11)
      Origin IGP, MED not set, localpref 100, weight 0
      Received label 10000
      Extcommunity: RT:65012:10000 SOO:192.168.77.1:512 ENCAP:8
          MAC Mobility Sequence:01:0
      Originator: 192.168.77.1 Cluster list: 192.168.77.11

  Path-id 1 not advertised to any peer

Route Distinguisher: 192.168.77.101:32777    (L2VNI 10000)
BGP routing table entry for [2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216, version 20
Paths: (1 available, best #1)
Flags: (0x000202) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported from 192.168.77.1:32777:[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
  AS-Path: NONE, path sourced internal to AS
    192.168.100.1 (metric 81) from 192.168.77.11 (192.168.77.11)
      Origin IGP, MED not set, localpref 100, weight 0
      Received label 10000
      Extcommunity: RT:65012:10000 SOO:192.168.77.1:512 ENCAP:8
          MAC Mobility Sequence:01:0
      Originator: 192.168.77.1 Cluster list: 192.168.77.11

  Path-id 1 not advertised to any peer
Example 1-5: show nve peers detail

Before bringing up the tunnel, Leaf-101 has to verify that the IP address 192.168.100.1 is reachable through the Underlay Network.

Leaf-101# sh ip route 192.168.100.1
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.100.1/32, ubest/mbest: 1/0
    *via 10.101.11.11, Eth1/1, [110/81], 02:58:17, ospf-UNDERLAY-NET, intra
Example 1-6: show ip route 192.168.100.1

Internet Protocol Version 4, Src: 192.168.77.11, Dst: 192.168.77.101
Transmission Control Protocol, Src Port: 57069, Dst Port: 179, Seq: 110, Ack: 39, Len: 242
Border Gateway Protocol - UPDATE Message
<snipped>
        Path Attribute - EXTENDED_COMMUNITIES
<snipped>
            Type Code: EXTENDED_COMMUNITIES (16)
            Length: 32
            Carried extended communities: (4 communities)
                Route Target: 65012:10000 [Transitive 2-Octet AS-Specific]
                Route Origin: 192.168.77.1:512 [Transitive IPv4-Address-Specific]
                    Type: Transitive IPv4-Address-Specific (0x01)
                    Subtype (IPv4): Route Origin (0x03)
                    IPv4 address: 192.168.77.1
                    2-Octet AN: 512
                Encapsulation: VXLAN Encapsulation [Transitive Opaque]
                    Type: Transitive Opaque (0x03)
                    Subtype (Opaque): Encapsulation (0x0c)
                    Tunnel type: VXLAN Encapsulation (8)
                MAC Mobility: Sticky MAC [Transitive EVPN]
                    Type: Transitive EVPN (0x06)
                    Subtype (EVPN): MAC Mobility (0x00)
                    Flags: 0x01
                        .... ...1 = Sticky/Static MAC: Yes
                    Sequence number: 0
        Path Attribute - ORIGINATOR_ID: 192.168.77.1
        Path Attribute - CLUSTER_LIST: 192.168.77.11
        Path Attribute - MP_REACH_NLRI
            Flags: 0x90, Optional, Extended-Length, Non-transitive, Complete
            Type Code: MP_REACH_NLRI (14)
            Length: 44
            Address family identifier (AFI): Layer-2 VPN (25)
            Subsequent address family identifier (SAFI): EVPN (70)
            Next hop network address (4 bytes)
            Number of Subnetwork points of attachment (SNPA): 0
            Network layer reachability information (35 bytes)
                EVPN NLRI: MAC Advertisement Route
                    Route Type: MAC Advertisement Route (2)
                    Length: 33
                    Route Distinguisher: 0001c0a84d018009 (192.168.77.1:32777)
                    ESI: 00 00 00 00 00 00 00 00 00
                    Ethernet Tag ID: 0
                    MAC Address Length: 48
                    MAC Address: 50:00:00:02:00:07 (50:00:00:02:00:07)
                    IP Address Length: 0
                    IP Address: NOT INCLUDED
                    MPLS Label Stack 1: 625, (BOGUS: Bottom of Stack NOT set!)
Capture 1-1: show nve peers detail

Figure summarizes the NVE peering from the Leaf-101 perspective. BGW-1 sends the BGP L2VPN EVPN Update including its’ system-MAC address. This way Intra-Site Leaf-101 learns the information which is needed for NVE peering. Shared NVE Anycast-BGW address 192.168.88.12 is learned from the BGP L2VPN EVPN Mac Route Advertisement originated by BGW-3 and forwarded bt both intra-site BGW switches. When DC Core switch (Route-Server) receives the Update message it changes the Route-Target (RT) AS part to its’ own AS due to the rt-rewrite definition. When BGW-1 receives this update, it also modifies the RT Extended Community to its own AS and it import the NLRI information. When sending an Update to Leaf-101, BGW-1 sets the Next-Hop to 192.168.88.12, which is the shared Anycast BGW address. Based on the RT 65012:10000 Leaf-101 is able to import BGP L2VPN EVPN MAC Advertisement Route originated by BGW-3 and learn the IP address of Intra-Site Anycast BGW from the Next-Hop field. This learning process is Control Plane learning and is also used for establishing NVE peering between Intra-Site BGW switches.



Figure 1-4: NVE peer learning process.

Example 1-7 shows that even though Leaf-101 has established NVE peering to BGW-1 the tunnel is still Unidirectional. BGW1 does only have NVE peering with fabric internal BGW2 and the BGW-3 but not with Leaf-101. This is because only BGWs advertises their System MAC-addresses as Route-Type 2 MAC advertisement route. Leaf is a normal VTEP switch so it does not advertise its system MAC.

BGW-1# sh nve peer control-plane
Interface Peer-IP          State LearnType Uptime   Router-Mac
--------- ---------------  ----- --------- -------- -----------------
nve1      192.168.100.2    Up    CP        03:05:49 n/a
nve1      192.168.100.3    Up    CP        02:44:54 n/a
Example 1-7: show nve peers detail

Now host Abba joins to the network. It pings the Anycast GW used in VLAN 10. This way Leaf-101 learn the MAC address of host Abba and send BGP L2VPN EVPN Route-type 2 advertisement to Route-Reflector Spine-11, which in turn forward the message to BGW switches. Example 1-11 shows that after receiving the BGP Update related to host Abba, BGW1 also has established the NVE peering with Leaf-101. Now there is a bi-directional VXLAN tunnel between these two switches and data can flow over it.

BGW-1# sh bgp l2vpn evpn
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 29, Local Router ID is 192.168.77.1
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
Route Distinguisher: 192.168.77.1:27001   (ES [0300.0000.0000.0c00.0309 0])
*>l[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136
                      192.168.100.1                     100      32768 i
*>i[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.100.2                     100          0 i

Route Distinguisher: 192.168.77.1:32777    (L2VNI 10000)
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.100.101                   100          0 i
*>l[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                     100      32768 i
*>i[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                     100          0 i
*>e[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.100.3                                  0 65088 65034 i
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[32]:[172.16.10.101]/272
                      192.168.100.101                   100          0 i
*>l[3]:[0]:[32]:[192.168.100.1]/88
                      192.168.100.1                     100      32768 i
*>e[3]:[0]:[32]:[192.168.100.3]/88
                      192.168.100.3                                  0 65088 65034 i

Route Distinguisher: 192.168.77.2:27001
*>i[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.100.2                     100          0 i

Route Distinguisher: 192.168.77.2:32777
*>i[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                     100          0 i

Route Distinguisher: 192.168.77.3:32777
*>e[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.100.3                                  0 65088 65034 i
*>e[3]:[0]:[32]:[192.168.100.3]/88
                      192.168.100.3                                  0 65088 65034 i

Route Distinguisher: 192.168.77.101:32777
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.100.101                   100          0 i
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[32]:[172.16.10.101]/272
                      192.168.100.101                   100          0 i
Example 1-8: sh bgp l2vpn evpn


BGW-1# sh nve peers control-plane
Interface Peer-IP          State LearnType Uptime   Router-Mac
--------- ---------------  ----- --------- -------- -----------------
nve1      192.168.100.2    Up    CP        03:14:42 n/a
nve1      192.168.100.3    Up    CP        02:53:47 n/a
nve1      192.168.100.101  Up    CP        00:01:07 n/a
Example 1-9: show nve peers control-plane.

Summary

As a conclusion, intra-site NVE peering is based on information carried within auto-generated Route-Type 2 describing system MAC address by BGW. The NVE peering from BGW-to-Leaf is based on information carried within the first Route-Type 2 MAC advertisement route that describes one of the hosts behind the Leaf switch. The result for this is bi-directional VXLAN tunnel.

Shared Common EVPN Domain Connections

Figure 1-5 illustrates the overall topology and eBGP peering between Border Gateways and DC Core Switch. For simplicity, only one DC Core switch is used. The DC Core switch has its dedicated BGP AS65088 meaning external BGP peering is used. DC Core switch also has a role of Route Server (RS) role. In this example, the RS is in the data path but in real-life scenarios, it does not have to be. The complete configuration of RS can be found from Appendix A at the end of this chapter. DC Core switch and all three BGW switches belongs to the Common EVPN Domain used for datacenter Interconnect (DCI). This means that each BGW belongs to both Intra-Site EVPN Domain as well as to Common EVPN Domain. All Unicast and Multicast traffic from one site to another goes through the BGW.

The eBGP IPv4 unicast afi is used for IPv4 NLRI exchange in an Underlay Network. BGW switches advertise their unique NVE interface IP address (PIP) and shared Virtual IP address (VIP) as well as the IP address of the external interface connected to DC Core switch, which in turn forward updates these BGP Updates to another site. PIP addresses are used in outer IP header destination and source IP address when BUM traffic is sent over Ingress-Replication tunnel. VIP address, in turn, is used in the outer IP header for Unicast traffic. Physical Interface IP addresses (Underlay Netwok) are used for recursive route lookup to find the next hop for the PIP/VIP address.

The eBGP L2VPN EVPN afi is used for exchanging EVPN NLRI in an Overlay Network. The information includes NLRI of intra-site host MAC and MAC/IP information (Route-Type 2) and IP Prefix information (Route-Type 5). Note that BGW switches also exchange their System-MAC addresses information by using Route-Type 2 MAC Advertisement Routes for NVE peering. In addition, BGW switches advertise Inclusive Multicast Route (Route-Type 3) to exchange NLRI information concerning Ingress-Replication tunnel. BGW switches advertise also Ethernet Segment Routes (Route-Type 4) used for Intra-Site BGW DF election over Common EVPN Domain but those are ignored by the remote BGW switch due to unmatched route-target import policy.




Figure 1-5: Common EVPN Domain Underlay and Overlay eBGP peering.


Border Gateway setup

This section explains the Border Gateway configuration.

Define Site-Id

Configure the device role as an EVPN Multi-Site Border Gateway an assign Site-Id to it. The site-Id has to be identical in all BGWs belonging to the local site. Optionally, the shared Virtual-IP address advertisement can be delayed after recovery. This way Underlay and Overlay Network Control Plane protocols of BGW switch have sufficient time to do their job such as building a BGP peering and establish both VXLAN and Ingress-Replication tunnels before introducing itself as a possible next-hop for inter-site destination by advertising Virtual IP address.

evpn multisite border-gateway 12
  delay-restore time 300
Example 1-10: enabling EVPN MS BGW on BGW1.

Define source IP for VIP under NVE Interface and BUM method for DCI

NVE interface of BGW-1 use IP address 192.168.00.1 (Loopback 100) as a Physical IP (PIP) and the IP address 192.168.88.12 (Loopback 88) as a Virtual IP address (VIP). Ingress Replication is used for Inter-Site BUM traffic for VNI 10000 while Intra-Site BUM traffic uses Multicast.

interface nve1
  no shutdown
  host-reachability protocol bgp
  source-interface loopback100
  multisite border-gateway interface loopback88
  member vni 10000
    multisite ingress-replication
    mcast-group 238.0.0.10
  member vni 10077 associate-vrf
Example 1-11: configuring NVE interface on BGW1.


Configure BGP Peering and redistribution

Configure eBGP IPv4 Unicast afi peering for Underlay Network between physical link interface IP addresses. Advertise BGW switch Loopback interface IP addresses used by NVE interface (VIP/PIP) and BGP peering to DC Core switch. Configure eBGP L2VPN EVPN afi peering for Overlay Network between Loopback 77 IP addresses and define the fabric peer-type as external. When eBGP peering is configured between Loopback interfaces there is also a need for adjusting TTL value by using “ebgp-multihop <value>” command. In this example scenario, the value is to five.

BGP L2VPN EVPN Updates send by BGW1 will carry site site-specific Route-Target Extended Community per VNI. This community use format BGP-AS: VNI-Id. This is why there is a command “rewrite-evpn-rt-asn” under L2VPN EVPN address-family. It modifies the Route-Target BGP AS-part from received number to local AS number. When BGW1 sends an eBGP L2VPN Update to DC Core switch, the original RT for VNI 10000 is 65012:10000. When DC Core switch receives the update message, it changes the RT to 65088:10000. It uses this RT value when sending the update message to BGW-3 on the other site. When BGW-3 receives the update message, it changes the RT to 65034:10000 before installing it into Adj-RIB-In. This way it is able to import NLRI information carried in update originated by remote-site BGW. Adjust also the BGP maximum path for load-balancing.

router bgp 65012
  router-id 192.168.77.1
  no enforce-first-as
  address-family ipv4 unicast
    redistribute direct route-map REDIST-TO-SITE-EXT-DCI
  address-family l2vpn evpn
    maximum-paths 2
    maximum-paths ibgp 2
  neighbor 10.1.88.88
    remote-as 65088
    update-source Ethernet1/2
    address-family ipv4 unicast
  neighbor 192.168.77.11
    remote-as 65012
    description ** Spine-11 BGP-RR **
    update-source loopback77
    address-family l2vpn evpn
      send-community extended
  neighbor 192.168.77.88
    remote-as 65088
    update-source loopback77
    ebgp-multihop 5
    peer-type fabric-external
    address-family l2vpn evpn
      send-community
      send-community extended
      rewrite-evpn-rt-asn
!
route-map REDIST-TO-SITE-EXT-DCI permit 10
  match tag 1234
!
interface loopback88
  description ** VIP for DCI-Inter-connect **
  ip address 192.168.88.12/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
Example 1-12: configuring eBGP IPv4 Unicast and L2VPN EVPN peering route redistribution on BGW1.

Configure DCI and Fabric Interface Tracking

BGW is in the borderline of intra-site EVPN Domain (Fabric EVPN) and Common EVPN Domain (DCI). All inter-site traffic goes through the BGW switches, so it is extremely important to have a mechanism for tracking the state of both Fabric and DCI interfaces. The configuration is shown in the example below. Link failure events are discussed in detail in “Failure Scenario” section.

interface Ethernet1/1
  description **Fabric Internal **
  no switchport
  mtu 9216
  mac-address b063.0001.1e11
  medium p2p
  ip address 10.1.11.1/24
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
  evpn multisite fabric-tracking
  no shutdown
!
interface Ethernet1/2
  description ** DCI Interface **
  no switchport
  mtu 9216
  mac-address b063.0001.1e12
  medium p2p
  ip address 10.1.88.1/24 tag 1234
  ip pim sparse-mode
  evpn multisite dci-tracking
  no shutdown
Example 1-13: DCI and fabric interface tracking on BGW1.

These are the basic EVPN Multi-Site related configuration. Complete configuration of all BGP switches and DC Core switch can be found from Appendix A at the end of this chapter.

BGP peering Verification on BGW

Example 1-14 shows that BGW-1 has established iBGP L2VPN EVPN session with 192.168.77.11 (Spine-11) and it has received three Route-Type 2 (MAC Advertisement Route) and one Route-Type 4 (Ethernet Segment Route). It also has established an eBGP L2VPN EVPN session with 192.168.77.88 (DC Core switch) from where it has received one Route-Type 2 (MAC Advertisement Route) and one Route-Type 3 (Inclusive Multicast Ethernet Tag Route).

BGW-1# sh bgp l2vpn evpn summary
BGP summary information for VRF default, address family L2VPN EVPN
BGP router identifier 192.168.77.1, local AS number 65012
BGP table version is 251, L2VPN EVPN config peers 2, capable peers 2
15 network entries and 15 paths using 2616 bytes of memory
BGP attribute entries [11/1804], BGP AS path entries [1/10]
BGP community entries [0/0], BGP clusterlist entries [2/8]

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
192.168.77.11   4 65012     609     486      251    0    0 07:14:11 4
192.168.77.88   4 65088     487     531      251    0    0 07:12:39 2

Neighbor        T    AS PfxRcd     Type-2     Type-3     Type-4     Type-5
192.168.77.11   I 65012 4          3          0          1          0
192.168.77.88   E 65088 2          1          1          0          0
Example 1-14: sh bgp l2vpn evpn summary on BGW1.

NVE peering Verification on BGW1

Example 1-15 shows that BGW-1 has established an NVE peering between the intra-site BGW-2 and Leaf-101. The Peer-Location shown in output verifies that these are fabric peers. In addition, BGW-1 has established NVE peering with BGW-3 which location is described as DCI. The NVE peering process between the inter-site BGW switches use the same mechanism than NVE peering between intra site BGW switches. The trigger for the NVE peer learning process is auto-generated system MAC-address advertisement (Route-Type 2 – MAC Advertisement Route).

BGW-1# sh nve peers detail
Details of nve Peers:
----------------------------------------
Peer-Ip: 192.168.100.2
    NVE Interface       : nve1
    Peer State          : Up
    Peer Uptime         : 08:13:09
    Router-Mac          : n/a
    Peer First VNI      : 10000
    Time since Create   : 08:13:09
    Configured VNIs     : 10000,10077
    Provision State     : peer-add-complete
    Learnt CP VNIs      : 10000
    vni assignment mode : SYMMETRIC
    Peer Location       : FABRIC
Peer-Ip: 192.168.100.3
    NVE Interface       : nve1
    Peer State          : Up
    Peer Uptime         : 07:52:14
    Router-Mac          : n/a
    Peer First VNI      : 10000
    Time since Create   : 07:52:14
    Configured VNIs     : 10000,10077
    Provision State     : peer-add-complete
    Learnt CP VNIs      : 10000
    vni assignment mode : SYMMETRIC
    Peer Location       : DCI
Peer-Ip: 192.168.100.101
    NVE Interface       : nve1
    Peer State          : Up
    Peer Uptime         : 04:59:34
    Router-Mac          : n/a
    Peer First VNI      : 10000
    Time since Create   : 04:59:34
    Configured VNIs     : 10000,10077
    Provision State     : peer-add-complete
    Learnt CP VNIs      : 10000
    vni assignment mode : SYMMETRIC
    Peer Location       : FABRIC
 Example 1-15: sh bgp l2vpn evpn summary on BGW1.

Example 1-16 taken from BGW-1 shows the BGW NVE related information. The output shows among the other things the NVE source interface that is a Physical IP address (PIP) and the shared Virtual IP address (VIP) used as a next-hop for ingress and egress inter-site traffic. Note that the operational state for the VIP interface (Loopback88) is “down”. This is because the output is taken when there was neither IP connectivity nor NVE peering between BGW-1 and BGW-2 (Spine-11 was turned off).

BGW-1# sh nve interface nve1 detail
Interface: nve1, State: Up, encapsulation: VXLAN
 VPC Capability: VPC-VIP-Only [not-notified]
 Local Router MAC: 5000.0002.0007
 Host Learning Mode: Control-Plane
 Source-Interface: loopback100 (primary: 192.168.100.1, secondary: 0.0.0.0)
 Source Interface State: Up
 Virtual RMAC Advertisement: No
 NVE Flags:
 Interface Handle: 0x49000001
 Source Interface hold-down-time: 180
 Source Interface hold-up-time: 30
 Remaining hold-down time: 0 seconds
 Virtual Router MAC: N/A
 Virtual Router MAC Re-origination: 0200.c0a8.580c
 Interface state: nve-intf-add-complete
 Multisite delay-restore time: 300 seconds
 Multisite delay-restore time left: 22 seconds
 Multisite bgw-if: loopback88 (ip: 192.168.88.12, admin: Up, oper: Down)
 Multisite bgw-if oper down reason:
Example 1-16: sh nve interface nve1 detail on BGW1.


When Spine-11 boots up and the IP connectivity and NVE peering is established between BGW-1 and BGW-2 the operational state for Loopback88 interface on BGW-1 changes to UP-state.

BGW-1# sh nve interface nve1 detail | i bgw-if
 Multisite bgw-if: loopback88 (ip: 192.168.88.12, admin: Up, oper: Up)
 Multisite bgw-if oper down reason:
Example 1-17: sh nve interface nve1 detail on BGW1.

BGP NLRI information verification.

Host Abba connected to Leaf-101 joins the network. It pings the Anycast-GW IP address 172.16.10.1 (SVI for VLAN 10). Leaf-101 learns the MAC address information from the ingress frame. It stores the MAC information into MAC address table and L2RIB of MAC-VRF from where the MAC address information is exported into BGP Loc-RIB and send it through the Adj-RIB-Out to Spine-11. BGP Route-Reflector Spine-11 forwards the BGP Update to both BGW-1 and BGW-2. BGW switches forwards BGP Update to DC Core switch after local processing. The example below shows that DC Core switch has learned the MAC address 1000.0010.abba of the host from both BGW-1 and BGW-2 with the same Next-Hop address 192.168.88.12 (VIP).

Route Distinguisher: 192.168.77.101:32777
*>e[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.12         2000                     0 65012 i
* e                   192.168.88.12         2000                     0 65012 i
  e[2]:[0]:[0]:[48]:[1000.0010.abba]:[32]:[172.16.10.101]/248
                      192.168.88.12         2000                     0 65012 i
  e                   192.168.88.12         2000                     0 65012 i
Example 1-18: sh bgp l2vpn evpn summary on DC Core switch (RouteServer).

The example below shows that the DC Core switch has installed a route to 192.168.88.12 into RIB from BGP Loc-RIB with two equal next-hop IP address (Underlay Network addresses) and will use both of these for ECMP load-balancing toward the destination.

RouteServer-1# sh ip route 192.168.88.12
<snipped>
192.168.88.12/32, ubest/mbest: 2/0
    *via 10.1.88.1, [20/0], 00:00:04, bgp-65088, external, tag 65012
    *via 10.2.88.2, [20/0], 00:00:04, bgp-65088, external, tag 65012
Example 1-19: sh ip route 192.168.88.12 on DC Core switch.

The example below shows that BGW-3 has received the BGP Update about the MAC addresses information of host Abba from DC Core switch. BGW-3 has changed the Route-Target AS-part to its BGP AS before importing the route from Adj-RIB-In (pre) into the Adj-RIB-In (post). From the Adj-RIB-In (post) route is imported into the Loc-RIB.

BGW-3# show bgp l2vpn evpn 1000.0010.abba
BGP routing table information for VRF default, address family L2VPN EVPN
!----------------< COMMENT: This entry is in BGP Loc-RIB >-------------
Route Distinguisher: 192.168.77.3:32777    (L2VNI 10000)
BGP routing table entry for [2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216, version 289
Paths: (1 available, best #1)
Flags: (0x000212) (high32 00000000) on xmit-list, is in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop, in rib
             Imported from 192.168.77.101:32777:[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
  AS-Path: 65088 65012 , path sourced external to AS
    192.168.88.12 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Received label 10000
      Extcommunity: RT:65034:10000 ENCAP:8

  Path-id 1 not advertised to any peer
!----------------< COMMENT: This entry is in BGP Adj-RIB-In >-------------
Route Distinguisher: 192.168.77.101:32777
BGP routing table entry for [2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216, version 288
Paths: (1 available, best #1)
Flags: (0x000202) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported to 1 destination(s)
  AS-Path: 65088 65012 , path sourced external to AS
    192.168.88.12 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Received label 10000
      Extcommunity: RT:65034:10000 ENCAP:8

  Path-id 1 not advertised to any peer
Example 1-20: sh bgp l2vpn evpn summary on BGW1.

L2RIB Verification on remote BGW

BGW-3 has installed MAC address information from the BGP Loc-RIB into L2RIB.

BGW-3# show l2route mac all

Flags -(Rmac):Router MAC (Stt):Static (L):Local (R):Remote (V):vPC link
(Dup):Duplicate (Spl):Split (Rcv):Recv (AD):Auto-Delete (D):Del Pending
(S):Stale (C):Clear, (Ps):Peer Sync (O):Re-Originated (Nho):NH-Override
(Pf):Permanently-Frozen, (Orp): Orphan

Topology    Mac Address    Prod   Flags         Seq No     Next-Hops
----------- -------------- ------ ------------- ---------- ----------------
10          1000.0010.abba BGP    Rcv           0          192.168.88.12
10          5000.0004.0007 VXLAN  Stt,Nho,      0          192.168.100.3
Example 1-21: show l2route mac all on BGW3.

MAC Address-Table Verification on remote BGW switch

BGW-3 has also installed MAC information into MAC address-table. The information stored in both L2RIB and MAC Address-Table incudes almost identical information. The difference compared to these two tables relies on usage. The Data Plane use MAC address-Table for switching while the Control Plane use the L2RIB for exporting/importing information to and from BGP processes.

BGW-3# show system internal l2fwder mac
Legend:
        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
        age - seconds since last seen,+ - primary entry using vPC Peer-Link,
        (T) - True, (F) - False, C - ControlPlane MAC
   VLAN     MAC Address      Type      age     Secure NTFY Ports
---------+-----------------+--------+---------+------+----+------------------
G     -    b063:0003:1e12    static   -          F     F   sup-eth1(R)
G     -    b063:0003:1e11    static   -          F     F   sup-eth1(R)
*    10    1000.0010.abba    static   -          F     F  nve-peer2 192.168.88.12
G     -    b063:0003:1e14    static   -          F     F   sup-eth1(R)
G     -    b063:0003:1e13    static   -          F     F   sup-eth1(R)
G     -    0200:c0a8:5822    static   -          F     F   sup-eth1(R)
    1           1         -00:01:00:01:00:01         -             1
Example 1-22: show system internal l2fwder mac on BGW3.


BGP NLRI Next-Hop verification on remote BGW

When BGW-3 forwards a BGP Updates message received from the Common EVPN Domain into intra-site devices, it changes the next-hop IP address to VIP address (even though it is the only BGW on-site).

Leaf-102# sh bgp l2vpn evpn
Route Distinguisher: 192.168.77.101:32777
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.34                     100          0 65088 65012 i

Route Distinguisher: 192.168.77.102:32777    (L2VNI 10000)
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.34                     100          0 65088 65012 i
Leaf-102#
Example 1-23: show bgp l2vpn evpn on Leaf-102.

The same process is done by BGW-1 and BGW-2. They both changes the next-hop address to shared VIP address when sending BGP L2VPN EVPN BGP Updates received from Common EVPN Domain to intra-site devices.

Leaf-101# sh bgp l2vpn evpn
<snipped>
Route Distinguisher: 192.168.77.102:32777
*>i[2]:[0]:[0]:[48]:[1000.0010.beef]:[0]:[0.0.0.0]/216
                      192.168.88.12                     100          0 65088 65034 i

Leaf-101#
Example 1-24: show bgp l2vpn evpn on Leaf-101.

Simple ping test verifies that there is a connection between host Beef connected to Leaf-102 and host Abba connected to Leaf-101.

Beef#ping 172.16.10.101
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 172.16.10.101, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 96/137/224 ms
Example 1-25: ping from host Beef to host Abba within VLAN 10.

Multi-Destination traffic forwarding

There is two important consideration related to inter-site BUM traffic. First, if there is more than one intra-site BGW switches, the role of Designated Forwarder (DF) per VLAN/VNI is selected randomly among all Intra-Site BGW switches. DF is the switch that is responsible for inter-site ingress/egress BUM traffic forwarding. Second, when DF election is done, BGW switches need to know whom to forward inter-site BUM traffic over Common EVPN Domain. This means that BGWs with each location needs to build a Multi-Destination Tree between themselves. Next two-section explains the DF election process by using BGP L2VPN EVPN Route-Type 4 (Ethernet Segment Route) and Multi-Destination Tree building process by using BGP L2VPN EVPN Route-Type 3 (Inclusive Multicast Route) for finding Ingress-Replication peers.

Designated Forwarder

BGW switches send a BGP L2VPN EVPN Route-Type 4 (Ethernet Segment Route) update to all of their BGP L2VPN EVPN peer. Switches use this information for selecting DF per VLAN/VNI. The first part of the NLRI update message [4] describes the EVPN Route-Type. The second part [0300.0000.0000.0c00.0309] includes information about; ESI Type (03 = MAC-based ESI), ESI system MAC 0000.0000.000c (formed from Site-Id 12 = HEX 0c). It also contains the auto-generated ESI local discriminator 000309. Value [32] describes the length of the following IP address that describes the sender IP address [192.168.100.1] which is used for DF election process (explained later). In addition, the Update message carries an ES Import Route-Target Extended Community BGP Path Attribute that is generated automatically based on the local Site-Id (0c = 12). Only the Intra-Site BGW switches later import these Updates.

All BGP L2VPN EVPN peers will receive the Route-Type 4 BGP Update, also Leaf-101 (reflected by Spine-11) and BGW-3 will receive the BGP Update though they ignore it because they do not have matching import clause for the RT.



Figure 1-6: Route-Type 4 sent by BGW-1 and BGW-2.

Examples 1-27 shows that BGW-1 has installed Ethernet Segment Route learned from the peer site-local BGW-2 switch into BGP table. The Route-Target Extended Path Attribute is based on Site-Id meaning that only intra-site BGW switches are able to import Ethernet Segment Routes between each other.

BGW-1# sh bgp l2vpn evpn route-type 4
BGP routing table information for VRF default, address family L2VPN EVPN
Route Distinguisher: 192.168.77.1:27001   (ES [0300.0000.0000.0c00.0309 0])
BGP routing table entry for [4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136, version 7
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: local, path is valid, is best path, no labeled nexthop
  AS-Path: NONE, path locally originated
    192.168.100.1 (metric 0) from 0.0.0.0 (192.168.77.1)
      Origin IGP, MED not set, localpref 100, weight 32768
      Extcommunity: ENCAP:8 RT:0000.0000.000c

  Path-id 1 advertised to peers:
    192.168.77.11      192.168.77.88
BGP routing table entry for [4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136, version 9
Paths: (1 available, best #1)
Flags: (0x000012) (high32 00000000) on xmit-list, is in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported from 192.168.77.2:27001:[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
  AS-Path: NONE, path sourced internal to AS
    192.168.100.2 (metric 81) from 192.168.77.11 (192.168.77.11)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: ENCAP:8 RT:0000.0000.000c
      Originator: 192.168.77.2 Cluster list: 192.168.77.11

  Path-id 1 not advertised to any peer

Route Distinguisher: 192.168.77.2:27001
BGP routing table entry for [4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136, version 8
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported to 1 destination(s)
  AS-Path: NONE, path sourced internal to AS
    192.168.100.2 (metric 81) from 192.168.77.11 (192.168.77.11)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: ENCAP:8 RT:0000.0000.000c
      Originator: 192.168.77.2 Cluster list: 192.168.77.11

  Path-id 1 advertised to peers:
    192.168.77.88
Example 1-26: BGP L2VPN EVPN Ethernet Segment Route in BGW-1 BGP table.

Examples 1-27 shows that BGW-2 has installed Ethernet Segment Route learned from the site-local peer BGW-1 switch into BGP table. 

BGW-2# sh bgp l2vpn evpn route-type 4
BGP routing table information for VRF default, address family L2VPN EVPN
Route Distinguisher: 192.168.77.1:27001
BGP routing table entry for [4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136, version 8
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported to 1 destination(s)
  AS-Path: NONE, path sourced internal to AS
    192.168.100.1 (metric 81) from 192.168.77.11 (192.168.77.11)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: ENCAP:8 RT:0000.0000.000c
      Originator: 192.168.77.1 Cluster list: 192.168.77.11

  Path-id 1 advertised to peers:
    192.168.77.88

Route Distinguisher: 192.168.77.2:27001   (ES [0300.0000.0000.0c00.0309 0])
BGP routing table entry for [4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136, version 9
Paths: (1 available, best #1)
Flags: (0x000012) (high32 00000000) on xmit-list, is in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: internal, path is valid, is best path, no labeled nexthop
             Imported from 192.168.77.1:27001:[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136
  AS-Path: NONE, path sourced internal to AS
    192.168.100.1 (metric 81) from 192.168.77.11 (192.168.77.11)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: ENCAP:8 RT:0000.0000.000c
      Originator: 192.168.77.1 Cluster list: 192.168.77.11

  Path-id 1 not advertised to any peer
BGP routing table entry for [4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136, version 7
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: local, path is valid, is best path, no labeled nexthop
  AS-Path: NONE, path locally originated
    192.168.100.2 (metric 0) from 0.0.0.0 (192.168.77.2)
      Origin IGP, MED not set, localpref 100, weight 32768
      Extcommunity: ENCAP:8 RT:0000.0000.000c

  Path-id 1 advertised to peers:
    192.168.77.11      192.168.77.88
Example 1-27: BGP L2VPN EVPN Ethernet Segment Route in BGW-2 BGP table.
The capture below shows the BGP L2VPN EVPN Route-Type 4 (Ethernet Segment Route) sent by BGW-1.
Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 93
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 70
    Path attributes
        Path Attribute - ORIGIN: IGP
        Path Attribute - AS_PATH: 65012
        Path Attribute - EXTENDED_COMMUNITIES
            Flags: 0xc0, Optional, Transitive, Complete
            Type Code: EXTENDED_COMMUNITIES (16)
            Length: 16
            Carried extended communities: (2 communities)
                Encapsulation: VXLAN Encapsulation [Transitive Opaque]
                    Type: Transitive Opaque (0x03)
                    Subtype (Opaque): Encapsulation (0x0c)
                    Tunnel type: VXLAN Encapsulation (8)
                ES Import: RT: 00:00:00:00:00:0c [Transitive EVPN]
                    Type: Transitive EVPN (0x06)
                    Subtype (EVPN): ES Import (0x02)
                    ES-Import Route Target: 00:00:00_00:00:0c (00:00:00:00:00:0c)
        Path Attribute - MP_REACH_NLRI
            Flags: 0x90, Optional, Extended-Length, Non-transitive, Complete
            Type Code: MP_REACH_NLRI (14)
            Length: 34
            Address family identifier (AFI): Layer-2 VPN (25)
            Subsequent address family identifier (SAFI): EVPN (70)
            Next hop network address (4 bytes)
            Number of Subnetwork points of attachment (SNPA): 0
            Network layer reachability information (25 bytes)
                EVPN NLRI: Ethernet Segment Route
                    Route Type: Ethernet Segment Route (4)
                    Length: 23
                    Route Distinguisher: 0001c0a84d016979 (192.168.77.1:27001)
                    ESI: 00:00:00:00:00:0c, Discriminator: 00 03
                        ESI Type: ESI MAC address defined (3)
                        ESI system MAC: 00:00:00_00:00:0c (00:00:00:00:00:0c)
                        ESI system mac discriminator: 00 03
                        Remaining bytes: 09
                    IP Address Length: 32
                    IPv4 address: 192.168.100.1
Capture 1-2: BGP L2VPN EVPN Route-Type 4

BGW switches choose Designated Forwarder (DF) among themselves to forward BUM (Broadcast, Unknown Unicast and Multicast) traffic to and from intra-site EVPN Domain. If intra-site has more than one VLAN, the DF roles are load-balanced between BGW nodes, i.e. DF for VLAN 10 is BGW-1 and DF for VLAN 1 and 77 is BGW-2. The selection process uses the formulai = V mod N”, where V represents VLAN Id and N represents a number of BGW switches in the redundancy group. The “i” is an ordinal of a leaf switch in the redundancy group. When BGW-1 and BGW-2 exchanges BGP L2VPN EVPN Route-Type 4 (Ethernet Segment Route) their IP address is included in NLRI. Each switch sets these IP address learned from BGP Update in numerical order from lowest to highest. In case of BGW-1 and BGW-2, the order is 192.168.100.1, 192.168.100.2. The lowest IP i.e. 192.168.100.1 gets ordinal zero (0) and the next one gets ordinal one (1) and so on.
Formula to calculate DF for VLAN 10 is

V mod N = i
V = 10 (VLAN Id)
N = 2 (number of leaf switches)
10 mod 2 = 0 > Leaf-102
(Remainders is zero (0) when 10 is divided by 2)
Ordinal zero is used by BGW-1, so it will be the DF for VLAN 10.
Formula to calculate DF for VLAN 77 is
V mod N = i
V = 77 (VLAN Id)
N = 2 (number of leaf switches)
77 mod 2 = 01 > BGW-1
(Remainders is one (1) when 77 is divided by 2)
Ordinal one is used by BGW-2, so it will be the DF for VLAN 77.

This procedure is the same that what was introduced in “EVPN ESI Multihoming- Part I: EVPN Ethernet Segment (ES) DF election section”.

Examples 1-28 shows that BGW-1 is DF for VLAN 10 and example 1-29 shows that BGW-2 is DF for VLAN 1 and 77.

BGW-1# sh nve ethernet-segment

ESI: 0300.0000.0000.0c00.0309
   Parent interface: nve1
  ES State: Up
  Port-channel state: N/A
  NVE Interface: nve1
   NVE State: Up
   Host Learning Mode: control-plane
  Active Vlans: 1,10,77
   DF Vlans: 10
   Active VNIs: 10000
  CC failed for VLANs:
  VLAN CC timer: 0
  Number of ES members: 2
  My ordinal: 0
  DF timer start time: 00:00:00
  Config State: N/A
  DF List: 192.168.100.1 192.168.100.2
  ES route added to L2RIB: True
  EAD/ES routes added to L2RIB: False
  EAD/EVI route timer age: not running
Example 1-28: DF election verification on BGW-1.

BGW-2# sh nve ethernet-segment

ESI: 0300.0000.0000.0c00.0309
   Parent interface: nve1
  ES State: Up
  Port-channel state: N/A
  NVE Interface: nve1
   NVE State: Up
   Host Learning Mode: control-plane
  Active Vlans: 1,10,77
   DF Vlans: 1,77
   Active VNIs: 10000
  CC failed for VLANs:
  VLAN CC timer: 0
  Number of ES members: 2
  My ordinal: 1
  DF timer start time: 00:00:00
  Config State: N/A
  DF List: 192.168.100.1 192.168.100.2
  ES route added to L2RIB: True
  EAD/ES routes added to L2RIB: False
  EAD/EVI route timer age: not running
----------------------------------------
Example 1-29: DF election verification on BGW-2.

Ingress-Replication

In order to forward Inter-Site Multi-Destination traffic, BGW switches form a Multi-destination tree between remote-site BGW switches. Switches use BGP L2VPN EVPN Route-Type 3 (Inclusive Multicast Route) to describe their Tunnel-Id used with VNI and tunnel type, which is Ingress-Replication. By using this information, switches are able to form the Multi-Destination tree over Unicast-Only Underlay Network.

In figure 1-6 BGW-1 sends a BGP L2VPN EVPN Update to BGW-3. EVPN NLRI describes the Route-Type (Inclusive Multicast Route) and sender IP (192,168.100.1). PMSI Tunnel Attribute describes the tunnel type (Ingress-Replication) the VNI which BUM traffic should be sent over the tunnel and Tunnel-Id used by BGW-1. This attribute is discussed in a later section. Route-Target Extended Community is set based on local values (65012:10000) which receiving switches changes to correspond their own AS: VNI. The same process applies to BGW-2 and BGW-3.

Note that DC Core SW does not forward BGP L2VPN EVPN Inclusive Multicast Route sent by BGW-1 to BGW-2 due to same AS number. This is a normal BGP Loop Prevention mechanism. Also, BGP L2VPN EVPN Inclusive Multicast Route is only sent out from the DCI interface and receiving BGW switch does not forward it to local BGP speakers.


Figure 1-6: BGP L2VPN EVPN Route-Type 3 (Inclusive Multicast Ethernet Tag).


Examples 1-30 shows that BGW-1 received the BGP L2VPN EVPN Route-Type 3 NLRI information originated by BGW-3.

BGW-1# sh bgp l2vpn evpn route-type 3
!---------------< Comment: This is the local information advertise to BGW-3 >---------------
BGP routing table information for VRF default, address family L2VPN EVPN
Route Distinguisher: 192.168.77.1:32777    (L2VNI 10000)
BGP routing table entry for [3]:[0]:[32]:[192.168.100.1]/88, version 3
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: local, path is valid, is best path, no labeled nexthop
  AS-Path: NONE, path locally originated
    192.168.100.1 (metric 0) from 0.0.0.0 (192.168.77.1)
      Origin IGP, MED not set, localpref 100, weight 32768
      Origin flag 0x2
      Extcommunity: RT:65012:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.1

  Path-id 1 advertised to peers:
    192.168.77.88
!---------------< Comment: This is the information installed into BGP Loc-RIB ---------->
BGP routing table entry for [3]:[0]:[32]:[192.168.100.3]/88, version 26
Paths: (1 available, best #1)
Flags: (0x000012) (high32 00000000) on xmit-list, is in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported from 192.168.77.3:32777:[3]:[0]:[32]:[192.168.100.3]/88
  AS-Path: 65088 65034 , path sourced external to AS
    192.168.100.3 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: RT:65012:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.3

  Path-id 1 not advertised to any peer

!-----------< Comment: This is the information installed into Adj-RIB-In >-------
Route Distinguisher: 192.168.77.3:32777
BGP routing table entry for [3]:[0]:[32]:[192.168.100.3]/88, version 24
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported to 1 destination(s)
  AS-Path: 65088 65034 , path sourced external to AS
    192.168.100.3 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: RT:65012:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.3

  Path-id 1 not advertised to any peer
Example 1-30: DF selection verification on BGW-2.

Examples 1-31 shows that also BGW-2 received the BGP L2VPN EVPN Route-Type 3 NLRI information originated by BGW-3.

BGW-2# sh bgp l2vpn evpn route-type 3
BGP routing table information for VRF default, address family L2VPN EVPN
Route Distinguisher: 192.168.77.2:32777    (L2VNI 10000)
BGP routing table entry for [3]:[0]:[32]:[192.168.100.2]/88, version 3
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: local, path is valid, is best path, no labeled nexthop
  AS-Path: NONE, path locally originated
    192.168.100.2 (metric 0) from 0.0.0.0 (192.168.77.2)
      Origin IGP, MED not set, localpref 100, weight 32768
      Origin flag 0x2
      Extcommunity: RT:65012:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.2

  Path-id 1 advertised to peers:
    192.168.77.88
BGP routing table entry for [3]:[0]:[32]:[192.168.100.3]/88, version 26
Paths: (1 available, best #1)
Flags: (0x000012) (high32 00000000) on xmit-list, is in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported from 192.168.77.3:32777:[3]:[0]:[32]:[192.168.100.3]/88
  AS-Path: 65088 65034 , path sourced external to AS
    192.168.100.3 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: RT:65012:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.3

  Path-id 1 not advertised to any peer

Route Distinguisher: 192.168.77.3:32777
BGP routing table entry for [3]:[0]:[32]:[192.168.100.3]/88, version 24
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported to 1 destination(s)
  AS-Path: 65088 65034 , path sourced external to AS
    192.168.100.3 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: RT:65012:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.3

  Path-id 1 not advertised to any peer
Example 1-31: sh bgp l2vpn evpn route-type 3.

Examples 1-32 shows that BGW-3 received the BGP L2VPN EVPN Route-Type 3 NLRI information originated by BGW-1 and BGW-2.

BGW-3# sh bgp l2vpn evpn route-type 3
BGP routing table information for VRF default, address family L2VPN EVPN
Route Distinguisher: 192.168.77.1:32777
BGP routing table entry for [3]:[0]:[32]:[192.168.100.1]/88, version 7
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported to 1 destination(s)
  AS-Path: 65088 65012 , path sourced external to AS
    192.168.100.1 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: RT:65034:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.1

  Path-id 1 not advertised to any peer

Route Distinguisher: 192.168.77.2:32777
BGP routing table entry for [3]:[0]:[32]:[192.168.100.2]/88, version 15
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported to 1 destination(s)
  AS-Path: 65088 65012 , path sourced external to AS
    192.168.100.2 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: RT:65034:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.2

  Path-id 1 not advertised to any peer

Route Distinguisher: 192.168.77.3:32777    (L2VNI 10000)
BGP routing table entry for [3]:[0]:[32]:[192.168.100.1]/88, version 11
Paths: (1 available, best #1)
Flags: (0x000012) (high32 00000000) on xmit-list, is in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported from 192.168.77.1:32777:[3]:[0]:[32]:[192.168.100.1]/88
  AS-Path: 65088 65012 , path sourced external to AS
    192.168.100.1 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: RT:65034:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.1

  Path-id 1 not advertised to any peer
BGP routing table entry for [3]:[0]:[32]:[192.168.100.2]/88, version 17
Paths: (1 available, best #1)
Flags: (0x000012) (high32 00000000) on xmit-list, is in l2rib/evpn, is not in HW
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop
             Imported from 192.168.77.2:32777:[3]:[0]:[32]:[192.168.100.2]/88
  AS-Path: 65088 65012 , path sourced external to AS
    192.168.100.2 (metric 0) from 192.168.77.88 (192.168.77.88)
      Origin IGP, MED not set, localpref 100, weight 0
      Extcommunity: RT:65034:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.2

  Path-id 1 not advertised to any peer
BGP routing table entry for [3]:[0]:[32]:[192.168.100.3]/88, version 3
Paths: (1 available, best #1)
Flags: (0x000002) (high32 00000000) on xmit-list, is not in l2rib/evpn
Multipath: eBGP iBGP

  Advertised path-id 1
  Path type: local, path is valid, is best path, no labeled nexthop
  AS-Path: NONE, path locally originated
    192.168.100.3 (metric 0) from 0.0.0.0 (192.168.77.3)
      Origin IGP, MED not set, localpref 100, weight 32768
      Origin flag 0x2
      Extcommunity: RT:65034:10000 ENCAP:8
      PMSI Tunnel Attribute:
        flags: 0x00, Tunnel type: Ingress Replication
        Label: 10000, Tunnel Id: 192.168.100.3

  Path-id 1 advertised to peers:
    192.168.77.88
Example 1-32: DF selection verification on BGW-2.

P-Multicast Service Instance (PMSI) Path Attribute shown in capture 1-2 describes the PSMI tunnel end-point for Multi-Destination tree over a Common EVPN domain for VNI 10000. BGW that acts as a kind of PE device offers PMSI service for site-local devices, which means that the BGW switch has to be able to forward Multi-Destination traffic received form CE device, which in intra-site perspective are Leaf and Spine switches, over a Common EVPN Domain to BGW switches located on remote-site and another way around. The binary figures in front of “MPLS label” describes the Virtual Network Identifier (VNI) for this Multi-destination tree. Binary value 0010.0111.0001 is in decimal notation 10000 (VNI used with VLAN 10).

Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 99
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 76
    Path attributes
        Path Attribute - ORIGIN: IGP
        Path Attribute - AS_PATH: 65012
        Path Attribute - EXTENDED_COMMUNITIES
            Flags: 0xc0, Optional, Transitive, Complete
            Type Code: EXTENDED_COMMUNITIES (16)
            Length: 16
            Carried extended communities: (2 communities)
                Route Target: 65012:10000 [Transitive 2-Octet AS-Specific]
                Encapsulation: VXLAN Encapsulation [Transitive Opaque]
        Path Attribute - PMSI_TUNNEL_ATTRIBUTE
            Flags: 0xc0, Optional, Transitive, Complete
            Type Code: PMSI_TUNNEL_ATTRIBUTE (22)
            Length: 9
            Flags: 0
            Tunnel Type: Ingress Replication (6)
            0000 0000 0010 0111 0001 .... = MPLS Label: 625
            Tunnel ID: tunnel end point -> 192.168.100.1
        Path Attribute - MP_REACH_NLRI
            Flags: 0x90, Optional, Extended-Length, Non-transitive, Complete
            Type Code: MP_REACH_NLRI (14)
            Length: 28
            Address family identifier (AFI): Layer-2 VPN (25)
            Subsequent address family identifier (SAFI): EVPN (70)
            Next hop network address (4 bytes)
            Number of Subnetwork points of attachment (SNPA): 0
            Network layer reachability information (19 bytes)
                EVPN NLRI: Inclusive Multicast Route
                    Route Type: Inclusive Multicast Route (3)
                    Length: 17
                    Route Distinguisher: 0001c0a84d018009 (192.168.77.1:32777)
                    Ethernet Tag ID: 0
                    IP Address Length: 32
                    IPv4 address: 192.168.100.1
Capture 1-3: BGP L2VPN EVPN Route-Type 3 – Inclusive Multicast Route (captured from BGW-1).

Figure 1-7 illustrates the Multi-Destination forwarding path. PIM BiDir is used to build a Bidirectional Multicast tree in both Intra-Sites. Spine switches are defined as Pseudo Rendezvous Point (Pseudo RP) for Multicast Tree. Site-Local Leaf switches and BGW switches will join the Multicast tree. On the Common EVPN Domain side, BGW switches located in the different site will form an Ingress-Replication path between each other. BGP L2VPN EVPN Route-Type 3 (Inclusive Multicast Route) EVPN NLRI is used for signaling.

In the case where Leaf-101 receives ingress L2 BUM frame from its connected host from VLAN 10 (VNI10000), it will check the Multicast Group attached to VNI 10000 (238.0.0.10) and sends the frame to the Spine-11 that is Pseudo-RP for group 238.0.0.10. (3) Spine-11 will forward L2 BUM frame out of the interfaces found from the Outgoing Interface List (OIL) for the Multicast Group 238.0.0.10. The OIL is build based on received PIM Join messages. Both BGW-1 and BGW-2 are joined to Mcast Group so Spine-11 will forward L2 BUM frame to them. BGW-1 is selected to Designated Forwarder (DF) for VLAN 10, so it will send L2 BUM frame to BGW-3 over the Ingress-Replication tunnel formed over Common EVPN domain with VXLAN encapsulation. BGW-2 will not forward fame. The source IP address used in outer tunnel IP header is PIP of VGW-1. When BGW-3 receives the frame, it checks the VNI found from the VXLAN header and de-capsulate the frame. It forwards the frame to Mcast Group 238.0.0.10 (MG for VNI 10000 also in this site) RP Spine-12. Spine checks the OIL list and forward frame to Leaf-102. 



Figure 1-7: Overall Multi-Destination delivery path.

Fabric Link Failure


In the case where BGW switch loses all intra-site links, it will stop advertising Shared Virtual IP (VIP) to its DCI Underlay Network BGP IPv4 Unicast peer. This way it makes sure that other switches do not consider it as a valid next for in ECMP decision process. In addition, it withdrawn all intra-site host-related MAC address information (Route-Type 2). It also stops advertising itself as an Ingress-Tunnel Endpoint by withdrawing the Inclusive Multicast Route (Route-Type 3). In addition, it withdrawn the Ethernet Segment Routes (Route-Type 4) even though they are not used outside the local site. It also withdrawn learned IP prefix routes (Route-Type 5), excluded locally connected prefixes from ether connected host or external IPv4 peer. 



Figure 1-8: Intra-site fabric-link failure on BGW-1.


Normal State

Example 1-33 shows the BGP IPv4 Unicast entries installed into DC Core switch BGP table before fabric-link failure. DC Core Switch has learned the Shared VIP address used in Site-12 from its’ BGP IPv4 Unicast peer switches BGW-1 and BGW-2.

RouteServer-1# sh ip bgp
BGP routing table information for VRF default, address family IPv4 Unicast
BGP table version is 21, Local Router ID is 192.168.77.88
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
*>e10.1.88.0/24       10.1.88.1                0                     0 65012 ?
*>e10.2.88.0/24       10.2.88.2                0                     0 65012 ?
*>e10.3.88.0/24       10.3.88.3                0                     0 65034 ?
*>e10.88.1.0/24       10.1.88.1                0                     0 65012 ?
*>e10.88.2.0/24       10.2.88.2                0                     0 65012 ?
*>e10.88.3.0/24       10.3.88.3                0                     0 65034 ?
*>e192.168.0.1/32     10.1.88.1                0                     0 65012 ?
*>e192.168.0.2/32     10.2.88.2                0                     0 65012 ?
*>e192.168.0.3/32     10.3.88.3                0                     0 65034 ?
*>e192.168.77.1/32    10.1.88.1                0                     0 65012 ?
*>e192.168.77.2/32    10.2.88.2                0                     0 65012 ?
*>e192.168.77.3/32    10.3.88.3                0                     0 65034 ?
*>r192.168.77.88/32   0.0.0.0                  0        100      32768 ?
*>e192.168.88.12/32   10.1.88.1                0                     0 65012 ?
*|e                   10.2.88.2                0                     0 65012 ?
*>e192.168.88.34/32   10.3.88.3                0                     0 65034 ?
*>r192.168.88.88/32   0.0.0.0                  0        100      32768 ?
*>e192.168.100.1/32   10.1.88.1                0                     0 65012 ?
*>e192.168.100.2/32   10.2.88.2                0                     0 65012 ?
*>e192.168.100.3/32   10.3.88.3                0                     0 65034 ?
Example 1-33 BGP IPv4 Uncast entries in DC Core switch.

Example 1-34 shows the BGP L2VPN EVPN entries installed into BGW-3 BGP table before fabric-link failure. There is one Route-Type 4 entry (Ethernet Segment Route), one Route-Type 3 entry (Inclusive Multicast Route) and two Route-Type 2 entries (MAC Advertisement Route) first one for System MAC and the second one for host Abba.

BGW-3# sh bgp l2vpn evpn
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 47, Local Router ID is 192.168.77.3
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
Route Distinguisher: 192.168.77.1:27001
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136
                      192.168.100.1                                  0 65088 65012 i

Route Distinguisher: 192.168.77.1:32777
*>e[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                                  0 65088 65012 i
*>e[3]:[0]:[32]:[192.168.100.1]/88
                      192.168.100.1                                  0 65088 65012 i

Route Distinguisher: 192.168.77.2:27001
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.77.1                                   0 65088 65012 i

Route Distinguisher: 192.168.77.2:32777
*>e[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                                  0 65088 65012 i
*>e[3]:[0]:[32]:[192.168.100.2]/88
                      192.168.100.2                                  0 65088 65012 i

Route Distinguisher: 192.168.77.3:27001   (ES [0300.0000.0000.0c00.0309 0])
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136
                      192.168.100.1                                  0 65088 65012 i
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.77.1                                   0 65088 65012 i
*>l[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.3]/136
                      192.168.100.3                     100      32768 i

Route Distinguisher: 192.168.77.3:32777    (L2VNI 10000)
*>e[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.12                                  0 65088 65012 i
*>e[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                                  0 65088 65012 i
*>e[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                                  0 65088 65012 i
*>l[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.100.3                     100      32768 i
*>e[3]:[0]:[32]:[192.168.100.1]/88
                      192.168.100.1                                  0 65088 65012 i
*>e[3]:[0]:[32]:[192.168.100.2]/88
                      192.168.100.2                                  0 65088 65012 i
*>l[3]:[0]:[32]:[192.168.100.3]/88
                      192.168.100.3                     100      32768 i

Route Distinguisher: 192.168.77.101:32777
*>e[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.12                                  0 65088 65012 i
Example 1-34 BGP L2VPN EVPN entries in BGW-3.

The Protocol, Link and Admin status of Loopback 88 (VIP) is UP.

BGW-1(config-if)# sh ip int bri

IP Interface Status for VRF "default"(1)
Interface            IP Address      Interface Status
Lo0                  192.168.0.1     protocol-up/link-up/admin-up
Lo77                 192.168.77.1    protocol-up/link-up/admin-up
Lo88                 192.168.88.12   protocol-up/link-up/admin-up
Lo100                192.168.100.1   protocol-up/link-up/admin-up
Eth1/1               10.1.11.1       protocol-up/link-up/admin-up
Eth1/2               10.1.88.1       protocol-up/link-up/admin-up
Eth1/3               10.11.1.1       protocol-up/link-up/admin-up
Eth1/4               10.88.1.1       protocol-up/link-up/admin-up
Example 1-35 Interface Loopback 88 UP on BGW-1.
Fabric-Link Failure
The fabric-link failure is simulated by shutting down the fabric-link Interface e1/1. When BGW-1 notices this, it changes the Interface Loopback 88 link-state to down.
BGW-1(config-if)# sh ip int bri

IP Interface Status for VRF "default"(1)
Interface            IP Address      Interface Status
Lo0                  192.168.0.1     protocol-up/link-up/admin-up
Lo77                 192.168.77.1    protocol-up/link-up/admin-up
Lo88                 192.168.88.12   protocol-down/link-down/admin-up
Lo100                192.168.100.1   protocol-up/link-up/admin-up
Eth1/1               10.1.11.1       protocol-down/link-down/admin-down
Eth1/2               10.1.88.1       protocol-up/link-up/admin-up
Eth1/3               10.11.1.1       protocol-up/link-up/admin-up
Eth1/4               10.88.1.1       protocol-up/link-up/admin-up
Example 1-36 Interface Loopback 88 DOWN on BGW-1.

Example 1-37 verifies that the Fabric-Link is also down.

BGW-1# sh nve multisite fabric-links
Interface      State
---------      -----
Ethernet1/1    Down
Example 1-37 sh nve multisite fabric-links on BGW-1.

Capture 1-4 shows that BGW-1 sends MP_Unreach_NLRI concerning the IP address of Loopback 88 to DC Core switch over the BGP IPv4 Unicast peering.

Internet Protocol Version 4, Src: 10.1.88.1, Dst: 10.1.88.88
Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 35
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 12
    Path attributes
        Path Attribute - MP_UNREACH_NLRI
            Flags: 0x90, Optional, Extended-Length, Non-transitive, Complete
            Type Code: MP_UNREACH_NLRI (15)
            Length: 8
            Address family identifier (AFI): IPv4 (1)
            Subsequent address family identifier (SAFI): Unicast (1)
            Withdrawn routes (5 bytes)
                192.168.88.12/32
Capture 1-4: BGP L2VPN EVPN Route-Type 3 – Inclusive Multicast Route (captured from BGW1).

As a result, the DC Core switch removes the routing entry from its BGP IPv4 table (example 1-38).

RouteServer-1# sh ip bgp
BGP routing table information for VRF default, address family IPv4 Unicast
BGP table version is 22, Local Router ID is 192.168.77.88
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
*>e10.1.88.0/24       10.1.88.1                0                     0 65012 ?
*>e10.2.88.0/24       10.2.88.2                0                     0 65012 ?
*>e10.3.88.0/24       10.3.88.3                0                     0 65034 ?
*>e10.88.1.0/24       10.1.88.1                0                     0 65012 ?
*>e10.88.2.0/24       10.2.88.2                0                     0 65012 ?
*>e10.88.3.0/24       10.3.88.3                0                     0 65034 ?
*>e192.168.0.1/32     10.1.88.1                0                     0 65012 ?
*>e192.168.0.2/32     10.2.88.2                0                     0 65012 ?
*>e192.168.0.3/32     10.3.88.3                0                     0 65034 ?
*>e192.168.77.1/32    10.1.88.1                0                     0 65012 ?
*>e192.168.77.2/32    10.2.88.2                0                     0 65012 ?
*>e192.168.77.3/32    10.3.88.3                0                     0 65034 ?
*>r192.168.77.88/32   0.0.0.0                  0        100      32768 ?
*>e192.168.88.12/32   10.2.88.2                0                     0 65012 ?
*>e192.168.88.34/32   10.3.88.3                0                     0 65034 ?
*>r192.168.88.88/32   0.0.0.0                  0        100      32768 ?
*>e192.168.100.1/32   10.1.88.1                0                     0 65012 ?
*>e192.168.100.2/32   10.2.88.2                0                     0 65012 ?
*>e192.168.100.3/32   10.3.88.3                0                     0 65034 ?
Example 1-38 Loopback88 of BGW-1 removed from the BGP IPv4 table of DC Core switch.

BGW-1 has also withdrawn all Route-type 2-5. Example 1-39 shows that route is removed from BGW-3 BGP L2VPN EVPN table.

BGW-3# sh bgp l2vpn evpn
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 87, Local Router ID is 192.168.77.3
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
Route Distinguisher: 192.168.77.2:27001
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.100.2                                  0 65088 65012 i

Route Distinguisher: 192.168.77.2:32777
*>e[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                                  0 65088 65012 i
*>e[3]:[0]:[32]:[192.168.100.2]/88
                      192.168.100.2                                  0 65088 65012 i

Route Distinguisher: 192.168.77.3:27001   (ES [0300.0000.0000.0c00.0309 0])
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.100.2                                  0 65088 65012 i
*>l[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.3]/136
                      192.168.100.3                     100      32768 i

Route Distinguisher: 192.168.77.3:32777    (L2VNI 10000)
*>e[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.12                                  0 65088 65012 i
*>e[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                                  0 65088 65012 i
*>l[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.100.3                     100      32768 i
*>e[3]:[0]:[32]:[192.168.100.2]/88
                      192.168.100.2                                  0 65088 65012 i
*>l[3]:[0]:[32]:[192.168.100.3]/88
                      192.168.100.3                     100      32768 i

Route Distinguisher: 192.168.77.101:32777
*>e[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.12                                  0 65088 65012 i
Example 1-39 BGP table of BGW-3 after fabric-link failure in BGW-1.

Fabric-Link Recovery
When fabric-link is brought back up on BGW-1, the Admin state is changed to UP state while the Operational state is still kept on DOWN state. BGW-1 starts the Delay-Restore Timer as can be seen from the example 1-40 and 1-41.

BGW-1# show nve interface nve 1 detail
Interface: nve1, State: Up, encapsulation: VXLAN
 VPC Capability: VPC-VIP-Only [not-notified]
 Local Router MAC: 5000.0002.0007
 Host Learning Mode: Control-Plane
 Source-Interface: loopback100 (primary: 192.168.100.1, secondary: 0.0.0.0)
 Source Interface State: Up
 Virtual RMAC Advertisement: No
 NVE Flags:
 Interface Handle: 0x49000001
 Source Interface hold-down-time: 180
 Source Interface hold-up-time: 30
 Remaining hold-down time: 0 seconds
 Virtual Router MAC: N/A
 Virtual Router MAC Re-origination: 0200.c0a8.580c
 Interface state: nve-intf-add-complete
 Multisite delay-restore time: 300 seconds
 Multisite delay-restore time left: 236 seconds
 Multisite bgw-if: loopback88 (ip: 192.168.88.12, admin: Up, oper: Down)
 Multisite bgw-if oper down reason:
Example 1-40 Delay Restore Timer on BGW-1.

BGW-1# show nve interface nve 1 detail | i  Multisite
 Multisite delay-restore time: 300 seconds
 Multisite delay-restore time left: 20 seconds
 Multisite bgw-if: loopback88 (ip: 192.168.88.12, admin: Up, oper: Down)
 Multisite bgw-if oper down reason:
Example 1-41 Delay Restore Timer on BGW-1.

After 300 seconds, BGW-1 change the Operational state of Interface Loopback to UP state as shown in examples 1-42 and 1-43.

BGW-1# show nve interface nve 1 detail | i  Multisite
 Multisite delay-restore time: 300 seconds
 Multisite delay-restore time left: 0 seconds
 Multisite bgw-if: loopback88 (ip: 192.168.88.12, admin: Up, oper: Up)
 Multisite bgw-if oper down reason:
Example 1-42 Delay Restore Timer on BGW-1.


BGW-1# sh ip int bri

IP Interface Status for VRF "default"(1)
Interface            IP Address      Interface Status
Lo0                  192.168.0.1     protocol-up/link-up/admin-up
Lo77                 192.168.77.1    protocol-up/link-up/admin-up
Lo88                 192.168.88.12   protocol-up/link-up/admin-up
Lo100                192.168.100.1   protocol-up/link-up/admin-up
Eth1/1               10.1.11.1       protocol-up/link-up/admin-up
Eth1/2               10.1.88.1       protocol-up/link-up/admin-up
Eth1/3               10.11.1.1       protocol-up/link-up/admin-up
Eth1/4               10.88.1.1       protocol-up/link-up/admin-up
Example 1-43 Loopback 88 ststus after recovery on BGW-1.

The network has recovered as can be seen from the examples 1-44 and 1-45.

RouteServer-1# sh ip bgp
BGP routing table information for VRF default, address family IPv4 Unicast
BGP table version is 23, Local Router ID is 192.168.77.88
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
*>e10.1.88.0/24       10.1.88.1                0                     0 65012 ?
*>e10.2.88.0/24       10.2.88.2                0                     0 65012 ?
*>e10.3.88.0/24       10.3.88.3                0                     0 65034 ?
*>e10.88.1.0/24       10.1.88.1                0                     0 65012 ?
*>e10.88.2.0/24       10.2.88.2                0                     0 65012 ?
*>e10.88.3.0/24       10.3.88.3                0                     0 65034 ?
*>e192.168.0.1/32     10.1.88.1                0                     0 65012 ?
*>e192.168.0.2/32     10.2.88.2                0                     0 65012 ?
*>e192.168.0.3/32     10.3.88.3                0                     0 65034 ?
*>e192.168.77.1/32    10.1.88.1                0                     0 65012 ?
*>e192.168.77.2/32    10.2.88.2                0                     0 65012 ?
*>e192.168.77.3/32    10.3.88.3                0                     0 65034 ?
*>r192.168.77.88/32   0.0.0.0                  0        100      32768 ?
*|e192.168.88.12/32   10.1.88.1                0                     0 65012 ?
*>e                   10.2.88.2                0                     0 65012 ?
*>e192.168.88.34/32   10.3.88.3                0                     0 65034 ?
*>r192.168.88.88/32   0.0.0.0                  0        100      32768 ?
*>e192.168.100.1/32   10.1.88.1                0                     0 65012 ?
*>e192.168.100.2/32   10.2.88.2                0                     0 65012 ?
*>e192.168.100.3/32   10.3.88.3                0                     0 65034 ?
Example 1-44:BGP IPv4 table on DC Core Switch after recovery.


BGW-3# sh bgp l2vpn evpn
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 103, Local Router ID is 192.168.77.3
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
Route Distinguisher: 192.168.77.1:27001
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136
                      192.168.100.1                                  0 65088 65012 i

Route Distinguisher: 192.168.77.1:32777
*>e[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                                  0 65088 65012 i
*>e[3]:[0]:[32]:[192.168.100.1]/88
                      192.168.100.1                                  0 65088 65012 i

Route Distinguisher: 192.168.77.2:27001
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.100.2                                  0 65088 65012 i

Route Distinguisher: 192.168.77.2:32777
*>e[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                                  0 65088 65012 i
*>e[3]:[0]:[32]:[192.168.100.2]/88
                      192.168.100.2                                  0 65088 65012 i

Route Distinguisher: 192.168.77.3:27001   (ES [0300.0000.0000.0c00.0309 0])
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136
                      192.168.100.1                                  0 65088 65012 i
*>e[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.100.2                                  0 65088 65012 i
*>l[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.3]/136
                      192.168.100.3                     100      32768 i

Route Distinguisher: 192.168.77.3:32777    (L2VNI 10000)
*>e[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.12                                  0 65088 65012 i
*>e[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                                  0 65088 65012 i
*>e[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                                  0 65088 65012 i
*>l[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.100.3                     100      32768 i
*>e[3]:[0]:[32]:[192.168.100.1]/88
                      192.168.100.1                                  0 65088 65012 i
*>e[3]:[0]:[32]:[192.168.100.2]/88
                      192.168.100.2                                  0 65088 65012 i
*>l[3]:[0]:[32]:[192.168.100.3]/88
                      192.168.100.3                     100      32768 i

Route Distinguisher: 192.168.77.101:32777
*>e[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.88.12                                  0 65088 65012 i
Example 1-45:BGP L2VPN EVPN table on DC Core Switch after recovery.

DCI-Link Failure

When all of the DCI links of BGW are down, it stops advertising VIP address to Intra-Site peer just like in case of previously discussed Fabric-Link failure. Naturally, it also stops advertising routes learned via DCI link due to link failure. What it still does, it continues acting as a regular Leaf switch. If it has connected hosts or external peers, it continues to advertise prefix attached/learned from those.



Figure 1-9: Inter-Site DCI-link failure on BGW-1.

Normal State
Example 1-46 shows that Spine-11 has learned Site-12 Shared VIP from both Intra-Site BGW switches via OSPF (Underlay Network).
Spine-11# sh ip route 192.168.88.12
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.88.12/32, ubest/mbest: 2/0
    *via 10.1.11.1, Eth1/1, [110/41], 01:28:36, ospf-UNDERLAY-NET, intra
    *via 10.2.11.2, Eth1/2, [110/41], 01:28:36, ospf-UNDERLAY-NET, intra
Example 1-46:RIB on Spine-11 in normal situation.

Example 1-47 shows that Spine-11 use both BGW-1 and BG-2 for load sharing data to Inter-Site. Note that the MAC address 5000.0004.0007 is the System MAC address of BGW-3 on Site-34.
Spine-11# sh bgp l2vpn evpn
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 122, Local Router ID is 192.168.77.11
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
Route Distinguisher: 192.168.77.1:27001
*>i[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.1]/136
                      192.168.100.1                     100          0 i

Route Distinguisher: 192.168.77.1:32777
*>i[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                     100          0 i

Route Distinguisher: 192.168.77.2:27001
*>i[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.100.2                     100          0 i

Route Distinguisher: 192.168.77.2:32777
*>i[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                     100          0 i

Route Distinguisher: 192.168.77.3:32777
*>i[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.88.12                     100          0 65088 65034 i
* i                   192.168.88.12                     100          0 65088 65034 i

Route Distinguisher: 192.168.77.101:32777
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.100.101                   100          0 i
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[32]:[172.16.10.101]/272
                      192.168.100.101                   100          0 i
Example 1-47: BGP L2VPN EVPN table on Spine-11 in normal situation.

DCI Link Failure
The DCI link failure is demonstrated by shutting down the DCI interface e1/2. The state of the link is verified on example 1-48 and 1-49.
BGW-1# sh nve multisite dci-links
Interface      State
---------      -----
Ethernet1/2    Down
Example 1-48: sh nve multisite dci-links on Spine-11.

BGW-1# sh ip int bri

IP Interface Status for VRF "default"(1)
Interface            IP Address      Interface Status
Lo0                  192.168.0.1     protocol-up/link-up/admin-up
Lo77                 192.168.77.1    protocol-up/link-up/admin-up
Lo88                 192.168.88.12   protocol-down/link-down/admin-up
Lo100                192.168.100.1   protocol-up/link-up/admin-up
Eth1/1               10.1.11.1       protocol-up/link-up/admin-up
Eth1/2               10.1.88.1       protocol-down/link-down/admin-down
Eth1/3               10.11.1.1       protocol-up/link-up/admin-up
Eth1/4               10.88.1.1       protocol-up/link-up/admin-up
Example 1-49: sh ip int bri on BGW-1.

Example 1-50 below shows that reason for Down-state is “DCI Isolated”.
BGW-1# show nve interface nve 1 detail
Interface: nve1, State: Up, encapsulation: VXLAN
 VPC Capability: VPC-VIP-Only [not-notified]
 Local Router MAC: 5000.0002.0007
 Host Learning Mode: Control-Plane
 Source-Interface: loopback100 (primary: 192.168.100.1, secondary: 0.0.0.0)
 Source Interface State: Up
 Virtual RMAC Advertisement: No
 NVE Flags:
 Interface Handle: 0x49000001
 Source Interface hold-down-time: 180
 Source Interface hold-up-time: 30
 Remaining hold-down time: 0 seconds
 Virtual Router MAC: N/A
 Virtual Router MAC Re-origination: 0200.c0a8.580c
 Interface state: nve-intf-add-complete
 Multisite delay-restore time: 300 seconds
 Multisite delay-restore time left: 0 seconds
 Multisite bgw-if: loopback88 (ip: 192.168.88.12, admin: Up, oper: Down)
 Multisite bgw-if oper down reason:  DCI isolated.
Example 1-50: show nve interface nve 1 detail on Spine-11.

BGW-1 withdrawn the VIP and now Spine-11 has only one destination to Intra-Site VIP address via BGW-2.
Spine-11# sh ip route 192.168.88.12
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.88.12/32, ubest/mbest: 1/0
    *via 10.2.11.2, Eth1/2, [110/41], 00:00:45, ospf-UNDERLAY-NET, intra
Example 1-51: show ip route 192.168.88.12 on Spine-11.

BGW-1 also withdrawn all Route-Type 2-5 routes received via DCI link. Now Spine-11 learns Inter-Site routes only via BGW-2.
Spine-11# sh bgp l2vpn evpn
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 95, Local Router ID is 192.168.77.11
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup, 2 - best2

   Network            Next Hop            Metric     LocPrf     Weight Path
Route Distinguisher: 192.168.77.1:32777
*>i[2]:[0]:[0]:[48]:[5000.0002.0007]:[0]:[0.0.0.0]/216
                      192.168.100.1                     100          0 i

Route Distinguisher: 192.168.77.2:27001
*>i[4]:[0300.0000.0000.0c00.0309]:[32]:[192.168.100.2]/136
                      192.168.100.2                     100          0 i

Route Distinguisher: 192.168.77.2:32777
*>i[2]:[0]:[0]:[48]:[5000.0003.0007]:[0]:[0.0.0.0]/216
                      192.168.100.2                     100          0 i

Route Distinguisher: 192.168.77.3:32777
*>i[2]:[0]:[0]:[48]:[5000.0004.0007]:[0]:[0.0.0.0]/216
                      192.168.88.12                     100          0 65088 65034 i

Route Distinguisher: 192.168.77.101:32777
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[0]:[0.0.0.0]/216
                      192.168.100.101                   100          0 i
*>i[2]:[0]:[0]:[48]:[1000.0010.abba]:[32]:[172.16.10.101]/272
                      192.168.100.101                   100          0 i
Example 1-52: sh bgp l2vpn evpn on Spine-11.

DCI Link Recovery
The recovery process is the same than in the case of Fabric-Link failure. BGW-1 starts a Delay-Restore timer that is set to 300 seconds.
BGW-1# show nve interface nve 1 detail
Interface: nve1, State: Up, encapsulation: VXLAN
 VPC Capability: VPC-VIP-Only [not-notified]
 Local Router MAC: 5000.0002.0007
 Host Learning Mode: Control-Plane
 Source-Interface: loopback100 (primary: 192.168.100.1, secondary: 0.0.0.0)
 Source Interface State: Up
 Virtual RMAC Advertisement: No
 NVE Flags:
 Interface Handle: 0x49000001
 Source Interface hold-down-time: 180
 Source Interface hold-up-time: 30
 Remaining hold-down time: 0 seconds
 Virtual Router MAC: N/A
 Virtual Router MAC Re-origination: 0200.c0a8.580c
 Interface state: nve-intf-add-complete
 Multisite delay-restore time: 300 seconds
 Multisite delay-restore time left: 295 seconds
 Multisite bgw-if: loopback88 (ip: 192.168.88.12, admin: Up, oper: Down)
 Multisite bgw-if oper down reason:
Example 1-53: Delay-restore timer start on BGW-1.

BGW-1 change the interface Loopback 88 status to UP after 300 seconds and start the normal operation.
BGW-1# show nve interface nve 1 detail | i Multisite
 Multisite delay-restore time: 300 seconds
 Multisite delay-restore time left: 0 seconds
 Multisite bgw-if: loopback88 (ip: 192.168.88.12, admin: Up, oper: Up)
 Multisite bgw-if oper down reason:
Example 1-54: Delay-restore timer stop on BGW-1.
Note that during failure, the BGW-2 will take over the Designated Forwarder role for all Intra-Site VNIs.


Author: Toni Pasanen CCIE#28158
Published: 6 - August 2019

References

Building Data Center with VXLAN BGP EVPN – A Cisco NX-OS Perspective
ISBN-10: 1-58714-467-0 – Krattiger Lukas, Shyam Kapadia, and Jansen Davis

Internet Engineering Task Force (IETF): Multicast in MPLS/BGP IP VPNs. 2012

Internet Engineering Task Force (IETF): BGP Encodings and Procedures for Multicast in MPLS/BGP IP VPNs. 2012

Internet Engineering Task Force (IETF): BGP MPLS-Based Ethernet VPN. 2015

BESS Working Group: Multi-site EVPN based VXLAN using Border Gateways. 2018

Cisco.com: VXLAN EVPN Multi-Site Design and Deployment:




Appendix A.
Configuration files for BGW switches and DC Core switch
BGW-1 Configuration
BGW-1# sh run

!Command: show running-config
!Running configuration last done at: Wed Aug  7 09:19:25 2019
!Time: Wed Aug  7 09:21:43 2019

version 9.2(3) Bios:version
hostname BGW-1
vdc BGW-1 id 1
  limit-resource vlan minimum 16 maximum 4094
  limit-resource vrf minimum 2 maximum 4096
  limit-resource port-channel minimum 0 maximum 511
  limit-resource u4route-mem minimum 248 maximum 248
  limit-resource u6route-mem minimum 96 maximum 96
  limit-resource m4route-mem minimum 58 maximum 58
  limit-resource m6route-mem minimum 8 maximum 8

nv overlay evpn
feature ospf
feature bgp
feature pim
feature fabric forwarding
feature interface-vlan
feature vn-segment-vlan-based
feature lacp
feature nv overlay

username admin password 5 $5$YTfyrnCx$D0BEzwcJJWm/PRjj/ykdkAySBr/9B6dsou/NWEAm6D
4  role network-admin
ip domain-lookup
copp profile strict
evpn multisite border-gateway 12
  delay-restore time 300
snmp-server user admin network-admin auth md5 0x42cd35684f49b26fca133253a1e0519d
 priv 0x42cd35684f49b26fca133253a1e0519d localizedkey
rmon event 1 description FATAL(1) owner PMON@FATAL
rmon event 2 description CRITICAL(2) owner PMON@CRITICAL
rmon event 3 description ERROR(3) owner PMON@ERROR
rmon event 4 description WARNING(4) owner PMON@WARNING
rmon event 5 description INFORMATION(5) owner PMON@INFO

fabric forwarding anycast-gateway-mac 0001.0001.0001
ip pim rp-address 192.168.238.1 group-list 238.0.0.0/24 bidir
ip pim ssm range 232.0.0.0/8
vlan 1,10,30,40,50,77
vlan 10
  name L2VNI-for-VLAN10
  vn-segment 10000
vlan 30
  vn-segment 30000
vlan 40
  vn-segment 40000
vlan 50
  vn-segment 50000
vlan 77
  name TENANT77
  vn-segment 10077

route-map REDIST-TO-SITE-EXT-DCI permit 10
  match tag 1234
vrf context TENANT77
  vni 10077
  rd auto
  address-family ipv4 unicast
    route-target both auto
    route-target both auto evpn
vrf context management
hardware access-list tcam region racl 512
hardware access-list tcam region vpc-convergence 256
hardware access-list tcam region arp-ether 256 double-wide


interface Vlan1

interface nve1
  no shutdown
  host-reachability protocol bgp
  source-interface loopback100
  multisite border-gateway interface loopback88
  member vni 10000
    multisite ingress-replication
    mcast-group 238.0.0.10
  member vni 10077 associate-vrf
  member vni 30000
    mcast-group 238.0.0.10
  member vni 40000
    mcast-group 238.0.0.10
  member vni 50000
    mcast-group 238.0.0.10

interface Ethernet1/1
  description **Fabric Internal **
  no switchport
  mac-address b063.0001.1e11
  medium p2p
  ip address 10.1.11.1/24
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
  evpn multisite fabric-tracking
  no shutdown

interface Ethernet1/2
  description ** DCI Interface **
  no switchport
  mac-address b063.0001.1e12
  medium p2p
  ip address 10.1.88.1/24 tag 1234
  ip pim sparse-mode
  evpn multisite dci-tracking
  no shutdown

interface Ethernet1/3
  description **Fabric Internal **
  no switchport
  mac-address b063.0001.1e13
  medium p2p
  ip address 10.11.1.1/24
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
  no shutdown

interface Ethernet1/4
  description ** DCI Interface **
  no switchport
  mac-address b063.0001.1e14
  medium p2p
  ip address 10.88.1.1/24 tag 1234
  no shutdown

interface mgmt0
  vrf member management

interface loopback0
  description ** RID/Underlay **
  ip address 192.168.0.1/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode

interface loopback77
  description ** BGP peering **
  ip address 192.168.77.1/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode

interface loopback88
  description ** VIP for DCI-Inter-connect **
  ip address 192.168.88.12/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0

interface loopback100
  description ** VTEP/Overlay **
  ip address 192.168.100.1/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
line console
line vty
boot nxos bootflash:/nxos.9.2.3.bin
router ospf UNDERLAY-NET
  router-id 192.168.0.1
router bgp 65012
  router-id 192.168.77.1
  no enforce-first-as
  address-family ipv4 unicast
    redistribute direct route-map REDIST-TO-SITE-EXT-DCI
  address-family l2vpn evpn
  neighbor 10.1.88.88
    remote-as 65088
    update-source Ethernet1/2
    address-family ipv4 unicast
  neighbor 192.168.77.11
    remote-as 65012
    description ** Spine-11 BGP-RR **
    update-source loopback77
    address-family l2vpn evpn
      send-community extended
  neighbor 192.168.77.88
    remote-as 65088
    update-source loopback77
    ebgp-multihop 5
    peer-type fabric-external
    address-family l2vpn evpn
      send-community
      send-community extended
      rewrite-evpn-rt-asn
  vrf TENANT77
    address-family ipv4 unicast
      advertise l2vpn evpn
evpn
  vni 10000 l2
    rd auto
    route-target import auto
    route-target export auto
  vni 30000 l2
    rd auto
    route-target import auto
    route-target export auto
  vni 40000 l2
    rd auto
    route-target import auto
    route-target export auto
  vni 50000 l2
    rd auto
    route-target import auto
    route-target export auto

BGW-2 Configuration
BGW-2# sh run

!Command: show running-config
!Running configuration last done at: Wed Aug  7 09:19:31 2019
!Time: Wed Aug  7 09:24:10 2019

version 9.2(3) Bios:version
hostname BGW-2
vdc BGW-2 id 1
  limit-resource vlan minimum 16 maximum 4094
  limit-resource vrf minimum 2 maximum 4096
  limit-resource port-channel minimum 0 maximum 511
  limit-resource u4route-mem minimum 248 maximum 248
  limit-resource u6route-mem minimum 96 maximum 96
  limit-resource m4route-mem minimum 58 maximum 58
  limit-resource m6route-mem minimum 8 maximum 8

nv overlay evpn
feature ospf
feature bgp
feature pim
feature fabric forwarding
feature interface-vlan
feature vn-segment-vlan-based
feature lacp
feature nv overlay

username admin password 5 $5$6O5Ozded$6G9z9ZYJnto10KgJSqYou0dZilxI2abRLQOgpBTzu8
A  role network-admin
ip domain-lookup
copp profile strict
evpn multisite border-gateway 12
  delay-restore time 300
snmp-server user admin network-admin auth md5 0x9bcc18427d4176f2aec8419a200a8bbf
 priv 0x9bcc18427d4176f2aec8419a200a8bbf localizedkey
rmon event 1 description FATAL(1) owner PMON@FATAL
rmon event 2 description CRITICAL(2) owner PMON@CRITICAL
rmon event 3 description ERROR(3) owner PMON@ERROR
rmon event 4 description WARNING(4) owner PMON@WARNING
rmon event 5 description INFORMATION(5) owner PMON@INFO

fabric forwarding anycast-gateway-mac 0001.0001.0001
ip pim rp-address 192.168.238.1 group-list 238.0.0.0/24 bidir
ip pim ssm range 232.0.0.0/8
vlan 1,10,30,40,50,77
vlan 10
  name L2VNI-for-VLAN10
  vn-segment 10000
vlan 30
  vn-segment 30000
vlan 40
  vn-segment 40000
vlan 50
  vn-segment 50000
vlan 77
  name TENANT77
  vn-segment 10077

route-map REDIST-TO-SITE-EXT-DCI permit 10
  match tag 1234
vrf context TENANT77
  vni 10077
  rd auto
  address-family ipv4 unicast
    route-target both auto
    route-target both auto evpn
vrf context management
hardware access-list tcam region racl 512
hardware access-list tcam region vpc-convergence 256
hardware access-list tcam region arp-ether 256 double-wide


interface Vlan1

interface nve1
  no shutdown
  host-reachability protocol bgp
  source-interface loopback100
  multisite border-gateway interface loopback88
  member vni 10000
    multisite ingress-replication
    mcast-group 238.0.0.10
  member vni 10077 associate-vrf
  member vni 30000
    mcast-group 238.0.0.10
  member vni 40000
    mcast-group 238.0.0.10
  member vni 50000
    mcast-group 238.0.0.10

interface Ethernet1/1
  description **Fabric Internal **
  no switchport
  mac-address b063.0002.1e11
  medium p2p
  ip address 10.2.11.2/24
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
  evpn multisite fabric-tracking
  no shutdown

interface Ethernet1/2
  description ** DCI Interface **
  no switchport
  mac-address b063.0002.1e12
  medium p2p
  ip address 10.2.88.2/24 tag 1234
  ip ospf network point-to-point
  ip pim sparse-mode
  evpn multisite dci-tracking
  no shutdown

interface Ethernet1/3
  description **Fabric Internal **
  no switchport
  mac-address b063.0002.1e13
  medium p2p
  ip address 10.11.2.2/24
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
  no shutdown

interface Ethernet1/4
  description ** DCI Interface **
  no switchport
  mac-address b063.0002.1e14
  medium p2p
  ip address 10.88.2.2/24 tag 1234
  no shutdown
interface mgmt0
  vrf member management

interface loopback0
  description ** RID/Underlay **
  ip address 192.168.0.2/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode

interface loopback77
  description ** BGP peering **
  ip address 192.168.77.2/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode

interface loopback88
  description ** VIP for DCI-Inter-connect **
  ip address 192.168.88.12/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0

interface loopback100
  description ** VTEP/Overlay **
  ip address 192.168.100.2/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
line console
line vty
boot nxos bootflash:/nxos.9.2.3.bin
router ospf UNDERLAY-NET
  router-id 192.168.0.2
router bgp 65012
  router-id 192.168.77.2
  no enforce-first-as
  address-family ipv4 unicast
    redistribute direct route-map REDIST-TO-SITE-EXT-DCI
  address-family l2vpn evpn
  neighbor 10.2.88.88
    remote-as 65088
    update-source Ethernet1/2
    peer-type fabric-external
    address-family ipv4 unicast
  neighbor 192.168.77.11
    remote-as 65012
    description ** Spine-11 BGP-RR **
    update-source loopback77
    address-family l2vpn evpn
      send-community extended
  neighbor 192.168.77.88
    remote-as 65088
    update-source loopback77
    ebgp-multihop 5
    peer-type fabric-external
    address-family l2vpn evpn
      send-community
      send-community extended
      rewrite-evpn-rt-asn
  vrf TENANT77
    address-family ipv4 unicast
      advertise l2vpn evpn
evpn
  vni 10000 l2
    rd auto
    route-target import auto
    route-target export auto
  vni 30000 l2
    rd auto
    route-target import auto
    route-target export auto
  vni 40000 l2
    rd auto
    route-target import auto
    route-target export auto
  vni 50000 l2
    rd auto
    route-target import auto
    route-target export auto

BGW-3 Configuration
BGW-3# sh run

!Command: show running-config
!No configuration change since last restart
!Time: Wed Aug  7 09:36:25 2019

version 9.2(3) Bios:version
hostname BGW-3
vdc BGW-3 id 1
  limit-resource vlan minimum 16 maximum 4094
  limit-resource vrf minimum 2 maximum 4096
  limit-resource port-channel minimum 0 maximum 511
  limit-resource u4route-mem minimum 248 maximum 248
  limit-resource u6route-mem minimum 96 maximum 96
  limit-resource m4route-mem minimum 58 maximum 58
  limit-resource m6route-mem minimum 8 maximum 8

nv overlay evpn
feature ospf
feature bgp
feature pim
feature fabric forwarding
feature interface-vlan
feature vn-segment-vlan-based
feature lacp
feature nv overlay

username admin password 5 $5$O9jHouJ4$gMMf.hMYXJRamUNys17VtdztzLMNq1PdMQDIc1xPZu
9  role network-admin
ip domain-lookup
copp profile strict
evpn multisite border-gateway 12
  delay-restore time 300
snmp-server user admin network-admin auth md5 0x423cb9002003f0f3c3acb917bba00bf8
 priv 0x423cb9002003f0f3c3acb917bba00bf8 localizedkey
rmon event 1 description FATAL(1) owner PMON@FATAL
rmon event 2 description CRITICAL(2) owner PMON@CRITICAL
rmon event 3 description ERROR(3) owner PMON@ERROR
rmon event 4 description WARNING(4) owner PMON@WARNING
rmon event 5 description INFORMATION(5) owner PMON@INFO

fabric forwarding anycast-gateway-mac 0001.0001.0001
ip pim rp-address 192.168.238.1 group-list 238.0.0.0/24 bidir
ip pim ssm range 232.0.0.0/8
vlan 1,10,77
vlan 10
  name L2VNI-for-VLAN10
  vn-segment 10000
vlan 77
  name TENANT77
  vn-segment 10077

route-map REDIST-TO-SITE-EXT-DCI permit 10
  match tag 1234
vrf context TENANT77
  vni 10077
  rd auto
  address-family ipv4 unicast
    route-target both auto
    route-target both auto evpn
vrf context management
hardware access-list tcam region racl 512
hardware access-list tcam region vpc-convergence 256
hardware access-list tcam region arp-ether 256 double-wide


interface Vlan1

interface nve1
  no shutdown
  host-reachability protocol bgp
  source-interface loopback100
  multisite border-gateway interface loopback88
  member vni 10000
    multisite ingress-replication
    mcast-group 238.0.0.10
  member vni 10077 associate-vrf

interface Ethernet1/1
  description **Fabric Internal **
  no switchport
  mac-address b063.0003.1e11
  medium p2p
  ip address 10.3.12.3/24
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
  evpn multisite fabric-tracking
  no shutdown

interface Ethernet1/2
  description ** DCI Interface **
  no switchport
  mac-address b063.0003.1e12
  medium p2p
  ip address 10.3.88.3/24 tag 1234
  evpn multisite dci-tracking
  no shutdown

interface Ethernet1/3
  description **Fabric Internal **
  no switchport
  mac-address b063.0003.1e13
  medium p2p
  ip address 10.12.3.3/24
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
  no shutdown

interface Ethernet1/4
  description ** DCI Interface **
  no switchport
  mac-address b063.0003.1e14
  medium p2p
  ip address 10.88.3.3/24 tag 1234
  no shutdown
interface mgmt0
  vrf member management

interface loopback0
  description ** RID/Underlay **
  ip address 192.168.0.3/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode

interface loopback77
  description ** BGP peering **
  ip address 192.168.77.3/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0

interface loopback88
  description ** VIP for DCI-Inter-connect **
  ip address 192.168.88.34/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0

interface loopback100
  description ** VTEP/Overlay **
  ip address 192.168.100.3/32 tag 1234
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode
line console
line vty
boot nxos bootflash:/nxos.9.2.3.bin
router ospf UNDERLAY-NET
  router-id 192.168.0.3
router bgp 65034
  router-id 192.168.77.3
  no enforce-first-as
  address-family ipv4 unicast
    redistribute direct route-map REDIST-TO-SITE-EXT-DCI
    maximum-paths 5
    maximum-paths ibgp 5
  address-family l2vpn evpn
  neighbor 10.3.88.88
    remote-as 65088
    update-source Ethernet1/2
    address-family ipv4 unicast
  neighbor 10.88.3.88
    remote-as 65088
    update-source Ethernet1/4
    address-family ipv4 unicast
  neighbor 192.168.77.12
    remote-as 65034
    description ** Spine-11 BGP-RR **
    update-source loopback77
    address-family l2vpn evpn
      send-community extended
  neighbor 192.168.77.88
    remote-as 65088
    update-source loopback77
    ebgp-multihop 5
    peer-type fabric-external
    address-family l2vpn evpn
      send-community
      send-community extended
      rewrite-evpn-rt-asn
  vrf TENANT77
    address-family ipv4 unicast
      advertise l2vpn evpn
evpn
  vni 10000 l2
    rd auto
    route-target import auto
    route-target export auto

DC Core switch (RouteServer) Configuration
RouteServer-1# sh run

!Command: show running-config
!No configuration change since last restart
!Time: Wed Aug  7 09:38:18 2019

version 9.2(3) Bios:version
hostname RouteServer-1
vdc RouteServer-1 id 1
  limit-resource vlan minimum 16 maximum 4094
  limit-resource vrf minimum 2 maximum 4096
  limit-resource port-channel minimum 0 maximum 511
  limit-resource u4route-mem minimum 128 maximum 128
  limit-resource u6route-mem minimum 96 maximum 96
  limit-resource m4route-mem minimum 58 maximum 58
  limit-resource m6route-mem minimum 8 maximum 8

nv overlay evpn
feature bgp
feature nv overlay

username admin password 5 $5$SAAwN66P$OSzsu5lztjirsP.UM0bkhSXhjkAqAnymcN0jNUwNc3
8  role network-admin
ip domain-lookup
copp profile strict
snmp-server user admin network-admin auth md5 0x842c130e837d0182abbfc3c8010e25f1
 priv 0x842c130e837d0182abbfc3c8010e25f1 localizedkey
rmon event 1 description FATAL(1) owner PMON@FATAL
rmon event 2 description CRITICAL(2) owner PMON@CRITICAL
rmon event 3 description ERROR(3) owner PMON@ERROR
rmon event 4 description WARNING(4) owner PMON@WARNING
rmon event 5 description INFORMATION(5) owner PMON@INFO

vlan 1

route-map REDIST-TO-SITE-EXT-DCI permit 10
  match tag 1234
route-map RETAIN-NEXT-HOP permit 10
  set ip next-hop unchanged
vrf context abba
  address-family ipv4 unicast
    route-target import 65088:1
    route-target export 65088:1
    route-target both auto
vrf context beef
  address-family ipv4 unicast
    route-target import 65088:2
    route-target export 65088:2
vrf context management
hardware access-list tcam region racl 512
hardware access-list tcam region vpc-convergence 256
hardware access-list tcam region arp-ether 256 double-wide


interface Ethernet1/1
  description ** to BGW-1 **
  no switchport
  ip address 10.1.88.88/24
  no shutdown

interface Ethernet1/2
  description ** to BGW-2 **
  no switchport
  ip address 10.2.88.88/24
  no shutdown

interface Ethernet1/3
  description ** to BGW-3 **
  no switchport
  ip address 10.3.88.88/24
  no shutdown

interface Ethernet1/4
  description ** to BGW-4 **
  no switchport
  ip address 10.4.88.88/24
  no shutdown

interface mgmt0
  vrf member management

interface loopback77
  ip address 192.168.77.88/32 tag 1234

interface loopback88
  ip address 192.168.88.88/32 tag 1234
line console
line vty
boot nxos bootflash:/nxos.9.2.3.bin
router bgp 65088
  router-id 192.168.77.88
  address-family ipv4 unicast
    redistribute direct route-map REDIST-TO-SITE-EXT-DCI
    maximum-paths 2
  address-family l2vpn evpn
    maximum-paths 2
    maximum-paths ibgp 2
    retain route-target all
  template peer MULTI-SITE-OVERLAY-PEERING
    update-source loopback77
    ebgp-multihop 5
    address-family l2vpn evpn
      send-community
      send-community extended
      route-map RETAIN-NEXT-HOP out
  neighbor 10.1.88.1
    remote-as 65012
    address-family ipv4 unicast
  neighbor 10.2.88.2
    remote-as 65012
    address-family ipv4 unicast
  neighbor 10.3.88.3
    remote-as 65034
    address-family ipv4 unicast
  neighbor 10.4.88.4
    remote-as 65034
    address-family ipv4 unicast
  neighbor 192.168.77.1
    inherit peer MULTI-SITE-OVERLAY-PEERING
    remote-as 65012
    address-family l2vpn evpn
      rewrite-evpn-rt-asn
  neighbor 192.168.77.2
    inherit peer MULTI-SITE-OVERLAY-PEERING
    remote-as 65012
    address-family l2vpn evpn
      rewrite-evpn-rt-asn
  neighbor 192.168.77.3
    inherit peer MULTI-SITE-OVERLAY-PEERING
    remote-as 65034
    address-family l2vpn evpn
      rewrite-evpn-rt-asn
  neighbor 192.168.77.4
    inherit peer MULTI-SITE-OVERLAY-PEERING
    remote-as 65034
    address-family l2vpn evpn
      rewrite-evpn-rt-asn



23 comments:

  1. Hi Toni,
    welcome back!

    Michael

    ReplyDelete
  2. Toni
    these days I am think of some questions and I hope you can help for answering.
    1. apart from layer two extension and real physical redundany, vxlan actual help to get rid of L3 routing process, I believe it makes vxlan faster than traditional 3 Level network. but how faster it could be comparing to traditional routing? Or what kind of delay requests or traffic volumne request push us to change traditional network into vxlan.
    2. I went through some reading and notice that, someone mentions, if vxlan network hsa the mtu of 1500+vxlan header, the actual thoughtput will be half of that a vlan can achieve, regardless of the physical redundancy vxlan has. while by increasing the mtu to 9000, vxlan shows the power and can reach almost twice the vlan can achieve. in your experience, do you notice about it?
    3. vxlan is really cpu comsumping. I notice you are using n7k,so what is the cpu comsumption of each vetp tunnel on N7k and how many could it support. let us get ird of evpn and only values that for pure vxlan.

    Really glad to see you again in the topic.

    Yours Sincerely
    Michael

    ReplyDelete
    Replies
    1. Hi Michael,

      I think that the driver for moving from traditional network solution (Spanning-Tree as an L2 Control Plane protocol) to VXLAN solution is more related to flexibility and reliability offered by VXLAN than speed. Here are just a few reasons why I like VXLAN:

      A) No need for virtualized devices with common Control-/Data Plane for gateway function.
      B) Multi-destination traffic over IP-Only Underlay
      C) VLAN available where needed (by stitching not stretching)
      D) Could be also the future solution for SD-WAN???

      But as a tradeoff, from the Control Plane perspective, BGP EVPN VXLAN solution is more complex than STP-based solution (my personal opinion).

      Unfortunately, I do not have any figures concerning forwarding rate or CPU usage. However, after Control Plane is converged (Underlay/Overlay/NVE peering...) and Forwarding tables are up to date, then the actual data forwarding is easy, just forwarding table lookup and encapsulation.

      In this post, I am using NX-OSv 9.2(3).

      And by the way, it is really nice to see that you are still reading my posts :-)

      Cheers - Toni

      Delete
  3. Another great post, thanks. (yes, still reading it :))

    I wonder how many "super-spine" architectures are actually implemented in production networks. It has a lot of boxes :)

    The Multi-site BGW function can run on the spine switches effectively collapsing the two layers. Also, N7k, N3600-R and N9500-R support MPLS hand-off, a neat way to marry VxLAN/EVPN with MPLS/VPNv4 in a single box. This way the Multi-Site architecture can have only three layers - Leafs, Spines(with BGW) and BorderPEs.

    ReplyDelete
  4. Hi Toni, thanks for your excellent and in-depth posts on this blog.

    I labbed this with Nexus 9000v switches, in my lab I replaced the core switch/route server with a small MPLS L3VPN and direct EBGP peerings between BGWs for the EVPN.

    An issue I see, is that even though the BGW1 and BGW2 shared VIP is seen as the next hop in BGW3's "show l2route mac all" output for MACs on the BGW1/2 site, the traffic is replicated and duplicate packets sent to the BGW1 and BGW2 PIPs. It's being forwarded as though it was BUM traffic, and BGW3 is forwarding it as an unknown unicast. This is also resulting in duplicate packets arriving at the devices connected to the BGW1/2 leaf switches. This seems to be the same for both 9.3.1 and 9.2.3 images.

    This only impacts intra-VLAN traffic, inter-VLAN traffic sent between sites is forwarded towards the VIP as expected.

    I'm wondering if you, or any of your readers have seen similar behaviour when capturing on the DCI interface of BGW3? I'm trying to figure out if this is misconfiguration on my part or a limitation of the virtual switches.

    ReplyDelete
    Replies
    1. I am also doing a similar lab. Basically no RS just 4 back-to-back BGWs in a square configuration. And I am also seeing the same problem, except for me, packets is blocked at BGW when it is not DF for that vlan. So ping only have 25% success rate. On the DCI link of DF BGW which didn't block the unicast traffic, I can see source and destination IP is PIP instead of VIP.

      Delete
    2. As I answered to Matt, this might be related to NX-OSv. If I remember correctly the preferred Back-to-Back design requires also cross connection between BGWs.

      Delete
    3. I tested with 9.3.3 images with same issue, from packet capture I found the unicast will send to BGW1 and BGW2 because BGW1 and BGW2 are using same VIP. underlay is using ospf equal path, so from the underlay the packet will send one to BGW1, send the other one to BGW2. Because of BGW1 is primary, BGW2 will discard packet, so that's why you can see some packets are sent and some are not.
      the solution I did is making BGW2 VIP loopback as ospf cost 2 using "ip ospf cost 2". so underlay will always choose BGW1 as primary path.

      Delete
    4. I found change ospf cost is not the solution. And this is the bug for nexus 9000v simulator because nexus 9000v does not have ASIC chipset, so the DF rules and split horizon rule are not taking in place. Refer below link:
      https://www.reddit.com/r/networking/comments/cruk37/nexus_9000v_vxlan_evpn_multisite_duplicate_looped/

      Delete
    5. Thanks Andy for sharing the link.

      Delete
  5. Hi Matt, The problem might be related to NX-OSv. Have you tested it with physical devices?

    ReplyDelete
  6. Hi Toni,
    Just came across your blog recently. Alot of VXLAN-EVPN concepts became clearer after going through.
    Please do you have any plans to do a write up on TRM ?

    ReplyDelete
    Replies
    1. I am happy that you found this blog informative. I am working on with TRM document and it is 95% ready. It should be out within couple of days. I used to inform about new posts in Linkedin, so if we are not already connected, just sent me an invitation.

      Delete
    2. I have add a short desctiption about TRM in this blog. The whole chapter (36 pages) is included in VXLAN book available via Leanpub.com and soon also via Amazon (eBook and hard copy)

      Delete
    3. Thanks a lot Toni.
      I was expecting to get a notification of your reply. Just got back here to see you had replied a while back.
      I've sent you linkedin request and just got a copy of your book as well.

      Delete
  7. That is what is meant by being creative. If you want to get more interesting details about sd-wan providers, head over to the website.

    ReplyDelete
  8. thank you for Ahriman such a great post I am very happy to be here and read about this post
    thepiratebay mirror proxy
    the pirate bay alternatives
    torlock mirror proxy
    torrentz2 mirror proxy

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Hi Guys,

    can i work with tha feature distributed anycast (SVI, port mode access/trunk) on Anycast BGW switch?

    Thank you,
    Daniel Lima

    ReplyDelete
  11. I really learned a lot from your posts, thanks Toni. Got a copy of your book to support your great work.
    I'm still new with the multisite configuration, after having tested your scenario with N9Kv (and it worked great) I tried to lab another scenario where I have a simple fabric: Leaf --- Spine --- BL --- ext router and I want to configure BGW on the BL. I noticed that as soon as I configure "evpn multisite border-gateway ___" on the BL, I lost the connectivity to the external networks (cannot ping anymore from LEAF to ext router). I don't know if technically we cannot configure BGW and BL on the same node or I missed a specific configuration to let it happens or it's simply a N9kv bug. I don't have any real switch to test and I probably should learn more but if you, or any readers have already tested this scenario and can share some experience it would be great. Thanks a lot.

    ReplyDelete
  12. Hi Toni,
    How is the RD generated for RT 4, especially the value after colon ?
    192.168.77.1:27001 ? where does 27001 comes from ?

    ReplyDelete

Note: only a member of this blog may post a comment.