Thursday 11 April 2019

VXLAN Underlay Routing - Part III: Internal BGP

Now you can also download my VXLAN book from the Leanpub.com 
"Virtual Extensible LAN VXLAN - A Practical guide to VXLAN Solution Part 1. (373 pages)

BGP as an Underlay Network Routing Protocol


Using BGP instead of OSPF or IS-IS for Underlay Network routing in BGP VXLAN fabric simplifies the Control Plane operation because there is only one routing protocol running on fabric switches. However, there are some tradeoffs too. The BGP only solution requires at least two BGP Address-Families (afi) per switch, one for the Underlay (IPv4 Unicast) and one for the Overlay (L2VPN EVPN). In addition, if Border Leaf switches are connected to MPLS network, there is a third BGP afi for VPNv4. In some cases, multi-afi BGP makes troubleshooting a bit more complex compared to a single-afi solution where BGP is used only in Overlay Network. The focus of this chapter is VXLAN fabric Underlay Network with iBGP routing.


Figure 1-1: High-Level operation of VXLAN Fabric


Figure 1-1 illustrates a high-level operation of a Control and Data Plane in VXLAN fabric. First, there is an address-family (afi) IPv4 Unicast BGP peering between the physical interfaces. Switches exchanges BGP Update packets, which carries an IPv4 Network Layer Reachability Information (IPv4 NLRI) about their directly connected Loopback Interfaces 100 (Overlay Control Plane-BGP) and 50 (Overlay Data Plane-VXLAN). This is a part of the Underlay Network operation. Second, all switches have an afi L2VPN EVPN peering between their Loopback 100 interfaces. Leaf switches advertise their local hosts MAC/IP address and both L2 and L3 Virtual Network Identifier (VNI) over this peering. Third, Data between the hosts belonging to the same Layer-2 segment but located in different Leaf switches are sent over the interface NVE1 (Network Virtualization Edge) with VXLAN encapsulation. The IP address of an NVE1 interface is taken from the interface Loopback 50 and it is used for VXLAN tunneling between VTEP switches.

Route Propagation

Figure 1-2 describes the BGP processes from the Underlay Network perspective focusing on the propagation of Loopback 100 and 50 interfaces connected to Leaf-102.

Figure 1-2: Underlay BGP process.

Step-1: Leaf-102 injects IP address of Loopback 100 (192.168.100.102) and 50 (192.168.50.102) into BGP table called Loc-RIB. This can be done for example by redistributing connected networks through the route-map or by using the network –clause under the BGP process.

Leaf-102# sh ip bgp | be Network
   Network            Next Hop   Metric LocPrf   Weight Path
*>i192.168.50.101/32  10.102.11.11       100     0       i
*>l192.168.50.102/32  0.0.0.0            100 32768       i
*>i192.168.100.11/32  10.102.11.11       100     0       i
*>i192.168.100.101/32 10.102.11.11       100     0       i
*>l192.168.100.102/32 0.0.0.0            100 32768       i
*>i192.168.238.0/29   10.102.11.11       100     0       i
Table 1-1: Loc-RIB (IPv4 Unicast) table on Leaf-102

Step-2: Leaf-102 installs routes from the Loc-RIB into the Adj-RIB-Out (Pre-Policy) table. Note that there are dedicated Pre-Policy and Post-Policy Adj-RIB-Out tables towards each BGP peer. Adj-RIB-Out (Pre-Policy) table is a peer type specific table (iBGP, eBGP, RR-Client, Confederation and Route-Server) and includes peer type specific routes. IPv4 Uncast peering type between Leaf-102 and Spine-11 is internal BGP (iBGP), so routes learned from the other iBGP peer are not eligible to this Adj-RIB-Out (Pre-Policy) table. Leaf-102 attach the iBGP specific Path Attributes into each installed route excluding the Next-Hop Path Attribute which is not set in this phase.

Step-3: Leaf-102 sends routes from the Adj-RIB-Out (Pre-Policy) table through the Policy Engine into the Adj-RIB-Out (Post-Policy) table. In this example, Leaf-102 does not filter any routes or change attached BGP Path Attributes, it just adds the Next-Hop Path Attribute introducing itself as a next-Hop. Leaf-102 sends BGP Update to Spine-11 over IPv4 Unicast peering.

Leaf-102# show ip bgp neighbors 10.102.11.11 advertised-routes   Network               Next Hop Metric LocPrf   Weight Path
*>l192.168.50.102/32  0.0.0.0          100     32768    i
*>l192.168.100.102/32 0.0.0.0          100     32768    i
Table 1-2: Adj-RIB-Out (Post-Policy) table on Leaf-102

Step-4: Spine-11 receives a BGP Update message from Leaf-102 and installs routes into peer-specific Adj-RIB-In (Pre-Policy) table without any modification.


Step-5: Spine-11 installs routes from the Adj-RIB-In (Pre-Policy) table through the Policy Engine into the Adj-RIB-In (Post-Policy) table without filtering prefixes or without changing Path Attributes carried in received BGP update.

Step-6: All Adj-RIB tables (In-Out/Pre-Post) are peer-specific and switch might have several BGP peers. Only the best route is installed into Loc-RIB based on BGP Path selection procedure.  Spine-11 installs the routes received from the Leaf-102 into Loc-RIB since there is no better path available via other BGP peer Leaf-101. In addition, it increases the BGP table version by 1.

Spine-11# sh ip bgp 192.168.100.102
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 192.168.100.102/32, version 14
<snipped>
  Advertised path-id 1
  Path type: internal, path is valid, is best path, in rib
  AS-Path: NONE, path sourced internal to AS
    10.102.11.102 (metric 0) from 10.102.11.102 (192.168.0.102)
      Origin IGP, MED not set, localpref 100, weight 0

  Path-id 1 advertised to peers:
    10.101.11.101 
Example 1-2: BGP Loc-RIB in Spine-11.

Step-7: Because there is no better route source than iBGP and routes are valid (reachable Next-Hop), Spine-11 installs routes from Loc-RIB also into the RIB.


Step-8: Leaf-101 and Leaf-102 are Route-Reflector clients of Spine-11. This way Spine-11 is able to forwards BGP updates messages received from iBGP peer Leaf-102 to another iBGP peer Leaf-101 and the other way around. As a first step, Spine-11 installs routes into Leaf-101 IPv4 unicast peering specific Adj-RIB-Out (Pre-Policy) table.

Step-9: Spine-11 installs routes from the Adj-RIB-Out (Pre-Policy) table into the Adj-RIB-Out (Post-Policy) table through the BGP Policy Engine. Because there is no operation defined in BGP Policy Engine, routes are installed without modification (table 1-3).

Spine-11# sh ip bgp neigh 10.101.11.101 advertised-routes

Peer 10.101.11.101 routes for address family IPv4 Unicast:
BGP table version is 10, Local Router ID is 192.168.0.11
<snipped>

   Network           Next Hop        Metric   LocPrf  Weight Path
*>i192.168.50.102/32  10.102.11.102          100      0      i
*>l192.168.100.11/32  0.0.0.0                100      32768  i
*>i192.168.100.102/32 10.102.11.102          100      0      i
*>l192.168.238.0/29   0.0.0.0                100      32768  i
Example 1-3: Adj-RIB-Out (Post-Policy) table on Spine-11.

When routes are sent to Leaf-101 by Spine-11, the command next-hop-self under the afi IPv4 Uncast section considering Leaf-101 neighbor parameters changes the Next-Hop Path Attribute to 10.101.11.11 (interface E1/1 of Spine-11). Example 1-4 illustrates partial BGP configuration on Spine-11.

Note, there is a dedicated section concerning next-hop-self after this section.


  neighbor 10.101.11.101
    remote-as 65000
    description ** BGP Underlay to Leaf-101 **
    address-family ipv4 unicast
      route-reflector-client
      next-hop-self
Example 1-4: Spine-11 BGP configuration.

Note! This solution breaks the recommendation on RFC 4456 (section 10) “In addition, when a RR reflects a route, it SHOULD NOT modify the following path attributes: NEXT_HOP, AS_PATH, LOCAL_PREF, and MED. Their modification could potentially result in routing loops”.

In addition to Next-Hop modification, Spine-11 adds BGP Path Attributes Originator-Id (value: 192.168.0.102) and Cluster_List (value: 192.168.0.11) into BGP Update Message (Capture 1-1).

Ethernet II,
Src: c0:8e:00:11:1e:11, Dst: 1e:af:01:01:1e:11

Internet Protocol Version 4,
Src: 10.101.11.11, Dst: 10.101.11.101

Transmission Control Protocol,
Src Port: 26601, Dst Port: 179, Seq: 58, Ack: 39, Len: 69

Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 69
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 46
    Path attributes
        Path Attribute - ORIGIN: IGP
        Path Attribute - AS_PATH: empty
        Path Attribute - LOCAL_PREF: 100
        Path Attribute - ORIGINATOR_ID: 192.168.0.102
        Path Attribute - CLUSTER_LIST: 192.168.0.11
        Path Attribute - MP_REACH_NLRI
         Address family identifier: IPv4 (1)
         Subsequent Address family identifier: Unicast (1)
         Next hop network address
           Next Hop: 10.101.11.11
         Network layer reachability information
           192.168.100.102/32
Capture 1-1: BGP Update sent to Leaf-101 by Spine-11.

Step-10-13: Leaf-101 receives the BGP Update from Spine-11. The import process follows the same principle that was described in Steps 4-6, Route is installed into Adj-RIB-in (Pre-Policy) table and from there through the Policy Engine into Adj-RIB-in (Post-Policy). After Best Path decision process route is installed via Loc-RIB into RIB.

Next-Hop-Self consideration

Just for a recap, Figure 1-3 illustrates three different peering in BGP EVPN VXLAN fabric. The BGP IPv4 Unicast peering between switches is used in Underlay Network for advertising Loopback addresses which in turn are used in Overlay Network. Loopback 100 is used for BGP L2VPN EVPN peering which is used for advertising host related information such as IP/MAC addresses and VNI used in Data Plane VXLAN encapsulation. Network Virtual Edge (NVE) Interfaces uses Loopback 50 address for NVE tunnel peering but also for the source IP address in the outer IP header for all VXLAN encapsulated packets. VXLAN encapsulated Ethernet frames are sent over the VXLAN tunnel between NVE peers if the host MAC address is known. If the host IP address is not known, it belongs to BUM traffic category (Broadcast, Unknown Unicast, and Multicast). In case of BUM traffic, the destination IP address is Multicast Group address used for BUM attached to Virtual Network.


Figure 1-3: Underlay/Overlay BGP peering and NVE peering.

Case-1: Next-hop-self is changed by RR Spine-11.

In this case, Spine-11 changes the next-hop address when it forwards the BGP Update message received from Leaf-102 to Leaf-101. Example 1-5 illustrates the Loc-BGP table of Leaf-101. It shows that the Leaf-102 Loopback address 192.168.50.102 is reachable via 10.101.11.11 (Spine-11).

Leaf-101# sh ip bgp | i Net|192.168.50.102
   Network            Next Hop       Metric LocPrf Weight Path
*>i192.168.50.102/32  10.101.11.11          100         0 i
Example 1-5: Leaf-101 Loc-BGP table.

Example 1-6 shows that the Next-Hop address is also installed into the RIB.

Leaf-101# sh ip route 10.101.11.11 | b 10.1
10.101.11.11/32, ubest/mbest: 1/0, attached
    *via 10.101.11.11, Eth1/1, [250/0], 00:35:30, am
Example 1-6: Leaf-101 RIB.

Example 1-7 shows that the state of the connection between NVE1 interfaces of Leaf-101 and Leaf-102 is Up.

Leaf-101# sh nve peers
Interface Peer-IP     State LearnType Uptime   Router-Mac      
---- ---------------  ----- ----      -------- --------------
nve1 192.168.50.102   Up    CP        00:28:39 5e00.0002.0007  
Example 1-7: Leaf-101Loc-BGP table.

Example 1-8 shows that host Café connected to Leaf-101 is able to ping host Abba which is connected to Leaf-102.

Cafe#ping 172.16.10.102
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 172.16.10.102, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 19/28/36 ms
Example 1-7: Leaf-101Loc-BGP table.

Case-2: RR Spine-11 does not change Next-hop-self.

Now the Spine-11 does not change the next-hop when it forwards BGP Update received from Leaf-102 to Leaf-101. Example 1-8 shows that Leaf receives the route but since the next hop is not available, it is not valid route (no * in front of the entry).

Leaf-101# sh ip bgp | i Net|192.168.50.102
   Network             Next Hop      Metric LocPrf  Weight Path
  i192.168.50.102/32  10.102.11.102         100          0 i
Example 1-8: Leaf-101Loc-BGP table (no next-hop-self on Spine-11).

Example 1-9 shows that there is no routing information concerning the IP address  192.168.50.102 and its Next-Hop address 10.102.11.102 (Interface E1/1 on Leaf-102) RIB of Leaf-101.

Leaf-101# sh ip route 192.168.50.102
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

Route not found

Leaf-101# sh ip route 10.102.11.102
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

Route not found
Leaf-101#
 Example 1-9: RIB of Leaf-101.

Also, the state of the NVE peering is now Down as can be seen from the example 1-10.

Leaf-101# sh nve peers
Interface Peer-IP         State LearnType Uptime   Router-Mac      
--------- --------------  ----- --------- -------- -----------
nve1      192.168.50.102   Down  CP        0.000000 n/a             
Example 1-10: RIB of Leaf-101.

However. ping still works between the Café and Abba (Example 1-11).

Cafe#ping 172.16.10.102
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 172.16.10.102, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 26/36/54 ms
Example 1-11: Ping from 172.16.10.101 (Café) to 172.16.10.102 (Abba)

The reason for this can be found when comparing captures taken from the Interface E1/1 of Leaf-101 while pinging from Café to Abba.

Capture 1-2 taken from the Interface E1/1 of Leaf-101 shows the encapsulated ICMP request packet sent by host Café (172.16.10.101) to host Abba (172.16.10.102) in solution, where Spine-11 changes the Next-Hop address.. The destination MAC address in outer header belongs to interface E1/1 of Spine-11. The outer destination IP address is the IP address of Interface NVE1 of Leaf-102 (figure 1-4). When Spine-11 receives this packet, it routes the packet based on Outer IP address.

Ethernet II,
Src: 1e:af:01:01:1e:11 ,Dst: c0:8e:00:11:1e:11
Internet Protocol Version 4,
Src: 192.168.50.101, Dst: 192.168.50.102
User Datagram Protocol, Src Port: 57251, Dst Port: 4789
Virtual eXtensible Local Area Network
Ethernet II, Src: 10:00:00:10:ca:fe, Dst: 10:00:00:10:ab:ba
Internet Protocol Version 4,
Src: 172.16.10.101, Dst: 172.16.10.102
Internet Control Message Protocol
Capture 1-2: Capture when Next-Hop-Self is used in Spine-11

Figure 1-4: Spine-11change the next-hop from 10.102.11.102 to 10.101.11.11.

Capture 1-3 taken from the Interface E1/1 of Leaf-101 shows the encapsulated ICMP request packet sent by host CafĂ© (172.16.10.101) to host Abba (172.16.10.102) when the Spine-11 does NOT change the Next-Hop address. Previous example 1-10 shows that when next-hop is not reachable, the NVE peering state is down. Note that the BGP EVPN peering between Leaf-101 and Spine-11 as well as Leaf-102 and Spine-11 is still up and Leaf-101 receives the BGP Update originated by Leaf-102 but now with the invalid Next-Hop address. In this case, traffic falls to Unknown Unicast category and it is forwarded towards Multicast RP (Spine-11) of Mcast group 238.0.0.10

Ethernet II,
Src: 1e:af:01:01:1e:11 , Dst: IPv4mcast_0a (01:00:5e:00:00:0a)
Internet Protocol Version 4, 01:00:5e:00:00:0a
Src: 192.168.50.101, Dst: 238.0.0.10
User Datagram Protocol, Src Port: 57528, Dst Port: 4789
Virtual eXtensible Local Area Network
Ethernet II,
Src: 10:00:00:10:ca:fe, Dst: 10:00:00:10:ab:ba
Internet Protocol Version 4,
Src: 172.16.10.101, Dst: 172.16.10.102
Internet Control Message Protocol
Capture 1-3: Capture when Next-Hop-Self is not used in Spine-11

Figure 1-5: Spine-11 does not change the next-hop address10.102.11.102

Spine-11 forwards packet based on its mroute table shown in example 1-12. Both Leaf -101 and Leaf-102 are joined to Mcast group 238.0.0.10 and interfaces towards them are listed in OIL (Outgoing Interface List). This way the ICMP request reaches Leaf-102 and this is how Data Plane still works, though it does not work as expected!

Spine-11# sh ip mroute
IP Multicast Routing Table for VRF "default"
<snipped>
(*, 238.0.0.10/32), bidir, uptime: 01:57:43, pim ip
  Incoming interface: loopback238, RPF nbr: 192.168.238.1
  Outgoing interface list: (count: 3)
    Ethernet1/2, uptime: 01:57:36, pim
    loopback238, uptime: 01:57:43, pim, (RPF)
    Ethernet1/1, uptime: 01:57:43, pim
Example 1-12: Mroute table on Spine-11.

If the hardware on Spine-11 switch does not support next-hop address modification in BGP Route-Reflector, the original Inter-Switch link used as Next-Hop can be advertised by BGP. This way the Next-Hop Addresses are reachable, the state of NVE peering remains Up and the data is sent as known Unicast. Example 1-13 shows that the route to 192.168.50.102 is now valid and the Next-Hop address 10.102.11.102 is reachable when the Inter-Switch links are also advertised.

Leaf-101# sh ip bgp | i Net|192.168.50.102
   Network            Next Hop        Metric LocPrf Weight Path
*>i192.168.50.102/32  10.102.11.102         100          0 i

Leaf-101# sh ip route 10.102.11.102
<snipped>
10.102.11.0/24, ubest/mbest: 1/0
    *via 10.101.11.11, [200/0], 00:02:05, bgp-65000, internal, tag 65000
Example 1-13: Loc-BGP table and RIB on Leaf-101.

References:

RFC 4456: BGP Route Reflection: An Alternative to Full Mesh Internal BGP (IBGP)

10 comments:

  1. Best among all VxLAN BGP EVPN documents/blogs I have seen! Great work.

    ReplyDelete
  2. Thanks to Maria for pushing me out more of my confort "box" I even just bought the book of you Toni.

    ReplyDelete
    Replies
    1. Sometimes it is eye opening for walking out of own comfort zone :-) Thanks for visiting and for buying the book.

      Delete
  3. Thanks Toni for putting these together. Do you have a detailed configuration for iBGP as underlay? I set this up in the lab, but as soon as iBGP neighbor between the loopbacks establish then the routes start to flap.

    ReplyDelete
    Replies
    1. Toni. Never mind. I figured it out myself. I was missing the AFI for l2vpn evpn.

      Delete
    2. Glad to see that you have already figure this out. I should have the original config file so if needed i can send those to you.

      Delete
  4. Hi Toni,
    Thanks for this Blog and the technical details. I followed it to build a VxLAN network with iBGP for Underlay and Overlay. VxLAN is working somehow good, except some issues need tuning. however, one of the major issue i am facing now is, after restart the switch, underlay BGP peer go up and receive routes but this routes are not installed in URIB or routing table immediately but it takes 2 minutes to be shown in routing table.
    Q: any idea why is this 2 minutes?
    it's problematic to stay down for 2 minutes. the servers ports are up and they are forwarding the traffic to the switch but then drop because VxLAN is not up yet.
    Notes:
    - the underlay iBGP is using the interfaces ip addresses not loopback.
    - vPC is there.

    Thanks

    ReplyDelete

Note: only a member of this blog may post a comment.