Wednesday 15 July 2020

BGP EVPN Underlay Network with BGP (Multi-AS)


Introduction


The focus of this chapter is to explain the BGP Multi-AS Underlay Network design in BGP EVPN/VXLAN Fabric. It starts by explaining the BGP configuration because this way explanation can be done by using show and debug command as well as taking packet captures. The next section discusses of BGP adjacency process and its related states (Idle, Connect/Active, OpenSent, Open Confirm and Established). After that, this chapter explains the BGP routing discussing how connected routes are sent from RIB to Loc-RIB and from there to Adj-RIB-Out (Pre/Post). This section also introduces how NLRIs received within BGP Update eventually ends up into the RIB of receiving BGP speaker. In addition, this chapter shortly introduces the MRAI timer as well as a non-disruptive device maintenance solution. The last section tries to give an answer which protocol best fits in the Underlay Network of BGP EVPN fabric.



Infrastructure AS Numbering and IP Addressing Scheme


The AS-numbering scheme used in this chapter is the same as what was used in chapter 1 but instead of using unnumbered interfaces, each inter-switch interface now has an IP address assigned to it. It is possible to use the Unnumbered interface also with BGP using IPv6 Link-Local addressing [RFC 5549]. However, this solution is not supported by all vendors.


Figure 2-1: IP addressing Scheme.


BGP Configuration



Leaf Switches

Example 2-1 shows the BGP configuration of Leaf-101. It has BGP IPv4 peering with S-11 and S-12. BGP is allowed to install eight next-hops with the equal AS-Path length per destination into the BGP Loc-RIB table. In our example, two would fulfill the requirements but using 8 paths there is no need to change the value when additional Spine switches are implemented in the network. The loopback interface that is used in the VXLAN header is redistributed by using a route-map. It could also be redistributed into BGP by using a network clause under the IPv4 address-family but using route-maps we have less unique configuration parameters per VTEP switch. This simplifies the automation. To be able to install multiple paths to one destination, there has to be (a) equal AS-Path attribute count, (b) Equal AS-Path attributes listed in each path. This default behavior means that if the AS-Path count is the same but there are different AS-Path attributes, the paths are not used to load-balance traffic to the destination.  This can be relaxed by using command bestpath as-path multipath-relax under the BGP process. In our example, this is not necessary but I have used it because I will demonstrate also the design where Spines have their unique BGP AS-number in chapter 3 (which may not be the best design model. I will explain why later).

feature bgp
!
route-map VTEP-TO-BGP permit 10
  match interface loopback30
    redistribute direct route-map VTEP-TO-BGP
!
router bgp 65101
  bestpath as-path multipath-relax
  address-family ipv4 unicast
    redistribute direct route-map VTEP-TO-BGP
    maximum-paths 8
  neighbor 10.10.1.1
    remote-as 65100
    address-family ipv4 unicast
  neighbor 10.10.1.3
    remote-as 65100
    address-family ipv4 unicast
Example 2-1: Leaf-101 BGP configuration.


Spine Switches

Example 2-2 shows the BGP configuration of S-11 and S-12. Instead of statically configured BGP peering with any Leaf or Super-Spine switches, Spine-11 is passively waiting for BGP connection from BGP speakers using source address  from the network 10.10.0.0/16 and which are located in the BGP AS listed in route-map. This shortens the BGP configuration of Spine switches because there is no need for building individual BGP peering configuration with every Leaf switches within the pod, and with Super-Spine switches. Commands bestpath as-path multipath-relax and maximum-paths 8 are also used in Spine switches and Super-Spine switches.

feature bgp
!
route-map Dynamic-BGP-AS-List permit 10
  match as-number 65001, 65101, 65102
!
router bgp 65100
  bestpath as-path multipath-relax
  address-family ipv4 unicast
    maximum-paths 8
  neighbor 10.10.0.0/16 remote-as route-map Dynamic-BGP-AS-List
    address-family ipv4 unicast
Example 2-2: Spine-11 BGP configuration.


Super-Spine Switches

Example 2-3 shows the BGP configuration of SS-1. It has BGP IPv4 peering with all Spine-switches. It also has the same BGP ECMP related commands that are already discussed.

feature bgp

router bgp 65001
  bestpath as-path multipath-relax
  address-family ipv4 unicast
    maximum-paths 8
  neighbor 10.10.10.0
    remote-as 65100
    address-family ipv4 unicast
  neighbor 10.10.10.4
    remote-as 65100
    address-family ipv4 unicast
  neighbor 10.10.20.0
    remote-as 65200
    address-family ipv4 unicast
  neighbor 10.10.20.4
    remote-as 65200
    address-family ipv4 unicast
Example 2-3: Super-Spine-1 BGP configuration.

BGP Neighbor Process


BGP Adjacency negotiation goes through the Idle, Connect/Active, OpenSent, OpenConfirm, and Established states. This section describes each of these states and the events that trigger changes from one state to another. Figure 2-2 illustrates the processes up to TCP-session establishment.

Idle
In the Idle state, the local BGP speaker is waiting for the Start-event. It does not accept incoming BGP connection from the peer, nor allocate any BGP resources to the peer.  For the reaction to the Start-event, the local BGP speaker will start the initialization of TCP connection either actively or passively.

Connect
ManualStart (event 1): An administrator starts the BGP peer connection manually. After that, the local system starts sending and listening to TCP SYN from/to port 179. This event can be started e.g. with basic BGP neighbor configuration under the BGP process. This is what happens in L-101.

AutomaticStart (event 3): Same as Event 1, except the event starts automatically. This could happen for example if the administrator of the remote BGP speaker clears the BGP peering with the local system.

For the reaction to the event 1 (ManualStart) and event 3 (AutomaticStart), the BGP-FSM changes state from Idle to Connect, where the local system is waiting for the TCP-connection to be completed.

Active
ManualStart_with_PassiveTcpEstablishment (event 4): BGP connection started manually by an administrator, but the local system waits for a TCP SYN packet to port 179 from the remote BGP speaker. This can be done e.g. by setting transport mode to passive (1) or by using BGP Dynamic Neighbor solution (2). S-11 uses the BGP Dynamic Neighbor solution.

(1)  neighbor 2001:DB8::915:9 transport connection-mode passive
(2)  bgp listen range 10.10.0.0/16 remote-as route-map Dynamic-BGP-AS-List

AutomaticStart_with_PassiveTcpEstablihment (event 5): The local system waits passively the TCP connection from a remote peer as in event 4. The start event happens automatically like in event 3. 

Reaction to either event 4 (ManualStart_with_PassiveTcpEstablishment) or event 5 (AutomaticStart_ with_PassiveTcpEstablishment) the state is changed from Idle to Active where local system is passively waiting for a TCP-connection to be completed.

Finalizing negotiation of the TCP connection
Finalizing the TCP connection from either Connect/Active state follows the same procedure. In our example, there are four events related to the TCP three-way handshake started by an active peer L-101. The events related to TCP 3-way Handshake process are described below:
TcpConnection_Valid (Event 14): This event occurs when the local system receives a TCP Connection Request (TCP SYN) from the peer BGP speaker with a valid source address (configured as a neighbor) and it is destined to the address that the local system uses as a source in BGP negotiation. In addition, the destination TCP port should be 179.
Tcp_CR_Invalid (Event 15): This event occurs when the local system receives a TCP Connection Request (TCP SYN) from the remote BGP speaker and the validity check does not pass.
Tcp_CR_Acked (Event 16): This event indicates that the local system has sent an ACK message to the remote peer as a confirmation of the SYN-ACK message sent by the remote peer.
TcpConnectionConfirmed (Event 17): This event indicates that the local system has received ACK messages from the peer BGP speaker.



Figure 2-2: BGP Adjacency: Idle – Connect/Active (Establishing TCP Connection).

Capture 2-1 shows the TCP-SYN message sent by L-101. The destination TCP port is 179, and the source port has randomly generated number 20376. L-101 generates the sequence number 1716573727. This number increased by the Next sequence number value (1) is the value that L-101 expects to be seen in the SYN-ACK message from S-11.

Internet Protocol Version 4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control Protocol, Src Port: 20376, Dst Port: 179, Seq: 0, Len: 0
    Source Port: 20376
    Destination Port: 179
    [Stream index: 1]
    [TCP Segment Len: 0]
    Sequence number: 0    (relative sequence number)
    Sequence number (raw): 1716573727
    [Next sequence number: 1    (relative sequence number)]
    Acknowledgment number: 0
    Acknowledgment number (raw): 0
    1010 .... = Header Length: 40 bytes (10)
    Flags: 0x002 (SYN)
    Window size value: 29200
    [Calculated window size: 29200]
    Checksum: 0x3cca [unverified]
    <snipped for brevity>
Capture 2-1: BGP Adjacency – TCP SYN from L101 to S-11.

Capture 2-2 shows the SYN-ACK message sent by S-11. It uses TCP source port 179 while the destination port is set to 20376. S-11 uses the same value in its Acknowledgement number field than what L-101 used as a Sequence number in its SYN message.

Internet Protocol Version 4, Src: 10.10.1.1, Dst: 10.10.1.0
Transmission Control Protocol, Src Port: 179, Dst Port: 20376, Seq: 0, Ack: 1, Len: 0
    Source Port: 179
    Destination Port: 20376
    [Stream index: 1]
    [TCP Segment Len: 0]
    Sequence number: 0    (relative sequence number)
    Sequence number (raw): 1088502133
    [Next sequence number: 1    (relative sequence number)]
    Acknowledgment number: 1    (relative ack number)
    Acknowledgment number (raw): 1716573728
    1010 .... = Header Length: 40 bytes (10)
    Flags: 0x012 (SYN, ACK)
    Window size value: 28960
    [Calculated window size: 28960]
    Checksum: 0xd01c [unverified]
    <snipped for brevity>
Capture 2-2: BGP Adjacency – TCP SYN-ACK from S-11 to L-101.

When L-101 receives SYN-ACK from S-11, it finalizes the connection by sending ACK messages back to S-11. As in previous steps, the Acknowledgement number is the Sequence number received in SYN-ACK increased by one.

Internet Protocol Version 4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control Protocol, Src Port: 20376, Dst Port: 179, Seq: 1, Ack: 1, Len: 0
    Source Port: 20376
    Destination Port: 179
    [Stream index: 1]
    [TCP Segment Len: 0]
    Sequence number: 1    (relative sequence number)
    Sequence number (raw): 1716573728
    [Next sequence number: 1    (relative sequence number)]
    Acknowledgment number: 1    (relative ack number)
    Acknowledgment number (raw): 1088502134
    1000 .... = Header Length: 32 bytes (8)
    Flags: 0x010 (ACK)
    Window size value: 7300
    [Calculated window size: 29200]
    [Window size scaling factor: 4]
    Checksum: 0x537e [unverified]
        <snipped for brevity>
Capture 2-3: BGP Adjacency – TCP ACK from L-101 to S-11.



OpenSent and OpenConfirm
After successful TCP negotiation, the BGP speakers exchange BGP OPEN messages to ensure that, they use the same version of BGP, their BGP RIDs don’t overlap, the peer BGP AS number is the same that is used in with this BGP speaker, and that they support the same set of capabilities (Event 19 - BGPOpen). They also compare HoldTime (that is used also when calculating KeepAliveTime) values and they choose the smaller one if values are not the same. If the check is fine, the state will be changed to OpenConfirm, and the local system sends a KEEPALIVE message to the remote peer. After receiving a KEEPALIVE message from the remote peer as a response to event 26 (KeepAliveMsg), the local system finalizes the BGP connection and changes the BGP state from OpenSent to Established.

Established
In this state, BGP peering is up and running and BGP neighbors can send and receive UPDATE, KEEPALIVE, and NOTIFICATION messages. All received UPDATE messages are validated. An example of the validation is a next-hop reachability checking. The Next-hop address must be other than the receiving BGP speaker's address. If peers are directly connected eBGP neighbors, the next-hop address has to be either senders IP- address that is used in BGP neighbor negotiation or it has to be from the same network segment than receivers IP-address (e.g. in the case of BGP redirect). The last rule is relaxed in an Overlay Network in BGP EVPN Fabric where spine switches are configured to retain the original next-hop for L2VPN EVPN afi NLRIs advertised within BGP Update. The reason for that is that the next-hop is used in the VXLAN tunnel header as a destination IP address that is used by spine switches when they route packets between VTEPs.


Figure 2-3: BGP Adjacency – OpenSent, OpenConfirm, and Established.


Captures from 2-4 to 2-9 show the BGP Open message exchanged between L-101 and S-11.

Internet Protocol Version 4, Src: 10.10.1.1, Dst: 10.10.1.0
Transmission Control Protocol, Src Port: 179, Dst Port: 20376, Seq: 1, Ack: 1, Len: 70
<snipped>
Border Gateway Protocol - OPEN Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 70
    Type: OPEN Message (1)
    Version: 4
    My AS: 65100
    Hold Time: 180
    BGP Identifier: 192.168.0.11
    Optional Parameters Length: 41
    Optional Parameters
Capture 2-4: BGP Adjacency – BGP Open Message sent by S-11.



Internet Protocol Version 4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control Protocol, Src Port: 20376, Dst Port: 179, Seq:1, Ack:71, Len: 70
<snipped>
Border Gateway Protocol - OPEN Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 70
    Type: OPEN Message (1)
    Version: 4
    My AS: 65101
    Hold Time: 180
    BGP Identifier: 192.168.0.101
    Optional Parameters Length: 41
    Optional Parameters
Capture 2-5: BGP Adjacency – BGP Open Message sent by L-101.

Internet Protocol Version 4, Src: 10.10.1.1, Dst: 10.10.1.0
Transmission Control Protocol, Src Port: 179, Dst Port: 20376, Seq:71, Ack:90, Len: 19
    <snipped>
Border Gateway Protocol - KEEPALIVE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 19
    Type: KEEPALIVE Message (4)
Capture 2-6: BGP Adjacency – BGP KeepAlive Message sent by S-11.

Internet Protocol Version 4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control Protocol, Src Port: 20376, Dst Port: 179, Seq:71, Ack:71, Len: 19
    <snipped>
Border Gateway Protocol - KEEPALIVE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 19
    Type: KEEPALIVE Message (4)
Capture 2-7: BGP Adjacency – BGP KeepAlive Message sent by L-101.

Transmission Control Protocol, Src Port: 179, Dst Port: 20376, Seq:90, Ack:90, Len: 48
    <snipped>
Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 29
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 6
    Path attributes
        Path Attribute - MP_UNREACH_NLRI
            Flags: 0x80, Optional, Non-transitive, Complete
            Type Code: MP_UNREACH_NLRI (15)
            Length: 3
            Address family identifier (AFI): IPv4 (1)
            Subsequent address family identifier (SAFI): Unicast (1)
            Withdrawn routes (0 bytes)
Border Gateway Protocol - KEEPALIVE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 19
    Type: KEEPALIVE Message (4)
Capture 2-8: BGP Adjacency – BGP Update and KeepAlive Message sent by S-11.


Internet Protocol Version 4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control Protocol, Src Port:20376, Dst Port: 179, Seq:90, Ack:138, Len:109
    <snipped>
Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 61
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 38
    Path attributes
        Path Attribute - ORIGIN: INCOMPLETE
            Flags: 0x40, Transitive, Well-known, Complete
            Type Code: ORIGIN (1)
            Length: 1
            Origin: INCOMPLETE (2)
        Path Attribute - AS_PATH: 65101
            Flags: 0x40, Transitive, Well-known, Complete
            Type Code: AS_PATH (2)
            Length: 6
            AS Path segment: 65101
        Path Attribute - MULTI_EXIT_DISC: 0
            Flags: 0x80, Optional, Non-transitive, Complete
            Type Code: MULTI_EXIT_DISC (4)
            Length: 4
            Multiple exit discriminator: 0
        Path Attribute - MP_REACH_NLRI
            Flags: 0x90, Optional, Extended-Length, Non-transitive, Complete
            Type Code: MP_REACH_NLRI (14)
            Length: 14
            Address family identifier (AFI): IPv4 (1)
            Subsequent address family identifier (SAFI): Unicast (1)
            Next hop network address (4 bytes)
            Number of Subnetwork points of attachment (SNPA): 0
            Network layer reachability information (5 bytes)
                192.168.31.101/32
                    MP Reach NLRI prefix length: 32
                    MP Reach NLRI IPv4 prefix: 192.168.31.101
Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 29
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 6
    Path attributes
        Path Attribute - MP_UNREACH_NLRI
            Flags: 0x80, Optional, Non-transitive, Complete
            Type Code: MP_UNREACH_NLRI (15)
            Length: 3
            Address family identifier (AFI): IPv4 (1)
            Subsequent address family identifier (SAFI): Unicast (1)
            Withdrawn routes (0 bytes)
Border Gateway Protocol - KEEPALIVE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 19
    Type: KEEPALIVE Message (4)
Capture 2-9: BGP Adjacency – BGP Update and KeepAlive Message sent by L-101.


BGP NLRI Update Process

Figure 2-4 illustrates the process of how BGP Network Layer Reachability Information (NLRI) about L-101 loopback 31 is propagated from L-101 to S-11.

RIB to Adj-RIB-Out (Pre-Policy)
The IP address 192.168.31.101/32 of Loopback 31 is redistributed from the RIB to the BGP Loc-RIB using route-map. When the route is redistributed from the RIB to the Loc-RIB, the route itself is encoded as MP_REACH_NLRI Path Attribute with other IPv4 Unicast peer specific BGP Path Attributes, such as ORIGIN, AS_PATH, and MED. The IPv4 address 192.168.31.101/32 is then sent from Loc-RIB to Adj-RIB-Out (Pre-Policy) of all BGP peers witch L-101 has IPv4 Unicast (AFI 1/SAFI 1) peering and are either eBGP, iBGP, RR-Client or Confederation peers. The reason why I mentioned all those four peer types is that there are some cases when IPv4 Unicast NLRI is not sent out to Adj-RIB-Out (Pre-Policy) even though the peering is IPv4 Unicast. One simple example is NLRI received from iBGP peers is not advertised to other iBGP peers. Some implementation also does not forward NLRIs from eBGP peer to another eBGP peer if the AS_Path Path Attribute in ingress BGP Update includes the same AS number than what is the BGP AS of the eBGP egress peer. BGP Path Attributes are not modified when programmed from Loc-RIB into Adj-RIB-Out (Pre-Policy). L-101, in our example,  sends information to Adj-RIB-Out (Pre) of the eBGP peers S-11 and S-12 (not shown in the figure).

Adj-RIB-Out (Pre) to Adj-RIB-Out (Post)
The Adj-RIB-Out (Pre-Policy) equals the Loc-RIB but it only includes routes that are eligible for each neighbor. L-101 send an NRLI about 192.168.31.101/32 from the Adj-RIB-Out (Pre-Policy) to Adj-RIB-Out (Post-Policy) through the BGP Policy-Engine (Outbound Policy). During this process the NLRI itself might be included in some aggregate address, its BGP Path_Attributes might be modified, or new Path_Attributes cloud be added (e.g. communities). The NLRI might even be filtered out from the BGP Update. In our case, L-101 doesn’t modify or filter routes in any way, both Adj-RIB-Out tables are equal. The Next-Hop Address in Path Attribute  MP_REACH_NLRI is set as an egress interface IP address when L-101 sends BGP Update to S-11.

Adj-RIB-In (Post) to Adj-RIB-In (Pre)
When S-11 receives the BGP Update about 192.168.31.101/32 from L-101, it installs the NLRI into Adj-RIB-In (Pre-Policy) without any modification. Then the NLRI is sent through the Policy-Engine (Inbound Policy) to Adj-RIB-In (Post-Policy). During the process, there might be some modifications like adding Local-Preference, Weight, or filtering based on some Path Attribute.

Adj-RIB-In (Pre) to Loc-RIB
Loc-RIB contains the NLRIs received from Adj-RIB-In (Post-Policy). The same NLRI might be received from several BGP peers and all of them are installed into Loc-RIB but only one of them is selected as the best route. In the case of BGP ECMP there might be several BGP peers installed as a next-hop (multipathing) and traffic to the destination is flow-based load-balanced between next-hops. The BGP Best Path Selection process compares all NLRIs which has valid Next-Hop (found in RIB) and doesn’t have the same local AS in its AS_Path Path Attribute. The latter one is the BGP loop prevention mechanism. After selecting best path eligible NLRIs, the best path selection process compares each NLRIs Path Attribute in this order (1) Highest Weight, (2) Highest Local-Preference, (3) prefer locally originated prefixes, (4) Shortest AS_Path attribute length, (5) prefer IGP < EGP < Incomplete,  (6) lowest MED, (7). In the case of each Path Attributes are equal the decision process prefers (8) eBGP over iBGP, (9) smallest IGP metric to Next-Hop, (10), and as a last step prefer the path through the neighbor that has lowest BGP RID. From the S-11 perspective, there is only one path to 192.168.31.101/32 via L-101.  

Loc-RIB to RIB
This process is simple, routes installed into Loc-RIB are installed into RIB if there are no better route sources for the specific route. Note! BGP does not flood BGP Updates to adjacent BGP peers, instead, it constructs BGP Updates only routes installed into its RIB, no matter how they are ended up there. This means that ingress BGP Updates are processed like described in this section and then reconstructed when sending to adjacent BGP speakers.


Figure 2-4: BGP NLRI Advertisement Process.

Example 2-4 shows the RIB entry about 192.168.31.101 in  L-101.

L-101# sh ip route 192.168.31.101 | sec 192
192.168.31.101/32, ubest/mbest: 2/0, attached
    *via 192.168.31.101, Lo30, [0/0], 01:43:30, local
    *via 192.168.31.101, Lo30, [0/0], 01:43:30, direct
Example 2-4: The RIB of L-101.

Example 2-5 shows BGP Loc-RIB about the same IP address. The NLRI is advertised to both spine switches: S-11 (10.10.1.1), and S-12 (10.10.1.3).

L-101# sh ip bgp 192.168.31.101
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 192.168.31.101/32, version 2
Paths: (1 available, best #1)
Flags: (0x080002) (high32 00000000) on xmit-list, is not in urib
Multipath: eBGP

  Advertised path-id 1
  Path type: redist, path is valid, is best path, no labeled nexthop
  AS-Path: NONE, path locally originated
    0.0.0.0 (metric 0) from 0.0.0.0 (192.168.0.101)
      Origin incomplete, MED 0, localpref 100, weight 32768

  Path-id 1 advertised to peers:
    10.10.1.1          10.10.1.3
Example 2-5: The RIB of L-101.

Capture 2-10 illustrates the BGP Update sent by L-101 to S-11. The ORIGIN Path Attribute is incomplete because the route is redistributed into BGP. AS_PATH Path Attribute is set to 65101. The Path Attribute MULTI_EXIT_DISC is set to zero. The Path Attribute MP_REACH_NLRI carries the actual routing information. It describes the destination network/host IP address and its next-hop. Because BGP runs over TCP, all messages are also expected to acknowledged by the receiver.

Internet Protocol Version 4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control Protocol, Src Port:29243, Dst Port:179, Seq:90, Ack:328, Len: 109
Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 61
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 38
    Path attributes
        Path Attribute - ORIGIN: INCOMPLETE
            Flags: 0x40, Transitive, Well-known, Complete
            Type Code: ORIGIN (1)
            Length: 1
            Origin: INCOMPLETE (2)
        Path Attribute - AS_PATH: 65101
            Flags: 0x40, Transitive, Well-known, Complete
            Type Code: AS_PATH (2)
            Length: 6
            AS Path segment: 65101
        Path Attribute - MULTI_EXIT_DISC: 0
            Flags: 0x80, Optional, Non-transitive, Complete
            Type Code: MULTI_EXIT_DISC (4)
            Length: 4
            Multiple exit discriminator: 0
        Path Attribute - MP_REACH_NLRI
            Flags: 0x90, Optional, Extended-Length, Non-transitive, Complete
            Type Code: MP_REACH_NLRI (14)
            Length: 14
            Address family identifier (AFI): IPv4 (1)
            Subsequent address family identifier (SAFI): Unicast (1)
            Next hop network address (4 bytes)
                Next Hop: 10.10.1.0
            Number of Subnetwork points of attachment (SNPA): 0
            Network layer reachability information (5 bytes)
                192.168.31.101/32
Capture 2-10: BGP Update Message sent by L-101.

Example 2-6 shows that S-11 has installed the NLRI in its Loc-RIB. Note that the Weight attribute is set to zero by default while in originating router L-101 it was set to 32768. The NLRI information is also advertised to all eBGP peers of S-11: L-102 (10.10.1.4), SS-1 (10.10.10.1), and SS-2 (10.10.10.3)

S-11# sh ip bgp 192.168.31.101
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 192.168.31.101/32, version 9
Paths: (1 available, best #1)
Flags: (0x8008001a) (high32 00000000) on xmit-list, is in urib, is best urib rou
te, is in HW
Multipath: eBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop, in rib
  AS-Path: 65101 , path sourced external to AS
    10.10.1.0 (metric 0) from 10.10.1.0 (192.168.0.101)
      Origin incomplete, MED 0, localpref 100, weight 0

  Path-id 1 advertised to peers:
    10.10.1.4          10.10.10.1         10.10.10.3
Example 2-6: BGP table entry about 192.168.31.101 on S-11 Loc-RIB.

Example 2-7 shows that S-11 has installed routing information about 192.168.31.101/32 into the RIB.

S-11# sh ip route 192.168.31.101
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.31.101/32, ubest/mbest: 1/0
    *via 10.10.1.0, [20/0], 00:05:06, bgp-65100, external, tag 65101
Example 2-7: RIB entry about 192.168.31.101 on RIB of  S-11.

Example 2-8 shows that L-202 located in remote Pod-2 has two equal-cost paths to destination 192.168.31.101/32

L-202# sh ip route 192.168.31.101
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.31.101/32, ubest/mbest: 2/0
    *via 10.10.2.5, [20/0], 02:17:03, bgp-65202, external, tag 65200
    *via 10.10.2.7, [20/0], 02:17:03, bgp-65202, external, tag 65200
Example 2-8: RIB entry about 192.168.31.101 on RIB of  L-102.


BGP Update: Unreachable Destination


Figure 2-5 illustrates the Inter-Switch link failure event between S-11 and L-101. The reaction time depends on the type of link failure and how fast S-11 notifies and reacts to it. If the failure is for example only in a fiber’s Transmit pair or some odd SFP  failure and reaction is based on HoldDown and Keepalive timers, the reaction time might take some time. However, using Bidirectional Forwarding Detection (BFD) between BGP neighbors, the failure is noticed almost immediately. Using BFD is the best practice. When S-11 notices that the link where the route to 192.168.31.101/32 was received is down, it checks the BGP Loc-RIB if the destination is available through some other BGP peer. In our case, L-101 was the only route source, so the BGP process notifies the RIB to remove the route learned from the BGP. When the route is removed from the RIB and Loc-RIB, S-11 withdraws the route from al of its peers. When BGP neighbors receive the BGP Update with MP-UNREACH-NLRI Path Attribute, they remove the destination described in the “withdrawn routes” attribute.



Figure 2-5: S-11 Reaction to Link-Failure Event.


The capture 2-11 shows the BGP Update message with MP-UNREACH-NLRI Path Attribute sent by S-11 to SS-1.

Internet Protocol Version 4, Src: 10.10.10.0, Dst: 10.10.10.1
Transmission Control Protocol, Src Port: 179, Dst Port: 21728, Seq: 20, Ack: 1, Len:35
Border Gateway Protocol - UPDATE Message
    Marker: ffffffffffffffffffffffffffffffff
    Length: 35
    Type: UPDATE Message (2)
    Withdrawn Routes Length: 0
    Total Path Attribute Length: 12
    Path attributes
        Path Attribute - MP_UNREACH_NLRI
            Flags: 0x90, Optional, Extended-Length, Non-transitive, Complete
            Type Code: MP_UNREACH_NLRI (15)
            Length: 8
            Address family identifier (AFI): IPv4 (1)
            Subsequent address family identifier (SAFI): Unicast (1)
            Withdrawn routes (5 bytes)
                192.168.31.101/32
                    MP Unreach NLRI prefix length: 32
                    MP Unreach NLRI IPv4 prefix: 192.168.31.101
Capture 2-11: The IP address 192.168.31.101/32 withdrawn by S-11.


MRAI Timer


The MinRouteAdvertisementInterval Timer (MRAI) defines the time of how often NLRI advertisement/withdrawn received from one peer about the same destination can be sent to another peer. This timer is peer specific. RFC 4271 states that the iBGP timer should be faster than eBGP because inside an area w need faster convergence time. However, in the modern Datacenter, MRAI should be set to zero for both peering types [BGP MRAI].  Nexus switches use MRAI value zero by default and it can’t be changed.
  

BGP AS-Path Prepend


There are several options to remove BGP Speaker from the data-path in a controlled manner, the neighbors can put in shut-down mode, the BGP process can be isolated or AS-Path advertised by the router can be prepended. The focus of this section is in AS-Path prepending solution. The example 2-9 shows that in the normal situation SS-1 has two equal paths to 192.168.31.101/32.


SS-1# sh ip bgp 192.168.31.101
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 192.168.31.101/32, version 12
Paths: (2 available, best #1)
Flags: (0x8008001a) (high32 00000000) on xmit-list, is in urib, is best urib route, is in HW
Multipath: eBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop, in rib
  AS-Path: 65100 65101 , path sourced external to AS
    10.10.10.4 (metric 0) from 10.10.10.4 (192.168.0.12)
      Origin incomplete, MED not set, localpref 100, weight 0

  Path type: external, path is valid, not best reason: newer EBGP path, multipath, no labeled nexthop, in rib
  AS-Path: 65100 65101 , path sourced external to AS
    10.10.10.0 (metric 0) from 10.10.10.0 (192.168.0.11)
      Origin incomplete, MED not set, localpref 100, weight 0

  Path-id 1 advertised to peers:
    10.10.20.0         10.10.20.4
Example 2-10: BGP Loc-RIB about 192.168.31.101  SS-1.
And both routes are also installed into the RIB.

SS-1# sh ip route 192.168.31.101
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.31.101/32, ubest/mbest: 2/0
    *via 10.10.10.0, [20/0], 00:01:01, bgp-65001, external, tag 65100
    *via 10.10.10.4, [20/0], 00:19:16, bgp-65001, external, tag 65100
Example 2-11: RIB of SS-1.

Example 2-11 shows the AS-Path prepend related configuration on S-11.

route-map AS-PATH-PREP-GIR permit 10
  set as-path prepend 65100 65100
!
router bgp 65100
  router-id 192.168.0.11
  bestpath as-path multipath-relax
  address-family ipv4 unicast
    maximum-paths 8
  neighbor 10.10.0.0/16 remote-as route-map Dynamic-BGP-AS-List
    address-family ipv4 unicast
      route-map AS-PATH-PREP-GIR out
Example 2-11: AS-Path Prepend Configuration on S-11.

As a reaction to configuration, S-11 generates a new BGP Update about all of its BGP learned routes prepended with AS-Path attribute 65101 65101.



Figure 2-6: AS-Path prepend on S-22.

Now the route received from S-11 has AS-Path: 65100 65100 65100 65101 while the route received from S-12 has AS-Path: 65100 65101 and it is selected as the best path due to shorter AS-Path list.

SS-1# sh ip bgp 192.168.31.101
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 192.168.31.101/32, version 14
Paths: (2 available, best #1)
Flags: (0x8008001a) (high32 00000000) on xmit-list, is in urib, is best urib route, is in HW
Multipath: eBGP

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop, in rib
  AS-Path: 65100 65101 , path sourced external to AS
    10.10.10.4 (metric 0) from 10.10.10.4 (192.168.0.12)
      Origin incomplete, MED not set, localpref 100, weight 0

  Path type: external, path is valid, not best reason: AS Path, no labeled nexthop
  AS-Path: 65100 65100 65100 65101 , path sourced external to AS
    10.10.10.0 (metric 0) from 10.10.10.0 (192.168.0.11)
      Origin incomplete, MED not set, localpref 100, weight 0

  Path-id 1 advertised to peers:
    10.10.20.0         10.10.20.4
Example 2-12: BGP Loc-RIB about 192.168.31.101 in SS-1.

The route via S-11 is removed from the RIB and the IP address 192.168.31.101 is only available through the S-12.

SS-1# sh ip route 192.168.31.101
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.31.101/32, ubest/mbest: 1/0
    *via 10.10.10.4, [20/0], 00:14:59, bgp-65001, external, tag 65100
Example 2-13: RIB about 192.168.31.101 in SS-1.

Example 2-14 shows that L-202 in Pod-2 still use ECMP to the destination because S-21 and S-22 receive update from both SuperSpines. This means that L-202 can send data to 192.168.31.101 with a 1:2 oversubscription ratio. This means that in a failure scenario there might be 50% packet loss. With three switches in the Spine and Super Spine layer, the oversubscription ratio will reduce down to 2:3 which gives us a 33% percent packet loss.

L-202# sh ip bgp 192.168.31.101
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 192.168.31.101/32, version 7
Paths: (2 available, best #2)
Flags: (0x8008001a) (high32 00000000) on xmit-list, is in urib, is best urib route, is in HW
Multipath: eBGP

  Path type: external, path is valid, not best reason: newer EBGP path, multipath, no labeled nexthop, in rib
  AS-Path: 65200 65001 65100 65101 , path sourced external to AS
    10.10.2.7 (metric 0) from 10.10.2.7 (192.168.0.22)
      Origin incomplete, MED not set, localpref 100, weight 0

  Advertised path-id 1
  Path type: external, path is valid, is best path, no labeled nexthop, in rib
  AS-Path: 65200 65001 65100 65101 , path sourced external to AS
    10.10.2.5 (metric 0) from 10.10.2.5 (192.168.0.21)
      Origin incomplete, MED not set, localpref 100, weight 0

  Path-id 1 not advertised to any peer
Example 2-14: BGP Loc-RIB about 192.168.31.101 in L-202.




OSPF or BGP In Underlay?

To answer the title question, we need to think about the intent of an Underlay Network. It should offer reliable IP connectivity between VTEP switches. By saying reliable I mean (a) enough redundant bandwidth (ECMP), (b) fast failure detection and recovery, (c) non-disruptive maintenance works. Both OSPF as a Link-State Protocol and BGP as  Path-Vector Protocol fulfills these requirements, so in that sense, the answer to the question is “both are suitable for Underlay network”.
If that answer is not good enough, we can try to find the tiebreaker by comparing the properties and operations of protocols shown in table 2-1. OSPF doesn’t use a transport layer protocol as BGP does. This means that one layer of complexity is removed from OSPF. Both protocols have a reliable, somewhat complex adjacency process. My opinion is that the BGP Adjacency process is a bit more complex because of the TCP three-way handshake process. OSPF routers, within an intra area, have a synchronized LSDB and they all make individual decisions about the best paths based on the metric. BGP in turn trust NLRI information received from adjacent BGP peer and the best path is selected using 13 step comparison based on Path Attributes carried within BGP update. In that sense, the BGP routing decision is not based on metrics but Administrative policy. Note,  RFC 7311 describes how the IGP metric can also be carried within the BGP update. Also, there is a draft “draft-ietf-lsvr-bgp-spf-09” [LSVR-BGP] that describes how Link-State distribution and SPF algorithm used with BGP. The convergence process of BGP is simpler than OSPF. In case of a link failure, BGP withdrawn does not affect the whole fabric like in case of single area OSPF design. Both OSPF and BGP use reliable information exchange, OSPF LSAs are acknowledged, and because BGP uses TCP as a transport protocol, all its messages including BGP Updates are acknowledged by adjacent BGP speakers. The biggest concern about OSPF is its flooding process where Link-State information is flooded when the Link-State Age is 1800 seconds. LSA Group Pacing and Area structure relaxes this so the flooding is not a problem.




Table 2-1: OSPF and BGP comparison.



References


[RFC 4271]                     Y. Rekhter et al., “A Border Gateway Protocol 4 (BGP-4)”, RFC 4271, January 2006.

[RFC 5549]                     F. Le Faucheur and E. Rosen, “Advertising IPv4 Network Layer Reachability Information with an IPv6 Next Hop, RFC 5549, May 2009.

[RFC 7311]                     P. Mohapatra et al., “The Accumulated IGP Metric Attribute for BGP”, RFC 7311, August 2014.

[RFC 7938]                      P.Lapukhov et al., “Use of BGP for Routing in Large-Scale Data Centers”, RFC 7938, August 2016.

[RFC 8671]                     T. Evens et al., “Support for Adj-RIB-Out in the BGP Monitoring Protocol (BMP)”, RFC 8671, November 2019.

                           
[BGP-MRAI]                    P. Jakma, “Revisions to the BGP ’Minimum Route Advertisement Interval’ draft-ietf-idr-mrai-dep-04, September 20, 2011.

[LSRV-BGP]                   K. Patel et al., “Shortest Path Routing Extensions for BGP Protocol”, draft-ietf-lsvr-bgp-spf-09, May 15, 2020.


No comments:

Post a Comment

Note: only a member of this blog may post a comment.