Introduction
The
focus of this chapter is to explain the BGP Multi-AS Underlay Network design in
BGP EVPN/VXLAN Fabric. It starts by explaining the BGP configuration because
this way explanation can be done by using show and debug command as well as
taking packet captures. The next section discusses of BGP adjacency process and
its related states (Idle, Connect/Active, OpenSent, Open Confirm and Established).
After that, this chapter explains the BGP routing discussing how connected routes
are sent from RIB to Loc-RIB and from there to Adj-RIB-Out (Pre/Post). This section
also introduces how NLRIs received within BGP Update eventually ends up into the
RIB of receiving BGP speaker. In addition, this chapter shortly introduces the MRAI
timer as well as a non-disruptive device maintenance solution. The last section
tries to give an answer which protocol best fits in the Underlay Network of BGP
EVPN fabric.
Infrastructure AS Numbering and IP
Addressing Scheme
The
AS-numbering scheme used in this chapter is the same as what was used in
chapter 1 but instead of using unnumbered interfaces, each inter-switch
interface now has an IP address assigned to it. It is possible to use the Unnumbered
interface also with BGP using IPv6 Link-Local addressing [RFC 5549]. However,
this solution is not supported by all vendors.
Figure
2-1: IP addressing Scheme.
BGP
Configuration
Leaf
Switches
Example 2-1
shows the BGP configuration of Leaf-101. It has BGP IPv4 peering with S-11 and
S-12. BGP is allowed to install eight next-hops with the equal AS-Path length
per destination into the BGP Loc-RIB table. In our example, two would fulfill
the requirements but using 8 paths there is no need to change the value when
additional Spine switches are implemented in the network. The loopback
interface that is used in the VXLAN header is redistributed by using a
route-map. It could also be redistributed into BGP by using a network clause
under the IPv4 address-family but using route-maps we have less unique
configuration parameters per VTEP switch. This simplifies the automation. To be
able to install multiple paths to one destination, there has to be (a) equal
AS-Path attribute count, (b) Equal AS-Path attributes listed in each path. This
default behavior means that if the AS-Path count is the same but there are
different AS-Path attributes, the paths are not used to load-balance traffic to
the destination. This can be relaxed by
using command bestpath as-path multipath-relax under the BGP process. In our
example, this is not necessary but I have used it because I will demonstrate
also the design where Spines have their unique BGP AS-number in chapter 3 (which
may not be the best design model. I will explain why later).
feature bgp
!
route-map VTEP-TO-BGP
permit 10
match interface loopback30
redistribute direct route-map VTEP-TO-BGP
!
router bgp 65101
bestpath as-path multipath-relax
address-family ipv4 unicast
redistribute direct route-map VTEP-TO-BGP
maximum-paths 8
neighbor 10.10.1.1
remote-as 65100
address-family ipv4 unicast
neighbor 10.10.1.3
remote-as 65100
address-family ipv4 unicast
Example
2-1: Leaf-101 BGP configuration.
Spine
Switches
Example 2-2
shows the BGP configuration of S-11 and S-12. Instead of statically configured
BGP peering with any Leaf or Super-Spine switches, Spine-11 is passively
waiting for BGP connection from BGP speakers using source address from the network 10.10.0.0/16 and which are
located in the BGP AS listed in route-map. This shortens the BGP configuration
of Spine switches because there is no need for building individual BGP peering
configuration with every Leaf switches within the pod, and with Super-Spine
switches. Commands bestpath as-path multipath-relax and maximum-paths 8 are also
used in Spine switches and Super-Spine switches.
feature bgp
!
route-map
Dynamic-BGP-AS-List permit 10
match as-number 65001, 65101, 65102
!
router bgp 65100
bestpath as-path multipath-relax
address-family ipv4 unicast
maximum-paths 8
neighbor 10.10.0.0/16 remote-as route-map
Dynamic-BGP-AS-List
address-family ipv4 unicast
Example
2-2: Spine-11 BGP configuration.
Super-Spine
Switches
Example 2-3 shows the BGP configuration
of SS-1. It has BGP IPv4 peering with all Spine-switches. It also has the same
BGP ECMP related commands that are already discussed.
feature bgp
router bgp 65001
bestpath as-path multipath-relax
address-family ipv4 unicast
maximum-paths 8
neighbor 10.10.10.0
remote-as 65100
address-family ipv4 unicast
neighbor 10.10.10.4
remote-as 65100
address-family ipv4 unicast
neighbor 10.10.20.0
remote-as 65200
address-family ipv4 unicast
neighbor 10.10.20.4
remote-as 65200
address-family ipv4 unicast
Example
2-3: Super-Spine-1 BGP
configuration.
BGP Neighbor
Process
BGP Adjacency
negotiation goes through the Idle, Connect/Active, OpenSent, OpenConfirm, and
Established states. This section describes each of these states and the events
that trigger changes from one state to another. Figure 2-2 illustrates the
processes up to TCP-session establishment.
Idle
In the Idle
state, the local BGP speaker is waiting for the Start-event. It does not accept
incoming BGP connection from the peer, nor allocate any BGP resources to the
peer. For the reaction to the
Start-event, the local BGP speaker will start the initialization of TCP
connection either actively or passively.
Connect
ManualStart (event 1): An
administrator starts the BGP peer connection manually. After that, the local
system starts sending and listening to TCP SYN from/to port 179. This event can
be started e.g. with basic BGP neighbor configuration under the BGP process. This
is what happens in L-101.
AutomaticStart (event 3): Same as Event
1, except the event starts automatically. This could happen for example if the
administrator of the remote BGP speaker clears the BGP peering with the local
system.
For the reaction
to the event 1 (ManualStart) and event 3 (AutomaticStart), the BGP-FSM changes
state from Idle to Connect, where the local system is waiting for the
TCP-connection to be completed.
Active
ManualStart_with_PassiveTcpEstablishment (event 4): BGP connection
started manually by an administrator, but the local system waits for a TCP SYN
packet to port 179 from the remote BGP speaker. This can be done e.g. by
setting transport mode to passive (1) or by using BGP Dynamic Neighbor solution
(2). S-11 uses the BGP Dynamic Neighbor solution.
(1) neighbor
2001:DB8::915:9 transport connection-mode passive
(2) bgp
listen range 10.10.0.0/16 remote-as route-map Dynamic-BGP-AS-List
AutomaticStart_with_PassiveTcpEstablihment (event
5):
The local system waits passively the TCP connection from a remote peer as in event
4. The start event happens automatically like in event 3.
Reaction to
either event 4 (ManualStart_with_PassiveTcpEstablishment) or event 5 (AutomaticStart_
with_PassiveTcpEstablishment) the state is changed from Idle to Active where
local system is passively waiting for a TCP-connection to be completed.
Finalizing
negotiation of the TCP connection
Finalizing the
TCP connection from either Connect/Active state follows the same procedure. In
our example, there are four events related to the TCP three-way handshake
started by an active peer L-101. The events related to TCP 3-way Handshake
process are described below:
TcpConnection_Valid (Event 14): This event
occurs when the local system receives a TCP Connection Request (TCP SYN) from
the peer BGP speaker with a valid source address (configured as a neighbor) and
it is destined to the address that the local system uses as a source in BGP
negotiation. In addition, the destination TCP port should be 179.
Tcp_CR_Invalid (Event 15): This event
occurs when the local system receives a TCP Connection Request (TCP SYN) from
the remote BGP speaker and the validity check does not pass.
Tcp_CR_Acked (Event 16): This event
indicates that the local system has sent an ACK message to the remote peer as a
confirmation of the SYN-ACK message sent by the remote peer.
TcpConnectionConfirmed (Event 17): This event
indicates that the local system has received ACK messages from the peer BGP
speaker.
Figure
2-2: BGP Adjacency: Idle –
Connect/Active (Establishing TCP Connection).
Capture 2-1 shows the
TCP-SYN message sent by L-101. The destination TCP port is 179, and the source
port has randomly generated number 20376. L-101 generates the sequence number
1716573727. This number increased by the Next
sequence number value (1) is the value that L-101 expects to be seen in the
SYN-ACK message from S-11.
Internet Protocol Version
4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control
Protocol, Src Port: 20376, Dst Port: 179, Seq: 0, Len: 0
Source Port: 20376
Destination Port: 179
[Stream index: 1]
[TCP Segment Len: 0]
Sequence number: 0 (relative sequence number)
Sequence number (raw): 1716573727
[Next sequence number: 1 (relative sequence number)]
Acknowledgment number: 0
Acknowledgment number (raw): 0
1010 .... = Header Length: 40 bytes (10)
Flags: 0x002 (SYN)
Window size value: 29200
[Calculated window size: 29200]
Checksum: 0x3cca [unverified]
<snipped for brevity>
Capture
2-1: BGP Adjacency – TCP SYN
from L101 to S-11.
Capture 2-2 shows the
SYN-ACK message sent by S-11. It uses TCP source port 179 while the destination
port is set to 20376. S-11 uses the same value in its Acknowledgement number
field than what L-101 used as a Sequence number in its SYN message.
Internet
Protocol Version 4, Src: 10.10.1.1, Dst: 10.10.1.0
Transmission
Control Protocol, Src Port: 179, Dst Port: 20376, Seq: 0, Ack: 1, Len: 0
Source Port: 179
Destination Port: 20376
[Stream index: 1]
[TCP Segment Len: 0]
Sequence number: 0 (relative sequence number)
Sequence number (raw): 1088502133
[Next sequence number: 1 (relative sequence number)]
Acknowledgment number: 1 (relative ack number)
Acknowledgment number (raw): 1716573728
1010 .... = Header Length: 40 bytes (10)
Flags: 0x012 (SYN, ACK)
Window size value: 28960
[Calculated window size: 28960]
Checksum: 0xd01c [unverified]
<snipped for brevity>
Capture
2-2: BGP Adjacency – TCP SYN-ACK
from S-11 to L-101.
When L-101 receives SYN-ACK from S-11,
it finalizes the connection by sending ACK messages back to S-11. As in
previous steps, the Acknowledgement number is the Sequence number received in
SYN-ACK increased by one.
Internet
Protocol Version 4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission
Control Protocol, Src Port: 20376, Dst Port: 179, Seq: 1, Ack: 1, Len: 0
Source Port: 20376
Destination Port: 179
[Stream index: 1]
[TCP Segment Len: 0]
Sequence number: 1 (relative sequence number)
Sequence number (raw): 1716573728
[Next sequence number: 1 (relative sequence number)]
Acknowledgment number: 1 (relative ack number)
Acknowledgment number (raw): 1088502134
1000 .... = Header Length: 32 bytes (8)
Flags: 0x010 (ACK)
Window size value: 7300
[Calculated window size: 29200]
[Window size scaling factor: 4]
Checksum: 0x537e [unverified]
<snipped for brevity>
Capture
2-3: BGP Adjacency – TCP ACK
from L-101 to S-11.
OpenSent
and OpenConfirm
After successful
TCP negotiation, the BGP speakers exchange BGP OPEN messages to ensure that,
they use the same version of BGP, their BGP RIDs don’t overlap, the peer BGP AS
number is the same that is used in with this BGP speaker, and that they support
the same set of capabilities (Event 19 - BGPOpen).
They also compare HoldTime (that is
used also when calculating KeepAliveTime)
values and they choose the smaller one if values are not the same. If the check
is fine, the state will be changed to OpenConfirm,
and the local system sends a KEEPALIVE message to the remote peer. After
receiving a KEEPALIVE message from the remote peer as a response to event 26
(KeepAliveMsg), the local system finalizes the BGP connection and changes the
BGP state from OpenSent to Established.
Established
In this state,
BGP peering is up and running and BGP neighbors can send and receive UPDATE,
KEEPALIVE, and NOTIFICATION messages. All received UPDATE messages are
validated. An example of the validation is a next-hop reachability checking.
The Next-hop address must be other than the receiving BGP speaker's address. If
peers are directly connected eBGP neighbors, the next-hop address has to be
either senders IP- address that is used in BGP neighbor negotiation or it has
to be from the same network segment than receivers IP-address (e.g. in the case
of BGP redirect). The last rule is relaxed in an Overlay Network in BGP EVPN
Fabric where spine switches are configured to retain the original next-hop for
L2VPN EVPN afi NLRIs advertised within BGP Update. The reason for that is that
the next-hop is used in the VXLAN tunnel header as a destination IP address that
is used by spine switches when they route packets between VTEPs.
Figure
2-3: BGP Adjacency – OpenSent,
OpenConfirm, and Established.
Captures from 2-4 to 2-9
show the BGP Open message exchanged between L-101 and S-11.
Internet Protocol Version
4, Src: 10.10.1.1, Dst: 10.10.1.0
Transmission Control
Protocol, Src Port: 179, Dst Port: 20376, Seq: 1, Ack: 1, Len: 70
<snipped>
Border Gateway Protocol -
OPEN Message
Marker: ffffffffffffffffffffffffffffffff
Length: 70
Type: OPEN Message (1)
Version: 4
My AS: 65100
Hold Time: 180
BGP Identifier: 192.168.0.11
Optional Parameters Length: 41
Optional Parameters
Capture
2-4: BGP Adjacency – BGP Open
Message sent by S-11.
Internet Protocol Version
4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control
Protocol, Src Port: 20376, Dst Port: 179, Seq:1, Ack:71, Len: 70
<snipped>
Border Gateway Protocol -
OPEN Message
Marker: ffffffffffffffffffffffffffffffff
Length: 70
Type: OPEN Message (1)
Version: 4
My AS: 65101
Hold Time: 180
BGP Identifier: 192.168.0.101
Optional Parameters Length: 41
Optional Parameters
Capture
2-5: BGP Adjacency – BGP Open
Message sent by L-101.
Internet Protocol Version
4, Src: 10.10.1.1, Dst: 10.10.1.0
Transmission Control
Protocol, Src Port: 179, Dst Port: 20376, Seq:71, Ack:90, Len: 19
<snipped>
Border Gateway Protocol -
KEEPALIVE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 19
Type: KEEPALIVE Message (4)
Capture
2-6: BGP Adjacency – BGP
KeepAlive Message sent by S-11.
Internet Protocol Version
4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control
Protocol, Src Port: 20376, Dst Port: 179, Seq:71, Ack:71, Len: 19
<snipped>
Border Gateway Protocol -
KEEPALIVE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 19
Type: KEEPALIVE Message (4)
Capture
2-7: BGP Adjacency – BGP
KeepAlive Message sent by L-101.
Transmission Control
Protocol, Src Port: 179, Dst Port: 20376, Seq:90, Ack:90, Len: 48
<snipped>
Border Gateway Protocol -
UPDATE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 29
Type: UPDATE Message (2)
Withdrawn Routes Length: 0
Total Path Attribute Length: 6
Path attributes
Path Attribute - MP_UNREACH_NLRI
Flags: 0x80, Optional,
Non-transitive, Complete
Type Code: MP_UNREACH_NLRI (15)
Length: 3
Address family identifier (AFI):
IPv4 (1)
Subsequent address family
identifier (SAFI): Unicast (1)
Withdrawn routes (0 bytes)
Border Gateway Protocol -
KEEPALIVE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 19
Type: KEEPALIVE Message (4)
Capture
2-8: BGP Adjacency – BGP Update
and KeepAlive Message sent by S-11.
Internet Protocol Version
4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control
Protocol, Src Port:20376, Dst Port: 179, Seq:90, Ack:138, Len:109
<snipped>
Border Gateway Protocol -
UPDATE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 61
Type: UPDATE Message (2)
Withdrawn Routes Length: 0
Total Path Attribute Length: 38
Path attributes
Path Attribute - ORIGIN: INCOMPLETE
Flags: 0x40, Transitive,
Well-known, Complete
Type Code: ORIGIN (1)
Length: 1
Origin: INCOMPLETE (2)
Path Attribute - AS_PATH: 65101
Flags: 0x40, Transitive,
Well-known, Complete
Type Code: AS_PATH (2)
Length: 6
AS Path segment: 65101
Path Attribute - MULTI_EXIT_DISC: 0
Flags: 0x80, Optional,
Non-transitive, Complete
Type Code: MULTI_EXIT_DISC (4)
Length: 4
Multiple exit discriminator: 0
Path Attribute - MP_REACH_NLRI
Flags: 0x90, Optional,
Extended-Length, Non-transitive, Complete
Type Code: MP_REACH_NLRI (14)
Length: 14
Address family identifier (AFI):
IPv4 (1)
Subsequent address family identifier
(SAFI): Unicast (1)
Next hop network address (4 bytes)
Number of Subnetwork points of
attachment (SNPA): 0
Network layer reachability
information (5 bytes)
192.168.31.101/32
MP Reach NLRI prefix
length: 32
MP Reach NLRI IPv4 prefix:
192.168.31.101
Border Gateway Protocol -
UPDATE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 29
Type: UPDATE Message (2)
Withdrawn Routes Length: 0
Total Path Attribute Length: 6
Path attributes
Path Attribute - MP_UNREACH_NLRI
Flags: 0x80, Optional,
Non-transitive, Complete
Type Code: MP_UNREACH_NLRI (15)
Length: 3
Address family identifier (AFI): IPv4
(1)
Subsequent address family
identifier (SAFI): Unicast (1)
Withdrawn routes (0 bytes)
Border Gateway Protocol -
KEEPALIVE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 19
Type: KEEPALIVE Message (4)
Capture
2-9: BGP Adjacency – BGP Update
and KeepAlive Message sent by L-101.
BGP NLRI
Update Process
Figure 2-4
illustrates the process of how BGP Network Layer Reachability Information
(NLRI) about L-101 loopback 31 is propagated from L-101 to S-11.
RIB
to Adj-RIB-Out (Pre-Policy)
The IP address
192.168.31.101/32 of Loopback 31 is redistributed from the RIB to the BGP
Loc-RIB using route-map. When the route is redistributed from the RIB to the
Loc-RIB, the route itself is encoded as MP_REACH_NLRI Path Attribute with other
IPv4 Unicast peer specific BGP Path Attributes, such as ORIGIN, AS_PATH, and
MED. The IPv4 address 192.168.31.101/32 is then sent from Loc-RIB to
Adj-RIB-Out (Pre-Policy) of all BGP peers witch L-101 has IPv4 Unicast (AFI
1/SAFI 1) peering and are either eBGP, iBGP, RR-Client or Confederation peers.
The reason why I mentioned all those four peer types is that there are some
cases when IPv4 Unicast NLRI is not sent out to Adj-RIB-Out (Pre-Policy) even
though the peering is IPv4 Unicast. One simple example is NLRI received from
iBGP peers is not advertised to other iBGP peers. Some implementation also does
not forward NLRIs from eBGP peer to another eBGP peer if the AS_Path Path
Attribute in ingress BGP Update includes the same AS number than what is the
BGP AS of the eBGP egress peer. BGP Path Attributes are not modified when
programmed from Loc-RIB into Adj-RIB-Out (Pre-Policy). L-101, in our example, sends information to Adj-RIB-Out (Pre) of the
eBGP peers S-11 and S-12 (not shown in the figure).
Adj-RIB-Out
(Pre) to Adj-RIB-Out (Post)
The Adj-RIB-Out
(Pre-Policy) equals the Loc-RIB but it only includes routes that are eligible
for each neighbor. L-101 send an NRLI about 192.168.31.101/32 from the
Adj-RIB-Out (Pre-Policy) to Adj-RIB-Out (Post-Policy) through the BGP
Policy-Engine (Outbound Policy). During this process the NLRI itself might be
included in some aggregate address, its BGP Path_Attributes might be modified,
or new Path_Attributes cloud be added (e.g. communities). The NLRI might even
be filtered out from the BGP Update. In our case, L-101 doesn’t modify or
filter routes in any way, both Adj-RIB-Out tables are equal. The Next-Hop
Address in Path Attribute MP_REACH_NLRI
is set as an egress interface IP address when L-101 sends BGP Update to S-11.
Adj-RIB-In (Post) to
Adj-RIB-In (Pre)
When S-11
receives the BGP Update about 192.168.31.101/32 from L-101, it installs the
NLRI into Adj-RIB-In (Pre-Policy) without any modification. Then the NLRI is
sent through the Policy-Engine (Inbound Policy) to Adj-RIB-In (Post-Policy).
During the process, there might be some modifications like adding
Local-Preference, Weight, or filtering based on some Path Attribute.
Adj-RIB-In (Pre) to
Loc-RIB
Loc-RIB contains
the NLRIs received from Adj-RIB-In (Post-Policy). The same NLRI might be
received from several BGP peers and all of them are installed into Loc-RIB but only
one of them is selected as the best route. In the case of BGP ECMP there might
be several BGP peers installed as a next-hop (multipathing) and traffic to the destination
is flow-based load-balanced between next-hops. The BGP Best Path Selection
process compares all NLRIs which has valid Next-Hop (found in RIB) and doesn’t
have the same local AS in its AS_Path Path Attribute. The latter one is the BGP
loop prevention mechanism. After selecting best path eligible NLRIs, the best
path selection process compares each NLRIs Path Attribute in this order (1)
Highest Weight, (2) Highest Local-Preference, (3) prefer locally originated
prefixes, (4) Shortest AS_Path attribute length, (5) prefer IGP < EGP <
Incomplete, (6) lowest MED, (7). In the case
of each Path Attributes are equal the decision process prefers (8) eBGP over
iBGP, (9) smallest IGP metric to Next-Hop, (10), and as a last step prefer the path
through the neighbor that has lowest BGP RID. From the S-11 perspective, there
is only one path to 192.168.31.101/32 via L-101.
Loc-RIB to RIB
This process is
simple, routes installed into Loc-RIB are installed into RIB if there are no
better route sources for the specific route. Note! BGP does not flood BGP
Updates to adjacent BGP peers, instead, it constructs BGP Updates only routes
installed into its RIB, no matter how they are ended up there. This means that
ingress BGP Updates are processed like described in this section and then
reconstructed when sending to adjacent BGP speakers.
Figure
2-4: BGP NLRI Advertisement
Process.
Example 2-4 shows the RIB entry about
192.168.31.101 in L-101.
L-101# sh ip route 192.168.31.101 | sec 192
192.168.31.101/32,
ubest/mbest: 2/0, attached
*via 192.168.31.101, Lo30, [0/0], 01:43:30,
local
*via 192.168.31.101, Lo30, [0/0], 01:43:30,
direct
Example
2-4: The RIB of L-101.
Example 2-5
shows BGP Loc-RIB about the same IP address. The NLRI is advertised to both
spine switches: S-11 (10.10.1.1), and S-12 (10.10.1.3).
L-101# sh ip bgp 192.168.31.101
BGP routing table
information for VRF default, address family IPv4 Unicast
BGP routing table entry
for 192.168.31.101/32, version 2
Paths: (1 available, best
#1)
Flags: (0x080002) (high32
00000000) on xmit-list, is not in urib
Multipath: eBGP
Advertised path-id 1
Path type: redist, path is valid, is best
path, no labeled nexthop
AS-Path: NONE, path locally originated
0.0.0.0 (metric 0) from 0.0.0.0
(192.168.0.101)
Origin incomplete, MED 0, localpref 100,
weight 32768
Path-id 1 advertised to peers:
10.10.1.1 10.10.1.3
Example
2-5: The RIB of L-101.
Capture 2-10
illustrates the BGP Update sent by L-101 to S-11. The ORIGIN Path Attribute is
incomplete because the route is redistributed into BGP. AS_PATH Path Attribute
is set to 65101. The Path Attribute MULTI_EXIT_DISC is set to zero. The Path
Attribute MP_REACH_NLRI carries the actual routing information. It describes
the destination network/host IP address and its next-hop. Because BGP runs over
TCP, all messages are also expected to acknowledged by the receiver.
Internet Protocol Version
4, Src: 10.10.1.0, Dst: 10.10.1.1
Transmission Control
Protocol, Src Port:29243, Dst Port:179, Seq:90, Ack:328, Len: 109
Border Gateway Protocol -
UPDATE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 61
Type: UPDATE Message (2)
Withdrawn Routes Length: 0
Total Path Attribute Length: 38
Path attributes
Path Attribute - ORIGIN: INCOMPLETE
Flags: 0x40, Transitive,
Well-known, Complete
Type Code: ORIGIN (1)
Length: 1
Origin: INCOMPLETE (2)
Path Attribute - AS_PATH: 65101
Flags: 0x40, Transitive,
Well-known, Complete
Type Code: AS_PATH (2)
Length: 6
AS Path segment: 65101
Path Attribute - MULTI_EXIT_DISC: 0
Flags: 0x80, Optional,
Non-transitive, Complete
Type Code: MULTI_EXIT_DISC (4)
Length: 4
Multiple exit discriminator: 0
Path Attribute - MP_REACH_NLRI
Flags: 0x90, Optional, Extended-Length,
Non-transitive, Complete
Type Code: MP_REACH_NLRI (14)
Length: 14
Address family identifier (AFI):
IPv4 (1)
Subsequent address family
identifier (SAFI): Unicast (1)
Next hop network address (4 bytes)
Next Hop: 10.10.1.0
Number of Subnetwork points of
attachment (SNPA): 0
Network layer reachability
information (5 bytes)
192.168.31.101/32
Capture
2-10: BGP Update Message sent by
L-101.
Example 2-6
shows that S-11 has installed the NLRI in its Loc-RIB. Note that the Weight
attribute is set to zero by default while in originating router L-101 it was
set to 32768. The NLRI information is also advertised to all eBGP peers of
S-11: L-102 (10.10.1.4), SS-1 (10.10.10.1), and SS-2 (10.10.10.3)
S-11# sh ip bgp 192.168.31.101
BGP routing table
information for VRF default, address family IPv4 Unicast
BGP routing table entry
for 192.168.31.101/32, version 9
Paths: (1 available, best
#1)
Flags: (0x8008001a)
(high32 00000000) on xmit-list, is in urib, is best urib rou
te, is in HW
Multipath: eBGP
Advertised path-id 1
Path type: external, path is valid, is best
path, no labeled nexthop, in rib
AS-Path: 65101 , path sourced external to AS
10.10.1.0 (metric 0) from 10.10.1.0
(192.168.0.101)
Origin incomplete, MED 0, localpref 100,
weight 0
Path-id 1 advertised to peers:
10.10.1.4 10.10.10.1 10.10.10.3
Example
2-6: BGP table entry about
192.168.31.101 on S-11 Loc-RIB.
Example 2-7
shows that S-11 has installed routing information about 192.168.31.101/32 into
the RIB.
S-11# sh ip route 192.168.31.101
IP Route Table for VRF
"default"
'*' denotes best ucast next-hop
'**' denotes best mcast
next-hop
'[x/y]' denotes
[preference/metric]
'%<string>' in via
output denotes VRF <string>
192.168.31.101/32,
ubest/mbest: 1/0
*via 10.10.1.0, [20/0], 00:05:06,
bgp-65100, external, tag 65101
Example
2-7: RIB entry about
192.168.31.101 on RIB of S-11.
Example 2-8 shows that L-202 located in
remote Pod-2 has two equal-cost paths to destination 192.168.31.101/32
L-202# sh ip route 192.168.31.101
IP Route Table for VRF
"default"
'*' denotes best ucast
next-hop
'**' denotes best mcast
next-hop
'[x/y]' denotes
[preference/metric]
'%<string>' in via
output denotes VRF <string>
192.168.31.101/32,
ubest/mbest: 2/0
*via 10.10.2.5, [20/0], 02:17:03,
bgp-65202, external, tag 65200
*via 10.10.2.7, [20/0], 02:17:03,
bgp-65202, external, tag 65200
Example
2-8: RIB entry about
192.168.31.101 on RIB of L-102.
BGP Update: Unreachable Destination
Figure 2-5
illustrates the Inter-Switch link failure event between S-11 and L-101. The
reaction time depends on the type of link failure and how fast S-11 notifies
and reacts to it. If the failure is for example only in a fiber’s Transmit pair
or some odd SFP failure and reaction is
based on HoldDown and Keepalive timers, the reaction time might take some time.
However, using Bidirectional Forwarding Detection (BFD) between BGP neighbors,
the failure is noticed almost immediately. Using BFD is the best practice. When
S-11 notices that the link where the route to 192.168.31.101/32 was received is
down, it checks the BGP Loc-RIB if the destination is available through some
other BGP peer. In our case, L-101 was the only route source, so the BGP
process notifies the RIB to remove the route learned from the BGP. When the
route is removed from the RIB and Loc-RIB, S-11 withdraws the route from al of
its peers. When BGP neighbors receive the BGP Update with MP-UNREACH-NLRI Path
Attribute, they remove the destination described in the “withdrawn routes”
attribute.
Figure
2-5: S-11 Reaction to Link-Failure
Event.
The capture 2-11 shows the
BGP Update message with MP-UNREACH-NLRI Path Attribute sent by S-11 to SS-1.
Internet
Protocol Version 4, Src: 10.10.10.0, Dst: 10.10.10.1
Transmission
Control Protocol, Src Port: 179, Dst Port: 21728, Seq: 20, Ack: 1, Len:35
Border
Gateway Protocol - UPDATE Message
Marker: ffffffffffffffffffffffffffffffff
Length: 35
Type: UPDATE Message (2)
Withdrawn Routes Length: 0
Total Path Attribute Length: 12
Path attributes
Path Attribute - MP_UNREACH_NLRI
Flags: 0x90, Optional,
Extended-Length, Non-transitive, Complete
Type Code: MP_UNREACH_NLRI (15)
Length: 8
Address family identifier (AFI):
IPv4 (1)
Subsequent address family identifier
(SAFI): Unicast (1)
Withdrawn routes (5 bytes)
192.168.31.101/32
MP Unreach NLRI prefix
length: 32
MP Unreach NLRI IPv4
prefix: 192.168.31.101
Capture
2-11: The IP address 192.168.31.101/32 withdrawn by S-11.
MRAI Timer
The MinRouteAdvertisementInterval Timer (MRAI)
defines the time of how often NLRI advertisement/withdrawn received from one
peer about the same destination can be sent to another peer. This timer is peer
specific. RFC 4271 states that the iBGP timer should be faster than eBGP
because inside an area w need faster convergence time. However, in the modern
Datacenter, MRAI should be set to zero for both peering types [BGP MRAI]. Nexus switches use MRAI value zero by default
and it can’t be changed.
BGP AS-Path Prepend
There are
several options to remove BGP Speaker from the data-path in a controlled
manner, the neighbors can put in shut-down mode, the BGP process can be isolated
or AS-Path advertised by the router can be prepended. The focus of this section
is in AS-Path prepending solution. The example 2-9 shows that in the normal
situation SS-1 has two equal paths to 192.168.31.101/32.
SS-1# sh ip bgp 192.168.31.101
BGP routing table
information for VRF default, address family IPv4 Unicast
BGP routing table entry
for 192.168.31.101/32, version 12
Paths: (2 available, best
#1)
Flags: (0x8008001a)
(high32 00000000) on xmit-list, is in urib, is best urib route, is in HW
Multipath: eBGP
Advertised path-id 1
Path type: external, path is valid, is best
path, no labeled nexthop, in rib
AS-Path: 65100 65101 , path sourced external
to AS
10.10.10.4 (metric 0) from 10.10.10.4
(192.168.0.12)
Origin incomplete, MED not set, localpref
100, weight 0
Path type: external, path is valid, not best
reason: newer EBGP path, multipath, no
labeled nexthop, in rib
AS-Path: 65100 65101 , path sourced external
to AS
10.10.10.0 (metric 0) from 10.10.10.0
(192.168.0.11)
Origin incomplete, MED not set, localpref
100, weight 0
Path-id 1 advertised to peers:
10.10.20.0 10.10.20.4
Example
2-10: BGP Loc-RIB about
192.168.31.101 SS-1.
And both routes
are also installed into the RIB.
SS-1# sh ip route 192.168.31.101
IP Route Table for VRF
"default"
'*' denotes best ucast
next-hop
'**' denotes best mcast
next-hop
'[x/y]' denotes
[preference/metric]
'%<string>' in via
output denotes VRF <string>
192.168.31.101/32,
ubest/mbest: 2/0
*via 10.10.10.0, [20/0], 00:01:01,
bgp-65001, external, tag 65100
*via 10.10.10.4, [20/0], 00:19:16,
bgp-65001, external, tag 65100
Example
2-11: RIB of SS-1.
Example 2-11
shows the AS-Path prepend related configuration on S-11.
route-map AS-PATH-PREP-GIR
permit 10
set as-path prepend 65100 65100
!
router bgp 65100
router-id 192.168.0.11
bestpath as-path multipath-relax
address-family ipv4 unicast
maximum-paths 8
neighbor 10.10.0.0/16 remote-as route-map
Dynamic-BGP-AS-List
address-family ipv4 unicast
route-map AS-PATH-PREP-GIR out
Example
2-11: AS-Path Prepend
Configuration on S-11.
As a reaction to
configuration, S-11 generates a new BGP Update about all of its BGP learned
routes prepended with AS-Path attribute 65101 65101.
Figure
2-6: AS-Path prepend on S-22.
Now the route received from
S-11 has AS-Path: 65100 65100 65100 65101 while the route received from S-12
has AS-Path: 65100 65101 and it is selected as the best path due to shorter
AS-Path list.
SS-1# sh ip bgp 192.168.31.101
BGP routing table
information for VRF default, address family IPv4 Unicast
BGP routing table entry
for 192.168.31.101/32, version 14
Paths: (2 available, best
#1)
Flags: (0x8008001a)
(high32 00000000) on xmit-list, is in urib, is best urib route, is in HW
Multipath: eBGP
Advertised path-id 1
Path type: external, path is valid, is best
path, no labeled nexthop, in rib
AS-Path: 65100 65101 , path sourced external
to AS
10.10.10.4 (metric 0) from 10.10.10.4
(192.168.0.12)
Origin incomplete, MED not set, localpref
100, weight 0
Path type: external, path is valid, not best
reason: AS Path, no labeled nexthop
AS-Path: 65100 65100 65100 65101 , path
sourced external to AS
10.10.10.0 (metric 0) from 10.10.10.0
(192.168.0.11)
Origin incomplete, MED not set, localpref
100, weight 0
Path-id 1 advertised to peers:
10.10.20.0 10.10.20.4
Example
2-12: BGP Loc-RIB about
192.168.31.101 in SS-1.
The
route via S-11 is removed from the RIB and the IP address 192.168.31.101 is
only available through the S-12.
SS-1# sh ip route 192.168.31.101
IP Route Table for VRF
"default"
'*' denotes best ucast
next-hop
'**' denotes best mcast
next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via
output denotes VRF <string>
192.168.31.101/32,
ubest/mbest: 1/0
*via 10.10.10.4, [20/0], 00:14:59,
bgp-65001, external, tag 65100
Example
2-13: RIB about 192.168.31.101 in
SS-1.
Example 2-14 shows that L-202 in Pod-2 still use ECMP to the destination
because S-21 and S-22 receive update from both SuperSpines. This means that
L-202 can send data to 192.168.31.101 with a 1:2 oversubscription ratio. This
means that in a failure scenario there might be 50% packet loss. With three
switches in the Spine and Super Spine layer, the oversubscription ratio will
reduce down to 2:3 which gives us a 33% percent packet loss.
L-202# sh ip bgp 192.168.31.101
BGP routing table
information for VRF default, address family IPv4 Unicast
BGP routing table entry
for 192.168.31.101/32, version 7
Paths: (2 available, best
#2)
Flags: (0x8008001a)
(high32 00000000) on xmit-list, is in urib, is best urib route, is in HW
Multipath: eBGP
Path type: external, path is valid, not best
reason: newer EBGP path, multipath, no labeled nexthop, in rib
AS-Path: 65200 65001 65100 65101 , path
sourced external to AS
10.10.2.7 (metric 0) from 10.10.2.7
(192.168.0.22)
Origin incomplete, MED not set, localpref
100, weight 0
Advertised path-id 1
Path type: external, path is valid, is best
path, no labeled nexthop, in rib
AS-Path: 65200 65001 65100 65101 , path
sourced external to AS
10.10.2.5 (metric 0) from 10.10.2.5
(192.168.0.21)
Origin incomplete, MED not set, localpref
100, weight 0
Path-id 1 not advertised to any peer
Example
2-14: BGP Loc-RIB about
192.168.31.101 in L-202.
OSPF or BGP In Underlay?
To answer the title
question, we need to think about the intent of an Underlay Network. It should offer
reliable IP connectivity between VTEP switches. By saying reliable I mean (a)
enough redundant bandwidth (ECMP), (b) fast failure detection and recovery, (c)
non-disruptive maintenance works. Both OSPF as a Link-State Protocol and BGP as
Path-Vector Protocol fulfills these
requirements, so in that sense, the answer to the question is “both are suitable
for Underlay network”.
If that answer
is not good enough, we can try to find the tiebreaker by comparing the properties
and operations of protocols shown in table 2-1. OSPF doesn’t use a transport layer
protocol as BGP does. This means that one layer of complexity is removed from OSPF.
Both protocols have a reliable, somewhat complex adjacency process. My opinion
is that the BGP Adjacency process is a bit more complex because of the TCP three-way
handshake process. OSPF routers, within an intra area, have a synchronized LSDB
and they all make individual decisions about the best paths based on the metric.
BGP in turn trust NLRI information received from adjacent BGP peer and the best
path is selected using 13 step comparison based on Path Attributes carried within
BGP update. In that sense, the BGP routing decision is not based on metrics but
Administrative policy. Note, RFC 7311 describes
how the IGP metric can also be carried within the BGP update. Also, there is a draft
“draft-ietf-lsvr-bgp-spf-09” [LSVR-BGP] that describes how Link-State distribution
and SPF algorithm used with BGP. The convergence process of BGP is simpler than
OSPF. In case of a link failure, BGP withdrawn does not affect the whole fabric
like in case of single area OSPF design. Both OSPF and BGP use reliable information
exchange, OSPF LSAs are acknowledged, and because BGP uses TCP as a transport
protocol, all its messages including BGP Updates are acknowledged by adjacent BGP
speakers. The biggest concern about OSPF is its flooding process where Link-State
information is flooded when the Link-State Age is 1800 seconds. LSA Group Pacing
and Area structure relaxes this so the flooding is not a problem.
Table
2-1: OSPF and BGP comparison.
References
[RFC 4271] Y. Rekhter et al., “A Border Gateway Protocol 4 (BGP-4)”, RFC 4271,
January 2006.
[RFC 5549] F. Le Faucheur and E. Rosen, “Advertising IPv4 Network Layer
Reachability Information with an IPv6 Next Hop, RFC 5549, May 2009.
[RFC 7311] P. Mohapatra et al., “The Accumulated IGP Metric Attribute for
BGP”, RFC 7311, August 2014.
[RFC 7938] P.Lapukhov et al., “Use of
BGP for Routing in Large-Scale Data Centers”, RFC 7938, August 2016.
[RFC 8671] T. Evens et al., “Support for Adj-RIB-Out in the BGP Monitoring Protocol
(BMP)”, RFC 8671, November 2019.
[BGP-MRAI] P. Jakma, “Revisions to the
BGP ’Minimum Route Advertisement Interval’ draft-ietf-idr-mrai-dep-04, September
20, 2011.
[LSRV-BGP]
K. Patel et al., “Shortest
Path Routing Extensions for BGP Protocol”, draft-ietf-lsvr-bgp-spf-09, May 15,
2020.
No comments:
Post a Comment