The Network Times: VXLAN Part IV: The Underlay Network – Multidestination Traffic: PIM BiDir

My Last post, VXLAN Part III, introduces VXLAN Fabric L2VNI service with Anycast-RP PIM (RFC4610 and RFC 7761). In this chapter, I will show how the PIM BiDir (RFC5015) with Phantom-RP can be used for the same purpose. I will use configurations, show commands and Wireshark captures to explain the theory part.

Figure 1: Example VIRL topology

Configuration

I am going to use the same topology that I used in Anycast-RP lab. Note that I am using Cisco VIRL in this lab. We have two Leaf switches (101 & 102) and two Spine switches. For the next couple of pages I am going to implement PIM BiDir with Phantom-RP in preconfigured VXLAN Fabric (Underlay and IP addressing configurations can be found from Part II as well as basic NVE 1 interface configurations).

Here is our task list:

1) Spine11: Configure the Loopback interface 238 and assign the IP address 192.168.238.6 with mask /29. Attach the interface to OSPF area 0.0.0.0 and enable PIM-SM on it.

2) Spine-12: Configure the Loopback 238 with same IP address but use mask /28. Attach the interface to OSPF area 0.0.0.0. and enable PIM-SM on it.

3) All switches: Define the IP address 192.168.238.1 as the RP for multicast groups 238.0.0.0/24.

Configure the loopback address 192.168.238.6 on both Spine switches. Use mask /29 in Spine-11 and mask /28 in Spine-12. Attach both interfaces to OSPF area 0.0.0.0 and use PIM-SM. Note that OSPF network type has to be point-to-point, otherwise the loopback IP will be advertised as a host route with mask /32.

Spine-11 loopback interface:

interface loopback238

description ** random IP in Phantom-RP network **

ip address 192.168.238.6/29

ip ospf network point-to-point

ip router ospf UNDERLAY-NET area 0.0.0.0

ip pim sparse-mode

Spine-12 loopback interface:

interface loopback238

description ** random IP in Phantom-RP network **

ip address 192.168.238.6/28

ip ospf network point-to-point

ip router ospf UNDERLAY-NET area 0.0.0.0

ip pim sparse-mode

Now if we take a look at the RIB from VTEP-101, we can see that it has both networks installed on RIB.

Leaf-101# sh ip route | b 192.168.238

192.168.238.0/28, ubest/mbest: 1/0

*via 192.168.0.12, Eth1/2, [110/41], 00:40:14, ospf-UNDERLAY-NET, intra

192.168.238.0/29, ubest/mbest: 1/0

*via 192.168.0.11, Eth1/1, [110/41], 00:40:03, ospf-UNDERLAY-NET, intra

Because of the more specific route, VTEP-101 will use Spine-11 as the next hop for 192.168.238.1 (our Phantom-RP).

Leaf-101# sh ip route 192.168.238.1

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

192.168.238.0/29, ubest/mbest: 1/0

*via 192.168.0.11, Eth1/1, [110/41], 00:38:37, ospf-UNDERLAY-NET, intra

Step-2:

Define the IP address 192.168.238.1 as the RP of multicast groups 238.0.0.0/24 in all switches.

ip pim rp-address 192.168.238.1 group-list 238.0.0.0/24 bidir

And that’s all for the configuration part.

Operation

Now it is time the see how this actually works. I will do it by first shutting down the NVE 1 interfaces from both VTEPs and then I am going to bring them up again. In this way, both switches will join the multicast group 238.0.0.10 (which is attached to their NVE 1 interface for VNI 10000) by sending a PIM join messages towards the RP of. Just for a recap, here is the Interface NVE 1 configuration from VTEP-101.

interface nve1

no shutdown

source-interface loopback100

member vni 10000

mcast-group 238.0.0.10

In figure-2 and Capture-1 we can see the join process.

Figure-2: PIM Join from VTEP-101.

Capture-1: PIM Join from VTEP-101.

As can be seen from the Capture-1, VTEP-101 joins multicast group 238.0.0.10 by sending a PIM join message to multicast group 224.0.0.13 (all PIM routers) out of its’ E1/1 by using the Underlay IP address as a source. Note that PIM join message is sent upstream towards RPA (RP Address). The same process happens in VTEP-102. Based on these received PIM join messages, Spine-11 adds the Interfaces E1/1 and E1/2 to OIL (Outgoing Interface List) for group 238.0.0.10.

Spine-11# sh ip mroute

IP Multicast Routing Table for VRF "default"

(*, 232.0.0.0/8), uptime: 01:00:27, pim ip

Incoming interface: Null, RPF nbr: 0.0.0.0

Outgoing interface list: (count: 0)

(*, 238.0.0.0/24), bidir, uptime: 00:58:01, pim ip

Incoming interface: loopback238, RPF nbr: 192.168.238.1

Outgoing interface list: (count: 1)

loopback238, uptime: 00:08:11, pim, (RPF)

(*, 238.0.0.10/32), bidir, uptime: 00:54:51, pim ip

Incoming interface: loopback238, RPF nbr: 192.168.238.1

Outgoing interface list: (count: 3)

Ethernet1/2, uptime: 00:08:10, pim

Ethernet1/1, uptime: 00:08:10, pim

loopback238, uptime: 00:08:11, pim, (RPF)

The bidirectional Multicast Tree, where Spine-11 is working as a routing vector, is now ready. Spine-12 is not participating the Shared Tree at this moment. In case of Spine-11 failure, Spine-12 still advertises RPA network with mask /28 (remember that Spine-11 uses mask /29).

Summary

I have now shown how the multi-destination traffic is forwarded in Underlay network by using either a PIM-ASM (Anycast-RP) or PIM DiDir (Phantom-RP).

In addition to these Multicast modes, we could use “Ingress Replication” (Unicast mode) where each VTEP replicates ingress BUM traffic received locally to all other VTEPs. The information about the other VTEPs IP addresses under the vni can be configured statically to each switch or the information about vni/VTEP IP address can be advertised by using BGP EVPN (Route-type 3 – Inclusive Multicast Ethernet Tag Route). Instead of showing the Ingress Replication configurations, I am going to show briefly the difference and pros/cons comparison between these three multi-destination traffic forwarding options.

Unicast Mode: ingress Replication

In unicast mode, each packet is replicated to all other VTEPS belonging to the same vni (Figure 3). Ingress VTEP-1 replicates the ingress BUM traffic to all other VTEPS. Each replicated packets are VXLAN encapsulated and forwarded like any other VXLAN-encapsulated unicast data. In smaller installations this is valid solutions because of its simplicity, there is no need for the Multicast protocol in Underlay network. We can define remote peer addresses statically or we can use BGP EVPN for advertising peer information with route-type 3 (Inclusive Multicast Ethernet Tag Route) advertisement. If Unicast mode is used the BGP EVPN is Best Practise model.

Figure-3: Unicast Mode: Ingress Replication

Multicast Mode: PIM-ASM (Anycast-RP)

Each VTEP uses the same RP address in PIM-ASM with Anycast RP (Figure 4). In our example, both Spine switches are active RPs (same RP IP address). VTEP switches choose which one to use based on a hash algorithm. We will end up a situation where VTEPs uses different Spine as an RP. This way we have automatic load balancing between two active RPs. In PIM-ASM each VTEP are both source and receiver for the Multicast traffic and since we have ten VTEP switches we will have ten source tree (S,G) in each switch.

Figure-4: Multicast Mode: PIM-ASM with Anycast RP

Multicast Mode: PIM BiDir (Phantom RP)

In PIM BiDir with Phantom RP (Figure 5), the selection of the RP is based on the longest match. In our example, Spine-11 has mask /28 and Spine has mask /27. If load balancing is needed, we could place the RPs of different Multicast Group to the different Spine: RP for Mcast Group 239.0.0.10 activated in Spine-11 and RP for Mcast Group 239.0.0.11 activated in Spine-12 and so on. In PIM-BiDir each Multicast distribution tree is rooted from the RP (there is no shortest path switchover operation) and there is only one group based entry (*,G) in each switch (excluding Spine-12).Mode: PIM-ASM with Anycast RP

Figure-5: Multicast Mode: PIM-BiDir with Phantom RP

From the complexity point of view, the Ingress Replication model is the most simple since we do not have to run Multicast routing in Underlay network but it has its scalability limitations. If we compare PIM-ASM and PIM-BiDir we have to make a decision which is a more important thing; automatic load balancing or the count of the Multicast distribution trees. My choice is PIM BiDiR since it has (even though not automatic) its load balancing method and it uses bidirectional shared trees for each Mcast Group.

My next article will be about VXLAN Flood and Learn.

Edited: Aug 30.2018 | Toni Pasanen CCIE#28158

References:

RFC 5015: Bidirectional Protocol Independent Multicast (BIDIR-PIM)

Building Data Center with VXLAN BGP EVPN – A Cisco NX-OS Perspective

ISBN-10: 1-58714-467-0

17 comments:

Unknown12 May 2018 at 16:23
Excellent work ... this section might need pros/cons for selecting PIM-SM vs PIM-BiDi vs IR in Ethernet Fabrics. Both explained in details but it might be beneficial to add why PIM-BiDi is preferred over PIM-SM.
Unknown27 August 2018 at 03:49
great post Toni, i have a question here:- why you configure the RP on all devices 192.168.238.1, however the loopback is 192.168.238.6, i think we can configure this:-

ip pim rp-address 192.168.238.6 group-list 238.0.0.0/24 bidir

thanks in advance, and really all of your posts regarding VXLAN is very useful. Thanks
Yuriy30 August 2018 at 16:04
hi. Is there any reason to use specifically /29 and /28 for Lo238? Maybe it's just me, but I'd rather use /31 and /32
Andrei Voinovich18 March 2019 at 07:46
Cool articles - many things are unclear while reading 'Building Data Centers with VXLAN BGP EVPN A Cisco NX-OS Perspective', but after your examples all the bricks are stuck together.
Rajesh23 November 2019 at 18:47
Is this is Safari ?
Toni Pasanen25 November 2019 at 22:49
No, at the moment the paperback version of the book is available at Amazon and the pdf version at Leanpub.com.
red1adn2 December 2022 at 13:13
Hey , i was struggling in the process of learning VXLAN, but i've to say you made it clear and easy for me to understand it. thank you a lot and may god bless you.
Champ Nweke15 May 2023 at 09:34
This comment has been removed by the author.

Tuesday, 20 March 2018

VXLAN Part IV: The Underlay Network – Multidestination Traffic: PIM BiDir

17 comments: