Tuesday 20 March 2018

VXLAN Part IV: The Underlay Network – Multidestination Traffic: PIM BiDir

My Last post, VXLAN Part III, introduces VXLAN Fabric L2VNI service with Anycast-RP PIM (RFC4610 and RFC 7761). In this chapter, I will show how the PIM BiDir (RFC5015) with Phantom-RP can be used for the same purpose. I will use configurations, show commands and Wireshark captures to explain the theory part.

Figure 1: Example VIRL topology


Configuration

I am going to use the same topology that I used in Anycast-RP lab. Note that I am using Cisco VIRL in this lab. We have two Leaf switches (101 & 102) and two Spine switches. For the next couple of pages I am going to implement PIM BiDir with Phantom-RP in preconfigured VXLAN Fabric (Underlay and IP addressing configurations can be found from Part II as well as basic NVE 1 interface configurations).

Here is our task list:

1) Spine11: Configure the Loopback interface 238 and assign the IP address 192.168.238.6 with mask /29. Attach the interface to OSPF area 0.0.0.0 and enable PIM-SM on it.

2) Spine-12: Configure the Loopback 238 with same IP address but use mask /28. Attach the interface to OSPF area 0.0.0.0. and enable PIM-SM on it.

3) All switches: Define the IP address 192.168.238.1 as the RP for multicast groups 238.0.0.0/24.


Configure the loopback address 192.168.238.6 on both Spine switches. Use mask /29 in Spine-11 and mask /28 in Spine-12. Attach both interfaces to OSPF area 0.0.0.0 and use PIM-SM. Note that OSPF network type has to be point-to-point, otherwise the loopback IP will be advertised as a host route with mask /32.

Spine-11 loopback interface:
interface loopback238
  description ** random IP in Phantom-RP network **
  ip address 192.168.238.6/29
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode

Spine-12 loopback interface:
interface loopback238
  description ** random IP in Phantom-RP network **
  ip address 192.168.238.6/28
  ip ospf network point-to-point
  ip router ospf UNDERLAY-NET area 0.0.0.0
  ip pim sparse-mode

Now if we take a look at the RIB from VTEP-101, we can see that it has both networks installed on RIB.
Leaf-101# sh ip route | b 192.168.238
192.168.238.0/28, ubest/mbest: 1/0
    *via 192.168.0.12, Eth1/2, [110/41], 00:40:14, ospf-UNDERLAY-NET, intra
192.168.238.0/29, ubest/mbest: 1/0
    *via 192.168.0.11, Eth1/1, [110/41], 00:40:03, ospf-UNDERLAY-NET, intra

Because of the more specific route, VTEP-101 will use Spine-11 as the next hop for 192.168.238.1 (our Phantom-RP).
Leaf-101# sh ip route 192.168.238.1
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.238.0/29, ubest/mbest: 1/0
    *via 192.168.0.11, Eth1/1, [110/41], 00:38:37, ospf-UNDERLAY-NET, intra

Step-2:
Define the IP address 192.168.238.1 as the RP of multicast groups 238.0.0.0/24 in all switches.
ip pim rp-address 192.168.238.1 group-list 238.0.0.0/24 bidir

And that’s all for the configuration part.


Operation

Now it is time the see how this actually works. I will do it by first shutting down the NVE 1 interfaces from both VTEPs and then I am going to bring them up again. In this way, both switches will join the multicast group 238.0.0.10 (which is attached to their NVE 1 interface for VNI 10000) by sending a PIM join messages towards the RP of. Just for a recap, here is the Interface NVE 1 configuration from VTEP-101.

interface nve1
  no shutdown
  source-interface loopback100
  member vni 10000
    mcast-group 238.0.0.10

In figure-2 and Capture-1 we can see the join process.

Figure-2: PIM Join from VTEP-101.

Capture-1: PIM Join from VTEP-101.
As can be seen from the Capture-1, VTEP-101 joins multicast group 238.0.0.10 by sending a PIM join message to multicast group 224.0.0.13 (all PIM routers) out of its’ E1/1 by using the Underlay IP address as a source. Note that PIM join message is sent upstream towards RPA (RP Address). The same process happens in VTEP-102. Based on these received PIM join messages, Spine-11 adds the Interfaces E1/1 and E1/2 to OIL (Outgoing Interface List) for group 238.0.0.10.

Spine-11# sh ip mroute
IP Multicast Routing Table for VRF "default"

(*, 232.0.0.0/8), uptime: 01:00:27, pim ip
  Incoming interface: Null, RPF nbr: 0.0.0.0
  Outgoing interface list: (count: 0)


(*, 238.0.0.0/24), bidir, uptime: 00:58:01, pim ip
  Incoming interface: loopback238, RPF nbr: 192.168.238.1
  Outgoing interface list: (count: 1)
    loopback238, uptime: 00:08:11, pim, (RPF)


(*, 238.0.0.10/32), bidir, uptime: 00:54:51, pim ip
  Incoming interface: loopback238, RPF nbr: 192.168.238.1
  Outgoing interface list: (count: 3)
    Ethernet1/2, uptime: 00:08:10, pim
    Ethernet1/1, uptime: 00:08:10, pim
    loopback238, uptime: 00:08:11, pim, (RPF)

The bidirectional Multicast Tree, where Spine-11 is working as a routing vector, is now ready. Spine-12 is not participating the Shared Tree at this moment. In case of Spine-11 failure, Spine-12 still advertises RPA network with mask /28 (remember that Spine-11 uses mask /29).


Summary

I have now shown how the multi-destination traffic is forwarded in Underlay network by using either a PIM-ASM (Anycast-RP) or PIM DiDir (Phantom-RP).
In addition to these Multicast modes, we could use “Ingress Replication” (Unicast mode) where each VTEP replicates ingress BUM traffic received locally to all other VTEPs. The information about the other VTEPs IP addresses under the vni can be configured statically to each switch or the information about vni/VTEP IP address can be advertised by using BGP EVPN (Route-type 3 – Inclusive Multicast Ethernet Tag Route). Instead of showing the Ingress Replication configurations, I am going to show briefly the difference and pros/cons comparison between these three multi-destination traffic forwarding options.


Unicast Mode: ingress Replication


In unicast mode, each packet is replicated to all other VTEPS belonging to the same vni (Figure 3). Ingress VTEP-1 replicates the ingress BUM traffic to all other VTEPS. Each replicated packets are VXLAN encapsulated and forwarded like any other VXLAN-encapsulated unicast data. In smaller installations this is valid solutions because of its simplicity, there is no need for the Multicast protocol in Underlay network. We can define remote peer addresses statically or we can use BGP EVPN for advertising peer information with route-type 3 (Inclusive Multicast Ethernet Tag Route) advertisement. If Unicast mode is used the BGP EVPN is Best Practise model.



Figure-3: Unicast Mode: Ingress Replication

Multicast Mode: PIM-ASM (Anycast-RP)

Each VTEP uses the same RP address in PIM-ASM with Anycast RP (Figure 4). In our example, both Spine switches are active RPs (same RP IP address). VTEP switches choose which one to use based on a hash algorithm. We will end up a situation where VTEPs uses different Spine as an RP. This way we have automatic load balancing between two active RPs. In PIM-ASM each VTEP are both source and receiver for the Multicast traffic and since we have ten VTEP switches we will have ten source tree (S,G) in each switch.



Figure-4: Multicast Mode: PIM-ASM with Anycast RP

Multicast Mode: PIM BiDir (Phantom RP)

In PIM BiDir with Phantom RP (Figure 5), the selection of the RP is based on the longest match. In our example, Spine-11 has mask /28 and Spine has mask /27. If load balancing is needed, we could place the RPs of different Multicast Group to the different Spine: RP for Mcast Group 239.0.0.10 activated in Spine-11 and RP for Mcast Group 239.0.0.11 activated in Spine-12 and so on. In PIM-BiDir each Multicast distribution tree is rooted from the RP (there is no shortest path switchover operation) and there is only one group based entry (*,G) in each switch (excluding Spine-12).Mode: PIM-ASM with Anycast RP

Figure-5: Multicast Mode: PIM-BiDir with Phantom RP


From the complexity point of view, the Ingress Replication model is the most simple since we do not have to run Multicast routing in Underlay network but it has its scalability limitations. If we compare PIM-ASM and PIM-BiDir we have to make a decision which is a more important thing; automatic load balancing or the count of the Multicast distribution trees. My choice is PIM BiDiR since it has (even though not automatic) its load balancing method and it uses bidirectional shared trees for each Mcast Group.

My next article will be about VXLAN Flood and Learn.

Edited: Aug 30.2018 | Toni Pasanen CCIE#28158

References:
RFC 5015: Bidirectional Protocol Independent Multicast (BIDIR-PIM)

Building Data Center with VXLAN BGP EVPN – A Cisco NX-OS Perspective
ISBN-10: 1-58714-467-0

17 comments:

  1. Excellent work ... this section might need pros/cons for selecting PIM-SM vs PIM-BiDi vs IR in Ethernet Fabrics. Both explained in details but it might be beneficial to add why PIM-BiDi is preferred over PIM-SM.

    ReplyDelete
    Replies
    1. Good point, I will add pros/cons section. Thanks for your valuable comment.

      Delete
    2. Hi Toni,

      I hope your day is going well.
      .
      My name is Bryan Bales, and I wanted to post here, to say that your work is excellent.

      Delete
    3. Hi Bryan, thanks for your kind words.

      Delete
  2. great post Toni, i have a question here:- why you configure the RP on all devices 192.168.238.1, however the loopback is 192.168.238.6, i think we can configure this:-

    ip pim rp-address 192.168.238.6 group-list 238.0.0.0/24 bidir

    thanks in advance, and really all of your posts regarding VXLAN is very useful. Thanks

    ReplyDelete
    Replies
    1. Thanks for the very good question! I have used the last unicast address from the block as an interface address and the first one as a Phantom RP. RFP 5015 states that "The RPA does not need to correspond to an address for an interface of a real router" This is just what the "Phantom" means, the address of RP is an address that is not assigned to any interface. Check this link out: https://learningnetwork.cisco.com/thread/120150. I really like the way how Juan explains this concept.

      Delete
  3. hi. Is there any reason to use specifically /29 and /28 for Lo238? Maybe it's just me, but I'd rather use /31 and /32

    ReplyDelete
    Replies
    1. Instead of assigning the Phantom-RP address to the router interface by using /32 mask, I wanted to let “the phantom” be free and float somewhere in between the Spine switches. Anyway, your solution is perfectly fine as long as you use a different mask length.

      Delete
    2. By the way, I used the Phantom-RP image in figure 4 even though it should be the Anycast RP image. Now the picture has been changed to correct one.

      Delete
    3. thank you for the reply :)

      BWT, awesome pictures in this series, I mean it!

      Delete
    4. Thanks! A picture is worth a thousand words :)

      Delete
  4. Cool articles - many things are unclear while reading 'Building Data Centers with VXLAN BGP EVPN A Cisco NX-OS Perspective', but after your examples all the bricks are stuck together.

    ReplyDelete
    Replies
    1. Nice to hear that these posts actually help. The book you mention is by the way one of my favorite book published by Cisco Press :)

      Delete
  5. Is this is Safari ?

    ReplyDelete
  6. No, at the moment the paperback version of the book is available at Amazon and the pdf version at Leanpub.com.

    ReplyDelete
  7. Hey , i was struggling in the process of learning VXLAN, but i've to say you made it clear and easy for me to understand it. thank you a lot and may god bless you.

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete