Yet another long delay between posts, but this one is worth the wait! I got to assist my super bad ass co-worker on a Nexus 9000 VXLAN EVPN deployment this past week, and what an adventure it was… there were ups and downs, and long nights in the data center (I feel bad since it was much worse for my co-worker!), far too much Cisco TAC hold music, and even some beer! Without further ado, here we go….
As you may know, recent (very recent) Nexus 9000 code supports leveraging the EVPN address-family in BGP as a control plane for VXLAN. That’s cool by itself, but the really cool part about these recent releases is we can now effectively do VXLAN routing. Prior to this VXLAN was only useful for stretching layer 2 around, then at some point you would have to bridge that VXLAN (bridge domain) into a VLAN, trunk it out to an SVI for routing, and then if your packet was destined for another VXLAN the device would trunk back down in that second VLAN, and then bridge you back into the second VXLAN…. so basically you had to hair pin all traffic like crazy to route between VXLANs… hopefully this drawing will help to illustrate what I’m talking about:
In the above drawing we have a basic Spine/Leaf network with two bridge-domains, 4097, and 4098. Our lovely user in 4097 has a flow that is destined for the other user in 4098. Heres how this would have worked before:
- The host sends a frame to the leaf node(s) that he’s connected to — from the host perspective this is just Ethernet traffic like normal. The frame hits the leaf(s) and the VLAN is associated to a VXLAN (VNI). The destination for this frame would be the default gateway for the domain — in our case that’s in the ‘Core Block.’ The leaf(s) encapsulate the frame with a L3 destination of the Border leaf(s) VTEP address and the packet (L3 encapsulated at this point) is sent along its way via some ECMP routing to the spines.
- The Spines are just big dumb routers, so they receive the encapsulated packet, do a lookup for the destination IP address and see that it must go to the Border leaf(s), and so off they go, routing the packet away.
- The Border leaf(s) see that they are the destination for the packet, realize its a VXLAN packet, and de-encapsulate the frame — now we are back to a L2 frame… they investigate the frame and see that the destination MAC address is that of the default gateway which they have learned exists on the ‘Core Block.’ The frame is now switched up to the default gateway like normal.
- The Core Block receives the frame, realizes it’s the destination and then investigates what to do with the packet… they see that there is a destination IP address that exists on another VLAN (remember, this is just normal VLAN/Ethernet at this point — the Core Block doesn’t know VXLAN exists at all) that it owns… the Packet information stays the same — it’s still looking for the same destination IP address in VXLAN 4098 — but the destination Ethernet address is changed to the lovely lady in VXLAN 4098.
- Now that we know that the frame needs to go to the destination MAC address of the lady in 4098, the Core switches trunk the frame along to where they have learned that the MAC address lives — which happens to be back to the Border leafs.
- The Border leafs get the frame from the Core Block and perform a MAC lookup — they see that the MAC is associated to the VTEP address of some other leaf switch, and encapsulate the frame again. The packet gets the destination IP address of the leaf that owns the destination MAC address for the frame…. again off we go to the spine switches.
- Again, the spine switches are just doing basic routing — they see they have a packet that is destined for some leaf switch and route the packet along its way.
- Finally, the destination leaf switch gets the packet, removes the VXLAN encapsulation, sees the destination MAC, bridges the VXLAN into the VLAN, and switches the frame to its final destination.
So, as you can see, lots going on, and if you had a bunch of VXLAN to VXLAN traffic flows you would have a ton of hair-pinning going on through the Core Block. Not exactly the picture of efficiency! Being able to route between bridge-domains is obviously pretty critical to be able to get this to scale appropriately… we can now do that on 9ks with EVPN! So, lets move on.
Very quickly, here is an overview of the topology that we will be talking about — host names and IPs have been changed to protect the innocent. This is a very simple Spine/Leaf design with Nexus 9396 Leaf nodes and 9332 Spine nodes. There also exists a strictly L3 routed block outside of the Spine/Leaf environment — we’ll talk more about that later — these two switches are also 9396s.
Before we can go crazy with VXLAN, we have some basics that we need to get configured first to support this. The first and most obvious is that we need to have an IGP in the fabric. The IGP can be whatever you’d like, we just need to be able to have IP reachability across the fabric — my personal choice would be ISIS, in this case, however, we’re using OSPF. I won’t go into details of configuring that since its quite simple, but common sense stuff like point-to-point network types on the routed links to speed things up would be good. We also need to ensure that we have loopbacks on all of our devices, and that these loopbacks are being advertised into your IGP.
Features — because this is on a Nexus box, we must enable features, the features we used in this deployment were:
- nv overlay
Sort of feature-specific — it is also required to configure “nv overlay evpn” to enable the EVPN overlay.
The next foundational piece that we need to have is Multicast. Even though we are now going to have a BGP control-plane for VXLAN, we still need multicast to support BUM traffic — that is Broadcast, Unknown Unicast, and Multicast replication at layer 2. This has been a key piece of how VXLAN supports L2 over L3 since the initial draft, so I won’t delve any further into that here. Cisco also recommends using anycast rendezvous point, and using the spine switches as the RPs for the environment. Pretty straight forward configuration — we’ll cover specifics in a bit.
Finally for the base configs, we need to stand up the skeleton of the EVPN configuration. If you’ve done BGP, you can knock this out nice and easy! The spine switches are route-reflectors for the fabric (and clients of each other), and all leafs peer to each spine. You do not need any v4 BGP, simply activate the EVPN address family… again, configs coming in a bit.
Next up, deploying EVPN control-plane requires the use of the Anycast-Gateway feature — meaning all leafs act as the default gateway for every bridge-domain — this is pretty much the same as FarbicPath anycast in theory. All leaf nodes are configured with a single command that defines the fabric anycast gateway MAC address. I’m not sure if there is a “best practice” here, so we simply selected a MAC address from an IANA reserved range — since this MAC doesn’t exist anywhere but in the fabric, this should be okay. In addition to this command, every VXLAN enabled VLAN SVI is configured with the fabric forwarding mode anycast command.
Next up is where we start to get into the real meat of it! One critical piece of this EVPN puzzle is that every tenant in the fabric requires an “overlay” VLAN. My understanding of this is that MP-BGP is essentially used to loop packets through the internal switching architecture at the leaf nodes as opposed to shipping all traffic out to a remote device for routing functionality as explained above. So to that end, we have a VRF and a VLAN that are assigned as an “overlay” VLAN. This VLAN is tied to a vn-segment, a VRF, and basically defines a tenant. All other VXLANs that live within this tenant basically roll up to this VLAN — its a kind of a container for the tenant. This will probably make more sense later in the configuration section!
As mentioned, the overlay VLAN is associated to a VRF — the VRF is then configured with the usual route-distinguisher and route-targets — the route-targets in this case have the additional key-word “evpn” which I believe associates them to the EVPN address-family in BGP.
For “host” VXLANs, or bridge-domains that will have devices in there, the configuration is much simpler. The layer 2 VLAN is simply assigned a VN-ID — VXLAN Network ID, and the layer 3 interface (SVI) is attached to the tenant VRF previously discussed.
Now, we have some vPC specific configurations. The “overlay” VLAN must be allowed over the peer-link. I believe that this is because it is referenced in the NVE configuration (up next) and is basically tied to the VTEP to associate each tenant, so this makes sense that the VLAN would have to traverse the peer-link. Additionally, and kind of related to the vPC, each vPC member must have its own unique loopback address, however, for purposes of anycast across the domain, they both must have the SAME secondary IP address configured on the loopback that is used as the NVE source.
Finally, we have the NVE interface — or Network Virtualization Endpoint interface. This is really the VTEP, so the name is sort of confusing, but I believe it’s also used for other layer 2 virtualization/encapsulation protocols on other devices so I suppose it fits. This interface is very similar to any other logical interface, we tie it to our loopback source, and then associate VNIs to it. Critically, the “overlay” VNI is associated and has an “associate-vrf” command. My understanding is that you could associate multiple tenants (or overlay VLANs/VXLANs) to a single NVE, but for purposes of this deployment we only had a single tenant so I can’t confirm that. In addition to associate the overlay, all other VNIs are associated and assigned a multicast group — recall that multicast is still used to handle BUM traffic.
VXLAN to the Real World:
Up till now we’ve talked about how to make VXLAN work generically, and the EVPN portion allows VXLAN to VXLAN routing (within a tenant at least — didn’t get to try/test between tenants although I assume it’s only a matter of route-target tweaking), but how does all of this magic escape the overlay and get into the real world?! Turns out it’s a bit confusing, but I’ll do my best to make sense of it (assuming I understand it completely, which I’m not entirely sure I get all the details).
First and foremost — it seems that there are several ways to skin the cat with routing out from a VXLAN EVPN tenant to the rest of the world. For our purposes, we created sub-interfaces on the routed links between the border leafs and the ‘core’ and associated that sub-interface with our tenant VRF. Once on the core, you could choose to continue running VRF-lite perhaps up to a firewall where you could maintain policy between tenants, or you could pipe that traffic out to the tenant over some other connection… totally up to you. For our purposes, we can simply assume that from the core we received routes from the rest of the tenant and we advertised our SVIs for the VXLANs into the core via OSPF. This isn’t quite enough though! It works to get your border leafs talking out to the rest of the world, but traffic on the other leafs wouldn’t have enough information to route out at this point. In the OSPF process under the VRF, BGP gets redistributed into OSPF, and on BGP under the VRF, OSPF is redistributed into BGP — additionally, the “advertise l2vpn evpn” command is configured. This is where things get a bit fuzzy for me… what I believe is happening here is that OSPF prefixes are getting piped into BGP and the “advertise l2vpn evpn” command basically takes the prefixes, and dumps them into the EVPN address-family… this essentially makes the border leafs get mapped as the VTEPs for any external routes. I’m less sure what the redistribution into OSPF is for — I believe that in larger scale environments this would make more sense… for our deployment the border leafs maintained ALL VXLANs (anycast gateway for all bridge-domains, and therefore was advertising all of these subnets up to the core via OSPF) and therefore advertised them to the core. For deployments that all VXLANs don’t exist on the border leafs I think the redistribution from BGP to OSPF is meant to announce the segments not on the border leafs to the rest of the world.
So, without further ado, lets move on to some configuration examples. These are sanitized versions of real life working configurations, so they should be pretty spot on 🙂
We’ll start with the relevant spine configurations since they’re the simplest piece to all of this, this is an example configuration for Spine-1 — it would be very similar on Spine-2:
# Enable Features # nv overlay evpn feature ospf feature bgp feature pim # # /Enable Features # Configure loopbacks # interface loopback0 ip address 126.96.36.199/32 ip router ospf 1 area 0 ip pim sparse-mode interface loopback1 ip address 188.8.131.52/32 ip router ospf 1 area 0 ip pim sparse-mode # # /Configure loopbacks # Required PIM/RP Anycast Configuration # Note: loopback1 is the dedicated anycast loopback # Note: the multicast ranges in the group-lists should encompass the multicast ranges used for the VNIs # Note: 184.108.40.206 is our example loopback0 on spine-1, 220.127.116.11 is our example loopback on spine-2 # Note: 18.104.22.168 is used as our anycast RP, and loopback1 IP address # ip pim rp-address 22.214.171.124 group-list 126.96.36.199/8 ip pim rp-candidate loopback1 group-list 188.8.131.52/8 ip pim anycast-rp 184.108.40.206 220.127.116.11 ip pim anycast-rp 18.104.22.168 22.214.171.124 # # /Required PIM/RP Anycast Configuration # BGP Configuration # router bgp 65535 router-id 126.96.36.199 address-family l2vpn evpn retain route-target all neighbor 188.8.131.52 remote-as 65535 update-source loopback0 address-family l2vpn evpn send-community both route-reflector client neighbor [copy above neighbor entry for all remaining leaf nodes] update-source loopback0 address-family l2vpn evpn send-community both route-reflector client neighbor 184.108.40.206 remote-as 655356 update-source loopback address-family l2vpn evpn send-community both route-reflector client # # /BGP Configuration
Thats pretty much it for the spine nodes, nice and simple. Of course they require all the fabric links to be routing and all other associated IP address configurations, but from a VXLAN perspective these are just dumb transit boxes.
Moving on, here is a configuration example for a leaf node:
# Enable Features # nv overlay evpn feature ospf feature bgp feature pim feature interface-vlan feature vpc # # /Enable Features # Configure loopback # interface loopback0 ip address 220.127.116.11/32 ip address 18.104.22.168/32 secondary ip router ospf 1 area 0 ip pim sparse-mode # # /Configure loopback # Required PIM/RP Anycast Configuration # ip pim rp-address 22.214.171.124 group-list 126.96.36.199/8 # # /Required PIM/RP Anycast Configuration # Overlay VLAN Configuration # vlan 1000 vn-segment 5096 interface Vlan1000 no ip address vrf member VRF-Tenant-1 ip forward no shutdown vrf context VRF-Tenant-1 vni 5096 rd 5096:5096 address-family ipv4 unicast route-target both 5096:5096 route-target both 5096:5096 evpn # # /Overlay VLAN Configuration # Example "Access" Bridge Domain Configuration # vlan 1001 vn-segment 5097 interface Vlan1001 vrf member VRF-Tenant-1 ip address 192.168.1.1/24 fabric forwarding mode anycast-gateway no shut # # /Example "Access" Bridge Domain Configuration # NVE Interface Configuration # interface nve1 no shutdown update-source loopback0 host-reachability protocol bgp member vni 5096 associate-vrf member vni 5097 suppress-arp mcast-group 188.8.131.52 # BGP Configuration # Note: The tenant must be configured in BGP to advertise l2vpn # Note: The evpn configuration stanza is new -- this associates the overlay VLAN/VRF to MP-BGP # router bgp 65535 router-id 184.108.40.206 address-family l2vpn evpn retain route-target all neighbor 220.127.116.11 remote-as 65535 update-source loopback0 address-family l2vpn evpn send-community both neighbor [copy above neighbor entry for all remaining spine nodes] update-source loopback0 address-family l2vpn evpn send-community both vrf VRF-Tenant-1 address-family ipv4 unicast advertise l2vpn evpn evpn vni 5096 l2 rd auto route-target both atuo # # /BGP Configuration
That is a lot digest!
I’ll wrap this up now and hopefully follow-up shortly with some verification and troubleshooting, but one last piece of advice… make sure you run 7.0(3)l1(1b) on the Nexus 9000s. We can’t be sure yet, but it seems like we had fully functional configurations on 7.0(3)l1(1) but we were not operational. There were some weird quirks with devices encapsulating things but not de-encapsulating things. It’s highly possible we had some misconfigurations but I can’t confirm that as of now. Lastly, the configuration guides for making this happen right now are in a pretty shabby state. They’ve got the most of the meat and if you combine them with the latest release notes you may get most of the way there, but probably not all the way – so be careful, follow this doc, follow the guides, and work with TAC, they were able to help work through this with us and make it successful. I’m very excited to have been able to roll this out though, despite the challenges this is probably the coolest deployment I’ve been a part of. I really think that all data centers in five years time will be running Spine/Leaf designs and VXLAN at some level (ACI, NSX, or CLI configured VXLAN), so this was pretty sweet to get things rolling early!
Finally, huge thanks to my super awesome post-sales co-worker who spent a LONG time on the phone with TAC getting everything dialed in properly and fully functional!