Nexus 9000 VXLAN – EVPN, vPC, and VXLAN Routing

Yet another long delay between posts, but this one is worth the wait! I got to assist my super bad ass co-worker on a Nexus 9000 VXLAN EVPN deployment this past week, and what an adventure it was… there were ups and downs, and long nights in the data center (I feel bad since it was much worse for my co-worker!), far too much Cisco TAC hold music, and even some beer! Without further ado, here we go….

As you may know, recent (very recent) Nexus 9000 code supports leveraging the EVPN address-family in BGP as a control plane for VXLAN. That’s cool by itself, but the really cool part about these recent releases is we can now effectively do VXLAN routing. Prior to this VXLAN was only useful for stretching layer 2 around, then at some point you would have to bridge that VXLAN (bridge domain) into a VLAN, trunk it out to an SVI for routing, and then if your packet was destined for another VXLAN the device would trunk back down in that second VLAN, and then bridge you back into the second VXLAN…. so basically you had to hair pin all traffic like crazy to route between VXLANs… hopefully this drawing will help to illustrate what I’m talking about:

Non-Routed VXLAN Annotated

 
In the above drawing we have a basic Spine/Leaf network with two bridge-domains, 4097, and 4098. Our lovely user in 4097 has a flow that is destined for the other user in 4098. Heres how this would have worked before:

  1. The host sends a frame to the leaf node(s) that he’s connected to — from the host perspective this is just Ethernet traffic like normal. The frame hits the leaf(s) and the VLAN is associated to a VXLAN (VNI). The destination for this frame would be the default gateway for the domain — in our case that’s in the ‘Core Block.’ The leaf(s) encapsulate the frame with a L3 destination of the Border leaf(s) VTEP address and the packet (L3 encapsulated at this point) is sent along its way via some ECMP routing to the spines.
  2. The Spines are just big dumb routers, so they receive the encapsulated packet, do a lookup for the destination IP address and see that it must go to the Border leaf(s), and so off they go, routing the packet away.
  3. The Border leaf(s) see that they are the destination for the packet, realize its a VXLAN packet, and de-encapsulate the frame — now we are back to a L2 frame… they investigate the frame and see that the destination MAC address is that of the default gateway which they have learned exists on the ‘Core Block.’ The frame is now switched up to the default gateway like normal.
  4. The Core Block receives the frame, realizes it’s the destination and then investigates what to do with the packet… they see that there is a destination IP address that exists on another VLAN (remember, this is just normal VLAN/Ethernet at this point — the Core Block doesn’t know VXLAN exists at all) that it owns… the Packet information stays the same — it’s still looking for the same destination IP address in VXLAN 4098 — but the destination Ethernet address is changed to the lovely lady in VXLAN 4098.
  5. Now that we know that the frame needs to go to the destination MAC address of the lady in 4098, the Core switches trunk the frame along to where they have learned that the MAC address lives — which happens to be back to the Border leafs.
  6. The Border leafs get the frame from the Core Block and perform a MAC lookup — they see that the MAC is associated to the VTEP address of some other leaf switch, and encapsulate the frame again. The packet gets the destination IP address of the leaf that owns the destination MAC address for the frame…. again off we go to the spine switches.
  7. Again, the spine switches are just doing basic routing — they see they have a packet that is destined for some leaf switch and route the packet along its way.
  8. Finally, the destination leaf switch gets the packet, removes the VXLAN encapsulation, sees the destination MAC, bridges the VXLAN into the VLAN, and switches the frame to its final destination.

So, as you can see, lots going on, and if you had a bunch of VXLAN to VXLAN traffic flows you would have a ton of hair-pinning going on through the Core Block. Not exactly the picture of efficiency! Being able to route between bridge-domains is obviously pretty critical to be able to get this to scale appropriately… we can now do that on 9ks with EVPN! So, lets move on.

Very quickly, here is an overview of the topology that we will be talking about — host names and IPs have been changed to protect the innocent. This is a very simple Spine/Leaf design with Nexus 9396 Leaf nodes and 9332 Spine nodes. There also exists a strictly L3 routed block outside of the Spine/Leaf environment — we’ll talk more about that later — these two switches are also 9396s.

Overview

 

Base Config:

Before we can go crazy with VXLAN, we have some basics that we need to get configured first to support this. The first and most obvious is that we need to have an IGP in the fabric. The IGP can be whatever you’d like, we just need to be able to have IP reachability across the fabric — my personal choice would be ISIS, in this case, however, we’re using OSPF. I won’t go into details of configuring that since its quite simple, but common sense stuff like point-to-point network types on the routed links to speed things up would be good. We also need to ensure that we have loopbacks on all of our devices, and that these loopbacks are being advertised into your IGP.

Features — because this is on a Nexus box, we must enable features, the features we used in this deployment were:

  • ospf
  • bgp
  • pim
  • interface-vlan
  • vn-segment-vlan-based
  • nv overlay

Sort of feature-specific — it is also required to configure “nv overlay evpn” to enable the EVPN overlay.

The next foundational piece that we need to have is Multicast. Even though we are now going to have a BGP control-plane for VXLAN, we still need multicast to support BUM traffic — that is Broadcast, Unknown Unicast, and Multicast replication at layer 2. This has been a key piece of how VXLAN supports L2 over L3 since the initial draft, so I won’t delve any further into that here. Cisco also recommends using anycast rendezvous point, and using the spine switches as the RPs for the environment. Pretty straight forward configuration — we’ll cover specifics in a bit.

Finally for the base configs, we need to stand up the skeleton of the EVPN configuration. If you’ve done BGP, you can knock this out nice and easy! The spine switches are route-reflectors for the fabric (and clients of each other), and all leafs peer to each spine. You do not need any v4 BGP, simply activate the EVPN address family… again, configs coming in a bit.

VXLAN Config:

Next up, deploying EVPN control-plane requires the use of the Anycast-Gateway feature — meaning all leafs act as the default gateway for every bridge-domain — this is pretty much the same as FarbicPath anycast in theory. All leaf nodes are configured with a single command that defines the fabric anycast gateway MAC address. I’m not sure if there is a “best practice” here, so we simply selected a MAC address from an IANA reserved range — since this MAC doesn’t exist anywhere but in the fabric, this should be okay. In addition to this command, every VXLAN enabled VLAN SVI is configured with the fabric forwarding mode anycast command.

Next up is where we start to get into the real meat of it! One critical piece of this EVPN puzzle is that every tenant in the fabric requires an “overlay” VLAN. My understanding of this is that MP-BGP is essentially used to loop packets through the internal switching architecture at the leaf nodes as opposed to shipping all traffic out to a remote device for routing functionality as explained above. So to that end, we have a VRF and a VLAN that are assigned as an “overlay” VLAN. This VLAN is tied to a vn-segment, a VRF, and basically defines a tenant. All other VXLANs that live within this tenant basically roll up to this VLAN — its a kind of a container for the tenant. This will probably make more sense later in the configuration section!

As mentioned, the overlay VLAN is associated to a VRF — the VRF is then configured with the usual route-distinguisher and route-targets — the route-targets in this case have the additional key-word “evpn” which I believe associates them to the EVPN address-family in BGP.

For “host” VXLANs, or bridge-domains that will have devices in there, the configuration is much simpler. The layer 2 VLAN is simply assigned a VN-ID — VXLAN Network ID, and the layer 3 interface (SVI) is attached to the tenant VRF previously discussed.

Now, we have some vPC specific configurations. The “overlay” VLAN must be allowed over the peer-link. I believe that this is because it is referenced in the NVE configuration (up next) and is basically tied to the VTEP to associate each tenant, so this makes sense that the VLAN would have to traverse the peer-link. Additionally, and kind of related to the vPC, each vPC member must have its own unique loopback address, however, for purposes of anycast across the domain, they both must have the SAME secondary IP address configured on the loopback that is used as the NVE source.

Finally, we have the NVE interface — or Network Virtualization Endpoint interface. This is really the VTEP, so the name is sort of confusing, but I believe it’s also used for other layer 2 virtualization/encapsulation protocols on other devices so I suppose it fits. This interface is very similar to any other logical interface, we tie it to our loopback source, and then associate VNIs to it. Critically, the “overlay” VNI is associated and has an “associate-vrf” command. My understanding is that you could associate multiple tenants (or overlay VLANs/VXLANs) to a single NVE, but for purposes of this deployment we only had a single tenant so I can’t confirm that. In addition to associate the overlay, all other VNIs are associated and assigned a multicast group — recall that multicast is still used to handle BUM traffic.

 VXLAN to the Real World:

Up till now we’ve talked about how to make VXLAN work generically, and the EVPN portion allows VXLAN to VXLAN routing (within a tenant at least — didn’t get to try/test between tenants although I assume it’s only a matter of route-target tweaking), but how does all of this magic escape the overlay and get into the real world?! Turns out it’s a bit confusing, but I’ll do my best to make sense of it (assuming I understand it completely, which I’m not entirely sure I get all the details).

First and foremost — it seems that there are several ways to skin the cat with routing out from a VXLAN EVPN tenant to the rest of the world. For our purposes, we created sub-interfaces on the routed links between the border leafs and the ‘core’ and associated that sub-interface with our tenant VRF. Once on the core, you could choose to continue running VRF-lite perhaps up to a firewall where you could maintain policy between tenants, or you could pipe that traffic out to the tenant over some other connection… totally up to you. For our purposes, we can simply assume that from the core we received routes from the rest of the tenant and we advertised our SVIs for the VXLANs into the core via OSPF. This isn’t quite enough though! It works to get your border leafs talking out to the rest of the world, but traffic on the other leafs wouldn’t have enough information to route out at this point. In the OSPF process under the VRF, BGP gets redistributed into OSPF, and on BGP under the VRF, OSPF is redistributed into BGP — additionally, the “advertise l2vpn evpn” command is configured. This is where things get a bit fuzzy for me… what I believe is happening here is that OSPF prefixes are getting piped into BGP and the “advertise l2vpn evpn” command basically takes the prefixes, and dumps them into the EVPN address-family… this essentially makes the border leafs get mapped as the VTEPs for any external routes. I’m less sure what the redistribution into OSPF is for — I believe that in larger scale environments this would make more sense… for our deployment the border leafs maintained ALL VXLANs (anycast gateway for all bridge-domains, and therefore was advertising all of these subnets up to the core via OSPF) and therefore advertised them to the core. For deployments that all VXLANs don’t exist on the border leafs I think the redistribution from BGP to OSPF is meant to announce the segments not on the border leafs to the rest of the world.

Configuration Examples:

So, without further ado, lets move on to some configuration examples. These are sanitized versions of real life working configurations, so they should be pretty spot on 🙂

We’ll start with the relevant spine configurations since they’re the simplest piece to all of this, this is an example configuration for Spine-1 — it would be very similar on Spine-2:

# Enable Features
#
nv overlay evpn
feature ospf
feature bgp
feature pim
#
# /Enable Features

# Configure loopbacks
#
interface loopback0
 ip address 1.1.1.1/32
 ip router ospf 1 area 0
 ip pim sparse-mode
interface loopback1
 ip address 3.3.3.3/32
 ip router ospf 1 area 0
 ip pim sparse-mode
#
# /Configure loopbacks

# Required PIM/RP Anycast Configuration
# Note: loopback1 is the dedicated anycast loopback
# Note: the multicast ranges in the group-lists should encompass the multicast ranges used for the VNIs
# Note: 1.1.1.1 is our example loopback0 on spine-1, 2.2.2.2 is our example loopback on spine-2
# Note: 3.3.3.3 is used as our anycast RP, and loopback1 IP address
#
ip pim rp-address 1.1.1.1 group-list 239.0.0.0/8
ip pim rp-candidate loopback1 group-list 239.0.0.0/8
ip pim anycast-rp 3.3.3.3 1.1.1.1
ip pim anycast-rp 3.3.3.3 2.2.2.2
#
# /Required PIM/RP Anycast Configuration
# BGP Configuration
#
router bgp 65535
  router-id 1.1.1.1
  address-family l2vpn evpn
    retain route-target all
  neighbor 5.5.5.5 remote-as 65535
    update-source loopback0
    address-family l2vpn evpn
      send-community both
      route-reflector client
  neighbor [copy above neighbor entry for all remaining leaf nodes]
    update-source loopback0
    address-family l2vpn evpn
      send-community both
      route-reflector client
  neighbor 2.2.2.2 remote-as 655356
    update-source loopback
    address-family l2vpn evpn
      send-community both
      route-reflector client
#
# /BGP Configuration

Thats pretty much it for the spine nodes, nice and simple. Of course they require all the fabric links to be routing and all other associated IP address configurations, but from a VXLAN perspective these are just dumb transit boxes.

Moving on, here is a configuration example for a leaf node:

# Enable Features
#
nv overlay evpn
feature ospf
feature bgp
feature pim
feature interface-vlan
feature vpc
#
# /Enable Features

# Configure loopback
#
interface loopback0
 ip address 5.5.5.5/32
 ip address 5.5.5.255/32 secondary
 ip router ospf 1 area 0
 ip pim sparse-mode
#
# /Configure loopback

# Required PIM/RP Anycast Configuration
#
ip pim rp-address 1.1.1.1 group-list 239.0.0.0/8
#
# /Required PIM/RP Anycast Configuration
# Overlay VLAN Configuration
#
vlan 1000
  vn-segment 5096
interface Vlan1000
  no ip address
  vrf member VRF-Tenant-1
  ip forward
  no shutdown
vrf context VRF-Tenant-1
  vni 5096
  rd 5096:5096
  address-family ipv4 unicast
    route-target both 5096:5096
    route-target both 5096:5096 evpn
#
# /Overlay VLAN Configuration

# Example "Access" Bridge Domain Configuration
#
vlan 1001
  vn-segment 5097
interface Vlan1001
  vrf member VRF-Tenant-1
  ip address 192.168.1.1/24
  fabric forwarding mode anycast-gateway
  no shut
#
# /Example "Access" Bridge Domain Configuration

# NVE Interface Configuration
#
interface nve1
  no shutdown
  update-source loopback0
  host-reachability protocol bgp
  member vni 5096 associate-vrf
  member vni 5097
    suppress-arp
    mcast-group 239.1.1.1

# BGP Configuration
# Note: The tenant must be configured in BGP to advertise l2vpn
# Note: The evpn configuration stanza is new -- this associates the overlay VLAN/VRF to MP-BGP
#
router bgp 65535
  router-id 5.5.5.5
  address-family l2vpn evpn
    retain route-target all
  neighbor 1.1.1.1 remote-as 65535
    update-source loopback0
    address-family l2vpn evpn
      send-community both
  neighbor [copy above neighbor entry for all remaining spine nodes]
    update-source loopback0
    address-family l2vpn evpn
      send-community both
  vrf VRF-Tenant-1
    address-family ipv4 unicast
       advertise l2vpn evpn
  evpn
    vni 5096 l2
       rd auto
       route-target both atuo
#
# /BGP Configuration

That is a lot digest!

I’ll wrap this up now and hopefully follow-up shortly with some verification and troubleshooting, but one last piece of advice… make sure you run 7.0(3)l1(1b) on the Nexus 9000s. We can’t be sure yet, but it seems like we had fully functional configurations on 7.0(3)l1(1) but we were not operational. There were some weird quirks with devices encapsulating things but not de-encapsulating things. It’s highly possible we had some misconfigurations but I can’t confirm that as of now. Lastly, the configuration guides for making this happen right now are in a pretty shabby state. They’ve got the most of the meat and if you combine them with the latest release notes you may get most of the way there, but probably not all the way – so be careful, follow this doc, follow the guides, and work with TAC, they were able to help work through this with us and make it successful. I’m very excited to have been able to roll this out though, despite the challenges this is probably the coolest deployment I’ve been a part of. I really think that all data centers in five years time will be running Spine/Leaf designs and VXLAN at some level (ACI, NSX, or CLI configured VXLAN), so this was pretty sweet to get things rolling early!

Finally, huge thanks to my super awesome post-sales co-worker who spent a LONG time on the phone with TAC getting everything dialed in properly and fully functional!

 

Why is Clos (Spine/Leaf) the thing to do in the Data Center?

I’ve been talking to lots of customers about Data Center refreshes and greenfield Data Center builds lately. A lot of these folks have a very ‘traditional’ Data Center design — Nexus 7k/5k/2k in a ‘fat tree’ type topology. This is a completely valid design obviously and has worked in the Data Center for quite a while, but, there is a better way! (maybe!?)

First lets talk about why/how/when this whole Clos thing came about. Turns out there was this smart dude back in the 1950s who was named Charles Clos. He created this magical topology for the internal architecture of phone switching equipment. Theres lots of stuff on the intertubes about this dude and his design, however for our purposes we just need to know that he essentially created a non-blocking internal fabric. This was all well and good, and I presume stuck around in phone-land and did awesome things there. Where it gets interesting (to me at least), is that this architecture reappears in the land of networking sometime in the 90s. Some smart people (I assume at least) decided that they could leverage this design in internal Ethernet switching infrastructure. All of this is basically to say, that this whole topology has been around a while.

Things get even more interesting when people started realizing that they could solve a lot of problems modern data centers have by implementing this very same topology in the physical layout of the data center. So what problems does this new topology really solve for us? I’m sure there are varying opinions on the matter, but here are my top three problems that a Spine/Leaf network helps solve in the DC:

  1. Explosion of East/West traffic — As applications have become more and more complex, a single web server may need to query a whole bunch of app servers/databases in order to fill out a web page. Consider a utility company — you want to log onto their website to pay your bill. You head on over to the companies webpage, generic website things start to load, no big deal. You log in — now the webpage has to query some database containing user information, then its got to go and pull information for your electricity usage, gas usage, perhaps water/sewer as well. This is perhaps a terrible example, but you get my point I hope! Essentially you have a very small amount of traffic coming in to the web server (North/South traffic) — essentially just the initial query, but potentially a LOT of stuff happening on the backend between the web server and other servers/services in the data center (East/West traffic). In a traditional 7k/5k/2k data center design, you may have these servers/services physically located in different rows or racks, causing the inter-server communication to go from its 2k, up to its 5k (going to happen anyway I guess unless you’re running Nexus 2300 series FEX I suppose), then up to the 7k, to go back down to another pair of 5k EoR/ToR switches, down to a 2k and back…. rinse/repeat for as many transactions as the website requires. This creates a lot of traffic that’s going over (probably) vPC links… vPCs can of course only be ran across a PAIR of switches, so at some point you will end up not being able to add additional bandwidth, or run out of ports between 5k/7k (obviously this would be at a pretty crazy scale, but still!). In a Spine/Leaf design, adding bandwidth is as simple as adding another Spine switch (because of reasons outlined in problems 2 and 3). You can eliminate the ‘vPC problem’ of only having a pair of devices up-stream, you can also eliminate the middle ‘distribution’ tier of 5ks if you want (you can still run 2ks off of ToR/EoR in a Leaf/Spine design if you are inclined). Because of these last two points, we can more efficiently leverage the bandwidth available, and do so without having to have a million ports on a pair of ‘core’ type devices since we can just add (cheaper, see next item) Spine devices at will.
  2. The ‘God Box’ Problem(s) — I have opinions. They’re always right. Always. Some people disagree. Those people are wrong! (Kidding… sort of…) Anyway, one of the central tenants of network design to me is always to build distributed systems. I always point to the Internet itself as the ultimate case in point that distributed systems work, and work well (all Internet problems aside, look at that thing — its huge, and its even working!). Why am I babbling about this? Because I feel strongly that a pair of Nexus 7ks represent a kind of ‘God Box’ in the Data Center. These boxes (not just 7ks obviously, but any DC ‘cores’) become so completely intertwined in the Data Center that they become hard to perform maintenance on, they probably house the SVIs for ALL your DC VLANs, they probably have load balancers connected to them, they’ve probably got other critical devices connected to them, and ALL your traffic flows through them. In a Spine/Leaf topology, there isn’t really a concept of a ‘core,’ there are just Spines and Leafs. Spines don’t do anything but provide ECMP paths for traffic between your Leafs. This makes them essentially just ‘dumb’ boxes performing L2 or L3 magic (Trill/FabricPath/ECMP routing). It’s very easy to add or remove Spines because traffic can simply swing between the nodes by costing out links on a Spine you would like to perform maintenance on. Leaf nodes are perhaps a little more complex, but usually deployed in pairs, and have a much smaller zone of impact. Leafs tend to be ToR switches, so even if you totally bugger up both Leafs in a  pair, you are only directly impacting a single racks worth of end hosts — not a whole DC. So, basically what I’m saying is that I believe a Leaf/Spine topology promotes a ‘God Box’ free data center… I’m a big fan of that! The other great thing that Leaf/Spine networks promote simply by the nature of the topology, is commoditized infrastructure in terms of both features and hardware. In the tree topology, we have to funnel everything up to the DC cores with a bunch of fancy features (expensive!) — adding bandwidth means adding links, adding links maybe means adding cards… all this leads to having to get lots of density into those complicated God boxes. Each Spine in a Spine/Leaf design needs only enough ports to have ONE connection to each Leaf node (you can add more if you’d like though), this allows us to have less dense (and potentially cheaper overall – because less modular, maybe we run single SUP boxes instead of dual, etc.) Spines. Adding bandwidth is much simpler, because we can just add physical Spine nodes. Consider a DC that wants to have 32 ToR switches connected at 40g, and have 160g going to each Leaf switch. In our 7k example, we would need to have 64 40g ports available on each 7k, but in a Spine/Leaf topology, we can have 4x Nexus 9332 switches. Lastly, not only can we have smaller, cheaper hardware, we are moving toward a simpler overall design feature-wise.
  3. Failure Domains — while a classical tree topology data center doesn’t neceessarily mean there are large layer 2 domains, in my experience that tends to be the case. We’ve tried to eliminate spanning-tree, but it doesn’t die easily. vPCs are a pretty solid band-aid to cover up spanning-tree issues, but they still don’t eliminate spanning-tree completely. While a Spine/Leaf topology doesn’t automatically eliminate L2 in the data center (in fact L2 Spine/Leaf designs with FabricPath/Trill is totally viable) I feel that it does promote a L3 ECMP fabric. In most vendor Spine/Leaf reference architectures VXLAN is used extensively to provide L2 adjacency across the fabric. This of course facilitates us having a totally routed fabric, which means our failure domains are pushed all the way to the edge of the fabric!

At this point often times people will say: “Well we will never need this type of scalability” and/or “Our traditional design has been working just fine.” Well… maybe in your environment a traditional topology is just fine, but for the vast majority of customers that I see, the explosion of East/West traffic, and the desire to move toward a ‘software defined data center’ (a la Spine/Leaf and VXLAN often times) is driving this topology no matter what. I also don’t think that just because you are leveraging a Spine/Leaf topology that means that you automatically need to build it out to support 32 leafs — a single pair of Spine switches, and 2 pairs of Leafs is often what I’m seeing deployed. The commoditized nature of the platforms that fit into this type of topology mean that buying those 6 switches is often less (sometimes by a BIG margin) than buying a pair of traditional monolithic data center core switches. The last issue that inevitably arises is: “I need to vMotion — what about my L2.” I think we all hate this statement… build better designed applications and you probably wouldn’t need to vMotion! Since that’s not happening fast enough though, we have VXLAN. We can still provide L2 anywhere in the fabric, and yet we don’t need to rely on spanning-tree kludges.

All-in-all, I’m a fan and have been promoting Spine/Leaf topologies and the benefits wherever I can! I’d be interested to hear from others their thoughts, and if they are seeing customers adopt a Spine/Leaf data center design, or if people are still holding out, and if they are what are the reasons?