Nexus 9000 VXLAN – EVPN, vPC, and VXLAN Routing

Yet another long delay between posts, but this one is worth the wait! I got to assist my super bad ass co-worker on a Nexus 9000 VXLAN EVPN deployment this past week, and what an adventure it was… there were ups and downs, and long nights in the data center (I feel bad since it was much worse for my co-worker!), far too much Cisco TAC hold music, and even some beer! Without further ado, here we go….

As you may know, recent (very recent) Nexus 9000 code supports leveraging the EVPN address-family in BGP as a control plane for VXLAN. That’s cool by itself, but the really cool part about these recent releases is we can now effectively do VXLAN routing. Prior to this VXLAN was only useful for stretching layer 2 around, then at some point you would have to bridge that VXLAN (bridge domain) into a VLAN, trunk it out to an SVI for routing, and then if your packet was destined for another VXLAN the device would trunk back down in that second VLAN, and then bridge you back into the second VXLAN…. so basically you had to hair pin all traffic like crazy to route between VXLANs… hopefully this drawing will help to illustrate what I’m talking about:

Non-Routed VXLAN Annotated

 
In the above drawing we have a basic Spine/Leaf network with two bridge-domains, 4097, and 4098. Our lovely user in 4097 has a flow that is destined for the other user in 4098. Heres how this would have worked before:

  1. The host sends a frame to the leaf node(s) that he’s connected to — from the host perspective this is just Ethernet traffic like normal. The frame hits the leaf(s) and the VLAN is associated to a VXLAN (VNI). The destination for this frame would be the default gateway for the domain — in our case that’s in the ‘Core Block.’ The leaf(s) encapsulate the frame with a L3 destination of the Border leaf(s) VTEP address and the packet (L3 encapsulated at this point) is sent along its way via some ECMP routing to the spines.
  2. The Spines are just big dumb routers, so they receive the encapsulated packet, do a lookup for the destination IP address and see that it must go to the Border leaf(s), and so off they go, routing the packet away.
  3. The Border leaf(s) see that they are the destination for the packet, realize its a VXLAN packet, and de-encapsulate the frame — now we are back to a L2 frame… they investigate the frame and see that the destination MAC address is that of the default gateway which they have learned exists on the ‘Core Block.’ The frame is now switched up to the default gateway like normal.
  4. The Core Block receives the frame, realizes it’s the destination and then investigates what to do with the packet… they see that there is a destination IP address that exists on another VLAN (remember, this is just normal VLAN/Ethernet at this point — the Core Block doesn’t know VXLAN exists at all) that it owns… the Packet information stays the same — it’s still looking for the same destination IP address in VXLAN 4098 — but the destination Ethernet address is changed to the lovely lady in VXLAN 4098.
  5. Now that we know that the frame needs to go to the destination MAC address of the lady in 4098, the Core switches trunk the frame along to where they have learned that the MAC address lives — which happens to be back to the Border leafs.
  6. The Border leafs get the frame from the Core Block and perform a MAC lookup — they see that the MAC is associated to the VTEP address of some other leaf switch, and encapsulate the frame again. The packet gets the destination IP address of the leaf that owns the destination MAC address for the frame…. again off we go to the spine switches.
  7. Again, the spine switches are just doing basic routing — they see they have a packet that is destined for some leaf switch and route the packet along its way.
  8. Finally, the destination leaf switch gets the packet, removes the VXLAN encapsulation, sees the destination MAC, bridges the VXLAN into the VLAN, and switches the frame to its final destination.

So, as you can see, lots going on, and if you had a bunch of VXLAN to VXLAN traffic flows you would have a ton of hair-pinning going on through the Core Block. Not exactly the picture of efficiency! Being able to route between bridge-domains is obviously pretty critical to be able to get this to scale appropriately… we can now do that on 9ks with EVPN! So, lets move on.

Very quickly, here is an overview of the topology that we will be talking about — host names and IPs have been changed to protect the innocent. This is a very simple Spine/Leaf design with Nexus 9396 Leaf nodes and 9332 Spine nodes. There also exists a strictly L3 routed block outside of the Spine/Leaf environment — we’ll talk more about that later — these two switches are also 9396s.

Overview

 

Base Config:

Before we can go crazy with VXLAN, we have some basics that we need to get configured first to support this. The first and most obvious is that we need to have an IGP in the fabric. The IGP can be whatever you’d like, we just need to be able to have IP reachability across the fabric — my personal choice would be ISIS, in this case, however, we’re using OSPF. I won’t go into details of configuring that since its quite simple, but common sense stuff like point-to-point network types on the routed links to speed things up would be good. We also need to ensure that we have loopbacks on all of our devices, and that these loopbacks are being advertised into your IGP.

Features — because this is on a Nexus box, we must enable features, the features we used in this deployment were:

  • ospf
  • bgp
  • pim
  • interface-vlan
  • vn-segment-vlan-based
  • nv overlay

Sort of feature-specific — it is also required to configure “nv overlay evpn” to enable the EVPN overlay.

The next foundational piece that we need to have is Multicast. Even though we are now going to have a BGP control-plane for VXLAN, we still need multicast to support BUM traffic — that is Broadcast, Unknown Unicast, and Multicast replication at layer 2. This has been a key piece of how VXLAN supports L2 over L3 since the initial draft, so I won’t delve any further into that here. Cisco also recommends using anycast rendezvous point, and using the spine switches as the RPs for the environment. Pretty straight forward configuration — we’ll cover specifics in a bit.

Finally for the base configs, we need to stand up the skeleton of the EVPN configuration. If you’ve done BGP, you can knock this out nice and easy! The spine switches are route-reflectors for the fabric (and clients of each other), and all leafs peer to each spine. You do not need any v4 BGP, simply activate the EVPN address family… again, configs coming in a bit.

VXLAN Config:

Next up, deploying EVPN control-plane requires the use of the Anycast-Gateway feature — meaning all leafs act as the default gateway for every bridge-domain — this is pretty much the same as FarbicPath anycast in theory. All leaf nodes are configured with a single command that defines the fabric anycast gateway MAC address. I’m not sure if there is a “best practice” here, so we simply selected a MAC address from an IANA reserved range — since this MAC doesn’t exist anywhere but in the fabric, this should be okay. In addition to this command, every VXLAN enabled VLAN SVI is configured with the fabric forwarding mode anycast command.

Next up is where we start to get into the real meat of it! One critical piece of this EVPN puzzle is that every tenant in the fabric requires an “overlay” VLAN. My understanding of this is that MP-BGP is essentially used to loop packets through the internal switching architecture at the leaf nodes as opposed to shipping all traffic out to a remote device for routing functionality as explained above. So to that end, we have a VRF and a VLAN that are assigned as an “overlay” VLAN. This VLAN is tied to a vn-segment, a VRF, and basically defines a tenant. All other VXLANs that live within this tenant basically roll up to this VLAN — its a kind of a container for the tenant. This will probably make more sense later in the configuration section!

As mentioned, the overlay VLAN is associated to a VRF — the VRF is then configured with the usual route-distinguisher and route-targets — the route-targets in this case have the additional key-word “evpn” which I believe associates them to the EVPN address-family in BGP.

For “host” VXLANs, or bridge-domains that will have devices in there, the configuration is much simpler. The layer 2 VLAN is simply assigned a VN-ID — VXLAN Network ID, and the layer 3 interface (SVI) is attached to the tenant VRF previously discussed.

Now, we have some vPC specific configurations. The “overlay” VLAN must be allowed over the peer-link. I believe that this is because it is referenced in the NVE configuration (up next) and is basically tied to the VTEP to associate each tenant, so this makes sense that the VLAN would have to traverse the peer-link. Additionally, and kind of related to the vPC, each vPC member must have its own unique loopback address, however, for purposes of anycast across the domain, they both must have the SAME secondary IP address configured on the loopback that is used as the NVE source.

Finally, we have the NVE interface — or Network Virtualization Endpoint interface. This is really the VTEP, so the name is sort of confusing, but I believe it’s also used for other layer 2 virtualization/encapsulation protocols on other devices so I suppose it fits. This interface is very similar to any other logical interface, we tie it to our loopback source, and then associate VNIs to it. Critically, the “overlay” VNI is associated and has an “associate-vrf” command. My understanding is that you could associate multiple tenants (or overlay VLANs/VXLANs) to a single NVE, but for purposes of this deployment we only had a single tenant so I can’t confirm that. In addition to associate the overlay, all other VNIs are associated and assigned a multicast group — recall that multicast is still used to handle BUM traffic.

 VXLAN to the Real World:

Up till now we’ve talked about how to make VXLAN work generically, and the EVPN portion allows VXLAN to VXLAN routing (within a tenant at least — didn’t get to try/test between tenants although I assume it’s only a matter of route-target tweaking), but how does all of this magic escape the overlay and get into the real world?! Turns out it’s a bit confusing, but I’ll do my best to make sense of it (assuming I understand it completely, which I’m not entirely sure I get all the details).

First and foremost — it seems that there are several ways to skin the cat with routing out from a VXLAN EVPN tenant to the rest of the world. For our purposes, we created sub-interfaces on the routed links between the border leafs and the ‘core’ and associated that sub-interface with our tenant VRF. Once on the core, you could choose to continue running VRF-lite perhaps up to a firewall where you could maintain policy between tenants, or you could pipe that traffic out to the tenant over some other connection… totally up to you. For our purposes, we can simply assume that from the core we received routes from the rest of the tenant and we advertised our SVIs for the VXLANs into the core via OSPF. This isn’t quite enough though! It works to get your border leafs talking out to the rest of the world, but traffic on the other leafs wouldn’t have enough information to route out at this point. In the OSPF process under the VRF, BGP gets redistributed into OSPF, and on BGP under the VRF, OSPF is redistributed into BGP — additionally, the “advertise l2vpn evpn” command is configured. This is where things get a bit fuzzy for me… what I believe is happening here is that OSPF prefixes are getting piped into BGP and the “advertise l2vpn evpn” command basically takes the prefixes, and dumps them into the EVPN address-family… this essentially makes the border leafs get mapped as the VTEPs for any external routes. I’m less sure what the redistribution into OSPF is for — I believe that in larger scale environments this would make more sense… for our deployment the border leafs maintained ALL VXLANs (anycast gateway for all bridge-domains, and therefore was advertising all of these subnets up to the core via OSPF) and therefore advertised them to the core. For deployments that all VXLANs don’t exist on the border leafs I think the redistribution from BGP to OSPF is meant to announce the segments not on the border leafs to the rest of the world.

Configuration Examples:

So, without further ado, lets move on to some configuration examples. These are sanitized versions of real life working configurations, so they should be pretty spot on 🙂

We’ll start with the relevant spine configurations since they’re the simplest piece to all of this, this is an example configuration for Spine-1 — it would be very similar on Spine-2:

# Enable Features
#
nv overlay evpn
feature ospf
feature bgp
feature pim
#
# /Enable Features

# Configure loopbacks
#
interface loopback0
 ip address 1.1.1.1/32
 ip router ospf 1 area 0
 ip pim sparse-mode
interface loopback1
 ip address 3.3.3.3/32
 ip router ospf 1 area 0
 ip pim sparse-mode
#
# /Configure loopbacks

# Required PIM/RP Anycast Configuration
# Note: loopback1 is the dedicated anycast loopback
# Note: the multicast ranges in the group-lists should encompass the multicast ranges used for the VNIs
# Note: 1.1.1.1 is our example loopback0 on spine-1, 2.2.2.2 is our example loopback on spine-2
# Note: 3.3.3.3 is used as our anycast RP, and loopback1 IP address
#
ip pim rp-address 1.1.1.1 group-list 239.0.0.0/8
ip pim rp-candidate loopback1 group-list 239.0.0.0/8
ip pim anycast-rp 3.3.3.3 1.1.1.1
ip pim anycast-rp 3.3.3.3 2.2.2.2
#
# /Required PIM/RP Anycast Configuration
# BGP Configuration
#
router bgp 65535
  router-id 1.1.1.1
  address-family l2vpn evpn
    retain route-target all
  neighbor 5.5.5.5 remote-as 65535
    update-source loopback0
    address-family l2vpn evpn
      send-community both
      route-reflector client
  neighbor [copy above neighbor entry for all remaining leaf nodes]
    update-source loopback0
    address-family l2vpn evpn
      send-community both
      route-reflector client
  neighbor 2.2.2.2 remote-as 655356
    update-source loopback
    address-family l2vpn evpn
      send-community both
      route-reflector client
#
# /BGP Configuration

Thats pretty much it for the spine nodes, nice and simple. Of course they require all the fabric links to be routing and all other associated IP address configurations, but from a VXLAN perspective these are just dumb transit boxes.

Moving on, here is a configuration example for a leaf node:

# Enable Features
#
nv overlay evpn
feature ospf
feature bgp
feature pim
feature interface-vlan
feature vpc
#
# /Enable Features

# Configure loopback
#
interface loopback0
 ip address 5.5.5.5/32
 ip address 5.5.5.255/32 secondary
 ip router ospf 1 area 0
 ip pim sparse-mode
#
# /Configure loopback

# Required PIM/RP Anycast Configuration
#
ip pim rp-address 1.1.1.1 group-list 239.0.0.0/8
#
# /Required PIM/RP Anycast Configuration
# Overlay VLAN Configuration
#
vlan 1000
  vn-segment 5096
interface Vlan1000
  no ip address
  vrf member VRF-Tenant-1
  ip forward
  no shutdown
vrf context VRF-Tenant-1
  vni 5096
  rd 5096:5096
  address-family ipv4 unicast
    route-target both 5096:5096
    route-target both 5096:5096 evpn
#
# /Overlay VLAN Configuration

# Example "Access" Bridge Domain Configuration
#
vlan 1001
  vn-segment 5097
interface Vlan1001
  vrf member VRF-Tenant-1
  ip address 192.168.1.1/24
  fabric forwarding mode anycast-gateway
  no shut
#
# /Example "Access" Bridge Domain Configuration

# NVE Interface Configuration
#
interface nve1
  no shutdown
  update-source loopback0
  host-reachability protocol bgp
  member vni 5096 associate-vrf
  member vni 5097
    suppress-arp
    mcast-group 239.1.1.1

# BGP Configuration
# Note: The tenant must be configured in BGP to advertise l2vpn
# Note: The evpn configuration stanza is new -- this associates the overlay VLAN/VRF to MP-BGP
#
router bgp 65535
  router-id 5.5.5.5
  address-family l2vpn evpn
    retain route-target all
  neighbor 1.1.1.1 remote-as 65535
    update-source loopback0
    address-family l2vpn evpn
      send-community both
  neighbor [copy above neighbor entry for all remaining spine nodes]
    update-source loopback0
    address-family l2vpn evpn
      send-community both
  vrf VRF-Tenant-1
    address-family ipv4 unicast
       advertise l2vpn evpn
  evpn
    vni 5096 l2
       rd auto
       route-target both atuo
#
# /BGP Configuration

That is a lot digest!

I’ll wrap this up now and hopefully follow-up shortly with some verification and troubleshooting, but one last piece of advice… make sure you run 7.0(3)l1(1b) on the Nexus 9000s. We can’t be sure yet, but it seems like we had fully functional configurations on 7.0(3)l1(1) but we were not operational. There were some weird quirks with devices encapsulating things but not de-encapsulating things. It’s highly possible we had some misconfigurations but I can’t confirm that as of now. Lastly, the configuration guides for making this happen right now are in a pretty shabby state. They’ve got the most of the meat and if you combine them with the latest release notes you may get most of the way there, but probably not all the way – so be careful, follow this doc, follow the guides, and work with TAC, they were able to help work through this with us and make it successful. I’m very excited to have been able to roll this out though, despite the challenges this is probably the coolest deployment I’ve been a part of. I really think that all data centers in five years time will be running Spine/Leaf designs and VXLAN at some level (ACI, NSX, or CLI configured VXLAN), so this was pretty sweet to get things rolling early!

Finally, huge thanks to my super awesome post-sales co-worker who spent a LONG time on the phone with TAC getting everything dialed in properly and fully functional!

 

Advertisements

16 thoughts on “Nexus 9000 VXLAN – EVPN, vPC, and VXLAN Routing

  1. This is a great post and I share your enthusiasm about how cool such a project is. However I’m speaking from the perspective of a network geek. But if I put my business-focused hat on, I can’t help but think we need to be given technology that requires less set up time, freeing up your time for actually delivering cool applications and services over the network, which is ultimately what we’re paid to do.
    My hope is that Cisco can perhaps utilise the APIC to deliver zero touch EVPN deployments similar to how it does with ACI zero risk provisioning.

  2. Hey Paul, thanks for the comment! I 100% agree that this was quite a bit of work/time to get things rolling. Of course this is totally new technology, so that complicated things quite a bit. I suspect as things progress this will get substantially easier to deploy…. or at least that would be the hope 🙂

    ACI (APIC) of course abstracts all of the configuration of VXLAN, so that you can indeed spend time delivering applications instead of hammering away configuring VXLAN all day long!

    Outside of ACI, I would put some money on the Tail-F acquisition eventually bringing some controller (or perhaps APIC-EM and/or maybe even Prime as it matures picks up some slack) automation of configuration of VXLAN.

    Carl

  3. Hi, so glad to find someone else trying this out!

    One comment though about the underlay – from what I can see of doco, and trying it myself (still have some issues) is that multicast is not required for the underlay for BUM traffic as this is done by MP-BGP.

    My config also differes slightly from the examples, and yours in that it’s a 4-node deployment without leaves of spines as those roles are collapsed onto each of the 4. They’re connected in a ring at physical layer but form a full-mesh for iBGP.

    • Hi Brenden,

      Do you have this working w/out multicast? Reading through the doc before doing this deployment I was convinced it was not required… the documentation was kind of awful and just replaced a lot of the multicast language in the release notes/config guides prior to 7.x with EVPN stuff. During the deployment though it became pretty clear that we needed it — TAC seemed to think so too.

      My understanding (subject to being totally wrong!) is that multicast is still used for BUM, but BGP is used to store MACs for normal traffic flows (basically the normal MAC to VTEP mapping for unicast flows), and also — and this is the important part I believe — used to loop the packet back through the internal architecture so that you can do VXLAN routing on your anycast leaf nodes.

      Let me know if you can get it working without multicast, and if you learn anything new and cool!

      Carl

      • Multicast in the underlay is required (unless you use Head End Replication I suppose) to allow for BUM traffic in the overlay. Can’t say I’ve done any multicast over VXLAN yet, but in the overlay it would just be normal multicast stuff — packets would get replicated and sent to whatever L2 destination they needed to be (based on joins as normal I would suspect) — furthermore I suspect (but again, never done this yet!) that Leaf switches doing any cast gateway could run PIM for each of those SVIs that are mapped to a VXLAN… could be fun to lab that up… I’ll try and find the time!

  4. Hi Carl,

    Yes I do now have it working. Not sure exactly what it was that got me over the line as I changed a few things recently while playing around and reading various doc’s.

    I did have one issue with using vPC VTEP – I had the secondary IP set for all 4 nodes the same, even though only 2 were in a vpc domain together. I was confusing the vPC VTEP anycast address with the anycast gateway feature… still don’t 100% understand it but there’s more doco to read 🙂

    One part I suspect was to do with the route-map. In the design guide it doesn’t mention using a route-map at all so I took it off. That may have had a part in getting it working.

    I can shoot you full configs if you’d like. email is brendan.franklin at gmredail.com (remove the red)

  5. Nice post. Regarding a controller (e.g. APIC), look at Cisco Virtual Topology System (VTS). Uses technology from the Tail-F acquisition and an IOS-XRv instance for overlay provisioning, open northbound API, integration with hypervisor/orchestration platforms etc.

    I would hate to be configuring an environment like this at large scale without it.

    • Thanks Nik!

      VTS for sure looks cool! I hadn’t heard about it when we did this deployment… it was for sure a learning curve, but you could pretty easily templatize it even without a controller… maybe even use the pyAPI to some extent (maybe?! i think you can do what you would need with that but not sure off hand).

      Is VTS available for download for dorking about in the lab? Would be cool to play with for sure!

      Carl

  6. Great post and I learned so much. I consider a fabric with VXLAN evpn that if I use VTEPs as a ToR and there are a lot of host belogn it. As above mention, I set anycast gateway and SVI IP address be the host’s gateway. How host’s traffic arrived to its router or nat which transfer to internet? Thanks.

  7. How is this anymore scalable than VLANs, if you have to tie a VLAN to a VXLAN in order for this to work at layer 3? Don’t we still effectively have the same limit of 4096 tenants then (instead of millions with VXLANs)?

    • On a single switch, sure. Across a “fabric” though you can map VLAN 10 on every switch to a different VNID (overlay basically), so we can scale up to the millions in that manner. As for the “tenancy” — thats just a limit of VRFs basically — I think I described it as the “tenant” VRF/overlay in the post (been a while!), so you can add a bunch of VRFs/overlays/tenants and map them to whatever VNIDs/VLANs you want. Hopefully that makes sense.

  8. i’m new to nexus 9000 bgp evpn/vxlan setup but i have question regarding ACI. do I need to finish setting up bgp/evpn/vxlan to make the network connection between leafs first before I could use ACI?

  9. I see you don’t monetize your site, don’t waste your
    traffic, you can earn additional bucks every month because you’ve got high quality content.
    If you want to know how to make extra bucks, search for:
    Mrdalekjd methods for $$$

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s