1000v BGP VXLAN Control Plane

The latest software release for the Cisco 1000v dropped early August to much fanfare and applause. Oh wait… no it definitely didn’t do that. Other than one or two people on Twitter talking about it, you could have easily missed it. Turns out that this release is actually pretty interesting and worth a bit of time to investigate!

Things that look cool:

  • Support for VM VXLAN Gateway – up till now you had to have the 1110x appliance to do this in VMware 1kv
  • VSUM — magical software thing to do installs and upgrades on the 1000v. I used this in a demo and it was pretty slick
  • Distributed Netflow  — VEMs can send Netflow information directly — no need to pipe things back to the VSM
  • BPDU Guard — whoa, about time? No more bridges in VMs causing problems
  • Possibly interesting TrustSec stuff… not something I’m very familiar with, but it seems like this is another good one
  • Storm Control! Yay! I’ve been wanting this for a while. VXLAN lets us do sometimes less than intelligent things with stretching bridge-domains all around… I feel like adding Storm Control is a bit of a safety net to prevent your bridge-domain from falling over from excessive broadcasts or multicast (or I guess unicast too)

Annnnnd the one I’m most interested in: BGP Control Plane for VXLAN. VXLAN has been around for sometime now, but still doesn’t have a control plane. We’ve been relying on multicast, or proprietary black magic unicast VXLAN to figure out which MACs live at which VTEPs. This is obviously working, but scale will likely become a serious consideration for any reasonably large deployment. What do we as network engineers know that scales pretty well? RIPv1!! Oh wait, no not that… BGP seems to do okay though. This latest release includes support for essentially doing unicast VXLAN in the 1000v, but also to extend that functionality across multiple (up to 8 for now) VSMs.

Why does BGP matter at all for this? Strictly in the realm of Cisco and the 1000v — it means that we can now scale much much much better. Each VSM gets tied to a single vCenter and a single data center within that vCenter, it also has limitations around total supported VEM per VSM (or HA VSM pair). By adding BGP, we now have a non multicast way (yay!) of sharing VXLAN information across multiple VSMs (multicast VXLAN does work across multiple VSMs though, so thats still an option).

From a more holistic standpoint, I believe this is an important step forward in the maturity of VXLAN as a technology. It’s interesting that there seems to be so much support and desire by vendors to have implement BGP with VXLAN, however it seems that nobody is doing it. It’s entirely possible I’m missing something, but other than a few slide decks and IETF drafts, I haven’t seen any vendors implementing this – please tell me if I am missing something here?! NSX and ACI seem to have stolen (at least in the Enterprise type segments I work in normally) a lot of the thunder from VXLAN in general by covering up the underpinnings and replacing it all with proprietary software and shiny GUIs.

In any case, BGP Control Plane is here, and it even works! After getting my two VSMs/vCenters upgraded to the latest code I jumped right in. The configuration is almost exactly what you would expect; enable the feature, utilize BGP off the control interface of the VSM, use an L2VPN address family, and advertise VXLANs (kind of). Here is a complete working config off of my ‘main’ VSM:

interface control0
 ip address
N1kv# sh run bgp
!Command: show running-config bgp
!Time: Sat Oct 4 11:03:28 2014
version 5.2(1)SV3(1.1)
feature bgp
router bgp 65535
 address-family l2vpn evpn
 neighbor remote-as 65535
 address-family l2vpn evpn
 send-community extended
bridge-domain VXLAN_666
 segment id 500666
 segment control-protocol bgp

Obviously the other VSM looks pretty similar. The ‘segment control-protocol’ under the bridge-domain can be configured globally or on a per bridge-domain basis. Since I’m using multicast at home primarily I configured it on the individual bridge-domain.  All that’s needed other than the above configuration is of course a port-profile to present your bridge-domain to VMware.

Verification is also about what you would expect. BGP commands are pretty much the same as BGP on any other Cisco device, and will show VTEP information for each bridge-domain:

N1kv# show bgp l2vpn evpn evi all vtep
BGP routing table information for VRF default, address family L2VPN EVPN
BGP table version is 12, local router ID is
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best
Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: (EVI 500666)
*>l10.10.10.183 100 32768 i
*>l10.10.11.254 100 32768 i
*>i10.10.13.253 100 0 i

All other ‘normal’ VXLAN show commands still do basically the same thing. It’s not a very sexy thing to look at configurations for, and it’s not very difficult to get going, but it is certainly a welcome development for VXLAN as a technology.

You can check out the release notes here: http://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus1000/sw/5_2_1_s_v_3_1_1/release/notes/n1000v_rn.html

1000v VxLAN

I wrote a bit about basic 1000v information a bit ago with a promise to write about some misadventures with VxLAN on the platform. Here I go!

As mentioned in the previous post, the 1000v essentials licensing (freemium) is all that is required to get some VxLAN stuff going. This is huge since it enables us to be able to do some VxLAN at home in the lab without requiring any hardware outside of compute (Note: CSR 1000v and Arista vEOS also allow you to do some VxLAN stuff totally in software!). As with the 1000v in general though, there is still the requirement for vCenter which is kind of a bummer for a home lab. Now that we have that out of the way, lets take a look at the topology that we are working with:


In the above topology we essentially have two subnets, VLAN 10 and VLAN 11. VLAN 10 is the ‘primary’ subnet containing two ESX hosts. ESX-1 is a management type box, it holds the 1000v VSM, and vCenter (this is I suspect not really best practice, personally I prefer 1000v deployments on the 1010 or 1110 appliances). ESX-2 is a production type box, its vSwitch/dVS has been replaced with the 1000v VEM. Finally, in the VLAN 11 subnet a third ESX box is deployed, this box also has had its vSwitch/dVS replaced with the 1000v VEM.

I won’t beat a dead horse about what VxLAN is and how it works since there are plenty of articles and blog posts that say it better than I can. What it means to us, however, is that we can stretch layer 2 domains across routed domains. In the context of the Cisco 1000v, we can do this via the standard (not yet ratified standard that is…) multicast configuration, or the Cisco unicast implementation. In the standard multicast is used as a pseudo control plane — VTEPs join multicast groups that are associated with the VNIs that the VTEP cares about — in this way, VxLAN is able to maintain the look and feel of Ethernet for hosts in bridge domains. The unicast implementation forgoes multicast by utilizing the VSM as basically a control node that maintains mappings of which VNIs need to go where.

Unicast VxLAN is certainly an attractive proposition. I’ve yet to meet a network guy/gal that wants to deal with multicast when they don’t have to. It is SUPER easy to get things rolling with unicast mode, but it does come with some drawbacks. Prime amongst those is that, at least for now, you are stuck entirely with the 1000v when using unicast mode. Multicast mode being the ‘standard’ (again, not ratified) allows for interoperability between different VTEP platforms (in theory! I’ve tested 1000v with CSR multicast VxLAN, and that does indeed work, but who knows about inter-vendor operability). So for the purposes of this post, we are going to run with multicast since it seems, at least to me, to be the smart choice.

As we are going with multicast, there is of course the prerequisite that multicast routing be enabled on the routed infrastructure. The routed backbone across which VxLAN will run must be configured with PIM Bidir, and of course unicast reachability between VTEPs is obviously required as well. In the 1000v scenario, this means the management IP address must have reachability to the vmkernal IP of the remote ESX hosts (basically to the VEM). When using CSR 1000v routers, a loopback is required as this is used as a source interface for the NVE interfaces.

Enough talk, onto the configurations.

As with other Nexus things, this of course is a feature, enable VxLAN with the segmentation feature:

feature segmentation

Now we can create our bridge-domain. This is like creating a L2 VLAN, just for VxLAN instead.

bridge-domain [BRIDGE DOMAIN NAME]
  segment id [VNI]
  group [MCAST GROUP]

The name is just a human friendly name and is arbitrary as far as I can tell. The VNI, however, is important. The VNI is the VxLAN Network ID — its like a VLAN tag, but it’s for VxLAN (all things old are new again) — this tag is a 24 bit field. This field is what allows us to have 16 million-ish unique VxLANs. Next of course we need to define the multicast group address that is associated with this VNI.

Next up, we just need to create a port-profile that is associated with this bridge-domain. This is almost identical to a ‘normal’ port-profile:

port-profile type vethernet [PORT PROFILE NAME]
  vmware port-group
  switchport mode access
  switchport access bridge-domain [BRIDGE DOMAIN NAME]
  no shutdown
  state enabled

Thats basically all there is to it. This will get our new port-profile into all the ESX hosts that are using the 1000v VEM, and allow us to assign it to guests. Guests in this bridge-domain will now be in the same L2 segment no matter where in the data center they are (in our example in either of the hosts). Doesn’t sound very cool, but we’ve just allowed L2 adjacency across an entirely routed backbone! This means that we can now build totally routed data center fabrics while still maintaining the ability for guests to vMotion to any other host without the requirement to re-IP.

Verification is a bit all over the place in my opinion… on the VSM you can take a look at the bridge-domains and see which MAC addresses are being mapped into which domain with the “show bridge-domain [BRIDGE DOMAIN NAME] mac” command. This will output the MAC address of course, as well as the module (basically VEM number), virtual ethernet port, and the VTEP IP address for a given bridge-domain. The VTEP IP will be the vmkernal IP address as mentioned before. For a bit more digging, you can SSH to the ESX host and run some esxcli commands:

~ # vemcmd show vxlan interfaces
LTL     VSM Port      IP       Seconds since Last   Vem Port
                               IGMP Query Received
(* = IGMP Join Interface/Designated VTEP)
 49        Veth2         6             vmk0         *

I feel the above command/output is the most helpful, but you can also look at which ports are assigned to which ports, and which are actually doing VxLAN things with “vemcmd show port”:

~ # vemcmd show port
  LTL   VSM Port  Admin Link  State  PC-LTL  SGID  Vem Port  Type
   17     Eth6/1     UP   UP    FWD       0          vmnic0
   49      Veth2     UP   UP    FWD       0            vmk0  VXLAN

I removed the other ports from the above output, but the important piece above is that the vmk0 interface is the VxLAN ‘port’ — basically that its a VTEP.

Lastly, we can view VxLAN statistics:

~ # vemcmd show vxlan-stats
  LTL  Ucast   Mcast/Repl   Ucast   Mcast    Total
       Encaps  Encaps       Decaps  Decaps   Drops
   49     800           68    1113      70       0
   54     800           68    1113      70       1

This one piqued my interest a bit. All those unicast encaps and decaps caught my eye. Isn’t this supposed to happen over multicast? Well the answer is yes and no. VTEP discovery and mapping essentially takes place over multicast. Multicast also is needed for broadcast traffic to be sent to all VTEPs that have hosts in a particular VNI. If you, like me, wanted to see this actually happening and bust out a packet capture you may notice something else interesting. The 1000v utilizes UDP port 8472… which also happens to be the same port that OTV utilizes. IANA has since allocated 4789, however the 1000v has not been updated to reflect this. Interestingly, the CSR1000v does indeed use 4789 — keep this in mind when interoperating the two!

Now that we have 1000v VxLAN working, there are some big bummers from a lab perspective (and possibly some real world bummers too). Our VxLAN is essentially trapped within the bridge-domain. There is no virtual L3 gateway (that I’m aware of) offered by Cisco. There is, however, a VxLAN gateway that can be utilized with the 1110 platform. This VxLAN gateway is basically a bridge that will bridge VxLANs into VLANs. Once the VxLAN is bridged onto a VLAN things continue as normal; upstream switches can have a default gateway and do normal routing stuff. This functionality in Vmware requires the 1110 appliance, however in KVM this functionality is supported entirely in software (hopefully I’ll get that working soonish!). In the meantime, for lab purposes, you can do a bit of duct tape and bailing wire shenanigans to get your VxLAN piped out onto your network. I placed an Ubuntu host in one of my ESX hosts with two network interfaces. Put one interface into an access port on the VLAN you want to bridge the VxLAN onto, and the other into the bridge-domain itself. Then you can simply use normal Linux bridging to get things flowing between the two interfaces. Heres what my /etc/network/interfaces config looks like:

auto br0
iface br0 inet dhcp
   bridge_ports eth0 eth1

Nice and simple!

One last point about VxLAN. It is NOT a DCI technology… Unicast mode does do some of the things that a DCI technology should do, but has big scalability concerns; requiring 1000v VSM, limited amount of bridge-domains, VxLAN gateway functionality (as of now) won’t scale very large, etc. Please refer to these two excellent posts that go into more detail about this… the TL:DR is: just because we can doesn’t mean we should.