1000v VxLAN

I wrote a bit about basic 1000v information a bit ago with a promise to write about some misadventures with VxLAN on the platform. Here I go!

As mentioned in the previous post, the 1000v essentials licensing (freemium) is all that is required to get some VxLAN stuff going. This is huge since it enables us to be able to do some VxLAN at home in the lab without requiring any hardware outside of compute (Note: CSR 1000v and Arista vEOS also allow you to do some VxLAN stuff totally in software!). As with the 1000v in general though, there is still the requirement for vCenter which is kind of a bummer for a home lab. Now that we have that out of the way, lets take a look at the topology that we are working with:


In the above topology we essentially have two subnets, VLAN 10 and VLAN 11. VLAN 10 is the ‘primary’ subnet containing two ESX hosts. ESX-1 is a management type box, it holds the 1000v VSM, and vCenter (this is I suspect not really best practice, personally I prefer 1000v deployments on the 1010 or 1110 appliances). ESX-2 is a production type box, its vSwitch/dVS has been replaced with the 1000v VEM. Finally, in the VLAN 11 subnet a third ESX box is deployed, this box also has had its vSwitch/dVS replaced with the 1000v VEM.

I won’t beat a dead horse about what VxLAN is and how it works since there are plenty of articles and blog posts that say it better than I can. What it means to us, however, is that we can stretch layer 2 domains across routed domains. In the context of the Cisco 1000v, we can do this via the standard (not yet ratified standard that is…) multicast configuration, or the Cisco unicast implementation. In the standard multicast is used as a pseudo control plane — VTEPs join multicast groups that are associated with the VNIs that the VTEP cares about — in this way, VxLAN is able to maintain the look and feel of Ethernet for hosts in bridge domains. The unicast implementation forgoes multicast by utilizing the VSM as basically a control node that maintains mappings of which VNIs need to go where.

Unicast VxLAN is certainly an attractive proposition. I’ve yet to meet a network guy/gal that wants to deal with multicast when they don’t have to. It is SUPER easy to get things rolling with unicast mode, but it does come with some drawbacks. Prime amongst those is that, at least for now, you are stuck entirely with the 1000v when using unicast mode. Multicast mode being the ‘standard’ (again, not ratified) allows for interoperability between different VTEP platforms (in theory! I’ve tested 1000v with CSR multicast VxLAN, and that does indeed work, but who knows about inter-vendor operability). So for the purposes of this post, we are going to run with multicast since it seems, at least to me, to be the smart choice.

As we are going with multicast, there is of course the prerequisite that multicast routing be enabled on the routed infrastructure. The routed backbone across which VxLAN will run must be configured with PIM Bidir, and of course unicast reachability between VTEPs is obviously required as well. In the 1000v scenario, this means the management IP address must have reachability to the vmkernal IP of the remote ESX hosts (basically to the VEM). When using CSR 1000v routers, a loopback is required as this is used as a source interface for the NVE interfaces.

Enough talk, onto the configurations.

As with other Nexus things, this of course is a feature, enable VxLAN with the segmentation feature:

feature segmentation

Now we can create our bridge-domain. This is like creating a L2 VLAN, just for VxLAN instead.

bridge-domain [BRIDGE DOMAIN NAME]
  segment id [VNI]
  group [MCAST GROUP]

The name is just a human friendly name and is arbitrary as far as I can tell. The VNI, however, is important. The VNI is the VxLAN Network ID — its like a VLAN tag, but it’s for VxLAN (all things old are new again) — this tag is a 24 bit field. This field is what allows us to have 16 million-ish unique VxLANs. Next of course we need to define the multicast group address that is associated with this VNI.

Next up, we just need to create a port-profile that is associated with this bridge-domain. This is almost identical to a ‘normal’ port-profile:

port-profile type vethernet [PORT PROFILE NAME]
  vmware port-group
  switchport mode access
  switchport access bridge-domain [BRIDGE DOMAIN NAME]
  no shutdown
  state enabled

Thats basically all there is to it. This will get our new port-profile into all the ESX hosts that are using the 1000v VEM, and allow us to assign it to guests. Guests in this bridge-domain will now be in the same L2 segment no matter where in the data center they are (in our example in either of the hosts). Doesn’t sound very cool, but we’ve just allowed L2 adjacency across an entirely routed backbone! This means that we can now build totally routed data center fabrics while still maintaining the ability for guests to vMotion to any other host without the requirement to re-IP.

Verification is a bit all over the place in my opinion… on the VSM you can take a look at the bridge-domains and see which MAC addresses are being mapped into which domain with the “show bridge-domain [BRIDGE DOMAIN NAME] mac” command. This will output the MAC address of course, as well as the module (basically VEM number), virtual ethernet port, and the VTEP IP address for a given bridge-domain. The VTEP IP will be the vmkernal IP address as mentioned before. For a bit more digging, you can SSH to the ESX host and run some esxcli commands:

~ # vemcmd show vxlan interfaces
LTL     VSM Port      IP       Seconds since Last   Vem Port
                               IGMP Query Received
(* = IGMP Join Interface/Designated VTEP)
 49        Veth2         6             vmk0         *

I feel the above command/output is the most helpful, but you can also look at which ports are assigned to which ports, and which are actually doing VxLAN things with “vemcmd show port”:

~ # vemcmd show port
  LTL   VSM Port  Admin Link  State  PC-LTL  SGID  Vem Port  Type
   17     Eth6/1     UP   UP    FWD       0          vmnic0
   49      Veth2     UP   UP    FWD       0            vmk0  VXLAN

I removed the other ports from the above output, but the important piece above is that the vmk0 interface is the VxLAN ‘port’ — basically that its a VTEP.

Lastly, we can view VxLAN statistics:

~ # vemcmd show vxlan-stats
  LTL  Ucast   Mcast/Repl   Ucast   Mcast    Total
       Encaps  Encaps       Decaps  Decaps   Drops
   49     800           68    1113      70       0
   54     800           68    1113      70       1

This one piqued my interest a bit. All those unicast encaps and decaps caught my eye. Isn’t this supposed to happen over multicast? Well the answer is yes and no. VTEP discovery and mapping essentially takes place over multicast. Multicast also is needed for broadcast traffic to be sent to all VTEPs that have hosts in a particular VNI. If you, like me, wanted to see this actually happening and bust out a packet capture you may notice something else interesting. The 1000v utilizes UDP port 8472… which also happens to be the same port that OTV utilizes. IANA has since allocated 4789, however the 1000v has not been updated to reflect this. Interestingly, the CSR1000v does indeed use 4789 — keep this in mind when interoperating the two!

Now that we have 1000v VxLAN working, there are some big bummers from a lab perspective (and possibly some real world bummers too). Our VxLAN is essentially trapped within the bridge-domain. There is no virtual L3 gateway (that I’m aware of) offered by Cisco. There is, however, a VxLAN gateway that can be utilized with the 1110 platform. This VxLAN gateway is basically a bridge that will bridge VxLANs into VLANs. Once the VxLAN is bridged onto a VLAN things continue as normal; upstream switches can have a default gateway and do normal routing stuff. This functionality in Vmware requires the 1110 appliance, however in KVM this functionality is supported entirely in software (hopefully I’ll get that working soonish!). In the meantime, for lab purposes, you can do a bit of duct tape and bailing wire shenanigans to get your VxLAN piped out onto your network. I placed an Ubuntu host in one of my ESX hosts with two network interfaces. Put one interface into an access port on the VLAN you want to bridge the VxLAN onto, and the other into the bridge-domain itself. Then you can simply use normal Linux bridging to get things flowing between the two interfaces. Heres what my /etc/network/interfaces config looks like:

auto br0
iface br0 inet dhcp
   bridge_ports eth0 eth1

Nice and simple!

One last point about VxLAN. It is NOT a DCI technology… Unicast mode does do some of the things that a DCI technology should do, but has big scalability concerns; requiring 1000v VSM, limited amount of bridge-domains, VxLAN gateway functionality (as of now) won’t scale very large, etc. Please refer to these two excellent posts that go into more detail about this… the TL:DR is: just because we can doesn’t mean we should.




2 thoughts on “1000v VxLAN

  1. What is the difference between 2000 hosts in 50 racks in one data center and 1000 hosts in 25 racks in 2 different data centers if they are 10 kms apart.
    Instead of there being a handful of SR optics there are LH.

    • I suppose there isn’t much difference. However extending a failure domain across data centers is probably not a good idea. My supposition is that if you have 2000 hosts in a data center, you likely have a DR site, or perhaps a second active data center. If that’s the case, then hopefully your business would be able to operate it a data center goes down. To that end, making your two data centers one big failure domain is probably not the best idea. You are correct though, you could easily do it. You could even stretch a VxLAN out to AWS or some other public cloud if you were feeling frisky!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s