Cisco ACI Bootcamp: Notes and Thoughts Pt.1

This week I got to spend some time at the Cisco office in Seattle getting all sorts of learned up on ACI and the APIC.  This post is a recap of the notes I took (slightly cleaned up, but don’t expect awesome notes, I’m no project manager), some answers to some questions I had going in, and other thoughts I had during the course. It’s a LOT of notes… first section is the high level TL;DR if you don’t feel like reading.

Overall Takeaways (TL;DR):

  • ACI is far more polished than I was anticipating at this stage. I love me some Cisco, but let’s be real… I wasn’t expecting ACI to be polished at this point. The interface felt smooth and intuitive, it was snappy, it made sense, overall it just seemed pretty awesome
  • The Nexus 9k hardware is super freakin’ sexy (ASR9ks are arguably even sexier, but in this post I’m talking strictly Nexus 9k). I felt this way before, but it REALLY is good. If you don’t want ACI, then don’t get ACI, but don’t overlook the 9k line. Bang for your buck is crazy good here, plus you get Cisco support which is first class, you can do VxLAN, you can route, you get tons of 40g… its hard to beat for what it is.
  • Cisco wants to be ‘open’ — and they don’t just say it (as far as I can tell); ACI is meant to be controlled by other things. While not FCS, Azure will be able to do tons of stuff out of the box, everything is open to use Python against (SDK already out there), and Puppet and Chef libraries are coming. They even specifically said that Github will be a very important place for ACI/APIC stuff going forward.
  • ACI sucks at multicast. This was a bit weird to me as VxLAN requires multicast (IETF draft at least), but the ACI fabric doesn’t support PIM. It does support IGMP, and you can put a router doing PIM attached to the fabric to handle things, but just seemed weird to me.
  • Cisco proprietary ASICs for the ACI magic, plus Broadcom T2 for all the other stuff; NX-OS only mode uses just the Broadcom chips.
  • ACI does NOT support multi-tier Clos. Not a big deal for almost any customer I’ve been at, but worth knowing.
  • Uses IETF VxLAN, just uses the reserved bits for some extra magic. It sounds like in theory that you can use a remote VTEP on something like a 1000v/9k/Arista(?)/Brocade(?)/etc. to extend a VxLAN outside of the ACI fabric. Not sure why you would want to do this, but its interesting at the least!
  • 1000v is NOT dead. AVS (application virtual switch) is going to basically be an ACI enabled version of the 1000v. 1000v was never on the ACI teams radar really. As Insieme was not Cisco when developing things, the 1000v not having tons of market share was not appealing enough for them. For the same reason there is no integration with the VSG. Both these products will keep on keepin’ on though.As of now there is no inter-vm (on a single hypervisor) security offered by the APIC. There is work on integrating APIC/ACI with vShield… seems silly when Cisco VSG would serve that purpose.
  • May as well ignore FC/FCoE for now. FCoE is supported only locally to a single leaf.
  • In the modular platform (9500) you are either a leaf or a spine — and the line cards don’t mix — kind of a bummer but makes sense.
  • I got the impression that Cisco Inter-cloud will be very …. important… going forward, and that the ‘cloud’ track of certifications could be pretty interesting (whatever they end up calling it officially)

Speeds/Feeds/Hardware:

  • Three basic chip types in use;
    • The ‘NFE’ or Network Forwarding Engine is the Trident T2 chip
    • The ‘ASE’ – Application Spine Engine is a custom Cisco ASIC for Spine switches (duh)
    • The ‘ALE’ – Application Leaf Engine is a custom Cisco ASIC for…. guess what? the Leaf switches
  • All the backplane stuff uses the Trident chips
  • NX-OS mode can only use the Tridents
  • The modular 9500 switches can be either a leaf OR a spine, not both (that part is relatively obvious)
    • You can NOT mix and match line card types in the 9500 — what i mean by that is some line cards are meant for leaf nodes, and some for spine, you cannot mix these flavors
  • Blades for the 9500:
    • 9400 blades are oversubscribed blades for NX-OS mode ONLY
    • 9500 blades are for NX-OS OR Leaf nodes
    • 9600 blades are NON-oversubscribed blades for NX-OS ONLY — they are NOT supported int he 9516 as there is not enough backplane speed to keep it non-oversubscribed
    • 9700 blades are Spine ONLY blades
  • Nexus 9336 is the ‘baby spine’ switch. Incredible value for the port speed/density. It can NOT be run in ‘normal’ NX-OS mode which is sad (no Trident chips, just the ASE).
  • In ACI mode (soft) reboot time is ~20 seconds! This can happen because there’s essentially a separate set of ‘supervisors’ for low-level things like power, fans, POST etc. You can reboot the ACI part of the switches without rebooting the low-level sup. Not sure this will be needed most times, but it is cool.
  • APIC comes in two flavors small or large. Small = <1000 edge ports, Large = >1000 edge ports
  • Some of the fixed chassis 9ks (93128TX comes to mind, but there may be others) can’t use all the ‘uplink’ ports and/or utilize the 40g->4x10g breakout cable options due to bandwidth limitations
  • ACI mode scales pretty crazy big
    • 1 million v4 and/or v6 routes — this is pretty cool actually; turns out there is some magic that makes v6 routes not take up two TCAM entries as in ‘normal’ boxes (i.e. 6509)
    • 8000 mcast groups PER leaf
    • 64000 tenants… that seems like enough… at least i think so
    • NX-OS mode does not scale like this — its limited by local hardware like ‘normal’ — ACI can scale since there is controller magic going on I think

ACI Overview:

  • Network industry is cyclical — VxLAN/NVGRE/other L2 tunnels allow us to have a big ‘routed’ fabric while maintaining L2 adjacency
  • Cisco seems to think (as do I) that VxLAN has won the ‘overlay wars’ (vs NVGRE)
  • Strong focus on the fact that overlays are fantastic, but if you can’t understand where the magical tunnel is taking traffic, and don’t have visibility into the underlay then you are missing out on a lot of information that could be valuable, especially in troubleshooting. This is no doubt a response to NSX… I can’t argue with Cisco on this front. Overlays are cool, but gotta know whats going on in the hardware too.
  • Cisco and Microsoft are partnering a lot it seems — not just in integration of Hyper V and Azure, but also in creating application profiles that the ACI fabric can act on.
    • Applications are pretty obviously the trend (it is called Application Centric Infrastructure… ) — application profiles can help the fabric understand what apps are running on devices automagically — i.e. an Exchange server comes on-line; the fabric can understand that due to the application profile and automatically treat that server with pre-defined security policies/load balancing/service chaining/etc.
  • I’m not a UCS guy, but it sounds like there are a lot of parallels between ACI and UCS — profiling things and GUI and just the overall way things work
  • At FCS there are not a ton of canned application profiles so we as engineers will have to create them.
    • The idea here is that we create them once and export and re-use for the next customer/data center
  • A big theme was not tying applications to subnets orVLANs. I think this is a pretty common theme in the big vendors ‘SDN’ strategy — ACI is saying we don’t care what subnet/VLAN a thing is in, we just care about the application.
    • In a lot of ways I think we’ve been able to do this already for quite some time with vCNS/vShield/1000v/VSG, but obviously as the industry progresses it’s becoming more and more of a theme
  • The ACI fabric has built-in extended ACL like functionality. It’s not a ‘real’ firewall, but it can do some cool stuff.
    • There is already some service-insertion type capabilities baked into the way ACI handles traffic flows (contracts/end point groups/etc.), but there is going to be further integration with other vendor firewalls it sounds like.
  • ACI is vendor agnostic — it doesn’t care what hypervisor you use, or if you want to use physical servers. Obviously there are different levels of integration, but you could do totally physical gear and still have some pretty powerful capabilities with ACI.
    • At FCS there seems to be only VMware support — it has hooks into vCenter day 1 and integrates with the distributed virtual switch right away.
    • Hyper V support should be available very soon (if not already — wasn’t clear on timing)
    • Since Hyper V is more about NVGRE, the leaf nodes will ‘translate’ between VxLAN and NVGRE — allowing L2 extension between VMware and Hyper V environments — pretty slick!
  • ACI is all ISIS and MP-BGP under the covers — but much like OTV/FabricPath you don’t ever touch/see this. All internal L2 adjacency is performed via VxLAN — then if required leaf nodes can translate between VxLAN/VLAN/NVGRE/etc.
  • EVERYTHING you can do in the CLI you can do in the GUI (and the GUI is NOT Java! Hooray!)
  • The APIC controller as well as the 9ks in ACI mode have a CLI. The 9ks CLI is only for troubleshooting.
  • ACI/APIC is NOT a DCI technology. This should be apparent, but must be called out just in case!

More to come very soon. I have another several pages of notes!! All in all I’m pretty impressed. Now I just need customers to buy it so I can play with it more!

1000v VxLAN

I wrote a bit about basic 1000v information a bit ago with a promise to write about some misadventures with VxLAN on the platform. Here I go!

As mentioned in the previous post, the 1000v essentials licensing (freemium) is all that is required to get some VxLAN stuff going. This is huge since it enables us to be able to do some VxLAN at home in the lab without requiring any hardware outside of compute (Note: CSR 1000v and Arista vEOS also allow you to do some VxLAN stuff totally in software!). As with the 1000v in general though, there is still the requirement for vCenter which is kind of a bummer for a home lab. Now that we have that out of the way, lets take a look at the topology that we are working with:

vxlan1

In the above topology we essentially have two subnets, VLAN 10 and VLAN 11. VLAN 10 is the ‘primary’ subnet containing two ESX hosts. ESX-1 is a management type box, it holds the 1000v VSM, and vCenter (this is I suspect not really best practice, personally I prefer 1000v deployments on the 1010 or 1110 appliances). ESX-2 is a production type box, its vSwitch/dVS has been replaced with the 1000v VEM. Finally, in the VLAN 11 subnet a third ESX box is deployed, this box also has had its vSwitch/dVS replaced with the 1000v VEM.

I won’t beat a dead horse about what VxLAN is and how it works since there are plenty of articles and blog posts that say it better than I can. What it means to us, however, is that we can stretch layer 2 domains across routed domains. In the context of the Cisco 1000v, we can do this via the standard (not yet ratified standard that is…) multicast configuration, or the Cisco unicast implementation. In the standard multicast is used as a pseudo control plane — VTEPs join multicast groups that are associated with the VNIs that the VTEP cares about — in this way, VxLAN is able to maintain the look and feel of Ethernet for hosts in bridge domains. The unicast implementation forgoes multicast by utilizing the VSM as basically a control node that maintains mappings of which VNIs need to go where.

Unicast VxLAN is certainly an attractive proposition. I’ve yet to meet a network guy/gal that wants to deal with multicast when they don’t have to. It is SUPER easy to get things rolling with unicast mode, but it does come with some drawbacks. Prime amongst those is that, at least for now, you are stuck entirely with the 1000v when using unicast mode. Multicast mode being the ‘standard’ (again, not ratified) allows for interoperability between different VTEP platforms (in theory! I’ve tested 1000v with CSR multicast VxLAN, and that does indeed work, but who knows about inter-vendor operability). So for the purposes of this post, we are going to run with multicast since it seems, at least to me, to be the smart choice.

As we are going with multicast, there is of course the prerequisite that multicast routing be enabled on the routed infrastructure. The routed backbone across which VxLAN will run must be configured with PIM Bidir, and of course unicast reachability between VTEPs is obviously required as well. In the 1000v scenario, this means the management IP address must have reachability to the vmkernal IP of the remote ESX hosts (basically to the VEM). When using CSR 1000v routers, a loopback is required as this is used as a source interface for the NVE interfaces.

Enough talk, onto the configurations.

As with other Nexus things, this of course is a feature, enable VxLAN with the segmentation feature:

feature segmentation

Now we can create our bridge-domain. This is like creating a L2 VLAN, just for VxLAN instead.

bridge-domain [BRIDGE DOMAIN NAME]
  segment id [VNI]
  group [MCAST GROUP]

The name is just a human friendly name and is arbitrary as far as I can tell. The VNI, however, is important. The VNI is the VxLAN Network ID — its like a VLAN tag, but it’s for VxLAN (all things old are new again) — this tag is a 24 bit field. This field is what allows us to have 16 million-ish unique VxLANs. Next of course we need to define the multicast group address that is associated with this VNI.

Next up, we just need to create a port-profile that is associated with this bridge-domain. This is almost identical to a ‘normal’ port-profile:

port-profile type vethernet [PORT PROFILE NAME]
  vmware port-group
  switchport mode access
  switchport access bridge-domain [BRIDGE DOMAIN NAME]
  no shutdown
  state enabled

Thats basically all there is to it. This will get our new port-profile into all the ESX hosts that are using the 1000v VEM, and allow us to assign it to guests. Guests in this bridge-domain will now be in the same L2 segment no matter where in the data center they are (in our example in either of the hosts). Doesn’t sound very cool, but we’ve just allowed L2 adjacency across an entirely routed backbone! This means that we can now build totally routed data center fabrics while still maintaining the ability for guests to vMotion to any other host without the requirement to re-IP.

Verification is a bit all over the place in my opinion… on the VSM you can take a look at the bridge-domains and see which MAC addresses are being mapped into which domain with the “show bridge-domain [BRIDGE DOMAIN NAME] mac” command. This will output the MAC address of course, as well as the module (basically VEM number), virtual ethernet port, and the VTEP IP address for a given bridge-domain. The VTEP IP will be the vmkernal IP address as mentioned before. For a bit more digging, you can SSH to the ESX host and run some esxcli commands:

~ # vemcmd show vxlan interfaces
LTL     VSM Port      IP       Seconds since Last   Vem Port
                               IGMP Query Received
(* = IGMP Join Interface/Designated VTEP)
-----------------------------------------------------------
 49        Veth2  10.10.11.254         6             vmk0         *

I feel the above command/output is the most helpful, but you can also look at which ports are assigned to which ports, and which are actually doing VxLAN things with “vemcmd show port”:

~ # vemcmd show port
  LTL   VSM Port  Admin Link  State  PC-LTL  SGID  Vem Port  Type
   17     Eth6/1     UP   UP    FWD       0          vmnic0
   49      Veth2     UP   UP    FWD       0            vmk0  VXLAN

I removed the other ports from the above output, but the important piece above is that the vmk0 interface is the VxLAN ‘port’ — basically that its a VTEP.

Lastly, we can view VxLAN statistics:

~ # vemcmd show vxlan-stats
  LTL  Ucast   Mcast/Repl   Ucast   Mcast    Total
       Encaps  Encaps       Decaps  Decaps   Drops
   49     800           68    1113      70       0
   54     800           68    1113      70       1

This one piqued my interest a bit. All those unicast encaps and decaps caught my eye. Isn’t this supposed to happen over multicast? Well the answer is yes and no. VTEP discovery and mapping essentially takes place over multicast. Multicast also is needed for broadcast traffic to be sent to all VTEPs that have hosts in a particular VNI. If you, like me, wanted to see this actually happening and bust out a packet capture you may notice something else interesting. The 1000v utilizes UDP port 8472… which also happens to be the same port that OTV utilizes. IANA has since allocated 4789, however the 1000v has not been updated to reflect this. Interestingly, the CSR1000v does indeed use 4789 — keep this in mind when interoperating the two!

Now that we have 1000v VxLAN working, there are some big bummers from a lab perspective (and possibly some real world bummers too). Our VxLAN is essentially trapped within the bridge-domain. There is no virtual L3 gateway (that I’m aware of) offered by Cisco. There is, however, a VxLAN gateway that can be utilized with the 1110 platform. This VxLAN gateway is basically a bridge that will bridge VxLANs into VLANs. Once the VxLAN is bridged onto a VLAN things continue as normal; upstream switches can have a default gateway and do normal routing stuff. This functionality in Vmware requires the 1110 appliance, however in KVM this functionality is supported entirely in software (hopefully I’ll get that working soonish!). In the meantime, for lab purposes, you can do a bit of duct tape and bailing wire shenanigans to get your VxLAN piped out onto your network. I placed an Ubuntu host in one of my ESX hosts with two network interfaces. Put one interface into an access port on the VLAN you want to bridge the VxLAN onto, and the other into the bridge-domain itself. Then you can simply use normal Linux bridging to get things flowing between the two interfaces. Heres what my /etc/network/interfaces config looks like:

auto br0
iface br0 inet dhcp
   bridge_ports eth0 eth1

Nice and simple!

One last point about VxLAN. It is NOT a DCI technology… Unicast mode does do some of the things that a DCI technology should do, but has big scalability concerns; requiring 1000v VSM, limited amount of bridge-domains, VxLAN gateway functionality (as of now) won’t scale very large, etc. Please refer to these two excellent posts that go into more detail about this… the TL:DR is: just because we can doesn’t mean we should.

http://yves-louis.com/DCI/?p=648

http://blog.ipspace.net/2012/11/vxlan-is-not-data-center-interconnect.html