Why is Clos (Spine/Leaf) the thing to do in the Data Center?

I’ve been talking to lots of customers about Data Center refreshes and greenfield Data Center builds lately. A lot of these folks have a very ‘traditional’ Data Center design — Nexus 7k/5k/2k in a ‘fat tree’ type topology. This is a completely valid design obviously and has worked in the Data Center for quite a while, but, there is a better way! (maybe!?)

First lets talk about why/how/when this whole Clos thing came about. Turns out there was this smart dude back in the 1950s who was named Charles Clos. He created this magical topology for the internal architecture of phone switching equipment. Theres lots of stuff on the intertubes about this dude and his design, however for our purposes we just need to know that he essentially created a non-blocking internal fabric. This was all well and good, and I presume stuck around in phone-land and did awesome things there. Where it gets interesting (to me at least), is that this architecture reappears in the land of networking sometime in the 90s. Some smart people (I assume at least) decided that they could leverage this design in internal Ethernet switching infrastructure. All of this is basically to say, that this whole topology has been around a while.

Things get even more interesting when people started realizing that they could solve a lot of problems modern data centers have by implementing this very same topology in the physical layout of the data center. So what problems does this new topology really solve for us? I’m sure there are varying opinions on the matter, but here are my top three problems that a Spine/Leaf network helps solve in the DC:

  1. Explosion of East/West traffic — As applications have become more and more complex, a single web server may need to query a whole bunch of app servers/databases in order to fill out a web page. Consider a utility company — you want to log onto their website to pay your bill. You head on over to the companies webpage, generic website things start to load, no big deal. You log in — now the webpage has to query some database containing user information, then its got to go and pull information for your electricity usage, gas usage, perhaps water/sewer as well. This is perhaps a terrible example, but you get my point I hope! Essentially you have a very small amount of traffic coming in to the web server (North/South traffic) — essentially just the initial query, but potentially a LOT of stuff happening on the backend between the web server and other servers/services in the data center (East/West traffic). In a traditional 7k/5k/2k data center design, you may have these servers/services physically located in different rows or racks, causing the inter-server communication to go from its 2k, up to its 5k (going to happen anyway I guess unless you’re running Nexus 2300 series FEX I suppose), then up to the 7k, to go back down to another pair of 5k EoR/ToR switches, down to a 2k and back…. rinse/repeat for as many transactions as the website requires. This creates a lot of traffic that’s going over (probably) vPC links… vPCs can of course only be ran across a PAIR of switches, so at some point you will end up not being able to add additional bandwidth, or run out of ports between 5k/7k (obviously this would be at a pretty crazy scale, but still!). In a Spine/Leaf design, adding bandwidth is as simple as adding another Spine switch (because of reasons outlined in problems 2 and 3). You can eliminate the ‘vPC problem’ of only having a pair of devices up-stream, you can also eliminate the middle ‘distribution’ tier of 5ks if you want (you can still run 2ks off of ToR/EoR in a Leaf/Spine design if you are inclined). Because of these last two points, we can more efficiently leverage the bandwidth available, and do so without having to have a million ports on a pair of ‘core’ type devices since we can just add (cheaper, see next item) Spine devices at will.
  2. The ‘God Box’ Problem(s) — I have opinions. They’re always right. Always. Some people disagree. Those people are wrong! (Kidding… sort of…) Anyway, one of the central tenants of network design to me is always to build distributed systems. I always point to the Internet itself as the ultimate case in point that distributed systems work, and work well (all Internet problems aside, look at that thing — its huge, and its even working!). Why am I babbling about this? Because I feel strongly that a pair of Nexus 7ks represent a kind of ‘God Box’ in the Data Center. These boxes (not just 7ks obviously, but any DC ‘cores’) become so completely intertwined in the Data Center that they become hard to perform maintenance on, they probably house the SVIs for ALL your DC VLANs, they probably have load balancers connected to them, they’ve probably got other critical devices connected to them, and ALL your traffic flows through them. In a Spine/Leaf topology, there isn’t really a concept of a ‘core,’ there are just Spines and Leafs. Spines don’t do anything but provide ECMP paths for traffic between your Leafs. This makes them essentially just ‘dumb’ boxes performing L2 or L3 magic (Trill/FabricPath/ECMP routing). It’s very easy to add or remove Spines because traffic can simply swing between the nodes by costing out links on a Spine you would like to perform maintenance on. Leaf nodes are perhaps a little more complex, but usually deployed in pairs, and have a much smaller zone of impact. Leafs tend to be ToR switches, so even if you totally bugger up both Leafs in a  pair, you are only directly impacting a single racks worth of end hosts — not a whole DC. So, basically what I’m saying is that I believe a Leaf/Spine topology promotes a ‘God Box’ free data center… I’m a big fan of that! The other great thing that Leaf/Spine networks promote simply by the nature of the topology, is commoditized infrastructure in terms of both features and hardware. In the tree topology, we have to funnel everything up to the DC cores with a bunch of fancy features (expensive!) — adding bandwidth means adding links, adding links maybe means adding cards… all this leads to having to get lots of density into those complicated God boxes. Each Spine in a Spine/Leaf design needs only enough ports to have ONE connection to each Leaf node (you can add more if you’d like though), this allows us to have less dense (and potentially cheaper overall – because less modular, maybe we run single SUP boxes instead of dual, etc.) Spines. Adding bandwidth is much simpler, because we can just add physical Spine nodes. Consider a DC that wants to have 32 ToR switches connected at 40g, and have 160g going to each Leaf switch. In our 7k example, we would need to have 64 40g ports available on each 7k, but in a Spine/Leaf topology, we can have 4x Nexus 9332 switches. Lastly, not only can we have smaller, cheaper hardware, we are moving toward a simpler overall design feature-wise.
  3. Failure Domains — while a classical tree topology data center doesn’t neceessarily mean there are large layer 2 domains, in my experience that tends to be the case. We’ve tried to eliminate spanning-tree, but it doesn’t die easily. vPCs are a pretty solid band-aid to cover up spanning-tree issues, but they still don’t eliminate spanning-tree completely. While a Spine/Leaf topology doesn’t automatically eliminate L2 in the data center (in fact L2 Spine/Leaf designs with FabricPath/Trill is totally viable) I feel that it does promote a L3 ECMP fabric. In most vendor Spine/Leaf reference architectures VXLAN is used extensively to provide L2 adjacency across the fabric. This of course facilitates us having a totally routed fabric, which means our failure domains are pushed all the way to the edge of the fabric!

At this point often times people will say: “Well we will never need this type of scalability” and/or “Our traditional design has been working just fine.” Well… maybe in your environment a traditional topology is just fine, but for the vast majority of customers that I see, the explosion of East/West traffic, and the desire to move toward a ‘software defined data center’ (a la Spine/Leaf and VXLAN often times) is driving this topology no matter what. I also don’t think that just because you are leveraging a Spine/Leaf topology that means that you automatically need to build it out to support 32 leafs — a single pair of Spine switches, and 2 pairs of Leafs is often what I’m seeing deployed. The commoditized nature of the platforms that fit into this type of topology mean that buying those 6 switches is often less (sometimes by a BIG margin) than buying a pair of traditional monolithic data center core switches. The last issue that inevitably arises is: “I need to vMotion — what about my L2.” I think we all hate this statement… build better designed applications and you probably wouldn’t need to vMotion! Since that’s not happening fast enough though, we have VXLAN. We can still provide L2 anywhere in the fabric, and yet we don’t need to rely on spanning-tree kludges.

All-in-all, I’m a fan and have been promoting Spine/Leaf topologies and the benefits wherever I can! I’d be interested to hear from others their thoughts, and if they are seeing customers adopt a Spine/Leaf data center design, or if people are still holding out, and if they are what are the reasons?