The Mess that is Micro-segmentation

In the last few years, with the advent of “micro-segmentation,” there has been a recurring challenge in data center design and deployment… I think I can summarize that challenge pretty concisely: “WHAT THE HELL DO MY APPS ACTUALLY DO!?” Okay, okay, so maybe we know *what* they do, but who the hell knows how they work… seriously… if you know who knows please let me know.

This was never really an issue for networking folk before the last little bit and all this fuss about microseg, so we never really figured out how to address it. We’ve generally asked the server guys what VLAN they want the ports to be in, and then sometimes we have to ask about ports for public facing things (stuff in the DMZ perhaps), but that has been more or less the totality of our involvement with the applications on our networks. Now you can obviously look at that and say well that’s maybe not the best thing… maybe we should have had more involvement and a more thorough understanding about what rides atop our (hopefully) super awesome networks. Yeah… probably, but let’s be honest, it’s really hard.

So while we never cared enough (or insert excuse/reason here) to make the effort to understand our applications before, we now are being forced to. Why? Well, it’s probably for the best to start, but also the obvious trend toward micro-segmentation is the big driver here. I won’t deign to try to define “micro-segmentation” as I think that it’s yet another buzzy buzz word, but regardless, the reality is in a modern DC design, you really can’t (and arguably shouldn’t) avoid it. So how do we go about understanding what our application flows look like when 90%+ companies really don’t understand their own apps?

The way I see it you really only have a handful of options, and none of them are terribly great. You could be cheap and rely on packet captures off of SPAN ports or via ERSPAN (yay Wireshark!). I actually kinda love this (also hate it though…) and I am working on a little side project w/ Python+Tshark+ACI to do some cool stuff (stay tuned for that), so this is *viable*, but not great option. Limitations are of course SPAN/ERSPAN support on devices, the fact that you still have to SPAN the traffic to *something* which is likely an old pizza box you found in the corner of the data center and shook the dust out of, or to a VM. Either the dusty pizza box or the venerable VM both have the same limitation — capacity to consume data. You’ll never be able to SPAN a busy EPG in ACI to a single 10GB attached box… that one EPG could be strewn across your data center with tens or hundreds of devices attached to it. In a non clos topology it *may* be a little easier as you can SPAN at “choke points” but the fact remains that you are throughput limited and can’t scale SPAN/ERSPAN.

The other primary option in my mind is the use of TAPs and specialized tooling. This option is REALLY good in a traditional environment (again choke points are important), but only *if* you want to pony up! I don’t really have anything against this particular option other than the cost. That being said, any modern data center will likely be a clos topology which makes taking advantage of TAPs rather difficult (as in you would miss all traffic local to a leaf due to anycast gateway functionality).

So all that being said, how are we supposed to do application dependency mapping (ADM), in order to actually figure out what this whole microsegmentation config should look like? I have no good answer! I think it will ultimately be a combination of more educated network folk, better cross-team communication (i.e. app guys talk to network guys and visa versa), and better tooling/analytics within the network itself.

Tooling is perhaps easier than the layer eight stuff (people!), but there is still no simple solution. In my mind there are a few critical aspects to address in whatever the tooling ends up looking like:

  • Distributed Systems work.
    • As I’ve said before the only way to do things at scale is to do things in a distributed nature. With respect to visibility/ADM this probably mostly means that the flow data needs to come from the EDGE of the network. This is even more critical in the modern spine/leaf style data center… you can no longer rely on stuffing a TAP into a choke point, or SPAN’ing all traffic at that choke point — you will lose all visibility into traffic local to a leaf or a pair of leafs.
    • Not having a choke point drives a need for distributing capture points by itself, but there is another just as critical reason for distributing the effort flow mapping — 40/100G pipes… yeah… can’t really SPAN a 100G link to a VM. I mean I guess you could, but don’t think that VM would last long before turning into a smoldering pile of computer bits. What I’m getting at here is that you just simply can’t keep up with the amount of data modern DCs are capable of pushing, so the only way to handle that is to parse smaller amounts of data, and the only way to do that is to do it at the edge of the DC.
  • But Centralized Management is key.
    • All that hullabaloo about Distributed Systems and why we have to investigate app flows at the edge, but the reality is nobody has time to manage a bajillion SPANs or TAPs, so we still need to have some sort of central management plane where we can distill all the info we are gathering at the edge into something actually consumable by humans.
  • Flexibility counts.
    • SPANs are great because pretty much every vendor supports SPAN, but there are definitely limitations. TAPs are great, but again they’re not very flexible. Hardware sensors (thinking about things like Cisco Tetration here), are great, but often costly and still not super flexible. Software sensors are cool (and like ALL THE WAY at the edge which is good!), but agents are sometimes a PITA at a minimum and sometimes not viable in all environments. Moral of the story here is that you likely need to support multiple flavors of these options in order to be dynamic enough to adapt to multiple environments.
  • API all the things!
    • This is just table stakes now. I know that not everyone cares, and I know that not everyone needs things to be all magic and programmable, but dear lord, just have an API… it’s not that hard and it will make things better for some people. I want to be able to consume that pared down data in some easy to consume fashion, that really means an API, because I am so freaking done with spreadsheets and CSV files ūüôā

One possible tool to assist with the Mess that is Micro-segmentation is Illumio’s Adaptive Security Platform. At the recent NFD12 Illumio presented their take on not just ADM, but also how they take that data and turn it into¬†real life security policies. From what we saw they have a pretty slick solution that uses endpoint agents on Linux or Windows guests. These agents report information back up to a central controller (the Policy Compute Engine), from which you can choose to enforce security policies via the guests native firewall tooling (IP Tables or Windows Firewall). Take a look at the video from Illumio’s presentation (and the other presenters from NFD12!) here at the TFD NFD12 YouTube channel. Or dive right into the Illumio demo right here:

Selfishly, I can see Illumio ASP being really useful for me — working predominantly with ACI, I often work with customers who need exactly the kind of information Illumio can provide. Moreover, I found it extremely convenient that the language and structure that Illumio employs is very similar to that of ACI, with a focus on a provide and consume relationship between tiers of applications (not that provide/consume is some mythical/odd thing, just seemed nice and handy). All of that is wrapped up with an API, which based on the¬†presentations, should hold all the data I would need to build out contracts in ACI, you could even employ both Illumio and ACI from a security perspective if were so inclined. The possibilities really are quite interesting — I’ve been envisioning building out net-new servers/services with the Illumio agent already installed, plopping it into a “dev” type tenant and letting it run for some amount of time, then (all programmatically) promoting that workload into “prod” and automatically generating contracts based on the information Illumio has gathered for you. Exciting possibilities indeed!

All of my selfish ACI musings aside, Illumio really ticks a lot of the¬†boxes for me in terms of what an ADM tool should be, then it goes above and beyond by actually providing a mechanism to enforce the policies. Illumio may not be a magical ADM panacea (if it is, figure out when/if they’re going to IPO!), but it sure looks intriguing!

 

Disclaimer: I was at Networking Field Day 12 as a delegate. The event was sponsored by vendors, including Illumio. But all the garbage I write here is my own opinion.¬†I’m bad at disclaimers.

Post-TFD Segment Routing Roundtable Thoughts

My brain has now had a bit of time to recover from the information overload that was the Tech Field Day Segment Routing Round Table, so it is most definitely time to write a bit about what I learned. You may want to get a listen in on the Software Gone Wild Podcast with Ivan Peplenjak for a solid foundation of what SR is before jumping into things. After that, head over to the TFD YouTube channel to check out the recordings from the event. We had some really great presentations from Walmart, Microsoft, and Comcast, each of these companies explained how Segment Routing is helping them in their particular environment. I would start with the presentation from Mark Pagan of Walmart as it goes over a lot the real world day 1 benefits of SR. Then take a listen to the Microsoft and Comcast presentations, they really kicked it up a lot in terms of complexity of their overall solutions, but also really highlighted a lot of what is possible with Segment Routing.

I’m not going to try to write anything too technical about SR because I am definitely not enough up to speed on it to talk about it at that level. What I am going to¬†do is jot down my view on it as a technology, and its applicability (in my mind I guess) in day-to-day network world. I also want to respond to¬†my own thoughts/questions from my previous post before the TFD event.

I’ll try to address my own previous points first:

– What ever happened to NSH: Guess I didn’t really get a solid answer here. As far as I can tell NSH is still technically a thing, but really seems to be fading away. I think ultimately its too big of a problem (or I guess solution) to really successfully implement. Somebody please chime in if there’s something new/interesting happening w/ NSH that I should be reading about. In any case, as compared to SR, they really are different beasts. I think there is some overlap in terms of what NSH was promising and what SR can do. Sure SR can direct traffic through a network, and maybe even to or through some devices on the network but it’s not intending to do “service-chaining” in the same way that NSH was/is.
– Config nightmare: Nope — think that was my biggest takeaway is that SR is pretty much MPLS 3.0. I mean 3.0 because it is just that much simpler, not only to configure, but to troubleshoot as well. I bring up troubleshooting since I think this is/was the biggest and most important part of the whole event – SIDs (Segment IDs) are globally significant. Sounds not very exciting/important by itself eh? Well the reason I think that is so huge is if you’ve ever worked w/ MPLS and you are troubleshooting and trying to understand the end to end label switch path (LSP) then you will know that the labels are all over the place and are significant to the local router — now they are unified across the whole LSP… that’s pretty badass! I should also note that instead of using LDP, SR distributes tag data via TLVs in OSPF or IS-IS, kinda sorta like LDP auto-config.
– Granularity/Service Chaining: I think you can do some of this with SR, but it’s really not its intended use case — a bit more on this later.
– Isn’t MPLS dead?: Heh… yes? No? Obviously it’s not dead in the service provider world, and likely won’t be for a long long time. In the data center… mostly dead is maybe fair? I can say I personally don’t see much/any MPLS in the DC at least. I think that part of why I was bringing this up before was because I was thinking more about an Enterprise DC (as that’s my day-to-day focus). I think you could absolutely use SR in an enterprise DC but I don’t think it’s really the best tool for that job. If you take a look at who presented though, you’ll see that while these are “Enterprises” (well plus Comcast as a Service Provider), but they’re freaking huge, and they’re really their own SPs doing SP type things. (MS is using this in the DCs but in a very hyperscale/SP type way)

Alright so I guess that addresses the points from my previous post, now on to a bit more wordy words to recap my thoughts on SR.

I feel like SR is kind of no-brainer in the SP/WAN world, it really does just seem like a way better way to do MPLS. You’ll still have to layer “stuff” on top of the SR bits (vpnv4/6 address family type stuff or whatever it is you’re running atop your MPLS), but SR just makes the rest seem so trivial. TE just got owned also… seems like there is basically no point to TE as we know it today if you can just use SR-TE to make your life so much easier. All that is well and good but, I don’t live in a provider-centric world really, I focus on data centers so…

While I am now a fan of SR, I feel like it doesn’t have a place in the data center. I know that the folks working on it will probably disagree, and I would like to agree with them but I can’t at this point. The current biggest challenge in the data center (at least at normal enterprise scale) is we still have to have L2 in some capacity. This is a super super super lame requirement, but it is what it is. This requirement is the reason we have jenky spanning-tree kludges (MLAG, vPC, VSS, etc.), FabricPath/TRILL, OTV, VPLS, and now VxLAN. Now from what I understand, there isn’t technically any reason you couldn’t use SR w/ some AToM or VPLS (maybe PBB?) to provide L2 over L3 in the data center, but that sounds like a freaking headache. VxLAN has pretty much won the DC overlay wars, and I don’t see any reason to introduce SR into the DC. Between data centers SR certainly could have a role in providing transport services, or even L2 extension, however even then as VxLAN continues to mature and grow into that role it doesn’t feel like its worth it to tack on another protocol/feature to support that requirement. If SR was the¬†panacea for service chaining that I was kinda hoping it would be, then perhaps I’d feel differently. So at this point, given our stupid “requirements” for L2, I think SR should/will likely¬†stick to the WAN/hyperscale folks. Theres nothing wrong with that of course, but I do feel like its important to¬†delineate where SR is best suited (at least in my mind!).

 

PS – Go watch Paul Mattes presentation (Microsoft), they’re using link-bandwidth in BGP which has always seemed to me to be the best kept secret of BGP. I was very exited to hear they’re really taking advantage of it in production, I rarely see it so that was fun. /end nerdgasm