NFD16 – Gigamon and Splunk (with a Dash of Phantom)

I had the opportunity to take a trip to the Gigamon mothership (no, not like the Apple Mothership, it’s a normal HQ) last week at Networking Field Day 16. I was pretty excited as I’ve not had a ton of hands on with Gigamon, but I’ve run into their products at a goodly portion of all the customers I’ve worked with over the years.

If you’d like to take a peak at some Gigamon presentations before continuing you can check out a ton of them here on Tech Field Day’s Youtube channel.

This events presentations were focused on leveraging Gigamon, and some of their partners (Splunk and Phantom), ability to react to security incidents. The core idea in their presentation is that security is a very hard problem (they ain’t wrong!), and that many organizations spend a large quantity of time and money, generally across a broad set of tools. In order to make security simpler the idea is that you can use Gigamon and Splunk to in concert to detect things that you would prefer not happen on your network, and that would require tons of other tools to do without these platforms combined power. The integration between the platforms allows for Splunk to fire off these triggers which Gigamon can then react to by dropping traffic or alerting on this traffic. Taking things a step further via the integration with Phantom these events as seen in Splunk could fire off a whole host of mitigation tactics/processes in order to automate away a lot of the manual tedious work that SOC personnel may have to deal with. All told, this is a pretty cool story. The integration with both of these other platforms seemed from the demo to be pretty smooth; flexible, but not super open-source-y (i.e. your average Joe/Jane could probably figure it out without pulling out too many hairs).

In a perfect world, I can certainly see Gigamon (and team) supplanting possibly many other products by consolidating functionality. Gigamon itself is a pretty powerful platform, couple that with Splunk which is a beast and can provide very interesting data correlation/insights, and finally wrap it all up with Phantom to put the sexy bow of automation on this and things look interesting (unfortunately time was limited for the Phantom portion of the presentation so I don’t have much insight there but it really did look awesome!). That being said, there are some challenges…

Gigamon at its core relies on being inline with traffic, or at least receiving traffic (of course if you’re not inline you can’t drop things so keep that in mind). This has historically been more or less a non-issue — data centers always have had choke points, so go ahead and plop your Gigamon appliance right there and you’re in business. That whole Christmas tree type topology where we had easily defined choke points is not really a thing anymore (at least in data centers being deployed now — of course they still exist). Most data centers, and certainly the ones I’m involved in, are opting to build out in a Clos topology. In a Clos topology we can have crazy things — like 128 way ECMP!! Not that 128 way ECMP is common, but even in small 2-4 spine node topologies there aren’t any especially good places to place a device like a Gigamon. You can of course put Gigamon/IPS/whatever inline between leaf and spine nodes, however this is an atrociously expensive proposition — for several reasons, firstly just for the sheer amount of links that may entail, and secondly from a capacity perspective — if you’re going 40G, 100G or looking toward the crazy 400G you’re going to have to pay to play to run that kind of throughput through a device (Gigamon or otherwise). Depending on the topology, it may be easy to snag “north/south” traffic (border leaf nodes -> whatever is “northbound” for example), but with an ever-increasing focus on micro segmentation within the data center this is *probably* not sufficient for most orgs.

One option to address some of this that was not mentioned, or was mentioned only briefly, at the Gigamon presentation is the GigaVUE-VM. The idea here is that this is a user-space virtual machine that can be either in-line with VM traffic, or sitting “listening” in promiscuous mode. Because this is living in the user-space there are no hypervisor requirements/caveats, it just kind of hangs out. If used in “inline mode” (which I’ve actually not seen so maybe thats not a thing?) there is the potential for this to replace the big iron hardware appliances, and fit more neatly into a Clos topology. I would have liked to see/hear more about this… a bit more about why at the end…

I had two major takeaways from the Gigamon presentation, firstly — Splunk is like a magical glue to tie things together! The data being fed into Splunk could have come from any number of sources (syslog of course, clients on agents, http events, etc.); in this case it was from Gigamon, and Gigamon performed drop actions based on the rules created in Splunk. I suspect that Splunk could be (relatively?) easily configured to make an API call to a firewall or other device/platform to react to data being fed into it. Splunk without data though, is not really all that useful — and here is where Gigamon showed their value. Being able to capture LARGE amounts of data, and then do something to it (really just drop, but thats an important thing) is very valuable.

That being said, my second takeaway was that this felt largely out of sync with what most customers I see are doing, at least in the data center. When pressed for how to practically adapt this into a Clos topology the answers were thin at best: (paraphrasing) “just tap all the links to leaf/spine”, “tap at chokepoints” etc.. This is all well and good and depending on requirements and budget may be just fine, however I didn’t exactly get warm fuzzies that Gigamon knows how to play nice in these Clos data centers. Obviously tapping everything is a non-starter financially, and chokepoints are well and good but that means that the substantial investment in Gigamon/Splunk (because it really does seem like they  need to be deployed in unison to justify the expenditures) doesn’t actually do you much/any good for securing east/west traffic.

Having ran into Gigamon in several Cisco ACI deployments I’ve been a part of I can say that customers really love — or at least have invested so much that they feel the need to continue to get value out of — Gigamon, but each time I’ve seen this there has been a big struggle to find a good home for the appliances. This is why I really would have liked to have seen and heard more about the GigaVUE-VM — my knowledge is quite limited on it but it certainly seems to be a possible work around for the challenges of finding choke points in a  Clos fabric. The big caveat to this is that the Gigamon folks did mention that the VM does NOT have feature parity with the HC hardware appliances. It sounds like they are investing in adding these features though which would obviously be helpful.

One final note, as I have very data center focused goggles on I’ve more or less ignored the campus/WAN, but I definitely think this could be useful in those areas, perhaps much more so than in the data center.


The Mess that is Micro-segmentation

In the last few years, with the advent of “micro-segmentation,” there has been a recurring challenge in data center design and deployment… I think I can summarize that challenge pretty concisely: “WHAT THE HELL DO MY APPS ACTUALLY DO!?” Okay, okay, so maybe we know *what* they do, but who the hell knows how they work… seriously… if you know who knows please let me know.

This was never really an issue for networking folk before the last little bit and all this fuss about microseg, so we never really figured out how to address it. We’ve generally asked the server guys what VLAN they want the ports to be in, and then sometimes we have to ask about ports for public facing things (stuff in the DMZ perhaps), but that has been more or less the totality of our involvement with the applications on our networks. Now you can obviously look at that and say well that’s maybe not the best thing… maybe we should have had more involvement and a more thorough understanding about what rides atop our (hopefully) super awesome networks. Yeah… probably, but let’s be honest, it’s really hard.

So while we never cared enough (or insert excuse/reason here) to make the effort to understand our applications before, we now are being forced to. Why? Well, it’s probably for the best to start, but also the obvious trend toward micro-segmentation is the big driver here. I won’t deign to try to define “micro-segmentation” as I think that it’s yet another buzzy buzz word, but regardless, the reality is in a modern DC design, you really can’t (and arguably shouldn’t) avoid it. So how do we go about understanding what our application flows look like when 90%+ companies really don’t understand their own apps?

The way I see it you really only have a handful of options, and none of them are terribly great. You could be cheap and rely on packet captures off of SPAN ports or via ERSPAN (yay Wireshark!). I actually kinda love this (also hate it though…) and I am working on a little side project w/ Python+Tshark+ACI to do some cool stuff (stay tuned for that), so this is *viable*, but not great option. Limitations are of course SPAN/ERSPAN support on devices, the fact that you still have to SPAN the traffic to *something* which is likely an old pizza box you found in the corner of the data center and shook the dust out of, or to a VM. Either the dusty pizza box or the venerable VM both have the same limitation — capacity to consume data. You’ll never be able to SPAN a busy EPG in ACI to a single 10GB attached box… that one EPG could be strewn across your data center with tens or hundreds of devices attached to it. In a non clos topology it *may* be a little easier as you can SPAN at “choke points” but the fact remains that you are throughput limited and can’t scale SPAN/ERSPAN.

The other primary option in my mind is the use of TAPs and specialized tooling. This option is REALLY good in a traditional environment (again choke points are important), but only *if* you want to pony up! I don’t really have anything against this particular option other than the cost. That being said, any modern data center will likely be a clos topology which makes taking advantage of TAPs rather difficult (as in you would miss all traffic local to a leaf due to anycast gateway functionality).

So all that being said, how are we supposed to do application dependency mapping (ADM), in order to actually figure out what this whole microsegmentation config should look like? I have no good answer! I think it will ultimately be a combination of more educated network folk, better cross-team communication (i.e. app guys talk to network guys and visa versa), and better tooling/analytics within the network itself.

Tooling is perhaps easier than the layer eight stuff (people!), but there is still no simple solution. In my mind there are a few critical aspects to address in whatever the tooling ends up looking like:

  • Distributed Systems work.
    • As I’ve said before the only way to do things at scale is to do things in a distributed nature. With respect to visibility/ADM this probably mostly means that the flow data needs to come from the EDGE of the network. This is even more critical in the modern spine/leaf style data center… you can no longer rely on stuffing a TAP into a choke point, or SPAN’ing all traffic at that choke point — you will lose all visibility into traffic local to a leaf or a pair of leafs.
    • Not having a choke point drives a need for distributing capture points by itself, but there is another just as critical reason for distributing the effort flow mapping — 40/100G pipes… yeah… can’t really SPAN a 100G link to a VM. I mean I guess you could, but don’t think that VM would last long before turning into a smoldering pile of computer bits. What I’m getting at here is that you just simply can’t keep up with the amount of data modern DCs are capable of pushing, so the only way to handle that is to parse smaller amounts of data, and the only way to do that is to do it at the edge of the DC.
  • But Centralized Management is key.
    • All that hullabaloo about Distributed Systems and why we have to investigate app flows at the edge, but the reality is nobody has time to manage a bajillion SPANs or TAPs, so we still need to have some sort of central management plane where we can distill all the info we are gathering at the edge into something actually consumable by humans.
  • Flexibility counts.
    • SPANs are great because pretty much every vendor supports SPAN, but there are definitely limitations. TAPs are great, but again they’re not very flexible. Hardware sensors (thinking about things like Cisco Tetration here), are great, but often costly and still not super flexible. Software sensors are cool (and like ALL THE WAY at the edge which is good!), but agents are sometimes a PITA at a minimum and sometimes not viable in all environments. Moral of the story here is that you likely need to support multiple flavors of these options in order to be dynamic enough to adapt to multiple environments.
  • API all the things!
    • This is just table stakes now. I know that not everyone cares, and I know that not everyone needs things to be all magic and programmable, but dear lord, just have an API… it’s not that hard and it will make things better for some people. I want to be able to consume that pared down data in some easy to consume fashion, that really means an API, because I am so freaking done with spreadsheets and CSV files 🙂

One possible tool to assist with the Mess that is Micro-segmentation is Illumio’s Adaptive Security Platform. At the recent NFD12 Illumio presented their take on not just ADM, but also how they take that data and turn it into real life security policies. From what we saw they have a pretty slick solution that uses endpoint agents on Linux or Windows guests. These agents report information back up to a central controller (the Policy Compute Engine), from which you can choose to enforce security policies via the guests native firewall tooling (IP Tables or Windows Firewall). Take a look at the video from Illumio’s presentation (and the other presenters from NFD12!) here at the TFD NFD12 YouTube channel. Or dive right into the Illumio demo right here:

Selfishly, I can see Illumio ASP being really useful for me — working predominantly with ACI, I often work with customers who need exactly the kind of information Illumio can provide. Moreover, I found it extremely convenient that the language and structure that Illumio employs is very similar to that of ACI, with a focus on a provide and consume relationship between tiers of applications (not that provide/consume is some mythical/odd thing, just seemed nice and handy). All of that is wrapped up with an API, which based on the presentations, should hold all the data I would need to build out contracts in ACI, you could even employ both Illumio and ACI from a security perspective if were so inclined. The possibilities really are quite interesting — I’ve been envisioning building out net-new servers/services with the Illumio agent already installed, plopping it into a “dev” type tenant and letting it run for some amount of time, then (all programmatically) promoting that workload into “prod” and automatically generating contracts based on the information Illumio has gathered for you. Exciting possibilities indeed!

All of my selfish ACI musings aside, Illumio really ticks a lot of the boxes for me in terms of what an ADM tool should be, then it goes above and beyond by actually providing a mechanism to enforce the policies. Illumio may not be a magical ADM panacea (if it is, figure out when/if they’re going to IPO!), but it sure looks intriguing!


Disclaimer: I was at Networking Field Day 12 as a delegate. The event was sponsored by vendors, including Illumio. But all the garbage I write here is my own opinion. I’m bad at disclaimers.