Collecting and analyzing the data
The agents installed in our customers’ environments allow us to see what’s happening on their networks. We can see what applications, services, and hosts are present; how they’re communicating; and when they’re communicating. The data we collect allows us to baseline normal communication patterns and assess the level of network overexposure. Using the observed traffic and a statistical analysis of what’s necessary for ideal-but-secure workload communication, we create “nearly optimal” policies. Edgewise has several goals for the generated policies to be optimal, which include:
- We want the smallest possible number of policies. This is to make policy management less onerous.
- We want policies that account for as much of the observed data as possible. This is for accuracy.
- We want policies to allow as little unobserved data as possible. In other words, we don’t want to guess at what’s happening on a customer’s network. However, this is complicated because we don't see unobserved data. That is, we know what should be allowed to communicate and with what other entities, but there's no explicit information about what shouldn't be allowed.
- We want relatively "trusted" applications to talk relatively freely, but for more suspicious or potentially dangerous applications to be more constrained—and for malicious applications not to be allowed to communicate at all.
- We want interpretable policies that users can understand. We can't have a "black box" model because those aren't interpretable. There are several reasons for this:
- First, most cybersecurity professionals don't blindly trust computers—if they did, they probably wouldn't be working in cybersecurity in the first place.
- Second, by using computers to process huge amounts of data into suggested policies, we want to allow users to integrate their knowledge of their networks into those policies, and gain insight about their networks.
Clearly, these goals cannot all be optimized simultaneously because they're in conflict with each other. However, we can define all these goals, and others, mathematically. Then, we use Bayesian techniques to look for sets of policies that satisfy all of these goals as well as possible.
Building optimal policy sets through machine learning
The tricky part is to create the set of candidate policies, from which we choose the final policy set. In the candidate set, we need both "low-level" policies (like "application X on host 1 can talk to application Y on host 2") and "higher-level" policies based on groups of host and groups of applications. If we see hosts that are in some meaningful ways acting alike (e.g., communicating with similar entities regularly, have similar identity profiles), we create a "host segment" with those hosts, and then protect them with an overarching segment policy. Likewise, if we see a group of applications that all act in similar ways, we create an "application collection" with those applications, and then build a segmentation policy around that collection. These abstractions provide the raw material for the higher-level candidate policies. Once we have identified that set, we discover the most "nearly optimal" subset of policies that we can, and we suggest those policies to the customer.
Of course, networks are always changing so we can't generate policies once and then never again. We regenerate policies constantly, using what we know from gathered data about the environment and our machine learning, and take into account the customer's own or adopted policies to ensure that the proposed policies won't conflict with the customer's actions.
This is how Edgewise uses and applies machine learning. We want to discover and suggest an accurate and interpretable set of policies so our customers can gain insight into their networks, and use that insight to secure their applications and services—easily, quickly, flexibly, and as transparently as possible.