There is no doubt that, as security professionals, we continually strive to communicate better with the organizations we support. Personally, I realized very early that communications is a skill that sets the effective security professional apart. Whether it is through our use of words, selecting the right and relevant data, or visualizing that data. It is in the last part that I see a lot of opportunity to improve. Through this post, I attempt to share what I have learned over the past 5+ years of focusing on data visualization but at the same time I'd love to invite discussion on this topic. Where have you seen failure? Where have you seen excellence? How can we do better?
I have structured this post in 4 major topics, each of which I think can be read separately. First, I will discuss metrics in general. Having a good grasp of what metrics are and are not determines whether data visualizations are effective and fit-for-purpose. Second, I delve into tooling. While there are many tools available that support data visualization efforts, not all of them are the same and not all of them are right for you. Understanding their limitations and how to apply them is the second pillar of your data visualization strategy. In the last section, I will discuss the most common pitfalls made in creating data visualizations. It's as much a piece of critique as it is intended to be a checklist for aspiring "visualizators".
Metrics
Metrics, measurements, data models, big data, machine learning, artificial intelligence, ... I shouldn't be surprised that so many either try to jump before they can walk or get stuck in the basics. The reality teaches us that not understanding the basics around metrics will result in failure no matter which path you choose. They're pretty darn important.
In essence, a metric is anything you can count. How many incident response tickets do we get? How many security analysts do we have? How many times do people call the helpdesk to get their password reset? Any raw cardinal number (remember this, it will come back) that tells us about what our processes and infrastructure generate. The number of 10pm phone calls from that guy in finance after our pre-approved server reset was carried out? It's a metric. More importantly, it is a base metric. These are metrics that don't have math applied to them. As opposed to compound metrics, which are metrics that are a result of a mathematical formula applied to them. Let's try this with a simple example...
There are general truths, like "there are 7 days in a week" and "there are 24 hours in a day". We might want to have more hours in a day but unless we move to another planet, that is not going to happen. So let's assume we take a weekly count of the number of events generated by our SIEM : 15234 for this week. That's your first base metric right there. We also know that over a 24 hour period there is an average of 7 security analysts present in the SOC. We can break it down like this:
Base Metrics
A. # of events : 15234
B. # of analysts : 7
Compound Metrics
1. Events per analyst, per week, formula : (A/B) -> 2176 (rounded down)
2. Events per analyst, per day, formula : (A/B)/7 -> 311 (rounded up)
That's pretty much the basic metrics identification process for you:
A. Identify base metrics
B. Define formulas for compound metrics
C. Calculate
D. Report
It doesn't get much harder but if you deviate from this process or forget about the basics, your metrics program will end in utter disarray. That is just the way it is, not even starting about visualization yet.
Let's move on.
Tooling
This is a very loaded topic, actually. For data visualization there is a lot out there. Let's seperate them in categories.
Hardcore
If you have the inclination to get your hands dirty and really control all the nuts and bolts of your data visualization efforts (or you love just making your life hard, this is where you want to be.
Personally I've done a lot of data vizualisation work in Python. Using the matplotlib library you can get a lot done. There are alternative libraries like Seaborn that provide similar functionality. In the end it comes down to how comfortable you are with coding. With this, you really start from scratch.
Then there is R (yes, just the letter). It's a toolset focused on data visualization. I'm still categorizing this as hardcore because again it takes a very code-centric approach. You can't get around having a good grasp of the code behind your models with R.
Jupyter Notebooks ... I only just recently started toying around with this. It's an iPython based framework that provides an extremely flexible environment to perform data analysis and visualization in. You can use various programming languages to interact with it and it facilitates the creation of interactive web-based dashboards. I think I like it but I need to do more with it.
No Warranty
There are a bunch of open source data visualization tools out there that, depending on your need and experience, might come in handy. You can explore the Davix live CD if you want to, it's probably a bit hit and miss but there is no doubt you can get some great stuff done here.
The Big Easy
Ok, ok. None of the tooling around data visualization is very easy. You still need to understand the mechanics behind the scene, choose your visualizations wisely, and communicate them with the necessary context to your audience. However, the tools in this category provide at least a thin vineer of abstraction that allows you to focus on the data.
Tableau and Qlik are great data visualization suites that is used by all types of industries. If you see interactive visualizations on news websites it is probably done in Tableau or Qlik. Super powers come at a price though. If your data visualization practice is nascent, you probably don't want to shell out the $$$ necessary for these suites.
One tool that I've been impressed by is Microsoft's Power BI. Geared towards business intelligence (all data visualization tools are, don't make a mistake about that), it's extremely powerful and makes it super easy to build graphics, publish dashboards, and massage data in all types of ways. The kicker? If you already have Office 365, it's entry level version comes at a mere $10/month per active user.
Just Don't
Excel, or any other spreadsheet application. Sure, you can create graphs but it just doesn't scale. If you're serious about data visualization, Excel is not your friend.
Common Pitfalls
Ha! this is the fun part, since I get to call out some of the things we are almost always doing wrong when it comes to data visualization. To get through this quicker, I'm going to link you to a presentation that I have given a while ago. There I have documented the "13 mistakes you are no longer allowed to make" as documented by Stephen Fry, a data visualization and business intelligence guru. Read them, remember them, never make them again. I'll pick out a few of the key pitfalls that I have encountered and discuss them here.
Don't apply math to ordinal numbers
Omg! I couldn't stress this enough. Remember when I stressed that our base metrics are cardinal numbers? This is why. If you have 5 apples, and John has 5 apples, together you have 2 x 5 apples. That makes total sense. Now, if you run a race, and you end up 5th. Then you run another race, and you end up 5th. 2 x 5 does not equal 7 in this case. We see it in risk management all the time. We rank the probability of a risk on a scale of 1 to 5, then we rank its impact on a similar scale and suddenly 2 x 5 = 10, representing the associated risk. Do.Not.Do.That! We can argue all day about how you mitigate the inaccuracies by reversing the scale, or weighing them for whatever your business needs are but my answer is still no. Ordinal numbers are to be left alone. Do not add them together, do not multiply or divide them. Leave them alone.
Think about time
This is a fun one and its associated errors are as subtle as they are grave. The time frame within which you can gather metrics is your primary guide on how you want to communicate them to your audiences. Imagine that you are looking to build metrics around your vulnerability management program. Total count of vulnerabilities makes a lot of sense as a base metric. So let's say we pick that one and we have to report it to the line managers. So we start. Month 1: 2459 ... Month 2 : 2459 ... Month 3 : 2459. You're audience is looking for a trend and they see that nothing is changing so you have 2 possible outcomes :
(1) They lose interest in the metric and its effectiveness is lost. This is not what you wanted because you wanted their attention and support for your patching activities. It's a real shame!
(2) They question why they are assigning budget to you. Nothing is changing? What are you doing with your money? What is it worth?
What really happened here is that you are actually doing quarterly scans of your infrastructure. Now that isn't a practice I would recommend but this post is long enough already. I'll save that for another day.
If the base metrics that you rely on come in at a certain frequency, your reporting of the metrics should never exceed that frequency. That's the gist of it.
Keep it simple
Once you start toying around with a variety of visualization tools, you will see that it is very easy to get lost in the different types of graphs you can use. The colors and effects you can apply! So pretty! No. Don't!
Visualization is a language. You are communicating in a non-verbal way to an audience you want to influence. Visualizations are, when done right, highly dense with information so you don't want to throw an avalanche of unstructured information in your audience's face.
- Don't use 3D, ever. It is unnecessary and often skews information.
- Feel free to use raw numbers. Executives are good with numbers. If you can avoid a chart by just putting a number on the dashboard, do it.
- Choose grey scale over different colors. If your chart looks like someone went loose with a box of crayolas, think hard about what you want to communicate. Maybe you want to change the bars that are less relevant for your message in grey, while the bars that matter remain in color to make them pop? Consider this before sending out another graph that will only raise more questions.
Consider your audience
One of the worst things you can do is take the dashboard that the security team uses and send it to your board or IT Operations team. Dashboards are highly specific so do spend some time with your audiences to understand what matters to them. After that you can develop dashboards specifically tailored to their needs and enter in a meaningful and engaging conversation about security with all stakeholders throughout your organization. Before you talk, listen.
Wrapping it up
Phew ... Here I was, thinking I was gonna whip up a quick post about data visualization. That became a little longer than expected, didn't it? I hope that there are a few useful nuggets in there that help people get better at communicating data through visualization. While you could always choose to explore this subject with someone experienced like myself (shameless plug, I know), I am confident that you will rock it just as well on your own. Why don't we continue the conversation? Let's hear your lessons about metrics, data visualization, etc? What are the pitfalls you have encountered?