Detection Engineering Metric Scoring Framework Pt. 2

Matthew OBrien
15 min readApr 28, 2024

--

While it was not my intention to turn this into a series, my previous article Detection Engineering Metric Scoring Framework was fundamentally incomplete, and it requires some revision to reflect a few key changes and updates I made to the framework. There seemed to be a good amount of interest in the original article and DE metrics in general, so I didn’t want to leave anyone with an incomplete solution.

Why The Changes?

I had finally gotten my metrics framework all squared away. I came up with a formula for the “Magic Number”, gotten my documentation in order, and had come up with satisfactory answers to every edge case I could think of. I was discussing the framework with my boss, and he asked a question that threw a massive monkey wrench into my all my carefully laid plans. It was something along the lines of:

“How does the framework account for and score detections created by our security tools?”

It was an issue I was aware of, but I couldn’t give an answer because I had become so hyper focused on custom detections that I had pretty much forgotten to consider detection created by our various security tools. The reason this broke things is that my previous iteration of the framework made assumptions that don’t necessarily apply to non-custom detections. And those non-custom detections had properties that the old framework couldn't accommodate. So, I had to go back to the drawing board in a big way.

After he asked the question, my boss and I began to brainstorm potential solutions. He suggested a third level of metrics: the procedure level. NIST defines a procedure as “a lower-level, highly detailed description of the behavior in the context of a technique”. This level would be in addition to metrics that exist at the technique and detection level. My task became figuring out which metrics should be moved to the procedure level, determining how that shift changes what that metrics are trying to define, and figuring out what the affect all this would have on the formula for the Magic Number.

Technique Coverage

Technique Coverage still exists at the technique level. The formula for it is still the same as well, but our interpretation of the metric has changed slightly. The description is as follows: “Technique Coverage measures the percentage of procedures encompassing a technique that are able to be detected by existing detections. It is a validation of whether or not functional coverage exists for each procedure in a technique. A detection cannot just exist for it to be considered to be providing coverage for a procedure, it but it must also work correctly.” So in comparison to our previous understanding of technique coverage, “exists” now implies that the detection is functional too.

If the procedure cannot be validated for any reason (usually because the detection or test cases fails or can’t execute for some reason), no coverage is assumed.

A Brief Word On Operating Systems

I’m still trying to work out how to handle the issue of a technique being applicable to more than one operating system. If you have a technique that applies to both Windows and Linux, but you only have coverage for Linux, that would obviously impact your Technique Coverage score. My current plan assumes that if a procedure is applicable to multiple operating systems, that procedure will be split off into multiple operating system specific procedures so that the ratio of procedures to test cases remains one to one. There’s no reason it has to be one to one other than I think its cleaner and simpler that way. An alternative plan would be to have the procedure have multiple test cases for each operating system. The difference comes down to how specifically you want to define what a procedure is, the former suggestion being more specific (since each procedure is specific to an OS) than the latter.

I can’t think of any other scenarios of the top of my head where a procedure could have multiple test cases. In my opinion, if that was the case, then the procedure likely needs to be more narrowly defined and split out into several procedures. However, I don’t want to completely discount the possibility that a situation could arise where it would be necessary to have multiple test cases for a single procedure.

I toyed with the idea separating the procedures of a technique by operating system, coming up with the Technique Coverage score for each of them, and then applying a weight to them. The weight being the share that each operating system makes up out of all of the operating systems currently running in our environment. For example, let’s say we had 80% Windows servers and 20% Linux. And there was a technique that applies to both and it has ten procedures. Six are for Windows and four are for Linux. Five of the Windows procedures have coverage for them while only one of the Linux procedures does. Normally, this would result in 60% Technique Coverage:

(5+1)/10 * 100 = 60%

Instead, with the scheme I just mentioned we could do:
((.8 * 5/6) + (.2 * 1/4)) * 100 = 72%

Initially, I liked this solution better. However, I don’t think I’m going to go this route because it fundamentally violates our definition for Technique Coverage by introducing the weighting. I feel like it starts to move away from answering “what coverage do we have for this technique”. Trying too hard to tailor Technique Coverage to our environment ends up diluting its broader meaning. For example, imagine a scenario where there is a technique that has many Linux procedures but only a few Windows ones. If we only have coverage for Windows, then we would have low Technique Coverage. However, assuming that same 80/20 split of Windows and Linux servers, that small amount of Windows Coverage would become greatly inflated when the weight is applied to it. This could give the wrong impression that there is more coverage for a technique than there actually is. Perhaps this could be its own separate metric in its own right. Maybe call it OS Coverage. I’d have to think about what its definition would be and how it differs from Technique Coverage.

Data Quality

Data Quality is still at the detection level, and it has changed very little. I added that there will not be a Data Quality score for detections that aren’t custom made, such as those thrown by a security tool. This is mainly due to lack of insight into how the tool functions under the hood.

Sophistication

Sophistication is still at the detection level, and it has changed very little. I added that there will not be a Sophistication score for detections that aren’t custom made, such as those thrown by a security tool. This is mainly due to lack of insight into how the tool functions under the hood.

Risk Score

Risk Score hasn’t changed all that much either. It is still at the detection level. However, it can be difficult to try and calculate a Risk Score for procedures that are detected by non-custom detections. This is mainly due to lack of visibility into the detection logic employed by our security tools. To get around this, we decided that the Risk, Impact, and Confidence Scores for these procedures (ones that are detected by non-custom detections) should correspond to the severity that the security tool originally assigned to the detection. The scale will be: Informational = 5, Low = 25, Medium = 50, High = 75, Critical = 100. If there are multiple detections, the highest severity takes precedence. If there are no detections, be it custom or tool based, for a procedure, then it should not have a Risk/Impact/Confidence Score assigned to it. The procedure would simply be assigned N/A for those metrics. The reason being, if there is no detection for a procedure, then I don’t really care what its Risk/Impact/Confidence Score is. The point becomes moot.

The values for Risk, Impact, and Confidence Score for a procedure do not increase or decrease based on the outcome of validation testing for that procedure. However, whether or not the Confidence Score for that procedure is counted in the final average of Confidence Score for a technique does depend on the outcome, but that is described more below.

Impact

There has been no changes to impact. It is still at the detection level. Although, we do have a new additional place to store the score. It has an annotation for the detection in our SIEM, similar to Data Quality or Sophistication. To be clear, this only applies to custom detections. Non-custom tool based detections aren’t created in our SIEM, so they don’t have a place to store Impact. That is why Risk/Impact/Confidence Score are all equal to the severity of the detection the tool created. Confidence Score will be stored separately in a lookup table, but that is described more below.

Confidence

Confidence hasn’t undergone many changes per se, but a lot of thought has been put towards it. The definition of Confidence is the same and it is still technically at the detection level. Since Confidence Score feeds into the Magic Number, each procedure for a technique needs to have its Confidence Score considered. In that sense, it exists at the procedure level. It isn’t there for several reason.

First, Confidence Score is part of the Risk Score, which is at the detection level. To have Risk and Confidence at different levels while one exists within the other is messy and logically incoherent. Second, a procedure could have multiple detections, each with a different level of confidence. Therefore, it wouldn’t make sense to have Confidence Score at the procedure level, where each procedure would only be able to have a single Confidence Score. Confidence fundamentally describes a quality related to our detections, it just so happens we need to know it for each procedure, which brings us to our next issue.

As Confidence is at the detection level, what happens if there is no detection or it was non-custom? The latter issue we’ve already answered. If a procedure doesn’t have a custom detection made for it, but a security tool catches the test case for it, then the Confidence Score would correspond to the severity of the detection thrown by the tool: Informational = 5, Low = 25, Medium = 50, High = 75, Critical = 100. If there are multiple detections, the highest severity takes precedence. As for the former issue(which was also sort of already answered), if neither a security tool nor a custom detection catch the test case for the procedure (either they don’t work or aren’t in place), it would get a Confidence Score of N/A. This would mean it would not be counted in the final Confidence Score. The reason being that if there is no detection, then there is nothing to score in regards to confidence.

The final Confidence Score is determined by taking the average of all of the Confidence Scores for the technique. If two or more procedures in a technique are covered by the same detection, and consequently have the same Confidence Score, they would still each be included in the average. Procedures with a value of N/A would simply not be included when calculating the average. In other words, the average that produces the final Confidence Score only includes the Confidence Scores for procedures that have existing detections which work. To reiterate a previous point, if multiple detections trigger for a procedure, there might be group of multiple Confidence Scores to consider. The highest Confidence Score in the group should be used to for the procedure.

Since each procedure needs a Confidence Score, I thought it best to include them in the lookup table stored in our SIEM that tracks Validation Score. This is a good segway into the changes made for Validation Score.

Validation Score

Validation Score is now at the procedure level. This means the metrics represent a measurable aspect that is unique to one of the procedures for a technique in the MITRE ATT&CK Framework. I like to think of it now as whether or not a procedure or test case (they are essentially synonymous given that currently each procedure has only one test case) “passes”. That is, was it caught by one of our custom or security tool based detections. The definition of the metric hasn’t really changed, but like Technique Coverage, our interpretation of it has shifted some. The reason these both shifted is because they are closely connected. This connection was pointed out to me when my boss said: “Is having Validation Score at the procedure level essentially telling us the same thing the Technique Coverage is?” After some back and forth, I realized that they are effectively the same thing. Technique Coverage has the old Validation Score baked into it now. The new Validation Score is now more specific and granular, as it is calculated for all procedures regardless of whether or not a detection caught the activity produced by their test cases. Could we completely get rid of Validation Score, as it currently stands, in favor of just using Technique Coverage? Probably. But, if you did that, I think some “grey area” would remain that Technique Coverage wouldn’t quite cover. I also feel like it is easier to deal with Technique Coverage once Validation Score is calculated (sort of akin to “showing your math” so others can see how you arrived at a solution) and that it’s just overall cleaner and more convenient to keep it around.

If a procedure can be validated, the score will likely be either 0% or 100%. The reason being that the outcome when evaluating a detection for the test case can either be pass (100%) or fail (0%). There is the potential for a partial pass/fail and therefore a value between 0% and 100%, but this will have to be clearly justified.

A scenario could happen where a procedure has multiple test cases. If a procedure had two test cases and one passed and one failed, it would have a validation score of 50%. As I mentioned above, I’m not sure how commonly this could occur, but the possibility should be considered. It would also create issue for how to deal with Confidence Score. Each procedure should have one, but if it has multiple test cases, then it could have multiple Confidence Scores for each of those test cases. Picking the highest one would be most inline with our current scheme. But that assumes all the test cases are relatively related (and they should be, otherwise why are they grouped together under the same procedure?), and that the query logic for the detection with the highest severity is in someway representative of the logic (which the Confidence Score is a direct consequence of) required to detect the other test cases. Averaging them would be a potential alternative. If you go that route, and one or more test cases fails for the procedure, you have to decide whether or not it would count towards the average. The way we have things now, it would get a score of N/A and not count. But everything described in this paragraph is all hypothetical at this point.

However, if, for example, a procedure had a custom detection that failed but a security tool caught it, the Validation score would not be 50% (one catches it but the other doesn’t) but rather 100%. In other words, only one detection method has to successfully execute in order for the procedure to pass. Although a case can be made for some sort of metric that measures defense in depth which is sort of what I think that 50% would represent. That is, a metric that answers: can more than one detection catch this activity? But that is a metric that requires a fundamentally more mature DE program and is therefore a little outside the scope of our framework at the moment. If a detection spans multiple procedures, the Validation Score would be determined, per procedure, by whether or not the detection successfully caught the activity produced by that procedure’s test case.

If a procedure did not have a custom detection made for it and a tool did not pick up the activity from the test case, it would receive a score of N/A. If a procedure cannot be validated because the test cases can’t be run, the procedure would get a score of N/A. The Validation Score primarily exists to inform the analyst of whether or not the procedure should be “included” in the Technique Coverage metric. Procedures that failed wouldn’t be counted as having coverage.

As mentioned above, Confidence and Validation Score will be stored in the same lookup table. The reason being is that each procedure needs those to metrics. They represent the bare minimum needed to calculate the Magic Number for a technique.

The Magic Number

Previously we decided that the Magic Number was going to be made up of Technique Coverage, Confidence Score, and Validation Score. However Validation Score, based on our new understanding of it, is actually used to come up with Technique Coverage. So, there is no need to include it a second time in the Magic Number. Therefore, it is just made up of Technique Coverage and Confidence Score now. As a reminder, the Confidence Score is an average of all of the Confidence Scores for the procedures of a technique, at least, all the ones that weren’t N/A.

One big question last time was what is the formula going to be or how can we come up with one? I had a few initial ideas like the one mentioned in the previous article. I even tried explaining the situation to ChatGPT to see what answers it would give. It was the first time I had used it, and I was initially skeptical, but it gave some surprisingly good answers. Ultimately, it came to the same conclusions I had which was validating.

One exercise I found that was very useful was to try and come up with a set of hypothetical metrics for a technique. Then I would try and figure out what I thought the Magic Number should be based on my “gut reaction”. After all, the Magic Number is just supposed to answer how well we can detect a certain technique. So based on the made up metrics, you can sort ballpark what sort of coverage you think you might have. From there, I tried to reverse engineer the though process I used to arrive at my Magic Number. I would think about: how am I combining the metrics I have, what are their effects on each other? I ended up with around 10 or so formulas. So, I would plug my fake metric into each of them, see what they spit out, and compare to the number I thought of to see which one was closest. That whole process could be a post in and of itself, but let’s skip to the good part. Here is what I came up with (assume all values are in their decimal form, i.e. 70% would be .7) :

(Technique Coverage — ((Technique Coverage * Weight) * (1 — Confidence Score))) * 100

The weight for Confidence score is 30% or .30. The decision on what the weight should be is very subjective and ultimately up to the organization. In my head, when considering the affect Confidence Score would have on Technique Coverage when trying to divine the Magic Number from a set of metrics, I could never really see Confidence Score dinging Technique Coverage by more than 20 or 30 points. As a team, after some discussion we agreed on 30% seemed like a good starting point.

Basically, we take Confidence Score, apply a weight to it, and then subtract that from Technique Coverage. If you think about it, Technique Coverage represents the basis of the coverage you have in place for a technique. But the detections that make up that coverage aren’t perfect. How accurate they are, basically what Confidence tied to measure, must be taken into account as well.

Originally we tried:

(Technique Coverage — (Weight * Confidence Score)) * 100

But, a high Confidence Score would result in a larger value when multiplied by the weight than a lower Confidence Score would. As a result, a higher Confidence Score would decrease Technique Coverage, and consequently the Magic Number, more. This is the opposite of what we want to happen because a high Confidence Score should have a smaller impact on the score than a low Confidence Score would. So we tried subtracting Confidence Score from 1

(Technique Coverage — (Weight * (1 — Confidence Score)) * 100

However, if the Technique Coverage is less than the weight (30%), it could result in a negative Magic Numbers. This is because the product of the weight and one minus the Confidence Score would be greater than the Technique Coverage, if the Confidence Score was high enough. To get around this, I multiplied the Technique Coverage by the weight. In this way, one minus the Confidence Score is multiplied by 30% of Technique Coverage, instead of just 30%. So as Technique Coverage decreases, so does the 30% share of it that weight represents. Another way of putting it would be that the “damage” that Confidence Score can do to Technique Coverage is proportional to Technique Coverage’s value.

Conclusion

After all the changes mentioned in this article, we have finally been able to put the framework into practice scoring techniques. However, I think that there is still room for improvement. Specifically, I think the definitions of some of the metrics can be tightened up and better explained so that they are easier to use. I also think an emphasis needs to be placed on keeping the framework flexible. That way, it can accommodate all of the edge cases that I couldn’t think of, as well as the ever changing needs of our DE program as it matures. Maybe in a future post I can provide a concrete example of how a technique is scored from beginning to end using the framework to better illustrate the metrics in action.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Matthew OBrien
Matthew OBrien

Written by Matthew OBrien

My personal site for security research, write-ups, and projects

Responses (1)