-
Notifications
You must be signed in to change notification settings - Fork 1
/
openuba.tex
532 lines (325 loc) · 23.4 KB
/
openuba.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
\documentclass[10pt, titlepage, twocolumn]{article}
\usepackage{titlesec}
\usepackage{fancyhdr}
\usepackage{lipsum}
\usepackage{tikz}
\usepackage{pgfplots}
\usepackage{siunitx}
\usepackage{pgfplotstable}
\usepackage{graphicx}
\usepackage{mathtools}
\usepackage[printwatermark]{xwatermark}
\usepackage{xcolor}
\usepackage[labelfont=bf]{caption}
\usepackage[T1]{fontenc}
\usepackage{blindtext}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{algpseudocode,algorithm,algorithmicx}
\usepackage{fancyvrb}
\usepackage{geometry}
\geometry{a4paper, portrait, margin=1.5in}
\usepackage[utf8]{inputenc}
\newtheorem{theorem}{Theorem}
\usepackage[hidelinks]{hyperref}
\DeclarePairedDelimiter\ceil{\lceil}{\rceil}
\DeclarePairedDelimiter\floor{\lfloor}{\rfloor}
\graphicspath{ {/} }
\titleformat{\section}
{\normalfont\scshape\centering}{\thesection}{1em}{}
\pagestyle{fancy}
\fancyhf{}
\fancyfoot[C]{\thepage}
\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\footrulewidth}{0.4pt}
\titleformat{\section}
{\normalfont\Large\bfseries}{\thesection}{0em}{}[{\titlerule[0.8pt]}]
\titlespacing\section{0pt}{12pt plus 4pt minus 2pt}{10pt plus 2pt minus 2pt}
\setlength{\parskip}{0.8ex}
\setlength{\parindent}{2em}
\setlength{\textheight}{225mm}
\setlength{\topmargin}{-8mm}
\newcommand{\ii}{\indent \indent}
\date{\today}
\date{}
\titleformat{\section}
{\normalfont\scshape\centering}{\thesection}{1em}{}
\pagestyle{fancy}
\fancyhf{}
\fancyhead[C]{OpenUBA: A SIEM-agnostic, Open Source Framework for Modeling User Behavior}
\fancyfoot[C]{ \thepage}
\titleformat{\section}
{\normalfont\Large\bfseries}{\thesection}{1em}{}[{\titlerule[0.8pt]}]
\makeatletter
\renewcommand\thesection{}
\renewcommand\thesubsection{\@arabic\c@section.\@arabic\c@subsection}
\makeatother
\begin{document}
\begin{titlepage}
\protect\parbox{.9\textwidth}
{\protect\centering
\huge OpenUBA: A SIEM-agnostic, Open Source Framework for Modeling User Behavior
}
\begin{center}
{\LARGE
\today}
\end{center}
\vfill
{
\Large
\begin{center}
\texttt{Jovonni L. Pharr} \\
\texttt{[email protected]} \\
\texttt{Atlanta, GA, USA} \\
\texttt{[v0.0.1]} \\
\end{center}
}
\vfill
\section*{Abstract}
\noindent
This project describes a system for analytic modeling of user \& entity behavior. We make use of the scientific computing community, and the tools used within it. We demonstrate conditions under which modeling invocation takes place, the handling of model results, and a description of model components. Version control is also described for models changing over time. Feedback is described for both rules, and models. Efficient system storage is also demonstrated for anomalies, and events of interest. Various risk calculation approaches are also described. The system takes advantage of turing-completeness for model development, instead of limited development freedom. Lastly, different scenarios in which models, and rules accept feedback are covered. The components described construct a UBA system designed to be extensible, powerful, and open.
\end{titlepage}
%%%%%%%%%%%%%%%%%%
\section{Motivation}
There exists several security monitoring tools focused on providing User and Entity behavior monitoring capabilities. UBA tools attempt to replace a SIEM solution, but the two tools should work together. Security Operation Centers (SOC) should not be forced to choose between SIEM solution, and a UBA solution. Although, there are SIEM solutions releasing their own UBA features, however, there is a difference between a flexible, and expressive UBA tool, and adding a UBA feature.
\subsection{Problems}
\hspace*{15pt}
Many UBA platforms take a "block box" approach to modeling. The very concept of modeling can vary across UBA tools, though more concretely defined in mathematics. Often UBA vendors claim their models are IP, which is an understandable business decision to make. However, ramifications of models being proprietary include vendor lock-in, decreased model developer-friendliness, and hindering community growth \& participation.
Also, some SOCs must use multiple tools to gain the benefits of UBA, but this can be expensive depending on the vendor. OpenUBA \cite{ouba} aims to solve these problems by providing
\begin{enumerate}
\item Open/"white box" model standard
\item SIEM-agnostic
\item Turing-complete, expressive modeling
\item Model feedback from cases
\item Model versioning
\item Lightweight architecture
\item Alerting
\end{enumerate}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Modeling}
UBA tools use an overarching concept of a model
\( \boldsymbol{ \mathcal{M} } \). A model can be an arbitrary set of commands, of which change the state of the system. Models may execute both stochastic, and deterministic sets of computations. With the depth, and breadth of security use cases, model development in any modern UBA tool should be turing complete, and fully expressive. UBA tools attempt to extend a flexible model development practice, but because it typically results in less than turing completeness, model development is inherently limited, although expressive.
\subsection{Mapping Mitre Att\&ck}
\hspace*{15pt}
Mapping to the Mitre Att\&ck matrix is important for SOCs to monitoring their own performance with respect to understanding their coverage and security posture. Models from the model library are mapped to Mitre Att\&ck techniques and tactics. This enables coverage with respect to Mitre Attack to be calculated across the model population. This is becoming common practice for security tools, as Mitre maintains an industry leading position on
standardizing attack vectors, and surfaces for the purpose of organizations using a "common language" -- this mapping is arbitrary.
\subsection{Defining models}
We define an abstraction of a model,
\begin{equation}
\boldsymbol{\mathcal{M}} = \{(\boldsymbol{m}, \lambda, s) \, \vert \,\,S_{ \texttt{min} } \leq s \leq S_{ \texttt{max} } \}
\end{equation}
Where \(\boldsymbol{m}\) is a mitre map, \(\lambda\) is the merkle root hash from the model component tree, \(s\) is the contributing risk score for the model, and \(S\) is the system's global risk score limits. Each model has a relation to a \(m_{\texttt{technique}}\), and \(m_{\texttt{tactic}}\).
Models run in model groups, which may contain \(\geq 1\) model. We define the model execution function,
\begin{equation}
\Xi = \texttt{execute}( \boldsymbol{\mathcal{M}} )
\end{equation}
which is a function mapping a cartesian product of the set of models, and the set of datasets in the system.
\begin{equation}
\Xi : \boldsymbol{\mathcal{M}} \times D \rightarrow r
\end{equation}
Where \(r\) is a model result, and \(D\) is set of data in the system, and \(d \in D\), the subset of data used by the \(\texttt{execute}\) function. Model results can be a set of relations between a user U, and \(s\), given \(s\) is between \(S_{ \texttt{min} }\), and \(S_{ \texttt{max} }\). Such a relation can be "\(U\) has new risk score of \(S\)", "\(U\) has their risk score increased/decreased by \(S\)", or another arbitrary risk score calculation.
%%%%%%%%%%%%%%%%%%
\subsection{Model Library}
We make use of a model registry, from which models are installed for use in the system. A model registry can be arbitrary, but must be trusted by the user, although models are verified before installation, and usage.
\subsubsection{Components}
Models are constructed by \(\geq 1\) component. Components at a high-level are files. Model library entries contain the file hashes, and encoded-data hashes of each component.
\begin{equation}
\{ (c_{type}, f, d) \, : \, c \ne \emptyset, \forall c \in C \}
\end{equation}
Where \(f\) is the file, and \(d\) is the encoded data. A model component contains a set of hashes,
\begin{equation}
\boldsymbol{H} := \{ \texttt{H}(f), \texttt{H}(d) \, \vert \, f \ne \emptyset , d \ne \emptyset \}
\end{equation}
where \(\texttt{H}\) is a hash function, and \(f\), and \(d\) must be present. Together, these form the model component.
\begin{equation}
c = \{ (c_{type}, \boldsymbol{H} ) \}
\end{equation}
\subsection{Common Modules}
For a given model, the model author may be required to rewrite a basic preprocessing step for a model, of which may differ depending on details of how the model is being used. For this reason, we extend common model modules to be used by models. For example, if a model needs to parse incoming data a certain way, the logic can be encapsulated inside of an external module, and invoked by the model, not written within the model itself.
%%%%%%%%%%%%%%%%%%
\subsubsection{Verification}
Model verification takes one of two forms, the system is either verifying the contents from the model library, before installing, or verifying the files the model uses, after being installed, but prior to each execution of the model.
Both verification functions, verifying the encoded data of the model
\begin{equation}
\text{V}_d : C _{data} \rightarrow \{0,1\}
\end{equation}
and verifying the files the model use
\begin{equation}
\text{V}_f : C_{file} \rightarrow \{0,1\}
\end{equation}
maps an element of the component to a truth value. Overall verification is the logical conjunction of both verification types, and is represented by the boolean expression
\begin{equation}
\texttt{V} := \text{V}_d\text{V}_f
\end{equation}
We represent the model invocation function as
\begin{equation}
\Phi = \texttt{invoke}(\dots)
\end{equation}
\begin{equation}
\Phi( \Gamma, \boldsymbol{\mathcal{M}} ) =
\begin{cases}
\texttt{install}( \boldsymbol{\mathcal{M}} ),
& \Gamma = d\\
\Xi ( \boldsymbol{\mathcal{M}} ),
& \Gamma = f\\
\texttt{Err},
& \text{otherwise}
\end{cases}
\end{equation}
Using the data and file verifications, we construct the simple verification conditions to be satisfied before model invocation. Invocation occurs if the verification function, \(V\)
\begin{equation}
V( \boldsymbol{\mathcal{M}} ) =
\begin{cases}
\Phi( \Gamma, \boldsymbol{\mathcal{M}} ),
& \text{if } \texttt{V}\\
\texttt{Err},
& \text{otherwise}
\end{cases}
\end{equation}
In addition to installation verification, the same process is executed upon each model execution. This ensures the integrity of the models are checked prior to running the model.
%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Model Groups}
Model groups are sets of models to run with a shared context, and shared datasets. We define executing over model groups by
\begin{equation}
r_j = \Xi (M_{\texttt{group}_i}, \boldsymbol{\mathcal{M}}_j) \, \vert \, \forall i,j, \in \boldsymbol{\mathcal{M}}_{\texttt{enabled}}
\end{equation}
given that \(\boldsymbol{\mathcal{M}}_j\) is enabled, where \(M\) is the model configuration, of which defines the group.
\subsubsection{Data Loaders}
A data loader provides models with the data needed to execute, and any contextual information needed for the model. Context for a model may be the host information if remotely connecting to a data source, or the location of the data source.
If a system has \(n\) models enabled for proxy data, data should be loaded at most one time for the model group. The amount of times the same data is loaded, \(D_{M_{\texttt{group}}} \) is not proportional to the number of models, \(\vert \boldsymbol{\mathcal{M}} \vert\).
\begin{equation}
D_{M_{\texttt{group}}} \not \propto \vert \boldsymbol{\mathcal{M}} \vert , \boldsymbol{\mathcal{M}} \in M_{\texttt{group}}
\end{equation}
Data loaders enable model groups to share the data used to model.
\subsection{Model Return}
Model return types are defined at model configuration time.
\begin{equation}
\rho = \texttt{result}_{type}
\end{equation}
and the set of result types is
\begin{equation}
\rho := \{ \texttt{U}_{risk}, \texttt{U}_{anom}, ... \texttt{U}_{n} \}
\end{equation}
For a return type of \(U_{risk}\)
\begin{equation}
\boldsymbol{\mathcal{M}}_{\texttt{res}} =
\begin{cases}
\{u_{1_{new}}, ... u_{n_{new}} \},
& \text{if } \rho = \texttt{U}_{risk}\\
\{u_1, ... u_n \},
& \text{if } \rho = \texttt{U}_{anom}\\
\texttt{Err},
& \text{otherwise}
\end{cases}
\end{equation}
where \(u_{n_{new}}\) is either the new risk score for user, \(u\), or the risk score to be used for new risk score calculation.
\subsection{Sequence Models}
\(\boldsymbol{\mathcal{M}}\) can be a single-fire model, or a sequence of individual models. Single-fire models perform their logic on the input data, and return a result. Single fire models are markovian, meaning that the state of the system after \( \Xi ( \boldsymbol{\mathcal{M}} ) \) is only dependent on the current state, and the state transition initiated by invoking \(\Xi\).
Sequence models depend on outputs from previous models. One approach is to directly feed the output from a model into the input of another model. However, we abstract the connection between models by memoizing model outputs.
The timing of sequence model executions is not always immediately after another. Therefore, it is efficient to memoize prior model output. This approach ensures the length of the time window used during model training/inference can be up to the length of the system historical window. In other words, we can store memoized model outputs, and condition upon them at model execution time.
%%%%%%%%%%%%%%%%%%
\subsection{Model Versioning}
If there exists a model where the model version is the maximum version number in the set of model versions, the system will infer on that model by default. Given a set of all model versions, \(\boldsymbol {\mathcal{M}}_{\texttt{vers}}\)
\begin{equation}
\boldsymbol {\mathcal{M}}_{\texttt{vers}} := \{v_1, v_2, \ldots v_n \vert \, \boldsymbol{\mathcal{M}}_v \geq 0, \forall v \}
\end{equation}
\begin{equation}
\exists \boldsymbol {\mathcal{M}} \ni \texttt{max}( \boldsymbol {\mathcal{M}}_{\texttt{vers}} ) = \boldsymbol {\mathcal{M}}_v \Rightarrow \Xi(\boldsymbol {\mathcal{M}}_v)
\end{equation}
This prevents multiple versions of the same model from executing, especially while running on the same data.
\subsection{Rules}
Traditionally, a rule engine within a SIEM contains a set of conditions. The engines use event data to check against these rules. With OpenUBA, an entire rule engine can be defined within a standalone model. This rule engine executes within a model group, and can output risk scores similar to a traditional model. This approach enables rule engines to be used in the same way models are used.
\subsection{Storage}
Another issue with current-state UBA tools is the requirement to store a copy of production data in such a way where the tool can access it. This forces UBA users to eventually copy large amounts of data. Although UBA tool tend to delete data beyond a specified time window, copying the data alone can be challenging for users with large amounts of data.
To be lightweight, and "siem-agnostic", we store the anomaly data, and supporting events within a specified time window. This way, the system can still execute on production data, but without the requirement to retain all of the data on disk, or in memory.
If an enterprise proxy log source can result in \(1\texttt{B}\) record on a given day, which in turn generates \(50,000\) proxy anomalies, retaining \(50,000\) records on disk is far more efficient than copying the full set of data. This constraint makes the system more lightweight than a traditional bloated SIEM, or UBA system. The number of stored events, \(\texttt{e}_{stored}\) is much less than the number of observed events \(\texttt{e}_{observed}\)
\begin{equation}
\texttt{e}_{stored} \lll \texttt{e}_{observed}
\end{equation}
If specific models call for sequence of events to occur within an interval of time, \([t^*,t]\), we can efficiently retain the potentially anomalous event for up to \(t - \Delta\), where \(\Delta\) is the amount of time into the past the system can observe an event -- typically this is 90 days in traditional SIEM products.
However, this data storage configuration is arbitrary because at any point in time, we can store the anomaly data in an external abstract data store, and a model can simply retrieve it from outside the system, similar to how models retrieve the data used during computation.
%%%%%%%%%%%%%%%%%%%
\section{Logs}
We process logs for both creating users, and entities, and for model execution. Although the process by which the data is processed is the same, the intent is different between the two.
\begin{equation}
\texttt{process}(i, j), \forall i,j
\end{equation}
\subsection{ID Feature}
For each log source, a feature to be used as an identifier is configured. The identifier should server as a common key across all log sources, or must be a subset of another key. This identifier makes it less costly to decipher the attributing account for any \(\texttt{e}\), not that it cannot be done otherwise.
\subsection{Dormancy}
Dormancy is determined by the lack of a subset of \(\texttt{e}\) in \(D_u\); the dormancy period is arbitrary. If exactly 0 events from preconfigured events, \(\texttt{e}_{dorm}\), are found within the logs of \(u\), we say \(u\) is dormant.
\begin{equation}
U_{dorm} = \{u \, \vert \, \forall u \in U, \texttt{e}_{dorm} \notin D_u \}
\end{equation}
%%%%%%%%%%%%%%%%%%%
\section{Risk}
Risk scores can not only be arbitrary, but can also mislead or confuse stakeholders. "Black box" risk calculations hinder SOCs from fully understanding how case generating anomalies are precisely scored, and their ability to concretely explain the calculations to third parties. There is an inherent tradeoff between interpretable risk scores, and the determinism of risk score calculation. The more a model uses stochastic, and more complex processes to calculate risk scores, the more SOC analysts lose granular explainability of risk results.
The philosophical concept of a risk score is subjective as well, and is designed for human intuition. This goal can be hindered by arbitrary vendor-driven risk score calculations, of which decrease interpretability of model outputs. In data science, there exists an accuracy vs interpretability tradeoff. As more complexed models are invented, of which increase in performance, we lose human intuition on the inner workings of the model. This justifies enabling SOC teams to transparently define their risk score calculation logic as much as possible -- hiding concrete risk score logic only harms day-to-day SOC decision making, and requires a great deal of trust from the SOC onto the vendor.
\subsection{Risk Calculation}
\hspace*{15pt}
We purposefully abstract the risk calculation, and demonstrate basic risk score calculations for an individual, and a group. If a model results in new user risk score, we can simply overwrite the users risk.
\begin{equation}
u_{\texttt{risk}_{t+1}} = r_u
\end{equation}
If a model results in s quantity of risk to be used in the final calculation of risk for \(U\), then the abstracted risk score calculation process is invoked using the quantity of risk as an input variable. For example, if the risk calculation for users is a summation between the previous risk score of \(u\), and the model output with respect to \(u\), then
\begin{equation}
u_{\texttt{risk}_{t+1}} = u_{\texttt{risk}_{t-1}}+ r_u
\end{equation}
We define an abstracted risk calculation function, and map over all users
\begin{equation}
u_{\texttt{risk}} = \gamma(u, \texttt{risk}(u, t), \boldsymbol {\mathcal{M}}_{ r }^u), \forall u \in U
\end{equation}
\(\gamma\) is a cartesian product of the user set, and two real numbers, and yields a real number, representing a new risk score for user, \(u\).
\begin{equation}
\gamma : u \times \Re \times \Re \rightarrow \Re
\end{equation}
For simple risk score calculations, the function does not need \(u\), however, the system can consider a user's profile as context during risk calculation, opposed to a risk score being independent of \(u\).
%%%%%%%%%%%%%%%
\section{Case Feedback}
Upon a SOC analyst providing feedback to a model, the models adjust internal logic in some meaningful way. Feedback into a model differs in several ways depending on properties of the model. However, feedback on a rule, must be separated from feedback to a model.
%## Feedback to a model
\subsection{Model Feedback}
Model feedback usually does not include a change in semantic or syntactic forms because model parameters usually change after feedback -- although additional feature selection, or preprocessing may also change after feedback.
\begin{enumerate}
\item Tuning
\begin{enumerate}
\item changing parameters in a model
\end{enumerate}
\item Remodeling
\begin{enumerate}
\item rewriting, or editing the underlying model structure
\end{enumerate}
\end{enumerate}
The feedback process used is mainly for parameter tuning, but more work is to be done on automatically remodeling based on feedback. For a simple statistical model such as
\begin{verbatim}
if user a triggers rule R
S standard deviations from 'normal'
\end{verbatim}
if a result was a false positive, the model would include the false positive metric in its calculation for standard deviation, \(\sigma(x)\).
\subsubsection{Weighting}
Model feedback can also be weighted. Instead of \(\sigma(x)\) just having the case-generating event included within its calculation, the function could also apply a weight, \([0,1]\), of which can damp the feedback. Damping can be done on several variables, including the amount of times the case has been deemed a false positive, or accepted behavior by the SOC.
\subsection{Rule Feedback}
Rule feedback may look differently from model feedback. Rule feedback may require metaprogramming of the actual rule, or changing a portion. Feedback to a model for a rule such as
\begin{verbatim}
if eventtype = A
\end{verbatim}
may become
\begin{verbatim}
if eventtype == 0
and if eventtype == 1
\end{verbatim}
after feedback. In other words, feedback on rules can change the syntactic form of the rule, or the semantic form, such as variables. Each model defines its own process through which the model interprets feedback given on the case generated.
%%%%%%%%%%%%%
\section{Future Work}
Herein, we have demonstrated an "Open Model" UBA solution for monitoring users and entities. Once systems like OpenUBA are implemented, macro-level SOC activities can be performed using the system's data, and the community data. Several threat feeds exists today, but little to none exists with a focus on analytics, and modeling insights from current threats. Sharing hyperparamaters to fine tune a shared security model on most recent Botnet activity is possible when "Open Model" solutions are used more widely.
\subsection{Shared SOC}
A SuperSOC consists of a set of small SOCs, all with their own people, processes, and data. SuperSOCs enabled multiple distributed SOCs to function as one organism regarding security threat modeling, and security analytics.
\subsection{Distributed Intel}
Having community driven intel is not uncommon in several security circles. Isolated, analytics-driven security tools hinder SOCs from building truly knowledge-driven analytic solutions by removing the community participation. More work is to be done in this space.
\subsection{Cross-SOC models}
SOCs can also begin to collaboratively model security use cases. For example, knowing how anomalous this proxy traffic for a user in your organization, is different when compared to other companies in the same industry. Intersoc collaboration is still an active area of research.
\begin{thebibliography}{9}
\bibitem{ouba}Georgia Cyber Warfare Range \\
OpenUBA, http://openuba.org
\end{thebibliography}
\end{document}