From d1b178157df74a66870bbf2835d681d9b4ddac6a Mon Sep 17 00:00:00 2001 From: Dennis Devey Date: Thu, 30 Nov 2017 23:59:06 -0500 Subject: [PATCH] Update enrichedDataset.md --- enrichedDataset.md | 41 ++++++++++++++++++++++------------------- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/enrichedDataset.md b/enrichedDataset.md index 42ea1a2..1a62ce2 100644 --- a/enrichedDataset.md +++ b/enrichedDataset.md @@ -1,28 +1,31 @@ -## I will use this to describe the csv. It's kinda stream of consciousness now, but it will become prettier. +## I will use this to describe the csv. -##### [0]: domainName -##### [1]: count, count # just in general useful for all of this... if you use total values for things like bytes or packets io, should be used to scale results. +##### domainName: +##### count: ## Word Magic: return([countUnique, percentageUnique, modeCount, percentageMode]) ### For every item below there are 4 columns. -##### [2-6] temp0, subdomain array #super important for DNS, less so for http -##### [6-11] temp1, agent array #unlikely, ignore +##### temp0 = subdomain array: Super important for DNS, less likely to be used for HTTP because there are so many other places to hide data. +##### temp1 = user agent array: Unlikely to be used by anyone, but it could happen. temp2, uri array #super important for http, encoded in URI ## Math Magic: (return([countUnique, percentageUnique, average, minimum, maximum, entropyStat, variationStat, skewStat, kurtosisStat]) ### For every item in this list, there are 9 columns for each statistics function returned -##### temp_0, delta time list # very important, periodicity? -##### magicDurationArray, durations #possibly important -##### magicOrigBytesArray, bytes sent #yes * maybe something can be done with ratios here -##### magicRespBytesArray, bytes received #yes -##### magicOrigPacketsArray, packets sent #yes -##### magicOrigIpBytesArray, ip bytes sent #yes -##### magicRespPacketsArray, packets recieved #yes -##### magicRespIpBytesArray, ip bytes recieved #yes * maybe something can be done with ratios here -##### temp_2, uri length #important -##### temp_3, uri depth #important -##### temp_4, uri entropy #important -##### temp_5, agent length #unlikely to matter, #unlikely to matter -##### temp_6, agent depth #unlikely to matter, #unlikely to matter -##### temp_7, agent entropy #unlikely to matter, recommend ignore +##### temp_0 = delta time list: Stats from an array of the time differences between connections... a poor man's time series analysis. There are much better ways to do this most likely, for now, most likely effective. +##### magicDurationArray = connection durations: Stats from an array of the connection lengths. File under, possibly important. +### TIME TO DO: Actual time series analysis +##### magicOrigBytesArray = bytes sent: Important +##### magicRespBytesArray = bytes received: Ditto and #yes +##### magicOrigPacketsArray = packets sent: Ditto and #yes +##### magicOrigIpBytesArray = ip bytes sent: Ditto and #yes +##### magicRespPacketsArray = packets recieved: Ditto and #yes +##### magicRespIpBytesArray = ip bytes recieved: Ditto and #yes +#### Bytes To Do: Various Producer/Consumer Ratios +##### temp_2 = uri length: Length of the URI, longer = sketchier. +##### temp_3 = uri depth: Stats from array of directory depths in URI. +##### temp_4 = uri entropy: Stats from array of uri entropy, can be significantly optimized. +### URI TO DO: Longest common substring stuff, URI hexadecimal count, entropy in final subdirectory. +##### temp_5 = agent length: #unlikely to matter, #unlikely to matter +##### temp_6 = agent depth: #unlikely to matter, #unlikely to matter +##### temp_7 = agent entropy: #unlikely to matter, recommend ignore