1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251
|
This is a log of findings during development of the slsa heuristic
Terms
-----
log message
a string of printable characters, delimited by the operating system
line terminator.
word a substring inside a log message that is delimited by specific
delimiters, usually whitespace [this definition may need to be
changed]
subword a sequence inside a word that is not delimited by the usual word
delimiters
MSA multiple sequence alignment (like used in bioinformatics)
motif a substring inside a log message that is frequently being
used and has a specific syntax and semantic (e.g. an IPv4 address).
The term is based on the idea of "sequence motif" in bioinformatics.
parser (also "motif parser") extracts actual data matching a given motif
from a log message.
Open Problems
-------------
[P00001] How to detect TAG?
syslog TAG does not work well with common prefix/suffix
detection. The problem is that e.g. PIX tags have integers, which
are detected and Linux logs have process IDs. The PIX integers
are actually part of the constant text, whereas in Linux it is fine
to detect them as variable part.
This problem becomes even more problematic with more agressive word
delimition (see [I00002]).
[P00002] subword detection with optional parts
Sometimes, subwords may contain optional parts, so not all words may
contain the common delimiter (e.g. Cisco PIX). In this case, subword
detection is not triggered.
Open TODO Items
---------------
Note: TODO items regarding motif parsers can also be found at
https://github.com/rsyslog/liblognorm/issues?utf8=%E2%9C%93&q=is%3Aissue+label%3A%22motif+parser%22
[T00001] - Dictionary motif
We need to add dictionary motifs, where one can e.g. query users or common words.
It may make sense to ship some "common-word dictionaries". Common word (sets)
are like ("TCP", "UDP", "ICMP", ...) which are often found in firewall messages.
With such words, subword detection does not prove to be helpful, because it
detects a common suffix like [["TC", "UD"] "P"].
[T00002] - URL motif
Logs often contain URLs (and file pathes), just like this:
193.45.10.119:/techsupp/css/default.css
We need to find a proper way to identify them. This motif is probably very
broad and hard to specify, especially when thinking about multiple platforms.
[T00003] - name=value motif
name=value is very common and justifed to be treated as motif.
[T00004] - quote characters in n=v pairs
We need to find out which quote characters happen to be present in especially
values of a wide variety of Linux logs (at least the typical ones). This can
be used to better describe the n=v motif.
Recommended further testing
---------------------------
This section lists some tests that would be good to conduct as part of the
project. However, it is NOT thought that these experiments are vital for
the results.
- test slsa on iptables logs with the regular n=v parser
We want to see here if there is a way to still properly detect the structure.
Maybe we need to shuffle the "nondetected" word elements.
- test slsa without cisco-interface-spec parser
Same question as above, can the generic heuristic sufficiently well detect
structure?
The core goal of these tests is to find ways to improve the slsa algorithm.
Indicators in first column of actual log
----------------------------------------
- lesson learned, general info
! idea
+ result of idea
Ideas are referenced by [Innnnn] at start of log entry
Actual Log 09:55:01
----------
2015-04-14
- Cisco ASA has ample slighly different formats for specifying interface
addresses (the faddr, gaddr, laddr parts of message). It looks like the
log strings are written directly in code, at least we have a lot of
small inconsistencies which suggest so. This seems to makes it impractical
to generate a single motif for this type.
- Based on visual review, it looks like pattern detection on a subword-level
may make sense (e.g. detect IP address, then you'll find a common slash,
then a port number). This boils down to doing a full MSA on the word level,
which I tried to avoid for performance reasons.
! [I00001] We may do an subword alignment on some common delimiters like slash,
comma, colon etc. This can probably be done much faster than a full MSA.
! [I00002] We may also experiment with additional "word-delimiters", optionally
enable those from the same set as [I00001]. When this is done, we need to make
sure the delimiter can be stored inside the rulebase (an additional step to be
taken only of this idea proves to be useful).
2015-04-15
+ [I0002] First rought tests indicate that additional word delimiters seem to
work well, at least on Cisco messages.
A problem is that TAG values are now intermangled, as a TAG typically is
"name:" and the colon is now a word delimiter. This leads to all TAGs becoming
one big group, and only after the colon differentiation occurs. That, however,
is too late, especially for things like fixed-width structured formats
(e.g. squid). This handling of the TAG is a big problem, because the TAG is
usually the primary identifier for a message class. So it should be preserved
as unique.
The same problem shows up when processing Linux logs. The TAG becomes
effectively unusable as a way to identify the message. I also have problems
interpreting postfix logs correctly, if "[]" is part of the delimiter set.
I have not tried yet to trace the root cause down, as the approach in general
seems to be problematic in regard to TAG. I suspect that this "postfix problem"
is actually related to the TAG.
Looks like [P00001] must be looked at with priority in order to continue with
useful analysis.
Another problem that manual review brings up is that colon is often used e.g.
in time formats ("12:04:11"). If we use colon as word delimiter, we are no
longer able to detect such time formats. This suggest that more agressive
word delimition is probably not a good thing to have. However, it looks like
we could do the same during subword detection stage without big problems.
Ideally, this would be a kind of alignment like proposed in [I00001].
2015-04-16
+ [I00001][I00002] Tried subword detection with colon and slash as delimiters.
Works fine on Cisco logs and does not have any of the problem we had with
[I0002] (as described yesterday).
One problem is if we have something like this:
7l: connection {5}
8l: %posint% {5}
9l: for {5}
10l: outside:192.168.66.144/80 {3}
11l: (192.168.66.144/80) {3}
12l: to {3}
13l: inside:192.168.12.154/56839
14l: (217.17.249.222/56839) [nterm 1]
13l: inside:192.168.12.154/56842
14l: (217.17.249.222/56842) [nterm 1]
13l: inside:192.168.12.154/56845
14l: (217.17.249.222/31575) [nterm 1]
The algo will explode the lower "inside:" parts individually because each
node is of course processed individually. So the result will be:
7l: connection {5}
8l: %posint% {5}
9l: for {5}
10l: outside {subword} {3}
11l: : {subword}
12l: %ipv4% {subword}
13l: / {subword}
14l: %posint% {subword}
15l: (192.168.66.144 / 80) to {subword} {3}
16l: inside {subword}
17l: : {subword}
18l: %ipv4% {subword}
19l: / {subword}
20l: %posint% {subword}
21l: (217.17.249.222 / 56839) {subword}
16l: inside {subword}
17l: : {subword}
18l: %ipv4% {subword}
19l: / {subword}
20l: %posint% {subword}
21l: (6.79.249.222 / 56842) {subword}
16l: inside {subword}
17l: : {subword}
18l: %ipv4% {subword}
19l: / {subword}
20l: %posint% {subword}
21l: (217.17.249.222 / 31575) {subword}
However, we expect that this will not affect the final rule generation. But
needs to be proven. [I00003] Once we have split subwords, we may do another
"alignment run" on the tree and check if we now can find additional things
to combine. Needs to be seen. In any case, we need to split braces, which
requires slight changes to the split algo.
We also see that the subword split algo does not work properly if we have
words of different formats. Cisco PIX, for example, has interface specifiers
which may either be "IP" or "IP/PORT" (among others). In that case, the
delimiter "/" is not detected as common delimiter and so subword detection
is not triggered. Now tracked as [P00002].
- I had a frist try at using "=" as a subword delimiter. This works for
name=value fields, but only if they are all in the same sequence. It
looks like it is a much better idea, for real N=V type of log messages
to generate an parser that works on the complete line (things like
iptables) [T00003]. It may still make sense to have individual N=V parsers if these
constructs are seen within otherwise non-structured messages.
2015-04-17
- as expected, the Linux kernel timestamp motif parser proved to be
useful.
2015-04-29
- some interim entries are missing, as I was focussed on some other work,
including support for some structured formats.
- I added cee-syslog and pure JSON motifs for structured formats. As
expected, this permits slsa to process log files that contain both
structured and unstructured formats (a common use case) much more
rapidly and also improved the correct detection.
Note that pure JSON is seen for example in GELF, wheres cee-syslog is
seen in ossec. Both are frequently used.
- I also worked on a specifc motif parser for interface specifiers found
in Cisco logs. This, too, improved detection, but at a later stage we
may want to try if we can gain to similar results without a specific
parser. For practical cases, however, a specific parser is much more
user-friendly.
- I have also continued to work on a specific Name=Value (N=V) parser [T00003].
A first test some days ago showed that iptables needs a special parser
because it also has name "flags", that is a field without a value and
WITHOUT an equal sign (e.g. "DF"). This is not something we want in the
regular n=v motif as it would match much to broadly. For iptables, we can
do other restrictions (e.g. names need to consist of uppercase letters A..Z).
So an iptable motif would not match too broad. Unfortunately, iptables is
an example of where a specialised parser is needed for a single application.
At least from the slsa point of view, this is bad, because it shows that
the tool has limited potential. From a practical perspective, that iptables
motif is definitely frequent enough to justify the (small) effort to
implement a specific parser.
- on the n=v parser: it turns out to be somewhat problematic to find good
sets of characters permitted in name and value. Value is not so much a
problem for quoted strings, but unquoted strings are very common. For
example
"session opened for user root by (uid=0)"
will lead to
"session opened for user root by (%name-value-list%"
This shows
a) the we must not insist on whitespace before the name, as quoted forms
(like the brace here) are common
b) but the terminating brace is treated as part of the value if we only
use whitespace delimition
Especially b) is a bit hard of a problem, because we may also see things
like "name=(value)", so it seems we cannot forbid braces in values.
HOWEVER, we may actually forbid them if inside the string and not being
escaped - so think of them as just different types of quoting, which they
are most probably like. This should be verified by experiments. --> [T00004]
2015-05-03
- did some experiments with the v2-iptables parser I had newly written. It turned
out that the parser is matching too broad (e.g a single upper case latter will
qualify as an iptables entry, because it looks like a single flag value). While
this is fine with a manually crafted rulebase, it doesn't work with tools like
slsa, which try to detect what the motif is. Also, it may cause misdetection
even in the manually crafted case. So this is a very good proof that motif
parsers (and thus the motifs) need to be very well defined and as specific as
possible. We will sort this out with the idea that an iptables entry always has
more than a single word.
+ fixing the v2-iptables parser as proposed above did solve the issue with
misdetection
|