File: lablog

package info (click to toggle)
liblognorm 2.0.9-1
links: PTS, VCS
area: main
in suites: forky, sid
size: 1,696 kB
sloc: ansic: 11,634; sh: 2,639; makefile: 255; python: 34
file content (251 lines) | stat: -rw-r--r-- 13,881 bytes
This is a log of findings during development of the slsa heuristic

Terms
-----
log message
	a string of printable characters, delimited by the operating system
	line terminator.
word	a substring inside a log message that is delimited by specific 
	delimiters, usually whitespace [this definition may need to be
	changed]
subword	a sequence inside a word that is not delimited by the usual word
	delimiters
MSA     multiple sequence alignment (like used in bioinformatics)
motif   a substring inside a log message that is frequently being
	used and has a specific syntax and semantic (e.g. an IPv4 address).
	The term is based on the idea of "sequence motif" in bioinformatics.
parser	(also "motif parser") extracts actual data matching a given motif
	from a log message.

Open Problems
-------------
[P00001] How to detect TAG?
syslog TAG does not work well with common prefix/suffix
detection. The problem is that e.g. PIX tags have integers, which
are detected and Linux logs have process IDs. The PIX integers
are actually part of the constant text, whereas in Linux it is fine
to detect them as variable part.

This problem becomes even more problematic with more agressive word
delimition (see [I00002]).

[P00002] subword detection with optional parts
Sometimes, subwords may contain optional parts, so not all words may
contain the common delimiter (e.g. Cisco PIX). In this case, subword
detection is not triggered.

Open TODO Items
---------------
Note: TODO items regarding motif parsers can also be found at
https://github.com/rsyslog/liblognorm/issues?utf8=%E2%9C%93&q=is%3Aissue+label%3A%22motif+parser%22
[T00001] - Dictionary motif
We need to add dictionary motifs, where one can e.g. query users or common words.
It may make sense to ship some "common-word dictionaries". Common word (sets)
are like ("TCP", "UDP", "ICMP", ...) which are often found in firewall messages.
With such words, subword detection does not prove to be helpful, because it
detects a common suffix like [["TC", "UD"] "P"].

[T00002] - URL motif
Logs often contain URLs (and file pathes), just like this:
    193.45.10.119:/techsupp/css/default.css 
We need to find a proper way to identify them. This motif is probably very
broad and hard to specify, especially when thinking about multiple platforms.

[T00003] - name=value motif
name=value is very common and justifed to be treated as motif.

[T00004] - quote characters in n=v pairs
We need to find out which quote characters happen to be present in especially
values of a wide variety of Linux logs (at least the typical ones). This can 
be used to better describe the n=v motif.

Recommended further testing
---------------------------
This section lists some tests that would be good to conduct as part of the
project. However, it is NOT thought that these experiments are vital for
the results.

- test slsa on iptables logs with the regular n=v parser
We want to see here if there is a way to still properly detect the structure.
Maybe we need to shuffle the "nondetected" word elements.

- test slsa without cisco-interface-spec parser
Same question as above, can the generic heuristic sufficiently well detect
structure?

The core goal of these tests is to find ways to improve the slsa algorithm.

Indicators in first column of actual log
----------------------------------------
- lesson learned, general info
! idea
+ result of idea
Ideas are referenced by [Innnnn] at start of log entry

Actual Log 09:55:01
----------
2015-04-14
- Cisco ASA has ample slighly different formats for specifying interface
  addresses (the faddr, gaddr, laddr parts of message). It looks like the
  log strings are written directly in code, at least we have a lot of
  small inconsistencies which suggest so. This seems to makes it impractical
  to generate a single motif for this type.
- Based on visual review, it looks like pattern detection on a subword-level
  may make sense (e.g. detect IP address, then you'll find a common slash,
  then a port number). This boils down to doing a full MSA on the word level,
  which I tried to avoid for performance reasons.
! [I00001] We may do an subword alignment on some common delimiters like slash,
  comma, colon etc. This can probably be done much faster than a full MSA.
! [I00002] We may also experiment with additional "word-delimiters", optionally
  enable those from the same set as [I00001]. When this is done, we need to make
  sure the delimiter can be stored inside the rulebase (an additional step to be
  taken only of this idea proves to be useful).

2015-04-15
+ [I0002] First rought tests indicate that additional word delimiters seem to
  work well, at least on Cisco messages.
  A problem is that TAG values are now intermangled, as a TAG typically is
  "name:" and the colon is now a word delimiter. This leads to all TAGs becoming
  one big group, and only after the colon differentiation occurs. That, however,
  is too late, especially for things like fixed-width structured formats
  (e.g. squid). This handling of the TAG is a big problem, because the TAG is
  usually the primary identifier for a message class. So it should be preserved
  as unique.
  The same problem shows up when processing Linux logs. The TAG becomes
  effectively unusable as a way to identify the message. I also have problems
  interpreting postfix logs correctly, if "[]" is part of the delimiter set.
  I have not tried yet to trace the root cause down, as the approach in general
  seems to be problematic in regard to TAG. I suspect that this "postfix problem"
  is actually related to the TAG.
  Looks like [P00001] must be looked at with priority in order to continue with
  useful analysis.
  Another problem that manual review brings up is that colon is often used e.g.
  in time formats ("12:04:11"). If we use colon as word delimiter, we are no
  longer able to detect such time formats. This suggest that more agressive
  word delimition is probably not a good thing to have. However, it looks like
  we could do the same during subword detection stage without big problems.
  Ideally, this would be a kind of alignment like proposed in [I00001].

2015-04-16
+ [I00001][I00002] Tried subword detection with colon and slash as delimiters.
  Works fine on Cisco logs and does not have any of the problem we had with
  [I0002] (as described yesterday).
  One problem is if we have something like this:
   7l:                     connection {5}
   8l:                        %posint% {5}
   9l:                           for {5}
  10l:                              outside:192.168.66.144/80 {3}
  11l:                                 (192.168.66.144/80) {3}
  12l:                                    to {3}
  13l:                                       inside:192.168.12.154/56839
  14l:                                          (217.17.249.222/56839) [nterm 1]
  13l:                                       inside:192.168.12.154/56842
  14l:                                          (217.17.249.222/56842) [nterm 1]
  13l:                                       inside:192.168.12.154/56845
  14l:                                          (217.17.249.222/31575) [nterm 1]
  The algo will explode the lower "inside:" parts individually because each
  node is of course processed individually. So the result will be:
   7l:                     connection {5}
   8l:                        %posint% {5}
   9l:                           for {5}
  10l:                              outside {subword} {3}
  11l:                                 : {subword}
  12l:                                    %ipv4% {subword}
  13l:                                       / {subword}
  14l:                                          %posint% {subword}
  15l:                                             (192.168.66.144 / 80) to {subword} {3}
  16l:                                                inside {subword}
  17l:                                                   : {subword}
  18l:                                                      %ipv4% {subword}
  19l:                                                         / {subword}
  20l:                                                            %posint% {subword}
  21l:                                                               (217.17.249.222 / 56839) {subword}
  16l:                                                inside {subword}
  17l:                                                   : {subword}
  18l:                                                      %ipv4% {subword}
  19l:                                                         / {subword}
  20l:                                                            %posint% {subword}
  21l:                                                               (6.79.249.222 / 56842) {subword}
  16l:                                                inside {subword}
  17l:                                                   : {subword}
  18l:                                                      %ipv4% {subword}
  19l:                                                         / {subword}
  20l:                                                            %posint% {subword}
  21l:                                                               (217.17.249.222 / 31575) {subword}
  However, we expect that this will not affect the final rule generation. But
  needs to be proven. [I00003] Once we have split subwords, we may do another
  "alignment run" on the tree and check if we now can find additional things
  to combine. Needs to be seen. In any case, we need to split braces, which
  requires slight changes to the split algo.
  We also see that the subword split algo does not work properly if we have
  words of different formats. Cisco PIX, for example, has interface specifiers
  which may either be "IP" or "IP/PORT" (among others). In that case, the
  delimiter "/" is not detected as common delimiter and so subword detection
  is not triggered. Now tracked as [P00002].

- I had a frist try at using "=" as a subword delimiter. This works for
  name=value fields, but only if they are all in the same sequence. It
  looks like it is a much better idea, for real N=V type of log messages
  to generate an parser that works on the complete line (things like
  iptables) [T00003]. It may still make sense to have individual N=V parsers if these
  constructs are seen within otherwise non-structured messages.

2015-04-17
- as expected, the Linux kernel timestamp motif parser proved to be
  useful.

2015-04-29
- some interim entries are missing, as I was focussed on some other work,
  including support for some structured formats.
- I added cee-syslog and pure JSON motifs for structured formats. As
  expected, this permits slsa to process log files that contain both
  structured and unstructured formats (a common use case) much more
  rapidly and also improved the correct detection.
  Note that pure JSON is seen for example in GELF, wheres cee-syslog is
  seen in ossec. Both are frequently used.
- I also worked on a specifc motif parser for interface specifiers found
  in Cisco logs. This, too, improved detection, but at a later stage we
  may want to try if we can gain to similar results without a specific
  parser. For practical cases, however, a specific parser is much more
  user-friendly.
- I have also continued to work on a specific Name=Value (N=V) parser [T00003].
  A first test some days ago showed that iptables needs a special parser
  because it also has name "flags", that is a field without a value and
  WITHOUT an equal sign (e.g. "DF"). This is not something we want in the
  regular n=v motif as it would match much to broadly. For iptables, we can
  do other restrictions (e.g. names need to consist of uppercase letters A..Z).
  So an iptable motif would not match too broad. Unfortunately, iptables is
  an example of where a specialised parser is needed for a single application.
  At least from the slsa point of view, this is bad, because it shows that
  the tool has limited potential. From a practical perspective, that iptables
  motif is definitely frequent enough to justify the (small) effort to
  implement a specific parser.
- on the n=v parser: it turns out to be somewhat problematic to find good
  sets of characters permitted in name and value. Value is not so much a 
  problem for quoted strings, but unquoted strings are very common. For
  example
  "session opened for user root by (uid=0)"
  will lead to 
  "session opened for user root by (%name-value-list%"
  This shows
  a) the we must not insist on whitespace before the name, as quoted forms
     (like the brace here) are common
  b) but the terminating brace is treated as part of the value if we only
     use whitespace delimition
  Especially b) is a bit hard of a problem, because we may also see things
  like "name=(value)", so it seems we cannot forbid braces in values.
  HOWEVER, we may actually forbid them if inside the string and not being
  escaped - so think of them as just different types of quoting, which they
  are most probably like. This should be verified by experiments. --> [T00004]
2015-05-03
- did some experiments with the v2-iptables parser I had newly written. It turned
  out that the parser is matching too broad (e.g a single upper case latter will
  qualify as an iptables entry, because it looks like a single flag value). While
  this is fine with a manually crafted rulebase, it doesn't work with tools like
  slsa, which try to detect what the motif is. Also, it may cause misdetection
  even in the manually crafted case. So this is a very good proof that motif
  parsers (and thus the motifs) need to be very well defined and as specific as
  possible. We will sort this out with the idea that an iptables entry always has
  more than a single word.
+ fixing the v2-iptables parser as proposed above did solve the issue with
  misdetection