File: DESIGN.md

package info (click to toggle)
stenographer 1.0.1-7
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 19,192 kB
sloc: cpp: 2,087; sh: 830; yacc: 257; makefile: 102
file content (422 lines) | stat: -rw-r--r-- 21,241 bytes
parent folder | download | duplicates (3)
Stenographer/Stenotype Design
=============================


Introduction
------------

This document is meant to give an overview of the design of stenographer and
stenotype at a medium/high level.  For low-level stuff, look at the code :).
The architecture described in this document has changed relatively little over
the course of the project, and we doubt it will change much in the future.


High-Level Design
-----------------

Stenographer consists of a `stenographer` server, which serves user requests and
manages disk, and which runs a `stenotype` child process.  `stenotype` sniffs
packet data and writes it to disk, communicating with `stenographer` simply by
un-hiding files when they're read for consumption.  The user scripts `stenocurl`
and `stenoread` provide simple wrappers around `curl`, which allow analysts to
request packet data from the `stenographer` server simply and easily.


Detailed Design
---------------

Stenographer is actually a few separate processes.


### Stenographer ###

Stenographer is a long-running server, the binary that you start up if you want
to "run stenographer" on your system.  It manages the `stenotype` binary as a
child process, watches disk usage and cleans up old files, and serves data to
analysts based on their queries.


#### Running Stenotype ####

First off, stenographer is in charge of making sure that `stenotype` (discussed
momentarily) starts and keeps running.  It starts stenotype as a subprocess,
watching for failures and restarting as necessary.  It also watches stenotype's
output (the files it creates) and may kill/restart stenotype itself if it feels
it is misbehaving or not generating files fast enough.


#### Managing Disk(s) ####

Stenographer watches the disks that stenotype uses and tries to keep them tidy
and usable.  This includes deleting old files when disk space decreases below a
threshold, and deleting old temporary files that stenotype creates, if
stenotype crashes  before it can clean up after itself.

Stenographer handles disk management in two ways.  First, it runs checks
whenever it starts up a new stenotype instance to make sure files from an old,
possibly crashed instance are no longer around and causing issues.  Secondly, it
periodically checks disk state for out-of-disk issues (currently every 15
seconds).  During that periodic check, it also looks for new files stenotype may
have generated that it can use to serve analyst requests (described
momentarily).


#### Serving Data ####

Stenographer is also in charge of serving any analyst requests for packet data.
It watches the data generated by stenotype, and when analysts request packets it
looks up their requests in the generated data and returns them.

Stenographer provides data to analysts over TLS.  Queries are POST'd to the /query
HTTP handler, and responses are streamed back as PCAP files (MIME type
application/octet-stream).

Currently, stenographer only binds to localhost, so it doesn't accept remote
user requests.


#### Access Control ####

Access to the server is controlled with client certificates.  On install, a
script, `stenokeys.sh`, is run to generate a CA certificate and use it to
create/sign a client and server certificate.  The client and server authenticate
each other on every request using the CA certificate as a source of truth.
POSIX permissions are used locally to control access to the certs... the
`stenographer` user which runs steno has read access to the server key
(`steno:root -r--------`).  The `stenographer` group as read access to the
client key (`root:steno ----r-----`).  Key usage extensions specify that the
server key must be used as a TLS server, and the client key must be used as a
TLS client.

Due to the file permissions mentioned above, giving steno access to a local user
simply requires adding that user to the local `stenographer` group, thus giving
them access to `client_key.pem`.

Once keys are created on install, they're currently NEVER REVOKED.  Thus, if
someone gets access to a client cert, they'll have access to the server ad
infinitum.  Should you have problems with a key being released, the current best
way to handle this is by deleting all data in the `/etc/stenographer/certs`
directory and rerunning `stenokeys.sh` to generate an entirely new set of keys
rooted to a new CA.

`stenokeys.sh` will not modify keys/certs that already exist in
`/etc/stenographer/certs`.  Thus, if you have more complex topologies, you can
overwrite these values and they'll happily be used by Stenographer.  If, for
example, you already have a CA in your organization, you can copy its cert into
the `ca_cert.pem` file, then create `{client,server}_{key,cert}.pem` files
rooted in that CA and copy them in.  This also allows folks to use a single CA
cert over multiple stenographer instances, allowing a single client cert to
access multiple servers over the network.

### Stenotype ###

Stenotype's sole purpose is to read packet data off the wire, index it, and
write it to disk.  It uses a multi-threaded architecture, while trying to limit
context switching by having most processing on a single core stay within a
single thread.


#### Packet Sniffing/Writing ####

Stenotype tries to be as performant as possible by allowing the kernel to do the
vast majority of the work.  It uses AF_PACKET, which asks the kernel to place
packets into blocks in a shared memory region, then notify stenotype when blocks
are available.  After indexing the packets in each block, it passes the block
directly back to the kernel as an O_DIRECT asynchronous write operation.

Besides indexing, then, stenotype's main job is to wait for the kernel to put
packets in a memory region, then immediately ask the kernel to take that region
back and write it.  An important benefit of this design is that packets are
never copied out of the kernel's shared memory space.  The kernel writes them
from the NIC to shared memory, then the kernel uses that same shared memory for
O_DIRECT writes to disk.  The packets transit the bus twice and are never copied
from RAM to RAM.


#### Packet File Format ####

As detailed above, the "file format" used by stenotype is actually to directly
dump data as it's presented by AF_PACKET.  Thus, data is written as blocks, with
each block containing a small header followed by a linked list of packets.
Blocks are large (1M), and are dumped regularly (every 10s), so there's a good
chance that for slow networks we use far more disk than we need.  However, as
network speed increases past 1M/minute/thread, this format becomes quite
efficient.  There will always be overhead, however.

Stenotype guarantees that a packet file will not exceed 4GB, by rotating files
if they reach that size.  It also rotates files older than 1 minute.  Files are
named for the microsecond timestamp they were created at.  While a file is being
written, it will be hidden (.1422693160230282).  When rotating, the file will be
renamed to no longer be hidden (.1422693160230282 -> 1422693160230282).  This
rename only occurs after all data has been successfully flushed to disk, so
external processes which see this rename happen (like stenographer) can
immediately start to use the newly renamed file.


#### Packet Load Balancing ####

Stenotype takes advantage of AF_PACKET's excellent load-balancing options to
split up the work of processing packets across many CPUs.  It uses AF_PACKET's
PACKET_FANOUT to create a separate memory region for N different threads, then
request that the kernel split up incoming packets across these regions.  One
stentoype packet reading/writing thread is created for each of these regions.
Within that single thread, block processing (reading in a block, indexing it,
starting an async write, reading the next block, etc...) happens serially.


#### Indexing ####

After getting a block of packets from the kernel but before passing them back to
be written out, stenotype reads through each packet and creates a small number
of indexes in memory.  These indexes are very simple, mapping a packet attribute
to a file seek offset.  Attributes we use include ports (src and dst), protocols
(udp/tcp/etc) and IPs (v4 and v6).  Indexes are dumped to disk when file
rotation happens, with a corresponding index file created for each packet file,
of the same name but in a different directory.  Given the example above, when
the .1422693160230282 -> 1422693160230282 file rotation happens, an index also
named .1422693160230282 will be created and written, then renamed to
1422693160230282 when the index has been fully flushed to disk.  Once both the
packets directory and index directory have a 1422693160230282 file, stenographer
can read both in and use the index to lookup packets.


#### Index File Format ####

Indexes are leveldb SSTables, a simple, compressed file format that stores
key-value pairs sorted by key and provides simple, efficient mechanisms to query
individual keys or key ranges.  Among other things, leveldb tables give us great
compression capabilities, keeping our indexes small while still providing fast
reads.

We store each attribute (port number, protocol number, IP, etc) and its
associated packet positions in the blockfile using the format:

   Key: [type (1 byte)][value (? bytes)]
   Value: [position 0 (4 bytes)][position 1 (4 bytes)] ...

The type specifies the type of attribute being indexed (1 == protocol, 2 ==
port, 4 == IPv4, 6 == IPv6).  The value is 1 byte for protocol, 2 for ports, 4
and 16 respectively for IPv4 and IPv6 addresses.  Each position is a seek offset
into a packet file (which are guaranteed to not exceed 4GB) and are always
exactly 4 bytes long.  All values (ports, protocols, positions) are big endian.
Looking up packets involves reading key for a specific attribute
to get all positions for that value, then seeking into the packet files to find
the packets in question and returning them.  For example, to find all packets
with port 80, you'd read in the positions for key:

   [\x02 (type=port) \x00\x50 (value=80)]


#### Index Writing ####

The main stenotype packet sniffing thread tries to very quickly read in packet
blocks, index them, then pass them back to the kernel.  It does all disk
operations asynchronously, in order to keep its CPU busy with indexing, by far
the most time-intensive part of the whole operation.  It would be extremely
detrimental to performance to have this thread block on each file rotation to
convert in-memory indexes to on-disk indexes and write out index files.  Because
of this, index writing is relegated to a separate thread.  For each
reading/writing thread, a index-writing thread is created, and a
thread-safe producer-consumer queue created to link them up.  When the
reader/writer wants to rotate a file, it simply passes a pointer to its
in-memory index over the queue, then creates a new empty index and starts
populating it with packet data for its new file.

The index-writing thread sits in an endless loop, watching the queue for new
indexes.  When it gets a new index, it creates a leveldb table, iterates
through the index to populate that table, and flushes that table to disk.  Since
index writing takes (in our experience) far less time/energy than packet
writing, the index-writing thread does all of its operations serially, blocking
while the index is flushed to disk, then moving that index into its usable
(non-hidden) location.


### Stenoread/Stenocurl ###

As detailed above in Stenographer's "Access Control" section, we require TLS
handshakes in order to verify that clients are indeed allowed access to packet
data.  To aid in this, the simple shell script `stenocurl` wraps the `curl`
utility, adding the various flags necessary to use the correct client
certificate and verify against the correct server certificate.  `stenoread` is a
simple addition to stenocurl, which takes in a query string, passes the query to
stenocurl as a POST request, then passes the resulting PCAP file through tcpdump
in order to allow for additional filtering, writing to disk, printing in a
human-readable format, etc.


#### How Queries Work ####

An analyst that wants to query stenographer calls the `stenoread` script,
passing in a query string (see README.md for the query language format).  This
string is then POST'd (via stenocurl, using TLS certs/keys) to stenographer.
Stenographer parses the query into a Query object, which allows it to decide:

   * which index files it should read
   * which keys it should read from each index file
   * how it should combine packet file positions it gets from each key

To illustrate, for the query string

   (port 1 or ip proto 2) and after 3h ago

Stenographer would translate:

   * `after 3h ago` -> only read index files with microsecond names greater
     than (now() - 3h)
   * within these files, compute the union (because of the `or`) of position
     sets from
      * key `\x02\x00\x01` (port == 1)
      * key `\x01\x02` (protocol == 2)

Once it has computed a set of packet positions for each index file, it then
seeks in the corresponding packet files, reads the packets out, and merges them
into a single PCAP file which it serves back to the analyst.

This PCAP file comes back via stenocurl as a stream to STDOUT, where stenoread
passes it through tcpdump.  With no additional options, tcpdump just prints the
packet data out in a nice format.  With various options, tcpdump could do
further filtering (by TCP flags, etc), write its input to disk (-w out.pcap), or
do all the other things tcpdump is so good at.

### gRPC ###

Stenographer has gRPC support that enables secure, remote interactions with the program. Given the sensitive nature of packet data and the requirements of many users to manage a fleet of servers running Stenographer, the gRPC channel only supports encryption with client authentication and expects the administrator to use certificates that are managed separately from those generated by `stenokeys.sh` (for easily generating certificates, take a look at Square's [certstrap](https://github.com/square/certstrap) utility). The protobuf that defines Stenographer's gRPC service can be found in protobuf/steno.proto.

gRPC support is optional and can be enabled by adding an Rpc dictionary of settings to `steno.conf`. An example configuration is shown below:
```json
, "Rpc": { "CaCert": "/path/to/rpc/ca/cert"
         , "ServerKey": "/path/to/rpc/key"
         , "ServerCert": "/path/to/rpc/cert"
         , "ServerPort": 8443
         , "ServerPcapPath": "/path/to/rpc/pcap/directory"
         , "ServerPcapMaxSize": 1000000000
         , "ClientPcapChunkSize": 1000
         , "ClientPcapMaxSize": 5000000
  }
```

#### RetrievePcap ####

This call allows clients to remotely retrieve PCAP via `stenoread`. To retrieve PCAP, clients send the service a unique identifier, the size of PCAP file chunks to stream in return, the maximum size of the PCAP file to return, and the `stenoread` query used to parse packet data. In response, clients receive streams of messages containing the unique identifier and PCAP file chunks (which need to be reassembled client-side). Below is a minimalist example (shown in Python) of how a client can request PCAP and save it to local disk:
```py
with grpc.secure_channel(server, creds) as channel:
    stub = steno_pb2_grpc.StenographerStub(channel)
    pb = steno_pb2.PcapRequest()
    pb.uid = str(uuid.uuid4())
    pb.chunk_size = 1000
    pb.max_size = 500000
    pb.query = 'after 5m ago and tcp'
    pcap_file = os.path.join('.', '{}.pcap'.format(uid))
    with open(pcap_file, 'wb') as fout:
        for response in stub.RetrievePcap(pb):
            fout.write(response.pcap)
```

`RetrievePcap` requires the gRPC server to be configured with the following fields (in addition to any fields that require the server to startup):
- ServerPcapPath: local path to the directory where `stenoread` PCAP is temporarily stored
- ServerPcapMaxSize: upper limit on how much PCAP a client is allowed to receive (used to restrict clients from receiving excessively large PCAPs)
- ClientPcapChunkSize: size of the PCAP chunks to stream to the client (used if the client has not specified a size in the request)
- ClientPcapMaxSize: upper limit on how much PCAP a client will receive (used if the client has not specified a size in the request)

### Defense In Depth ###

#### Stenotype ####

We're pretty scared of stenotype, because:

   1. We're processing untrusted data: packet
   2. We've got very strong permissions: the ability to read packets
   3. It's written in a memory-unsafe language: C++
   4. We're not perfect.

Because of this, we've tried to use security best practices to minimize the risk
of running these binaries with the following methods:

   * Runing as an unprivileged user `stenographer`
      * We `setcap` the stenotype binary to just have the ability to read
        raw packets.
      * If you DON'T want to use `setcap`, we also offer the ability to drop
        privileges with `setuid/setgid` after starting `stenotype`... you can
        start it as `root`, then drop privs to an untrusted user (that user
        must still be able to open/write files in the index/packet
        directories).
   * `seccomp` sandboxing:  `stenotype` sandboxes itself after opening up
     sockets for packet reading.  This sandbox isn't particularly granular,
     but it should stop us from doing anything too crazy if the `stenotype`
     binary is compromized.
   * Fuzzing:  We've extracted the most concerning bit of code (the indexing
     code that processes packet data) and fuzzed it as best we can, using the
     excellent [AFL](http://lcamtuf.coredump.cx/afl/) fuzzer.  If you'd like to
     run your own fuzzing, install AFL, then run `make fuzz` in the `stenotype/`
     subdirectory, and watch your CPUs become forced-air heaters.
   * We're considering AppArmor, and may add some configs to use it for locking
     down stenotype as well.

#### Stenographer ####

We're slightly less concerned about `stenographer`, since it doesn't actually
process packet information.  It also has a smaller attack surface, especially
when bound to localhost.  Our major attack vector in `stenographer` is queries
coming in over TLS.  However, TLS certificate handling is all done with the
Go standard library (which we trust prett well ;), so our code only ever
touches queries that come from a user in the `stenographer` group.  Since we run
it as user `stenographer`, if someone in the `stenographer` group does achieve a
shell, they'll be able to... read packets.  The big concern here is that they'll
be able to read more packets than allowed by default (let's say that we've
passed in a BPF filter to stenotype, for example).  Our primary defenses, then,
are:

   * Running as an unprivileged user `stenographer`
   * Using Go's standard library TLS to reject requests not coming from
     relatively trusted users
   * Using Go, which is much more memory-safe (runtime array bounds checks, etc)
   * We're considering AppArmor here, too, and will update this doc if we come
     up with good configs.


Design Limitations
------------------

Some of Stenographer's design decisions make it perform poorly in certain
environments or give it strange performance characteristics.  This section aims
to point these out in advance, so folks have a better understanding of some
of the idiosyncracies they may see when deploying Stenographer.

### Slow Links, Large Files ###

Stenographer is optimized for fast links, and some of those optimizations
give it strange behavior on slow links.  The first of these is file size.  You
may notice that on a network link that's REALLY slow, you'll still see 6MB
files created every minute.  This is because currently, Stenographer will:

   * Store packets in 1MB _blocks_
   * Flush one _block_ every 10 seconds

Of course, if your link generates over 1MB every 10 seconds, this doesn't
matter to you at all.  If it does, though, you're going to waste disk space.
We're considering flushing one block a minute or every thirty seconds.

### Packets Don't Show Up Immediately ###

With `stenotype` writing files and `stenographer` reading them, a packet
won't show up in a request's response until it's on disk, its index is on
disk, and `stenographer` has noticed both of these things occurring.  This
means that packets are generally 1-2 minutes behind real-time, since

   * Packets are stored by the kernel for up to 10 seconds before being
     written to disk
   * Packet files flush every minute
   * Index files created/flushed starting when packet files are written
   * `stenographer` looks for new files on disk every 15 seconds

Altogether, this means that there's a maximum 100-120 second delay between
`stenotype` seeing a packet and `stenographer` being able to serve that
packet based on analyst requests.

Note that for fast links, this time is reduced slightly, since:

   * Stenotype flushes a block whenever it gets 1MB of packets, reducing
     the initial 10-second wait for the kernel.
   * `stenotype` flushes at 1 minute OR at 4GB, whichever comes first, so
     if you get over 4GB/min, you'll flush files/indexes faster than once
     a minute.