File: README

package info (click to toggle)
snowdrop 0.02b-17
links: PTS, VCS
area: main
in suites: forky, sid
size: 312 kB
sloc: ansic: 2,251; makefile: 288
file content (518 lines) | stat: -rw-r--r-- 25,316 bytes
parent folder | download | duplicates (9)

  snowdrop - text watermarking and watermark recovery (ver. 0.02b)
  ----------------------------------------------------------------
  == http://lcamtuf.coredump.cx/snowdrop.tgz ==

  ***************************************
  **** THIS IS A BETA STAGE MATERIAL ****
  ***************************************

  Copyright (C) 2002, 2003 by Michal Zalewski <lcamtuf@coredump.cx>

  Submit bug reports, complaints, ideas, patches, ports and chocolate
  to Michal Zalewski at <lcamtuf@coredump.cx>.

  Contents:

  [1] Why would I possibly want to watermark a document?
  [2] What should I know about this approach?
  [3] Text watermarking: what to expect
  [4] C code watermarking: what to expect
  [5] General usage rules / bugs
  [6] Developing your own modules


[1] Why would I possibly want to watermark a document?
------------------------------------------------------

  The traditional watermarking relies on embedding some information in
  a binary file (such a proprietary format document - Adobe PDF, MS Word; 
  or multimedia files) to identify the origin of a particular copy.
  Watermarking can be combined with steganography to hide this data from
  a casual viewer. 

  Snowdrop is intended to bring (relatively) invisible and
  modification-proof watermarking to a new realm of "source material" - 
  written word and computer source codes. The information is not being 
  embedded in the least significant portions of some binary output, as it would
  be with a traditional low-level steganography, but into the source itself. 
  The idea, at least for English, isn't new - there was some serious work done 
  by Mikhail Atallah from Purdue University. Snowdrop is merely an attempt to 
  provide a reliable, useful tool to implement those source-level watermarking 
  / steganography capabilities. Because of some tricks, such as using specially 
  crafted MD5 shortcuts, it gives certain additional advantages to its 
  potential users, such as integrity and privacy of embedded information, or an 
  ability to demonstrate the origin of a document to the public (see section 2 
  for more details). Separate logical channels are used to carry highly 
  redundant watermark to ensure it is extremely difficult to remove this 
  information by accident, simple reformatting, etc.

  I am under the impression that both the computer community in general, and 
  security researchers in particular, would benefit from having such a tool 
  for many reasons. There are two main scenarios where watermarking 
  capabilities provided by Snowdrop are particularly useful. One is protecting 
  limited distribution work, such as advisories, exploits, licensed or
  closed source code, confidential research, internal corporate memos and
  other information that could eventually leak to the public. In such cases,
  embedding an unique watermark in every copy of the document would enable
  you to track down the leak - at the same time, only you would be able
  to decode or modify this information. This knowledge can be later 
  demonstrated and documented.

  Second scenario is enforcing copyright. In case of plagiarism or copyright 
  violation, it can be demonstrated that other person used a text or source 
  code originating from you. This procedure does not prove who authored the
  document, merely demonstrating that party A published portions of a document 
  received from party B, as opposed to publishing original work. Once again, 
  the information can be recovered only by you, and cannot be altered in a 
  meaningful way by third parties.

  While it is perfectly possible to intentionally remove a watermark from
  a document, the idea of using steganography makes it much more difficult
  to realize you are actually dealing with a watermarked document. In other
  words, unless you run every document routinely thru a "watermark remover"
  utility, there is very little chance you'd be aware of the watermark and
  thus attempt to remove it. At the same time, as moentioned above, i t is
  difficult to remove the watermark by accident or simple modifications.

  Snowdrop, in its current version, supports three operating modes:
  draft quality English language document, fine quality English language
  document, and C source code. It is relatively simple to implement 
  new language modules (see section 6), and I encourage you to do so,
  both for programming and spoken languages.


[2] What should I know about this approach?
-------------------------------------------

  More detailed information about how documents are being modified is
  provided in next two sections. First, I'd like to focus on some
  basic mechanisms implemented in the language-independent kernel.

  Embedded in the document are MD5 shortcuts of a specific 64-bit one-time
  magic value and the name of a recipient. This makes it practically 
  impossible to guess who the recipient was, or to modify this information
  and mislead you. You can, however, verify MD5 watermark embedded in a 
  file - you will need the original file without a watermark, and a
  database of magic values and recipient names. Having this information,
  you can demonstrate to others what MD5 value was actually embedded in
  the file (this cannot be determined without seeing the original), and
  how this MD5 value was created (by disclosing recipient's name and
  one-time magic). By doing that, you do not compromise the security of
  other signatures, because magic values are different for each copy.

  The advantages of using Snowdrop:

  - "Source-level" watermarks ensure medium-independent protection.
    You can store information into many files that otherwise couldn't be 
    controlled at all.

  - Watermarks are relatively short and redundant, so even 5-10 lines 
    of text can be sufficient to recover the watermark, and even severe
    modifications in medium-sized document will not affect the ability
    to recover the watermark.

  - Several separate channels are used to carry the information. This
    means that simple reformatting, edits, spell-checking or any other
    operation alone will most likely not destroy the watermark.

  - Watermarks can be only analyzed by you, unless you decide to publish
    the information, in which case, the watermark can be confirmed by
    third parties. Otherwise, no third party can read, swap or spoof 
    watermarks in your name.

  - Watermark presence is not evident. In a typical text document or
    almost every C source, there is very little to indicate 
    steganography on all levels supported.

  Note that Snowdrop signature does not prove that you actually authored
  the material - it only demonstrates that you had it, you passed it to
  another party, and the party did something with it. Snowdrop also does
  not prove you actually disclosed the code to a person you claimed, as
  "recipient" field can be set arbitrarily. Your intent can be questioned.

  NOTE: Snowdrop, by default, uses 32-bit watermarks. This alone has a
  relatively low value as a "public proof". Simply put, it is 
  feasible for you to use brute-force to find such values that would
  give an identical watermark to what you've found in some file watermarked
  by somebody else, and claim it is your file. If you plan on using
  watermarks to document leaks to the public, please use -6 option (64-bit
  watermarks) when possible. 64 bit is reasonably strong in that, on a box 
  that is around 50,000 times faster than an average PC, it would still 
  take a year to crack it, approximately. This is far beyond the limit
  of computational feasibility for typical users, and I expect that it
  won't be disputed whether you actually watermarked the file in question,
  unless, of course, you happen to be an administrator of a cluster of
  supercomputers ;-) 96-bit signatures will be coming in future versions,
  but I don't think this is critical at this point.

  That said, 32-bit signatures are still suitable for watermarking any code 
  or text for your own purposes - that is, just tracing leaks or other
  information. Another advantage of sticking with 32-bit is that such
  watermarks take less space, can be embedded more times, and thus, are
  more reliable and easier to recover from even a short chunk of text.


[3] Text watermarking: what to expect
-------------------------------------

  Text watermarking component for technical English: sd-eng
  Higher quality output, lower capacity: sd-engf

  There are two versions of the same module available. First of them, sd-eng,
  works fine to generate a readable equivalent of input text in technical
  English. Use it with e-mails and other information that does not have
  to look nice, can have typos, etc. It is NOT supposed to preserve ASCII 
  layout, equations, to preserve the meaning of poetry, and so on. It is 
  suitable for short drafts, memos and other documents that do not have 
  to deliver highest possible quality, where the amount of information 
  stored in each file is more important than the quality. This documentation
  wouldn't look pretty after being run thru sd-eng.

  Another version, sd-engf, generates higher quality output for the price
  of only 30% the storage capacity of sd-eng. It should preserve tables,
  equations and other ASCII artifacts, preserve indentation, introduce
  far less typos, etc. This documentation should still look very good
  after being processed with this tool.

  Both modules use synonym database to replace certain words with others.
  If you feel that your wording is being hurt by the synonym substitution
  used by the program, create a copy of /usr/share/snowdrop/synonyms
  file and make necessary edits to remove the offending rules. NEVER
  EDIT THE MASTER COPY. You will need the exact copy of 'synonyms'
  file you used to watermark files in order to recover watermarks later.
  If you want to use a new file, pass SD_SYNONYMS environment variable
  pointing to the new copy, and don't forget to keep the copy for
  further reference. 

  Unfortunately, current version of sd-eng does not support watermark 
  recovery using multiple 'synonyms' files at once, so please make a note 
  of used file and be sure to pass it in SD_SYNONYMS on the run time.
  
  Minimum length of a document that can be successfully watermarked is
  around 5-10 lines of normal, reasonably formatted English text. This
  is also probably the shortest length from which the watermark can
  be successfully recovered. Redundancy in all channels is achieved at
  around 40-70 lines of text (multiply those figures by three to get
  an estimate for sd-engf utility). This amount, even if modified, should 
  be sufficient to recover the watermark.

  Four used channels:

  - whitespaces

  - typos (rarely used but high capacity)

  - notation (various types of quotes and other punctuation marks)

  - synonymous words 

  Unfortunately, unlike with C code, channel distribution is very
  not proportional, with weakest channel (whitespaces) dominant in
  sd-eng. In sd-engf, distribution should be a bit more fair, but
  only because of whitespace channel capacity being severily reduced.

  Usage of the module is pretty straightforward - please run it with
  no parameters to get help.


[4] C code watermarking: what to expect
---------------------------------------

  C code watermarking component: sd-c

  *********************************************************
  * NOTE: This code still needs some work. Use with care. *
  *********************************************************

  C support is a bit rough in that it generates ugly code and, at least for
  now, non-customizable variable names that can be a bit suspicious to
  a paranoid person. Also, some features are still missing, making the
  watermark easier to remove; your contributions, fixes, etc are welcome
  - for now, consider this code pretty close to being only a proof of
  concept.

  But you can expect very good results. Even around three lines of
  code can carry enough information to store a watermark, ten to fifteen lines
  should provide very good redundancy in all channels. The code is optimized
  for standalone executables. Because of that, function names and variable
  names will be modified. If this file is supposed to provide exported
  symbols (e.g. is a library), you might consider restoring symbols later
  by hand. Future versions should include an option to limit this functionality
  to static variables and functions and local variable names.

  Currently, some specific constructions (for example, preprocessor macros
  that refer to global or local variables that are not yet defined and not
  passed as a parameter) can result in an invalid code being generated.
  This and other minor glitches can be corrected by hand without loosing
  the ability to retreive data from the file.

  Four channels are:

  - whitespaces, line breaks and other non-essential data.

  - code logic (e.g. 'if A then B else C' becomes 'if not A then C else B');
    this channel is not used in this version, but reserved for the future.

  - code notation (usage of ;, !sth versus sth==0, etc); not used in this
    version.

  - substituted variable names.

  Usage of the module is pretty straightforward - please run it with
  no parameters to get help.

  I took GOBBLES Apache exploit, watermarked it, then removed all comments 
  and copyrights, ran it thru 'ident' and, well...

  [+] This document matches entry 65 (channel offset 456):
  Source file : /home/lcamtuf/apache-scalp.c
  Time        : Fri Aug  2 19:38:28 2002
  Recipient   : Evil Hacker
  Comment     : no comment
  Source MD5  : 871921b3-9b9239f5
  Magic value : 21903813-164c4f42
  Watermark   : 000000002767bdc1


[5] General usage rules / bugs
------------------------------

  This tool has been tested on Linux and FreeBSD. It should be fairly
  easy to port to any system, it might, however, depend on little 
  endian architecture. That is to say, I have not tested it on big
  endian. Feel free to do it, eventually fix any problems, and mail
  me back. This tool requires either OpenSSL development libraries
  or RSA MD5 libraries installed in the system.

  Before starting, make sure to read the paragraph on 32- versus 64-bit 
  watermarks in the section [2], and other parts of this documentation
  in general. Make sure you use options suitable for your needs.

  There is no detailed installation and usage tutorial, because the
  author assumes that all potential users would have some basic
  knowledge of Unix and C. Similarly, generated output should be
  readable to everyone with basic understanding of this write-up. 
 
  There are some general considerations when using this program:

  First of all, NEVER DELETE INPUT FILES OR SYNONYM DATABASES. Those files
  are essential for watermark recovery later. This is what makes watermarks
  impossible to spoof or read to others - consider those two files a part
  of your private key. Snowdrop will keep MD5 checksums of both files for 
  your reference and will refuse to run if you don't have same exact files
  while trying to recover watermarks. If you have difficulties
  managing this information, use "comment" field provided by the tool
  to store this data. Note that disclosing those files alone does not
  compromise your watermark integrity - it would make it possible for
  others to read MD5 shortcut embedded in the file, but it would be
  pretty useless without the database you have.

  It would be possible for Snowdrop to archive those files for you
  in a safe location, but since Snowdrop can be theoretically run against,
  say, 10000 identical messages, it would be a waste of storage space
  to copy the configuration and input every time. Besides, in some 
  cases, creating additional copies of the material would be not 
  desirable. If you are forgetful, don't mind wasting some space or don't 
  expect such a massive fingerprinting, a very simple shell script would do.

  Snowdrop database is located in ~/.snowdrop/database. It stores MD5
  sums of input files, unique per-file magic values, watermarks, recipient 
  names, comments and other information. You should protect this file
  (default permissions are safe).

  If a portion of your code or text is used in another program, Snowdrop
  will try to automatically synchronize input and output files. If this
  mechanism fails, please let me know - it is pretty experimental.

  If you plan on sending watermarked file via e-mail, it is best to do it
  as a text attachment. Otherwise, try to insert the file in the mail,
  do not copy-and-paste (as, in most clients, this would enable auto line
  wrapping).

  As to bugs...

  Quite frankly, this code should be written in a functional / declarative
  language. Simple as that, there is a lot of tasks that require some
  recursion, retracting from certain choices, language processing. Prolog
  would be better. But, for obvious reasons, I decided to go with C. The
  code wasn't carefully designed from the very beginning (or, more
  precisely, it was, but many of my assumptions turned out to be incorrect).
  As a result of that, this is pretty much a hack to achieve reasonable
  results and test the concept, but the next version should be pretty much
  reworked from scratch.

  That said... There is very little atom length checking while most of the
  time, static buffers are used. So run it only against normal text files
  you've looked over, with no words or lines over 4 kB, etc. You probably
  don't want to report this to BUGTRAQ ;-)

  Another problem is that the code is awfully slow on resynchronization 
  (if two files have some sections grossly different). It typically takes
  up to few seconds per line. Your options are, being patient when you
  try to recover the watermark, of watermark smaller portions of relevant
  text instead of a huge document.

  There are three main issues that cause this problem:

     - resynchronization algorithm is not very optimal because of
       the irreversible nature of get_orig_atom(); that is, to recover from
       five deleted words, the *whole* document is being processed five
       times. Because of that, recovery from added words is much faster.
       It is not a big deal to make get_orig_atom() reversible, just
       add a stack, store all flags and pointers.

     - string manipulation functions make way too many copies; more
       neat string management and comparsion (for example, the code I
       used in Catty 2) should be implemented at some point; the point is
       that it'd have to be used everywhere, so this requires code review.
 
     - get_value() essentially tries to brute force and compare instead
       of reversing set_value(). This causes some performance loss, but
       also saved me some serious debugging. get_value() has to be 
       modified eventually to deliver better reliability anyway.

  As of today, watermark recovery algorithm is not really perfect. If streams 
  are non-continous (frequent changes were introduced), long enough segments 
  of the watermark may be not present in the reconstructed stream. The ideal 
  reconstruction algorithm will keep track of 'gaps' in the stream and fill 
  them with newly acquired data instead of always simply appending it.
 
  Current algorithm, however, requires that at least eight bit segments
  are continous in any of the domains and otherwise fails to find a matching
  key. 

  Another watermark recovery problem is that domains don't have equal 
  priorities. While it is possible to recover synonym data from an atom if
  there is a change in the number of whitespaces preceeding it, it might
  be impossible to recover whitespace data if the synonym was changed.
  This is another weakness of the brute force get_value(). This shouldn't
  be a real problem, as it shouldn't happen too often that the document
  is completely reworded but whitespaces are preserved - but it is an issue.



[6] Developing own modules
--------------------------

  This is a draft section, expect slight differences between this
  description and the real behavior. I plan on making the API simplier
  in the future.

  Module API is pretty trivial. To add a new module, you have to include it
  in the LANG= line in Makefile, and create a lang-nnn.c file, where nnn
  is your module name. The module should include "language.h" file, and
  have the following exports:

    void  set_original(const char* buf);

  This should set new input buffer, a continuous memory region starting
  at 'buf' and NULL-terminated, as the input data. It should also 
  reinitialize module to its original state (which is critical, since
  every module is called twice during the watermarking process).

    void  set_watermarked(const char* buf);

  This works like set_original, except for reinitialization part, for
  watermark comparison. Essentially, the main code needs language atom
  extraction done on both original and compared file to compare elements.

    char* get_orig_atom(void);
    char* get_water_atom(void);

  Those two guys return language atom from two different sources. Those
  functions should be identical, except that get_water_atom() should
  not modify returned data or internal state of the parser. It is
  guaranteed that get_orig_atom() will be called only once, when we really
  mean to work on this particular atom, while get_water_atom() can be
  called a zillion times.
 
    int   get_storage(const char* orig, const int domain);

  This function, called on an atom from the original file (and only
  from this file), is supposed to return its storage capacity, in
  bits. Note that if you can store six values in an atom, you effectively
  can store only two bits (four values) in there. Unless you want to
  re-work the code, half-bits are not allowed ;-) Domain stands for
  modification domain. This is because every atom can be modified 
  independently in four different ways. Domains are as follows:

  #define DOMAIN_WHITE     0      // Whitespaces, comments
  #define DOMAIN_GRAMMAR   1      // Grammar / logic changes
  #define DOMAIN_FORMAT    2      // Formatting, notation
  #define DOMAIN_SYNONYMS  3      // Synonyms / name substitution

  You have to be sure there is no combination of modifications in every
  domain that would be ambiguous. Also, capacity of each domain should
  not depend on what happened in other domains. In other words, if for 
  term "foo" in this particular location in the file, DOMAIN_GRAMMAR
  returns capacity 2, it should ALWAYS return this capacity, even if
  the word has been replaced by DOMAIN_SYNONYMS or modified in other
  way. 

  If multiple modifications are possible in each domain, it has to return the 
  best match (with most storage capacity; if same storage capacities are shared 
  by several modifications, pick the one that is least vulnerable to edits or 
  least intrusive).

  get_storage() should NOT change any parameters or internal state of the 
  module.

    char* set_value(const char* orig,int value, const int domain);

  This call should return a modified copy of the original atom orig
  set to the value 'value' in the domain 'domain', using the modification
  previously agreed with get_storage(). This is called from domain 3 to 
  domain 0, every subsequent call gets a copy of previous result as its 
  input. This function should commit to all choices. In other words, if
  we decided to store something in indentation, this call should modify
  internal state so that indentation is copied to next lines.

    int get_value(const char* orig,const char* water,int* sc,int* va,int test);

  This function should retrieve a value stored in atom 'water' in comparison
  to atom 'orig' in all four domains, 3 to 0. Storage capacities in each
  domain should be stored in four subsequent integers pointed by 'sc',
  read values - in four subsequent integers of 'va'. Unless 'test' is
  non-zero, this function should also advance all internal state counters
  and such just the way set_value would do it.

  This function should try to retreive as much information as possible
  from domains. If nothing can be retreived, return 0. Otherwise, return 1.

  Helper functions:

    char* get_langdesc(void);

  Return up to approx. 25 character module language description

    unsigned int md5_importantstuff(void);

  Generate an MD5 sum of all files this module depends on. This is
  important for detecting and reporting changes in those files that could
  affect functionality. Result 0 if this module does not rely on files
  and other variables of this kind.

    void  module_help(void);

  Write short module help.

    void  md5_wrong(void);

  Write a warning message. This function is called if no matches were
  found for signatures with MD5 shortcut returned by module equal to
  what md5_importantstuff() returned, but some signatures were found with
  other MD5 values. This should essentially display "perhaps you used
  different input data to watermark this file"?

  Macros:

     fatal(x...)

  printf-alike macro that displays a message and terminates.

     debug(x...)

  printf-alike macro that should be used for all status output 
  (all kinds of non-fatal messages).