1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518
|
snowdrop - text watermarking and watermark recovery (ver. 0.02b)
----------------------------------------------------------------
== http://lcamtuf.coredump.cx/snowdrop.tgz ==
***************************************
**** THIS IS A BETA STAGE MATERIAL ****
***************************************
Copyright (C) 2002, 2003 by Michal Zalewski <lcamtuf@coredump.cx>
Submit bug reports, complaints, ideas, patches, ports and chocolate
to Michal Zalewski at <lcamtuf@coredump.cx>.
Contents:
[1] Why would I possibly want to watermark a document?
[2] What should I know about this approach?
[3] Text watermarking: what to expect
[4] C code watermarking: what to expect
[5] General usage rules / bugs
[6] Developing your own modules
[1] Why would I possibly want to watermark a document?
------------------------------------------------------
The traditional watermarking relies on embedding some information in
a binary file (such a proprietary format document - Adobe PDF, MS Word;
or multimedia files) to identify the origin of a particular copy.
Watermarking can be combined with steganography to hide this data from
a casual viewer.
Snowdrop is intended to bring (relatively) invisible and
modification-proof watermarking to a new realm of "source material" -
written word and computer source codes. The information is not being
embedded in the least significant portions of some binary output, as it would
be with a traditional low-level steganography, but into the source itself.
The idea, at least for English, isn't new - there was some serious work done
by Mikhail Atallah from Purdue University. Snowdrop is merely an attempt to
provide a reliable, useful tool to implement those source-level watermarking
/ steganography capabilities. Because of some tricks, such as using specially
crafted MD5 shortcuts, it gives certain additional advantages to its
potential users, such as integrity and privacy of embedded information, or an
ability to demonstrate the origin of a document to the public (see section 2
for more details). Separate logical channels are used to carry highly
redundant watermark to ensure it is extremely difficult to remove this
information by accident, simple reformatting, etc.
I am under the impression that both the computer community in general, and
security researchers in particular, would benefit from having such a tool
for many reasons. There are two main scenarios where watermarking
capabilities provided by Snowdrop are particularly useful. One is protecting
limited distribution work, such as advisories, exploits, licensed or
closed source code, confidential research, internal corporate memos and
other information that could eventually leak to the public. In such cases,
embedding an unique watermark in every copy of the document would enable
you to track down the leak - at the same time, only you would be able
to decode or modify this information. This knowledge can be later
demonstrated and documented.
Second scenario is enforcing copyright. In case of plagiarism or copyright
violation, it can be demonstrated that other person used a text or source
code originating from you. This procedure does not prove who authored the
document, merely demonstrating that party A published portions of a document
received from party B, as opposed to publishing original work. Once again,
the information can be recovered only by you, and cannot be altered in a
meaningful way by third parties.
While it is perfectly possible to intentionally remove a watermark from
a document, the idea of using steganography makes it much more difficult
to realize you are actually dealing with a watermarked document. In other
words, unless you run every document routinely thru a "watermark remover"
utility, there is very little chance you'd be aware of the watermark and
thus attempt to remove it. At the same time, as moentioned above, i t is
difficult to remove the watermark by accident or simple modifications.
Snowdrop, in its current version, supports three operating modes:
draft quality English language document, fine quality English language
document, and C source code. It is relatively simple to implement
new language modules (see section 6), and I encourage you to do so,
both for programming and spoken languages.
[2] What should I know about this approach?
-------------------------------------------
More detailed information about how documents are being modified is
provided in next two sections. First, I'd like to focus on some
basic mechanisms implemented in the language-independent kernel.
Embedded in the document are MD5 shortcuts of a specific 64-bit one-time
magic value and the name of a recipient. This makes it practically
impossible to guess who the recipient was, or to modify this information
and mislead you. You can, however, verify MD5 watermark embedded in a
file - you will need the original file without a watermark, and a
database of magic values and recipient names. Having this information,
you can demonstrate to others what MD5 value was actually embedded in
the file (this cannot be determined without seeing the original), and
how this MD5 value was created (by disclosing recipient's name and
one-time magic). By doing that, you do not compromise the security of
other signatures, because magic values are different for each copy.
The advantages of using Snowdrop:
- "Source-level" watermarks ensure medium-independent protection.
You can store information into many files that otherwise couldn't be
controlled at all.
- Watermarks are relatively short and redundant, so even 5-10 lines
of text can be sufficient to recover the watermark, and even severe
modifications in medium-sized document will not affect the ability
to recover the watermark.
- Several separate channels are used to carry the information. This
means that simple reformatting, edits, spell-checking or any other
operation alone will most likely not destroy the watermark.
- Watermarks can be only analyzed by you, unless you decide to publish
the information, in which case, the watermark can be confirmed by
third parties. Otherwise, no third party can read, swap or spoof
watermarks in your name.
- Watermark presence is not evident. In a typical text document or
almost every C source, there is very little to indicate
steganography on all levels supported.
Note that Snowdrop signature does not prove that you actually authored
the material - it only demonstrates that you had it, you passed it to
another party, and the party did something with it. Snowdrop also does
not prove you actually disclosed the code to a person you claimed, as
"recipient" field can be set arbitrarily. Your intent can be questioned.
NOTE: Snowdrop, by default, uses 32-bit watermarks. This alone has a
relatively low value as a "public proof". Simply put, it is
feasible for you to use brute-force to find such values that would
give an identical watermark to what you've found in some file watermarked
by somebody else, and claim it is your file. If you plan on using
watermarks to document leaks to the public, please use -6 option (64-bit
watermarks) when possible. 64 bit is reasonably strong in that, on a box
that is around 50,000 times faster than an average PC, it would still
take a year to crack it, approximately. This is far beyond the limit
of computational feasibility for typical users, and I expect that it
won't be disputed whether you actually watermarked the file in question,
unless, of course, you happen to be an administrator of a cluster of
supercomputers ;-) 96-bit signatures will be coming in future versions,
but I don't think this is critical at this point.
That said, 32-bit signatures are still suitable for watermarking any code
or text for your own purposes - that is, just tracing leaks or other
information. Another advantage of sticking with 32-bit is that such
watermarks take less space, can be embedded more times, and thus, are
more reliable and easier to recover from even a short chunk of text.
[3] Text watermarking: what to expect
-------------------------------------
Text watermarking component for technical English: sd-eng
Higher quality output, lower capacity: sd-engf
There are two versions of the same module available. First of them, sd-eng,
works fine to generate a readable equivalent of input text in technical
English. Use it with e-mails and other information that does not have
to look nice, can have typos, etc. It is NOT supposed to preserve ASCII
layout, equations, to preserve the meaning of poetry, and so on. It is
suitable for short drafts, memos and other documents that do not have
to deliver highest possible quality, where the amount of information
stored in each file is more important than the quality. This documentation
wouldn't look pretty after being run thru sd-eng.
Another version, sd-engf, generates higher quality output for the price
of only 30% the storage capacity of sd-eng. It should preserve tables,
equations and other ASCII artifacts, preserve indentation, introduce
far less typos, etc. This documentation should still look very good
after being processed with this tool.
Both modules use synonym database to replace certain words with others.
If you feel that your wording is being hurt by the synonym substitution
used by the program, create a copy of /usr/share/snowdrop/synonyms
file and make necessary edits to remove the offending rules. NEVER
EDIT THE MASTER COPY. You will need the exact copy of 'synonyms'
file you used to watermark files in order to recover watermarks later.
If you want to use a new file, pass SD_SYNONYMS environment variable
pointing to the new copy, and don't forget to keep the copy for
further reference.
Unfortunately, current version of sd-eng does not support watermark
recovery using multiple 'synonyms' files at once, so please make a note
of used file and be sure to pass it in SD_SYNONYMS on the run time.
Minimum length of a document that can be successfully watermarked is
around 5-10 lines of normal, reasonably formatted English text. This
is also probably the shortest length from which the watermark can
be successfully recovered. Redundancy in all channels is achieved at
around 40-70 lines of text (multiply those figures by three to get
an estimate for sd-engf utility). This amount, even if modified, should
be sufficient to recover the watermark.
Four used channels:
- whitespaces
- typos (rarely used but high capacity)
- notation (various types of quotes and other punctuation marks)
- synonymous words
Unfortunately, unlike with C code, channel distribution is very
not proportional, with weakest channel (whitespaces) dominant in
sd-eng. In sd-engf, distribution should be a bit more fair, but
only because of whitespace channel capacity being severily reduced.
Usage of the module is pretty straightforward - please run it with
no parameters to get help.
[4] C code watermarking: what to expect
---------------------------------------
C code watermarking component: sd-c
*********************************************************
* NOTE: This code still needs some work. Use with care. *
*********************************************************
C support is a bit rough in that it generates ugly code and, at least for
now, non-customizable variable names that can be a bit suspicious to
a paranoid person. Also, some features are still missing, making the
watermark easier to remove; your contributions, fixes, etc are welcome
- for now, consider this code pretty close to being only a proof of
concept.
But you can expect very good results. Even around three lines of
code can carry enough information to store a watermark, ten to fifteen lines
should provide very good redundancy in all channels. The code is optimized
for standalone executables. Because of that, function names and variable
names will be modified. If this file is supposed to provide exported
symbols (e.g. is a library), you might consider restoring symbols later
by hand. Future versions should include an option to limit this functionality
to static variables and functions and local variable names.
Currently, some specific constructions (for example, preprocessor macros
that refer to global or local variables that are not yet defined and not
passed as a parameter) can result in an invalid code being generated.
This and other minor glitches can be corrected by hand without loosing
the ability to retreive data from the file.
Four channels are:
- whitespaces, line breaks and other non-essential data.
- code logic (e.g. 'if A then B else C' becomes 'if not A then C else B');
this channel is not used in this version, but reserved for the future.
- code notation (usage of ;, !sth versus sth==0, etc); not used in this
version.
- substituted variable names.
Usage of the module is pretty straightforward - please run it with
no parameters to get help.
I took GOBBLES Apache exploit, watermarked it, then removed all comments
and copyrights, ran it thru 'ident' and, well...
[+] This document matches entry 65 (channel offset 456):
Source file : /home/lcamtuf/apache-scalp.c
Time : Fri Aug 2 19:38:28 2002
Recipient : Evil Hacker
Comment : no comment
Source MD5 : 871921b3-9b9239f5
Magic value : 21903813-164c4f42
Watermark : 000000002767bdc1
[5] General usage rules / bugs
------------------------------
This tool has been tested on Linux and FreeBSD. It should be fairly
easy to port to any system, it might, however, depend on little
endian architecture. That is to say, I have not tested it on big
endian. Feel free to do it, eventually fix any problems, and mail
me back. This tool requires either OpenSSL development libraries
or RSA MD5 libraries installed in the system.
Before starting, make sure to read the paragraph on 32- versus 64-bit
watermarks in the section [2], and other parts of this documentation
in general. Make sure you use options suitable for your needs.
There is no detailed installation and usage tutorial, because the
author assumes that all potential users would have some basic
knowledge of Unix and C. Similarly, generated output should be
readable to everyone with basic understanding of this write-up.
There are some general considerations when using this program:
First of all, NEVER DELETE INPUT FILES OR SYNONYM DATABASES. Those files
are essential for watermark recovery later. This is what makes watermarks
impossible to spoof or read to others - consider those two files a part
of your private key. Snowdrop will keep MD5 checksums of both files for
your reference and will refuse to run if you don't have same exact files
while trying to recover watermarks. If you have difficulties
managing this information, use "comment" field provided by the tool
to store this data. Note that disclosing those files alone does not
compromise your watermark integrity - it would make it possible for
others to read MD5 shortcut embedded in the file, but it would be
pretty useless without the database you have.
It would be possible for Snowdrop to archive those files for you
in a safe location, but since Snowdrop can be theoretically run against,
say, 10000 identical messages, it would be a waste of storage space
to copy the configuration and input every time. Besides, in some
cases, creating additional copies of the material would be not
desirable. If you are forgetful, don't mind wasting some space or don't
expect such a massive fingerprinting, a very simple shell script would do.
Snowdrop database is located in ~/.snowdrop/database. It stores MD5
sums of input files, unique per-file magic values, watermarks, recipient
names, comments and other information. You should protect this file
(default permissions are safe).
If a portion of your code or text is used in another program, Snowdrop
will try to automatically synchronize input and output files. If this
mechanism fails, please let me know - it is pretty experimental.
If you plan on sending watermarked file via e-mail, it is best to do it
as a text attachment. Otherwise, try to insert the file in the mail,
do not copy-and-paste (as, in most clients, this would enable auto line
wrapping).
As to bugs...
Quite frankly, this code should be written in a functional / declarative
language. Simple as that, there is a lot of tasks that require some
recursion, retracting from certain choices, language processing. Prolog
would be better. But, for obvious reasons, I decided to go with C. The
code wasn't carefully designed from the very beginning (or, more
precisely, it was, but many of my assumptions turned out to be incorrect).
As a result of that, this is pretty much a hack to achieve reasonable
results and test the concept, but the next version should be pretty much
reworked from scratch.
That said... There is very little atom length checking while most of the
time, static buffers are used. So run it only against normal text files
you've looked over, with no words or lines over 4 kB, etc. You probably
don't want to report this to BUGTRAQ ;-)
Another problem is that the code is awfully slow on resynchronization
(if two files have some sections grossly different). It typically takes
up to few seconds per line. Your options are, being patient when you
try to recover the watermark, of watermark smaller portions of relevant
text instead of a huge document.
There are three main issues that cause this problem:
- resynchronization algorithm is not very optimal because of
the irreversible nature of get_orig_atom(); that is, to recover from
five deleted words, the *whole* document is being processed five
times. Because of that, recovery from added words is much faster.
It is not a big deal to make get_orig_atom() reversible, just
add a stack, store all flags and pointers.
- string manipulation functions make way too many copies; more
neat string management and comparsion (for example, the code I
used in Catty 2) should be implemented at some point; the point is
that it'd have to be used everywhere, so this requires code review.
- get_value() essentially tries to brute force and compare instead
of reversing set_value(). This causes some performance loss, but
also saved me some serious debugging. get_value() has to be
modified eventually to deliver better reliability anyway.
As of today, watermark recovery algorithm is not really perfect. If streams
are non-continous (frequent changes were introduced), long enough segments
of the watermark may be not present in the reconstructed stream. The ideal
reconstruction algorithm will keep track of 'gaps' in the stream and fill
them with newly acquired data instead of always simply appending it.
Current algorithm, however, requires that at least eight bit segments
are continous in any of the domains and otherwise fails to find a matching
key.
Another watermark recovery problem is that domains don't have equal
priorities. While it is possible to recover synonym data from an atom if
there is a change in the number of whitespaces preceeding it, it might
be impossible to recover whitespace data if the synonym was changed.
This is another weakness of the brute force get_value(). This shouldn't
be a real problem, as it shouldn't happen too often that the document
is completely reworded but whitespaces are preserved - but it is an issue.
[6] Developing own modules
--------------------------
This is a draft section, expect slight differences between this
description and the real behavior. I plan on making the API simplier
in the future.
Module API is pretty trivial. To add a new module, you have to include it
in the LANG= line in Makefile, and create a lang-nnn.c file, where nnn
is your module name. The module should include "language.h" file, and
have the following exports:
void set_original(const char* buf);
This should set new input buffer, a continuous memory region starting
at 'buf' and NULL-terminated, as the input data. It should also
reinitialize module to its original state (which is critical, since
every module is called twice during the watermarking process).
void set_watermarked(const char* buf);
This works like set_original, except for reinitialization part, for
watermark comparison. Essentially, the main code needs language atom
extraction done on both original and compared file to compare elements.
char* get_orig_atom(void);
char* get_water_atom(void);
Those two guys return language atom from two different sources. Those
functions should be identical, except that get_water_atom() should
not modify returned data or internal state of the parser. It is
guaranteed that get_orig_atom() will be called only once, when we really
mean to work on this particular atom, while get_water_atom() can be
called a zillion times.
int get_storage(const char* orig, const int domain);
This function, called on an atom from the original file (and only
from this file), is supposed to return its storage capacity, in
bits. Note that if you can store six values in an atom, you effectively
can store only two bits (four values) in there. Unless you want to
re-work the code, half-bits are not allowed ;-) Domain stands for
modification domain. This is because every atom can be modified
independently in four different ways. Domains are as follows:
#define DOMAIN_WHITE 0 // Whitespaces, comments
#define DOMAIN_GRAMMAR 1 // Grammar / logic changes
#define DOMAIN_FORMAT 2 // Formatting, notation
#define DOMAIN_SYNONYMS 3 // Synonyms / name substitution
You have to be sure there is no combination of modifications in every
domain that would be ambiguous. Also, capacity of each domain should
not depend on what happened in other domains. In other words, if for
term "foo" in this particular location in the file, DOMAIN_GRAMMAR
returns capacity 2, it should ALWAYS return this capacity, even if
the word has been replaced by DOMAIN_SYNONYMS or modified in other
way.
If multiple modifications are possible in each domain, it has to return the
best match (with most storage capacity; if same storage capacities are shared
by several modifications, pick the one that is least vulnerable to edits or
least intrusive).
get_storage() should NOT change any parameters or internal state of the
module.
char* set_value(const char* orig,int value, const int domain);
This call should return a modified copy of the original atom orig
set to the value 'value' in the domain 'domain', using the modification
previously agreed with get_storage(). This is called from domain 3 to
domain 0, every subsequent call gets a copy of previous result as its
input. This function should commit to all choices. In other words, if
we decided to store something in indentation, this call should modify
internal state so that indentation is copied to next lines.
int get_value(const char* orig,const char* water,int* sc,int* va,int test);
This function should retrieve a value stored in atom 'water' in comparison
to atom 'orig' in all four domains, 3 to 0. Storage capacities in each
domain should be stored in four subsequent integers pointed by 'sc',
read values - in four subsequent integers of 'va'. Unless 'test' is
non-zero, this function should also advance all internal state counters
and such just the way set_value would do it.
This function should try to retreive as much information as possible
from domains. If nothing can be retreived, return 0. Otherwise, return 1.
Helper functions:
char* get_langdesc(void);
Return up to approx. 25 character module language description
unsigned int md5_importantstuff(void);
Generate an MD5 sum of all files this module depends on. This is
important for detecting and reporting changes in those files that could
affect functionality. Result 0 if this module does not rely on files
and other variables of this kind.
void module_help(void);
Write short module help.
void md5_wrong(void);
Write a warning message. This function is called if no matches were
found for signatures with MD5 shortcut returned by module equal to
what md5_importantstuff() returned, but some signatures were found with
other MD5 values. This should essentially display "perhaps you used
different input data to watermark this file"?
Macros:
fatal(x...)
printf-alike macro that displays a message and terminates.
debug(x...)
printf-alike macro that should be used for all status output
(all kinds of non-fatal messages).
|