1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846
|
# Audio Watermarking
The **audiowmark** program can add watermarks to audio files and extract
previously embedded watermarks from audio material.
The usage is as follows:
```
usage: audiowmark <command> [ <args>... ]
Commands:
* create a watermarked wav file with a message
audiowmark add <input_wav> <watermarked_wav> <message_hex>
* retrieve message
audiowmark get <watermarked_wav>
* compare watermark message with expected message
audiowmark cmp <watermarked_wav> <message_hex>
* generate 128-bit watermarking key, to be used with --key option
audiowmark gen-key <key_file> [ --name <key_name> ]
Global options:
-q, --quiet disable information messages
--strict treat (minor) problems as errors
Options for get / cmp:
--detect-speed detect and correct replay speed difference
--detect-speed-patient slower, more accurate speed detection
--json <file> write JSON results into file
Options for add / get / cmp:
--key <file> load watermarking key from file
--short <bits> enable short payload mode
--strength <s> set watermark strength [10]
--input-format raw use raw stream as input
--output-format raw use raw stream as output
--format raw use raw stream as input and output
The options to set the raw stream parameters (such as --raw-rate
or --raw-channels) are documented in the README file.
HLS command help can be displayed using --help-hls
```
\pagebreak
# Audiowmark Architecture
<style> body { max-width: 50em; margin: auto; } </style>
The **audiowmark** program is used to integrate (`add` command) and extract (`get` command) watermarks (messages of up to 128 bits) into/from audio files.
Internally, the program is organized as nested components, the outermost deals with file IO and command processing.
The commands are implemented via various components that process the watermark, audio signal,
an optional encoding key and user facing information.
## Adding Watermarks
The `audiowmark add <in> <out> <bits> [--key…]` command allows adding watermarks to audio files.
This command takes an audio file, a 128-bit hexadecimal watermark and an optional key as input,
it combines these into a newly generated WAV file. Using the same key, the watermark bits can later
be re-retrieved with the `audiowmark get` command without requiring access to the original
audio input (this is called blind decoding).
By using the encoding key as input, various AES based random number streams are generated to
shuffle, interleave and mix the watermark information into the audio signal.
For robust extraction and forward error correction, the watermark is encoded via convolutional codes
with an order of `15` and a rate of `1/6` (similar to the communication of the
[Mars Pathfinder](https://en.wikipedia.org/wiki/Convolutional_code#Popular_convolutional_codes)).
The expanded watermark bits are transformed into a delta spectrum at a sample rate of 44100Hz and
distributed across various segments (of ca 23 millisecond lengths) of the audio signal and spread
across bands above 800Hz and below 5000Hz.
Based on the delta spectrum, the watermark signal can be modulated and adapted to the current
segment of the input signal before the two are mixed together. To avoid clipping of the output
signal, the final output stage consists of a time local limiter with ca 1 second window.
<!-- TODO: describe the limiter in more detail -->
\pagebreak
An outline of the component interactions to integrate the watermark information via delta
band spectrum into the audio signal is provided in the following chart.
~~~~{.graphviz prog=dot}
digraph "Audiowmark Watermark Embedding" {
graph[fontsize=13,_fontname="sans"];
node[fontsize=13,target="_top",_fontname="sans"];
edge[arrowhead=vee,arrowtail=vee,color="#00000080",_fontname="sans"];
compound=true;
concentrate=false;
rankdir="TB";
bitvec -> get_frame_mod [color=green4];
AudioInput -> Limiter [color=blue];
AudioInput -> in_resampler [color=blue];
AudioInput -> snr_signal_power [style=dashed,color=gold3];
Key -> Random [minlen=2,color=goldenrod];
subgraph cluster_wmadd {
label=< <b><font face="mono">audiowmark add <AudioInput> <AudioOutput> <Bits> [--key…] </font></b> >;
Random -> get_frame_mod [color=goldenrod];
in_resampler -> fft_analyzer [color=blue];
{ rank=same Random in_resampler }
subgraph cluster_WatermarkGen {
label=< <b>Watermark Generation</b> >; style=dashed; color=grey;
// fontsize=9; node[fontsize=5,margin=0]; edge[fontsize=5,margin=0];
fft_analyzer -> apply_frame_mod [color=blue4,xlabel=" 513 FFT Bands \r"];
get_frame_mod -> apply_frame_mod [color=red,xlabel="Up/Down-Band Modulators \r"];
apply_frame_mod -> wm_synth [color=fuchsia,xlabel="FFT Delta Bands \r"];
}
wm_synth -> out_resampler [color=fuchsia];
out_resampler -> Limiter [color=fuchsia];
out_resampler -> snr_signal_power [style=dashed,color=gold3];
}
Limiter -> AudioOutput [color=darkmagenta];
snr_signal_power -> SnrOutput [style=dashed,color=gold3];
bitvec [color=green4,margin=0,label=" Watermark Bits "];
Key [color=goldenrod,margin=0,label=" Key "];
AudioInput [color=blue,margin=0,label=" WAV/MP3 Audio Input File "];
AudioOutput [color=darkmagenta,margin=0,label=" WAV Audio Output File "];
{ rank=same bitvec Key AudioInput }
Random [color=goldenrod,shape=record,label="Random Stream \l AES128/CTR \l"];
in_resampler [color=blue,shape=record,label="Resample to 44.1kHz"];
out_resampler [color=fuchsia,shape=record,label="Resample from 44.1kHz"];
Limiter [color=darkmagenta,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Mixing ⊕ Limiting</b> <br/>
• Audio and watermark signals are added <br/>
• The result is scaled down to [-0.99…+0.99] <br/>
• Uses 1 second detection window <br/>
</TD></TR></TABLE> >];
snr_signal_power [style=filled,fillcolor="#ffffbb",color=gold3,shape=rect,label=<
<TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" ALIGN="LEFT"><TR><TD BALIGN="LEFT" CELLPADDING="5">
<b>Power Measurement</b> <br/>
• Signal Power <br/>
• Delta Power <br/>
• Ratio Calculation <br/>
</TD></TR></TABLE> >];
SnrOutput [color=gold3,margin=0,style=filled,fillcolor="#ffffbb",label=" Signal/Noise Ratio Info "];
fft_analyzer [color=blue4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Fourier Transform Analyzer</b> <br/>
• Input: Time domain samples <br/>
• FFT with block Size 1024 <br/>
• Hann Window <br/>
</TD></TR></TABLE> >];
apply_frame_mod [color=fuchsia,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Band Modulation ⊗</b> <br/>
• Input bands are Up/Down modulated <br/>
• Factor amounts to ±Amplitude^1% <br/>
• Output: ± Delta bands <br/>
</TD></TR></TABLE> >];
get_frame_mod [color=red,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Modulation Frame Generator</b> <br/>
• Encodes Watermark Bits <br/>
• Pregenerate A/B-Blocks <br/>
• Yields A/B-Block frames <br/>
</TD></TR></TABLE> >];
wm_synth [color=fuchsia,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Watermark Signal Synthesis</b> <br/>
• Inverse FFT with block size 1024 <br/>
• Cosine window with overlap of 10% <br/>
• Output: Time domain samples <br/>
</TD></TR></TABLE> >];
}
~~~~
At a sample frequency of 44100Hz, the audio signal used for the watermark creation is split into
"Frames" of 1024 samples each, which corresponds to segments of ca 23 millisecond length. These
frames are transformed from the time domain (samples) into the frequency domain (spectral bands)
and vice versa to apply the watermark embedding in certain spectral bands.
Data and synchronization bits are encoded across several frames, with different levels of
redundancy. In the "Modulation Frame Generator" the number of frames that compose all encoded
information needed to find and extract the watermark bits are combined into two types of "Blocks".
\pagebreak
A detailed chart of the component interactions for the Frame and Block generation in the
"Modulation Frame Generator" is provided in the next chart.
~~~~{.graphviz prog=dot}
digraph "Modulation Frame Generator" {
graph[fontsize=13,_fontname="sans"];
node[fontsize=13,target="_top",_fontname="sans"];
edge[arrowhead=vee,arrowtail=vee,color="#00000080",_fontname="sans"];
compound=true;
concentrate=false;
rankdir="TB";
Random -> ab_generators [color=goldenrod,lhead=cluster_ModulationFrameGenerator];
//Random -> randomize_bit_order [color=goldenrod,xlabel="bit_order R5"];
//Random -> mark_sync [color=goldenrod,xlabel="sync_up_down R2"];
//Random -> mark_data [color=goldenrod,xlabel="data_up_down R1"];
//Random -> frame_pos [color=goldenrod,xlabel="frame_position R6"];
bitvec -> conv_encode [color=green4,minlen=2];
{ rank=same Random bitvec }
subgraph cluster_ModulationFrameGenerator {
label=< <b>Modulation Frame Generator</b> >; style=dashed; color=red;
ab_generators -> conv_encode [color=green4,minlen=2];
conv_encode -> randomize_bit_order [color=green4,minlen=2];
{ rank=same ab_generators conv_encode }
UpDownGen -> mark_sync [color=goldenrod];
UpDownGen -> gen_mix_entries [color=goldenrod];
frame_pos -> mark_sync [color=goldenrod,minlen=1];
// --linear: frame_pos -> mark_data;
frame_pos -> gen_mix_entries [color=goldenrod];
randomize_bit_order -> mark_data [color=goldenrod];
init_frame_mod_vec -> get_frame_mod [color=cyan4];
mark_sync -> init_frame_mod_vec [color=cyan3,minlen=1];
mark_data -> init_frame_mod_vec [color=teal];
gen_mix_entries -> mark_data [color=goldenrod];
{ rank=same mark_data mark_sync }
}
get_frame_mod -> apply_frame_mod [color=red,xlabel="Frame Up/Down-Band Modulators",minlen=2];
apply_frame_mod [shape=plain,label=" "];
bitvec [color=green4,margin=0,label=" Watermark Bits "];
Random [color=goldenrod,shape=record,label="
Random Stream \l
AES128/CTR \l
Streams R1…R6 \l
"];
conv_encode [color=green4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Convolutional Code Expansion</b> <br/>
• Pads watermark with termination zeros <br/>
• Combines bit stream with A/B constants <br/>
• Generates output stream of encoded bits <br/>
• Generates 858 encoded bits A-Block <br/>
• Generates 858 encoded bits B-Block <br/>
</TD></TR></TABLE>>];
ab_generators [color=green4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Convolutional Code Parameters</b> <br/>
• Convolutional code with rate 1/6 <br/>
• Order 15, needs 15 termination bits <br/>
• Six constants for A-Block and B-Block <br/>
• Forward correction of ca ≈20% bit errors <br/>
• Encodes 128 bits in 858 bit blocks <br/>
</TD></TR></TABLE>>];
UpDownGen [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Up/Down-Band Generator</b> <br/>
• Uses per-frame shuffling seed <br/>
• Picks random bands, 30 UP, 30 DOWN <br/>
• Bands are between ca 861Hz…4307Hz <br/>
</TD></TR></TABLE>>];
mark_sync [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Synchronization Frame Generator</b> <br/>
• Encodes 6 sync bits in 6 * 85 frames <br/>
• A-Block bit pattern: 010101 <br/>
• B-Block bit pattern: 101010 <br/>
• Randomizes Up/Down-Band shifts [R2] <br/>
• Output: 510 Frames * 60 Up/Down-Bands <br/>
</TD></TR></TABLE>>];
gen_mix_entries [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Mix Entry Generator (skipped for --linear)</b> <br/>
• Generates list of data bit encoding bands <br/>
• Uses 30 up + 30 down bands in 2 frames per bit <br/>
• Randomizes Up/Down-Band shifts [R1] <br/>
• Shuffles data bit association of entries [R4] <br/>
• Output: 2 * 858 * 30 Up/Down band pairs <br/>
</TD></TR></TABLE>>];
mark_data [color=teal,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Data Frame Generator</b> <br/>
• Encodes 858 data bits in 858 * 2 frames <br/>
• Encodes A-Blocks, B-Blocks in turn <br/>
• Omits Mix Entry Generator with --linear <br/>
• Randomizes Up/Down-Band shifts [R1] <br/>
• Output: 1716 Frames * 60 Up/Down-Bands <br/>
</TD></TR></TABLE>>];
frame_pos [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Frame Position Randomization</b> <br/>
• Mixes sync + data frames <br/>
• Shuffles frame positions <br/>
• Uses random stream [R6] <br/>
</TD></TR></TABLE>>];
randomize_bit_order [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Randomize Bit Order for ENCODE</b> <br/>
• Reversible shuffle for encode/decode <br/>
• Shuffles/interleaves bit stream [R5] <br/>
• Interleaving improves robustness <br/>
• Reduces bit stream damage impact <br/>
</TD></TR></TABLE>>];
init_frame_mod_vec [color=cyan4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>A/B-Block Frame Modulator Composition</b> <br/>
• Interleaves synchronization and data frames <br/>
• Pulls and interleaves each block type separately <br/>
• Output: Up/down band modulators for 1 block <br/>
</TD></TR></TABLE>>];
get_frame_mod [color=red,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Modulation Frame Selector</b> <br/>
• Yields A-Block band modulators per frame <br/>
• Yields B-Block band modulators and starts over <br/>
• Output: Up/down band modulators for 1 frame <br/>
</TD></TR></TABLE>>];
}
~~~~
The watermark is encoded and embedded into the audio signal in two block types,
A-Blocks and B-Blocks. The information contained in each block alone is
usually sufficient to extract the watermark. However, in case of very
distorted and noisy transmissions where watermark extraction from either block
type fails, a combination of segments with A-Block and B-Block data may still
lead to successful recovery of the original watermark.
In order to support watermark extraction from clipped excerpts of the input
stream, a fixed pattern of synchronization bits is integrated into the data
blocks with much higher redundancy than the data bits. The fixed pattern
allows detection of the location of A-Blocks and B-Blocks as such to aid the
watermark extraction.
The user provided encoding `Key` seeds an AES based pseudo random number generator in
[Counter Mode](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#CTR)
that is used to determine encoding places, randomize the noise introduced by
the watermark and to interleave encoding for robustness. Without the key, the
watermark information cannot be retrieved. Using a key is important because the
implementation itself is open source, and being able to read the watermark
message bits would allow an attacker to remove the watermark without degrading
the audio quality.
The different types of random data streams used for the distribution of the embedded watermark information are as follows:
* R1 - Used to randomizes Up/Down band shifts for watermark data bits.
* R2 - Used to randomizes Up/Down band shifts for watermark synchronization bits.
* R3 - Currently unused.
* R4 - Used to mix (shuffle) data bit associations of Up/Down bands distributed across several frames.
* R5 - Used to shuffle (interleave) the bit stream. Due to redundancy in the generated bit stream, interleaving reduces the
number of adversely affected bits by bursts (holes) in transmission loss.
* R6 - Used to randomize and mix data frames with synchronization frames, this makes synchronization frames unlikely to be detectable without the encoded key.
## Extracting Watermarks
The `audiowmark get <watermarked_wav> [--key…]` command extracts a watermark from an audio file.
This command takes an audio file and an optional key as input.
With the same key used during watermark embedding, synchronization bits are determined and searched
for in the audio content.
If synchronization bit matches are detected, encoded watermark information can be located,
extracted and decrypted with error correction.
The retrieval does not require access to the original audio input (this is called blind decoding).
The detection results are produced on *stdout* with accompanying information about the location,
match quality and a measure for likely decoding errors.
An outline of the component interactions to locate and extract the watermark information from the
frequency spectrum in the audio signal is provided in the following charts.
~~~~{.graphviz prog=dot}
digraph "Audiowmark Watermark Extraction" {
newrank=true;
graph[fontsize=13,_fontname="sans"];
node[fontsize=13,target="_top",_fontname="sans"];
edge[arrowhead=vee,arrowtail=vee,color="#00000080",_fontname="sans"];
compound=true;
concentrate=false;
rankdir="TB";
WavData -> in_resampler [color=darkmagenta];
Key -> Random [minlen=2,color=goldenrod];
subgraph cluster_wmget {
label=< <b><font face="mono">audiowmark get <AudioFile> [--key…]
</font></b> >;
{ rank=same Random in_resampler }
in_resampler -> fft_range [color=darkmagenta];
Random -> conv_decode_soft [minlen=2,color=goldenrod,
lhead=cluster_WatermarkExtraction]; // fake arrow target for cluster alignment
subgraph cluster_WatermarkExtraction {
label=< <b>Watermark Extraction</b> >; style=dashed; color=green4;
BlockDecoder -> ClipDecoder [color=darkorchid3];
fft_range -> BlockDecoder [color=blue4,constraint=1];
{ rank=same fft_range conv_decode_soft }
conv_decode_soft -> BlockDecoder [color=green4,dir=both];
conv_decode_soft -> ClipDecoder [color=green4,dir=both];
}
BlockDecoder -> SyncFinder [color=cyan4,minlen=4,dir=both,xlabel="Mode::BLOCK"];
ClipDecoder -> SyncFinder [color=cyan4,minlen=1,dir=both,xlabel="Mode::CLIP",constraint=false];
subgraph cluster_SyncFinder {
label=""; style=dashed; color=cyan4; node[margin=0]; edge[margin=0]; margin=0;
SyncFinder [shape=plaintext,
label=< <table border="0"><tr><td BALIGN="LEFT">
<b>Synchronization Position Finder </b> <br/>
• Performs coarse search for synchronization <br/>
bit markers in the frequency spectrum <br/>
• Searches for A/B-Block synchronizations <br/>
• Calculates score for possible locations <br/>
and picks the 5 best matches <br/>
• Refines the exact block locations with a fine <br/>
grained search for synchronization markers <br/>
• Short audio segments are dealt with by <br/>
adding zero padding (in Mode::CLIP) <br/>
</td></tr></table> >];
}
ClipDecoder -> result_set_print [color=darkorchid4];
{ rank=same SyncFinder result_set_print }
}
result_set_print -> bitvec [color=green4];
Key [color=goldenrod,margin=0,label=" Key "];
WavData [color=darkmagenta,margin=0,label=" WAV/MP3 Audio Input File "];
{ rank=same Key WavData }
in_resampler [color=darkmagenta,shape=record,label="Resample to 44.1kHz"];
Random [color=goldenrod,shape=record,label="Random Stream \l AES128/CTR \l"];
fft_range [color=blue4,shape=rect,label=< <table border="0" align="left"><tr><td balign="left">
<b>Fourier Transform Analyzer</b> <br/>
• Input: Frames with time domain samples <br/>
• Short inputs are zero-padded as needed (Mode::CLIP) <br/>
• FFT with block Size 1024 <br/>
• Hann Window <br/>
• Output: 513 FFT Bands <br/>
</td></tr></table> >];
BlockDecoder [color=darkorchid3,shape=rect,label=< <table border="0" align="left"><tr><td balign="left">
<b>Best Block Decoder</b> <br/>
• Detect A/B-Blocks via 'Synchronization Position Finder' in Mode::BLOCK <br/>
• Only works for audio clips with at least 52 seconds and proper block <br/>
alignment (or up from 104 seconds without alignment) <!-- 1024 * (510 + 1716) / 44100 --> <br/>
• Reconstruct Up/Down-Band associations (reverses 'Mix Entry Generator') [R4] <br/>
• Estimate bit vectors resulting from average deviations in <br/>
randomized Up/Down-Band shifts [R1] <!-- mix_decode --> <br/>
• Unshuffle bit order (reverses 'Randomize Bit Order for ENCODE') [R5]<br/>
• Normalize soft bits (normalization of estimated bit vectors) <br/>
• Reconstruct watermark bits using 'Soft-Decision Decoder' <br/>
• Decode individual A-Blocks, B-Blocks and if present AB-Blocks <br/>
• As last resort, attempt an AB-Block decode on accumulated bit vectors <br/>
averaged over all selected blocks
</td></tr></table> >];
conv_decode_soft [color=green4,shape=rect,label=< <table border="0" align="left"><tr><td balign="left">
<b>Soft-Decision Decoder</b> <br/>
• Utilize Viterbi algorithm <br/>
• Decode blocks of 858 encoded bits <br/>
• Reconstructs 128 payload bits <br/>
• Uses 'Convolutional Code Parameters' <br/>
• Decode A-, B- and AB-Blocks <br/>
</td></tr></table> >];
ClipDecoder [color=darkorchid4,shape=rect,label=< <table border="0" align="left"><tr><td balign="left">
<b>Short Audio Clip Decoder</b> <br/>
• Used only for small audio clips up to 160 seconds (3.1 blocks) <!-- 1024 * (510 + 1716) / 44100 * 3.1 --> <br/>
• Uses zero padding (silence) around short audio clips to construct 3 blocks <br/>
• Zero padded regions are ignored during synchronization detection and scoring <br/>
• Determine alignment with 'Synchronization Position Finder' in Mode::CLIP <br/>
• Select the five blocks with the best detected synchronization markers <br/>
• Reconstruct Up/Down-Band associations (reverses 'Mix Entry Generator') [R4] <br/>
• Estimate bit vectors resulting from average deviations in <br/>
randomized Up/Down-Band shifts [R1] <!-- mix_decode --> <br/>
• Unshuffle bit order (reverses 'Randomize Bit Order for ENCODE') [R5]<br/>
• Normalize soft bits (normalization of estimated bit vectors) <br/>
• Reconstruct watermark bits using 'Soft-Decision Decoder' <br/>
• Attempt AB-Block decoding at alignment position <br/>
• Attempt secondary AB-Block decoding in case of audio data surplus <br/>
• In practice, ca 10 seconds are needed for reliable detection, <br/>
in good scenarios up to 3 seconds may suffice <br/>
</td></tr></table> >];
result_set_print [style=filled,fillcolor="#eeffdd",color=green4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Result Set Printing</b> <br/>
• Sort results by block type and score <br/>
• Print potential watermarks on stdout <br/>
• Provide time offset, score quality, <br/>
block type and a measure for likely <br/>
decoding errors per watermark <br/>
</TD></TR></TABLE> >];
bitvec [style=filled,fillcolor="#eeffdd",color=green4,margin=0,label=" Watermark Bits "];
}
~~~~
At a sample frequency of 44100Hz, a spectral analysis is performed on the audio signal and the spectrum is then searched for known synchronization markers.
Upon detection of A/B-Block synchronization positions, watermark bits are extracted from known data bit locations, while making use of the embedded redundancy to make the detection more robust.
Due to high redundancy and wide spread of watermark information, bits often can still be extracted from audio clips that are heavily shortened. To employ the full detection machinery to very short clips, symmetric zero padding is used to provide enough input samples (zero padded regions are ignored during scoring however).
Since detection success is directly dependent on the precise bit stream synchronization, an iterative process is used for fast approximation of synchronization locations with later refinements to yield precise results.
The purpose of the synchronization algorithm is to find the location of the
watermark A/B blocks in the input signal. This is important because the
signal may have been cropped so that the location of the blocks is not
known. To be able to find the locations of the blocks, while adding the
watermark, some sync bits are added to each block with relatively high redundancy.
The values of these sync bits are known, for an A block they are 010101, for a
B block they are 101010. The up- and down-bands used for the sync bits and
offsets of all frames that belong to sync bits inside the A / B block are known
and determined by the key.
To perform the actual synchronization and locate the start of an A (or B) block,
two steps are performed.
* As a first step, the synchronization algorithm tests all possible positions
for the start of an A (or B) block using a step size of 256 (1/4 frame size)
and tries to decode the sync bits at the expected locations relative to the
start of the block. Since the values and locations of the bits are known, a sync
score can be computed that indicates how good the bits in the actual audio
input at this position match the expected bit sequence.
* For all start locations with a significantly high sync score, in a second
step the actual start position is searched by trying all different start
locations near to the original match with a smaller finer step size. Again
a sync score can be computed and compared to a second threshold to decide
whether this is location is really likely to contain a data block. If the match
is good enough the start location will be used to decode the data bits in the
block.
Besides using this strategy to find "whole" data blocks, there is also a
variant of the synchronization algorithm that is used if the audio signal
is very short. It can find the location of the watermark even if the length
of the input signal is too short to contain a complete data block. To be
able to do this, the input signal is zero padded before sync detection and
then the usual algorithm to find whole blocks is used.
The following chart provides the detail of the steps involved in determining the synchronization locations.
~~~~{.graphviz prog=dot}
digraph "Synchronization Position Finder" {
graph[fontsize=13,_fontname="sans"];
node[fontsize=13,target="_top",_fontname="sans"];
edge[arrowhead=vee,arrowtail=vee,color="#00000080",_fontname="sans"];
compound=true;
concentrate=false;
rankdir="TB";
in_resampler;
Random;
{ rank=same in_resampler Random }
Random -> frame_pos [color=goldenrod,lhead=cluster_SyncFinder,minlen=2];
subgraph cluster_SyncFinder {
label=< <b>Synchronization Position Finder</b> >; style=dashed; color=cyan4;
in_resampler -> fft_analyzer [color=darkmagenta];
fft_analyzer -> sync_fft_256_8 [color=blue4,dir=both];
frame_pos -> init_up_down [color=goldenrod];
UpDownGen -> init_up_down [color=goldenrod];
init_up_down -> search_approx [color=cyan3];
search_approx -> sync_select_by_threshold [color=cyan3];
sync_select_by_threshold -> search_refine [color=cyan3];
sync_fft_256_8 -> search_refine [style=dashed,dir=back,color=darkcyan,xlabel=" Refining\l Feedback\l "];
sync_fft_256_8 -> sync_decode [color=darkcyan,style=dashed,label="Repeat\lRefined\l "];
sync_decode -> search_refine [color=darkcyan,style=dashed,label="Repeat\lRefined\l "];
sync_fft_256_8 -> sync_decode [color=cyan3];
sync_decode -> search_approx [color=cyan3];
{ rank=same fft_analyzer init_up_down }
}
search_refine -> sync_scores [color=cyan4,xlabel="Scoring and A/B-Type for potential blocks",minlen=2];
sync_scores [shape=plain,label=" "];
in_resampler [color=darkmagenta,shape=record,label="
WAV/MP3 Audio Input File \l
\l
Resampled to 44.1kHz \l
"];
Random [color=goldenrod,shape=record,label="
Random Stream \l
AES128/CTR \l
Streams R1…R6 \l
"];
{ rank=same in_resampler Random }
frame_pos [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Frame Position Randomization</b> <br/>
• Mixes sync + data frames <br/>
• Shuffles frame positions <br/>
• Uses random stream [R6] <br/>
</TD></TR></TABLE> >];
UpDownGen [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Up/Down-Band Generator</b> <br/>
• Uses per-frame shuffling seed <br/>
• Picks random bands, 30 UP, 30 DOWN <br/>
• Bands are between ca 861Hz…4307Hz <br/>
</TD></TR></TABLE> >];
init_up_down [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Synchronization Bit Frame Generator</b> <br />
<b>(Mode::BLOCK & Mode::CLIP)</b> <br/>
• Generates 6 sync bits in 6 * 85 frames (* 2 for Mode::CLIP) <br/>
• Randomizes Up/Down-Band shifts [R2] <br/>
• Sorts synchronization bit frames by frame index <br/>
• Mode::BLOCK Output: 510 Bit Frames with 60 Up & 60 Down-Bands <br/>
• Mode::CLIP Output: 1020 Bit Frames with 60 Up & 60 Down-Bands <br/>
</TD></TR></TABLE> >];
fft_analyzer [color=blue4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Fourier Transform Analyzer</b> <br/>
• Input: Time domain samples <br/>
• FFT with block Size 1024 <br/>
• Hann Window <br/>
• Output: 513 FFT Bands <br/>
</TD></TR></TABLE> >];
sync_fft_256_8 [color=blue4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Decibel Quantifier</b> <br/>
• Uses coarse stepping of 256 values for approximate search <br/>
• A stepping of 256 equates 1/4th FFT block <br/>
• Uses fine stepping of 8 values for refined search <br/>
• Pulls FFT Bands for all input blocks <br/>
• Computes dB for all bands of all blocks <br/>
</TD></TR></TABLE> >];
sync_decode [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Synchronization Bit Matching</b> <br/>
• Collect Up/Down-Band magnitudes for sync bits <br/>
• Determine match quality for alternating bit patterns <br/>
• Apply watermark strength dependent thresholds <br/>
• Decide A/B-Block based on 010101 / 101010 detection <br/>
</TD></TR></TABLE> >];
search_approx [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Approximate Synchronization Frame Search</b> <br/>
• Skips over zero-padded samples (Mode::CLIP) <br/>
• Computes multiple time-shifted FFT vectors <br/>
• Uses coarse subframe stepping of 256 values <br/>
• Overlaps frames for sync detection by 1/4th frame <br/>
• Scores positions for synchronization matches <br/>
</TD></TR></TABLE> >];
sync_select_by_threshold [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Synchronization Frame Selection</b> <br/>
• Due to the subframe stepping, good and bad matches can be expected to alternate <br/>
• Identification of local match maxima <br/>
• Strength dependent threshold determines minimum match quality <br/>
• Selection of likely match positions via maxima and threshold <br/>
</TD></TR></TABLE> >];
search_refine [color=darkcyan,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Refined Synchronization Frame Search</b> <br/>
• Computes fine-stepped time-shifted FFT vectors around selected frames <br/>
• Searches ±16 subframes around previously detected good scores <br/>
• Keeps subframe if the score (synchronization frame detection quality) improves <br/>
</TD></TR></TABLE> >];
}
~~~~
The user provided 128-Bit AES key is essential to determine spectral bands, encoding patterns, and bit locations.
During decoding, the same Pseudo Random Number Generator sequences R1…R6 are used that facilitated watermark embedding.
By using the same AES key and a cryptographically secure PRNG, the sequences are uniformly distributed and deterministically reproducible but cannot be extrapolated.
This prevents watermark extraction or modification by anyone without possession of the exact encoding key.
## The Patchwork Algorithm

To store one single bit inside a spectrum, **audiowmark** uses the patchwork
algorithm. From the frequency bands of the spectrum (generated by computing the
FFT of one frame), two groups are choosen in the frequency range of the watermark
using the pseudo random number generator. These are called up- and down-bands.
In the example above, the up-bands are red and the down-bands are green.
Typically there are 30 up- and 30 down-bands and the other bands do not carry
information.
To embed a single bit, the following changes are made to the spectrum:
* to **store a 1 bit**, each magnitude of each up-band is increased by a small amount,
and each magnitude of each down-band is decreased by a small amount (this is
shown by the small arrows in the example image)
* to **store a 0 bit**, each magnitude of each up-band is decreased, and each magnitude of
each up-band is increased (the opposite of the small arrows in the example image)
Since we have pseudo-randomly choosen the up- and down-bands from the spectrum,
we can expect that if we sum up all values of the up-bands and sum up all
values of the down-bands **before** embedding the bit, we will get a similar
result (because the mean value of all spectrum bins is shared between the two).
However, since we increased all elements of the up-bands and decreased all
elements of the down-bands **after embedding a 1 bit**, the sum of the up-bands
should be **greater than** the sum of the down-bands.
So to decode the bit from the spectrum, we can simply use the rule
* **decode as 1 bit**, if the sum of the up-bands is greater than the sum
of the down-bands
* **decode as 0 bit**, if the sum of the up-bands is smaller than the sum
of the down-bands
In the actual implementation, increasing/decreasing the magnitude of the
up-/down-bands is done by generating a watermark signal with the right
magnitude/phase for each frame that only contains the changes. So we
compute a delta spectrum, which is then passed to the IFFT, windowed and then
added to the original audio, so that the sum has the desired modified spectrum
magnitude.
The detection is performed on dB values of the magnitudes of the spectrum
obtained from the FFT, so the sums of the dB values of up-/down-bands are
computed and compared to decide whether a 0 bit or 1 bit was received.
The patchwork algorithm does not guarantee that encoding/decoding will always
yield the right result at the lowest level of embedding/decoding one bit (as
the difference of the up-/down-bands can be too big before embedding due to
the original signal). However error correction and redundancy by embedding a
bit in more than one frame makes the whole process reliable at a higher level.
There are three improvements over the basic patch work algorithm described
above, which make the watermark detection more accurate:
* To use soft-decoding for the convolutional decoder, instead of deciding
whether a 0 or 1 bit was received by comparing the two sums directly before
decoding the convolutional code to obtain the message bits, the difference
between the two sums is normalized and is used as a soft-bit input for the
Viterbi algorithm.
* Instead of storing one data bit in each frame spectrum, a data bit uses up-
and down-bands from different frames. This is called mix-encoding, which
spreads the information of each data bit over many frames.
* As described above, the original signal can have some negative effect
on the performance of the decoder, since the sum of the up-bands and the
sum of the down-bands will be different even before embedding the bits.
To make detection more reliable, the original signal level for each bin is
estimated by taking the average value of the previous and next spectrum and
subtracted before computing the sum of the up- and down-bands.
## Mixing with Limiter
The input material for **audiowmark** is normalized (all samples are in the
range from -1 to 1). If we simply added the watermark to the input, it
could happen that this sum exceeds the range from -1 to 1 which would
result in clipping. To avoid this, a limiter during mixing is used.
The limiter computes the highest peak for each one second long block. Then a
linear volume envelope is constructed connecting the blocks, such that the
envelope is greater or equal to the height of the peaks in each block. The
typical value for really high peaks is about 1.04 for the default watermarking
strength of 10.
To avoid clipping, the signal is divided by the slowly changing volume
envelope. The result is somewhat similar to a lookahead peak limiter with
attack of one second, and linear release of one second. Or to describe the
effect more directly, if a single peak of 1.04 was produced in the watermarked
signal, the limiter would slowly start decreasing the volume to 1/1.04 over the
time of one second before the block that contains the peak, stay there for a
while (due to the use of blocks) and afterwards slowly increase the volume
again over the time of one second.
By using a limiter that works on one second blocks like this, it is possible to
seek to any point in the watermark (which is required for streaming via HLS)
and getting the exact same output that watermarking all previous samples would
have produced, because the output of the limiter only depends on the current,
previous and next one second block. So only a small context window needs to
be processed when seeking.
## Speed Detection
As one of the later developments, a dedicated speed detection facility has been integrated that explores the ability to extract watermarks from
audio segments played back at unknown rates.
In scenarios where audio has been resampled and pitched at a constant rate, synchronization markers may still be
detectable by searching the audio content at varying resampled playback rates.
The `audiowmark --detect-speed` command line option attempts to detect playback rate changes compared to the original material used for embedding within 80% to 125%.
Detection of playback rate modifications is approached in several steps.
First, the detector picks two short audio clips (ca 25 seconds) with high signal energy and performs multiple coarse scans while detecting <0.2% rate modulations.
This rough speed estimate is improved upon with secondary scans around 1/20‰ rate modulations on an audio clip of 50 second.
Executing coarse and fine detection runs at varying resampled playback rates with multiple refining steps consumes a lot of processing resources.
To speed up the detection, resampling and scanning runs are carried out on a downsampled version of the audio material (by a factor of 2) and detection runs are parallelized across all available CPU cores.
Finally, the watermark extraction is carried out on a resampled version of the audio material at the most likely detected playback rate in addition to regular watermark detection, because the detected playback rate may have been guessed wrongly.
<!--
# https://graphviz.org/documentation/
BUILD:
apt install -y python3-pygraphviz python3-pandocfilters
pandoc -F graphviz.py audiowmark.md -o audiowmark.html
pandoc -F graphviz.py audiowmark.md -V papersize:a4 -V geometry:margin=2cm -o audiowmark.pdf
-->
|