File: audiowmark.md

package info (click to toggle)
audiowmark 0.6.5-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 1,168 kB
  • sloc: cpp: 10,571; sh: 1,867; makefile: 107; python: 34
file content (846 lines) | stat: -rw-r--r-- 39,213 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
# Audio Watermarking

The **audiowmark** program can add watermarks to audio files and extract
previously embedded watermarks from audio material.
The usage is as follows:


```
usage: audiowmark <command> [ <args>... ]

Commands:
  * create a watermarked wav file with a message
    audiowmark add <input_wav> <watermarked_wav> <message_hex>

  * retrieve message
    audiowmark get <watermarked_wav>

  * compare watermark message with expected message
    audiowmark cmp <watermarked_wav> <message_hex>

  * generate 128-bit watermarking key, to be used with --key option
    audiowmark gen-key <key_file> [ --name <key_name> ]

Global options:
  -q, --quiet             disable information messages
  --strict                treat (minor) problems as errors

Options for get / cmp:
  --detect-speed          detect and correct replay speed difference
  --detect-speed-patient  slower, more accurate speed detection
  --json <file>           write JSON results into file

Options for add / get / cmp:
  --key <file>            load watermarking key from file
  --short <bits>          enable short payload mode
  --strength <s>          set watermark strength              [10]

  --input-format raw      use raw stream as input
  --output-format raw     use raw stream as output
  --format raw            use raw stream as input and output

The options to set the raw stream parameters (such as --raw-rate
or --raw-channels) are documented in the README file.

HLS command help can be displayed using --help-hls
```

\pagebreak

# Audiowmark Architecture
<style>	body { max-width: 50em; margin: auto; }	</style>

The **audiowmark** program is used to integrate (`add` command) and extract (`get` command) watermarks (messages of up to 128 bits) into/from audio files.

Internally, the program is organized as nested components, the outermost deals with file IO and command processing.
The commands are implemented via various components that process the watermark, audio signal,
an optional encoding key and user facing information.

## Adding Watermarks

The `audiowmark add <in> <out> <bits> [--key…]` command allows adding watermarks to audio files.
This command takes an audio file, a 128-bit hexadecimal watermark and an optional key as input,
it combines these into a newly generated WAV file. Using the same key, the watermark bits can later
be re-retrieved with the `audiowmark get` command without requiring access to the original
audio input (this is called blind decoding).

By using the encoding key as input, various AES based random number streams are generated to
shuffle, interleave and mix the watermark information into the audio signal.

For robust extraction and forward error correction, the watermark is encoded via convolutional codes
with an order of `15` and a rate of `1/6` (similar to the communication of the
[Mars Pathfinder](https://en.wikipedia.org/wiki/Convolutional_code#Popular_convolutional_codes)).

The expanded watermark bits are transformed into a delta spectrum at a sample rate of 44100Hz and
distributed across various segments (of ca 23 millisecond lengths) of the audio signal and spread
across bands above 800Hz and below 5000Hz.
Based on the delta spectrum, the watermark signal can be modulated and adapted to the current
segment of the input signal before the two are mixed together. To avoid clipping of the output
signal, the final output stage consists of a time local limiter with ca 1 second window.
<!-- TODO: describe the limiter in more detail -->

\pagebreak
An outline of the component interactions to integrate the watermark information via delta
band spectrum into the audio signal is provided in the following chart.

~~~~{.graphviz prog=dot}
digraph "Audiowmark Watermark Embedding" {
  graph[fontsize=13,_fontname="sans"];
  node[fontsize=13,target="_top",_fontname="sans"];
  edge[arrowhead=vee,arrowtail=vee,color="#00000080",_fontname="sans"];
  compound=true;
  concentrate=false;
  rankdir="TB";

bitvec -> get_frame_mod [color=green4];

AudioInput -> Limiter [color=blue];
AudioInput -> in_resampler [color=blue];
AudioInput -> snr_signal_power [style=dashed,color=gold3];

Key -> Random [minlen=2,color=goldenrod];

subgraph cluster_wmadd {
  label=< <b><font face="mono">audiowmark add &lt;AudioInput&gt; &lt;AudioOutput&gt; &lt;Bits&gt; [--key…]                         </font></b> >;

  Random -> get_frame_mod [color=goldenrod];
  in_resampler -> fft_analyzer [color=blue];
  { rank=same Random in_resampler }

  subgraph cluster_WatermarkGen {
    label=< <b>Watermark Generation</b> >; style=dashed; color=grey;
    // fontsize=9; node[fontsize=5,margin=0]; edge[fontsize=5,margin=0];

    fft_analyzer -> apply_frame_mod [color=blue4,xlabel="  513 FFT Bands \r"];
    get_frame_mod -> apply_frame_mod [color=red,xlabel="Up/Down-Band Modulators \r"];
    apply_frame_mod -> wm_synth [color=fuchsia,xlabel="FFT Delta Bands  \r"];

  }

  wm_synth -> out_resampler [color=fuchsia];
  out_resampler -> Limiter [color=fuchsia];
  out_resampler -> snr_signal_power [style=dashed,color=gold3];
}

Limiter -> AudioOutput [color=darkmagenta];
snr_signal_power -> SnrOutput [style=dashed,color=gold3];

bitvec [color=green4,margin=0,label=" Watermark Bits "];
Key [color=goldenrod,margin=0,label=" Key "];
AudioInput [color=blue,margin=0,label=" WAV/MP3 Audio Input File "];
AudioOutput [color=darkmagenta,margin=0,label=" WAV Audio Output File "];
{ rank=same bitvec Key AudioInput }

Random        [color=goldenrod,shape=record,label="Random Stream \l AES128/CTR \l"];
in_resampler  [color=blue,shape=record,label="Resample to 44.1kHz"];
out_resampler [color=fuchsia,shape=record,label="Resample from 44.1kHz"];

Limiter [color=darkmagenta,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Mixing ⊕ Limiting</b> <br/>
• Audio and watermark signals are added <br/>
• The result is scaled down to [-0.99…+0.99] <br/>
• Uses 1 second detection window <br/>
</TD></TR></TABLE> >];

snr_signal_power [style=filled,fillcolor="#ffffbb",color=gold3,shape=rect,label=<
<TABLE BORDER="0" CELLBORDER="0" CELLSPACING="1" ALIGN="LEFT"><TR><TD BALIGN="LEFT" CELLPADDING="5">
<b>Power Measurement</b> <br/>
• Signal Power <br/>
• Delta Power <br/>
• Ratio Calculation <br/>
</TD></TR></TABLE> >];
SnrOutput [color=gold3,margin=0,style=filled,fillcolor="#ffffbb",label=" Signal/Noise Ratio Info "];

fft_analyzer [color=blue4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Fourier Transform Analyzer</b> <br/>
• Input: Time domain samples <br/>
• FFT with block Size 1024 <br/>
• Hann Window <br/>
</TD></TR></TABLE> >];

apply_frame_mod [color=fuchsia,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Band Modulation ⊗</b> <br/>
• Input bands are Up/Down modulated <br/>
• Factor amounts to ±Amplitude^1% <br/>
• Output: ± Delta bands <br/>
</TD></TR></TABLE> >];

get_frame_mod [color=red,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Modulation Frame Generator</b> <br/>
• Encodes Watermark Bits <br/>
• Pregenerate A/B-Blocks <br/>
• Yields A/B-Block frames <br/>
</TD></TR></TABLE> >];

wm_synth [color=fuchsia,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Watermark Signal Synthesis</b> <br/>
• Inverse FFT with block size 1024 <br/>
• Cosine window with overlap of 10% <br/>
• Output: Time domain samples <br/>
</TD></TR></TABLE> >];

}
~~~~


At a sample frequency of 44100Hz, the audio signal used for the watermark creation is split into
"Frames" of 1024 samples each, which corresponds to segments of ca 23 millisecond length. These
frames are transformed from the time domain (samples) into the frequency domain (spectral bands)
and vice versa to apply the watermark embedding in certain spectral bands.

Data and synchronization bits are encoded across several frames, with different levels of
redundancy. In the "Modulation Frame Generator" the number of frames that compose all encoded
information needed to find and extract the watermark bits are combined into two types of "Blocks".

\pagebreak
A detailed chart of the component interactions for the Frame and Block generation in the
"Modulation Frame Generator" is provided in the next chart.

~~~~{.graphviz prog=dot}
digraph "Modulation Frame Generator" {
  graph[fontsize=13,_fontname="sans"];
  node[fontsize=13,target="_top",_fontname="sans"];
  edge[arrowhead=vee,arrowtail=vee,color="#00000080",_fontname="sans"];
  compound=true;
  concentrate=false;
  rankdir="TB";

Random -> ab_generators [color=goldenrod,lhead=cluster_ModulationFrameGenerator];
//Random -> randomize_bit_order [color=goldenrod,xlabel="bit_order R5"];
//Random -> mark_sync [color=goldenrod,xlabel="sync_up_down R2"];
//Random -> mark_data [color=goldenrod,xlabel="data_up_down R1"];
//Random -> frame_pos [color=goldenrod,xlabel="frame_position R6"];

bitvec -> conv_encode [color=green4,minlen=2];
{ rank=same Random bitvec }

subgraph cluster_ModulationFrameGenerator {
  label=< <b>Modulation Frame Generator</b> >; style=dashed; color=red;

  ab_generators -> conv_encode [color=green4,minlen=2];
  conv_encode -> randomize_bit_order [color=green4,minlen=2];
  { rank=same ab_generators conv_encode }

  UpDownGen -> mark_sync [color=goldenrod];
  UpDownGen -> gen_mix_entries [color=goldenrod];
  frame_pos -> mark_sync [color=goldenrod,minlen=1];
  // --linear: frame_pos -> mark_data;
  frame_pos -> gen_mix_entries [color=goldenrod];

  randomize_bit_order -> mark_data [color=goldenrod];
  init_frame_mod_vec -> get_frame_mod [color=cyan4];
  mark_sync -> init_frame_mod_vec [color=cyan3,minlen=1];
  mark_data -> init_frame_mod_vec [color=teal];
  gen_mix_entries -> mark_data [color=goldenrod];
  { rank=same mark_data mark_sync }
}

get_frame_mod -> apply_frame_mod [color=red,xlabel="Frame Up/Down-Band Modulators",minlen=2];
apply_frame_mod [shape=plain,label=" "];

bitvec [color=green4,margin=0,label=" Watermark Bits "];
Random [color=goldenrod,shape=record,label="
Random Stream \l
AES128/CTR \l
Streams R1…R6 \l
"];

conv_encode [color=green4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Convolutional Code Expansion</b> <br/>
• Pads watermark with termination zeros <br/>
• Combines bit stream with A/B constants <br/>
• Generates output stream of encoded bits <br/>
• Generates 858 encoded bits A-Block <br/>
• Generates 858 encoded bits B-Block <br/>
</TD></TR></TABLE>>];

ab_generators [color=green4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Convolutional Code Parameters</b> <br/>
• Convolutional code with rate 1/6 <br/>
• Order 15, needs 15 termination bits <br/>
• Six constants for A-Block and B-Block <br/>
• Forward correction of ca ≈20% bit errors <br/>
• Encodes 128 bits in 858 bit blocks <br/>
</TD></TR></TABLE>>];

UpDownGen [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Up/Down-Band Generator</b> <br/>
• Uses per-frame shuffling seed <br/>
• Picks random bands, 30 UP, 30 DOWN <br/>
• Bands are between ca 861Hz…4307Hz <br/>
</TD></TR></TABLE>>];

mark_sync [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Synchronization Frame Generator</b> <br/>
• Encodes 6 sync bits in 6 * 85 frames <br/>
• A-Block bit pattern: 010101 <br/>
• B-Block bit pattern: 101010 <br/>
• Randomizes Up/Down-Band shifts [R2] <br/>
• Output: 510 Frames * 60 Up/Down-Bands <br/>
</TD></TR></TABLE>>];

gen_mix_entries [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Mix Entry Generator (skipped for --linear)</b> <br/>
• Generates list of data bit encoding bands <br/>
• Uses 30 up + 30 down bands in 2 frames per bit <br/>
• Randomizes Up/Down-Band shifts [R1] <br/>
• Shuffles data bit association of entries [R4] <br/>
• Output: 2 * 858 * 30 Up/Down band pairs <br/>
</TD></TR></TABLE>>];

mark_data [color=teal,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Data Frame Generator</b> <br/>
• Encodes 858 data bits in 858 * 2 frames <br/>
• Encodes A-Blocks, B-Blocks in turn <br/>
• Omits Mix Entry Generator with --linear <br/>
• Randomizes Up/Down-Band shifts [R1] <br/>
• Output: 1716 Frames * 60 Up/Down-Bands <br/>
</TD></TR></TABLE>>];

frame_pos [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Frame Position Randomization</b> <br/>
• Mixes sync + data frames <br/>
• Shuffles frame positions <br/>
• Uses random stream [R6] <br/>
</TD></TR></TABLE>>];

randomize_bit_order [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Randomize Bit Order for ENCODE</b> <br/>
• Reversible shuffle for encode/decode <br/>
• Shuffles/interleaves bit stream [R5] <br/>
• Interleaving improves robustness <br/>
• Reduces bit stream damage impact <br/>
</TD></TR></TABLE>>];

init_frame_mod_vec [color=cyan4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>A/B-Block Frame Modulator Composition</b> <br/>
• Interleaves synchronization and data frames <br/>
• Pulls and interleaves each block type separately <br/>
• Output: Up/down band modulators for 1 block <br/>
</TD></TR></TABLE>>];

get_frame_mod [color=red,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Modulation Frame Selector</b> <br/>
• Yields A-Block band modulators per frame <br/>
• Yields B-Block band modulators and starts over <br/>
• Output: Up/down band modulators for 1 frame <br/>
</TD></TR></TABLE>>];

}
~~~~

The watermark is encoded and embedded into the audio signal in two block types,
A-Blocks and B-Blocks.  The information contained in each block alone is
usually sufficient to extract the watermark.  However, in case of very
distorted and noisy transmissions where watermark extraction from either block
type fails, a combination of segments with A-Block and B-Block data may still
lead to successful recovery of the original watermark.

In order to support watermark extraction from clipped excerpts of the input
stream, a fixed pattern of synchronization bits is integrated into the data
blocks with much higher redundancy than the data bits.  The fixed pattern
allows detection of the location of A-Blocks and B-Blocks as such to aid the
watermark extraction.

The user provided encoding `Key` seeds an AES based pseudo random number generator in
[Counter Mode](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#CTR)
that is used to determine encoding places, randomize the noise introduced by
the watermark and to interleave encoding for robustness. Without the key, the
watermark information cannot be retrieved. Using a key is important because the
implementation itself is open source, and being able to read the watermark
message bits would allow an attacker to remove the watermark without degrading
the audio quality.

The different types of random data streams used for the distribution of the embedded watermark information are as follows:

* R1 - Used to randomizes Up/Down band shifts for watermark data bits.
* R2 - Used to randomizes Up/Down band shifts for watermark synchronization bits.
* R3 - Currently unused.
* R4 - Used to mix (shuffle) data bit associations of Up/Down bands distributed across several frames.
* R5 - Used to shuffle (interleave) the bit stream. Due to redundancy in the generated bit stream, interleaving reduces the
  number of adversely affected bits by bursts (holes) in transmission loss.
* R6 - Used to randomize and mix data frames with synchronization frames, this makes synchronization frames unlikely to be detectable without the encoded key.

## Extracting Watermarks

The `audiowmark get <watermarked_wav> [--key…]` command extracts a watermark from an audio file.
This command takes an audio file and an optional key as input.
With the same key used during watermark embedding, synchronization bits are determined and searched
for in the audio content.
If synchronization bit matches are detected, encoded watermark information can be located,
extracted and decrypted with error correction.
The retrieval does not require access to the original audio input (this is called blind decoding).
The detection results are produced on *stdout* with accompanying information about the location,
match quality and a measure for likely decoding errors.

An outline of the component interactions to locate and extract the watermark information from the
frequency spectrum in the audio signal is provided in the following charts.

~~~~{.graphviz prog=dot}
digraph "Audiowmark Watermark Extraction" {
  newrank=true;
  graph[fontsize=13,_fontname="sans"];
  node[fontsize=13,target="_top",_fontname="sans"];
  edge[arrowhead=vee,arrowtail=vee,color="#00000080",_fontname="sans"];
  compound=true;
  concentrate=false;
  rankdir="TB";

WavData -> in_resampler [color=darkmagenta];

Key -> Random [minlen=2,color=goldenrod];

subgraph cluster_wmget {
  label=< <b><font face="mono">audiowmark get &lt;AudioFile&gt; [--key…]
                                                    </font></b> >;

  { rank=same Random in_resampler }
  in_resampler -> fft_range [color=darkmagenta];
  Random -> conv_decode_soft [minlen=2,color=goldenrod,
    lhead=cluster_WatermarkExtraction]; // fake arrow target for cluster alignment

  subgraph cluster_WatermarkExtraction {
    label=< <b>Watermark Extraction</b> >; style=dashed; color=green4;

    BlockDecoder -> ClipDecoder [color=darkorchid3];
    fft_range -> BlockDecoder [color=blue4,constraint=1];

    { rank=same fft_range conv_decode_soft }
    conv_decode_soft -> BlockDecoder [color=green4,dir=both];
    conv_decode_soft -> ClipDecoder [color=green4,dir=both];
  }
  BlockDecoder -> SyncFinder [color=cyan4,minlen=4,dir=both,xlabel="Mode::BLOCK"];
  ClipDecoder -> SyncFinder  [color=cyan4,minlen=1,dir=both,xlabel="Mode::CLIP",constraint=false];

  subgraph cluster_SyncFinder {
    label=""; style=dashed; color=cyan4; node[margin=0]; edge[margin=0]; margin=0;
    SyncFinder [shape=plaintext,
    label=< <table border="0"><tr><td BALIGN="LEFT">
<b>Synchronization Position Finder </b> <br/>
• Performs coarse search for synchronization <br/>
   bit markers in the frequency spectrum <br/>
• Searches for A/B-Block synchronizations <br/>
• Calculates score for possible locations <br/>
   and picks the 5 best matches <br/>
• Refines the exact block locations with a fine <br/>
   grained search for synchronization markers <br/>
• Short audio segments are dealt with by <br/>
   adding zero padding (in Mode::CLIP) <br/>
</td></tr></table> >];
  }


  ClipDecoder -> result_set_print [color=darkorchid4];

  { rank=same SyncFinder result_set_print }
}


result_set_print -> bitvec [color=green4];

Key [color=goldenrod,margin=0,label=" Key "];
WavData [color=darkmagenta,margin=0,label=" WAV/MP3 Audio Input File "];
{ rank=same Key WavData }
in_resampler  [color=darkmagenta,shape=record,label="Resample to 44.1kHz"];
Random        [color=goldenrod,shape=record,label="Random Stream \l AES128/CTR \l"];

fft_range [color=blue4,shape=rect,label=< <table border="0" align="left"><tr><td balign="left">
<b>Fourier Transform Analyzer</b> <br/>
• Input: Frames with time domain samples <br/>
• Short inputs are zero-padded as needed (Mode::CLIP) <br/>
• FFT with block Size 1024 <br/>
• Hann Window <br/>
• Output: 513 FFT Bands <br/>
</td></tr></table> >];

BlockDecoder [color=darkorchid3,shape=rect,label=< <table border="0" align="left"><tr><td balign="left">
<b>Best Block Decoder</b> <br/>
• Detect A/B-Blocks via 'Synchronization Position Finder' in Mode::BLOCK <br/>
• Only works for audio clips with at least 52 seconds and proper block <br/>
   alignment (or up from 104 seconds without alignment)  <!-- 1024 * (510 + 1716) / 44100 --> <br/>
• Reconstruct Up/Down-Band associations (reverses 'Mix Entry Generator') [R4] <br/>
• Estimate bit vectors resulting from average deviations in <br/>
   randomized Up/Down-Band shifts [R1] <!-- mix_decode --> <br/>
• Unshuffle bit order (reverses 'Randomize Bit Order for ENCODE') [R5]<br/>
• Normalize soft bits (normalization of estimated bit vectors) <br/>
• Reconstruct watermark bits using 'Soft-Decision Decoder' <br/>
• Decode individual A-Blocks, B-Blocks and if present AB-Blocks <br/>
• As last resort, attempt an AB-Block decode on accumulated bit vectors <br/>
   averaged over all selected blocks
</td></tr></table> >];

conv_decode_soft [color=green4,shape=rect,label=< <table border="0" align="left"><tr><td balign="left">
<b>Soft-Decision Decoder</b> <br/>
• Utilize Viterbi algorithm <br/>
• Decode blocks of 858 encoded bits <br/>
• Reconstructs 128 payload bits <br/>
• Uses 'Convolutional Code Parameters' <br/>
• Decode A-, B- and AB-Blocks <br/>
</td></tr></table> >];

ClipDecoder [color=darkorchid4,shape=rect,label=< <table border="0" align="left"><tr><td balign="left">
<b>Short Audio Clip Decoder</b> <br/>
• Used only for small audio clips up to 160 seconds (3.1 blocks) <!-- 1024 * (510 + 1716) / 44100 * 3.1 --> <br/>
• Uses zero padding (silence) around short audio clips to construct 3 blocks <br/>
• Zero padded regions are ignored during synchronization detection and scoring <br/>
• Determine alignment with 'Synchronization Position Finder' in Mode::CLIP <br/>
• Select the five blocks with the best detected synchronization markers <br/>
• Reconstruct Up/Down-Band associations (reverses 'Mix Entry Generator') [R4] <br/>
• Estimate bit vectors resulting from average deviations in <br/>
   randomized Up/Down-Band shifts [R1] <!-- mix_decode --> <br/>
• Unshuffle bit order (reverses 'Randomize Bit Order for ENCODE') [R5]<br/>
• Normalize soft bits (normalization of estimated bit vectors) <br/>
• Reconstruct watermark bits using 'Soft-Decision Decoder' <br/>
• Attempt AB-Block decoding at alignment position <br/>
• Attempt secondary AB-Block decoding in case of audio data surplus <br/>
• In practice, ca 10 seconds are needed for reliable detection, <br/>
   in good scenarios up to 3 seconds may suffice <br/>
</td></tr></table> >];


result_set_print [style=filled,fillcolor="#eeffdd",color=green4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Result Set Printing</b> <br/>
• Sort results by block type and score <br/>
• Print potential watermarks on stdout <br/>
• Provide time offset, score quality, <br/>
   block type and a measure for likely <br/>
   decoding errors per watermark <br/>
</TD></TR></TABLE> >];

bitvec [style=filled,fillcolor="#eeffdd",color=green4,margin=0,label=" Watermark Bits "];

}
~~~~

At a sample frequency of 44100Hz, a spectral analysis is performed on the audio signal and the spectrum is then searched for known synchronization markers.
Upon detection of A/B-Block synchronization positions, watermark bits are extracted from known data bit locations, while making use of the embedded redundancy to make the detection more robust.
Due to high redundancy and wide spread of watermark information, bits often can still be extracted from audio clips that are heavily shortened. To employ the full detection machinery to very short clips, symmetric zero padding is used to provide enough input samples (zero padded regions are ignored during scoring however).

Since detection success is directly dependent on the precise bit stream synchronization, an iterative process is used for fast approximation of synchronization locations with later refinements to yield precise results.

The purpose of the synchronization algorithm is to find the location of the
watermark A/B blocks in the input signal. This is important because the
signal may have been cropped so that the location of the blocks is not
known. To be able to find the locations of the blocks, while adding the
watermark, some sync bits are added to each block with relatively high redundancy.
The values of these sync bits are known, for an A block they are 010101, for a
B block they are 101010. The up- and down-bands used for the sync bits and
offsets of all frames that belong to sync bits inside the A / B block are known
and determined by the key.

To perform the actual synchronization and locate the start of an A (or B) block,
two steps are performed.

* As a first step, the synchronization algorithm tests all possible positions
for the start of an A (or B) block using a step size of 256 (1/4 frame size)
and tries to decode the sync bits at the expected locations relative to the
start of the block. Since the values and locations of the bits are known, a sync
score can be computed that indicates how good the bits in the actual audio
input at this position match the expected bit sequence.

* For all start locations with a significantly high sync score, in a second
step the actual start position is searched by trying all different start
locations near to the original match with a smaller finer step size. Again
a sync score can be computed and compared to a second threshold to decide
whether this is location is really likely to contain a data block. If the match
is good enough the start location will be used to decode the data bits in the
block.

Besides using this strategy to find "whole" data blocks, there is also a
variant of the synchronization algorithm that is used if the audio signal
is very short. It can find the location of the watermark even if the length
of the input signal is too short to contain a complete data block. To be
able to do this, the input signal is zero padded before sync detection and
then the usual algorithm to find whole blocks is used.

The following chart provides the detail of the steps involved in determining the synchronization locations.


~~~~{.graphviz prog=dot}
digraph "Synchronization Position Finder" {
  graph[fontsize=13,_fontname="sans"];
  node[fontsize=13,target="_top",_fontname="sans"];
  edge[arrowhead=vee,arrowtail=vee,color="#00000080",_fontname="sans"];
  compound=true;
  concentrate=false;
  rankdir="TB";

  in_resampler;
  Random;
  { rank=same in_resampler Random }

  Random -> frame_pos [color=goldenrod,lhead=cluster_SyncFinder,minlen=2];

  subgraph cluster_SyncFinder {
    label=< <b>Synchronization Position Finder</b> >; style=dashed; color=cyan4;

    in_resampler -> fft_analyzer [color=darkmagenta];
    fft_analyzer -> sync_fft_256_8 [color=blue4,dir=both];
    frame_pos -> init_up_down [color=goldenrod];
    UpDownGen -> init_up_down [color=goldenrod];
    init_up_down -> search_approx [color=cyan3];
    search_approx -> sync_select_by_threshold [color=cyan3];
    sync_select_by_threshold -> search_refine [color=cyan3];

    sync_fft_256_8 -> search_refine [style=dashed,dir=back,color=darkcyan,xlabel=" Refining\l Feedback\l                "];
    sync_fft_256_8 -> sync_decode [color=darkcyan,style=dashed,label="Repeat\lRefined\l                "];
    sync_decode -> search_refine [color=darkcyan,style=dashed,label="Repeat\lRefined\l                "];

    sync_fft_256_8 -> sync_decode [color=cyan3];
    sync_decode -> search_approx [color=cyan3];

    { rank=same fft_analyzer init_up_down }
  }

search_refine -> sync_scores [color=cyan4,xlabel="Scoring and A/B-Type for potential blocks",minlen=2];
sync_scores [shape=plain,label=" "];

in_resampler  [color=darkmagenta,shape=record,label="
WAV/MP3 Audio Input File \l
\l
Resampled to 44.1kHz \l
"];

Random [color=goldenrod,shape=record,label="
Random Stream \l
AES128/CTR \l
Streams R1…R6 \l
"];
{ rank=same in_resampler Random }

frame_pos [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Frame Position Randomization</b> <br/>
• Mixes sync + data frames <br/>
• Shuffles frame positions <br/>
• Uses random stream [R6] <br/>
</TD></TR></TABLE> >];

UpDownGen [color=goldenrod,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Up/Down-Band Generator</b> <br/>
• Uses per-frame shuffling seed <br/>
• Picks random bands, 30 UP, 30 DOWN <br/>
• Bands are between ca 861Hz…4307Hz <br/>
</TD></TR></TABLE> >];

init_up_down [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Synchronization Bit Frame Generator</b> <br />
<b>(Mode::BLOCK &amp; Mode::CLIP)</b> <br/>
• Generates 6 sync bits in 6 * 85 frames (* 2 for Mode::CLIP) <br/>
• Randomizes Up/Down-Band shifts [R2] <br/>
• Sorts synchronization bit frames by frame index <br/>
• Mode::BLOCK Output: 510 Bit Frames with 60 Up &amp; 60 Down-Bands <br/>
• Mode::CLIP Output: 1020 Bit Frames with 60 Up &amp; 60 Down-Bands <br/>
</TD></TR></TABLE> >];

fft_analyzer [color=blue4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Fourier Transform Analyzer</b> <br/>
• Input: Time domain samples <br/>
• FFT with block Size 1024 <br/>
• Hann Window <br/>
• Output: 513 FFT Bands <br/>
</TD></TR></TABLE> >];

sync_fft_256_8 [color=blue4,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Decibel Quantifier</b> <br/>
• Uses coarse stepping of 256 values for approximate search <br/>
• A stepping of 256 equates 1/4th FFT block <br/>
• Uses fine stepping of 8 values for refined search <br/>
• Pulls FFT Bands for all input blocks <br/>
• Computes dB for all bands of all blocks <br/>
</TD></TR></TABLE> >];

sync_decode [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Synchronization Bit Matching</b> <br/>
• Collect Up/Down-Band magnitudes for sync bits <br/>
• Determine match quality for alternating bit patterns <br/>
• Apply watermark strength dependent thresholds <br/>
• Decide A/B-Block based on 010101 / 101010 detection <br/>
</TD></TR></TABLE> >];

search_approx [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Approximate Synchronization Frame Search</b> <br/>
• Skips over zero-padded samples (Mode::CLIP) <br/>
• Computes multiple time-shifted FFT vectors <br/>
• Uses coarse subframe stepping of 256 values <br/>
• Overlaps frames for sync detection by 1/4th frame <br/>
• Scores positions for synchronization matches <br/>
</TD></TR></TABLE> >];

sync_select_by_threshold [color=cyan3,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Synchronization Frame Selection</b> <br/>
• Due to the subframe stepping, good and bad matches can be expected to alternate <br/>
• Identification of local match maxima <br/>
• Strength dependent threshold determines minimum match quality <br/>
• Selection of likely match positions via maxima and threshold <br/>
</TD></TR></TABLE> >];

search_refine [color=darkcyan,shape=rect,label=<
<TABLE BORDER="0" ALIGN="LEFT"><TR><TD BALIGN="LEFT">
<b>Refined Synchronization Frame Search</b> <br/>
• Computes fine-stepped time-shifted FFT vectors around selected frames <br/>
• Searches ±16 subframes around previously detected good scores <br/>
• Keeps subframe if the score (synchronization frame detection quality) improves <br/>
</TD></TR></TABLE> >];

}
~~~~

The user provided 128-Bit AES key is essential to determine spectral bands, encoding patterns, and bit locations.
During decoding, the same Pseudo Random Number Generator sequences R1…R6 are used that facilitated watermark embedding.
By using the same AES key and a cryptographically secure PRNG, the sequences are uniformly distributed and deterministically reproducible but cannot be extrapolated.
This prevents watermark extraction or modification by anyone without possession of the exact encoding key.

## The Patchwork Algorithm

![Example Spectrum](example-spectrum.png)

To store one single bit inside a spectrum, **audiowmark** uses the patchwork
algorithm. From the frequency bands of the spectrum (generated by computing the
FFT of one frame), two groups are choosen in the frequency range of the watermark
using the pseudo random number generator. These are called up- and down-bands.
In the example above, the up-bands are red and the down-bands are green.
Typically there are 30 up- and 30 down-bands and the other bands do not carry
information.

To embed a single bit, the following changes are made to the spectrum:

 * to **store a 1 bit**, each magnitude of each up-band is increased by a small amount,
   and each magnitude of each down-band is decreased by a small amount (this is
   shown by the small arrows in the example image)

 * to **store a 0 bit**, each magnitude of each up-band is decreased, and each magnitude of
   each up-band is increased (the opposite of the small arrows in the example image)

Since we have pseudo-randomly choosen the up- and down-bands from the spectrum,
we can expect that if we sum up all values of the up-bands and sum up all
values of the down-bands **before** embedding the bit, we will get a similar
result (because the mean value of all spectrum bins is shared between the two).

However, since we increased all elements of the up-bands and decreased all
elements of the down-bands **after embedding a 1 bit**, the sum of the up-bands
should be **greater than** the sum of the down-bands.

So to decode the bit from the spectrum, we can simply use the rule

 * **decode as 1 bit**, if the sum of the up-bands is greater than the sum
   of the down-bands

 * **decode as 0 bit**, if the sum of the up-bands is smaller than the sum
   of the down-bands

In the actual implementation, increasing/decreasing the magnitude of the
up-/down-bands is done by generating a watermark signal with the right
magnitude/phase for each frame that only contains the changes. So we
compute a delta spectrum, which is then passed to the IFFT, windowed and then
added to the original audio, so that the sum has the desired modified spectrum
magnitude.

The detection is performed on dB values of the magnitudes of the spectrum
obtained from the FFT, so the sums of the dB values of up-/down-bands are
computed and compared to decide whether a 0 bit or 1 bit was received.

The patchwork algorithm does not guarantee that encoding/decoding will always
yield the right result at the lowest level of embedding/decoding one bit (as
the difference of the up-/down-bands can be too big before embedding due to
the original signal). However error correction and redundancy by embedding a
bit in more than one frame makes the whole process reliable at a higher level.

There are three improvements over the basic patch work algorithm described
above, which make the watermark detection more accurate:

* To use soft-decoding for the convolutional decoder, instead of deciding
whether a 0 or 1 bit was received by comparing the two sums directly before
decoding the convolutional code to obtain the message bits, the difference
between the two sums is normalized and is used as a soft-bit input for the
Viterbi algorithm.

* Instead of storing one data bit in each frame spectrum, a data bit uses up-
and down-bands from different frames. This is called mix-encoding, which
spreads the information of each data bit over many frames.

* As described above, the original signal can have some negative effect
on the performance of the decoder, since the sum of the up-bands and the
sum of the down-bands will be different even before embedding the bits.
To make detection more reliable, the original signal level for each bin is
estimated by taking the average value of the previous and next spectrum and
subtracted before computing the sum of the up- and down-bands.

## Mixing with Limiter

The input material for **audiowmark** is normalized (all samples are in the
range from -1 to 1). If we simply added the watermark to the input, it
could happen that this sum exceeds the range from -1 to 1 which would
result in clipping. To avoid this, a limiter during mixing is used.

The limiter computes the highest peak for each one second long block. Then a
linear volume envelope is constructed connecting the blocks, such that the
envelope is greater or equal to the height of the peaks in each block. The
typical value for really high peaks is about 1.04 for the default watermarking
strength of 10.

To avoid clipping, the signal is divided by the slowly changing volume
envelope. The result is somewhat similar to a lookahead peak limiter with
attack of one second, and linear release of one second. Or to describe the
effect more directly, if a single peak of 1.04 was produced in the watermarked
signal, the limiter would slowly start decreasing the volume to 1/1.04 over the
time of one second before the block that contains the peak, stay there for a
while (due to the use of blocks) and afterwards slowly increase the volume
again over the time of one second.

By using a limiter that works on one second blocks like this, it is possible to
seek to any point in the watermark (which is required for streaming via HLS)
and getting the exact same output that watermarking all previous samples would
have produced, because the output of the limiter only depends on the current,
previous and next one second block. So only a small context window needs to
be processed when seeking.

## Speed Detection

As one of the later developments, a dedicated speed detection facility has been integrated that explores the ability to extract watermarks from
audio segments played back at unknown rates.
In scenarios where audio has been resampled and pitched at a constant rate, synchronization markers may still be
detectable by searching the audio content at varying resampled playback rates.

The `audiowmark --detect-speed` command line option attempts to detect playback rate changes compared to the original material used for embedding within 80% to 125%.

Detection of playback rate modifications is approached in several steps.
First, the detector picks two short audio clips (ca 25 seconds) with high signal energy and performs multiple coarse scans while detecting <0.2% rate modulations.
This rough speed estimate is improved upon with secondary scans around 1/20‰ rate modulations on an audio clip of 50 second.

Executing coarse and fine detection runs at varying resampled playback rates with multiple refining steps consumes a lot of processing resources.
To speed up the detection, resampling and scanning runs are carried out on a downsampled version of the audio material (by a factor of 2) and detection runs are parallelized across all available CPU cores.

Finally, the watermark extraction is carried out on a resampled version of the audio material at the most likely detected playback rate in addition to regular watermark detection, because the detected playback rate may have been guessed wrongly.


<!--
# https://graphviz.org/documentation/
BUILD:
apt install -y python3-pygraphviz python3-pandocfilters
pandoc -F graphviz.py audiowmark.md -o audiowmark.html
pandoc -F graphviz.py audiowmark.md -V papersize:a4 -V geometry:margin=2cm -o audiowmark.pdf
-->