File: overview.rst

package info (click to toggle)
fastchunking 0.0.3-4
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 300 kB
  • sloc: cpp: 490; python: 384; makefile: 197
file content (100 lines) | stat: -rw-r--r-- 3,851 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
==================
Usage and Overview
==================

`fastchunking` provides efficient implementations for different string chunking
algorithms, e.g., static chunking (SC) and content-defined chunking (CDC).

Static Chunking (SC)
--------------------

Static chunking splits a message into fixed-size chunks.

Let us consider a random example message that shall be chunked:
    >>> import os
    >>> message = os.urandom(1024*1024)

Static chunking is trivial when chunking a single message:
    >>> import fastchunking
    >>> sc = fastchunking.SC()
    >>> chunker = sc.create_chunker(chunk_size=4096)
    >>> chunker.next_chunk_boundaries(message)
    [4096, 8192, 12288, ...]

A large message can also be chunked in fragments, though:
    >>> chunker = sc.create_chunker(chunk_size=4096)
    >>> chunker.next_chunk_boundaries(message[:10240])
    [4096, 8192]
    >>> chunker.next_chunk_boundaries(message[10240:])
    [2048, 6144, 10240, ...]

Content-Defined Chunking (CDC)
------------------------------

`fastchunking` supports content-defined chunking, i.e., chunking of messages
into fragments of variable lengths.

Currently, a chunking strategy based on Rabin-Karp rolling hashes is supported.

As a rolling hash computation on plain-Python strings is incredibly slow with
any interpreter, most of the computation is performed by a C++ extension which
is based on the `ngramhashing` library by Daniel Lemire, see:
https://github.com/lemire/rollinghashcpp

Let us consider a random message that should be chunked:
    >>> import os
    >>> message = os.urandom(1024*1024)

When using static chunking, we have to specify a rolling hash window size (here:
48 bytes) and an optional seed value that affects the pseudo-random distribution
of the generated chunk boundaries.

Despite that, usage is similar to static chunking:
    >>> import fastchunking
    >>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)
    >>> chunker = cdc.create_chunker(chunk_size=4096)
    >>> chunker.next_chunk_boundaries(message)
    [7475L, 10451L, 12253L, 13880L, 15329L, 19808L, ...]
    
Chunking in fragments is straightforward:
    >>> chunker = cdc.create_chunker(chunk_size=4096)
    >>> chunker.next_chunk_boundaries(message[:10240])
    [7475L]
    >>> chunker.next_chunk_boundaries(message[10240:])
    [211L, 2013L, 3640L, 5089L, 9568L, ...]

Multi-Level Chunking (ML-\*)
----------------------------

Multiple chunkers of the same type (but with different chunk sizes) can be
efficiently used in parallel, e.g., to perform multi-level chunking [LS16]_.

Again, let us consider a random message that should be chunked:
    >>> import os
    >>> message = os.urandom(1024*1024)

Usage of multi-level-chunking, e.g., ML-CDC, is easy:
    >>> import fastchunking
    >>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)
    >>> chunk_sizes = [1024, 2048, 4096]
    >>> chunker = cdc.create_multilevel_chunker(chunk_sizes)
    >>> chunker.next_chunk_boundaries_with_levels(message)
    [(1049L, 2L), (1511L, 1L), (1893L, 2L), (2880L, 1L), (2886L, 0L),
    (3701L, 0L), (4617L, 0L), (5809L, 2L), (5843L, 0L), ...]

The second value in each tuple indicates the highest chunk size that leads to
a boundary. Here, the first boundary is a boundary created by the chunker with
index 2, i.e., the chunker with 4096 bytes target chunk size.

.. note::
   Only the highest index is output if multiple chunkers yield the same
   boundary.
    
.. warning::
   Chunk sizes have to be passed in correct order, i.e., from lowest to highest
   value.

References:
    .. [LS16] Dominik Leibenger and Christoph Sorge (2016). sec-cs: Getting the
       Most out of Untrusted Cloud Storage.
       `arXiv:1606.03368 <http://arxiv.org/abs/1606.03368>`_