Arithmetic coding is a general technique for coding the outcome of a stochastic process based on an adaptive model. The expected bit rate is the cross-entropy rate of the model versus the actual process. PPM, prediction by partial matching, is an adaptive statistical model of a symbol sequence which models the likelihood of the next byte based on a (relatively short) suffix of the sequence of previous bytes.
Although arithmetic coding works with any sample space, taking the outcomes to be bytes allows the coding of text in arbitrary character sets as well as binary files. Encoding proceeds a byte at a time and provides output a bit at a time; decoding proceeds a bit at a time and provides output a byte at a time. The end-of-file outcome is coded as a separate outcome, mirroring the behavior of byte reading in both Java and C streams.
Suppose b[0],...,b[m-1]
is a sequence of bytes of length
m
where b[m-1]=EOF
. A suitable model
provides an estimate P(b[n]|b[0],...,b[n-1])
for the
probability of the n
th byte in the stream or end-of-file
given its observation of the previous n-1
bytes. The
number of bits required to encode b[n]
is approximately
-log_2 P_n(b[n]|b[0],...b[n-1])
. A slight penalty is
incurred based on the precision of the underlying arithmetic.
In a sequence of bytes b[0], b[1], ..., b[m-1]
, the byte
b[n]
is coded based on an ordering b_0, ...,
b_256
of the all 256 bytes and the end-of-file symbol.
Assuming b[n]=b_k
for some k
, the arithmetic
encoder stores the sub-interval
[ SUM_j < k P(b[j]|b[0],...,b[j-1]),
SUM_j <= k P(b[j]|b[0],...,b[j-1]) ]
of [0.0,1.0]
. Independently of the indexing, the width
of the interval is P(b[n]|b[0],...,b[n-1])
. In practice,
the interval is approximated using three integers low
,
high
, and total
, which define the interval
[low/total,high/total]
. The difference
high-low
must be non-zero for every outcome with non-zero
probability. A slight loss in coding precision arises from this
integer approximation and is bounded by the precision induced by
total
.
For decoding, the total
used to encode the next symbol's
interval is provided by the model. This requires each possible outcome
P(.|b[0],...,b[n-1])
to be coded with
the same total
value. Given the total, the decoder returns a
count k
such that low<=k<high
. The model
then determines the symbol from the count k
. The coder
encodes byte b[n]
based on the previous symbols
b[0],...,b[n-1]
, enabling it to update its model after
every symbol. The decoder is able to reconstruct the same model after
decoding each symbol. The only restriction on the models are that
they are trained only on the previous n-1
symbols; in
particular, they cannot use information about the symbol being coded.
The arithmetic coder is distributed with several illustrative models, culminating in PPMC, prediction by partial matching with model C. When introduced in the 1980s, PPM provided the best known compression on general text and binary data such as images. It has since been beaten by embellishments, first PPMC a few years later which had better fallback strategies for unseen contexts, then PPM* a decade after that, which kept track of unbounded length contexts, and then more recently, PPMZ, which provides several improvements to both estimation and implementation.
Uniform distributions assign the same probability to every outcome.
The uniform distribution codes a byte b
as the interval
[low,high,total] = [b,b+1,257]
and end-of-file as
[256,257,257]
. The expected entropy rate is -log_2
1/257
, just over 8 bits. The loss is due to encoding
end-of-file. For a regular file, the file system keeps track of its
length so that all 256 bytes may be included. Read and write in
languages like C and Java return bytes in the range 0-255 and return
-1 if the end-of-file is reached; essentially 257 outcomes. An
alternative implementation of arithmetic coding would store the length
n
separately in log_2 n
bits. Although this
is typically more efficient than coding end-of-file in a model, it
disrupts the streaming nature of coding/decoding.
The adaptive unigram model keeps a count of previously seen bytes and
estimates the likelihood of a byte b
based on its
frequency count(b)
, where each byte's count is
initialized to 1
. Adding 1
to unseen events
in this way is known as Laplace smoothing. Whenever the total count
exceeds the maximum, the counts are rescaled by dividing by
2
and rounding up to at least 1
. The model
is adaptive in that it updates after each byte; it is a unigram
because strings of length 1 are used. A binary search is used during
decoding to determine the interval, but the totals are updated by a
linear traversal of all bytes before the updated byte in the order.
The PPM model (Cleary and Witten 1984) predicts the next byte based on
a sequence of previous bytes. It generalizes the byte frequency model
by keeping counts for sequences of bytes b_1,..,b_n for some value of
n (to be precise, we write PPM(n) for the n
th order PPM
model).
The PPM model illustrates the final aspect of the arithmetic coding model -- escapes. A model has the option of informing the coder that it doesn't currently know how to code the symbol and providing the likelihood interval for this "escape" outcome. Then, the coder asks again. With PPM, after each escape, the context is narrowed to a shorter prefix of the current byte, eventually grounding out in the uniform distribution. At each escaped context in the backoff, outcomes that have been seen in longer contexts are excluded from the counts.
The PPM model is configurable by constructor for maximum context length, and rescaling and maximum count parameters can be tuned through constants in the code.
Arithmetic coding relies on processing files a single bit at a
time. The classes BitInput
and BitOutput
wrap ordinary input and output streams to provide buffered access to
individual bits.
In order to implement PPM with tight compression, it is necessary to keep track of which symbols have been seen in escaped contexts so that their counts may be subtracted from the narrower contexts. The representation of the bytes that have been seen is handled with a compact, generic implementation of byte sets. The members are stored in the bits four long values and accesses through shifting and masking. (This was found to be faster and smaller than a binary array.)