Arithmetic Coding and Model-Based Compression Tutorial

Arithmetic coding is a general technique for coding the outcome of a stochastic process based on an adaptive model. The expected bit rate is the cross-entropy rate of the model versus the actual process. PPM, prediction by partial matching, is an adaptive statistical model of a symbol sequence which models the likelihood of the next byte based on a (relatively short) suffix of the sequence of previous bytes.

Byte Stream Models

Although arithmetic coding works with any sample space, taking the outcomes to be bytes allows the coding of text in arbitrary character sets as well as binary files. Encoding proceeds a byte at a time and provides output a bit at a time; decoding proceeds a bit at a time and provides output a byte at a time. The end-of-file outcome is coded as a separate outcome, mirroring the behavior of byte reading in both Java and C streams.

Suppose b[0],...,b[m-1] is a sequence of bytes of length m where b[m-1]=EOF. A suitable model provides an estimate P(b[n]|b[0],...,b[n-1]) for the probability of the nth byte in the stream or end-of-file given its observation of the previous n-1 bytes. The number of bits required to encode b[n] is approximately -log_2 P_n(b[n]|b[0],...b[n-1]). A slight penalty is incurred based on the precision of the underlying arithmetic.

Encoding with Cumulative Probabilities

In a sequence of bytes b[0], b[1], ..., b[m-1], the byte b[n] is coded based on an ordering b_0, ..., b_256 of the all 256 bytes and the end-of-file symbol. Assuming b[n]=b_k for some k, the arithmetic encoder stores the sub-interval

[ SUM_j < k P(b[j]|b[0],...,b[j-1]),
SUM_j <= k P(b[j]|b[0],...,b[j-1]) ]

of [0.0,1.0]. Independently of the indexing, the width of the interval is P(b[n]|b[0],...,b[n-1]). In practice, the interval is approximated using three integers low, high, and total, which define the interval [low/total,high/total]. The difference high-low must be non-zero for every outcome with non-zero probability. A slight loss in coding precision arises from this integer approximation and is bounded by the precision induced by total.

Decoding with Cumulative Probabilities

For decoding, the total used to encode the next symbol's interval is provided by the model. This requires each possible outcome P(.|b[0],...,b[n-1]) to be coded with the same total value. Given the total, the decoder returns a count k such that low<=k<high. The model then determines the symbol from the count k. The coder encodes byte b[n] based on the previous symbols b[0],...,b[n-1], enabling it to update its model after every symbol. The decoder is able to reconstruct the same model after decoding each symbol. The only restriction on the models are that they are trained only on the previous n-1 symbols; in particular, they cannot use information about the symbol being coded.

Models

The arithmetic coder is distributed with several illustrative models, culminating in PPMC, prediction by partial matching with model C. When introduced in the 1980s, PPM provided the best known compression on general text and binary data such as images. It has since been beaten by embellishments, first PPMC a few years later which had better fallback strategies for unseen contexts, then PPM* a decade after that, which kept track of unbounded length contexts, and then more recently, PPMZ, which provides several improvements to both estimation and implementation.

Uniform Model

Uniform distributions assign the same probability to every outcome. The uniform distribution codes a byte b as the interval [low,high,total] = [b,b+1,257] and end-of-file as [256,257,257]. The expected entropy rate is -log_2 1/257, just over 8 bits. The loss is due to encoding end-of-file. For a regular file, the file system keeps track of its length so that all 256 bytes may be included. Read and write in languages like C and Java return bytes in the range 0-255 and return -1 if the end-of-file is reached; essentially 257 outcomes. An alternative implementation of arithmetic coding would store the length n separately in log_2 n bits. Although this is typically more efficient than coding end-of-file in a model, it disrupts the streaming nature of coding/decoding.

Adaptive Unigram Model

The adaptive unigram model keeps a count of previously seen bytes and estimates the likelihood of a byte b based on its frequency count(b), where each byte's count is initialized to 1. Adding 1 to unseen events in this way is known as Laplace smoothing. Whenever the total count exceeds the maximum, the counts are rescaled by dividing by 2 and rounding up to at least 1. The model is adaptive in that it updates after each byte; it is a unigram because strings of length 1 are used. A binary search is used during decoding to determine the interval, but the totals are updated by a linear traversal of all bytes before the updated byte in the order.

PPM: Prediction by Partial Matching

The PPM model (Cleary and Witten 1984) predicts the next byte based on a sequence of previous bytes. It generalizes the byte frequency model by keeping counts for sequences of bytes b_1,..,b_n for some value of n (to be precise, we write PPM(n) for the nth order PPM model).

The PPM model illustrates the final aspect of the arithmetic coding model -- escapes. A model has the option of informing the coder that it doesn't currently know how to code the symbol and providing the likelihood interval for this "escape" outcome. Then, the coder asks again. With PPM, after each escape, the context is narrowed to a shorter prefix of the current byte, eventually grounding out in the uniform distribution. At each escaped context in the backoff, outcomes that have been seen in longer contexts are excluded from the counts.

The PPM model is configurable by constructor for maximum context length, and rescaling and maximum count parameters can be tuned through constants in the code.

I/O: BitInput and BitOutput

Arithmetic coding relies on processing files a single bit at a time. The classes BitInput and BitOutput wrap ordinary input and output streams to provide buffered access to individual bits.

Sets of Bytes

In order to implement PPM with tight compression, it is necessary to keep track of which symbols have been seen in escaped contexts so that their counts may be subtracted from the narrower contexts. The representation of the bytes that have been seen is handled with a compact, generic implementation of byte sets. The members are stored in the bits four long values and accesses through shifting and masking. (This was found to be faster and smaller than a binary array.)