Recently while attending a talk at the Stanford Compression Forum, I heard about a new class of entropy coders called the Asymmetric Numeral Systems coders (ANS). Considering Arithmetic coding and Huffman coding have essentially resolved the problem of entropy coding, it is surprising that there is a new development in the field of entropy coding. Today, just a few years after the original paper came out, ANS has already been incorporated into numerous compressors, including Zstandard (facebook), LZFSE (Apple), CRAM (Genomic SAM file compression) etc. The ANS Coder is not a single compressor but a class of entropy coders for a given distribution. The most important thing they achieve is obtaining very accurate compression at very high compression and decompression speeds (like Pied Piper, ANS significantly improves upon the Weissman score :) ).
In this series of posts, we will try to improve our understanding of the ANS family of coders. But before we do that, lets revise the two of the most important entropy coders we have: Huffman and Arithmetic Coders.
Huffman and Arithmetic Coders
Huffman coding (and for that matter any prefix-free codes) are the fastest entropy coders, as all they do is perform table-lookup for the unique prefix-free code for the input symbol. The prefix-free nature of the codes also allows for efficient decoding using a binary tree. However, in some cases the Huffman coder can be significantly suboptimal in terms of compression, with a gap with the entropy as high as 1 bit/symbol . One can also imagine huffman coding as a Finite State Entropy coder (FSE) with a single state: For every input alphabet, the encoder outputs the corresponding prefix-free code (from the lookup-table) and transitions back to the same state.
On the other extreme we have the family of Arithmetic coders. Arithmetic coding can be thought of as an encoder which implicitly represents the entire input as a single state from a huge Finite-State automata (whose size is asymtotically exponential in the length of the input). Arithmetic coding is optimal in terms of its compression, but has slower compression/decompression speeds.
The ANS family of coders forms the bridge between these two extremes of entropy coders, where we can design a k-state FSE coder (for a wide range of k values), which results in better compression optimality than Huffman coding with a slight decrease in the speed. This ability of graceful tradeoff between the compression vs speed is crucial to what makes the ANS family so useful in practice. We will first look at one member of the ANS family of coders: rANS, which is an acronym for range-ANS due to its similarities to the arithmetic encoder.
rANS
Lets consider a simple example: Say we have sequence of digits (in the range [0,\ldots,9] ), S = \{ 3,2,0,8,9,1 \} which we want to represent by a single integer state, how can we do that? We can simply form a single number X_6 = 320891 . What we are doing is essentially the following:
\begin{aligned}
X_0 &= 0 \\
X_1 &= X_0*10 + 3 \\
X_2 &= X_1*10 + 2 \\
&\ldots \\
X_6 &= X_5*10 + 1
\end{aligned}
The final state X_6 contains all the information necessary to recover the original string. We can encode and store this state using bits by simply converting the state into binary using \log_2 ( X_6 ) bits: thus, larger the state, the more expensive is the bit-representation. These bits can then be read to decode X_6 , from which we can recursively decode S in the reverse order as follows:
\begin{aligned}
X_6 &= 320891 \\
X_5 &= \lfloor X_6/10 \rfloor, s_6 = mod(X_6,10) \\
X_4 &= \lfloor X_5/10 \rfloor, s_5 = mod(X_5,10) \\
&\ldots \\
X_0 &= 0, s_1 = 3 \\
\end{aligned}
This looks simple. But, the question to ask is, if this representation of S is optimal? Are we using the minimum possible number of bits? Well, if we look at the encoding, at time step t , we are scaling the state by approximately 10 times (since: X_t \approx X_{t-1}*10 ). This is reasonable if all the symbols [0,\ldots,9] are equiprobable, as we are essentially spending \log_2 10 bit per encoded symbol. In the case where we have non-uniform symbol distribution, this would be suboptimal as its entropy can be far less than \log_2 10 . Can do modify the structure to improve the performance for non-uniform distributions?
One idea can be to scale more frequent symbols with a factor smaller than 10 , while the less frequent ones by a larger factor. This is infact the core idea behind the rANS construction, making our standard numeral system asymmetric to achieve a more efficient representation. the rANS encoding algorithm is a very ingenious adaptation of this simple idea!
Notation
Before we describe rANS, some notation:
Let S = (s_1,s_2,s_3,\ldots, s_n) be the input string of symbols from the alphabet set \mathcal{A} = \{a_1,a_2,\ldots,a_k \} of size k
We assume the data comes from a distribution characterized by the frequency counts \mathcal{F} = \{ F_{a_1}, F_{a_2},\ldots, F_{a_k} \} , which are integers proportional to the probability mass distribution \{p_1,p_2,\ldots,p_k\} of the symbols.
Let M = \sum_{i=1}^k F_i . Then, p_i = \frac{F_{a_i}}{M}
We also define the cumulative frequency counts C_{a_i} = \sum_{j=1}^{i-1} F_{a_j} , which correspond to the cumulative distribution of the symbols
Example: Let the alphabet be \mathcal{A} = \{ A,B,C\} . Then, we can consider input generated using frequency counts F = \{ 3, 3, 2\} , which corresponds to probabilities \{ \frac{3}{8}, \frac{3}{8}, \frac{2}{8} \} .
The cumulative frequency counts in this case corresponds to C = \{ 0, 3, 6 \}
rANS keeps track of the input using a single integer state. Let X_t represent the integer state of rANS after it has looked at t input symbols. We initialize the rANS state to be, X_0 = 0 . rANS updates the state X_t based on the previous state X_{t-1} and the current symbol s_t .
X_t = C_{rANS}(X_{t-1},s_t)
The compression output is the final state: X_n represented using \lceil \log_2 X_n \rceil bits. X_n (along with the number of symbols encoded: n ) is used to decode the entire input string S . The decoder works in an exactly reverse direction: recursively extracting the previous state, and the last encoded symbol.
s_t, X_{t-1} = D_{rANS}(X_{t})
rANS Encoding
So here it goes:
X_t = \left\lfloor \frac{X_{t-1}}{F_{s_t}} \right\rfloor * M + C_{s_t} + mod(X_{t-1}, F_{s_t})
Thats all! This is the encoding step for the rANS. We also present a python function snippet for this encoding step:
def C_rANS(s, state, symbol_counts):
total_counts = np.sum(symbol_counts) #Represents M
cumul_counts = np.insert(np.cumsum(symbol_counts),0,0) #the cumulative frequencies
s_count = symbol_counts[s] #current symbol count/frequency
next_state = (state//s_count)*total_counts + cumul_counts[s] + (state % s_count)
return next_state
rANS Encoding Example
Symbol Counts, \mathcal{F}
Input Symbol String:
Lets also state the decoder for completeness.
rANS Decoding
Let C\_inv(y) be the inverse function of the cumulative frequency, where: C\_inv(y) = a_i , if C_{a_i} \leq y < C_{a_{i+1}} . Then, the decoder, D_{rANS}(X_{t}) is defined as follows:
\begin{aligned}
slot &= mod(X_t,M) \\
s_t &= C\_inv(slot) \\
X_{t-1} &= \left\lfloor \frac{X_t}{M} \right\rfloor * F_{s_t} + slot - C_{s_t}
\end{aligned}
A python function snippet for the decoder:
def D_rANS(state, symbol_counts):
total_counts = np.sum(symbol_counts) #Represents M
cumul_counts = np.insert(np.cumsum(symbol_counts),0,0) #the cumulative frequencies
#The Cumulative frequency inverse function
def cumul_inverse(y):
for i,_s in enumerate(cumul_sum):
if y < _s: return i-1
slot = state % total_counts #compute the slot
s = cumul_inverse(y) #decode the symbol
prev_state = (state//total_counts)*symbol_count[s] + slot - cumul_counts[s] #update the state
return s,prev_state
rANS Decoding Example
Symbol Counts, \mathcal{F}
State
Number of Encoded Symbols
One can understand rANS encoding as a 2-step process: In the first step, we choose a M sized block. For a state X_{t-1} , we choose the block with blockId = \lfloor {X_{t-1}}/{F_{s_t}} \rfloor . Once the blockId is fixed, we choose a slot , out of the M allowed integers from the block. For the next symbol being s_t , the slot can be chosen in the range [C_{s_t},C_{s_{t} + 1}-1] . The next state X_t then is composed of as: X_t = blockId*M + slot
Having distinct ranges for every symbol makes it possible to decode s_t , by only looking at:
slot = mod(X_t,M) during the decoding, using the C\_inv function. We can also retrieve the blockId from the state X_t using: blockId = \lfloor X_t/M \rfloor . As blockId = \lfloor X_{t-1}/F_{s_t} \rfloor , the only missing piece of information required to retrieve X_{t-1} is mod(X_{t-1}, F_{s_t}) . Amazingly we can accomodate this information during the encoding, by choosing the slot to be C_{s_t} + mod(X_{t-1}, F_{s_t}) , which lies in the allowed range of [C_{s_t},C_{s_{t} + 1}-1] .
Optimality of rANS
Now that we have convinced ourselves that rANS indeed is a lossless codec, lets see why rANS achieves the optimal compression ratio. Notice that:
\begin{aligned}
\frac{X_t}{X_{t-1}} &\approx p(s_t) = \frac{M}{F_{s_t}} \\
{X_n} &\approx \prod_{t=1}^n p(s_t) \\
\log_2(X_n) &\approx \sum_{t=1}^{n} \log_2 \frac{1}{p(s_t)} \approx n H(P) \\
\end{aligned}
For every symbol s_t , we use approximately \log_2 \frac{1}{p(s_t)} bits for its representation. This is optimal as its expected value is the entropy of the source. Thus, this gives a nice justification for the optimality of rANS, which can be made rigorous by some more math. We will leave the proof here, but for those interested can look at the footnote for the complete proof.
Practical Aspects
Lets take a look at implementing rANS in practice!
Choice of M : Notice that the decoding can be made very fast by choosing M to be of the form: 2^r . The modulus operation ( slot = mod(X_t, M)) in the decoding then corresponds to picking the last r bits, while the division operation ( \lfloor X_t/M \rfloor ) becomes equivalet to bit-shifts both of which are very fast.
Choice of \mathcal{F} frequencies : Note that one operation we need to perform is choosing frequency counts F_a which correctly approximate the true probabilities q_a \approx F_a/M = p_a. This is one crucial step of inaccuracy in the ANS design (unlike Arithmetic encoding), as we know that designing a code for p_a distribution, when the true distribution is q_a always leads to a loss equivalent to the KL-divergence: D(q_a||p_a) .
Reverse Encoding : Notice that unlike Arithmetic coder, rANS decodes the input in a reverse order. One simple solution to this problem is processing the inputs in blocks and parsing the input in a reverse direction during encoding. This would add to the latency in the encoding, but practically seems alright as input is generally processed and transmitted in blocks anyway.
Adaptive context-based rANS: One of the most popular usages of Arithmetic coding is with context-based modeling. A lot of modeling techniques such as PPM, CTW, Markov modeling etc. employ arithmetic coding over a distribution modeled using the past samples. This modeling becomes tricky for rANS due to its reverse order encoding. rANS essentially needs to look ahead of the current reverse-order input during the encoding, so that the decoder has the necessary information to figure out the context model
Alias mapping instead of cumulative inverse : Recall that the encoding, for the input symbol s_t , we were mapping slot to the range [C_{s_t}, C_{s_{t+1}} - 1] . During the decoding, in the C\_inv function, we need to perform binary search to locate the appropriate range to decode s_t , which can be relatively slow as compared with other rANS operations. If we understand the encoding, then it is clear that we need to map the slot to a range of size F_{s_t} , however we do not need to map it to a continuous range. This observation is used by Fabian Giesen to subtly improve the rANS decoding, using a technique known as Alias mapping. You can read more about it here:
Finite Precision Arithmetic: In practice, as we will be using finite precision arithmetic, the state variable would be bounded by something like: 2^{64} . Although this seems large, rANS encoding would very rapidly reach this limit (infact exponentially). Thus, we need modify the rANS construction to have the state bounded in some range. We will next look at this important variant, called streaming-rANS.
Streaming-rANS
To apply rANS on a streaming input (or even a moderately large input), we can come up with some very simple modifications: One simple idea is to set a maximum state size of say, H = 2^{64} and encode encode the symbols in chunks so that containing as many symbols so that the state is not larger than H. In this case, along with the final state, we also need to store the number of symbols in every chunk of the input. Another variation is to encode in blocks containing a fixed number of input symbols (say 20), but at the same time ensure that the state does not overflow. One issue with both of these variants that, we always lose some bits everytime we chose a new chunk or a block. Is there a way to utilize these partial bits?
One idea is to force the state to lie in a specific interval \mathcal{I} = [L,H] at every timestep t. Let the state X_{t-1} at timestep {t-1} be in the range \mathcal{I}. Now, if on encoding the next symbol s_t, the state X_t = C_{rANS}(X_{t-1},s_t) is going to be in the range \mathcal{I}, then well and good :), we don't have to do anything. However, if not, then we need to decrease the value of the state X_{t-1} appropriately. Let {I}_a = [L_a,H_a] be the interval corresponding to alphabet a, so that for any state z \in {I}_a on encoding we have C_{rANS}(z,a) \in I, and for z \notin I_a, C_{rANS}(z,a) \notin \mathcal{I}. Then, we essentially need to modify of the state X_{t-1}, so that the mapped value X'_{t-1} lies in the range {I}_{s_t}. For example:
\begin{aligned}
\mathcal{A} &= \{ A,B,C\}, \mathcal{F} = \{ 3, 3, 2\} \\
\mathcal{I} &= \{ 8,9,10,\ldots,14,15\} \\
{I}_A &= \{ 3,4,5\} \\
{I}_B &= \{ 3,4,5\} \\
{I}_C &= \{ 2,3\}
\end{aligned}
Notice that, as the size of the interval |{I}_a| is in general smaller than |\mathcal{I}|, there is no one-to-one mapping between the mapped value X'_{t-1} and the original state X_{t-1}. Thus, during the decoding, although the decoder will be able to figure out s_t, X'_{t-1} = D_{rANS}(X_t), it will need some additional information to correctly figure out the state X_{t-1} from X'_{t-1}. Is there an elegant way to perform this mapping (and the reverse mapping at the decoder)? It seems there is! Jarek Duda in the paper: gives a simple but a very ingenious way to achieve this!
State Mapping
One simple idea of mapping the state X_{t-1} \in \mathcal{I} to the allowed range {I}_{s_t} is as follows: take out bits from X_{t-1} until it lands in the interval {I}_{s_t}. Thus, effectively the mapping can be described as:
X'_{t-1} = \lfloor X_{t-1}/{2^r} \rfloor
where r is the smallest number, such that X'_{t-1} \in {I}_{s_t}
Is this always possible? Are there cases where we will jump over the interval {I}_{s_t}? For example: Let {I}_A = \{ 5,6,7,8\}. If we start with a state of 17 , then on taking out bits we obtain the sequence: 17,8,4 , none of which lie in the {I}_A range. This problem can be solved by choosing the intervals {I}_{s_t} sufficiently large, so that we do not jump over them. So how large do we need to choose the intervals? One sufficient condition is the following:
\forall a \in \mathcal{A}: {I}_a = [L_a, H_a], H_a \geq 2L_a-1
Intuitively, every interval \mathcal{I}_a spans an octave: due to which it is not possible to jump over the intervals by taking out bits. Also note that we want this condition to be satisfied by all symbol ranges \mathcal{I}_a, \forall a \in \mathcal{A} , corresponding to a single range \mathcal{I} .
If we now think about how the decoding should proceed, the it is also clear from the mapping, that the "additional information" required by the decoder to reverse map X'_{t-1} to X_{t-1} is going to be the r bits, b_{t-1} = mod(X_{t-1},2^r) taken out of X_{t-1}, since:
X_{t-1} = X'_{t-1}*2^{r} + b_{t-1}
These bits b_{t-1} are be appended to a single BitStream , at every timestep. Note that, in general the decoder also needs to know the number of bits to pop out from this BitStream , since they can be different at different timesteps. One simple solution is to perform exactly the reverse of what the encoder does, i.e: Read in bits from the BitStream , and append them to the mapped state X'_{t-1} until it lands in the interval \mathcal{I}. Will this always work? One problem case is when there are more than 1 possible solutions for state X_{t-1} are in the valid interval \mathcal{I}, and the decoder stops early. For Example:
\begin{aligned}
\mathcal{I} &= \{ 8,9,10,\ldots,16,17 \} \\
X'_{t-1} &= 4 \\
BitStream &= \{0100\ldots\}
\end{aligned}
Then, the reverse mapped state X_{t-1} can be \{8,17,35,\ldots\} , out of which both 8,17 are valid states. Now, if the true state X_{t-1} = 17, then we are going to have an error as the decoder is going to stop after reading a single bit from the BitStream (as 8 \in \mathcal{I}). This problem can be solved by choosing the interval \mathcal{I} small enough, so that there is only one possibility for the reverse mapping of X'_{t-1} to the state X_{t-1} . How small should \mathcal{I} be? One sufficient condition is:
\mathcal{I} = [L, H], H \leq 2L-1
Notice that this condition is the exact opposite of what we got during the encoding for the symbol ranges! The intuition here is also analogous: adding bits increases the state more than twice, thus we choose the range \mathcal{I} , so that no number is larger than twice the other.
Streaming-rANS Final Form
Let us summarize all the conditions on the ranges we discussed in the previous section:
\begin{aligned}
&\forall a \in \mathcal{A}, \forall z \in \mathcal{I}_a : C_{rANS}(z,a) \in \mathcal{I} \\
&\forall a \in \mathcal{A} : \mathcal{I}_a = [L_a, H_a], H_a \geq 2L_a-1 \\
&\mathcal{I} = [L, H], H \leq 2L-1
\end{aligned}
Are there some simple parameters for which we achieve all these conditions? You bet, there are! Jarek Duda, again in his paper , gives a simple rule for which all these conditions are satisfied. For any integer l :
\begin{aligned}
\mathcal{I} &= [lM, 2lM-1] \\
\forall a \in \mathcal{A}: \mathcal{I}_{a} &= [lF_a, 2lF_a - 1]
\end{aligned}
It is easy to see why this satisfies the second and third conditions mentioned above. The first condition is true since: C_{rANS}(lF_a,a) = lM and C_{rANS}(2lF_a) = 2lM \notin \mathcal{I} . Phew! that was quite a bit of engineering to get rANS to get working in practice! For completeness, take a look at the python code for Streaming-rANS!
Streaming-rANS Encoder Code
def Streaming_rANS_encoder(s_input, symbol_counts, range_factor):
total_counts = np.sum(symbol_counts) # Represents M
bitstream = [] #initialize stream
state = low_level*total_counts #state initialized to lM
for s in s_input: #iterate over the input
# Output bits to the stream to bring the state in the range for the next encoding
while state >= range_factor*symbol_counts[s]:
bitstream.append( state%2 )
state = state/2
state = C_rANS(s, state, symbol_counts) # The rANS encoding step
return state, bitstream
Streaming-rANS Decoder Code
def Streaming_rANS_decoder(state, bitstream, symbol_counts, range_factor):
total_counts = np.sum(symbol_counts) #Represents M
#perform the rANS decoding
s_decoded,state = D_rANS(state, symbol_counts)
#remap the state into the acceptable range
while state < range_factor*total_counts:
bits = bitstream.pop()
state = state*2 + bits
return s_decoded,state
You can also tryout the Streaming-rANS examples below.
Streaming-rANS Encoding Example
Symbol Counts, \mathcal{F}
Input Symbol String:
Streaming-rANS Decoding Example
Symbol Counts, \mathcal{F}
Final State
BitStream
Number of Encoded Symbols
Optimality of Streaming-rANS
Restricting the state to the range \mathcal{I} no longer guarantees that the average codelength will be equal to the entropy. The Optimality analysis also becomes tricky as we need to consider the size of the BitStream , which also depends on the order in which the symbols appear in the input sequence. However, it can still be shown that as the range parameter l increases, the performance of the Streaming-rANS asymptotically approaches Entropy. This was shown in the interesting work by Yokoo in [cite], which uses a clever way of approximating the state distribution \mathcal{I} .
A theoretical result which shows the rate of convergence towards entropy as we increase l is still open. However, Jarek Duda, again in his work [cite], gave the following relation for the performance of Streaming-rANS for a given range parameter l :
L_{avg} = H + \mathcal{O}(1/{l^2})
He provides experimental evidence and some arguments for the relationship, but theoretically sound arguments are still elusive.
Practical Aspects
Based on the discussion presented, Streaming rANS looks like a very good practical replacement for Arithmetic coding and range coding. However, there are a few more tweaks which can be used to further improve the rANS encoding:
Streaming out bytes: Our Streaming-rANS construction was based on the premise that we are streaming out a bit at a time during the encoding, and reading in bits during the decoding. Generally in practice, it is much more efficient to write/read larger chunks of bits at a time, such as a byte or a word of 64 bits. We can infact achieve this objective by a simple change to the Streaming-rANS architecture: Lets say we are streaming k bits at a time, then we choose the symbol ranges so that:
\begin{aligned}
\mathcal{I} &= [lM, 2^klM-1] \\
\forall a \in \mathcal{A}: \mathcal{I}_{a} &= [lF_a, 2^klF_a - 1]
\end{aligned}
This makes sure that the streaming works fine.
SIMD and Parallel Streams: Fabian Giesen in the paper [cite], presents a nice way to utilize the SSE instruction set to parallelize the Streaming-rANS encoding and decoding. Its an interesting work which I am still in the process of completely digesting. Feel free to take a look at the paper if interested!
Caching of Encoder/Decoder computations: Note that based on the newly observed symbol, the encoder maps the current state in the finite range \mathcal{I} = [lM, 2lM-1] to another state in the same range, while at the same time sending out a few bits to the BitStream . Thus, the encoder (as well as the decoder) can be viewed as a special type of Finite-State Automata based Encoder (FSE). Thus, to speed up the encoding/decoding, one can just store the Streaming-rANS computations in a cache table. Here is an example to tryout!
Streaming-rANS as a FSE
Symbol Counts, \mathcal{F}
This use of cache encoding and decoding tables can lead to massive speed-ups, as all we are doing is looking for the values in a lookup-table: thus, making the Streaming-rANS encoding/decoding as fast as Huffman coding! Wow! This Cached Streaming-rANS coder infact belongs to a bigger class of ANS coders, known as tANS Coders , which is aptly an abbreviation for tabled-ANS. In the next section we will try to understand the tANS class of coders in more detail.
tANS [In Progress]
We took a look at the cached encoding tables for Streaming-rANS in the previous section. Let us also look at the decoding tables for the same.
Symbol Counts, \mathcal{F}
If you notice, the symbol decoding is still very simple. Every state in the range \mathcal{I} = [lM, 2lM-1] is decoded into a unique symbol. In fact due to table being formed using the rANS construct, there are distinct output ranges for every symbol. But note that as we are now storing the encoding and decoding tables, one can get very creative with the encoding/decoding. We can just permute this state assignment corresponding to every symbol and that will still give us a valid Finite-State Entropy coder. This is pretty amazing as it gives us an exponentially large family of Finite-State Encoders of a given state size.