C K Fitch - 24th September 1991

An alternative to Huffman Code

To date, Huffman Code stands as the most efficient, direct method of encoding an arbitrary number of tokens of varying frequency into variable length binary sequences. However, it is not particularly easy to implement. For those of you that would prefer something easier to digest as well as implement while being of comparable efficiency, I'd like to introduce you to Unary Prefix Code (UP Code).

Just like sorting algorithms, where there is still a place for the bubble sort, I think it would be a pity if we forgot the less optimal coding methods. They can be quite educational, if not useful in some situations.

Like many programmers I've done my share of re-inventing the wheel. Indeed, most programmers usually recognise when a discovery is liable to have already been discovered and is thus a mere re-invention. Of course there are times when things go the other way; a real discovery is dismissed by its author until it's recognised to be original (nearly always by someone else). This was the case with Huffman code (see Scientific American September 1991, p27). David A Huffman developed the method in response to a term paper problem set by his professor, who surprised him upon revealing that its solution had until then, not been perfected.

I discovered Unary Prefix Code in ignorance of Huffman Code or the idea of using binary digits as directions for traversing a coding tree. For those of you that don't know, a detailed description of how to implement Huffman Code is in the book "Algorithms in C" by Robert Sedgewick^[1]. When I heard about Huffman Code I thought that it was what I'd re-invented. No such luck. I'd only discovered a less optimal coding scheme that I suspect was old hat to David Huffman and his fellow academics. Nevertheless, it does have its merits and though not optimal, has the appeal of being simple to understand and implement.

The idea of all such encoding is that given N different items occurring Fn times in a sequence, how can you represent each item as a variable length binary code so as to use the fewest number of binary digits in the encoded sequence? Obviously the items that occur most are given the shorter codes, but the question is what codes should be used and precisely how should they be allocated. A typical application is that of compressing an 8 bit text file where only a few of 256 possible characters are used, and furthermore, have markedly different frequencies ('E' occurs far more than 'Z').

Unary Prefix Code is as it's called; giving a binary number a unary most-significant-digit. Moreover, the binary number, or suffix, is of variable length.

Now, being unary, the prefix enhances the range of lengths of the code and so becomes ideal in situations where there may be a wide variation in frequency. It allows total flexibility of code length, where with a fixed length binary prefix you sacrifice being able to have very short codes for the lesser benefit of having shorter, long codes.

One example of a useful binary prefix code is where the msb of a byte determines whether the following value is stored in one byte or two. This allows values of 0-127 in 8 bits and values of 128-32895 in 16 bits.

So, in summary, UP code uses the prefix to encode information about which group of items are represented by the suffix and how many items are in the group.

You're probably wondering by now, how unary numbers can be implemented on a binary processor. Well, you can, in the same way that you can implement a series of decimal digits in hexadecimal, where you can consider every digit decimal until you find one greater than 9. Unary is quite similar. Every '0' bit is unary until you find a '1' (reading from left to right). Thus 0,1,2,3 in unary is '','0','00','000' and '1','01','001','0001' in BCU (Binary Coded Unary). I won't go into philosophical discussions here about whether zero can be represented in unary!

Table 1: Binary and unary compared

Decimal

Binary (3 bit)

Unary

Binary Coded Unary

Unary Prefixed Binary

000

100

010

001

0100

110

000

0001

0110

001

0000

00001

0101

101

00000

000001

0111

011

000000

0000001

001000

NB: Reading left to right

Let me demonstrate with a very simple example, the 11 letter word: "ABRACADABRA". 'A' occurs five times, 'B' & 'R' twice and 'C' & 'D' once.

In 8 bit ASCII this would occupy 88 bits (we'll presume the length does not need to be encoded).

We could encode the letters as a three bit character set and reduce the size to 33bits.

In BCU we could code 'A'=1, 'B'=01', 'R'=001, 'C'=0001, and 'D'=0000. This would reduce the size to (1x5+2x2+3x2+4x1+5x1)=(5+4+6+4+5)=24 bits. Notice that we don't need to terminate the code for 'D' since the decoder will know that unary 4 is the largest code used.

In UP code we'd use 'A'=1, 'B','R','C'&'D'=0xx. This gives a size of (1x5+3x(2+2+1+1))=(5+3x6)=23 bits. Again, the last code's prefix doesn't need to be terminated.

In Huffman code we'd use 'A'=1, 'B'=00, 'R'=010, 'C'=0110, 'D'=0111. This gives a size of (1x5+2x2+3x2+4x1+4x1)=(5+4+6+4+4)=23 bits.

Table 2: Comparison of encoding ABRACADABRA

Item

ASCII

3 bits

BCU

UP Code

Huffman Code

A = 5 x

B = 2 x

R = 2 x

C = 1 x

D = 1 x

Total bits

UP code often gives very close performance to that of Huffman code, in fact it has to be a carefully contrived example that displays a marked difference.

I've already referred you to "Algorithms by Sedgewick" for a description of Huffman Code, but it doesn't contain one of Unary Prefix Code. Although I won't go into details about how to write routines to compress and decompress files using UP Code, I will provide you with the critical routine to calculate the number and lengths of UP codes. From that I will presume you have a level of competence sufficient to pursue the idea further.

Listing 1: The C function getupcode()

/* ENTRY >> occ[]: Number of occurrences of n items, sorted in descending order */

/* EXIT >> ul[]: Number of bits in suffix of each of (returned) prefix codes */

static int getupcode(int occ[],int n,int ul[])

{int total, /* Total occurrence of current and as yet unallocated items */

item, /* Item */

oc, /* Occurrence of item(s) to be allocated to current unary prefix */

u, /* Index of current unary prefix */

i; /* General index */

for (total=0,i=n; i--; total+=occ[i]) /* Compute total occurrence of all items */

ul[i]=0; /* Reset unary prefix lengths */

for (u=item=0; item<n; ++u,total-=oc) /* For each prefix, until all items processed (maintain total of unallocated occurrences) */

for (oc=occ[item++]; oc*3<=total; ++ul[u]) /* Extend suffix until occurrences exceed half the remainder (oc>(total-oc)/2) */

for (i=1<<ul[u]; i--; oc+=occ[item++]); /* Accumulate occurrences of other half of items in extended, suffix group */

while (u>1 && ul[u-1]==ul[u-2]) /* Combine final suffixes of equal length */

++ul[--u-1]; /* decrementing number of prefixes and doubling items in penultimate suffix group */

return(u); /* Return number of unary prefixes */

}

Listing 1 is of the C function 'getupcode()'. This takes as input parameters, a sorted array of occurrence counts of each of 0..n-1 items. It returns an array of suffix lengths.

Listing 2: A C program to print out the results of the function getupcode()

/* Program to demonstrate Unary Prefix Code */

/* By Crosbie Fitch 26/9/91 */

#include <stdio.h>

#define DIM(A) (sizeof(A)/sizeof(*(A)))

static int item_occ[]={11,6,5,5,4,4,3,3,3,3,2,2,2,2,2,1,1,1};

#define MAXCODEN DIM(item_occ)

static int getupcode(int [],int,int []);

int main(void)

{static int upcode[MAXCODEN],upcoden;

int i,j;

upcoden=getupcode(item_occ,MAXCODEN,upcode);

for (i=0; i<upcoden; ++i)

{for (j=i; j--; )

putchar('0');

if (i<upcoden-1)

putchar('1');

for (j=upcode[i]; j--; )

putchar('x');

putchar('\n');

}

return(0);

}

Listing 2 is of a demonstration program which will output a list of UP codes resulting from a particular list of occurrence counts. Given the ABRACADABRA example {5,2,2,1,1} it would produce:

0xx

The zeros and ones make up the unary prefix and the 'x's indicate the binary suffix. Listing 2 is based on the frequency counts in the example from Sedgewick. It produces the output shown in the first column of table 3.

Table 3: UP Code details of the string below

A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS

UP Code

Occurrences

Letters

1xx

11,6,5,5

_,I,E,N

01xx

4,4,3,3

M,S,A,B

001xx

3,3,2,2

O,T,D,G

0001x

2,2

L,R

00001

000001x

1,1

C,F

000000

Note that the codes are allocated in unary order to the items in sorted occurrence count order. Thus the first item occurring 11 times is given the code 100; the 2nd of 6 times, 101; 3rd of 5, 110; and so on. The encoded text would start off: 01011000110110...

The reason UP code isn't as optimal as Huffman code is that UP code restricts the size of suffix groups to be a power of two, whereas Huffman code is able to have totally irregular suffixing. Thus instead of having to have four 3 bit codes as in 1xx, Huffman code is able to have codes 100,101xxx and 11x thus having three 3 bit codes, using the fourth as a prefix for a longer group.

I'd recommend UP code for high speed situations with a lesser need for optimal compression of large files. To decode, all it requires are simple bit shift operations that increment an index into an array of suffix lengths and corresponding indices to the original items. Incidentally, if speed is paramount, it might be preferable to leave the termination bit on the last prefix.

A trick that might increase decoding times is if the unary prefix is first extracted directly by using the following algorithm:

Prefix=UPCode & (~UPCode + 1)

Eg if UPCode is 001xx, stored as bbbxx100 then we have:

Prefix=bbbxx100 & (BBBXX011+1)

=bbbxx100 & BBBXX100

=00000100

This is then easier to compare or shift (to zero). Eg while (x>>=1) ++count;

Alternatively, if you have memory to spare, you can create a look up table of values for each possible prefix signature.

One area in which UP Code is more efficient than Huffman Code is that of encoding information about the encoding key. Both methods also have to supply details of what each code represents, but since that is the same for both, I won't describe it here.

Huffman Code must supply details of the binary coding tree. Since it is a binary tree it can be represented in binary too, using a bit for each node. A zero bit indicates a leaf and a one bit indicates a branch. The coding tree needed to encode "A SIMPLE STRING..." requires 18 bits for the leaves and 17 bits for the branches; a total of 35 bits. Note that this tree will also reveal how many different codes there are.

UP Code only needs to supply the suffix lengths to each unary prefix. An initial value declares the number of prefixes (-1) which follow in BCU. I suggest the use of regular Unary Prefixed Binary (UPB) for this first value. UPB is an equal number of bits used for both unary prefix and binary suffix, and so progresses: 0..1=1x, 2..5=01xx, 6..13=001xxx, etc. Table 4 shows how the UP Code list has been encoded.

Table 4: Encoding the UP Code list

UP Code prefixes

Encoded

6 (n-1)

001000

1xx

001

01xx

001

001xx

001

0001x

00001

000001x

000000

This gives a length of 21 bits.

If N is the number of different items encoded, then the encoding information size E is constant for Huffman Code at E=2N-1 bits. Unfortunately, there is no simple formula for that of UP code (using my method). Even so, I estimate that E varies between 3+Log2(N) bits and 3+N+2Log2(N) bits. With N=18 as in the preceding example E could actually vary from between 9 to 28 bits. Anyway, if we're dealing with small file sizes, the bits saved in the header of UP Code can often make up for the bits lost in the non-optimal coding. If we had the same frequency distribution as in the example, it would take a file size of about 840 items (as opposed to 60) before Huffman Code becomes more compact (a difference of one bit every 60 items).

One final point to mention: you may decide you prefer BCU with 1s as digits and 0s as terminators. Thus instead of 1,01,001,0001 you have 0,10,110,1110. At least that way the sum of the digits accords with the value. I used it the other way around to continue with the pattern of the highest valued digit being the base-1. Thus decimal is A-1=9, octal is 8-1=7, binary is 2-1=1, and so I thought unary should be 1-1=0.

Well, I hope this has been of interest. If I haven't persuaded anyone of the merits of Unary Prefix Code, I'd be happy to think that at least some people have had the idea knocked out of them that Huffman Code is the only useful, variable length, binary encoding scheme.

^[1]Sedgewick, Robert, 1946 -

Algorithms in C / by Robert Sedgewick.

ISBN 0-201-51425-7

Published by Addison-Wesley

£21.95