Probable and Improbable Characters

What happens if the relative frequencies or probabilities of the various characters you use are not equal? Let's consider an alphabet with only two characters, A and B (see reference 2). If A and B have equal probabilities, the information content of each is log{base2}(1/1/2), or log{base2}2, or 1 infobit. The average information content of this alphabet is also 1 infobit.

However, if A and B are not equally probable, and A occurs twice as often as B, their probabilities are 2/3 and 1/3, and their information contents, log{base2}(1/2/3) and log{base2}(1/1/3), respectively.

So, what's the average information content of this alphabet? Since A occurs 2/3 of the time, it should have more weight, and B should have less. Thus, the average information content is (2/3 x log{base2}(1/2/3)) + (1/3 x log{base2}(1/1/3)), or 0.918 infobits.

Wait a minute. If you have two characters, how can you possibly use less than 1 repbit to distinguish between them, even if you code them as blocks? For example, if you use blocks of two characters, you have four possible strings, AA, AB, BA, and BB; how can you code them in less than 2 repbits and get an average per character of less than 1 repbit?

This is where probability enters in. The probability of an AA is 2/3 x 2/3, or 4/9; a BB is only 1/3 x 1/3, or 1/9; and an AB or a BA is 2/3 x 1/3, or 2/9. If you use variable lengths of code, short ones for the more frequent, or probable, strings and longer ones for the rarer strings, then you can code the blocks, as in

table 1.

By using unequal code lengths, the weighted average of the number of repbits used for a string of two symbols becomes (4/9 x 1) + (2/9 x 3) + (1/9 x 3) or 17/9. Thus, the average number of repbits per symbol is 17/9/2 or 17/18, or 0.9444, which is less than 1. Hamming shows, similarly, that by using strings of three, four, or more symbols, you can move closer to the theoretical average of 0.918 repbits per symbol.

Incidentally, this highlights a very important point; Calculations of information content or entropy don't indicate how you should code to achieve the minimum number of repbits required; they only show that a possibility of more efficient coding exists. The "how" question is dealt with by coding theory the practical realisation of the theoretical possibilities established by information theory.

You can extend the idea of weighted average to character sets of any size. For instance, let's consider the set of 26 letters that is the English alphabet, plus a space, and ignore for the moment punctuation, signs, and so on. If the 27 characters are equally likely, then the probability for each is 1/27, and the information content for each is log{base2}(1/1/27), or log{base2}27, or 4.75 infobits. The average amount of information for this character set is also 4.75 infobits per character. This process actually attaches a weight of 1/27 to each character, so that the average is (1/27 x log{base2}27) + (1/27 x log{base2}27) ... repeated 27 times with a sum of log{base2}27.

However, the characters are not equally likely - for instance, E's are common, and Q's are rare. Estimates are that the probability of a space is 0.18, and E, 0.11, a C, 0.02, and so on for all the characters. Shannon and Weaver used such figures to calculate the weighted average amount of information:

(0.18 x log{base2}(1/0.18)) + (0.11 x log{base2} x 1/0.11) + ...

The estimated that this gives an average information content of 4.1bits per character. The fact that the average is now lower makes sense because uncertainty is largest when the characters are equally likely. The reverse is also true.

Shannon used the word redundancy to capture the concept that a character set may have less entropy than the maximum it is capable of. In our 27-character English alphabet, the redundancy is expressed as (4.7-4.1)/4.7, or almost 13 per cent. Actually, the redundance in the English language is even higher. Not only are the characters not equally likely, but their probabilities depend on what characters precede them. (Consider the probabilities of the different characters when the previous one is a Q!) Taking these factors into account, the redundancy in English text is estimated to be higher than 13 per cent.

+------------------------------------------------------------------------+

¦ Table 1:Using probabilities to lower the average number of repbits ¦

¦ required to encode blocks of characters. ¦

¦ AA with probability of 4/9 is coded as 1 ¦

¦ AB with probability of 2/9 is coded as 01 ¦

¦ BA with probability of 2/9 is coded as 000 ¦

¦ BB with probability of 1/9 is coded as 001 ¦

+------------------------------------------------------------------------+