What happens if the relative frequencies or probabilities of the
various characters you use are not equal? Let's consider an alphabet
with only two characters, A and B (see reference 2). If A and
B have equal probabilities, the information content of each is
log{base2}(1/1/2), or log{base2}2, or 1 infobit. The
average information content of this alphabet is also 1 infobit.
However, if A and B are not equally probable, and A occurs twice
as often as B, their probabilities are 2/3 and 1/3,
and their information contents, log{base2}(1/2/3) and
log{base2}(1/1/3), respectively.
So, what's the average information content of this alphabet? Since
A occurs 2/3 of the time, it should have more weight, and B should
have less. Thus, the average information content is (2/3 x log{base2}(1/2/3))
+ (1/3 x log{base2}(1/1/3)), or 0.918 infobits.
Wait a minute. If you have two characters, how can you possibly
use less than 1 repbit to distinguish between them, even if you
code them as blocks? For example, if you use blocks of two characters,
you have four possible strings, AA, AB, BA, and BB; how can you
code them in less than 2 repbits and get an average per character
of less than 1 repbit?
This is where probability enters in. The probability of an AA is 2/3 x 2/3, or 4/9; a BB is only 1/3 x 1/3, or 1/9; and an AB or a BA is 2/3 x 1/3, or 2/9. If you use variable lengths of code, short ones for the more frequent, or probable, strings and longer ones for the rarer strings, then you can code the blocks, as in
table 1.
By using unequal code lengths, the weighted average of the number
of repbits used for a string of two symbols becomes (4/9
x 1) + (2/9 x 3) + (1/9 x 3) or 17/9.
Thus, the average number of repbits per symbol is 17/9/2
or 17/18, or 0.9444, which is less than 1. Hamming
shows, similarly, that by using strings of three, four, or more
symbols, you can move closer to the theoretical average of 0.918
repbits per symbol.
Incidentally, this highlights a very important point; Calculations
of information content or entropy don't indicate how you should
code to achieve the minimum number of repbits required; they only
show that a possibility of more efficient coding exists. The "how"
question is dealt with by coding theory the practical realisation
of the theoretical possibilities established by information theory.
You can extend the idea of weighted average to character sets
of any size. For instance, let's consider the set of 26 letters
that is the English alphabet, plus a space, and ignore for the
moment punctuation, signs, and so on. If the 27 characters are
equally likely, then the probability for each is 1/27,
and the information content for each is log{base2}(1/1/27),
or log{base2}27, or 4.75 infobits. The average amount of information
for this character set is also 4.75 infobits per character. This
process actually attaches a weight of 1/27 to each
character, so that the average is (1/27 x log{base2}27)
+ (1/27 x log{base2}27) ... repeated 27 times with
a sum of log{base2}27.
However, the characters are not equally likely - for instance,
E's are common, and Q's are rare. Estimates are that the probability
of a space is 0.18, and E, 0.11, a C, 0.02, and so on for all
the characters. Shannon and Weaver used such figures to calculate
the weighted average amount of information:
(0.18 x log{base2}(1/0.18)) + (0.11 x log{base2} x 1/0.11)
+ ...
The estimated that this gives an average information content of
4.1bits per character. The fact that the average is now lower
makes sense because uncertainty is largest when the characters
are equally likely. The reverse is also true.
Shannon used the word redundancy to capture the concept that a
character set may have less entropy than the maximum it is capable
of. In our 27-character English alphabet, the redundancy is expressed
as (4.7-4.1)/4.7, or almost 13 per cent. Actually, the redundance
in the English language is even higher. Not only are the characters
not equally likely, but their probabilities depend on what characters
precede them. (Consider the probabilities of the different characters
when the previous one is a Q!) Taking these factors into account,
the redundancy in English text is estimated to be higher than
13 per cent.
+------------------------------------------------------------------------+
¦ Table 1:Using probabilities to lower the average number of repbits ¦
¦ required to encode blocks of characters. ¦
¦ AA with probability of 4/9 is coded as 1 ¦
¦ AB with probability of 2/9 is coded as 01 ¦
¦ BA with probability of 2/9 is coded as 000 ¦
¦ BB with probability of 1/9 is coded as 001 ¦
+------------------------------------------------------------------------+