Related documents


Measuring Bits of Information

Information theory is based on certain key assumptions, or postulates, that are inherently plausible and reasonable. However, the ultimate justification is that logical conclusions drawn from these postulates have led to useful and effective solutions to real-life problems.

One assumption of information theory is that a message is not significant by itself; it is significant in the context of all the other possible messages that could have been sent. When a message tells you something that you already know, it's reasonable to say that the message conveys no information; there was no other possible message. For example, if you have a 10-year-old son, and someone tells you that you have a son, no information has been conveyed. On the other hand, under different circumstances (when more than one message is possible), the same message could convey some information. For example, if you are in the hospital delivery room, and someone tells you that you have a son, some information has been conveyed.

"The significant aspect is that the actual message is one selected from a set of possible messages" (Shannon and Weaver). The greater the number of possible messages, the greater the amount of information conveyed. In other words, how much information a message contains depends on the extent to which it resolves uncertainty.

You could also say that the more probable a message is, the less information it conveys. For instance, a message selected from a set of only one possible message has a probability of 100per cent, or 1, and conveys no information. A message selected from a set of two equally probably messages, each with a probability of 1/2, conveys some information, while a message from a set of three (probability of 1/3) conveys even more, and so on.

The amount of information increases as the probability of the message decreases; they are inversely related, but in exactly what proportions? You could say that the information content of a message with a probability of p1 is 1/p1, but this doesn't give zero information content for a message with a probability of 1.

Shannon suggested a more definite form for relating information content and message probability. He argued that you can measure information so that the total amount conveyed by two messages is equal to the sum of the information conveyed by each of them; in other words, the information conveyed by a series of messages is additive.

If you have two messages, one with a probability of p1 and the other with a probability of p2, you could say that the quantity of information these messages convey is related to 1/p1 and 1/p2, respectively. However, if you think of the two as a compound message, the probability becomes p1 x p2. For example, if p1 is 1/3 and p2 is 1/3, there is a one-in-three chance of the first message being selected. If it is chosen, there is only a one-in-five chance that the second message will also be chosen. Thus,

the chances of the compound message being sent are 1/3 x 1/5, or 1/15. Thus, the information content of this compound message should be related to 1(p1 x p2).

The concept of additivity requires that the information content associated with a 1/(p1 x p2) probability be the sum of the information content associated with 1/p1 and that associated with 1/p2. Therefore,

I(1/(p1 x p2)) = I(1/p1) + I(1/p2)

where I denotes quantity of information. According to Shannon, the only mathematical relationship that satisfies this requirement is: The quantity of information associated with a probability of p1 is

I(1/p1) = log(1/p1)

This, then, is Shannon's fundamental equation for measuring quantity of information.

Briefly, the logarithm of any number to a particular base is defined as the power to which you must raise the base to get that number. For example, the log of 1000 to the base 10 is 3, since 10 x 10 x 10, or 103, is 1000. So what base should Shannon's equation use? Base 2 seems a natural choice because, in the simplest case where one or two equally probable messages is selected, each with a probability of 1/2, the quantity of information is log(1/1/2), or log{base2}. The log of 2 to the base 2 is 1. Thus, the information contained in each of these two messages equals one unit. The average amount of information also equals one unit.

Shannon chose the name bit for this unit for measuring the amount of information. Let's call it an infobit, since it isn't quite the same as a bit in computer storage, which represents information (let's call that a repbit). Thus, if a message with a probability of 1/4 is chosen out of four equally likely messages, he amount of information would be log{base2}(1/1/4), or log{base2}4, or 2infobits.

Figure 1a

The process that occurs at the transmission end of communicating a message.

Transmitter Signal

Information source ---_ (codes message)---_ (channel)

_

¦

¦

Noise

Figure 1b

The corresponding process that occurs at the receiving end.

SignalReceiver (decodes)

To see the difference, as well as the connection between repbits and infobits, suppose you are expecting one or two messages, yes or no, in regard to some decision, and the two are equally probable. The message could be sent as yes or no, using 8 repbits for each character, 24 for yes or 16 for no, with an average of 20repits. However, in terms of information theory, for two equally probable messages, each with a probability of 1/2, each has an information content of log{base2}(1/1/2), or log{base2}2, or 1 infobit; and the average is also 1 infobit.

Thus, the number of repbits is not necessarily equal to the number of infobits, but there is a connection. You could say that the number of infobits is the smallest number of repbits required. If there are only two possible messages and you use a code of 0 for no and 1 for yes, then a message of 1 repbit is enough. Similarly, if there are four possible messages, each with a probability of 1/4, the number of infobits needed is 2; the minimum number of repbits required is also 2.

What if you have three messages, each with a probability of 1/3? According to Shannon's equation, the number of infobits is log{base2}3=1.58 infobits. But repbits can only be whole numbers, so how does this work?

You need at least 2 repbits to distinguish between the three alternatives. But with 2 repbits, you could actually handle four alternatives, so you're wasting some of the capacity of the 2repbits for sending messages. You could reduce this waste if you code block of such messages, rather than sending each one individually.

If you code blocks of 10 such messages, the whole block could contain 310, or 59,049, alternative forms. If you use a string of 16 binary signals, you can have 216, or 65,536, alternative forms. Since a string of 16 binary signals is more than enough to handle 10 of these three-alternative messages, on the average you need only 16/10, or 1.6, repbits to represent the average three-alternative message.

An alternative name that Shannon gave to the average amount of information is entropy, a term from thermodynamics. One interpretation of the amount of entropy in a physical system concerns the degree of uncertainty about which of many possible states of the system is actually realised at different stages. Shannon chose this name because of the analogy between realising one of many possible states and choosing one of many possible messages, and also because the mathematical equations for calculating thermodynamic entropy and average quantity of information were similar.

Thus, the fractional entropy represents the average amount of repbits required if you code the messages in sufficiently long blocks instead of one at a time. The longer the block of messages, the closer the calculation of 1.6 repbits moves to the 1.58 average that the entropy calculation gives.

In reality, messages aren't usually a series of signals indicating which of different messages is being sent; they are a series of characters selected from a character set or alphabet. If you consider each choice of a character as a "minimessage", a selection from the set of all possible characters, the method still applies. You can think of an overall message as a long series of such minimessages.