Mutual information

From Scholarpedia

This article is undergoing 1 initial review (2 completed); It may contain inaccuracies and unapproved changes made by anonymous reviewers.

Peer review status

You are not logged in.

Reviewer A: Awaits author's response (last modifications by author - 14 days ago)
Reviewer B: Modified and approved on 25 February 2008.
Reviewer C: Modified and approved on 22 September 2008.
This article is not accepted to Scholarpedia yet. It may contain inaccuracies and unapproved changes made by anonymous reviewers.
Author prefers old-fashioned peer-review process: all comments should be put into the 'reviews' part of the article.

Author: Dr. Peter E. Latham, Gatsby Computational Neuroscience Unit, Uinversity College London
Author: Dr. Yasser Roudi, Gatsby Computational Neuroscience Unit, University College London

Mutual information is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits, and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent.


Contents

Definition

For two discrete variables X and Y whose joint probability distribution is P_{XY}(x,y), the mutual information between them, denoted I(X;Y), is given by (Shannon and Weaver, 1949; Cover and Thomas, 1991)

(1)
I(X;Y) = \sum_{x,y} P_{XY}(x,y) \log {P_{XY}(x,y) \over P_X(x) P_Y(y)} = E_{P_{XY}} \log{P_{XY} \over P_X P_Y} \, .

Here P_X(x) and P_Y(y) are the marginals: P_X(x) = \sum_y P_{XY}(x,y) and P_Y(y) = \sum_x P_{XY}(x,y) and E_P is the expected value over the distribution P. The focus here is on discrete variables, but most results derived for discrete variables extend very naturally to continuous ones – one simply replaces sums by integrals. One should be aware, though, that the formal replacement of sums by integrals hides a great deal of subtlety, and, for distributions that are not sufficiently smooth, may not even work. See (Gray, 1990) for details.

The units of information depend on the base of the logarithm. If base 2 is used (the most common, and the one used here), information is measured in bits. For other bases, see Table I.

Table I
base units conversion
2 bits 1 bit = 1 bit
e nats 1 bit = \log_e 2 \ (\approx 0.693) nats
10 bans 1 bit = \log_{10} 2 \ (\approx 0.301) bans

Interpretation

To understand what I(X;Y) actually means, we first need to define entropy and conditional entropy.

Qualitatively, entropy is a measure of uncertainty – the higher the entropy, the more uncertain one is about a random variable. This statement was made quantitative by Shannon. He postulated that a measure of uncertainty of a random variable X should be a continuous function of its probability distribution P_{X}(x) and should satisfy the following conditions:

  • It should be maximal when P_{X}(x) is uniform, and in this case it should increase with the number of possible values X can take;
  • It should remain the same if we reorder the probabilities assigned to different values of X;
  • The uncertainty about two independent random variables should be the sum of the uncertainties about each of them.

He then showed that the only measure of uncertainty that satisfies all these conditions is the entropy, defined as

(2)
H(X)=-\sum_x P_X(x) \log P_X(x) = - E_{P_X} \log P_X

Although not particularly obvious from this equation, H(X) has a very concrete interpretation: Suppose x is chosen randomly from the distribution P_X(x), and someone who knows the distribution P_X(x) is asked to guess which x was chosen. If the guesser uses the optimal question-asking strategy, which is to divide the probability in half on each guess by asking questions like "is x greater than x_0?", then the average number of yes/no questions it takes to guess x lies between H(X) and H(X)+1 (Cover and Thomas, 1991). This gives quantitative meaning to "uncertainty": it is the number of yes/no questions it takes to guess a random variables, given knowledge of the underlying distribution and taking the optimal question-asking strategy. See figure 1 for examples.

Figure 1:  Guessing a random variable using an optimal strategy. In both panels, the random variable can take on the four values 1, 2, 3 or 4, but with different probabilities. The optimal strategy is for each guess to divide the probability in half. a. All variables have probability 1/4. The first question is "is it 1 or 2?" because that question divides the probabilities in half. (Note that this question is not unique; one could just as easily asked, for example, "is it 2 or 4?") This strategy continues until the variable is guessed, which always takes two guess. This is also the entropy of the distribution: H=-(1/4)log(1/4)-(1/4)log(1/4)-(1/4)log(1/4)-(1/4)log(1/4)=2. b. Variables 1-4 have probabilities 1/2, 1/4, 1/8, and 1/8, respectively. The first question, "is it 1?" divides the probability in half. (Again, this question is not uniqe; one could have asked "is it 2, 3 or 4?")  "Yes" indicates that the variable has been guessed; "no" requires a second question. That question is "is it 2?", which again divides the probability in half. If the answer is "no", then there are only two variables left, 3 and 4, each occurring with probability 1/2, and the last question is "is it 3". An answer of either yes or no will tell us what the variable was. On average, the number of questions is (1/2)1 + (1/4)2 + (1/8)3 + (1/8)3 = 1 3/4. This is equal to the entropy (using log to the base 2): H=-(1/2)log(1/2)-(1/4)log(1/4)-(1/8)log(1/8)-(1/8)log(1/8)= 1 3/4. Adopted from  Nirenberg and Latham (2003)
Enlarge
Figure 1: Guessing a random variable using an optimal strategy. In both panels, the random variable can take on the four values 1, 2, 3 or 4, but with different probabilities. The optimal strategy is for each guess to divide the probability in half. a. All variables have probability 1/4. The first question is "is it 1 or 2?" because that question divides the probabilities in half. (Note that this question is not unique; one could just as easily asked, for example, "is it 2 or 4?") This strategy continues until the variable is guessed, which always takes two guess. This is also the entropy of the distribution: H=-(1/4)log(1/4)-(1/4)log(1/4)-(1/4)log(1/4)-(1/4)log(1/4)=2. b. Variables 1-4 have probabilities 1/2, 1/4, 1/8, and 1/8, respectively. The first question, "is it 1?" divides the probability in half. (Again, this question is not uniqe; one could have asked "is it 2, 3 or 4?") "Yes" indicates that the variable has been guessed; "no" requires a second question. That question is "is it 2?", which again divides the probability in half. If the answer is "no", then there are only two variables left, 3 and 4, each occurring with probability 1/2, and the last question is "is it 3". An answer of either yes or no will tell us what the variable was. On average, the number of questions is (1/2)1 + (1/4)2 + (1/8)3 + (1/8)3 = 1 3/4. This is equal to the entropy (using log to the base 2): H=-(1/2)log(1/2)-(1/4)log(1/4)-(1/8)log(1/8)-(1/8)log(1/8)= 1 3/4. Adopted from Nirenberg and Latham (2003)


The conditional entropy is the average uncertainty about X after observing a second random variable Y, and is given by

(3)
H(X|Y)=\sum_y P_Y(y) \left[ - \sum_x P_{X|Y}(x|y) \log \left(P_{X|Y}(x|y)\right)\right] = E_{P_Y} \left[ - E_{P_{X|Y}} \log P_{X|Y} \right]

where P_{X|Y}(x|y) (\equiv P_{XY}(x,y)/P_Y(y)) is the conditional probability of x given y.

With the definitions of H(X) and H(X|Y), Eq. (1) can be written

(4)
I(X,Y)=H(X)-H(X|Y).

Mutual information is therefore the reduction in uncertainty about variable X, or the expected reduction in the number of yes/no questions needed to guess X after observing Y. Note that the yes/no question interpretation even applies to continuous variables: although it takes an infinite number of questions to guess a continuous variable, the difference in the number of yes/no questions it takes to guess X before versus after observing Y is finite, and is the mutual information. While problems can arise when going from discrete to continuous variables (subtracting infinities is always dangerous), they rarely do in practice; see (Gray, 1990).


Properties

  • Mutual information is intimately related to the Kullback-Leibler divergence (Cover and Thomas, 1991), a very natural measure of the "distance" between two distributions. It is defined as follows: for any two distributions P(z) and Q(z),
D_{KL} \Big( P(z)||Q(z) \Big) \equiv \sum_z P(z) \log \left[ {P(z) \over Q(z)} \right] \, .

Distance is in quotes because the Kullback-Leibler divergence is not a true distance: it is not symmetric, and it does not obey the triangle inequality (Cover and Thomas, 1991). It is not hard to show that D_{KL} \Big(P(z)||Q(z) \Big) is non-negative, and zero if and only if P(z) = Q(z)

From Eq. (1), it is easy to see that the mutual information is just the Kullback-Leibler distance between the joint distribution, P_{XY}(x,y), and the product of the independent ones, P_X(x)P_Y(y),

I(X;Y) = D_{KL} \Big( P_{XY}(x,y)||P_X(x)P_Y(y) \Big) \, .

Thus, another way to think about mutual information is that it is a measure of how close the true joint distribution of X and Y is to the independent joint distribution. Because the Kullback-Leibler distance, and thus the mutual information, is zero if and only if P_{XY}(x,y) = P_X(x)P_Y(y), it follows that the mutual information captures all dependencies between random variables, not just, say, second order ones, as captured by the covariance.

  • Mutual information is additive for independent variables. More quantitatively, if P_{XYWZ}(x,y,w,z) = P_{XY}(x,y) P_{WZ}(w,z), then
I(X,W; Y,Z) = I(X;Y) + I(W;Z) \, .

This follows easily from the definition of the mutual information. It's also a property that information should have. For example, suppose X and Y are the cost of a bottle of wine and how it tastes, and W and Z are the height and weight of tree squirrels. If cost provides 1 bit of information about taste and height provides 2 bits of information about weight, then it makes sense that cost+weight should provide 3 bits of information about taste+height. Assuming, of course, that squirrels (and their weights) do not influence wines (and their prices).


  • The Data Processing Inequality (DPI) states, loosely, that post-processing cannot increase information. More quantitatively, consider two random variables, X and Y, whose mutual information is I(X,Y). Now consider a third random variable, Z, that is a (probabilistic) function of Y only. (The only qualifier means P_{Z|XY}(z|x,y)=P_{Z|Y}(z|y), which in turn implies that P_{XYZ}(x,y,z)=P_{Z|Y}(z|y)P_{Y|X}(y|x)P_X(x)=P_{X|Y}(x|y)P_{Y|Z}(y|z)P_Z(z). This last relationship means that X, Y and Z form a Markov Chain.) The DPI states that Z can't have more information about X than Y has about X; that is,
I(X;Z) \le I(X;Y) \, .

This inequality, which again is a property information should have, is easy to prove. Combining Eqs. (3) and (4) and using \sum_z P_{XYZ}(x,y,z)=P_{XY}(x,y) and \sum_y P_{XYZ}(x,y,z)=P_{XZ}(x,z), we have

I(X;Y)-I(X;Z) = \sum_{x,y,z} P_{XYZ}(x,y,z) \log \left[ {P_{X|Y}(x|y) \over P_X(x)} \right] - \sum_{x,y,z} P_{XYZ}(x,y,z) \log \left[ {P_{X|Z}(x|z) \over P_X(x)} \right] = \sum_{x,y,z} P_{XYZ}(x,y,z) \log \left[ {P_{X|Y}(x|y) \over P_{X|Z}(x|z)} \right] .

Then, using the Markov property of X, Y and Z to replace P_{XYZ}(x,y,z) with P_{X|Y}(x|y)P_{Y|Z}(y|z)P_Z(z)=P_{X|Y}(x|y) P_{YZ}(y,z), the above expression becomes

I(X;Y)-I(X;Z) = \sum_{y,z} P_{YZ}(y,z) \sum_x P_{X|Y}(x|y) \log \left[ {P_{X|Y}(x|y) \over P_{X|Z}(x|z)} \right] = \sum_{y,z} P_{YZ}(y,z) D_{KL} \left(P_{X|Y}(x|y)| P_{X|Z}(x|z) \right) \ge 0 .

The last expression has a certain amount of intuitive appeal: if P_{X|Y}(x|y) = P_{X|Z}(x|z) for all x,y and z, then I(X;Y)=I(X;Z), and no information is lost in the transformation from Y to Z. If those two probabilities are not equal, then at least some information is lost.


The channel coding theorem

Mutual Information is just one way among many of measuring how related two random variables are. However, it is a measure ideally suited for analyzing communication channels.

Abstractly, a communication channel can be visualized as a transmission medium which receives an input x and produces and output y. If the channel is noiseless, the output will be equal to the input. However, in general, the transmission medium is noisy and an input x is converted to output y with probability P_{Y|X}(y|x).

Given a communication channel, we can transmit any message \mathbf{s} from a set of M possible messages by performing the following three steps:

  1. To each message \mathbf{s} we assign a string \mathbf{x}(\mathbf{s})=(x_1,\dots,x_n) of length n. Each \mathbf{x}(s) is called a codeword. The deterministic mapping from the set of messages to the set of codewords is called the encoding function.
  2. To transmit \mathbf{s}, we transmit the corresponding string \mathbf{x}(\mathbf{s}) over the channel. This yields an output string \mathbf{y}, also of length n. For the so called memoryless channles, each x_i in the input string is mapped to an output y_i with probability P_{Y|X}(y|x), independent of the other x_j.
  3. The output string \mathbf{y} is used to recover the transmitted message, using a deterministic decoding function. The decoding function maps each \mathbf{y} to one symbol \mathbf{s'}.

These three steps are illustrated as

\mathbf{s} \ \mathbf{\xrightarrow{Encoding}} \ x_1 x_2 ... x_n\ \rightarrow \Big[ \ {\rm noisy \ channel} \! : P_{Y|X}(y|x) \ \Big] \rightarrow y_1 y_2 ... y_n\ \mathbf\xrightarrow{Decoding} \ \mathbf{s'} \, .

One would like \mathbf{s'} to be the same as \mathbf{s}, but, because of noise, there is no way to absolutely guarantee this. However, the channel coding theorem (Shannon and Weaver, 1949) states that one can come close, so long as M is not too large. Moreover - and this is where the link between mutual information and communication channels arises - the maximum number of messages that can be transmitted almost error-free is a function of the mutual information between X and Y. Specifically, the channel coding theorem states the following:

First, define the channel capacity, C, as the maximum mutual information with respect to the input distribution, P_X,

C=\max_{p_X} I(X;Y) \, .


Then, assume that M = 2^{nR} messages are encoded in strings of length n. If R < C, the probability of an error (i.e., the probability that \mathbf{s} \ne \mathbf{s'}) is exponentially small in n. If, on the other hand, R \ge C, then errors are likely. More formally,

For R=\frac{\log(M)}{n}<C, it is possible to transmit messages with a maximum probability of error which is exponentially small in n. Conversely, if R>C, this is not possible.

In the above theorem, R is called the rate.


Although at first glance the ability to transmit information over a noisy channel almost error free looks mysterious, it's based on a very simple principle: send only a subset of the possible messages (that is, choose M such that M < 2^{nC}). The idea is illustrated in Fig. 2. This figures shows a set of input messages, represented by the dots in the left hand panels. Because of noise, a particular input message could map to any one of the messages inside the circles in the right hand panels. If one used all possible messages, as in Fig. 2a, and tried to decode - to map Y back to X - one would make errors. For example, the red message on the right hand side of Fig. 2a could be mapped back to any of three input messages. If, on the other hand, one used only a subset of the messages, as in Fig. 2b, then decoding could be done perfectly. Although this main idea is simple to grasp, the power of the channel coding theorem lies in the fact that increasing n exponentially increases the maximum number of messages that can be transmitted almost error free (M \sim 2^{nC}).

Figure 2: How to transmit information error free. a. A scheme that does not transmit information error free. The  dots in the left panel show (very schematically) all possible messages. The noisy channel maps each of them to any of the messages within the circles in the right panel. If one were to try to decode these messages, errors would be made on a regular basis. That's because multiple messages lie within each circle, so for each message that is received, more than one message could have been sent. For example, the red dot could be decoded as the red one (which is correct), but also as either of the cyan ones (both of which are wrong). b. A scheme that does transmit information error free. A sufficiently small subset of the messages are sent so that there is only one message inside each circles. Decoding can now be done perfectly - the red dot on the right can now be decoded only as the red dot on the left. This is, of course, an idealization; in real systems, there is always some probability of error. But, as shown in the main text, that probability can be made arbitrarily small.
Enlarge
Figure 2: How to transmit information error free. a. A scheme that does not transmit information error free. The dots in the left panel show (very schematically) all possible messages. The noisy channel maps each of them to any of the messages within the circles in the right panel. If one were to try to decode these messages, errors would be made on a regular basis. That's because multiple messages lie within each circle, so for each message that is received, more than one message could have been sent. For example, the red dot could be decoded as the red one (which is correct), but also as either of the cyan ones (both of which are wrong). b. A scheme that does transmit information error free. A sufficiently small subset of the messages are sent so that there is only one message inside each circles. Decoding can now be done perfectly - the red dot on the right can now be decoded only as the red dot on the left. This is, of course, an idealization; in real systems, there is always some probability of error. But, as shown in the main text, that probability can be made arbitrarily small.

A detailed proof of the channel coding theorem is beyond the scope of this article and we refer the readers to the relevant literature [Shannon and Weaver 1949, Cover and Thomas 1990]. It is, however, possible to see the relation between the rate of transmission and the probability of error quite intuitively, as shown below.

Suppose that we want to transmit M messages using codewords of length n. If we send a codeword \mathbf{x}^\mathrm{true} \equiv (x_1^\mathrm{true}, x_2^\mathrm{true}, ..., x_n^\mathrm{true}), and it is converted to \mathbf{y} \equiv (y_1, y_2, ..., y_n) on the output side, there is a very natural way to translate from \mathbf{y} back to \mathbf{x}: choose the message most likely to have caused it. That is, compute p(\mathbf{y}|\mathbf{x}) - the probability that \mathbf{y} was produced by codeword \mathbf{x} - for all codewords, and choose the one that maximizes this probability. We denote the probability of mistakenly decoding a randomly chosen codeword, \mathbf{x}, by P_n. This is defined as

P_n = \mathrm{Prob} \left[ p(\mathbf{y}|\mathbf{x}) > p(\mathbf{y}|\mathbf{x}^\mathrm{true}) \right] \, .

Since \mathbf{x} was chosen randomly in the above expression, it follows that the total probability of an error when decoding messages, denoted P_{tot}, is given by

P_{tot} = 1 - (1-P_n)^{M} \approx 1 - \exp[-M P_n] \, .

For P_{tot} to be small, M P_n must be much less than one. This provides a bound on the number of messages, M, that can be sent almost error free using codewords of length n

(5)
M \ll 1/P_n \, .

It is a straightforward exercise in large deviation theory (see Chapter 12 of Cover & Thomas, 1991) to show that P_n decreases exponentially with n. In fact, as motivated in the example below, in the large n limit, P_n is given by

P_n = 2^{-nI(X;Y)} \, .

If we now let

M = 2^{n[I(X;Y)-\epsilon]}

with \epsilon positive, we have automatically satisfied Eq. (5), in the large n limit. Thus, this analysis implies that 2^{nI(X;Y)} is an upper bound on the number of messages of length n can be sent with a probability of error equal to 2^{-n\epsilon}. By making n sufficiently large, the probability of error can be made arbitrarily small.

The connection between mutual information and the number of messages that can be sent is a deep one, but it turns out to be fairly easy to understand, as can be seen with the following example.

Consider a communication channel that transmits 0s and 1s, and transmits them correctly with probability q; that is,

\begin{array}{ll} P_{Y|X}(1,1) & = q \\ P_{Y|X}(1,1) & = q \, . \end{array}

For simplicity assume that P_X(0)=P_X(1)=1/2, which in turn implies that P_{X|Y}(1,1) = P_{X|Y}(0,0) = q. The mutual information for this distribution is given by

(6)
\begin{array}{ll} I(X;Y) & = H(X) - H(X|Y) \\ & = -(1/2) \log(1/2) -(1/2) \log(1/2) -(1/2)[-q \log q - (1-q) \log(1-q)] -(1/2)[-q \log q - (1-q) \log(1-q)] \\ & = -\log(1/2) + q \log q + (1-q)\log(1-q) \end{array}

As usual, one constructs a codebook containing only a subset of possible strings of 0s and 1s; that is, for strings of length n, the codebook contain far fewer than 2^n messages. Suppose a particular codeword, \mathbf{y} \equiv (y_1, y_2, ..., y_n), was received. How would one figure out which message was sent? The idea is to make use of the fact that if a 1 was received then there is a probability q that a 1 was sent, and similarly for 0. For definiteness, take q=0.9. Then, one simply looks for the codeword that has approximately 90% 1s at positions in the string where the received symbol was 1, and approximately 90% 0s at positions in the string where the received symbol was 0. For example, consider following received message and two candidate codewords,

{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1}{\color{red}0} \ \ \mathrm{received \ message}
{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{red}0}{\color{green}0}{\color{red}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{green}0}{\color{blue}1}{\color{red}0} \ \ \mathrm{likely \ codeword}
{\color{red}0}{\color{green}0}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{green}0}{\color{green}1}{\color{green}0}{\color{blue}1}{\color{green}0} {\color{blue}1}{\color{blue}1}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{green}1}{\color{red}0}{\color{green}0} {\color{blue}1}{\color{green}1}{\color{blue}1}{\color{green}0}{\color{red}0}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{green}0}{\color{green}0} {\color{red}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{green}1}{\color{green}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{green}0}{\color{red}0} \ \ \mathrm{unlikely \ codeword}

Incorrect matches are color-coded in green. For the likely codeword, only about 10% of the symbols are green, whereas for the unlikely codeword, about 50% are green.

There are two points to this example. One is that the probability of making an error - of incorrectly decoding a message - is about equal the probability that a randomly chosen message, one in which the probability of a 0 or 1 in each position is 1/2, will have 90% of its 1s and 90% of its zeros in just the right places. The other is that this probability is not hard to calculate: it is the same as the probability of flipping n coins, and having the first n/2 come up 90% heads and the second n/2 come up 90% tails. This probability (with 0.9 replaced by q) is given by

P_n = \left[(1/2)^{n/2} \frac{(n/2)!}{(qn/2)!((1-q)n/2)!} \right]^2 \, .

The first factor, (1/2)^{n/2} is the probability of a particular message occurring; the second enumerates all the ways to arrange the symbols. Using Stirling's formula and taking the large n limit, this expression can be written

P_n \approx 2^{-n[- \log_2 (1/2) + q \log_2 q + (1-q) \log_2 (1-q)]} = 2^{-nI(X;Y)} \, .

where the last equality follows from Eq. (6). This analysis, which required only the use of Stirling's formula (and can be easily extended to the general case), provides a link between information theory and communications channels.


While this fundamental theorem puts bounds on transmitted information, it does not tell us how to reach those bounds. An active area of research is designing codes that come as close as possible to the limit given by mutual information error correcting codes.

References

  • Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons, New York, NY.
  • Gray, R.M. (1990). Entropy and Information Theory. Springer-Verlag, New York, NY.
  • S. Nirenberg and P.E. Latham, Decoding neuronal spike trains: how important are correlations? Proc. Natl. Acad. Sci. 100:7348-7353 (2003).
  • Shannon, C.E. and Weaver, W. (1949). The mathematical theory of communication. University of Illinois Press, Urbana, Illinois.

Further reading

  • Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons, New York, NY.

External links

See also

Entropy

Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
Reviewer C: Dr. Simon R Schultz, Imperial College London, UK
For authors