Mutual information
From Scholarpedia
| This article is undergoing 1 initial review (2 completed); It may contain inaccuracies and unapproved changes made by anonymous reviewers. | ||||||||||||||||||||
Author: Dr. Peter E. Latham, Gatsby Computational Neuroscience Unit, Uinversity College London
Author: Dr. Yasser Roudi, Gatsby Computational Neuroscience Unit, University College London
Mutual information is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits, and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent.
Contents |
Definition
For two discrete variables
and
whose joint probability distribution
is
, the mutual information between them, denoted
, is given by (Shannon and Weaver, 1949;
Cover and Thomas, 1991)
- (1)
Here
and
are the marginals:
and
and
is the expected value over the distribution
. The focus here is on discrete
variables, but most results derived for discrete variables
extend very naturally to continuous ones – one simply replaces sums by integrals.
One should be aware, though, that the formal replacement of sums by integrals hides a great deal of subtlety, and, for distributions that are not sufficiently smooth, may not even work. See (Gray, 1990) for details.
The units of information depend on the base of the logarithm. If base 2 is used (the most common, and the one used here), information is measured in bits. For other bases, see Table I.
| base | units | conversion |
|---|---|---|
| 2 | bits | 1 bit = 1 bit |
| nats | 1 bit = nats
|
| 10 | bans | 1 bit = bans
|
Interpretation
To understand what
actually means, we first need to define
entropy and conditional entropy.
Qualitatively, entropy is a measure of uncertainty – the higher the entropy, the more uncertain one is about a random variable. This statement was made quantitative by Shannon. He postulated that a measure of uncertainty of a random variable
should be a continuous function of its probability distribution
and should satisfy the following conditions:
- It should be maximal when
is uniform, and in this case it should increase with the number of possible values
can take;
- It should remain the same if we reorder the probabilities assigned to different values of
;
- The uncertainty about two independent random variables should be the sum of the uncertainties about each of them.
He then showed that the only measure of uncertainty that satisfies all these conditions is the entropy, defined as
- (2)
Although not particularly obvious from this equation,
has a very concrete interpretation: Suppose
is chosen randomly from the distribution
, and someone who knows the distribution
is asked to guess which
was chosen. If the guesser uses the optimal question-asking strategy, which is to divide the probability in half on each guess by asking questions like "is
greater than
?", then the average number of yes/no questions it takes to guess
lies between
and
(Cover and Thomas, 1991). This gives quantitative meaning to "uncertainty": it is the number of yes/no questions it takes to guess a random variables, given knowledge of the underlying distribution and taking the optimal question-asking strategy. See figure 1 for examples.
The conditional entropy is the average uncertainty about
after observing a second random variable
, and is given by
- (3)
where
is the conditional probability of
given
.
With the definitions of
and
, Eq. (1)
can be written
- (4)
Mutual information is therefore the reduction in uncertainty about variable
, or the expected reduction in the number of yes/no questions needed to guess
after observing
. Note that the yes/no question interpretation even applies to continuous variables: although it takes an infinite number of questions to guess a continuous variable, the difference in the number of yes/no questions it takes to guess
before versus after observing
is finite, and is the mutual information. While problems can arise when going from discrete to continuous variables (subtracting infinities is always dangerous), they rarely do in practice; see (Gray, 1990).
Properties
- Mutual information is intimately related to the Kullback-Leibler divergence (Cover and Thomas, 1991), a very natural measure of the "distance" between two distributions. It is defined as follows: for any two distributions
and
,
Distance is in quotes because the Kullback-Leibler divergence is not a true distance: it is not symmetric, and it does not obey the triangle inequality (Cover and Thomas, 1991). It is not hard to show that
is non-negative, and zero if and only if
From Eq. (1), it is easy to see that the mutual information is just the Kullback-Leibler distance between the joint distribution,
, and the product of the independent ones,
,
Thus, another way to think about mutual information is that it is a measure of how close the true joint distribution of
and
is to the
independent joint distribution. Because the Kullback-Leibler distance, and thus the mutual information, is zero if and only if
, it follows that the mutual information captures all dependencies between random variables, not just, say, second order ones, as captured by the covariance.
- Mutual information is additive for independent variables. More quantitatively, if
, then
This follows easily from the definition of the mutual information. It's also a property that information should have. For example, suppose
and
are the cost of a bottle of wine and how it tastes, and
and
are the height and weight of tree squirrels. If cost provides 1 bit of information about taste and height provides 2 bits of information about weight, then it makes sense that cost+weight should provide 3 bits of information about taste+height. Assuming, of course, that squirrels (and their weights) do not influence wines (and their prices).
- The Data Processing Inequality (DPI) states, loosely, that post-processing cannot increase information. More quantitatively, consider two random variables,
and
, whose mutual information is
. Now consider a third random variable,
, that is a (probabilistic) function of
only. (The only qualifier means
, which in turn implies that
. This last relationship means that
and
form a Markov Chain.) The DPI states that
can't have more information about
than
has about
; that is,
This inequality, which again is a property information should have, is easy to prove. Combining Eqs. (3) and (4) and using
and
, we have
Then, using the Markov property of
and
to replace
with
, the above expression becomes
The last expression has a certain amount of intuitive appeal: if
for all
and
, then
, and no information is lost in the transformation from
to
. If those two probabilities are not equal, then at least some information is lost.
The channel coding theorem
Mutual Information is just one way among many of measuring how related two random variables are. However, it is a measure ideally suited for analyzing communication channels.
Abstractly, a communication channel can be visualized as a transmission medium which receives an input
and produces and output
. If the channel is noiseless, the output will be equal to the input. However, in general, the transmission medium is noisy and an input
is converted to output
with probability
.
Given a communication channel, we can transmit any message
from a set of
possible messages by performing the following three steps:
- To each message
we assign a string
of length
. Each
is called a codeword. The deterministic mapping from the set of messages to the set of codewords is called the encoding function.
- To transmit
, we transmit the corresponding string
over the channel. This yields an output string
, also of length
. For the so called memoryless channles, each
in the input string is mapped to an output
with probability
, independent of the other
.
- The output string
is used to recover the transmitted message, using a deterministic decoding function. The decoding function maps each
to one symbol
.
These three steps are illustrated as
One would like
to be the same as
, but, because of noise, there is no way to absolutely guarantee this. However, the channel coding theorem (Shannon and Weaver, 1949) states that one can come close, so long as
is not too large. Moreover - and this is where the link between mutual information and communication channels arises - the maximum number of messages that can be transmitted almost error-free is a function of the mutual information between
and
. Specifically, the channel coding theorem states the following:
First, define the channel capacity,
, as the maximum mutual information with respect to the input distribution,
,
Then, assume that
messages are encoded in strings of length
. If
, the probability of an error (i.e., the probability that
) is exponentially small in
. If, on the other hand,
, then errors are likely. More formally,
For
, it is possible to transmit messages with a maximum probability of error which is exponentially small in
. Conversely, if
, this is not possible.
In the above theorem,
is called the rate.
Although at first glance the ability to transmit information over a noisy channel almost error free looks mysterious, it's based on a very simple principle: send only a subset of the possible messages (that is, choose
such that
). The idea is illustrated in Fig. 2. This figures shows a set of input messages, represented by the dots in the left hand panels. Because of noise, a particular input message could map to any one of the messages inside the circles in the right hand panels. If one used all possible messages, as in Fig. 2a, and tried to decode - to map
back to
- one would make errors. For example, the red message on the right hand side of Fig. 2a could be mapped back to any of three input messages. If, on the other hand, one used only a subset of the messages, as in Fig. 2b, then decoding could be done perfectly. Although this main idea is simple to grasp, the power of the channel coding theorem lies in the fact that increasing
exponentially increases the maximum number of messages that can be transmitted almost error free (
).
A detailed proof of the channel coding theorem is beyond the scope of this article and we refer the readers to the relevant literature [Shannon and Weaver 1949, Cover and Thomas 1990]. It is, however, possible to see the relation between the rate of transmission and the probability of error quite intuitively, as shown below.
Suppose that we want to transmit
messages using codewords of length
.
If we send a codeword
, and it is
converted to
on the output side, there is a very natural way to translate from
back to
: choose the message most likely to have caused it. That is, compute
- the probability that
was produced by codeword
- for all codewords, and choose the one that maximizes this probability. We denote the probability of mistakenly decoding a randomly chosen codeword,
, by
. This is defined as
Since
was chosen randomly in the above expression, it follows that the total probability of an error when decoding messages, denoted
, is given by
For
to be small,
must be much less than one. This provides a bound
on the number of messages,
, that can be sent almost error free using codewords of length
- (5)
It is a straightforward exercise in large deviation theory (see Chapter 12 of Cover & Thomas, 1991) to show that
decreases exponentially with
. In fact, as motivated in the example below, in the large
limit,
is given by
If we now let
with
positive, we have automatically satisfied Eq. (5), in the large
limit. Thus, this analysis implies that
is an upper bound on the number of messages of length
can be sent with a probability of error equal to
. By making
sufficiently large, the probability of error can be made arbitrarily small.
The connection between mutual information and the number of messages that can be sent is a deep one, but it turns out to be fairly easy to understand, as can be seen with the following example.
Consider a communication channel that transmits 0s and 1s, and transmits them correctly with probability
; that is,
For simplicity assume that
, which in turn implies that
The mutual information for this distribution is given by
- (6)
As usual, one constructs a codebook containing only a subset of possible strings of 0s and 1s; that is, for strings of length
, the codebook contain far fewer than
messages. Suppose a particular codeword,
, was received. How would one figure out which message was sent? The idea is to make use of the fact that if a 1 was received then there is a probability
that a 1 was sent, and similarly for 0. For definiteness, take
. Then, one simply looks for the codeword that has approximately 90% 1s at positions in the string where the received symbol was 1, and approximately 90% 0s at positions in the string where the received symbol was 0. For example, consider following received message and two candidate codewords,
Incorrect matches are color-coded in green. For the likely codeword, only about 10% of the symbols are green, whereas for the unlikely codeword, about 50% are green.
There are two points to this example. One is that the probability of making an error - of incorrectly decoding a message - is about equal the probability that a randomly chosen message, one in which the probability of a 0 or 1 in each position is 1/2, will have 90% of its 1s and 90% of its zeros in just the right places. The other is that this probability is not hard to calculate: it is the same as the probability of flipping
coins, and having the first
come up 90% heads and the second
come up 90% tails. This probability (with 0.9 replaced by q) is given by
The first factor,
is the probability of a particular message occurring; the second enumerates all the ways to arrange the symbols. Using
Stirling's formula and taking the large
limit, this expression can be written
where the last equality follows from Eq. (6). This analysis, which required only the use of Stirling's formula (and can be easily extended to the general case), provides a link between information theory and communications channels.
While this fundamental theorem puts bounds on transmitted information, it does not tell us how to reach those bounds. An active area of research is designing codes that come as close as possible to the limit given by mutual information error correcting codes.
References
- Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons, New York, NY.
- Gray, R.M. (1990). Entropy and Information Theory. Springer-Verlag, New York, NY.
- S. Nirenberg and P.E. Latham, Decoding neuronal spike trains: how important are correlations? Proc. Natl. Acad. Sci. 100:7348-7353 (2003).
- Shannon, C.E. and Weaver, W. (1949). The mathematical theory of communication. University of Illinois Press, Urbana, Illinois.
Further reading
- Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons, New York, NY.
External links
- Peter E. Latham's website
- Yasser Roudi's website
- Mutual Information (Wikipedia)
- Error correcting codes (Wikipedia)
See also
| Action editor: | Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia |
| Reviewer C: | Dr. Simon R Schultz, Imperial College London, UK |
nats
bans


