5.
Coding for Efficiency Suppose we have a very long message which
we want to save or send to someone, and we would like to shorten
it to save ourselves space or sending time. This happens in
particular if our message contains some kind of audio or video
information, which can go on and on. There are all sorts of
procedures and protocols used to compactify information. We
will study only one simple problem. Suppose we know nothing
whatever about our message, and want to compress it using only
information we can glean from it by making a relatively small
amount of counting on its data. Suppose then that our message
is represented as a long string of 0's and 1's of length N.
One thing we can do is to pick a small number, k, divide the
message up into consecutive blocks each consisting of k bits
of information, (that is, k digits) and count how many blocks
there are in our message for each possible block pattern of
length k. There are, of course, 2k different possible patterns
of digits 0 and 1 of length k. If we were to do this, we would
find, say, f(q) patterns of form q for each such pattern q.
If k is 2, the possible patterns are 00, 01, 10 and 11, which
we can call 0,1,2 and 3 which corresponds to reading them in
binary notation. So we will get four frequencies, f(0), …
f(3). Their sum will be N/k or rather the smallest integer at
least as large as N/k, since that is how many blocks we will
have. So we could find 2k such pattern occurrence numbers, (let’s
call them pattern frequencies) and we want to use this information
to shorten our message. How can we do so? Well, we can assign
a “code word” to each message, and make the words
that correspond to our frequently occurring pattern short, and
those corresponding to rare patterns long. This is what Morse
code sort of does for English language text. And this brings
us to two specific questions: 1. How much can we shorten our
message by using this technique? 2. How can we find an efficient
shortening algorithm for this problem? 5.1 Shannon’s Theorem:
A lower bound to the length of our shortened message We now
look at the first of these questions: What can we say about
the length of the shortest possible message obtained by trying
this approach? We can employ our old friend the Pigeonhole Principle,
to get a bound. To do so we address two questions: If we shorten
our message so it ends up being M bits long, how many possible
messages can we distinguish? Given what we are attempting, how
many possible messages do we have to be able to handle in order
to be able to decode every possible coded message successfully?
The first question is straightforward: if our shortened message
is M bits long, there are 2M possible messages that can be created.
So let us look at the second question: how many possible messages
are there, given the information that the frequencies f(q) tell
us about our message N? We know from our frequency information
how often each block occurs; but we have no information about
the ordering of the blocks in our message. Thus, our message
could, as far as the frequency information is concerned be any
ordering of f(q) samples of block q for each q. So the question
we want to address here is: How many different ways are there
to order N/k blocks, given our frequency information? The answer
is what is known as a “Multinomial Coefficient”.
Let us first then remind ourselves about “Binomial Coefficients”.
Suppose we had chosen k to be 1, so that we had only two possible
values for q. Our current question would then be, how many different
arrangements of two symbols are there given that there are f(0)
occurrences of the symbol 0 and f(1) of 1? This is the number
of ways of placing f(0) 0’s among f(0)+f(1) symbols all
together. This is the same as the number of times you choose
1 in multiplying out (1+x)f(0)+f(1) using the distributive law
over and over again. It is called the Binomial Coefficient,
(C(f(0)+f(1),f(0)) and it is defined by the “Binomial
Theorem” (1 + x) n = SUM (over k of) C(n,k). We can write
a similar theorem, which we could call the multinomial theorem,
as follows: (1 + x1 +x2 + … xj)n = SUM(over k1…kj)
Cj(n, k1, …,kj)((x1)k1(x2)k2…(xj)kj), where once
again, the multinomial coefficients, the Cj(n, k1,…,kj)
can be obtained by use of the distributive law. (This is something
of a dopey notation, since it singles out the 0th symbol, and
does not list how often it is picked (it would be k0). We can
of course deduce that k0 in each term on the right here is n
minus the sum of all the other k’s.) In our case we can
identify f(q) with kq, recognizing that every ordering of our
symbols corresponds to a term obtained by applying the distributive
law to the function on the left here. (the symbol appearing
in position r in an arrangement is the index s corresponding
to the variable xs that was chosen from the r-th factor here
in the expanded term we are looking at.) So what then is a multinomial
coefficient? There is a standard way of calculating these things,
which goes as follows. If we specify how many copies of each
symbol we are to have, which we do, all arrangements are alike
except for order. We use the following fact The number of orderings
is the number of arrangements multiplied by the number of orders
which leave each such arrangement fixed. This is a basic fact,
and I usually argue for it by saying it again louder, which
unfortunately doesn’t help those who don’t see it.
Let me try it here. Take any arrangement of our blocks. Now
reorder the blocks according to some other order. Our initial
arrangement will get mapped into some arrangement. Each arrangement
A will be hit in this way the same number of times, say Q times
. We can deduce then that the number of arrangements A is the
number of possible orders R divided by Q. For example, if there
are 24 orders and among them they map our original order into
each arrangement 3 times, we can conclude that there must be
8 arrangements. This fact tells us what the binomial and multinomial
coefficients are. How many orders are there? With n blocks,
we have n! possible orderings of them? (for us, n is N/k). How
many such orders keep an arrangement of f(q) copies of q blocks
for each q fixed? Any reordering of blocks which merely permutes
the blocks corresponding to the same q value does not change
the arrangement at all. The totals number of orderings that
merely permute blocks having the same q values is the product
over all q values of f(q)!. We conclude that our multinomial
coefficient is in general n!/(Product(s=0 to j) of ((ks)!))
In our case we find we have (N/k)! /(f(0)!*f(1)!*…*f(j)!)
possible arrangements of our blocks that make distinct messages
of which our message is one. So our pigeonhole principle tells
us: 2M > (N/k)! /(f(0)!*f(1)!*…*f(j)!) taking logs
we get M > log2 (N/k)! /(f(0)!*f(1)!*…*f(j)!). We will
now massage the right hand side here to get the bound of Shannon’s
Theorem, which theorem is: Theorem: If we compress a message
of length N by dividing it into N/k blocks of length k each,
using only frequency information contained in the frequencies
f(s) of the various block patterns, we will need a number of
binary digits for our message that is at least (N/k)H({p(0),
p(1), …, p(N/k-1)}) and we can can find a way to compress
our message in this way that uses at most (N/k)(H(p(0), p(1),
…, p(N/k-1))+1) binary bits. (Usually we get much closer
to the former number.) Here we use the notation p(s)= kf(s)/N,
which is the “relative frequency” of the block pattern
s: it is the number of s blocks divided by the total number
of blocks; and H(…) is the “entropy of information”
of the sequence of p’s. What then is this “entropy
of information”, usually written as H({p(s)})? It should
come as no surprise here that it is essentially the log to the
base 2 of the right hand side of our pigeonhole principle bound,
which boils down to H({p(s)}) = SUM (over s=0 to N/k-1 of) (-p(s)
log2 p(s)). We will do the boiling down below. Here we notice
that the first part of this theorem is an immediate consequence
of the pigeonhole principle and this boiling. The second part
of this “Shannon’s Theorem” is the statement
that the bound here can be achieved, almost. We prove this soon.
The remainder of this Chapter consists of 3 parts: 1. A brief
description of what entropy of information means and how it
relates to the concept of entropy in physics. 2. Convenient
approximation for the function n! that allows us to reduce the
our bound to a statement about entropy, and a reduction to same.
3. Proof of the second part of the theorem. 5.2 What is Entropy?
What is Entropy of Information? The physical concept of entropy
was developed in the subjects of thermodynamics and statistical
mechanics. If we look at a small piece of a large interacting
system, we will find that it will not in general be in some
fixed state. Rather, it can be found in any one of a usually
large number of different states. If our small piece exchanges
energy with the rest of the system, we will find it in states
of differing energy at different times. Its entropy is a logarithmic
measure of the diversity of the states of the system that can
be observed in our small subsystem. These states generally appear
with a frequency proportional to a factor of exp(-E/kT) where
E represents the energy of the state, and T is the temperature
of the system. In physics the entropy is proportional to the
logarithm of the sum over all states of the exponential factor
just mentioned. At high temperatures the exponential factor
is close to 1 for many states and the entropy is high. At low
temperature, only states with close to the lowest possible energy
occur much at all, and the entropy is usually very low. There
are two laws of thermodynamics that involve entropy. The second
law states that entropy of a system left on its own never decreases.
That reflects the physical fact that states do not coalesce
spontaneously, and a diverse system will retain its diversity.
There is also a third law of thermodynamics which states that
the entropy of a physical system approaches 0 near 0 temperature.
This is of course only true of systems with very few low energy
states. At low temperature only these states predominate in
our subsystem and there is little diversity among the states.There
is an exception to this law, and it is the substance known as
water, but that is beyond our present horizon. Shannon invented
the concept of entropy of information, as a logarithmic measure
of the diversity among potential messages that is wiped out
upon receipt of a specific message. It thus provides a measure
of how much information, how many bits you actually learn when
you get a message. In the case of information, if we know the
frequency of our blocks, the diversity of potential messages
comes entirely from the different orders in which these blocks
can occur. In our case, our frequency information tells us that
the number of potential messages is a certain multinomial coefficient.
The log to the base 2 of this coefficient is number of bits
you must have in order to exceed the same number of potential
messages, and is the number of bits you need supply in order
to reduce the original diversity down to one particular message.
5.3 Stirling’s Formula and Approximations for Multinomial
Coefficients We have written a formula for a multinomial coefficient
in terms of factorials, so we can write it and its logarithm
conveniently if we can find a convenient approximation to n!.
There is such a formula and it can be written as follows: n!
= (n/e)n W, where W is a factor on the order of n1/2. In fact
we can write W= (2 pi n + pi/3)1/2 (1 + O(1/n)). When we take
logarithms of n! the factor W recedes into insignificance, and
contributes nothing to the leading terms here. It can therefore
be ignored in our present discussion. (It is curious that the
approximate formula implied above for n! is quite accurate even
for n=0, (if you interpret (0)0 as 1) and gets better as n increases
in that the ratio the formula to n! approaches 1 as n increases..)
An argument for this form of approximation comes from the power
series expression for the function exp(n). This series is: Exp(n)
= 1 + n + … nn-1/(n-1)! + nn/n! + nn+1/(n+1)! + ….
You can verify that the terms increase up to nn-1/(n-1)!, then
the next term is the same and then they start downward, at ever
increasing speeds since the ratio of the kth term to the (k-1)st
is n/k. We can thereforestimate represent the left-hand-side,
the exponential function, by the product of the largest term
on the right and a measure of how many terms on the right make
major contributions, in short by nn/n! *W where W is easily
seen to be less than n. And we can rearrange this relation to
read n! = W*nn/exp(n) as claimed, with 1< W < n. The notation
O(f(n)) means that as n goes to infinity is quantity behaves
like f(n), in that its ratio to f(n) approaches a constant,
while o(f(n)) means its ratio to f(n) goes to 0 as n increases.
What then does this tell us about the logarithm of a multinomial
coefficient? Consider first a binomial coefficient, C(n,np(1)).
This can be written as n!/(np(0))!(np(1))! Applying Stirling’s
formula this becomes (n/e)n((np(0))/e)-n(1-p(1))(np(1)/e)-np(1)
Wn /Wp(1)n/Wn(1-p(1) And the n and e factors cancel and we get
the following expression for the binomial coefficient: ((p(1))-p1(1-p(1)))-p1)n
Wn /Wp(1)n/Wn(1-p(1). IIf we take logs of both sides, we find
that the W factors are insignificant for large n and we get
H(p1) = - p1log2 p1 –(1-p1) log2(1-p1). = -p1 log2 p1
– p0log2 p0. And the log of the binomial coeffient is
close to nH(p1). This as a function of p1 is symmetric about
p1=1/2, is 0 at 0 and 1, and is 1 at ½. And what about
the multinomial coefficient? When one takes the logarithm of
it we get a term of –ps log2 ps f for each s, and again
the W terms can be ignored, and we get. H({ps}) = -p0 log p0
– p1 log p1 - … - pj log pj. The log of the multinomial
coefficient is roughly nH({ps}) 5.4 How can we get a code that
come close to the bound here? We can do this by defining a code,
which assigns a binary sequence to each block type. The length
l(q) of the sequence assigned to will be used a total of f(q)
times in the message, and the length of the message will be
SUM (over all q from 0 to j of) l(q)f(q) What code should we
use? We first notice that there is a simple code that can be
used to meet the lower bound just described exactly, when each
p(s) is the reciprocal of a power of 2. We assume now that the
p(s) add up to 1; this tells us that the rarest two messages
must have the same frequency, when they are all reciprocals
of powers of 2. We can then take the rarest two messages, assign
their last bits to be 0 and 1, and from now on treat them as
one message. It is clear that their frequency doubles when their
frequencies were the same and we combine them as one. We can
immediately deduce, by induction on the number of blocks that
by doing this we can assign each block q a code word whose length,
l(q) is -log2p(q). For, by the induction hypothesis (using induction
on the number of block patterns) we can assume this to be true
after combining the two rarest frequencies. That combination
was achieved at a cost of adding one extra bit to each of their
code words, but we also subtracted 1 from each of their –log2p(q)
values, which came from doubling their frequency. (doubling
p(q) lowers – log2p(q) by 1). The induction hypothesis
tells us we can assign each of these blocks a common prefix
which is one bit shorter than their log2p(q) values, plus one
extra bit to distinguish them, which gives exactly log2p(q)
bits in all, as desired for them. If we have a code like this,
then the total message length is exactly the lower bound we
have found from the first part of Shannon’s Theorem. We
can also observe that given any values for p(q), we can lower
them to make them reciprocals of powers of 2 at the cost at
worst of halving them. (In reality if we lower some p(q) we
would raise others to keep the sum at 1, so we would do lots
better than halving each p(q). But halving each p(q) adds only
1 to each negative log, and the that would affect the total
message length at worst as indicated in the statement of the
theorem. So we can actually find a reasonably good code, by
rounding our relative frequencies down to inverse powers of
2 and assigning code words to have length given by their -logf(q)
values But this code is not optimal, and we can do better by
a still easier procedure which we will describe in the next
section. Exercises: 1. Use a spreadsheet or other means to plot
the function H(p1) in the range 0 to 1 for p1. 2. On a spreadsheet
or otherwise compute n!/(n/e)n /n1/2 for n = 2,4,8,16,32,64,128
Use these results and extrapolation to estimate the limit of
this ratio. (to extrapolate look at s(k) = 2r(2k)-r(k) then
t(k) = 4s(2k)/3 –s(k)/3 then u(k) = 8t(2k)/7-t(k)/7 then
v(k) = 16u(2k)/15 – u(k)/15 Compare your answer to (2
pi)1/2. 3. Given the following frequencies f(q) for q=0 to 15,
compute the entropy of this list (compute their total, n, the
p(q) and then the entropy function of these) 1, 22,11,15,2,1,2,3,45,95,2,2,1,2,25,1
5. Coding for Efficiency Suppose we have a very long message
which we want to save or send to someone, and we would like
to shorten it to save ourselves space or sending time. This
happens in particular if our message contains some kind of audio
or video information, which can go on and on. There are all
sorts of procedures and protocols used to compactify information.
We will study only one simple problem. Suppose we know nothing
whatever about our message, and want to compress it using only
information we can glean from it by making a relatively small
amount of counting on its data. Suppose then that our message
is represented as a long string of 0's and 1's of length N.
One thing we can do is to pick a small number, k, divide the
message up into consecutive blocks each consisting of k bits
of information, (that is, k digits) and count how many blocks
there are in our message for each possible block pattern of
length k. There are, of course, 2k different possible patterns
of digits 0 and 1 of length k. If we were to do this, we would
find, say, f(q) patterns of form q for each such pattern q.
If k is 2, the possible patterns are 00, 01, 10 and 11, which
we can call 0,1,2 and 3 which corresponds to reading them in
binary notation. So we will get four frequencies, f(0), …
f(3). Their sum will be N/k or rather the smallest integer at
least as large as N/k, since that is how many blocks we will
have. So we could find 2k such pattern occurrence numbers, (let’s
call them pattern frequencies) and we want to use this information
to shorten our message. How can we do so? Well, we can assign
a “code word” to each message, and make the words
that correspond to our frequently occurring pattern short, and
those corresponding to rare patterns long. This is what Morse
code sort of does for English language text. And this brings
us to two specific questions: 1. How much can we shorten our
message by using this technique? 2. How can we find an efficient
shortening algorithm for this problem? 5.1 Shannon’s Theorem:
A lower bound to the length of our shortened message We now
look at the first of these questions: What can we say about
the length of the shortest possible message obtained by trying
this approach? We can employ our old friend the Pigeonhole Principle,
to get a bound. To do so we address two questions: If we shorten
our message so it ends up being M bits long, how many possible
messages can we distinguish? Given what we are attempting, how
many possible messages do we have to be able to handle in order
to be able to decode every possible coded message successfully?
The first question is straightforward: if our shortened message
is M bits long, there are 2M possible messages that can be created.
So let us look at the second question: how many possible messages
are there, given the information that the frequencies f(q) tell
us about our message N? We know from our frequency information
how often each block occurs; but we have no information about
the ordering of the blocks in our message. Thus, our message
could, as far as the frequency information is concerned be any
ordering of f(q) samples of block q for each q. So the question
we want to address here is: How many different ways are there
to order N/k blocks, given our frequency information? The answer
is what is known as a “Multinomial Coefficient”.
Let us first then remind ourselves about “Binomial Coefficients”.
Suppose we had chosen k to be 1, so that we had only two possible
values for q. Our current question would then be, how many different
arrangements of two symbols are there given that there are f(0)
occurrences of the symbol 0 and f(1) of 1? This is the number
of ways of placing f(0) 0’s among f(0)+f(1) symbols all
together. This is the same as the number of times you choose
1 in multiplying out (1+x)f(0)+f(1) using the distributive law
over and over again. It is called the Binomial Coefficient,
(C(f(0)+f(1),f(0)) and it is defined by the “Binomial
Theorem” (1 + x) n = SUM (over k of) C(n,k). We can write
a similar theorem, which we could call the multinomial theorem,
as follows: (1 + x1 +x2 + … xj)n = SUM(over k1…kj)
Cj(n, k1, …,kj)((x1)k1(x2)k2…(xj)kj), where once
again, the multinomial coefficients, the Cj(n, k1,…,kj)
can be obtained by use of the distributive law. (This is something
of a dopey notation, since it singles out the 0th symbol, and
does not list how often it is picked (it would be k0). We can
of course deduce that k0 in each term on the right here is n
minus the sum of all the other k’s.) In our case we can
identify f(q) with kq, recognizing that every ordering of our
symbols corresponds to a term obtained by applying the distributive
law to the function on the left here. (the symbol appearing
in position r in an arrangement is the index s corresponding
to the variable xs that was chosen from the r-th factor here
in the expanded term we are looking at.) So what then is a multinomial
coefficient? There is a standard way of calculating these things,
which goes as follows. If we specify how many copies of each
symbol we are to have, which we do, all arrangements are alike
except for order. We use the following fact The number of orderings
is the number of arrangements multiplied by the number of orders
which leave each such arrangement fixed. This is a basic fact,
and I usually argue for it by saying it again louder, which
unfortunately doesn’t help those who don’t see it.
Let me try it here. Take any arrangement of our blocks. Now
reorder the blocks according to some other order. Our initial
arrangement will get mapped into some arrangement. Each arrangement
A will be hit in this way the same number of times, say Q times
. We can deduce then that the number of arrangements A is the
number of possible orders R divided by Q. For example, if there
are 24 orders and among them they map our original order into
each arrangement 3 times, we can conclude that there must be
8 arrangements. This fact tells us what the binomial and multinomial
coefficients are. How many orders are there? With n blocks,
we have n! possible orderings of them? (for us, n is N/k). How
many such orders keep an arrangement of f(q) copies of q blocks
for each q fixed? Any reordering of blocks which merely permutes
the blocks corresponding to the same q value does not change
the arrangement at all. The totals number of orderings that
merely permute blocks having the same q values is the product
over all q values of f(q)!. We conclude that our multinomial
coefficient is in general n!/(Product(s=0 to j) of ((ks)!))
In our case we find we have (N/k)! /(f(0)!*f(1)!*…*f(j)!)
possible arrangements of our blocks that make distinct messages
of which our message is one. So our pigeonhole principle tells
us: 2M > (N/k)! /(f(0)!*f(1)!*…*f(j)!) taking logs
we get M > log2 (N/k)! /(f(0)!*f(1)!*…*f(j)!). We will
now massage the right hand side here to get the bound of Shannon’s
Theorem, which theorem is: Theorem: If we compress a message
of length N by dividing it into N/k blocks of length k each,
using only frequency information contained in the frequencies
f(s) of the various block patterns, we will need a number of
binary digits for our message that is at least (N/k)H({p(0),
p(1), …, p(N/k-1)}) and we can can find a way to compress
our message in this way that uses at most (N/k)(H(p(0), p(1),
…, p(N/k-1))+1) binary bits. (Usually we get much closer
to the former number.) Here we use the notation p(s)= kf(s)/N,
which is the “relative frequency” of the block pattern
s: it is the number of s blocks divided by the total number
of blocks; and H(…) is the “entropy of information”
of the sequence of p’s. What then is this “entropy
of information”, usually written as H({p(s)})? It should
come as no surprise here that it is essentially the log to the
base 2 of the right hand side of our pigeonhole principle bound,
which boils down to H({p(s)}) = SUM (over s=0 to N/k-1 of) (-p(s)
log2 p(s)). We will do the boiling down below. Here we notice
that the first part of this theorem is an immediate consequence
of the pigeonhole principle and this boiling. The second part
of this “Shannon’s Theorem” is the statement
that the bound here can be achieved, almost. We prove this soon.
The remainder of this Chapter consists of 3 parts: 1. A brief
description of what entropy of information means and how it
relates to the concept of entropy in physics. 2. Convenient
approximation for the function n! that allows us to reduce the
our bound to a statement about entropy, and a reduction to same.
3. Proof of the second part of the theorem. 5.2 What is Entropy?
What is Entropy of Information? The physical concept of entropy
was developed in the subjects of thermodynamics and statistical
mechanics. If we look at a small piece of a large interacting
system, we will find that it will not in general be in some
fixed state. Rather, it can be found in any one of a usually
large number of different states. If our small piece exchanges
energy with the rest of the system, we will find it in states
of differing energy at different times. Its entropy is a logarithmic
measure of the diversity of the states of the system that can
be observed in our small subsystem. These states generally appear
with a frequency proportional to a factor of exp(-E/kT) where
E represents the energy of the state, and T is the temperature
of the system. In physics the entropy is proportional to the
logarithm of the sum over all states of the exponential factor
just mentioned. At high temperatures the exponential factor
is close to 1 for many states and the entropy is high. At low
temperature, only states with close to the lowest possible energy
occur much at all, and the entropy is usually very low. There
are two laws of thermodynamics that involve entropy. The second
law states that entropy of a system left on its own never decreases.
That reflects the physical fact that states do not coalesce
spontaneously, and a diverse system will retain its diversity.
There is also a third law of thermodynamics which states that
the entropy of a physical system approaches 0 near 0 temperature.
This is of course only true of systems with very few low energy
states. At low temperature only these states predominate in
our subsystem and there is little diversity among the states.There
is an exception to this law, and it is the substance known as
water, but that is beyond our present horizon. Shannon invented
the concept of entropy of information, as a logarithmic measure
of the diversity among potential messages that is wiped out
upon receipt of a specific message. It thus provides a measure
of how much information, how many bits you actually learn when
you get a message. In the case of information, if we know the
frequency of our blocks, the diversity of potential messages
comes entirely from the different orders in which these blocks
can occur. In our case, our frequency information tells us that
the number of potential messages is a certain multinomial coefficient.
The log to the base 2 of this coefficient is number of bits
you must have in order to exceed the same number of potential
messages, and is the number of bits you need supply in order
to reduce the original diversity down to one particular message.
5.3 Stirling’s Formula and Approximations for Multinomial
Coefficients We have written a formula for a multinomial coefficient
in terms of factorials, so we can write it and its logarithm
conveniently if we can find a convenient approximation to n!.
There is such a formula and it can be written as follows: n!
= (n/e)n W, where W is a factor on the order of n1/2. In fact
we can write W= (2 pi n + pi/3)1/2 (1 + O(1/n)). When we take
logarithms of n! the factor W recedes into insignificance, and
contributes nothing to the leading terms here. It can therefore
be ignored in our present discussion. (It is curious that the
approximate formula implied above for n! is quite accurate even
for n=0, (if you interpret (0)0 as 1) and gets better as n increases
in that the ratio the formula to n! approaches 1 as n increases..)
An argument for this form of approximation comes from the power
series expression for the function exp(n). This series is: Exp(n)
= 1 + n + … nn-1/(n-1)! + nn/n! + nn+1/(n+1)! + ….
You can verify that the terms increase up to nn-1/(n-1)!, then
the next term is the same and then they start downward, at ever
increasing speeds since the ratio of the kth term to the (k-1)st
is n/k. We can thereforestimate represent the left-hand-side,
the exponential function, by the product of the largest term
on the right and a measure of how many terms on the right make
major contributions, in short by nn/n! *W where W is easily
seen to be less than n. And we can rearrange this relation to
read n! = W*nn/exp(n) as claimed, with 1< W < n. The notation
O(f(n)) means that as n goes to infinity is quantity behaves
like f(n), in that its ratio to f(n) approaches a constant,
while o(f(n)) means its ratio to f(n) goes to 0 as n increases.
What then does this tell us about the logarithm of a multinomial
coefficient? Consider first a binomial coefficient, C(n,np(1)).
This can be written as n!/(np(0))!(np(1))! Applying Stirling’s
formula this becomes (n/e)n((np(0))/e)-n(1-p(1))(np(1)/e)-np(1)
Wn /Wp(1)n/Wn(1-p(1) And the n and e factors cancel and we get
the following expression for the binomial coefficient: ((p(1))-p1(1-p(1)))-p1)n
Wn /Wp(1)n/Wn(1-p(1). IIf we take logs of both sides, we find
that the W factors are insignificant for large n and we get
H(p1) = - p1log2 p1 –(1-p1) log2(1-p1). = -p1 log2 p1
– p0log2 p0. And the log of the binomial coeffient is
close to nH(p1). This as a function of p1 is symmetric about
p1=1/2, is 0 at 0 and 1, and is 1 at ½. And what about
the multinomial coefficient? When one takes the logarithm of
it we get a term of –ps log2 ps f for each s, and again
the W terms can be ignored, and we get. H({ps}) = -p0 log p0
– p1 log p1 - … - pj log pj. The log of the multinomial
coefficient is roughly nH({ps}) 5.4 How can we get a code that
come close to the bound here? We can do this by defining a code,
which assigns a binary sequence to each block type. The length
l(q) of the sequence assigned to will be used a total of f(q)
times in the message, and the length of the message will be
SUM (over all q from 0 to j of) l(q)f(q) What code should we
use? We first notice that there is a simple code that can be
used to meet the lower bound just described exactly, when each
p(s) is the reciprocal of a power of 2. We assume now that the
p(s) add up to 1; this tells us that the rarest two messages
must have the same frequency, when they are all reciprocals
of powers of 2. We can then take the rarest two messages, assign
their last bits to be 0 and 1, and from now on treat them as
one message. It is clear that their frequency doubles when their
frequencies were the same and we combine them as one. We can
immediately deduce, by induction on the number of blocks that
by doing this we can assign each block q a code word whose length,
l(q) is -log2p(q). For, by the induction hypothesis (using induction
on the number of block patterns) we can assume this to be true
after combining the two rarest frequencies. That combination
was achieved at a cost of adding one extra bit to each of their
code words, but we also subtracted 1 from each of their –log2p(q)
values, which came from doubling their frequency. (doubling
p(q) lowers – log2p(q) by 1). The induction hypothesis
tells us we can assign each of these blocks a common prefix
which is one bit shorter than their log2p(q) values, plus one
extra bit to distinguish them, which gives exactly log2p(q)
bits in all, as desired for them. If we have a code like this,
then the total message length is exactly the lower bound we
have found from the first part of Shannon’s Theorem. We
can also observe that given any values for p(q), we can lower
them to make them reciprocals of powers of 2 at the cost at
worst of halving them. (In reality if we lower some p(q) we
would raise others to keep the sum at 1, so we would do lots
better than halving each p(q). But halving each p(q) adds only
1 to each negative log, and the that would affect the total
message length at worst as indicated in the statement of the
theorem. So we can actually find a reasonably good code, by
rounding our relative frequencies down to inverse powers of
2 and assigning code words to have length given by their -logf(q)
values But this code is not optimal, and we can do better by
a still easier procedure which we will describe in the next
section. Exercises: 1. Use a spreadsheet or other means to plot
the function H(p1) in the range 0 to 1 for p1. 2. On a spreadsheet
or otherwise compute n!/(n/e)n /n1/2 for n = 2,4,8,16,32,64,128
Use these results and extrapolation to estimate the limit of
this ratio. (to extrapolate look at s(k) = 2r(2k)-r(k) then
t(k) = 4s(2k)/3 –s(k)/3 then u(k) = 8t(2k)/7-t(k)/7 then
v(k) = 16u(2k)/15 – u(k)/15 Compare your answer to (2
pi)1/2. 3. Given the following frequencies f(q) for q=0 to 15,
compute the entropy of this list (compute their total, n, the
p(q) and then the entropy function of these) 1, 22,11,15,2,1,2,3,45,95,2,2,1,2,25,1