so we've been talking about information
how you measure information
and of course, you measure information in bits
how you can use information to label things
for instance, with bar codes
so you can use information
to label things
and then we talked about
probability and information
and if I have probabilities for events P_i
then it says that the amount of information
that's associated with this event occurring is
minus the sum over i
p_i log to the base 2
of p_i
this beautiful formula that was developed
by Maxwell, Boltzmann, and Gibbs
back in the middle of nineteenth century
to talk about the amount of entropy
and atoms or molecules
this is often called S
for entropy, as well
and then was rediscovered
by Claude Shannon
in the 1940's
to talk about information theory in the abstract
and the mathematical theory of communication
in fact, there is a funny story
about this that
Shannon, when he came up with this formula
minus sum over i p_i log to the base 2 of p_i
he went to John Von Neumann
the famous mathematician
and he said, "what sould I call this quantity?"
and Von Neumann says
"you should call it H, because that's what Boltzmann called it "
but Von Neumann, who had a famous memory
apparently forgot
of Boltzmann's H
of his famous "H theorem"
was the same thing but without the minus sign
so it's a negative quantity
and it gets more and more negative
as opposed to entropy
which is a positive quantiy
and gets more and more positive
so, actually these fundamental formulas
about information theory
go back to the mid 19th century
a hundred and fifty years
so now we'd like to apply them
to ideas about communication
and to do that, I'd like to tell you
a little bit more about
probability
so, we talked about
probabilities for events
probability x
you know x equals
"it's sunny"
probability of y
y is "it's raining"
and we could look at the probability
of x
and I'm gonna use the notation
I introduced for Boolean logic before
is the probabiity of this thing right here
means "AND"
"probability of X AND Y"
or we can also just call
this the probability of X Y simultaneously
keep on streamlining our notation
this is the probability
that it's raining... it's sunny and it's raining
now, mostly in the world
this is a pretty small probability
but here in Santa Fe
it happens all the time
and as a result, you get rather beautiful
rainbows...single, double, triple
on a daily basis
so, we have...
this is what is called
the joint probability
the joint probability that it's sunny and its raining
the joint probability of X and Y
and what do we expect of this
joint probability?
so, we have the probability of X and Y
and this tells you the probability
that it's sunny and it's raining
we can also look at the probability
X AND NOT Y
so X AND (NOT Y)
again using our notation
introduced to us by the famous husband
of the daughter of the severe general
of the British ___(?)
George Boole, married to Mary Everest
and we have a relationship which says
that the probability of X on its own
should be equal to the probability
of X AND Y plus the
probability of X AND (NOT Y)
and the probability of X on its own
is called the "marginal probability"
so, it's just the probability that
it's sunny on its own
so the probability that it's sunny on its own
is the probability that it's sunny and it's raning
plus the probability that it's sunny and it's not raining
I think this makes some kind of sense
why is called the "marginal probability"?
I have no idea
so let's not even worry about it
there's a very nice picture of probabilities
in terms of set theory
I don't know about you
but I grew up in the age of "new math"
where they tried to teach us
about set theory
and unions of sets
and intersections of sets and things like that
from starting at a very early age
which means people of my generation
are completely unable to do their tax returns
but for me, dealing a lot with math
it actually has been quite helpful
for my career to learn about set theory at the age of 3 or 4
or whatever it was
so, we have a picture like this
this is the space or the set of all events
here is the set X
which is the set of events X, where
it's sunny
here is the set of events Y, where is
the set of events where it's raining
this thing right here is called
"X intersection Y"
which is the set of events
where it's both sunny and it's raining
but in contrast, if I look at
this right here
this is "X union Y"
which is the set of events
where it's either sunny or raining
and now you can kind of see
where George Boole got his funny
"cap" and "cup" notation
we can pair this with X AND Y
X AND Y, from a logical standpoint
is essentially the same as this union
of these sets
and similarly, X intersection Y
is X OR Y --translator's note: professor Lloyd meant "union" when referring to OR and "intersection" when referring to AND http://www.onlinemathlearning.com/intersection-of-two-sets.html--
so when I take the logical statement
corresponding to the set of events
that I write it as X AND Y
the set of events is the intersection
of it's sunny and it's raining
X OR Y is the intersection of events
where it's sunny or it's raining
--translator's note: professor Lloyd meant "union"
when referring to OR, "intersection" refers to AND--
and you can have all kinds of you know
nice pictures
here's Z where let's say it's snowy at the
same time it's sunny
which is something that I've seen happen
here in Santa Fe
this is not so strange in here
where we have X intersection Y intersection Z
which is not the empty when in terms of Santa Fe
ok, so now let's actually look
at the kinds of information that are
associated with this
suppose that I have a set of possible
events, I'll call one set labeled by i
the other set, labeled by j
and now I can look at p of i and j
so this is a case where the
first type of event
is i and the second type of event is j
and I can define
you know, I'm gonna do this
slightly different
let's call this... we'll be slightly fancier
we'll call these event x_i and event y_j
so, i labels the different events of x
and j labels the different events of y
so, for instance x_i could be two events
either it's sunny or it's not sunny
so i could be zero, and it would be
'it's not sunny'
and 1 could be it's sunny
and j could be it's either raining
or it's not raining
so there are two possible value of y
I'm just trying to make my life easier
so we have a joint probability
distribution x_i and y_j
this is our joint probability, as before
and now we have a joint information
which we shall call I of X and Y
this is the information
that's inherent in the joint set of events
X and Y
in our case, it being sunny and not sunny,
raining and not raining
and this just takes the same form as before
we sum over all different possibilities
sunny-raining, not sunny-raining,
sunny-not raining, not sunny-not raining
this is why one shouldn't try to enumerate these things
p of x_i y_j logarithm of p of x_i y_j
so this is the amount of information that's
inherent with these two sets of events
together
and of course, we still have this, if you like the
marginal information, the information
of X on its own
which is now just the sum over events x
on its own
of the marginal distribution
why it's called "marginal" I don't know
it's just the probability for X on its own
p of X_i log base two of X_i
and similarly we can talk about
I of Y is minus the sum over j
p of Y_j log to the base 2 of
p of Y_j
this is the amount of information
inherent whether it's sunny or not sunny
it could be up to a bit of information
if it's probability one half of being
sunny or not sunny
then there's a bit of information let me
tell you in Santa Fe
there's far less than a bit of information
on whether it's sunny or not
because it's sunny most of the time
similarly, raining or not raining
could be up to a bit of information
if each of these probabilities is 1/2
again we're in the high desert here
it's normally not raining
so, you've far less than a bit of information
on the question whether it's raining or not raining
so, we have joint information
constructed out of joint probabilities
marginal information, or information on the original variables on their own,
constructed
out of marginal probabilities
and let me end this little section by defining
a very useful quantity which is called the
mutual information
the mutual information, which is defined to be
I( X ...I normally define it with this little colon
right in the middle, because it looks nice
and symmetrical
and we'll see that this isn't symmetrical
it's the information in X plus the information in Y
minus the information in X and Y taken together
it's possible to show that this is always greater or equal to zero
and this mutual information can be thought of as the amount of information
the variable X has about Y
if X and Y are completely uncorrelated, so it's completely
uncorrelated whether it's sunny
or not sunny or raining or not raining
then this will be zero
however, in the case of sunny and not sunny
raning and not raining, they are very correlated
in the sense that once you know that it's sunny
it's probabiy not raining, even though
sometimes that does happen here in Santa Fe
and so in that case, you'd expect
to find a large amount of mutual information
in most places in fact, you'll find that knowing
whether it's sunny or not sunny
gives you a very good prediction
about whether it's raining or it's not raining
mutual information measures the amount of information
that X can tell us about Y
it's symmetric, so it tells us the amount of information that
Y can tell us about X
and another way of thinking about it
is that it's the amount of information
that X and Y hold in common
which is why it's called "mutual information"