1
00:00:09,629 --> 00:00:12,554
so we've been talking about information
2
00:00:12,608 --> 00:00:15,780
how you measure information
3
00:00:19,332 --> 00:00:20,936
and of course, you measure information in bits
4
00:00:21,749 --> 00:00:26,872
how you can use information to label things
5
00:00:27,345 --> 00:00:30,712
for instance, with bar codes
6
00:00:34,798 --> 00:00:37,028
so you can use information
7
00:00:37,063 --> 00:00:38,065
to label things
8
00:00:38,065 --> 00:00:40,148
and then we talked about
9
00:00:40,478 --> 00:00:45,490
probability and information
10
00:00:45,490 --> 00:00:47,249
and if I have probabilities for events P_i
11
00:00:47,249 --> 00:00:51,577
then it says that the amount of information
12
00:00:51,745 --> 00:00:53,530
that's associated with this event occurring is
13
00:00:54,006 --> 00:00:56,158
minus the sum over i
14
00:00:56,426 --> 00:00:58,259
p_i log to the base 2
15
00:00:58,332 --> 00:01:00,834
of p_i
16
00:01:01,161 --> 00:01:02,047
this beautiful formula that was developed
17
00:01:02,507 --> 00:01:04,031
by Maxwell, Boltzmann, and Gibbs
18
00:01:04,726 --> 00:01:07,707
back in the middle of nineteenth century
19
00:01:11,235 --> 00:01:12,106
to talk about the amount of entropy
20
00:01:14,759 --> 00:01:15,026
and atoms or molecules
21
00:01:16,800 --> 00:01:17,050
this is often called S
22
00:01:19,611 --> 00:01:20,534
for entropy, as well
23
00:01:21,595 --> 00:01:22,760
and then was rediscovered
24
00:01:22,822 --> 00:01:23,666
by Claude Shannon
25
00:01:23,735 --> 00:01:24,624
in the 1940's
26
00:01:24,766 --> 00:01:26,587
to talk about information theory in the abstract
27
00:01:26,660 --> 00:01:27,815
and the mathematical theory of communication
28
00:01:28,048 --> 00:01:33,006
in fact, there is a funny story
29
00:01:33,101 --> 00:01:34,764
about this that
30
00:01:35,677 --> 00:01:38,377
Shannon, when he came up with this formula
31
00:01:38,390 --> 00:01:41,143
minus sum over i p_i log to the base 2 of p_i
32
00:01:41,231 --> 00:01:43,232
he went to John Von Neumann
33
00:01:43,423 --> 00:01:45,007
the famous mathematician
34
00:01:45,188 --> 00:01:48,283
and he said, "what sould I call this quantity?"
35
00:01:48,378 --> 00:01:49,821
and Von Neumann says
36
00:01:49,993 --> 00:01:53,641
"you should call it H, because that's what Boltzmann called it "
37
00:01:53,906 --> 00:01:55,447
but Von Neumann, who had a famous memory
38
00:01:57,527 --> 00:01:57,777
apparently forgot
39
00:01:57,777 --> 00:01:59,854
of Boltzmann's H
40
00:02:00,045 --> 00:02:02,080
of his famous "H theorem"
41
00:02:02,353 --> 00:02:03,976
was the same thing but without the minus sign
42
00:02:04,028 --> 00:02:05,212
so it's a negative quantity
43
00:02:05,912 --> 00:02:06,371
and it gets more and more negative
44
00:02:06,549 --> 00:02:07,553
as opposed to entropy
45
00:02:07,576 --> 00:02:08,752
which is a positive quantiy
46
00:02:08,895 --> 00:02:10,552
and gets more and more positive
47
00:02:10,678 --> 00:02:13,173
so, actually these fundamental formulas
48
00:02:13,246 --> 00:02:14,782
about information theory
49
00:02:14,838 --> 00:02:17,366
go back to the mid 19th century
50
00:02:20,236 --> 00:02:20,841
a hundred and fifty years
51
00:02:21,298 --> 00:02:22,627
so now we'd like to apply them
52
00:02:22,937 --> 00:02:25,071
to ideas about communication
53
00:02:25,318 --> 00:02:26,384
and to do that, I'd like to tell you
54
00:02:26,577 --> 00:02:27,575
a little bit more about
55
00:02:27,600 --> 00:02:28,548
probability
56
00:02:32,186 --> 00:02:33,963
so, we talked about
57
00:02:34,432 --> 00:02:34,967
probabilities for events
58
00:02:35,447 --> 00:02:35,822
probability x
59
00:02:37,707 --> 00:02:39,140
you know x equals
60
00:02:42,846 --> 00:02:43,096
"it's sunny"
61
00:02:43,096 --> 00:02:44,610
probability of y
62
00:02:45,662 --> 00:02:49,085
y is "it's raining"
63
00:02:52,794 --> 00:02:53,067
and we could look at the probability
64
00:02:53,067 --> 00:02:54,694
of x
65
00:02:57,032 --> 00:02:57,282
and I'm gonna use the notation
66
00:02:57,282 --> 00:02:59,129
I introduced for Boolean logic before
67
00:03:02,553 --> 00:03:02,803
is the probabiity of this thing right here
68
00:03:04,835 --> 00:03:05,085
means "AND"
69
00:03:05,085 --> 00:03:07,739
"probability of X AND Y"
70
00:03:07,838 --> 00:03:09,480
or we can also just call
71
00:03:13,838 --> 00:03:14,269
this the probability of X Y simultaneously
72
00:03:14,378 --> 00:03:16,996
keep on streamlining our notation
73
00:03:22,515 --> 00:03:23,549
this is the probability
74
00:03:23,704 --> 00:03:32,035
that it's raining... it's sunny and it's raining
75
00:03:36,238 --> 00:03:36,488
now, mostly in the world
76
00:03:36,488 --> 00:03:36,877
this is a pretty small probability
77
00:03:38,873 --> 00:03:39,123
but here in Santa Fe
78
00:03:39,767 --> 00:03:40,389
it happens all the time
79
00:03:41,982 --> 00:03:42,232
and as a result, you get rather beautiful
80
00:03:42,232 --> 00:03:43,919
rainbows...single, double, triple
81
00:03:44,068 --> 00:03:45,850
on a daily basis
82
00:03:45,933 --> 00:03:46,794
so, we have...
83
00:03:47,324 --> 00:03:49,636
this is what is called
84
00:03:49,665 --> 00:03:54,263
the joint probability
85
00:03:58,698 --> 00:04:01,862
the joint probability that it's sunny and its raining
86
00:04:01,939 --> 00:04:05,842
the joint probability of X and Y
87
00:04:08,716 --> 00:04:12,795
and what do we expect of this
joint probability?
88
00:04:13,432 --> 00:04:16,560
so, we have the probability of X and Y
89
00:04:17,213 --> 00:04:22,186
and this tells you the probability
that it's sunny and it's raining
90
00:04:22,941 --> 00:04:25,000
we can also look at the probability
X AND NOT Y
91
00:04:34,664 --> 00:04:34,914
so X AND (NOT Y)
92
00:04:35,323 --> 00:04:36,184
again using our notation
93
00:04:39,564 --> 00:04:41,209
introduced to us by the famous husband
94
00:04:41,307 --> 00:04:41,882
of the daughter of the severe general
of the British ___(?)
95
00:04:46,557 --> 00:04:46,867
George Boole, married to Mary Everest
96
00:04:46,867 --> 00:04:49,540
and we have a relationship which says
97
00:04:52,117 --> 00:04:52,515
that the probability of X on its own
98
00:04:56,203 --> 00:04:56,453
should be equal to the probability
of X AND Y plus the
99
00:04:56,453 --> 00:04:59,409
probability of X AND (NOT Y)
100
00:05:00,215 --> 00:05:02,449
and the probability of X on its own
101
00:05:03,082 --> 00:05:07,728
is called the "marginal probability"
102
00:05:10,860 --> 00:05:11,359
so, it's just the probability that
103
00:05:13,018 --> 00:05:13,268
it's sunny on its own
104
00:05:13,268 --> 00:05:15,306
so the probability that it's sunny on its own
105
00:05:15,574 --> 00:05:18,052
is the probability that it's sunny and it's raning
106
00:05:19,527 --> 00:05:19,942
plus the probability that it's sunny and it's not raining
107
00:05:22,199 --> 00:05:22,449
I think this makes some kind of sense
108
00:05:24,475 --> 00:05:24,725
why is called the "marginal probability"?
109
00:05:25,383 --> 00:05:25,633
I have no idea
110
00:05:25,633 --> 00:05:28,432
so let's not even worry about it
111
00:05:28,786 --> 00:05:31,563
there's a very nice picture of probabilities
112
00:05:31,563 --> 00:05:38,981
in terms of set theory
113
00:05:38,981 --> 00:05:41,094
I don't know about you
114
00:05:41,094 --> 00:05:43,443
but I grew up in the age of "new math"
115
00:05:43,443 --> 00:05:44,805
where they tried to teach us
116
00:05:44,805 --> 00:05:46,580
about set theory
117
00:05:46,580 --> 00:05:48,024
and unions of sets
118
00:05:48,024 --> 00:05:50,452
and intersections of sets and things like that
119
00:05:50,452 --> 00:05:53,000
from starting at a very early age
120
00:05:53,000 --> 00:05:54,897
which means people of my generation
121
00:05:54,897 --> 00:05:57,575
are completely unable to do their tax returns
122
00:05:57,575 --> 00:06:00,073
but for me, dealing a lot with math
123
00:06:00,073 --> 00:06:02,258
it actually has been quite helpful
124
00:06:02,258 --> 00:06:04,456
for my career to learn about set theory at the age of 3 or 4
125
00:06:04,456 --> 00:06:06,326
or whatever it was
126
00:06:06,326 --> 00:06:09,255
so, we have a picture like this
127
00:06:09,255 --> 00:06:17,869
this is the space or the set of all events
128
00:06:17,869 --> 00:06:20,570
here is the set X
129
00:06:20,570 --> 00:06:22,312
which is the set of events X, where
it's sunny
130
00:06:22,312 --> 00:06:28,348
here is the set of events Y, where is
the set of events where it's raining
131
00:06:30,532 --> 00:06:32,432
this thing right here is called
132
00:06:32,489 --> 00:06:34,084
"X intersection Y"
133
00:06:34,995 --> 00:06:37,302
which is the set of events
134
00:06:37,302 --> 00:06:40,769
where it's both sunny and it's raining
135
00:06:40,769 --> 00:06:43,165
but in contrast, if I look at
136
00:06:43,165 --> 00:06:44,867
this right here
137
00:06:44,867 --> 00:06:47,733
this is "X union Y"
138
00:06:47,733 --> 00:06:49,090
which is the set of events
139
00:06:49,090 --> 00:06:51,657
where it's either sunny or raining
140
00:06:51,657 --> 00:06:52,806
and now you can kind of see
141
00:06:52,806 --> 00:06:59,159
where George Boole got his funny
"cap" and "cup" notation
142
00:06:59,159 --> 00:07:02,202
we can pair this with X AND Y
143
00:07:02,202 --> 00:07:04,818
X AND Y, from a logical standpoint
144
00:07:04,818 --> 00:07:10,088
is essentially the same as this union
of these sets
145
00:07:10,088 --> 00:07:15,043
and similarly, X intersection Y
is X OR Y --translator's note: professor Lloyd meant "union" when referring to OR and "intersection" when referring to AND http://www.onlinemathlearning.com/intersection-of-two-sets.html--
146
00:07:15,550 --> 00:07:20,035
so when I take the logical statement
corresponding to the set of events
147
00:07:20,035 --> 00:07:22,202
that I write it as X AND Y
148
00:07:22,202 --> 00:07:26,743
the set of events is the intersection
of it's sunny and it's raining
149
00:07:26,743 --> 00:07:32,613
X OR Y is the intersection of events
where it's sunny or it's raining
--translator's note: professor Lloyd meant "union"
when referring to OR, "intersection" refers to AND--
150
00:07:33,124 --> 00:07:37,124
and you can have all kinds of you know
nice pictures
151
00:07:42,092 --> 00:07:46,784
here's Z where let's say it's snowy at the
same time it's sunny
152
00:07:46,784 --> 00:07:49,177
which is something that I've seen happen
here in Santa Fe
153
00:07:49,177 --> 00:07:50,069
this is not so strange in here
154
00:07:50,069 --> 00:07:54,813
where we have X intersection Y intersection Z
155
00:07:54,813 --> 00:07:58,900
which is not the empty when in terms of Santa Fe
156
00:07:58,900 --> 00:08:01,341
ok, so now let's actually look
157
00:08:01,341 --> 00:08:02,761
at the kinds of information that are
associated with this
158
00:08:02,761 --> 00:08:10,588
suppose that I have a set of possible
events, I'll call one set labeled by i
159
00:08:10,588 --> 00:08:16,182
the other set, labeled by j
160
00:08:16,182 --> 00:08:22,411
and now I can look at p of i and j
161
00:08:22,411 --> 00:08:25,981
so this is a case where the
first type of event
162
00:08:25,981 --> 00:08:29,517
is i and the second type of event is j
163
00:08:29,517 --> 00:08:31,726
and I can define
164
00:08:31,726 --> 00:08:33,586
you know, I'm gonna do this
slightly different
165
00:08:33,586 --> 00:08:37,179
let's call this... we'll be slightly fancier
166
00:08:37,179 --> 00:08:41,546
we'll call these event x_i and event y_j
167
00:08:41,546 --> 00:08:43,782
so, i labels the different events of x
168
00:08:43,782 --> 00:08:46,105
and j labels the different events of y
169
00:08:46,105 --> 00:08:51,491
so, for instance x_i could be two events
either it's sunny or it's not sunny
170
00:08:51,491 --> 00:08:55,639
so i could be zero, and it would be
'it's not sunny'
171
00:08:55,639 --> 00:08:56,540
and 1 could be it's sunny
172
00:08:56,540 --> 00:08:58,352
and j could be it's either raining
173
00:08:58,352 --> 00:08:59,563
or it's not raining
174
00:08:59,563 --> 00:09:01,771
so there are two possible value of y
175
00:09:01,771 --> 00:09:04,044
I'm just trying to make my life easier
176
00:09:04,044 --> 00:09:09,612
so we have a joint probability
distribution x_i and y_j
177
00:09:09,612 --> 00:09:12,045
this is our joint probability, as before
178
00:09:12,045 --> 00:09:14,883
and now we have a joint information
179
00:09:14,883 --> 00:09:18,376
which we shall call I of X and Y
180
00:09:18,376 --> 00:09:21,023
this is the information
181
00:09:21,023 --> 00:09:21,273
that's inherent in the joint set of events
182
00:09:21,273 --> 00:09:24,233
X and Y
183
00:09:24,233 --> 00:09:26,386
in our case, it being sunny and not sunny,
raining and not raining
184
00:09:26,386 --> 00:09:29,836
and this just takes the same form as before
185
00:09:29,836 --> 00:09:32,274
we sum over all different possibilities
186
00:09:32,274 --> 00:09:42,296
sunny-raining, not sunny-raining,
sunny-not raining, not sunny-not raining
187
00:09:42,296 --> 00:09:45,986
this is why one shouldn't try to enumerate these things
188
00:09:45,986 --> 00:09:53,138
p of x_i y_j logarithm of p of x_i y_j
189
00:09:53,138 --> 00:09:55,747
so this is the amount of information that's
190
00:09:55,747 --> 00:09:57,282
inherent with these two sets of events
together
191
00:09:57,282 --> 00:09:59,006
and of course, we still have this, if you like the
192
00:10:00,802 --> 00:10:03,887
marginal information, the information
of X on its own
193
00:10:10,813 --> 00:10:11,075
which is now just the sum over events x
on its own
194
00:10:13,468 --> 00:10:13,769
of the marginal distribution
195
00:10:13,769 --> 00:10:14,844
why it's called "marginal" I don't know
196
00:10:14,844 --> 00:10:17,196
it's just the probability for X on its own
197
00:10:22,078 --> 00:10:22,710
p of X_i log base two of X_i
198
00:10:22,710 --> 00:10:24,826
and similarly we can talk about
199
00:10:32,311 --> 00:10:33,431
I of Y is minus the sum over j
p of Y_j log to the base 2 of
200
00:10:35,829 --> 00:10:36,079
p of Y_j
201
00:10:38,369 --> 00:10:38,988
this is the amount of information
202
00:10:40,966 --> 00:10:41,216
inherent whether it's sunny or not sunny
203
00:10:43,319 --> 00:10:43,630
it could be up to a bit of information
204
00:10:46,327 --> 00:10:47,131
if it's probability one half of being
sunny or not sunny
205
00:10:49,213 --> 00:10:49,678
then there's a bit of information let me
tell you in Santa Fe
206
00:10:51,676 --> 00:10:52,215
there's far less than a bit of information
207
00:10:52,215 --> 00:10:53,365
on whether it's sunny or not
208
00:10:53,365 --> 00:10:54,715
because it's sunny most of the time
209
00:10:54,715 --> 00:10:56,402
similarly, raining or not raining
210
00:10:56,530 --> 00:10:58,441
could be up to a bit of information
211
00:11:00,236 --> 00:11:00,486
if each of these probabilities is 1/2
212
00:11:02,861 --> 00:11:03,642
again we're in the high desert here
213
00:11:05,084 --> 00:11:05,471
it's normally not raining
214
00:11:05,471 --> 00:11:07,848
so, you've far less than a bit of information
215
00:11:09,831 --> 00:11:10,428
on the question whether it's raining or not raining
216
00:11:12,768 --> 00:11:13,018
so, we have joint information
217
00:11:14,891 --> 00:11:16,822
constructed out of joint probabilities
218
00:11:19,138 --> 00:11:19,388
marginal information, or information on the original variables on their own,
219
00:11:21,606 --> 00:11:22,942
constructed
out of marginal probabilities
220
00:11:26,134 --> 00:11:26,481
and let me end this little section by defining
221
00:11:28,635 --> 00:11:32,481
a very useful quantity which is called the
mutual information
222
00:11:38,809 --> 00:11:42,431
the mutual information, which is defined to be
223
00:11:42,431 --> 00:11:46,651
I( X ...I normally define it with this little colon
224
00:11:46,651 --> 00:11:49,776
right in the middle, because it looks nice
and symmetrical
225
00:11:49,776 --> 00:11:54,964
and we'll see that this isn't symmetrical
226
00:11:54,964 --> 00:11:58,093
it's the information in X plus the information in Y
227
00:11:58,093 --> 00:12:01,211
minus the information in X and Y taken together
228
00:12:01,211 --> 00:12:03,961
it's possible to show that this is always greater or equal to zero
229
00:12:03,961 --> 00:12:09,112
and this mutual information can be thought of as the amount of information
230
00:12:09,112 --> 00:12:11,981
the variable X has about Y
231
00:12:11,981 --> 00:12:16,546
if X and Y are completely uncorrelated, so it's completely
uncorrelated whether it's sunny
232
00:12:16,546 --> 00:12:18,546
or not sunny or raining or not raining
233
00:12:18,546 --> 00:12:22,412
then this will be zero
234
00:12:22,412 --> 00:12:24,719
however, in the case of sunny and not sunny
235
00:12:24,719 --> 00:12:28,966
raning and not raining, they are very correlated
236
00:12:28,966 --> 00:12:32,171
in the sense that once you know that it's sunny
237
00:12:32,171 --> 00:12:33,574
it's probabiy not raining, even though
238
00:12:33,574 --> 00:12:35,371
sometimes that does happen here in Santa Fe
239
00:12:35,371 --> 00:12:36,387
and so in that case, you'd expect
240
00:12:36,387 --> 00:12:37,664
to find a large amount of mutual information
241
00:12:37,664 --> 00:12:39,804
in most places in fact, you'll find that knowing
242
00:12:39,804 --> 00:12:41,148
whether it's sunny or not sunny
243
00:12:41,148 --> 00:12:42,357
gives you a very good prediction
244
00:12:42,357 --> 00:12:45,128
about whether it's raining or it's not raining
245
00:12:45,128 --> 00:12:49,149
mutual information measures the amount of information
that X can tell us about Y
246
00:12:49,149 --> 00:12:53,404
it's symmetric, so it tells us the amount of information that
Y can tell us about X
247
00:12:53,404 --> 00:12:55,910
and another way of thinking about it
248
00:12:55,910 --> 00:12:58,359
is that it's the amount of information
249
00:12:58,359 --> 00:13:00,042
that X and Y hold in common
250
00:13:00,042 --> 00:13:05,956
which is why it's called "mutual information"