1
00:00:02,120 --> 00:00:08,290
We've introduced this problem of producing
a parsimonious model of the data,
2
00:00:08,300 --> 00:00:12,640
meaning a description of the probabilities
of each of the possible configurations.
3
00:00:12,650 --> 00:00:16,000
Now what I'm going to do is I'm going
to show you the general method
4
00:00:16,010 --> 00:00:18,880
for enlarging a parsimonious model,
5
00:00:18,930 --> 00:00:22,830
or conversely a general method for
producing a model that's more parsimonious
6
00:00:22,840 --> 00:00:25,490
than an exact reproduction of the data,
7
00:00:25,500 --> 00:00:28,360
and that method is called
the method of maximum entropy
8
00:00:28,380 --> 00:00:30,590
or the MaxEnt principle.
9
00:00:30,600 --> 00:00:33,000
The example I am going to talk about
10
00:00:33,010 --> 00:00:36,780
is predicting when you are going
to get a cab in New York City.
11
00:00:36,800 --> 00:00:41,470
So the joke about New York
is that you can never get a cab,
12
00:00:41,480 --> 00:00:44,630
except when you don't need a cab,
and there's just cabs everywhere.
13
00:00:44,630 --> 00:00:47,880
And, of course, there's some queer reasons
why this might be the case,
14
00:00:47,890 --> 00:00:49,580
but if you've ever try to get a cab
15
00:00:49,600 --> 00:00:51,810
in the early morning going south
on Park Avenue,
16
00:00:51,840 --> 00:00:53,840
forget it, you will never get a cab.
17
00:00:53,920 --> 00:00:56,330
These are some New York City cabs here.
18
00:00:57,620 --> 00:01:00,310
Let's say you are scientist about this,
19
00:01:00,310 --> 00:01:03,000
and so you decide to gather data,
20
00:01:03,490 --> 00:01:06,890
and you're gathering data, you say,
I need a cab, I go out in the street,
21
00:01:06,890 --> 00:01:09,530
How long do I have to wait ...
22
00:01:12,150 --> 00:01:17,070
for me to finally get a cab
that I can get in,
23
00:01:17,090 --> 00:01:20,300
a cab that's free and on duty?
24
00:01:20,320 --> 00:01:24,140
Let's say I kept records
on that for a while,
25
00:01:24,160 --> 00:01:26,870
and so here are some data I've gathered,
26
00:01:26,900 --> 00:01:31,220
and this is the time it took me
to get a cab in minutes.
27
00:01:31,240 --> 00:01:33,720
So, one time it took me
6 minutes to get a cab,
28
00:01:33,770 --> 00:01:36,670
then it took me 3 minutes, 4 minutes,
29
00:01:36,690 --> 00:01:39,170
another time it took me 6 minutes again,
and so forth.
30
00:01:39,420 --> 00:01:43,550
So this here is a set of observations,
31
00:01:43,580 --> 00:01:48,650
about a really basic empirical question:
How long does it take to get a cab?
32
00:01:48,730 --> 00:01:55,750
And then the question is: What should
I believe about the waiting time
33
00:01:55,790 --> 00:01:58,260
for a New York City cab?
34
00:01:58,270 --> 00:02:00,730
So, you're actually pretty good
at this already.
35
00:02:00,750 --> 00:02:04,250
You know, for example,
that one way we can do it
36
00:02:04,260 --> 00:02:06,660
is to take this data here,
37
00:02:06,670 --> 00:02:09,240
I have 10 data points
on how long it takes to get a cab,
38
00:02:09,240 --> 00:02:13,360
and so the the probability of me
waiting 6 minutes to get a cab
39
00:02:13,410 --> 00:02:18,330
looks like it's about ... well, there's
one, two, three times out of ten
40
00:02:18,360 --> 00:02:21,860
that I saw a cab after six minutes,
41
00:02:21,860 --> 00:02:28,180
so that means it's about a 30% chance
that I'm going to have to wait 6 minutes.
42
00:02:29,820 --> 00:02:32,850
And, for example, the chance
that I'll have to wait 2 minutes
43
00:02:32,860 --> 00:02:34,650
looks like it's 20%.
44
00:02:35,180 --> 00:02:37,960
So, you can see right away
there's a huge problem here,
45
00:02:37,970 --> 00:02:43,710
because, for example, it turns out that
if I follow this naive model directly,
46
00:02:43,730 --> 00:02:46,330
it says the chance of me getting
a cab in 1 minute is zero.
47
00:02:46,340 --> 00:02:48,430
There's this zero chance of me
getting a cab in a minute.
48
00:02:48,460 --> 00:02:50,000
And not only that, it says, for example,
49
00:02:50,020 --> 00:02:55,180
the chance of me waiting for 7 minutes
to take a cab is also zero.
50
00:02:56,820 --> 00:02:58,190
This seems puzzling guys,
51
00:02:58,200 --> 00:03:00,900
it feels like what we're
what we call overfitting the data.
52
00:03:00,920 --> 00:03:06,130
We're describing the data in such a way
that we're putting too much structure in.
53
00:03:06,130 --> 00:03:10,400
The fact that I never waited
more than 6 minutes to take a cab,
54
00:03:10,410 --> 00:03:14,310
but I actually waited three times,
I waited 6 minutes three times,
55
00:03:14,320 --> 00:03:16,300
that seems to be an accident of the data.
56
00:03:16,330 --> 00:03:18,330
We don't want to put that into our model.
57
00:03:18,360 --> 00:03:21,600
So, instead of doing the naive method,
58
00:03:21,610 --> 00:03:28,070
what I'm going to do, and this is the core
of the maximum entropy method,
59
00:03:28,100 --> 00:03:31,260
is produce a probability distribution
that has two things.
60
00:03:31,300 --> 00:03:39,510
One my P_{MaxEnt} that I'm going
to try to produce,
61
00:03:39,530 --> 00:03:43,950
First of all P_{MaxEnt},
we'll call it P_{ME},
62
00:03:44,000 --> 00:03:54,870
satisfies a limited number of constraints,
63
00:03:54,920 --> 00:03:59,210
and I'll tell you what a constraint
is explicitly in a moment.
64
00:03:59,290 --> 00:04:03,780
And number two, the distribution that
satisfies those constraints
65
00:04:03,780 --> 00:04:16,620
has the maximum entropy
of all the distributions
66
00:04:20,070 --> 00:04:21,620
that satisfy those constraints.
67
00:04:21,640 --> 00:04:27,510
So what we'll find is that there are
potentially many probability distributions
68
00:04:27,530 --> 00:04:29,170
that satisfy the constraints,
69
00:04:29,200 --> 00:04:32,500
and we're going to pick the one,
and it turns out it's the unique one,
70
00:04:32,970 --> 00:04:35,960
that has the maximum entropy
71
00:04:35,990 --> 00:04:39,460
out of all those distributions
that satisfy those constraints.
72
00:04:39,480 --> 00:04:43,640
So the constraints will always be
in the form of expectation values.
73
00:04:43,640 --> 00:04:46,900
There will always be constraints
on the average
74
00:04:46,900 --> 00:04:49,720
of some quantity you measure on the data.
75
00:04:49,750 --> 00:04:52,090
So, for example, ...
76
00:04:57,060 --> 00:05:00,050
we could have a constraint
on the expectation value
77
00:05:00,110 --> 00:05:03,470
of the average waiting time.
78
00:05:03,530 --> 00:05:05,320
So we write this like this.
79
00:05:05,320 --> 00:05:09,280
These angle brackets
mean the expectation value of x,
80
00:05:09,310 --> 00:05:15,530
and so the way we do that is we integrate
the probability of waiting x time
81
00:05:15,560 --> 00:05:20,840
times x, dx, and integrate from 0 to ∞ .
82
00:05:24,280 --> 00:05:27,580
And if we're willing to just discretize
and talk about minutes,
83
00:05:27,610 --> 00:05:29,010
we round to the nearest minute,
84
00:05:29,030 --> 00:05:31,830
we can also write it like this.
85
00:05:38,880 --> 00:05:42,120
Where here instead of integrating
over a continuum of times
86
00:05:42,120 --> 00:05:45,260
from 0 to 0.01 and so forth minutes,
87
00:05:45,280 --> 00:05:47,710
here we just some 0 minutes, 1 minute,
2 minutes, 3 minutes.
88
00:05:47,730 --> 00:05:51,790
So 0 minutes, the cab is right there,
you open the door, it's a magical day.
89
00:05:51,850 --> 00:05:57,840
So this is an expectation value
on the average waiting time.
90
00:05:57,860 --> 00:05:59,260
And just to give you an example,
91
00:05:59,260 --> 00:06:03,080
here's another expectation value
you might you might measure.
92
00:06:03,100 --> 00:06:06,180
This is the average
of the square of the waiting time,
93
00:06:06,200 --> 00:06:09,770
and, of course, the way you do that,
94
00:06:09,830 --> 00:06:15,830
is you integrate x² dx, weighted
by the probability of that particular x,
95
00:06:15,850 --> 00:06:20,190
and in general the expectation value
of a function f(x)
96
00:06:24,560 --> 00:06:29,060
is weighting f(x)
by the probability of each x.
97
00:06:29,070 --> 00:06:32,010
So this notation here
should be something that,
98
00:06:32,010 --> 00:06:34,080
if you're not familiar
or not comfortable with it,
99
00:06:34,100 --> 00:06:36,290
you should take some time
and just figure out
100
00:06:36,300 --> 00:06:40,060
why this is the correct way to talk
about the average value of x.
101
00:06:41,380 --> 00:06:45,980
And if you like, this one here
might be more familiar to you
102
00:06:45,990 --> 00:06:49,120
if integrals are still a little scary,
which they shouldn't be.
103
00:06:49,180 --> 00:06:52,960
What we're going to do
in this particular application,
104
00:06:52,990 --> 00:06:54,490
the maximum entropy principle,
105
00:06:54,520 --> 00:07:03,190
is one, P_{ME} (x) will be constrained
106
00:07:06,230 --> 00:07:10,980
so that the average value of x,
the average waiting time,
107
00:07:11,000 --> 00:07:13,340
under the distribution P_{ME},
108
00:07:13,380 --> 00:07:15,600
is equal to that in the data.
109
00:07:15,640 --> 00:07:18,170
And, in fact, if you count here
110
00:07:18,170 --> 00:07:20,330
and measure the average
waiting time in the data,
111
00:07:20,360 --> 00:07:24,010
you discover,
and I'm quite happy about this,
112
00:07:24,040 --> 00:07:26,760
the average waiting time.
113
00:07:26,780 --> 00:07:29,610
In this dataset it's 4 minutes,
and so what we're going to say
114
00:07:29,640 --> 00:07:36,940
is give me probability distributions
whose average waiting time is 4 minutes.
115
00:07:37,000 --> 00:07:40,800
So that's step one,
this is the constraint step.
116
00:07:46,420 --> 00:07:49,050
And you can see right away
that there are many distributions
117
00:07:49,060 --> 00:07:51,280
that have an average waiting time
of 4 minutes.
118
00:07:51,310 --> 00:07:52,750
Here's one.
119
00:07:58,330 --> 00:08:04,710
The probability of waiting x minutes is 0,
except when x = 4.
120
00:08:04,710 --> 00:08:08,970
Just to be technical, this is a definition
that would only work in the discrete case.
121
00:08:09,000 --> 00:08:12,970
We'd have to use delta functions,
I will spare you the delta functions.
122
00:08:13,430 --> 00:08:16,090
Here's another example.
123
00:08:16,100 --> 00:08:28,360
P(x) = 0.5 if x =3, 0.5 if x =5,
and 0 otherwise.
124
00:08:31,260 --> 00:08:36,640
These are all potential models
of catching a cab in New York City
125
00:08:36,700 --> 00:08:40,289
that satisfy the constraint
that their average is four minutes.
126
00:08:40,360 --> 00:08:44,350
So, someone could say, "Hey, I know,
here's a good model of your data.
127
00:08:44,400 --> 00:08:47,980
Providing data is like cabs
either take 3 minute or 5 minutes
128
00:08:48,010 --> 00:08:50,860
and no other time whatsoever."
129
00:08:50,890 --> 00:08:53,510
And, of course, you can think
about mixing these two together.
130
00:08:53,530 --> 00:08:57,610
So, for example, you might mix this
and this so you'd have a distribution,
131
00:08:57,630 --> 00:09:00,780
and I'll just draw it graphically here.
132
00:09:07,190 --> 00:09:11,720
Where there's some spread over waiting
times between 3, 4 and 5 minutes.
133
00:09:11,860 --> 00:09:16,460
And, of course, this original
distribution here also satisfies it.
134
00:09:16,630 --> 00:09:20,410
By definition, if we have a distribution
that's non-zero only at these points,
135
00:09:20,430 --> 00:09:23,430
and is weighted by the number of times
we see it in the data,
136
00:09:23,510 --> 00:09:27,490
the expectation over that
will also be 4 minutes, by definition.
137
00:09:27,750 --> 00:09:32,630
So we have a plethora of candidate models.
138
00:09:32,690 --> 00:09:38,490
We've a plethora of models that satisfy
this one particular constraint.
139
00:09:39,040 --> 00:09:41,250
Pick the one ...
140
00:09:42,880 --> 00:09:47,850
that maximizes the entropy.
141
00:09:51,130 --> 00:09:54,700
So you should have remembered
the definition of entropy,
142
00:09:54,730 --> 00:09:59,740
if not this the perfect time
to pause the video, and review.
143
00:10:00,810 --> 00:10:06,050
But what we want is the distribution
whose entropy is maximized.
144
00:10:06,060 --> 00:10:07,830
Another way to say it is
we want the distribution
145
00:10:07,830 --> 00:10:12,710
that leaves us maximally uncertain about
how long the cab will take to arrive,
146
00:10:12,730 --> 00:10:15,940
except for the fact, of course,
that one thing is constrained.
147
00:10:16,330 --> 00:10:18,220
The one thing that we've constrained
148
00:10:18,220 --> 00:10:21,090
is that the cab takes four minutes
on average.
149
00:10:21,480 --> 00:10:23,820
But otherwise I want to be
maximally uncertain,
150
00:10:23,840 --> 00:10:26,820
I don't want to have, you know the way
we would say this philosophically,
151
00:10:26,820 --> 00:10:31,620
I don't have any prejudices
about what New York City cabs do,
152
00:10:32,650 --> 00:10:36,980
I want to be maximally uncertain
about their behavior
153
00:10:36,980 --> 00:10:39,510
subject to this one constraint.
154
00:10:39,540 --> 00:10:42,690
And you can see, for example,
here, intuitively,
155
00:10:42,690 --> 00:10:46,450
the idea that cabs always take 4 minutes
156
00:10:46,490 --> 00:10:49,960
is satisfying this average criteria
157
00:10:49,990 --> 00:10:52,900
but it's putting a huge amount
of additional structure in.
158
00:10:52,900 --> 00:10:54,520
It's saying for some reason
159
00:10:54,560 --> 00:10:58,240
all waiting times
except for 4 minutes are forbidden.
160
00:10:58,310 --> 00:11:01,970
And so intuitively somehow that seems like
it would require an extra justification.
161
00:11:02,020 --> 00:11:04,230
But we're trying to be minimally biased,
162
00:11:04,240 --> 00:11:07,410
we're trying to have
a maximally possible range,
163
00:11:07,410 --> 00:11:10,470
a maximal possible range
over all configurations of the system,
164
00:11:10,480 --> 00:11:14,420
subject to a constraint
on the average behavior that we observe.
165
00:11:16,440 --> 00:11:19,080
This one here is slightly better
because it allows for a wider range,
166
00:11:19,090 --> 00:11:22,160
and in fact the mixture of these
is even better still.
167
00:11:23,140 --> 00:11:25,920
And what we would like to do
is produce a distribution
168
00:11:25,950 --> 00:11:28,050
where you have to ask,
169
00:11:28,080 --> 00:11:31,080
– and so this is one way
to interpret entropy –
170
00:11:31,100 --> 00:11:34,300
you have to ask on average
the maximal number of questions
171
00:11:34,320 --> 00:11:37,610
in order to decide
how long the cab actually took.
172
00:11:37,640 --> 00:11:43,400
So, this step here allows you to select
from all of these models,
173
00:11:43,410 --> 00:11:45,880
from the plethora of popular models
that satisfy the constraints,
174
00:11:45,900 --> 00:11:47,960
one particular model.