We've boiled down our max ent. prescription
into 2 steps. The first thing is you want the
probability distribution to satisfy this constraint
on the average value of the waiting time.
And the second thing is we want
that particular probability distribution to
have the maximum entropy - to maximize the
function p log p, and I always forget the negative
sign. So in fact, we're maximizing the function
negative sum over all states of the system
p log p. And remember, states of the system
here are how long you've waited for a particular
cab. Or rather, how long you've waited from
a particular time you decided to start waiting.
So, this turns out to be a hard problem, or
at least a non-trivial problem. If you have
a little calculus, you're really good at maximizing
functions, so let's start there. Let's imagine
in particular...and I'll use this very simple example
of maximizing a function over a 2-dimensional
space. Let's call these the x1 and x2 axes.
So I have some function. I'm just
going to draw the contours for you.
So here we go. And what I'm going to do is
make it such that there's a single maximum over
this space. And what we'll talk about in one of
the appendices, if we have time, is why we can
prove that the entropy function has a unique
maximum even what you subject it to these
constraints. For now, you can take it on faith
that, in fact, this problem has a unique solution.
So, here's a function. I've given it a unique
maximum, and so we'll call this function f.
So, using your amazing calculus skills, you know
that the maximum of this function is defined as
the point where the derivative of f with
respect to x is 0. And remember this is a vector
here so what this means is df/dx1 is 0 and
df/dx2 is zero. Now, yes, you might have
accidentally hit a minimum, so just check to
make sure it's actually a minimum. That's
what you would do. So now the problem is that
we're not allowed to range over this entire space.
We're restricted to some subspace. We're restricted
in particular to some constraint here. So how
can we find the maximum of the function, not
the global maximum, but the maximum that
is also satisfying a set of constraints and what I'll
do is I'll draw those constraints as a line
in this space. A point here is a valid argument
for the function f, but it doesn't satisfy this
constraint here and what I'll do is
define this constraint in the following way.
I'll say the constraint is that g(x) = c, where c
is some number. And just to be clear, for us,
it's better to write g(p). We set that equal to
4 minutes. That's just to remind you our particular
constraint is that the function g is equal to 4.
This is the general case here. So what we want
to do now is not find the maximum point, the top
of the mountain. We want to find the point
that is the maximum along this line - this
line defined by g(x) = c. So let me give you
a little intuition about how you might do this.
Imagine you're on a train going through this
mountainous landscape, and as you're going
through, down here, you'll see you're crossing
contours of the function f. In this case, you're
going uphill - the function is increasing - so you
know a point there is not a maximum of the function
along this line because if you wait a little bit
longer, you'll get to this point here and you've
already crossed the contour line. So here,
you're going up. Notice here, you're going down
the other side of the mountain - you're crossing
contour lines in the other way, so you know,
in fact, that the maximum can't be over here
because you were already higher over here.
So, somewhere between here and here is the
maximum - somewhere in the middle you reach the
peak. You reach the peak when the contours of the
function f are parallel to the track that you are taking -
where there is a tangent point between the contour
and the direction of your motion on this fictitious
train. So, we know how to get the directions of the
contours of the function f - those are, indeed, just the
gradient of the function...this is a vector, remember.
And we are going to say these are equal to the
perpendiculars to the train tracks. So if the
perpendicular to the contour is parallel to the
perpendicular to the train track, that means
that the direction of the contour is parallel
to the direction of the train track. If the two
perpendiculars are parallel, so are the two
original vectors. So then the next question is
how can I get the perpendicular to the train track.
What I want you do is imagine this is the train
track for g(x) = c, and here is the train track
for g(x) = c' and so forth. So here is another set
of contours defined by the function g and we
want the perpendiculars to these contours to
be parallel to the contours for f - the
perpendiculars for the contours of f. So what
that means is that this gradient here - these arrows
here, and in particular, these arrows right here -
are equal to some real number lambda times
the gradient of the constraint. When this equation
is satisfied, that means that these contours here
are precisely parallel to these contours here.
So in order to maximize a function f subject
to a set of constraints, don't solve this here.
Don't solve this problem, solve this problem.
And now you notice you have this mysterious
value lambda. This is called the Lagrange multiplier.
So what we're going to do is try to find a solution
where the gradients are parallel to each other.
In other words, that one can be transformed into
another by scaling by some constant factor
along all axes. So this is the intuitive motivation
for how to solve the problem of maximization
subject to constraints when you have only
a single constraint. What we do is find the point
at which these two gradients align. There's a wrinkle.
The wrinkle is the following. It looks like we have only
one constraint here and our contraint is just
that this function is equal to 4. But we actually have
2 constraints. Our second constraint is overall
normalization and it says that we want this
function p here to normalize to 1. If you
sum the probabilities of all the arrival times, they
have to equal 1. Now, of course, p is a probability
so we know that has to be true. We didn't talk about
this explicitly, but when we start wandering over
function space - when these xs here become ps,
we start manipulating the probabilities - what we
want to do is be able to relax the constraint
that they sum to 1 when we consider maximizing
this function f. We want to be able to range over
the entire space including, for example, points
where all the probabilities - all the ps - are 0.
And then later what we'll do is we'll impose the
normalization constraint. So we actually have
2 constraints, and not just 1.