In the previous section I walked you
through a whole set of mathematical
derivations designed to produce a maximum
entropy model of
essentially a toy problem,
the toy problem is how can we model
the distribution of arrival times of
New York City taxicabs based on a small
set of data, and that's a problem that
well might be of interest
to you personally,
but is certainly not of
profound scientific interest.
In this next section, what I am
going to do is present to you a reasonably
interesting scientific problem and show
how the maximum entropy approach can
illuminate some interesting features
of that system.
The particular system we have in mind is
what people have tended to call the open
source ecosystem. It is the large
community of people devoted to writing
code in such a way that it is open,
accessible, and de-buggable,
and in general,
produced not by an individual,
and not by a corporation, with
copyright protections and control over the
source, but rather by a community of
people sharing and editing
each others' code.
The open source ecosystem is
something that dominates a large fraction
of the computer code run for us today,
including not only Mac OS X, but of course
Linux. It is a great success story and we
would like to study it scientifically.
I used the word ecosystem advisably
in part because a lot of what I am going
to tell you now is a set of tools and
derivations that I learned from John Harte
and people who have worked with John
Harte on the maximum entropy approach
to not social systems, but to biological
systems and in particular ecosystems.
John Harte's book "Maximum Entropy
and Ecology" - I recommend it to you as a
source of a lot more information on the
kinds of tools that I am going to show
you now. My goal here is really to show
you that even simple arguments based on
maximum entropy can provide some really
deep scientific insight.
What I am gonna
take as my source of data, because I am
going to study the empirical world now,
is drawn from source forge. Source forge
is no longer the most popular repository
of open source software - perhaps Github
has now eclipsed it - but for a long
period, perhaps from 1999, and we gather
data up to 2011 on this, it has an
enormous archive of projects that range
from different kinds of computer games to
text editors to business and mathematical
software, some of the code that I've used
in my own research is put up on
Sourceforge.
It is a great place to study in
particular the use of computer languages.
Here, what I have plotted is a
distribution of languages used in the open
source community and found on Sourceforge.
On the x-axis is the log of the
number of projects written in a particular
language. You can see that log zero, that
is one. In the database there are about
twelve languages that have only a round of
order one project. These languages are
extremely rare, in other words, in the
open source movement. Conversely,
on the other end of this logarithmic
scale here at four, so, ten to the four
that's ten thousand, we see there is a
small number of extremely popular
languages, these are sort of the most
common languages you will find on
Sourceforge
they have a run away popularity, ok?
And if you know anything about computer
programming it will not surprise you to
learn that these are languages mostly
derived from C, such as C, C++, Java, ok?
Somewhere in the middle between these
extreme rare birds and these incredibly
common, you know, because these are sort
of like the bacteria of the open source
movement, somewhere in the middle you have
a larger number of moderately popular
languages, ok? So this distribution
of languages is what we are gonna try to
explain using maximum entropy methods, ok?
there is a small number of rare languages
a larger model of moderately popular ones
and then again a very small number of
wildly popular languages, ok?
So I plotted that as a probability
distribution, in fact, P of n where n
is the number of projects in the open
source community that use your language,
and this is the probability that your
language has n projects in the open source
community, what we would like to do is
build a maximum entropy model of this
distribution here, I represented the same
data in a slightly different way, this is
how people tend to represent it, this is a
rank abundance distribution,
what ecologist call a species abundance
distribution. So the top rank language,
rank one, here, is the language with the
most number of projects, and it won't
surprise youth lear that it actually
turned out to be Java, there is
20 thousands projects written in Java,
you can see the second rank language C++,
then C, then PHP, and this far rare
languages down here have much lower ranks,
so higher numbers means lower ranks, like
in 3rd place, 4th place, so a 100th place
down right here on the 100th place is the
very unfortunate language called Turing,
which has only as far as I'm aware in the
archive only two projects are written in
Turing, and you can see some of my
favorite languages like Ruby are somewhere
here in this kind of moderate popularity
zone.
So we represent that data, is the
same data, now what I just plotted here is
log abundance on this axis and here is the
language rank but is a linear, as oppose
to here, where I showed you the log, ok?
this is a log-log plot, this is know a
log-linear plot, so this is the
actual data.
So, the first thing will try
is a
maximum entropy distribution for language
abundance, in other words, for the
probability of finding a language with n
projects, and what we are gonna do is
we are gonna constraint only one thing,
the average popularity of a language,
this is gonna be our one constraint for
the maximum entropy problem, and we are
gonna pick the probability distribution
p of n that maximizes the entropy p log p
negative sum, and from zero to infinity of
p log p, where gonna maximize this
quantity subject to this constraint, and
of course, always subject to the
normalization constraint, that p(n) is
equal to unity, ok? So that's my other
constraint, and of course, we know how to
do this problem already, we know the
functional form, is exactly the same
problem as the one you learnt to do when
you modeled the waiting time for a
New York City taxi cab, ok? You modeled
that problem exactly the same way, so I'm
only gonna constraint the average of the
waiting time, here we're only gonna
constraint the average popularity of a
language, popularity mean the number of
projects written in that language has on
the archive, and so, we know what the
functional form will look like, it looks
like something like e to the negative
lambda n, ok? all over Z, and then all we
have to do is fit lambda and Z, ok? So
that we reproduce the correct abundance
that we see in the data, so this is the
maximum entropy distribution, it's also,
of course, an exponential model, has an
exponential form, and if actually find the
lambda and Z that best reproduce the data,
in other words, that best satisfies this
constraint, and are otherwise
maximum entropy,
here is what we find, ok?
This red band here is the 1 and 2 sigma
contours for the rank abundance
distribution, and all I want you to see on
this graph is an incredibly
bad fit to this.
The maximum entropy distribution
with this constraint,
does not reproduce the data,
is not capable of explaining,
our modeling, the data in any reasonable
way, it radically under predicts
these really extremely popular languages,
it's unable to, in other words, reproduce
the fact that there are languages like
C and Python, that are extremely popular,
it over predicts this kind of mesoscopic
regime, it over predicts the moderately
popular languages, right, as also it does
over predict those those really rare
birds, those really low rank languages,
with very few examples in the archive,
so this is right, this is science fail.