In the previous unit, what I showed you
was a way to rescue MaxEnt models to
describe the language abundance P(n)
by reference to a hidden variable, epsilon
which we referred to as "programmer time"
and what we assumed was that the system
was constrained in two ways:
it was constrained not only to having
a certain average number of projects per
language. We fixed the average popularity
of languages, but we also fixed the average
programmer time devoted to projects
in a particular language. And so this
distribution, here, looks like this.
in functional form. And when we integrate
out this variable, epsilon, we get
something that looks like this.
So we get a different prediction for
the language distribution. And we get
a prediction that, I argue, looks
reasonably good.
It certainly looks better than the
exponential distribution.
I feel honor-bound to tell you
about the controversy
that happens when we try to make
these models... and in particular
there is a very different mechanistic
model that looks quite similar.
So this is the Fisher log series.
And the argument behind the Fisher log
series, to explain this distribution,
involves the idea of a "hidden" additional
constraint.
In the open source question, what I've
done is describe that additional
constraint as "programmer time", just
because it seems like it might be a
constraint in the system. OK?
That the average programmer time
for languages is fixed, not just the
average number of projects.
But, and so that means that languages
can vary in their popularity, but also
in their efficiency. In the ecological
modelling, this is "species"
languages are species, and this is the
abundance of species.
So that means the number of particular
instances of the species in the wild.
And also metabolic rate: how much energy
a particular species consumes.
And so in that case, the system is
constrained to a certain average
species abundance, and a certain average
species energy consumption
Here languages are constrained to a certain
abundance, and a certain consumption of
programmer energy. That's the analogy.
So, we can also build a mechanistic model.
of programming language popularity.
And previously, when we studied the taxi-cab
problem, what we did was,
when we produced the mechanistic model,
we were able to find a very simple one
that had the same predictions as the
MaxEnt model did.
Here, by contrast, what we're going to find
is that the mechanistic model
is going to produce similar behavior, but
the functional form will actually be
slightly distinct. So, here's the mechanistic
model... we imagine that
languages all start out with a baseline
popularity. Whoever invents the language,
for example, has to write
at least one project. There is at least one
programmer at the beginning
of a language's invention, who knows how
to program in that language, somewhat
by definition. And so there's two ways
that that popularity can grow.
It can grow, for example, linearly.
So, on day 1 there's 1 programmer.
And on day 2, that one programmer is
joined by another programmer.
And on day 3, those two programmers
are joined by a third.
And so, over time, what you have is
a growth rate that's linear...
in time.
But, perhaps a more plausible model
for how languages accrue popularity
is multiplicative.
At time 1, there's 1 programmer, and he has
some efficiency of converting other
programmers to his cause. So maybe he's
able to double the number of programmers.
And he's able to double the number of
programmers, because his language
is particularly good, and perhaps perhaps
people who like to program in that language
happen to be particularly persuasive.
And so on the second day, those two
programmers each themselves
go out and convert two people,
because they are the same as the
original programmer in their effectiveness
and the language itself is just as
convincing as it was before.
So each of those two programmers goes out
and gathers two for each and we go to 4
And by a similar argument, we go to 8,
and so this would be the
exponential growth model...
where the number of programmers
as a function of time increases
multiplicatively, as opposed to additively.
So, let's make this model a little more
realistic, and in particular, let's allow
the multiplicative factor,
which in this case we set to 2, we're
going to allow this multiplicative factor
to vary. And in fact, we're going to draw
this multiplicative factor, alpha, from some
distribution. And in fact, it doesn't really
matter what that distribution is
as long as alpha is always greater than 0,
so it's not possible for all programmers
to suddenly disappear.
so it's always greater than zero,
and it's bounded at some point,
so it's impossible for a language to become
infinitely popular after a finite number
of steps. So we're going to draw...
each day we're going to draw a number,
alpha, from this distribution, here.
So, after one day, there are alpha programmers.
After two days, there is alpha... or rather
alpha(1) programmers. (This is the
draw on the first day.) On the second day
there's alpha(2) times alpha(1)
programmers, and so on.
alpha(3) times alpha(2) times alpha(1)
So this is now growth that occurs through
a random multiplicative process.
It's similar to growth that would happen
through a random additive process
except now, instead of adding a random
number of programmers each day
you multiply the total number of
programmers each day, by some factor alpha
drawn from this distribution.
So, you can always convert this multiplicative
process into an additive process,
by a very simple
trick of taking the logarithm.
Over time, if we count programmer numbers
we're multiplying, but if you're working in
log space, we're just adding.
We're adding a random number to the
distribution, as long as alpha is always
strictly greater than zero, these will
always be well defined.
Now, all of a sudden, it looks like the
additive model in log space.
And what we know from the central limit
theorem, is that if you add together
lots of random numbers, that distribution
tends towards a Gaussian distribution.
with some particular mean (mu) and some
particular variance (sigma).
Let's not worry about what mu and sigma
are in particular, but rather note that
that growth happens in log space
The distribution of these sums over long
time scales will end up looking like a
Gaussian distribution.
The average boost per day to a language
looks in log space like a Gaussian
distribution. What that means
is that the exponential growth model
with random kicks, random multiplicative
kicks, actually looks like
a Gaussian in log space, or what we call
a log-normal
in actual number space
So, instead of looking at the logarithm of
the popularity of the language,
just look at the total popularity of
the language, and what that means is
that it looks like the exponential of
log(n) minus some mean squared
over two sigma squared... and then you
just have to be careful to normalize
things properly here.
So, this is the log-normal distribution.
And a mechanistic model where language
growth happens multiplicatively
where a language gains new adherents
in proportion to the number of adherents
it already has, where a language gains
new projects in proportion to the number
of projects it already has
dependent upon the environment -
that's where the multiplicative randomness
comes from. Alpha is a random number
it's not a constant. It's not 2. It's not
the language always necessarily doubles.
But the fact that it grows through
a multiplicative random process
as opposed to an additive process
means that you have a log-normal growth.
And so now you can say, "OK, let's imagine
that languages grow through this
log-normal process.
And let's find the best fit parameters
for mu and sigma." And if you do that, you find
that the mechanistic log-normal model looks
pretty good as well.
We were impressed by how well
the blue line fit this distribution
compared to the red exponential model
the MaxEnt model, constraining only N
That was the red model. Here the blue
model does well... this is the Fisher-log
series.
Unfortunately, a mechanistic model... and
I've given you a short account
of the mechanistic model, here.
Where what's happening is you're adding
together lots of small multiplicative
random kicks
The mechanistic model also works [?]
I will tell you that this fits better.
If you do a statistical analysis, both of
these models have two parameters
If you do a statistical analysis, the
Fisher-log series actually fits better
in particular, it's able to explain these
really high-popularity languages better
these deviations here seem larger
than the deviations here, but you have
to remember that this is on a log scale
So this gets much closer up here
than this does here.
So the mechanistic model, at least
visually, looks like it's extremely
competitive, with the Fisher log
series model
derived from a MaxEnt argument
Statistically speaking, if you look at
these two, this one
is actually slightly dispreferred.
But like many people, what you want is some
ironclad evidence for one versus the other.
And I think the best way to look for
that kind of evidence is to figure out
what, if anything, this epsilon really is
in the real world.
If we were able to build a solid theory
about what epsilon was, and how we could
measure it in the data, then we could see if
this here, this joint distribution, was
well reproduced.
If we could find evidence, for example,
for the fact that these two co-vary.
That here we have a term that boosts
the popularity of a language if it becomes
more efficient. So, if this goes down,
this can get higher, and the language
can still have the same probability
of being found with those properties.
And of course, the problem is that we don't
know how to measure this, sort of,
mysterious programmer time...
programmer efficiency.
The ecologists have a much better time
with this.
Because the ecologists, they know what
their epsilon is.
They know that their epsilon is metabolic
energy units intake.
So this is "how much a particular instance
of this species consumes in energy
over the course of a day, or
over the course of its lifetime."
And they're able to measure that, and
in fact, they're able to measure this
joint distribution. If we come to study
the open source ecosystem, so far
we don't really have a way to measure this
and so we're unable to measure the joint
and so now we're left with one model that's
mechanistic, right, popularity accrual model
and over here, this model that talks about
there being two constraints on the system
Average number and average
programmer time