# Eigengoogle. How the Google PageRank Algorithm Works

While we’re on the subject of sorting things online, we might as well talk about Google: the 93-billion dollar company whose main export is taking all the things ever and putting them in the right order. If there’s one thing Google knows best, it’s sorting stuff.

I was curious how it all works, and it turned out really interesting, plus I got to learn a bit about Markov chains. It all starts with an algorithm called PageRank^{1}. According to Wikipedia,

Pagerank uses a model of a random surfer who gets bored after several

clicks and switches to a random page. It can be understood as a Markov chain in

which the states are pages, and the transitions are the links between pages.

When calculating PageRank, pages with no outbound links are assumed to link

out to all other pages in the collection (the random surfer chooses another

page at random).The PageRank values are the entries of the dominant eigenvector of the

modified adjacency matrix.

In this post I’ll try to break that down and provide some of the background

necessary to understand Google PageRank.

## Graphs as Matrices

A graph is a collection of nodes joined by edges. If the edges are arrows that flow in one direction, we call that a *directed graph*. A graph whose edges have each been assigned a “weight” (usually some real number) is a *weighted graph*.

A graph of `n`

nodes can be represented in the form of an `n x n`

*adjacency matrix*,

such that is equal to the weight of the edge going from node to node :

[0, 1, 0, 0] [1, 0, 2, 0] [2, 1, 0, 1] [0, 0, 4, 0]

## Stochastic Matrices

The term “stochastic” is used to describe systems whose state can only be described in probabilistic terms (i.e: the likelihood of some event happening at any given time).

Scenario:Consider two competing websites. Every month, the first website loses 30% of its audience to the second website, while the second website loses 60% of its audience to the first.

If the two websites start out with 50% of the global audience each, how many users will each website have after a month? After a year?

This scenario can be represented as the following system:

P = [0.7, 0.6], x_0 = [0.5, 0.5] [0.3, 0.4]

This is a *Markov chain* with *transition matrix* and a *state vector* .

The transition matrix is called a *stochastic matrix*; it represents the likelihood that some individual in a system will transition from one state to another. The columns on a stochastic matrix are always non-negative numbers that add up to 1 (i.e: the probability of *at least one* of the events occurring is always 1 — the likelihood of a user either staying on the same website, or leaving, is always 100%. He must choose one of the two).

The state after the first month is

So, after the first month, the second website will have only 35% of the global audience.

To get the state of the system after two months, we simply apply the transition matrix again, and so on. That is, the current state of a Markov chain depends only on its previous state. Thus, the state vector at month can be defined recursively:

From which, through substitution, we can derive the following equation:

Using this information, we can figure out the state of the system after a year, and then again after two years (using the Sage mathematical library for python):

P = Matrix([[0.70, 0.60], [0.30, 0.40]]) x = vector([0.5,0.5]) P^12*x # -&gt; (0.666666666666500, 0.333333333333500) P^24*x # -&gt; (0.666666666666666, 0.333333333333333)

So it seems like the state vector is “settling” around those values. It would appear that, as , is converging to some such that . As we’ll see below, this is indeed the case.

We’ll call this the *steady state vector*.

## Eigenvectors!

Recall from linear algebra that an eigenvector of a matrix is a vector such that:

for some scalar (the *eigenvalue*). A *leading eigenvalue* is an eigenvalue such that its absolute value is greater than any other eigenvalue for the given matrix.

One method of finding the leading eigenvector of a matrix is through a power iteration sequence, defined recursively like so:

Again, by noting that we can substitute , and so on, it follows that:

This sequence converges to the leading eigenvector of .

Thus we see that the steady state vector is just an eigenvector with the special case .

## Stochastic Matrices that Don’t Play Nice

Before we can finally get to Google PageRank, we need to make a few more observations.

First, it should be noted that power iteration has its limitations: not all stochastic matrices converge. Take as an example:

P = Matrix([ [0, 1, 0], [1, 0, 0], [0, 0, 1]]) x = vector([0.2, 0.3, 0.5]) P * x # -&gt; (0.3, 0.2, 0.5) P^2 * x # -&gt; (0.2, 0.3, 0.5) P^3 * x # -&gt; (0.3, 0.2, 0.5)

The state vectors of this matrix will oscillate in such a way forever. This matrix can be thought of as the transformation matrix for reflection about a line in the x,y axis… this system will never converge (indeed, it has no leading eigenvalue: ).

Another way of looking at is by drawing its graph:

Using our example of competing websites, this matrix describes a system such that, every month, *all* of the first website’s users leave and join the seconds website, only to abandon the second website again a month later and return to the first, and so on, forever.

It would be absurd to hope for this system to converge to a steady state.

States 1 and 2 are examples of *recurrent states*. These are states that, once reached, there is a probability of 1 (absolute certainty) that the Markov chain will return to them infinitely many times.

A *transient state* is such that the probability is that they will never be reached again. (If the probability *is* 0, we call such a state *ephemeral* — in terms of Google PageRank, this would be a page that no other page links to):

There are two conditions a transition matrix must meet if we want to ensure that it converges to a steady state:

It must be *irreducible*: an irreducible transition matrix is a matrix whose graph has no closed subsets. (A closed subset is such that no state within it can reach a state outside of it. 1, 2 and 3 above are closed from 4 and 5.)

It must be *primitive*: A primitive matrix is such that, for some positive integer , is such that for all (that is: all of its entries are positive numbers).

More generally, it must be

positive recurrentandaperiodic.Positive recurrence means that it takes, on average, a finite number of steps to return to any given state. Periodicity means the number of steps it takes to return to a particular state is always divisible by some natural number (its period).

Since we’re dealing

with finite Markov chains, irreducibility implies positive recurrence, and primitiveness ensures aperiodicity.

## Google PageRank

We are now finally ready to understand how the PageRank algorithm works. Recall

from Wikipedia:

The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain in which the states are pages, and the transitions, which are all equally probable, are the links between pages.

So, for example, if we wanted to represent our graph above, we would start with the following adjacency matrix:

[0, 0, 0.5, 0, 0], [0, 0, 0.5, 0.5, 0], [1, 1, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0.5, 0]

For the algorithm to work, we must transform this original matrix in such a way that we end up with an irreducible, primitive matrix. First,

If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.

When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection.

[0, 0, 0.5, 0, 0.2], [0, 0, 0.5, 0.5, 0.2], S = [1, 1, 0, 0, 0.2], [0, 0, 0, 0, 0.2], [0, 0, 0, 0.5, 0.2]

We are now ready to produce $$G$$, the Google Matrix, which is both irreducible and primitive. Its steady state vector gives us the final PageRank score for each page.

## The Google Matrix

The Google Matrix for an matrix is derived from the equation

Where is an matrix whose entries are all 1, and

is referred to as the *damping factor*.

If , then . Meanwhile, if all of the entries in are the same (hence, the original structure of the network is “dampened” by , until we lose it altogether).

So the matrix is a matrix that represents a “flat” network in which all pages link to all pages, and the user is equally likely to click any given link (with likelihood ), while is dampened by a factor of .

Google uses a damping factor of 0.85. For more on this, I

found this paper.

tl;dr:the second eigenvalue of a Google matrix is , and the rate of convergence of the power iteration is given by . So higher values of imply better accuracy but worse performance.

With some moving stuff around, we can see that

For all up to , which means that is indeed stochastic, irreducible, and primitive. Cool.

In conclusion,

1. Actually, it all started with the HITS algorithm, which PageRank is based off of. More details here.

# Sorting Posts by User Engagement Level

At Functional Imperative we’re building the new *CanLII Connects* website (a social portal for Canada’s largest database of legal cases), and this week I was given the task of figuring out a sensible way of sorting posts.

Figuring out how to sort user-generated content is a common problem that many social websites face.

Here’s Reddit’s scoring equation for ‘Best’ (source and explanation):

Not all scoring equations are that hairy, here are a few more.

Interestingly enough, Reddit’s ‘Hot’ scoring function (explained in link above):

is quite flawed.

Sidenote: One observation not mentioned in that first article is that, while all other equations use some form of`time_now - time_posted`

to calculate how old a post is, the clever guys at Reddit use`time_posted - some_old_date`

.The advantage of this is that the post’s score need only be calculated once, whereas the value of scores calculated with

`time_now`

will change on every request.

Anyway, while all those scoring functions work pretty well, they didn’t quite fit the requirements for *CanLII Connects*.

In this post, I’ll walk through the decision process of creating a scoring function. Hopefully this will be useful if you encounter a similar feature to implement.

## Requirements:

*CanLII Connects* links to a database of legal cases, and users can post opinions on those cases:

- A user can post.
- A user can upvote a post.
- A user can comment on a post.
- A user can comment on a comment on a post.

So what’s a sensible way of sorting posts?

Right away, we’re dealing with a different problem than Reddit or HN: while it makes sense to slowly degrade the score of a post on those sites over time, the same does not make sense for CanLII. Old cases might be cited at any time, no matter how old they are, so what matters is not how old a discussion is, but rather how actively engaged users are within a given discussion.

## Initial Score

Ok, so first we have to give each post an initial score. I like Reddit’s approach of taking the base-10 log of its upvotes. This makes sense because, the more popular a post already is, the more likely people are to see it, and therefore upvote it, which gives it an unfair advantage.

In our case, we’re not only trying to measure how much people “like” a post, but rather how engaged they are with it. It makes sense that, while an upvote is a fine indicator of “engagedness”, a user actually bothering to comment on a post is even more of an indicator. I’ll count that as equivalent to two upvotes, and a user commenting on a comment will count as three upvotes (the 2 is so we don’t take the log of 1 or 0):

## Frequency

Next, we need the post’s position to degrade as it becomes less active. It makes sense to divide the intial score by some factor of time:

Now we need a reasonable value for . A good start is the average time, in seconds, between the three most recent user interactions with a post.

We define a user interaction to be: a user creates a post, a user comments on a post, or a user upvotes a post.

Also, we want the most recent interactions to weigh more than older interactions. So let’s say each `t`

weighs twice as much as the previous:

Where

= UNIX timestamp, at now, in seconds.

= UNIX timestamp of n^{th} interaction.

## One Final Detail

There is one last property we want this function to have, which is the following: if interactions are very frequent right now (within a timeframe of, say, 10 days), then clearly the post is “hot”, and its score should be boosted. But as time passes, it really doesn’t matter as much how much distance there is between interactions. If a post has already gone a full year without anyone commenting on it, does it really make that much difference if it goes another month without a comment?

To accomplish the first property, all we do is divide by the number of seconds in 10 days: `60*60*24*10`

.

To accomplish the second property, what we are looking for is some sort of always-increasing, concave function (positive derivative, negative second derivative). The first thing that comes to mind is the square-root function, which is good enough.

## Result

And thus we have our final scoring function:

If we plot this equation for `x = number of points`

and `y = time`

, we can see the shape of this function and check for different values if they make sense:

As expected, there is a steep 10-day “boost” period, followed by an increasingly slower decline in the value as more and more time passes.

The function is also heavily biased toward very new posts, which will always come out on top, giving them a chance. This might be a bad idea if posting becomes frequent, but user interaction is low (many summaries a day, few votes or comments), and might have to be changed.

There are many ways to tweak this equation (changing the boost period, for example) to make it more or less biased towards either time or user interaction.

## Bonus Round: Implementing in ElasticSearch

Implementing a custom scoring function in Elasticsearch, though easy once it’s all set up, was rather frustrating because of the poor documentation.

For our implementation, we’re using the `tire`

gem (a wrapper around the Elasticsearch API). This is where we call the custom scoring script:

query do #custom_score script: "parseInt(doc['upvote_count'].value)", lang: "javascript" do custom_score script: script, lang: 'javascript' do string query.join(" OR ") end end

Where `script`

is simply a variable holding the contents of a javascript file as a string. Note the option `lang: 'javascript'`

. This lets us use javascript as our language of choice, as opposed to mvel, the most poorly documented scripting language on the face of the earth. To enable this option, we’ll also require the elasticsearch-lang-javascript plugin.

Here is our script:

Sidenote:Notice the logger function. This enables us to implement a sort of “console.log” which we can read using the following shell command`tail -f /var/log/elasticsearch/elasticsearch.log`

.

// Logger function: var logger = org.elasticsearch.common.logging.Loggers.getLogger("rails_logger"); // Example usage: logger.info("========= NEW CALC ==========="); var points_log = parseFloat(doc.points_log.value); var now = Math.round(new Date().getTime() / 1000); /** * NOTE: doc.ts.values is not actually an array, * here I create an array out of it: **/ var ts = []; for (var i = 0; i < doc.ts.values.length; i++) ts[i] = doc.ts.values[i]; ts.push(now); // Newest first ts.reverse(); /** * Boost period. **/ var ten_days = 60*60*24*10; /** * The scoring function **/ function score() { /** * Weighed average numerator **/ var times_num = (function() { var val = 0; for (var i = 1; i < ts.length; i++) { var exp = i - 1; val += Math.pow(0.5, exp) * (parseFloat(ts[i]) - parseFloat(ts[i - 1])); } return val; })(); /** * Weighed average denominator **/ var times_denom = (function() { var val = 0; for (var i = 1; i < ts.length; i++) { var exp = i - 1; val += Math.pow(0.5, exp); } return val; })(); var t_ave = (times_num/times_denom); return points_log/Math.sqrt(t_ave/ten_days); }; score();

# Amdahl’s Law

As multicore computing becomes the norm (even my phone is dual core!), it’s important to understand the benefits and also the limitations of concurrency. Amdahl’s Law addresses the latter.

Let’s imagine a simple program. It prints “Hello World” 100 times, then quits.

Our first version of the program is written as a single sequential task: it prints one “Hello World”, then another, then another, 100 times, then quits. This program takes some unit of time, to execute.

Now say we have a dual-core machine at hand. (My phone, perhaps).

Cool, now we can spawn *two* tasks that print “Hello World” 50 times each. And, because our magical imaginary computer experiences no overhead, it takes us exactly units of time to run our second program.

So we keep adding more and more processors, until we have 100 concurrent threads printing one “Hello World” each, and our program runs 100 times faster.

At this point we stop: “Ah, the trend is clear: more processors equals more speed! No point in continuing this experiment.”

**A naive (wrong) first guess:** Given processors executing a program, the maximum boost in speed is . (That is, we can get our program to run times faster).

Cool! This means that, given enough processors, we could make *any* program run almost instantly. Right?

Of course this is not the case! Enough daydreaming. Let’s figure out a more Continue reading

# Managing Multiple Computers with One bashrc/zshrc

Here is a simple way to share the same `.bashrc`

/ `.zshrc`

/ `.bash_profile`

across multiple computers, while still retaining unique settings in between computers.

Suppose you want some special setting to apply only to your laptop.

First, create an empty file called `.setup_00`

:

$ touch ~/.setup_00

Next, in your `rc`

file, add the following `if`

statement. Anything inside that `if`

statement will only apply to your laptop:

if [ -f '.setup_00' ]; then echo "This message only shows on my laptop!" fi

You can use this method to run any shell script uniquely on computers that contain the `.setup_00`

file.

That’s it. It’s not fancy, but it works.

# Rice’s Theorem

Rice’s theorem can be stated thus:

Every non-trivial semantic property of a program is undecidable.

Before we prove the theorem, let’s break down that statement: Continue reading

# Girl on a Train Asks About the Book in my Bag

“What’s that book you’re reading?” she asks.

“*Ernest Wattle:* the story of a man who falls in love with the statue of an ancient goddess.”

“Sounds strange.”

“Depends how you read it. It’s also the story of Almendra, an ancient goddess who creates a man just so he would love her. But then, through prayer, he brings her statue to life, but she falls in love with someone else.”

A quaint-looking pig farm, like something out of a picture book, whizzes across the window.

“Everybody is inventing everybody.”

# Little Room

A set of boots halt just behind Little Room door, mark a pensive hollow in the casted light on Little Room floor, peak an inch or two, bringing with them an ounce of sunshine more, and Little Room croaks.

The door shuts behind them; the sun is silenced again.

Little Room is home to feathered insects; bonsai orchards spring out of cracks on the floor, on which they perch and feed. A girl, sixteen maybe, lost her virginity there, by one of those cracks, to a crowd of cackling drunkards, tickled by the feathered pixies, and she thought it was a magical place, and she could never go back.

The Boots demand a double-shot of Jim from nobody in particular. Another girl, nineteen, ragged as a doll, but a doll, still, with hair frayed beautifully like a fractal tuft of burnt shrubbery, offers her body to The Boots, in her room upstairs, not for his money, just his attention. He takes the drink from her hand and tosses a coin and immediately forgets she exists.

# A Bird by the Window

acu riousblue li ttlewing

i sawta pingon mywindow

like asif onstrings sheelin gered

glo wingsof tlyin thesun

sowith myfin geri tappedtoo

what afun andglad somejoy

theli ttleblue birdsang asong

tobuoy myheart andbring mecheer

not thenlong shestar tedflying

there andhere andca lingmee

pla yingcir clesin thesky

andsaid sheedlike toplay withme:

oh penup andcome outside!

oh penup and chaseme!

idlike tojoin itlooks likefun

but alas ican notopen:

oh penup youwill enjoyit!

and thebreeze isso inviting!

and thesun doesshine sobright-ly

and thetrees arein fullbloom

and thefruits andbugs and juicy warms

and the earth smells fresh like after a storm

and freedom

and freedom is a

freedom is the

freedom is to

# Bursts

The first time he fell in love it was with a girl who always smelled like fresh laundry, and her name was Anna and it was love at first sight, or not quite so, because it wasn’t really all at once that he fell in love with her, but in little bursts rather, like the steps up a ladder, or grains of sand piling up at the bottom end of an hourglass, the first of which was him falling madly in love with her back, which was all he could see from where he sat, and all he wanted to see, entranced as he was at that moment by its smooth shape and tender flesh and by the contours of her spine and ribs, which traced sensuous grooves along the pink fabric of her tanktop whenever she took a deep breath, or spoke, or hunched over slightly in order to write something in her notebook, at which point long locks of black hair–black like tree branches after a forest fire–would fall and unravel, revealing for him, like curtains being fluttered by a morning breeze, which let slip a first ray of light into the room and wake the dreamer within from his slumber– yes, exactly like that, they would fall and unravel and reveal to him her neck, her perfect neck, he thought, which was the next thing he fell in love with, followed by her dainty hands, and then the freckles on her face, which she hated, which he loved, adored, and, this is true, that very same winter he even fell in love with a certain bubbly noise her nose was making because she was trying to be as quiet as possible when sniffling every so often because she had had a cold and a runny nose, so he offered her a tissue, but really it was just so he could lean close enough to smell her, and to look into her eyes, which, he thought, gazing, small and dark and rounded, were like an infinitely dark abyss that gazed also into him, which he knew, of course, because the only reason he had read that damn book was because he had seen it poking out of her bag that one time.