A second pair stacked on top. Dirty socks. Dirty shirts. Piles of them.
That chair I don’t even like. Too many cups. A clean bedroom, the bed made. Clean dishes, one of those pluggable scented candles. His records. My records. Too much space.
Books
I find there’s something about learning from a physical object (with volume, texture, scent) that makes things “click” better in my head; and I doubt I’m an oddity in this respect — the root of this effect may be more biological than preferential: For example, when recalling some piece of information, often I can remember where — within some textbook I read half a decade ago — the subject is explained. And I do mean, a literal, physical “where”: a page, (a thing,) which at all times can be found tucked somewhere within a physical stack of bound paper, which I can touch and feel; I might even remember where I was sitting (or standing, or lying), as I held that particular book, open at that particular page, when I studied the subject, and what it felt like to the touch, and how the lighting in the room fell on the paper; maybe I jotted some note on the margin, maybe I spilled tea on the pages before, giving that “where” its own unique irregular texture and color — all of which my brain will forever associate with that information. Digital can’t do any of that.
In this sense, there is nothing left to improve about the physical book.
On the other hand, if once I open this book I find myself confused, (maybe because the book was written by a guy named named Spivak, who does not like to explain what happens in between those little “=” symbols,) I can’t press on anything and expect it to expand magically and reveal more detail. It’s at this point that I will turn to my laptop, and search for more information, maybe even ask a stranger half a world away, or watch an interactive video. My copy of “Calculus on Manifolds” can’t do any of that.
So even though ebooks do offer some perks, in many ways the printed book remains superior. I think it’s a safe bet that books aren’t going anywhere. Radio did not replace print, TV did not replace radio. (Interactive did not replace passive; print did not replace the spoken word.) Each of these media is better suited for different tasks.
(I also still find that the best way to solve a problem (even a programming problem), is with pen and paper. Physically jotting down an idea beats pressing buttons while staring at an obnoxious, headacheinducing glowing screen any time (which I try to avoid as much as possible, even as a software developer). To me flipping pages still beats searching and clicking — what a dreadful user experience is a digital textbook!)
Sorting By Relative Popularity
Hey, looks like I’m sorting user content again! last time, I sorted user posts by “interestingness”; this time around I’ll be sorting players from a set of sports teams. Once again we’ll look at why sorting things based on popularity alone is a bad idea, we’ll get a primer on standard deviation, and finally a bit of scripting to put it all together.
The Task
To drive up user engagement at theScore, we introduced an onboarding screen that’s shown when you open the app for the first time.
First, you get a list of sports teams that are popular in your area, and the option to subscribe to some of them.
Now based on the teams you choose, it would be nice to also recommend some players for you to follow. So how would one go about choosing which players to recommend out of all those teams?
Let’s assume we have three sixplayer teams, where each player has the following number of subscribers:
big_team_1: 1. 100 000 2. 110 000 3. 90 000 4. 80 500 5. 140 000 6. 140 500 big_team_2: 1. 120 000 2. 250 000 3. 180 000 4. 135 000 5. 157 000 6. 202 000 small_team: 1. 3 000 2. 100 3. 234 4. 301 5. 250 6. 400
Now let’s consider some properties we want from our algorithm:
 A computeonce, static value: we don’t want to run our algorithm on every user. We want to give each player in our database a static
recommendation_score
; a numeric value that is cheap to index. 
Simple: the algorithm should be simple and use elementary methods. Recommending players is a small part of the app; there should be little code maintenance involved.

Variety: The purpose of the onboarding process is to get you engaged, so we want to recommend a variety of players from different sports and leagues.
The Naive Approach
NOTE: I will be using the Julia programming language for all my examples. You can find the complete implementation at the bottom of this post.
The first obvious solution is to simply sort players by their subscription_count
. The more popular the player, the more recommendable he is. Here is our naive sorting function:
function naive_sort(teams) Base.sort( [[team.players for team in teams]...], by=x > x.subscribers, rev=true )[1:5] end
Which yields:
5element Array{PlayerRecommender.Player,1}: Big Team 2 Player  subscribers: 250000 Big Team 2 Player  subscribers: 202000 Big Team 2 Player  subscribers: 180000 Big Team 2 Player  subscribers: 157000 Big Team 1 Player  subscribers: 140500
The problem with this approach is that we only get results from the most popular teams. So if you’re a fan of both a very popular NFL team and another team that is not as popular (your local basketball team, perhaps), even the least popular player from the NFL team will be recommended to you, whereas the most popular player from your favorite local basketball team will not show up in your list at all!
A Better Approach
What are we really looking for?
Well, I think the players we want to recommend are the not necessarily the ones who are most famous, but rather, the ones who are most popular compared to the rest of their team, regardless of how popular that teams is (we already know you’re a fan of the team or you wouldn’t have subscribed to it in the first place).
In other words, we want an algorithm that answers the following question:
Which players deviate the most in popularity from the rest of their team?
Luckily, there’s a mathematical tool for figuring out just this: standard deviation.
Standard Deviation tl;dr
Standard deviation is a pretty straightforward concept: Take a set of values, and figure out the average value for that set (also known as the arithmetic mean). The standard deviation simply tells us by how much the value of the typical element in our set deviates from the average.
The mathematical representation of this calculation is:
Where is the population size, and is the arithmetic mean, which itself is represented by:
For example, the following two sets have the same average value, but clearly the values are spread out differently:
close_to_average = [11,8,9,12] spread_out = [0,1,19,20] average(close_to_average) == 10 average(spread_out) == 10 standard_deviation(close_to_average) == 1.59 standard_deviation(spread_out) == 9.51
Thus we arrive at our simple scoring function. All we need to do is find players whose deviation from the average is substantially higher than that of their teammates:
function recommend(player, team) player_dev = player.subscribers  team.average if player_dev == 0 player.recommendation_score = 0 else player.recommendation_score = player_dev / team.std_dev end end
Using this scoring function on our original teams, we get the following results:
5element Array{PlayerRecommender.Player,1}: Small Team Player  subscribers: 3000  score: 2.2276261544470644 Big Team 2 Player  subscribers: 250000  score: 1.749555170961297 Big Team 1 Player  subscribers: 140500  score: 1.3134034577576315 Big Team 1 Player  subscribers: 140000  score: 1.291753950212176 Big Team 2 Player  subscribers: 202000  score: 0.6445729577225832
Much better. This time around, our top player actually has the least number of subscribers, but this makes sense, because even though he belongs to a team that is not very popular, his subscription count is tenfold that of his teammates; clearly someone to keep an eye on! (perhaps a rising star in a college league? Certainly wouldn’t want our recommendation script to ignore that one.)
Limitations
This algorithm has many limitations.
 New players won’t be very well represented (they will by nature have low subscription counts).

Allstar teams might result in nobody being particularly recommendable. Though this one might be less of a problem: thanks to our good ol’ friend the bell curve, even among rare anomalies, there are rare anomalies.
While there are ways to address these limitations and improve the accuracy of the algorithm (for example, taking into account the rate of change in subscription_count
), one has to remember the purpose of this feature: to drive up user engagement during onboarding. Is the added complexity of such changes worth the minimal improvement in the recommendations?
Point is, it’s Friday night and I should go out for a beer now. I’m also looking forward to testing out the enormous Chinese fermentation jug I bought yesterday. It looks something like this, but a LOT bigger:
And it was only $30. What a bargain.
Here is the code used in these examples (working as of Julia 0.4.1). Our actual code at theScore is in Ruby.
The Diminishing Returns of Focusing on the Right Thing
There was a recent thread on Hacker News about the search for the “cure for aging”:
The thought has recently occurred to me that we should all be working on the “aging problem”, until it’s solved, and then spending all our extra time on other pursuits. —
b_emery
I’m sure you’ve heard (or expressed) a similar sentiment about other pressing Big Problems.
After all, why should we spend millions in tax dollars funding “useless”, or at the very least, nonvital research, when there are real problems to be solved? Why do we pay pure mathematicians to dream up infinitedimensional spheres and Hilbert spaces, when that money could be better spent training engineers to build nextgeneration sustainable transportation?
This line of thinking may be noble in principle (“Let’s prioritize research that focuses on finding solutions to our most urgent realworld problems, instead of wasting resources on useless/unprofitable endeavours”), but it tends to lead to stagnation, not innovation.
Focusing on big problems is important, but there are diminishing returns to focusing on the right thing, possibly even negative returns. That is, focusing too much on solutions makes us less likely to find them.
While, improvements and advancements come from focused research, paradigmshifting discoveries tend to come from unexpected places; often based on information that is initially discarded as not useful, even by experts.
By taking away funding from “useless” pursuits (like pure maths, arts, theoretical research) and focussing solely on the bottom line (“realworld problems”, what is profitable, what is immediately useful), we make great advancements less likely, not more.
This is why we can’t simply pour infinite money into “Finding the cure for x” and expect results; we can’t pour infinite money on engineers, while defunding basic research, and expect innovation.
Reminds me of the Louie bit where David Lynch instructs Louie to “Be funny: 3… 2… 1… Go”. — Well, that’s not how funny work. Achieving that level of mastery of his craft is 90% nonfunny related activities (life experiences, personal growth, selfreflection etc.), and 10% actually focusing on “being funny” in itself.
What if the key to unlocking “disease x” comes from unrelated research in metabolism, what if the techniques needed to sort through the data come from an applied mathematician investigating economic trends? What if the technology required to model the problem in order to even ask the right question in the first place is developed by computer scientists modelling data for a social network? There’s just no way to know but to explore.
Make Your APIs Trivially Expandable
Stones
I was at the lake last Summer. I picked a mossy old bench to sit, when a group of adolescents gathered to throw stones into the water. For an hour they busied around, making loud gestures and arguing over whose stone made the bigger splash. They would try every angle, and then bicker about which one spattered out the widest, or which one sent the water flying the highest, and how could they agree on what they’d seen if in a moment it was gone? Finally they approached me hoping I would settle their dispute, but I’d not seen the water splash– they scoffed, frustrated, and went on their way, and I was free to continue watching the ripples that now adorned the surface of the lake.
A Simple Implementation of Trait Mixins for Javascript
Eigengoogle. How the Google PageRank Algorithm Works
While we’re on the subject of sorting things online, we might as well talk about Google: the 93billion dollar company whose main export is taking all the things ever and putting them in the right order. If there’s one thing Google knows best, it’s sorting stuff.
I was curious how it all works, and it turned out really interesting, plus I got to learn a bit about Markov chains. It all starts with an algorithm called PageRank^{1}. According to Wikipedia,
Pagerank uses a model of a random surfer who gets bored after several
clicks and switches to a random page. It can be understood as a Markov chain in
which the states are pages, and the transitions are the links between pages.
When calculating PageRank, pages with no outbound links are assumed to link
out to all other pages in the collection (the random surfer chooses another
page at random).The PageRank values are the entries of the dominant eigenvector of the
modified adjacency matrix.
In this post I’ll try to break that down and provide some of the background
necessary to understand Google PageRank.
Graphs as Matrices
A graph is a collection of nodes joined by edges. If the edges are arrows that flow in one direction, we call that a directed graph. A graph whose edges have each been assigned a “weight” (usually some real number) is a weighted graph.
A graph of n
nodes can be represented in the form of an n x n
adjacency matrix,
such that is equal to the weight of the edge going from node to node :
[0, 1, 0, 0] [1, 0, 2, 0] [2, 1, 0, 1] [0, 0, 4, 0]
Stochastic Matrices
The term “stochastic” is used to describe systems whose state can only be described in probabilistic terms (i.e: the likelihood of some event happening at any given time).
Scenario:
Consider two competing websites. Every month, the first website loses 30% of its audience to the second website, while the second website loses 60% of its audience to the first.
If the two websites start out with 50% of the global audience each, how many users will each website have after a month? After a year?
This scenario can be represented as the following system:
P = [0.7, 0.6], x_0 = [0.5, 0.5] [0.3, 0.4]
This is a Markov chain with transition matrix and a state vector .
The transition matrix is called a stochastic matrix; it represents the likelihood that some individual in a system will transition from one state to another. The columns on a stochastic matrix are always nonnegative numbers that add up to 1 (i.e: the probability of at least one of the events occurring is always 1 — the likelihood of a user either staying on the same website, or leaving, is always 100%. He must choose one of the two).
The state after the first month is
So, after the first month, the second website will have only 35% of the global audience.
To get the state of the system after two months, we simply apply the transition matrix again, and so on. That is, the current state of a Markov chain depends only on its previous state. Thus, the state vector at month can be defined recursively:
From which, through substitution, we can derive the following equation:
Using this information, we can figure out the state of the system after a year, and then again after two years (using the Sage mathematical library for python):
P = Matrix([[0.70, 0.60], [0.30, 0.40]]) x = vector([0.5,0.5]) P^12*x # &gt; (0.666666666666500, 0.333333333333500) P^24*x # &gt; (0.666666666666666, 0.333333333333333)
So it seems like the state vector is “settling” around those values. It would appear that, as , is converging to some such that . As we’ll see below, this is indeed the case.
We’ll call this the steady state vector.
Eigenvectors!
Recall from linear algebra that an eigenvector of a matrix is a vector such that:
for some scalar (the eigenvalue). A leading eigenvalue is an eigenvalue such that its absolute value is greater than any other eigenvalue for the given matrix.
One method of finding the leading eigenvector of a matrix is through a power iteration sequence, defined recursively like so:
Again, by noting that we can substitute , and so on, it follows that:
This sequence converges to the leading eigenvector of .
Thus we see that the steady state vector is just an eigenvector with the special case .
Stochastic Matrices that Don’t Play Nice
Before we can finally get to Google PageRank, we need to make a few more observations.
First, it should be noted that power iteration has its limitations: not all stochastic matrices converge. Take as an example:
P = Matrix([ [0, 1, 0], [1, 0, 0], [0, 0, 1]]) x = vector([0.2, 0.3, 0.5]) P * x # &gt; (0.3, 0.2, 0.5) P^2 * x # &gt; (0.2, 0.3, 0.5) P^3 * x # &gt; (0.3, 0.2, 0.5)
The state vectors of this matrix will oscillate in such a way forever. This matrix can be thought of as the transformation matrix for reflection about a line in the x,y axis… this system will never converge (indeed, it has no leading eigenvalue: ).
Another way of looking at is by drawing its graph:
Using our example of competing websites, this matrix describes a system such that, every month, all of the first website’s users leave and join the seconds website, only to abandon the second website again a month later and return to the first, and so on, forever.
It would be absurd to hope for this system to converge to a steady state.
States 1 and 2 are examples of recurrent states. These are states that, once reached, there is a probability of 1 (absolute certainty) that the Markov chain will return to them infinitely many times.
A transient state is such that the probability is that they will never be reached again. (If the probability is 0, we call such a state ephemeral — in terms of Google PageRank, this would be a page that no other page links to):
There are two conditions a transition matrix must meet if we want to ensure that it converges to a steady state:
It must be irreducible: an irreducible transition matrix is a matrix whose graph has no closed subsets. (A closed subset is such that no state within it can reach a state outside of it. 1, 2 and 3 above are closed from 4 and 5.)
It must be primitive: A primitive matrix is such that, for some positive integer , is such that for all (that is: all of its entries are positive numbers).
More generally, it must be positive recurrent and aperiodic.
Positive recurrence means that it takes, on average, a finite number of steps to return to any given state. Periodicity means the number of steps it takes to return to a particular state is always divisible by some natural number (its period).
Since we’re dealing
with finite Markov chains, irreducibility implies positive recurrence, and primitiveness ensures aperiodicity.
Google PageRank
We are now finally ready to understand how the PageRank algorithm works. Recall
from Wikipedia:
The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain in which the states are pages, and the transitions, which are all equally probable, are the links between pages.
So, for example, if we wanted to represent our graph above, we would start with the following adjacency matrix:
[0, 0, 0.5, 0, 0], [0, 0, 0.5, 0.5, 0], [1, 1, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0.5, 0]
For the algorithm to work, we must transform this original matrix in such a way that we end up with an irreducible, primitive matrix. First,
If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.
When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection.
[0, 0, 0.5, 0, 0.2], [0, 0, 0.5, 0.5, 0.2], S = [1, 1, 0, 0, 0.2], [0, 0, 0, 0, 0.2], [0, 0, 0, 0.5, 0.2]
We are now ready to produce $$G$$, the Google Matrix, which is both irreducible and primitive. Its steady state vector gives us the final PageRank score for each page.
The Google Matrix
The Google Matrix for an matrix is derived from the equation
Where is an matrix whose entries are all 1, and
is referred to as the damping factor.
If , then . Meanwhile, if all of the entries in are the same (hence, the original structure of the network is “dampened” by , until we lose it altogether).
So the matrix is a matrix that represents a “flat” network in which all pages link to all pages, and the user is equally likely to click any given link (with likelihood ), while is dampened by a factor of .
Google uses a damping factor of 0.85. For more on this, I
found this paper.tl;dr: the second eigenvalue of a Google matrix is , and the rate of convergence of the power iteration is given by . So higher values of imply better accuracy but worse performance.
With some moving stuff around, we can see that
For all up to , which means that is indeed stochastic, irreducible, and primitive. Cool.
In conclusion,
1. Actually, it all started with the HITS algorithm, which PageRank is based off of. More details here.
Sorting Posts by User Engagement Level
At Functional Imperative we’re building the new CanLII Connects website (a social portal for Canada’s largest database of legal cases), and this week I was given the task of figuring out a sensible way of sorting posts.
Figuring out how to sort usergenerated content is a common problem that many social websites face.
Here’s Reddit’s scoring equation for ‘Best’ (source and explanation):
Not all scoring equations are that hairy, here are a few more.
Interestingly enough, Reddit’s ‘Hot’ scoring function (explained in link above):
is quite flawed.
Sidenote: One observation not mentioned in that first article is that, while all other equations use some form of
time_now  time_posted
to calculate how old a post is, the clever guys at Reddit usetime_posted  some_old_date
.The advantage of this is that the post’s score need only be calculated once, whereas the value of scores calculated with
time_now
will change on every request.
Anyway, while all those scoring functions work pretty well, they didn’t quite fit the requirements for CanLII Connects.
In this post, I’ll walk through the decision process of creating a scoring function. Hopefully this will be useful if you encounter a similar feature to implement.
Requirements:
CanLII Connects links to a database of legal cases, and users can post opinions on those cases:
 A user can post.
 A user can upvote a post.
 A user can comment on a post.
 A user can comment on a comment on a post.
So what’s a sensible way of sorting posts?
Right away, we’re dealing with a different problem than Reddit or HN: while it makes sense to slowly degrade the score of a post on those sites over time, the same does not make sense for CanLII. Old cases might be cited at any time, no matter how old they are, so what matters is not how old a discussion is, but rather how actively engaged users are within a given discussion.
Initial Score
Ok, so first we have to give each post an initial score. I like Reddit’s approach of taking the base10 log of its upvotes. This makes sense because, the more popular a post already is, the more likely people are to see it, and therefore upvote it, which gives it an unfair advantage.
In our case, we’re not only trying to measure how much people “like” a post, but rather how engaged they are with it. It makes sense that, while an upvote is a fine indicator of “engagedness”, a user actually bothering to comment on a post is even more of an indicator. I’ll count that as equivalent to two upvotes, and a user commenting on a comment will count as three upvotes (the 2 is so we don’t take the log of 1 or 0):
Frequency
Next, we need the post’s position to degrade as it becomes less active. It makes sense to divide the intial score by some factor of time:
Now we need a reasonable value for . A good start is the average time, in seconds, between the three most recent user interactions with a post.
We define a user interaction to be: a user creates a post, a user comments on a post, or a user upvotes a post.
Also, we want the most recent interactions to weigh more than older interactions. So let’s say each t
weighs twice as much as the previous:
Where
= UNIX timestamp, at now, in seconds.
= UNIX timestamp of n^{th} interaction.
One Final Detail
There is one last property we want this function to have, which is the following: if interactions are very frequent right now (within a timeframe of, say, 10 days), then clearly the post is “hot”, and its score should be boosted. But as time passes, it really doesn’t matter as much how much distance there is between interactions. If a post has already gone a full year without anyone commenting on it, does it really make that much difference if it goes another month without a comment?
To accomplish the first property, all we do is divide by the number of seconds in 10 days: 60*60*24*10
.
To accomplish the second property, what we are looking for is some sort of alwaysincreasing, concave function (positive derivative, negative second derivative). The first thing that comes to mind is the squareroot function, which is good enough.
Result
And thus we have our final scoring function:
If we plot this equation for x = number of points
and y = time
, we can see the shape of this function and check for different values if they make sense:
As expected, there is a steep 10day “boost” period, followed by an increasingly slower decline in the value as more and more time passes.
The function is also heavily biased toward very new posts, which will always come out on top, giving them a chance. This might be a bad idea if posting becomes frequent, but user interaction is low (many summaries a day, few votes or comments), and might have to be changed.
There are many ways to tweak this equation (changing the boost period, for example) to make it more or less biased towards either time or user interaction.
Bonus Round: Implementing in ElasticSearch
Implementing a custom scoring function in Elasticsearch, though easy once it’s all set up, was rather frustrating because of the poor documentation.
For our implementation, we’re using the tire
gem (a wrapper around the Elasticsearch API). This is where we call the custom scoring script:
query do #custom_score script: "parseInt(doc['upvote_count'].value)", lang: "javascript" do custom_score script: script, lang: 'javascript' do string query.join(" OR ") end end
Where script
is simply a variable holding the contents of a javascript file as a string. Note the option lang: 'javascript'
. This lets us use javascript as our language of choice, as opposed to mvel, the most poorly documented scripting language on the face of the earth. To enable this option, we’ll also require the elasticsearchlangjavascript plugin.
Here is our script:
Sidenote: Notice the logger function. This enables us to implement a sort of “console.log” which we can read using the following shell command
tail f /var/log/elasticsearch/elasticsearch.log
.
// Logger function: var logger = org.elasticsearch.common.logging.Loggers.getLogger("rails_logger"); // Example usage: logger.info("========= NEW CALC ==========="); var points_log = parseFloat(doc.points_log.value); var now = Math.round(new Date().getTime() / 1000); /** * NOTE: doc.ts.values is not actually an array, * here I create an array out of it: **/ var ts = []; for (var i = 0; i < doc.ts.values.length; i++) ts[i] = doc.ts.values[i]; ts.push(now); // Newest first ts.reverse(); /** * Boost period. **/ var ten_days = 60*60*24*10; /** * The scoring function **/ function score() { /** * Weighed average numerator **/ var times_num = (function() { var val = 0; for (var i = 1; i < ts.length; i++) { var exp = i  1; val += Math.pow(0.5, exp) * (parseFloat(ts[i])  parseFloat(ts[i  1])); } return val; })(); /** * Weighed average denominator **/ var times_denom = (function() { var val = 0; for (var i = 1; i < ts.length; i++) { var exp = i  1; val += Math.pow(0.5, exp); } return val; })(); var t_ave = (times_num/times_denom); return points_log/Math.sqrt(t_ave/ten_days); }; score();
Amdahl’s Law
As multicore computing becomes the norm (even my phone is dual core!), it’s important to understand the benefits and also the limitations of concurrency. Amdahl’s Law addresses the latter.
Let’s imagine a simple program. It prints “Hello World” 100 times, then quits.
Our first version of the program is written as a single sequential task: it prints one “Hello World”, then another, then another, 100 times, then quits. This program takes some unit of time, to execute.
Now say we have a dualcore machine at hand. (My phone, perhaps).
Cool, now we can spawn two tasks that print “Hello World” 50 times each. And, because our magical imaginary computer experiences no overhead, it takes us exactly units of time to run our second program.
So we keep adding more and more processors, until we have 100 concurrent threads printing one “Hello World” each, and our program runs 100 times faster.
At this point we stop: “Ah, the trend is clear: more processors equals more speed! No point in continuing this experiment.”
A naive (wrong) first guess: Given processors executing a program, the maximum boost in speed is . (That is, we can get our program to run times faster).
Cool! This means that, given enough processors, we could make any program run almost instantly. Right?
Of course this is not the case! Enough daydreaming. Let’s figure out a more Continue reading