Fork me on GitHub

PageRank with MapReduce

Pagerank algorithm and its mapreduce implementation.

First of all, let’s consider an interesting problem. Everyone knows pagerank and everyone can craw webpages, why there is not another Google? In other words, what’s the limitation of being the second Google?

Obviously, the answer is computation ablility. Actually, google has more than 2 million machines locating in all over the world. Its very very powerful machines can store massive data and support computing.

Although google has a lot of machines to support compute, it still needs many distributed solution, because one machine or server cannot store all of the webpages and compute their importance. So, let us introduce very famous algorithm – pagerank first, and then discuss its mapreduce implementation.

Random Walk

Random walk is a mathematical object, known as a stochastic or random process, that describes a path that consists of a succession of random steps on some mathematical space such as the integers. Its idea is baesd on Markov Chain. A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

Let $ p_i(t) $ be the probability that the walk is in node $ i $ at time $ t $.

$$
p_i(t) = \sum_{m=0} \frac{A_(ij)}{k_j} p_j(t-1)
$$

So, the probability of node $ i $ depends on the degree and probabilities of its neighbors.

PageRank

PageRank algorithm is based on the idea of random walk. Let $ p_i $ be a page. $ R(p_i) $ be the PageRank value in terms of importance of page $ p_i $. Let $ d = 0.85 $ be the damping factor. $ N_{in}(p_i)/N_{out}(p_i) $ is the in/out-neighbors of $ p_i $.

$$
R(p_i) = (1 - d) + d \sum_{p_j \in N_{in}(p_i)} \frac{R(p_i)}{|N_{out}(p_i)|}
$$

PageRank with MapReduce

As refer to the equation above. The algorithm contains two cycles. The first one is to count the total out degrees of node $ p_j $. The second is to get the sum value of in-neighbors of node $ p_i $.

Therefore, we can use three mapreduce rounds to get the distributed solution.
1) Calculate the first cycle.
2) Calculate the second cycle.
3) Make a judgement that the algorithm has converged or not.

Back soon!