The pagerank citation ranking bringing order the web
Chapter 6
PageRank
|
|
---|---|
6.1 |
The PageRank algorithm was first introduced by Sergey Brin and Larry Page at the Seventh International World Wide Web Conference (WWW7) in April 1998, with the aim of tackling some major difficulties with the content-based ranking algorithms of early search engines. These early search engines essentially retrieved relevant pages for the user based on content similarities of the user query and the indexed pages of the search engines. The retrieval and ranking algorithms were simply direct implementation of those from information retrieval. However, starting from 1996, it became clear that the content similarity alone was no longer sufficient for search due to two main reasons. First, the number of Web pages grew rapidly during the middle to late 1990s. Given any query, the number of relevant pages can be huge. For example, given the search query “classification technique,” the Google search engine estimates that there are about 10 million relevant pages. This abundance of information causes a major problem for ranking, that is, how to choose only 10 to 30 pages and rank them suitably to present to the user. Second, content similarity methods are easily spammed. A page owner can repeat some important words and add many remotely related words in his/her pages to boost the rankings of the pages and/or to make the pages relevant to a large number of possible queries.
117
6.2 |
---|
PageRank produces a static ranking of Web pages in the sense that a PageRank value is computed for each page off-line, and the value is not dependent on search queries. In other words, the PageRank computation is purely based on the existing links on the Web and has nothing to do with each query issued by users. Before introducing the PageRank formula, let us first state some main concepts.
In-links of page i: These are the hyperlinks that point to page i from other pages. Usually, hyperlinks from the same site are not considered.
1. A hyperlink from a page pointing to another page is an implicit conveyance of authority to the target page. Thus, the more in-links that a page ireceives, the more prestige the page i has.
2. Pages that point to page i also have their own prestige scores. A page with a higher prestige score pointing to i is more important than a page with a lower prestige score pointing to i. In other words, a page is important if it is pointed to by other important pages.
P(i) | P( j) | (6.1) | |
= | ( j,i)∈E� | O j |
P = (P(1), P(2), . . . , P(n))T
Let A be the adjacency matrix of our graph with
otherwise
|
---|
© 2009 by Taylor & Francis Group, LLC
PageRank | |||
---|---|---|---|
1 | |||
Figure 6.1 | 3 | 4 | |
2 | |||
The conditions are that A is a stochastic matrix and that it is irreducible and aperiodic. However, the Web graph does not meet these conditions. In fact, Equation (6.3) can also be derived based on the Markov chain. Then some theoret-ical results from Markov chains can be applied [8], which is where the above three conditions come from.
In the Markov chain model, each Web page or node in the Web graph is regarded as a state. A hyperlink is a transition, which leads from one state to another state with a probability. Thus, this framework models Web surfing as a stochastic process. It modelsaWebsurfer randomlysurfingtheWebasastatetransitionintheMarkovchain. Now let us look at the Web graph and see why all three conditions are not satisfied. First of all, A is not a stochastic (transition) matrix. A stochastic matrix is the transition matrix for a finite Markov chain whose entries in each row are nonnegative real numbers and sum to 1. This requires that every Web page must have at least one out-link. This is not true on the Web because many pages have no out-links, which are reflected in transition matrix A by some rows of complete 0’s. Such pages are called dangling pages (nodes).
0
0
0
0
1/2
0
0
0
0
1/3
0
is not a stochastic matrix because the fifth row is all 0’s, that is, page 5 is a dangling For example, A12 = A13 = 1/2 because node 1 has two out-links. We can see that A
page.
all 1’s, giving us the following matrix:
¯A =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝1/2
1/6
0
1
1/6
00
1/6
1/6
1/2
0
0
0
Definition of strongly connected graphs: A directed graph G = (V , E) is
strongly connected if and only if, for each pair of nodes u, v ∈ V , there is a path from u to v.
can be dealt with using a single strategy (described below).
Finally, A is not aperiodic. A state i in a Markov chain being periodic means
A Markov chain is aperiodic if all states are aperiodic.
Example 6.2.2 Figure 6.2 shows a periodic Markov chain with k = 3. The transition
0 1 0 1 1
A =0
0
1
It is easy to deal with the above two problems with a single strategy.
r We add a link from each page to every page and give each link a small transition probability controlled by a parameter d.
of jumping to a random page. Note that Equation (6.6) assumes that A has already
been made a stochastic matrix. After scaling, we obtain
P(i) = (1 − d) + d | n � |
(6.8) |
---|
P(i) (1 d) d | P( j) | (6.9) | |
---|---|---|---|
= − + | ( j,i)∈E� | O j | |
|
k ← 1
repeat
Figure 6.3 |
|
|
---|---|---|
6.3 |
|
|
The intuition here is that if the page was last updated (or created) a long time ago, the
pages that it points to are even older and are probably out of date. Then the 1 − f (t) value for such a page should be large, which means that the surfer will have a high
124 | PageRank |
---|---|
probability of jumping to a random page. If a page is new, then its 1− f (t) value should be small, which means that the surfer will have a high probability to follow an out-link of the page and a small probability of jumping to a random page. For a complete new page in a Web site, which does not have any in-links at all, the method given uses the average Timed-PageRank value of the past pages in the Web site. This is reasonable because a quality site in the past usually publishes quality new pages. The Timed-PageRank algorithm has been evaluated based on research publication search and has given promising results. Interested readers, please refer to [6] for additional details. | |
6.4 |
details can be found in [1, 4, 5, 7]. An extension to the PageRank algorithm is also
briefly discussed, which adds the temporal dimension to search. Finally, we should
6.5 |
---|
1. Given A below, obtain P by solving Equation (6.7) directly.
A =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝1/2
0
1/3
1/4
1/4
1/2
0
0
1/4
1/4
0
0
iterations of P.
3. Calculate the squared error on each iteration in problem 2 where the squared
stabilize?
1 | 3 | 4 | |
---|---|---|---|
2 | |||
6. For the graph G given in problem 5, what is P after seven iterations based on the power iteration method?
7. Pick a URL, and construct a Web graph containing Web pages within three hops from the starting URL.
[8] |
|
---|