Language:EN
Pages: 9
Rating : ⭐⭐⭐⭐⭐
Price: $10.99
Page 1 Preview
the pagerank citation ranking bringing order the w

The pagerank citation ranking bringing order the web

Chapter 6

PageRank

6.3 An Extension: Timed-PageRank ....................................... 123

6.4 Summary .............................................................. 124

6.1

The PageRank algorithm was first introduced by Sergey Brin and Larry Page at the Seventh International World Wide Web Conference (WWW7) in April 1998, with the aim of tackling some major difficulties with the content-based ranking algorithms of early search engines. These early search engines essentially retrieved relevant pages for the user based on content similarities of the user query and the indexed pages of the search engines. The retrieval and ranking algorithms were simply direct implementation of those from information retrieval. However, starting from 1996, it became clear that the content similarity alone was no longer sufficient for search due to two main reasons. First, the number of Web pages grew rapidly during the middle to late 1990s. Given any query, the number of relevant pages can be huge. For example, given the search query “classification technique,” the Google search engine estimates that there are about 10 million relevant pages. This abundance of information causes a major problem for ranking, that is, how to choose only 10 to 30 pages and rank them suitably to present to the user. Second, content similarity methods are easily spammed. A page owner can repeat some important words and add many remotely related words in his/her pages to boost the rankings of the pages and/or to make the pages relevant to a large number of possible queries.

117

6.2

PageRank produces a static ranking of Web pages in the sense that a PageRank value is computed for each page off-line, and the value is not dependent on search queries. In other words, the PageRank computation is purely based on the existing links on the Web and has nothing to do with each query issued by users. Before introducing the PageRank formula, let us first state some main concepts.

In-links of page i: These are the hyperlinks that point to page i from other pages. Usually, hyperlinks from the same site are not considered.

1. A hyperlink from a page pointing to another page is an implicit conveyance of authority to the target page. Thus, the more in-links that a page ireceives, the more prestige the page i has.

2. Pages that point to page i also have their own prestige scores. A page with a higher prestige score pointing to i is more important than a page with a lower prestige score pointing to i. In other words, a page is important if it is pointed to by other important pages.

P(i) P( j) (6.1)
= ( j,i)∈E O j

P = (P(1), P(2), . . . , P(n))T

Let A be the adjacency matrix of our graph with

otherwise

We can write the system of n equations with⎩

© 2009 by Taylor & Francis Group, LLC

PageRank
1
Figure 6.1 3 4
2

The conditions are that A is a stochastic matrix and that it is irreducible and aperiodic. However, the Web graph does not meet these conditions. In fact, Equation (6.3) can also be derived based on the Markov chain. Then some theoret-ical results from Markov chains can be applied [8], which is where the above three conditions come from.

In the Markov chain model, each Web page or node in the Web graph is regarded as a state. A hyperlink is a transition, which leads from one state to another state with a probability. Thus, this framework models Web surfing as a stochastic process. It modelsaWebsurfer randomlysurfingtheWebasastatetransitionintheMarkovchain. Now let us look at the Web graph and see why all three conditions are not satisfied. First of all, A is not a stochastic (transition) matrix. A stochastic matrix is the transition matrix for a finite Markov chain whose entries in each row are nonnegative real numbers and sum to 1. This requires that every Web page must have at least one out-link. This is not true on the Web because many pages have no out-links, which are reflected in transition matrix A by some rows of complete 0’s. Such pages are called dangling pages (nodes).

0

0

0

0
1/2

0

0

0

0
1/3
0

is not a stochastic matrix because the fifth row is all 0’s, that is, page 5 is a dangling For example, A12 = A13 = 1/2 because node 1 has two out-links. We can see that A

page.

all 1’s, giving us the following matrix:

¯A =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝1/2

1/6
0

1

1/6
0

0
1/6

1/6

1/2
0

0

0

Definition of strongly connected graphs: A directed graph G = (V , E) is

strongly connected if and only if, for each pair of nodes u, vV , there is a path from u to v.

can be dealt with using a single strategy (described below).

Finally, A is not aperiodic. A state i in a Markov chain being periodic means

A Markov chain is aperiodic if all states are aperiodic.

Example 6.2.2 Figure 6.2 shows a periodic Markov chain with k = 3. The transition

0 1 0 1 1

A =0

0

1

It is easy to deal with the above two problems with a single strategy.

r We add a link from each page to every page and give each link a small transition probability controlled by a parameter d.

of jumping to a random page. Note that Equation (6.6) assumes that A has already

been made a stochastic matrix. After scaling, we obtain

P(i) = (1 − d) + d n
(6.8)
P(i) (1 d) d P( j) (6.9)
= − + ( j,i)∈E O j

The parameter d, called the damping factor, can be set to a value between 0 and 1. d = 0.85 is used in [1, 7]. The computation of PageRank values of the Web pages can be done using the power iteration method [2], which produces the principal eigenvector with an eigenvalue of 1. The algorithm is quite simple (see Figure 6.3). One can start with any initial assignments of PageRank values. The iteration ends when the PageRank values do not change much or converge. In Figure 6.3, the iteration ends after the 1-norm of the residual vector is less than a prespecified threshold ε.

In Web search, we are only interested in the ranking of the pages. Thus, the actual convergence may not be necessary and fewer iterations are needed. In [1], it is reported that on a database of 322 million links the algorithm converges to an acceptable tolerance in roughly 52 iterations.

k ← 1

repeat

Figure 6.3

The power iteration method for PageRank.

6.3

An Extension: Timed-PageRank

The intuition here is that if the page was last updated (or created) a long time ago, the

pages that it points to are even older and are probably out of date. Then the 1 − f (t) value for such a page should be large, which means that the surfer will have a high

124 PageRank
probability of jumping to a random page. If a page is new, then its 1− f (t) value should be small, which means that the surfer will have a high probability to follow an out-link of the page and a small probability of jumping to a random page. For a complete new page in a Web site, which does not have any in-links at all, the method given uses the average Timed-PageRank value of the past pages in the Web site. This is reasonable because a quality site in the past usually publishes quality new pages. The Timed-PageRank algorithm has been evaluated based on research publication search and has given promising results. Interested readers, please refer to [6] for additional details.
6.4

details can be found in [1, 4, 5, 7]. An extension to the PageRank algorithm is also

briefly discussed, which adds the temporal dimension to search. Finally, we should

6.5

1. Given A below, obtain P by solving Equation (6.7) directly.

A =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝1/2

0
1/3

1/4

1/4

1/2
0

0

1/4

1/4
0

0

iterations of P.

3. Calculate the squared error on each iteration in problem 2 where the squared

stabilize?

1 3 4
2

6. For the graph G given in problem 5, what is P after seven iterations based on the power iteration method?

7. Pick a URL, and construct a Web graph containing Web pages within three hops from the starting URL.

[8]
[9]
[10]

S. Brin, and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30, 1998.

X. Li, B. Liu, and P. S. Yu. Time Sensitive Ranking with Application to Publi-cation Search. Conference on Data Mining 2008.

L. Page, S. Brin, R. Motwami, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999–0120, Computer Science Department, Stanford University, 1999.

You are viewing 1/3rd of the document.Purchase the document to get full access instantly

Immediately available after payment
Both online and downloadable
No strings attached
How It Works
Login account
Login Your Account
Place in cart
Add to Cart
send in the money
Make payment
Document download
Download File
img

Uploaded by : Ahana Bhat

PageId: ELI6C603D3