March 05, 2004

Page rank puzzles

Prompted by Semantic Compositions' recent self-evaluation, I finally decided to explore the page rank numbers available on the Google toolbar for Internet Explorer. I don't usually use IE, but for the occasion I cranked it up and gave the thing a try.

And I'm puzzled. In the large, things sort of make sense. But the details are puzzling.

As you doubtless know, page rank is a method for using the eigenstructure of the web's link graph as a source of information about about the relative importance or value of pages. Though there are various attempts to go on beyond google, this is still the method of choice for sorting web pages whose content looks relevant to a query (where this basic relevance is calculated as some sort of weighted sum of the words a given page shares with the query). As Google's site explains

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

It's talk like this that has Semantic Compositions calling his site "totally unimportant" just because Google (or rather the Google toolbar) assigned it a PageRank of 0 (on a crudely quantized and probably-logarithmic scale of 10). As SC is well aware, his negligence is temporary, and will disappear with time. In fact, as of this writing, his Google toolbar page rank is already 2! But many puzzles of page rank remain unexplained by the lag of linking to new sites and/or sampling new links.

MIT is the only university that I've found at page rank 10 -- Penn, Stanford, Harvard, Princeton, Yale, Berkeley, UCLA, UCSC, UCSD, Michigan, Rutgers, Colorado, Florida and CMU are all 9; Swarthmore, UConn, Vermont and Georgia are 8, but OSU is only 7, down with Vassar, Haverford and Maine. What's up with that?

Apple and Intel are 10, but Microsoft, Sun, AOL and Oracle are 9; AMD, Hitachi and Sony are merely 8.

Science magazine is 10; The New York Times, CNN and Scientific American are 9; Atlantic magazine, Salon and The New Yorker are 8; Arts and Letters Daily, the Philadelphia Inquirer, Slate and Harper's are 7.

LinguistList is 8; the LSA is 7, as is the LDC. The linguistics departments at Penn, Stanford, UCSC and Berkeley are at 7; the linguistics departments at UMass, OSU and UCLA are at 6; but the MIT linguistics department is at 8. As far as I can tell, the differential ranking of the departments' pages doesn't match the size of the departments, the ranking of the departments, or the amount of interesting stuff directly reachable from their front pages.

As a final indication of how semi-random this can be, my home page is 7 while Geoff Nunberg's is 6. He's much more famous than I am, and also has much more interesting stuff on his home page. And it's hard to believe that my feeble little home page gets as many page rank votes as Slate, Harper's, Vassar College and OSU, and more than the whole UMass and UCLA linguistics departments.

This mathematical and practical discussion of page rank helps explain why results can sometimes be unintuitive. For example, if you get quite a few links from others but don't send any outside your site, that boosts your page rank -- that might explain Science magazine at 10 -- the links come in, but they don't go out! This might also explain the oddly high rank of my graduate alma mater MIT, which is not only famous but also famously self-involved :-). But I'm still kind of puzzled about me and Geoff and OSU.

Another alternative is to try the Alexa toolbar, which uses a very different method of calculating importance. In Alexa rankings, the best possible number is 1, and higher numbers indicate less importance (like rank in a competition). For Alexa, the NYT is 66 while Science is 8,506; Harvard is 1,110 to MIT's 1,199; the LDC is 2,231 while LinguistList is 47,790 and the LSA is 577,603; and Geoff Nunberg's home page ranks 1,704,538, while mine has "no data", but OSU is 3,123! Go figure.

Posted by Mark Liberman at March 5, 2004 05:40 AM