Skip to content

Google’s search algorithm

There is an article in the New York Times on Google’s algorithm for returning pages after a search. It is remarkable how complex it has become. At its base is a concept called Page Rank (named after Larry Page, one of the founders of Google), a very simple concept that every operations research student can understand (Wikipedia has a short outline of it). You can see Page Rank as the ergodic state of a markov chain, or as an eigenvector of a particular matrix (as described by Kurt Bryan and Tanya Leise).  But the system has gone far beyond that:

Mr. Singhal [the subject of the article;  the Google engineer responsible for the rankings] has developed a far more elaborate system for ranking pages, which involves more than 200 types of information, or what Google calls “signals.” PageRank is but one signal. Some signals are on Web pages — like words, links, images and so on. Some are drawn from the history of how pages have changed over time. Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years.

“The data we have is pushing the state of the art,” Mr. Singhal says. “We see all the links going to a page, how the content is changing on the page over time.”

Increasingly, Google is using signals that come from its history of what individual users have searched for in the past, in order to offer results that reflect each person’s interests. For example, a search for “dolphins” will return different results for a user who is a Miami football fan than for a user who is a marine biologist. This works only for users who sign into one of Google’s services, like Gmail.

Lots of operations research projects end up looking like that.  A sharp, clean model at the core, augmented over time with more and more additions to better fit reality.

{ 2 } Comments

  1. Hart Z. | June 4, 2007 at 1:02 am | Permalink

    “A sharp, clean model at the core” is actually what is driving every webmaster crazy. When they are supposed to be developing useful contents that make the WWW better, they are forced to build meaningless links instead. No one can ever estimate how much time the world lost because of Google’s brilliant PageRank. Some may say you don’t have to; if you have unique and useful contents people will link to you naturally. Well, good luck on that one. While you are waiting for organic links, sites with 1/10 of the content but more (although useless to visitors) backlinks will be ranked much higher by Google and thus gaining more backlinks and being ranked even higher. You’ll never catch up.

    Don’t believe me, just go to any SEO forums and compare the number of people discussing contents vs link building.

    PageRank did a great job in organizing the internet, but its emphasis on backlinks did more harm than good. Google keeps saying there are 200+ factors in ranking pages, but everyone knows they still rely on links so heavily that if you are a webmaster who care about your site, you’d better spent most of your time looking for links.

  2. Roger D | June 20, 2007 at 9:40 am | Permalink

    Agree with Hart Z post above.

    Automated page generation is all the rage as well.

    Worthless content on the Internet will continue to ruin serps due to link spamming and Google rewarding sites with of thousands of pages of scraped content.