The Anatomy of a Large-Scale Hypertextual Web Search Engine
Google was conceived as an effort to better organize the rapidly growing amount of information on the web. Primitive methods of organization that involved a high level of human curation had already been employed but it was clear that these methods were already inadequate. The primary form of content organization at the time involved manual retrieval, identification, and classification of each web page.
The webpages were then categorized into large directories that were grouped by topics similar to the way libraries are maintained. Some simplistic versions of search engines existed at the time but relied too heavily on keyword matching and therefore retrieved a large number of related documents but were unable to organize results any further. Put plainly: as the number of documents contained within existing indices increased, so did the number of junk results, however, the user’s ability to parse results was not increasing. Google was designed to break these limits by using highly optimized data structures that scale well to extremely large data sets while removing the need for human curation of results.
Google’s advantage over existing search methods revolved around the author’s knowledge that recent research at the time had indicated that the hyperlink structure of the web is extremely useful in determining search engine query relevance. Most existing search engines performed information retrieval by simply searching all indexed webpages for the queried search term. Pages were then ordered in the search results by determining their relevance to the queried term through analysis of the words on each page.
Primitive forms of these methods included the number of times a query appeared on a given page (term-frequency analysis), the appearance of the query in the title of a page, and even slightly more advanced mathematical formulas for calculating relevance (Term Frequency-Inverse Document Frequency). These methods, called natural language processing (NLP), have progressed greatly since that time and are still used today. We will discuss NLP in detail including its use in modern search engines in another post.
Google incorporated these on-page methods of determining query relevance, but also strongly capitalized on the use of additional data taken from hyperlinks. By incorporating data regarding the link anchor text that points to a given page, Brin and Page were able to better refine search results and greatly increase the precision of results page rankings. The algorithm that the Google search engine used to rank websites in search engine results was named PageRank after Larry Page and was described in a paper called “The PageRank Citation Ranking: Bringing Order to the Web”.
The original paper can be found in its full form here: The Anatomy of a Large-Scale Hypertextual Web Search Engine
The PageRank Citation Ranking: Bringing Order to the Web
By: Lawrence Page, Sergey Brin, Rajeev Motwani & Terry Winograd
In its most basic form, PageRank works by factoring in the number and quality of inbound links to a page to roughly determine a website’s importance. The underlying assumption of PageRank is that more important websites are likely to be linked to more times from other websites. PageRank is still used as part of the Google ranking algorithm but it is not the only algorithm used by google and cannot be used by itself to explain Google’s search results rankings21. The PageRank research paper gives us a unique glimpse into the inner mechanics of Google’s core search algorithms. The paper describes the ranking methods in explicit detail with a clarity and breadth that would never again be replicated by the future search giant. Following the subsequent incorporation, IPO, and rapid growth of Google Inc (now operating under the parent company, Alphabet, Inc), public knowledge of this core search algorithm would only become increasingly opaque.
It is important to meticulously note the careful wording in this paper as well as subsequent Google press releases and public comments. Academic papers, especially those produced during early years of graduate school, are meticulously examined and torn apart as part of the scholastic process. Writing an academic paper for release to a conference or scientific journal is no joke. Although it might seem overbearing to analyze the wording of this paper in extreme detail, you will find that this type of examination can be quite revealing. The review process for these papers is extremely rigorous and students are expected to be able to defend each and every fact and inference that they make. It is very clear that Sergey and Brin took this to heart as even today the press releases, literature, and public comments that are made by Google and its employees are quite obviously methodically sculpted. You will find entire YouTube channels and media outlets devoted to the interpretation of Google press releases. With that being said, please note the following excerpt taken from the Abstract of the 1998 PageRank paper:
“This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them”
To the untrained eye, this may seem like a pretty bland summary sentence, but it’s more than that. This sentence has changed the lives of millions of people, caused hundreds of billions of dollars to change hands, made and broken countless businesses, and will possibly make or break your career in SEO or your e-commerce business. The Google search engine that harbored PageRank incorporated many state-of-the-art ideas, but without this summary sentence & objective, PageRank would have been nothing more than a natural progression in the evolution of search engines. Sergey and Brin took the search engine industry a giant leap forward by redefining the objective of search engines. While existing search engines were designed to rank results by judging a page’s relevance to the search query, the term “relevance” isn’t even mentioned in this paper. Instead, the PageRank was created in an attempt to measure “human interest” in web pages and judge the “attention devoted” to them. This may seem subtle, but the distinction is highlighted by Sergey and Brin is massive. End users of a search engine don’t care if the returned pages are relevant to a search query if those pages don’t contain the information they were looking for. While web page relevance to a search query is very useful for the initial information retrieval process, this feature is not nearly as useful when ranking the final results pages. Query relevance has a very high rate of recall and a very variable precision. Even in 1998, there was a large amount of useless websites “garbage” websites on the internet. Even websites that a human would determine to be useless or gibberish often times contain exact text matches to the query term and would show up at the top of search results of primitive search engines. PageRank build upon existing IR techniques by first retrieving a relevant subset of all web pages using previously established high-recall methods, then further refined and ranked these pages using a method known for its high precision. By incorporating first high-recall, then a high-precision filtering of web pages, the accuracy (or more specifically F-score) of PageRank is nearly mathematically optimized. So what are the high-precision features incorporated into PageRank? Hint: Think back to the text excerpt from the abstract…
- Human Interest
- Attention devoted to the website
If you only remember two things from this blog post, please remember these two features and their importance to websites. No matter how many times Google’s algorithms change, regardless of the number of features that they claim to be incorporating into their core ranking algorithm or how confusing and pedantic SEO tactics seem, just remember that it is all don’t in an effort to gauge human interest and measure attention devoted to your website.
The original paper can be found in its full form here: The PageRank Citation Ranking: Bringing Order to the Web