Learn the three phases of how a search engine works: The Discovery Phase (Crawling), The Storing Phase (Indexing) and The Serving Phase (Ranking).
Every time you make a search, there are thousands if not millions of web pages with lots of helpful information available to you at the click of a button. All within a matter of milliseconds.
Before we dive into the nitty gritty details of Google works, we’ll discuss how search engines discover, organise and rank these pages.
After all, in order to rank in the first position, you need to make sure that your website is visible to search engines!
The Three Phases of Search Engines
There are three primary functions that a search engine has: crawling, indexing and ranking. We’ll go into much more detail as to how Google performs these tasks, but for now, here is a short summary of each one.
The Discovery Phase: Crawling
Search engines crawl the world wide web using software programs called spiders. They are also known as search engine bots or web crawlers. Their task is to discover new and updated content that has been made available.
Crawlers begin their search by downloading the website’s robots.txt file; a seed document containing rules about what resources search engines can and cannot crawl along with information about sitemaps. Sitemaps contain a list of URLs that the search engines should crawl.
Whilst visiting these pages, the bot identifies hyperlinks in the pages and keeps track of these newly discovered URLs in what’s called the crawl frontier. These URLs are then recursively visited according to the directives defined in the robots.txt file.
Algorithms are used to determine the frequency of how often a page is re-crawled along with the number of pages that should be indexed for a given website.
The Storing Phase: Indexing
Once new content has been crawled, search engines store their own copy (or snapshot) of it, usually across thousands of machines, in their search index. The search index is a library or repository that includes every single web page discovered by the crawler.
The cached (copied) content is stored together with its meta information created during indexing. The meta information may include relevant key signals about the contents of each page, like:
- The keywords and topics associated with the piece of content
- The type of content being crawled – sometimes described using microdata called Schema (more on this later in the course)
- The page’s freshness – when it was last updated and when it changed compared to the previous versions of cached (indexed) content.
- How users engage and interact with the page, where available.
This content also needs to be organised properly for rapid retrieval should a search pertaining to this content be made.
The Serving Phase: Ranking
When a search query is made by a user, the search engine scours its index, identifies the most relevant content that it thinks will satisfy the searcher’s query, and orders it using a search algorithm. The aim of the search engine algorithm is to display the most relevant set of high quality search results that will address the user’s search query as quickly and directly as possible.
Each search engine has its own algorithm, which means that a page that ranks highly in Google, may not rank as well in Bing or vice versa.