Decoration Circle
Advanced SEO Textbook
1

Mastering Google’s Crawl

In this chapter we take a thorough look at how to ensure that your website can be crawled by Google and is bot-friendly.

Topic Details
Clock icon Time: 102
Difficulty Intermediate

One of the cornerstones and often most overlooked aspects of SEO is site crawlability (and by extension indexability) – i.e. the art of sculpting your website in a way that Googlebot can crawl and index your web pages correctly.

After all, if Googlebot (and other search engine robots) can’t find your web pages, then they won’t be indexed, and if they aren’t indexed, then your web pages have no chance of ranking.

Essentially, a bot-friendly website drastically improves your chances of search engines discovering your content and making it available to users. Sometimes even the smallest of tweaks can result in large gains in the SERPs. However, a false step can be detrimental to your site’s crawlability as important pages may not be crawled.

Therefore, in this section, we’ll walk you through the best ways in which you can optimise your website’s crawlability and indexability.

Page Blocking

We have seen in the How Google Works module, that Google assigns each website with a crawl budget i.e. the number of URLs that Googlebot can and wants to crawl.

Page blocking helps you to ensure that Google only discovers the most important pages that you want it to crawl, meaning that your precious crawl budget is not wasted.

Let’s see how you can go about blocking particular pages from being crawled/indexed by Google.

Robots.txt

One of the simplest and most common ways to block pages is with the robots.txt file.

What is a robots.txt file?

The robots.txt file, which should be stored in the root directory of your domain, instructs which pages or files search engine bots can and cannot access from your site. Originally designed to prevent DDOS attacks (i.e. to avoid search engines from overloading your website with lots of requests), the robots.txt file is not a mechanism from preventing Google from indexing the web page. In order to do this, you need to either protect the page with a password or use noindex directives.

What Does a robots.txt File Look Like?

Below is an example of a basic robots.txt file template.

User-agent: [bot identifier]
[directive 1]
[directive 2]
[directive ...]

User-agent: [another bot identifier]
[directive 1]
[directive 2]
[directive ...]

Sitemap: [URL of Sitemap]

A robots.txt file generally consists of one or more rules where each rule blocks (or allows) a crawler from accessing a specified file or web page. access for a given crawler to a specified file path in that website.

Here’s an example of a simple (but common) robots.txt file:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

Want to find your robots.txt file?

Simply navigate to the following URL in your browser to access your robots.txt: yourdomain.com/robots.txt.

It will look something like this:

User-Agents

This specifies which robots should follow the rules outlined in the robots.txt file. Each search engine has its own user-agent and you can set custom instructions for each of these in your robots.txt file.

Here are some user-agents that are useful for SEO:

  • Google: Googlebot
  • Google Images: Googlebot-Image
  • Bing: Bingbot
  • Yahoo: Slurp
  • Baidu: Baiduspider
  • Yandex: YandexBot

You can also use the asterisk (*) wildcard to assign directives to all user-agents.

So for example, if you wanted to block all user-agents except for Googlebot and Bingbot, here’s how you would do it:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Directives

Directives outline the rules that you want the user-agents that you declared above to follow.

Below are the most common directives that Google currently supports.

Disallow

This directive instructs spiders not to access certain files/resources on your website. In our simple example from above, we are telling the search engine bots not to crawl the /wp-admin/ folders (because this is where a lot of important files are stored for WordPress websites). You can add as many of these directives as you please.

Allow

This directive should be used to allow search engine bots to access a page or subdirectory. In our example, we are telling the spiders that despite being inside the /wp-admin/ folder, it is still allowed to crawl the /admin-ajax.php file. You can add as many of these directives as you please.

Sitemap

This tells the search engine bots where they can find your XML sitemap(s). It’s worth noting that this is one of the most commonly forgotten lines from the robots.txt file. If you’ve already submitted your sitemap(s) through Google Search Console, then including the location on the robots.txt is redundant, though it’s still important to use this directive as it tells other search engines like Bing where to find your sitemap.

You can find a full list of directives for robots.txt files supported by Google here.

How to Create a robots.txt File

To create a robots.txt file, simple open a blank .txt document (text file) and begin typing directives.

Remember to name your file robots.txt and save it in the root directory of the subdomain to which it applies – otherwise it will not be found.

You can also use a robots.txt generator to help eliminate any syntax errors which may prove to be catastrophic to your SEO. However, one downside is that there will be limitations to how customisable you can make it.

For more detailed instructions on how to create your robots.txt file, head here.

robots.txt Common Pitfalls and Best Practices

Here are some of the most common mistakes that we tend to see when it comes to incorrect robots.txt files:

1. Blocking Googlebot – ensure that your robots.txt does not have any of the following:

User-agent: Googlebot
Disallow: /

User-agent: *
Disallow: /

This may seem trivial, but both of these are telling Googlebot not to access your entire website. To fix this, simply remove them.

2. Each directive should be on a new line – To avoid potentially confusing search engine bots, ensure that each directive is on its own line.

Good example:

User-agent: * 
Disallow: /directory-1/ 
Allow: /directory-2/

Bad example:

User-agent: * Disallow: /directory-1/ Allow: /directory-2/

3. Simplify instructions by using wildcards – wildcards can be used to apply directives to user-agents as well as match URL patterns when declaring these directives. For instance, if you want to prevent search engines from accessing product category pages that have parameters on your website, you could use something like this:

User-agent: * 
Disallow: /products/*?

This would block any pages that have any product pages with “?” within the URL.

4. Use the “$” wildcard to specify the end of a URL – If you wanted bots not to access PDF files on your website for example, this qualifier is extremely useful. For example:

User-agent: * 
Disallow: /*.pdf$

In the above example, search engines are blocked from accessing any pages that end with .pdf. Note that pages like: pdf?id=123 can still be accessed because they don’t end with “.pdf”.

5. Make your robots.txt human-friendly – although robots.txt files are primarily for robots, it’s important to ensure that they’re also easy to understand for humans. Adding comments helps describe and explain your robots.txt file to developers.Start the line with a hash (#) to create a comment.

# This instructs Google not to crawl the /wp-admin/ subdirectory.
User-agent: *
Disallow: /wp-admin/

6. Test your robots.txt file – in order to ensure that there aren’t any issues with your robots.txt file, use Google’s built-in tester. This allows you to submit URLs to the tool and see whether it would be blocked/allowed for Googlebot to access.

Meta Robots

What are Meta Robots?

An extension of the robots directives, the meta robots tags (sometimes referred to as just meta tags) are HTML snippets that can be used to specify your crawl preferences. Meta directives instruct crawlers on how to crawl and index information they find on a specific web page.

The meta robots tags should be placed into the <head> section of a web page.

Why Are Meta Robots Tags Important For SEO?

Meta robots tags help prevent web pages from showing up in the SERPs, for example:

  • Pages with think content that add little to no value to the user
  • Pages that are in your development/staging environment.
  • Administrative pages such as login pages, thank you pages etc.
  • Search results from your internal search feature
  • Pages that contain duplicate content

As a result, meta robots tags enable Google (and other search engines) to efficiently crawl and index your pages whilst preserving precious crawl budget.

Meta Robots Syntax

Regardless of whether or not you specify a preference, by default all web pages are set to “index,follow” – this means that all pages will be crawled and indexed by Google unless told otherwise. Therefore, adding the following tag to your pages will have no effect:

<meta name=”robots” content="index, follow">

However, if you want to stop a page from being crawled (and indexed) by Google (or other search engines), then use the following syntax.

<meta name="robots" content="noindex">

If you wanted to only prevent Googlebot from indexing a page, use:

<meta name="googlebot" content="noindex">

Meta Robots Parameters

Regardless of whether or not you specify a preference, by default all web pages are set to “index,follow” – this means that all pages will be crawled and indexed by Google unless told otherwise. Therefore, adding the following tag to your pages will have no effect:

<meta name=”robots” content="index, follow">

However, if you want to stop a page from being crawled (and indexed) by Google (or other search engines), then use the following syntax.

<meta name="robots" content="noindex">

If you wanted to only prevent Googlebot from indexing a page, use:

<meta name="googlebot" content="noindex">

Meta Robots Parameters

Below are the parameters that you can use in your meta robots tag.

  • index: this instructs the bots to index the page so that they appear in the search results.
  • noindex: this instructs the bots not to index the page and prevents it from appearing in the search results.
  • follow: this instructs the bots to crawl the links on the page, and that you also vouch for them
  • nofollow: this instructs the bots not to crawl the links on the page but note that this does not prevent those linked pages from being indexed, especially if they have other links pointing to them.

These parameters can be combined in the following ways:

<meta name=”robots” content="noindex, nofollow">

– tells Googlebot not to index the page and not to follow the links on this page.

<meta name=”robots” content="index, follow">

– tells Googlebot to index the page and to follow the links on this page.

<meta name=”robots” content="noindex, follow">

– tells Googlebot not to index the page but to follow the links on this page.

<meta name=”robots” content="index, nofollow">

– tells Googlebot to index the page but not to follow the links on this page.

Other common parameters include:

  • none: this behaves the same as combining noindex and nofollow, but should be avoided because other search engines like Bing don’t support this.
  • all: this is the equivalent to the default value of “index, follow”.
  • noimageindex: blocks Google from indexing any of the images that appear on the web page.

Meta Robots Common Pitfalls and Best Practices

1. Use commas to separate parameters – you can combine any number of the parameters that you want by separating them with commas.

<meta name=”robots” content="noindex, nofollow">

2. Conflicting Parameters – if the parameters conflict, then Google will simply use the most restrictive directive . In the example below, this would be the “noindex” parameter.

<meta name=”robots” content="noindex, index">

3. Only use meta robots tags if you want to prevent a page from being crawled and indexed – as mentioned previously, by default all pages are treated as “index, follow” unless specified otherwise so you only need to add this tag to pages that you do not want Googlebot to crawl.

4. Why is my page not being indexed? – A common reason why a page may not be indexed by Google is because it may be marked as “noindex” in the meta robots tag.

5. Noindexed pages being blocked by robots.txt. – this prevents crawlers from seeing the noindex robots tag which means that the page might still be indexed.

X-Robots Tag

The robots meta tag is great for implementing noindex directives on HTML pages, but if you want to block search engines from indexing other resources such as images or PDFs, then the X-Robots tag is a powerful way to do so.

What is the X-Robots Tag?

The X-Robots tag can be included as an element of the HTTP header response for a particular URL to control the indexing of the page as a whole, as well as specific elements of the page like images and/or PDFs.

Here’s an example of what it could look like:

Note that the same directives for meta robots tags can be used for the x-robots-tag.

How to Use the X-Robots Tag

In order to use the X-Robots tag, you will need access to either your site’s header .php, .htaccess, or server access file.

Here’s an example of the code that you may use to block off a page with PHP:

header(“X-Robots-Tag: noindex”, true);

This method is recommended for blocking specific pages.

Here’s an example of the code that you may use for blocking off .doc and .pdf files from the SERPs without having to specify every PDF in your robots.txt file via your .htaccess (or httpd.conf file). This example is for websites that use Apache, the most widely used server-type.

<FilesMatch “.(doc|pdf)$”>

Header set X-Robots-Tag “noindex, noarchive, nosnippet”

</FilesMatch>

This method is recommended for blocking specific file types.

When Should You Use the X-Robots-Tag?

Although meta robots tags are simple to add (it’s just about adding a simple HTML snippet afterall), it isn’t always the best option.

Here are a few cases for why you might employ the x-robots-tag instead of the meta robots tag:

1. Non-HTML files – Using the X-Robots tag is the only way to control the indexation of non-HTML content like flash, videos, images and PDFs.

2. Blocking indexation of a particular subdomain or subdirectory at once – bulk editing the meta robots tag for each page on a subdomain or subdirectory is a laborious task and is much easier to do with the x-robots-tag. This is because the x-robots-tag allows for the use of regular expressions, which enables a higher level of flexibility as you can match HTTP header modifications to URLs and/or file names.

XML Sitemaps

You wouldn’t explore a new place without a map (or Google maps), right?

Well, just like that, it can sometimes be difficult for Google to find all of the pages (that you want to be discovered) on your website without the use of a sitemap.

An XML sitemap to be precise.

What is an XML Sitemap?

An XML sitemap (or just sitemap) is an XML file which maps out the most important content on your website. Search engines like Google use this file to crawl your website so it’s recommended that you include any page or file that you want to be found by their crawlers within your sitemap.

Here’s the XML sitemap for SUSO Digital.

As you can see, apart from detailing which resources we’d like Googlebot to discover and crawl, the sitemap also provides useful information about the files such as the last updated date of a web page. When this date changes, Google knows that new content is available and aims to crawl and index it.

You can also detail how often web pages have been changed as well as if there are any alternate versions of the page i.e. if your pages are in different languages.

It’s important to note that XML sitemaps can’t be larger than 50MB in size and that you can list up to, but no more than 50,000 URLs.

If you have a very large website, you will need to create more than one sitemap.

Types of Sitemaps

Apart from the general XML sitemap which we described above, there are three other types of sitemaps that you can upload to Google:

Video Sitemap

This sitemap is designed to specifically help Google understand the video content that is hosted on your web pages. Here you can add details about the video running time, category, and age-appropriateness rating for each video entry.

Here’s an example of what a Video sitemap looks like:

 

You can find Google’s guidelines on how to create and upload a video sitemap here.

Google News Sitemap

This sitemap is designed to help Google find and understand articles on websites for Google News. Google News crawls these sitemaps as often as it crawls the rest of your site which means that you can add new articles to this sitemap as they’re published.

Here’s an example of what a Google News sitemap looks like:

You can find Google’s guidelines on how to create and upload a Google News sitemap here.

Image Sitemap

The image sitemap helps Google find all of the images that you have hosted on your website. On top of this, you can provide additional information about the images such as the subject matter type and licence for each image entry.

Here’s an example of what an Image sitemap looks like:

You can find Google’s guidelines on how to create and upload an image sitemap here.

Why Are Sitemaps Important?

Well, we know sitemaps help Google access the most important (and up to date) content on your website, but they also help in several other ways.

For example, sitemaps become especially important if for example a particular page (that is of value to you), doesn’t have any internal links pointing towards it, and may not be otherwise discovered by the search engine spiders. These are referred to as orphaned pages.

Likewise, if you have a large website (with 500 pages or more), sitemaps prove to be a great way for the search engine (and you) to understand the structure of your website.

It’s worth noting that Google says: “If your site’s pages are properly linked, our web crawlers can usually discover most of your site.”.

This means that not every site will NEED a sitemap, but having one won’t hinder your SEO performance so it doesn’t hurt to use them.

Breaking Down An XML Sitemap

XML sitemaps are designed and formatted in a way that’s easy for computers (search engines) to understand. The language used is XML (which stands for Extensible Markup Language).

Here’s an example of a (simple) XML sitemap.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
		<loc>https://example.com/</loc>
		<lastmod>2020-02-22T11:02:50+04:00</lastmod>
	</url>
	<url>
		<loc>https://example.com.com/page-1/</loc>
		<lastmod>2020-03-23T12:56:14+02:00</lastmod>
	</url>
</urlset>

Let’s break down the various elements and dive into the details!

XML Header Declaration

<?xml version="1.0" encoding="UTF-8"?>

The XML header defines how the contents of the XML file is structured as well as the character encoding. It basically tells the search engine that they’re looking at an XML file, and that the version of XML being used should be 1.0 with the encoding UTF-8.

URL Set Definition

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

The URL set encapsulates all of the URLs that are outlined in the sitemap and informs the search engine crawlers which version of the XML Sitemap standard is being used. The Sitemap 0.90 standard is the most common specification and is supported by Google, Yahoo! And Microsoft.

Note that the URL Set definition should also be closed at the bottom of the sitemap as follows:

</urlset>

URL Definition

<url>
	<loc>https://example.com/</loc>
	<lastmod>2020-02-22T11:02:50+04:00</lastmod>
</url>

This will be the meat of your sitemap(s) – the important part!

The <url> identifier serves as the parent tag for each of the URLs that you list in your sitemap. Here, you have to specify the location of the URL within <loc> tags.

Note that these URLs must be absolute as opposed to relative canonical URLs.

The <loc> tag is compulsory, but there are a few additional optional properties that you may wish to include:

  • <lastmod> – this specifies the date that the file was last modified. Note that the date must be kept in the W3C Datetime format i.e. if a page was updated on 22nd April 2020, the attribute should be: 2020-04-22. Additionally, you may also wish to include the time too, but this is optional. Google’s Gary Ilyes states that this attribute is mostly ignored because “webmasters are doing a horrible job keeping it accurate”.
  • <priority> – we mentioned that sitemaps allow you to tell search engines which pages are the most important, you can do this by assigning a priority score between 0.0 and 1.0 (where 0.0 is low priority and 1.0 is very high priority) to each URL. This tells the crawlers how important the URL is in relation to others on your site.

We should note that Google describes this score as a “bag of noise” and actually ignores them as stated by Gary Ilyes in the tweet below; so in practice, they aren’t very important for SEO.

  • <changefreq> – this tag specifies how often the contents of the URL is likely to change. The possible values are: always, hourly, daily, weekly, monthly, yearly, never. But again, Google states that this is no longer as important and “doesn’t play as much of a role” anymore.

Creating Your XML Sitemap

There are several different ways that you can create your XML sitemap. There are lots of brilliant guides on how to create a XML sitemap already, so here, we’ll simply highlight the various methods and provide links to those great guides.

Manually

For small websites with very few web pages you can create your sitemap relatively easily using the format that we outlined above. All you need to do is open up a text editor and start typing away – just remember to follow the above guidelines and to save your document as an XML file (i.e. with the extension .xml) otherwise Google won’t be able to read it.

Automatically Generated Sitemaps

If you still want to automate the process (because let’s face it, why wouldn’t you?) or your website is simply too large to manually create a sitemap, then the following tools are great

  • ScreamingFrog XML Sitemap Generator – you can still get a free version of ScreamingFrog (a fabulous website crawler and log analysis tool we’ve been using for years now) here. (SUSO approved)
  • WordPress – if your site is in WP, you can use the following plugins to automatically generate your sitemap.

     

  • Wix / Squarespace / Shopify – if your website is in either of these CMS’s, your sitemap will automatically be generated for you – you can find your sitemap by accessing: yoursite.com/sitemap.xml
  • XML-Sitemaps.com – simply enter your website’s URL into this site and it’ll generate a sitemap for you!

Our recommended choices would be ScreamingFrog (or for WP sites, the Yoast Plugin). This is because some of the other plugins/tools include non-canonical URLs, noindexed pages, and redirects which is not good for SEO.

Regardless of which method you pick, it’s vital that you go through the sitemap yourself to ensure that there aren’t any glaringly obvious omissions or additions.

Submitting Your Sitemap to Google

Once you’ve created (and checked) your sitemap and have hosted it on your website (at the root folder), the next step is to make it accessible to Google and other search engines, but before you do that, you need to know where your sitemap is!

Your sitemap can be found at: yoursite.com/sitemap.xml, though this may vary depending on the CMS you’re using.

Now that you’ve got the URL for your sitemap, submitting it to Google couldn’t be easier!

Simply go to Google Search Console > Sitemaps > paste/type in your sitemap location (i.e. “sitemap.xml”) > click “Submit”

Tip: You should also specify the path to your sitemap in your robots.txt file by using the sitemap directive.

For example:

Sitemap: http://www.example.com/sitemap.xml

…and if you have more than one sitemap, simply add multiple lines:

Sitemap: http://www.example.com/sitemap.xml
Sitemap: http://www.example.com/sitemap-2.xml
Sitemap: http://www.example.com/sitemap-3.xml

XML Sitemaps Best Practices: A Checklist

Here’s a quick checklist of best practices to ensure your sitemaps are the best they can possibly be:

1. Are all of your pages indexable? – only pages that are indexable should be listed in your sitemap, this means any URLs that point to redirects or are missing (return 404 errors) should be omitted from the sitemap. Ensure there are no directives blocking search engines from being indexed like meta robots/x-robots/canonical tags.

2. Stick to the default location and filename – to make it as easy as possible for Google to find your sitemap(s), ensure that you stick to the default location (yoursite.com/sitemap.xml) and that the filename is sitemap.xml.

3. Have you referenced your sitemap in the robots.txt – Especially if your sitemap doesn’t follow the default URL path or filename, the best way to ensure Google finds it is to reference it in your robots.txt file, but this should be done regardless.

4. Is your sitemap within limits? – Ensure your sitemap contains no more than 50,000 URLs and that the file is limited to 50MB. If you exceed either of these limitations, you will need to create another sitemap.

Canonicals

Canonical tags have been an essential part of SEO for over a decade now, they were created by Google, Microsoft and Yahoo in 2009 with the aim to provide webmasters with a solution to solving duplicate content issues with ease.

What is a Canonical Tag?

A canonical tag is a snippet of HTML code that allows you to tell search engines which page (or URL) represents the master copy for duplicate, near-duplicate or similar web pages. So, if you have multiple pages that contain the same or very similar content, you can specify which of these Google should treat as the main version or most important and thus, should index.

Google will then understand that the duplicated pages refer to the canonicalised version. On top of this, additional URL properties like the PageRank as well as other related signals are transferred over to the canonicalised page too.

The canonical tag is represented by the following HTML syntax:

rel = “canonical”

Here’s what a canonical tag looks like in practice.

<link rel=“canonical” href=“https://example.com/hellow-words-page/” />

Let’s break it down:

1. link rel=“canonical” – this snippet of code tells the search engine that we want the link in this tag to be treated as the main (canonical) version.

2. href=“https://example.com/hellow-words-page/” – this tells the search engine the location of the canonical version of the page.

Why are Canonicals Important for SEO?

The main reason why canonicals are important for SEO, is to solve the issue of duplicate content.

Google wants to provide users with the best experience possible, which means that diversity in the SERPs is important. With that in mind, duplicate content is something that Google (and other search engines) aren’t keen on at all.

This is because when search engines crawl multiple pages with the same or very similar content, it can cause the following SEO problems:

  • Google may not be able to identify which version of a page should be indexed and therefore should rank. Picking the wrong URL would hurt your ranking ability considerably.
  • Google may not be able to identify relevant queries that a page should rank for.
  • Google may be unsure as to whether the link authority should be split between the different versions of the page, or just one of the pages.
  • Having multiple pages wastes crawl budget which may prevent Google from crawling other important pages that actually contain unique content. Canonical pages are crawled more frequently whereas duplicates are crawled less frequently – so adding a canonical tag will reduce the crawling load on your website.

Using canonical tags will solve all these problems because it puts you in control of any duplicate content that may be present on your site – whether intentional or not.

What If You Don’t Specify A Canonical?

If you have lots of duplicate content and fail to specify a canonical URL, Google will decide for you:

“If you don’t explicitly tell Google which URL is canonical, Google will make the choice for you, or might consider them both of equal weight, which might lead to unwanted behavior”

During the indexing process, Googlebot tries to determine the primary content of each web page. But if the crawler finds multiple pages with the same content, it chooses the page that it feels is the most complete and automatically treats it as the canonical.

How Does Google Choose the Canonical Page?

Below are a handful of factors that Google takes into consideration when deciding which page should be treated as the canonical:

  • Whether the page is served via HTTP or HTTPS
  • The quality of the page
  • Whether the page is listed in a sitemap
  • Whether the canonical tag is present

Google explains how the above techniques can be used to show Google your preferred page, but that Google still “may choose a different page as canonical as you”. This is because the above techniques are hints as opposed to strict directives.

Do I Have Duplicate Content?

Intentionally? Probably not.

Unintentionally? Maybe.

It’s very unlikely that you’ve been actively publishing the same content on multiple pages, but it’s important to remember that search engines crawl URLs, not web pages.

Let’s look an example of how search crawlers may discover your homepage:

  • http://www.example.com
  • http://example.com
  • https://www.example.com
  • https://example.com
  • http://example.com/index.php
  • http://example.com/index.php?

… and so on.

Likewise, the URLs…

  • example.com/product
  • example.com/product?price=asc

… are treated as unique URLs despite the fact that the web page (and content) is the same or very similar.

These kinds of URLs are called parameterised URLs and are extremely common on eCommerce websites that have faceted or filtered navigation (i.e. filtering products by size, colour, availability, most popular, price etc).

For example, Flannels are an online clothing store.

Here’s the URL for their category page for men’s shirts: https://www.flannels.com/men/clothing/shirts

Now, let’s see how the URL changes if we filter for only green shirts:

https://www.flannels.com/men/clothing/shirts#dcp=1&dppp=100&OrderBy=rank&Filter=ACOL%5EGreen

And how about green shirts that are only available in size medium?

https://www.flannels.com/men/clothing/shirts#dcp=1&dppp=100&OrderBy=rank&Filter=ACOL%5EGreen%7CACSIZE%5EM

If we then also filter for a price range of £50-£100, yet another parameter is added to the URL:

https://www.flannels.com/men/clothing/shirts#dcp=1&dppp=100&OrderBy=rank&Filter=ACOL%5EGreen%7CACSIZE%5EM%7CAPRI%5E%C2%A350%20to%20%C2%A3100

As you can see, each of these pages contain very similar content, but Google treats them as individual pages.

Whilst this issue largely pertains to eCommerce websites, there are several other common cases of page duplication:

  • Dynamic URLs for search parameters (i.e. example.com?q=green-socks) or session IDs (i.e. https://example.com?sessionid=3)
  • Having unique URLs to support different device types (i.e. m.example.com for mobile)
  • If your CMS automatically creates unique URLs for posts under different categories within your blog (i.e. blog.example.com/dresses/green-dresses/ and blog.example.com/green-things/green-dresses/)
  • Configuring your server to serve the same content for www/non-www and http/https variants (i.e. http://example.com, https://example.com, http://www.example.com, https://www.example.com)
  • Serving the same content on pages with and without a trailing slash (i.e. example.com/dresses/ and example.com/dresses)
  • Serving the same content on pages with capitalised/non-capitalised URLs (i.e. example.com/Dresses and example.com/dresses)
  • Serving pages in multiple languages – Google treats pages in different languages only if the main content is in the same language i.e. “if only the header, footer, and other non-critical text is translated, but the body remains the same, then the pages are considered to be duplicates”

Google uses the canonical pages as the main sources to evaluate content and quality so it’s crucial that you use canonical tags if you want to solve any of the duplication problems listed above.

It’s also important to note that if for example you canonicalise the desktop version of a web page, Google may still rank the mobile page if the user is on a mobile device.

Canonicals Common Pitfalls and Best Practices

Here are a few top tips and important points to take note of when using canonical tags to solve duplicate content issues.

Google Recommends Using Absolute URLs

Google’s John Mueller advises you to use absolute URLs instead of relative URLs with the rel=“canonical” link element.

For example, Google recommends that you use:

<link rel=“canonical” href=“https://example.com/hello-world/” />

Instead of:

<link rel=“canonical” href=”/hello-world/” />

Ensure The Correct Domain Version Is Used

Make sure that if your website is using SSL (i.e. HTTPS) any canonical tags you use do not point to non-SSL URLs as this could lead to further confusion on Google’s part.

Therefore, if your website is on a secure domain, use the following version of your URL:

<link rel=“canonical” href=“https://example.com/hello-world/” />

Instead of:

<link rel=“canonical” href=“http://example.com/hello-world/” />

Canonicalise Your Homepage

Considering that one of the most common cases of duplicate content is with the homepage, a quick (but sometimes overlooked) way to solve this issue is to proactively canonicalise your homepage.

Self-Referential Tags Are Recommended

Again, coming from John Mueller, self-referential canonical tags are recommended, though not mandatory. The reason for this is because it makes it clear to Google which page you want to be indexed.

For example, if we wanted to add a self-referential canonical tag to the page: https://example.com/hello-world, then we would simply add the following snippet of code:

<link rel=“canonical” href=“https://example.com/hello-world” />

If you’re using a custom CMS (content management system), then you may need to ask your web developer to hard code this into the respective pages. However, most CMS’s automatically do this for you.

Google Ignores Multiple Canonicals

This is quite an important one!

Only one canonical tag is allowed per page. If Google encounters multiple rel=canonical tags within your web page’s source code, it will ignore both.

Canonical Tags Should Only Appear in the <head>

A common mistake is to include the re=canonical tag in the <body> section, when in fact, it should go within the <head> of the HTML document. Any canonical tags that Google finds within the <body> is disregarded.

On top of this, Google recommends adding your canonical tag as early as possible within the <head> to avoid any HTML parsing issues.

Don’t Use Noindex to Prevent Canonicalisation

Another common mistake is to use the noindex directive to prevent Google from selecting a canonical page.

Remember, this directive should be used if and only if you do not want Google to index your web page, it should not be used to manage which page Google chooses as the canonical one.

Canonicalised URLs Blocked via robots.txt

Remember that disallowing a URL in your robots.txt prevents Google from crawling it.

This means that any canonical tags used on that page will not be seen by the crawler which in turn may prevent link equity being passed from the non-canonical page(s) to the canonical version.

Methods of Implementing Canonical Tags

Google outlines four different ways that you can implement canonical tags.

rel=canonical HTML Tag

This is the simplest and most common way to specify which page you want Google to treat as the canonical.

All you need to do is add the following snippet to the <head> section of the HTML code:

<link rel=“canonical” href=“https://example.com/canonical-page/” />

rel=canonical HTTP header

You can also use rel=”canonical” within your HTTP headers (as opposed to HTML tags) to indicate the canonical URL for any non-HTML resources such as images, PDF files, videos etc.

For example, if you wanted to add a canonical tag to a PDF file, here’s the line of code that you would need to add to the HTTP Header:

Link: <http://www.example.com/page/file.pdf>; rel="canonical"

Use a Sitemap

Google states that you should only include canonicalised pages in your sitemap. This is because all pages that are listed in your sitemap are seen as canonicals by Google.

However, sometimes Google may use another URL to the ones you’ve listed in the sitemap:

“We don’t guarantee that we’ll consider the sitemap URLs to be canonical, but it is a simple way of defining canonicals for a large site, and sitemaps are a useful way to tell Google which pages you consider most important on your site”.

Use 301 redirects

301 redirects can be used to divert traffic from duplicated URLs to the canonicalised URL.

For example, say you have the following duplicated versions of your homepage:

https://example.com/home

https://example.com

https://www.example.com/index.php

You should pick the one you want to be treated as the canonical and then use 301 redirects to divert users to your preferred page.

A 301 redirect indicates to users and search engines that the page has permanently moved to a new location.

In this case, we may want to add a redirect from https://example.com/home and https://www.example.com/index.php to the canonicalised page https://example.com.

How to Audit Your Canonical Tags

There are three main questions you need to ask yourself when auditing your canonical tags

1. Does the page have a canonical tag?

In order to do this, simply open up the source code of your web page and look for “rel=”canonical””.

2. Does the canonical point to the right page?

Make sure that the canonical is pointing to the right web page.

3. Are the pages crawlable and indexable?

Make sure that you haven’t blocked the web pages in your robots.txt file.

HTTP Status Codes

Although they may appear to be trivial to general visitors, HTTP status codes are actually incredibly important for your SEO and should be assessed when auditing your website.

What are HTTP Status Codes?

An HTTP (HyperText Transfer Protocol) status code is a three-digit response sent to the client (i.e. a web browser or search engine bot) from the web server when its request can or cannot be fulfilled.

When you visit a website, your browser starts a dialogue with the web server, this is referred to as a “handshake” within the computer science community.

Here’s how the dialogue goes:

1. The browser sends a request to the site’s web server to retrieve the web page.

For example:

GET /example.com/academy/hello-world/ HTTP/2

Let’s break that down:

  • GET: this is the HTTP method which is used to request data from a specified resource (i.e. the web server)
  • /example.com/academy/hello-world/: describes the URL that the browser has requested.
  • HTTP/2: this defines what protocol the browser and server are communicating in.

2. The server responds with a status code which is embedded within the web page’s HTTP header. This tells the browser the result of the request i.e. whether the request can be fulfilled or not.

For example, the server may send:

HTTP/2 200 OK

Where:

  • HTTP/2 – describes what protocol to communicate in.
  • 200 OK – the request was successful—this is what you want to see.

HTTP Status Code Classifications

These status codes are separated into the following five classes based on the different aspects of the handshake between the client and server:

Fun fact: If you ever try to brew coffee in a teapot, your teapot will probably send you the status code 418: I’m a teapot.

Why Are HTTP Status Codes Important for SEO?

The main reason why HTTP status codes shouldn’t be overlooked when it comes to SEO, is because web crawlers like Googlebot use them to determine and evaluate a website’s health.

For instance, if your website is regularly sending 5xx error codes to a search engine that is trying to index your content, this may cause several issues that will likely prevent your site from ranking to its potential.

After all, if you want to drive organic traffic to your website (which we highlighted as being one of the main reasons why SEO is important), you need to ensure that search engines are able to crawl your content.

Therefore, in order to be an effective SEO, it’s crucial that you understand the language that is being used between your website and search engines.

Most Important HTTP Status Codes for SEO

There are dozens of HTTP status codes out there, most of which you probably will never have encountered and are outside the scope of SEO.

Therefore, we’ll only highlight the most important HTTP status codes that will have the largest impact on your SEO.

HTTP 200 OK

The ideal status code you want being returned for every web page. No action needs to be taken from pages that return the HTTP 200 OK code. In this scenario, all parties are happy – the server (for providing the requested web page), the browser/search engine (for receiving the requested web page), and of course the visitor! All messages in 2xx mean some sort of success.

HTTP 301 Moved Permanently

A HTTP 301 code is sent to the client when the requested URL has permanently moved to a new location.

This is a code that you will likely use whilst working on your website. For example, for any site migrations or other scenarios where you may need to permanently transfer SEO authority from one web page to another, this is the code to use.

If you do not add a 301 redirect (and a visitor lands on the old page), then the browser will display a 404 error message which is something you want to avoid as it spoils the user’s experience.

On top of this, using a 301 has the added benefit of ensuring that any link authority is passed from the old URL to the new URL.

HTTP 302 Found / Moved Temporarily

A HTTP 302 status code means that the URL that the client has requested has been found, but that it now resides under a different URL.

This is quite an ambiguous code as it doesn’t specify whether this change is temporary or permanent. Therefore, you should only use a 302 redirect if and only if you want to temporarily redirect a URL to a different location.

Since 302 redirects are temporary (you’re effectively telling the search engine that you will revert back to the original URL at some point), no relevance or authority signals are passed between the old and new URL.

It’s also important to note that if a 302 redirect is left in place for a long time, then search engines will treat it as a 301 redirect (i.e. it is treated as a permanent redirect).

HTTP 303 See Other

A 303 redirect tells the browser or search engine crawler that the server is redirecting the requested URL to a different URL.

This is mainly useful for preventing users from accidentally re-submitting forms more than once (i.e. when they hit the “back” button in their web browser) because the 303 redirect tells the browser that a follow-up request should be made to the temporary URL.

Google’s Gary Illyes confirmed that 303 redirects do pass popularity signals but that they shouldn’t be used for anything except redirecting forms. This is because 303 redirects also pass link equity, but it takes much longer for this to happen than with a 301 redirect.

HTTP 307 Temporary Redirect / Internal Redirect

A 307 redirect is the equivalent of the 302 redirect for HTTP 1.1. A 307 redirect also lets the client (browser or search engine) know that it must NOT make any changes to the HTTP method of request if redirected to another URL.

HTTP 403 Forbidden

Simply put, a 403 code tells the browser that the user is forbidden from accessing the requested content because they do not have the correct credentials.

HTTP 404 Not Found

The 404 error code is probably the most common status code that you will have encountered when browsing the web; for that reason, it’s also one of the most important ones for SEO.

A 404 Not Found status code is sent by the server when the requested resource cannot be found and has most likely been deleted.

Avoiding 404 Not found errors is crucial in ensuring that the user’s experience is as smooth as possible.

For example, if you delete a web page, there may still be other pages that link to it. Therefore, opting for a redirect is advised (in most cases) as this way, users who click on these links (or visit the removed page directly) will be redirected to the most relevant page instead of being presented with an error message.

On top of this having lots of 404 pages on your website may be perceived as poor maintenance by Google, which by extension, may influence your rankings. In this case, a 410 (which we’ll explain next), would be more appropriate as it sends a clearer signal to Google that the page no longer exists.

That being said, in some cases – purposely presenting the user with a 404 page is valid because it will ensure that the page is not repeatedly crawled by the search engine. To create the best possible experience, you should create a custom 404 page, which is one of the suggestions outlined in this article from Google.

HTTP 410 Gone

A 410 status code is an extension of the 404 error in that it indicates that content that has been requested cannot be found. However, the distinction between the two, is that a 410 code is more permanent – you’re telling the client that the requested page has actually been deleted, cannot be found elsewhere and will not come back.

HTTP 500 Internal Server Error

A 500 Internal server error indicates that the server encountered an unexpected error whilst processing the request, but is unable to identify exactly what went wrong.

We mentioned that search engines want to see that websites are being maintained, this means that you should also ensure that HTTP 500 errors are kept to a minimum – because if a crawler or user is unable to access your web page, it won’t be crawled, indexed or ranked.

HTTP 503 Service Unavailable

A 503 error is sent by the server when it is unavailable to process the client’s request. This means that whoever is trying to access the server (whether it’s a user or search engine), is essentially told to come back later.

A server may be unavailable for a number of reasons, for example it may be down for maintenance or it may be overloaded because it is unable to handle too many requests.

If a 503 error is being sent by the server for a prolonged period of time, Google may choose to remove the content from their index.

Redirects

We briefly touched upon redirects in the HTTPS Status Codes section above, but here we’ll explore the theory and practice behind redirects (specifically 301 redirects) in much more detail.

What are Redirects?

If you are changing the structure of your website, deleting pages or even moving from one domain to another, you will undoubtedly have to use redirects to do this. Redirects are a method used to divert visitors and search engines to a different URL to the one that they requested.

Handling redirects correctly is crucial in ensuring that your website doesn’t lose any rankings.Therefore, understanding what the different types of redirects are as well as knowing when to use them is incredibly important.

Types of Redirects

There are two main classifications of redirects: client-side and server-side, for the scope of this textbook, we will only focus on server-side redirects.

A server-side redirect is where the server sends a 3xx HTTP status code to the client (browser or search engine crawler) when a URL is requested.

The most common HTTP status codes that are relevant to SEO are:

  • 301 Moved Permanently (often best for SEO)
  • 302 Found / Moved Temporarily
  • 303 See Other
  • 307 Temporary Redirect

However, for the scope of this course, we’ll only look at 301 redirects because they’re the type of redirect that is widely recommended within the SEO community, and Google.

301 Redirects

What is A 301 Redirect?

A 301 redirect is sent to the client when the requested URL has permanently moved to a new location. The new location is what should be used for any future requests made by a client.

In most cases, the 301 redirect is the recommended method for implementing redirects on a website.

How 301 Redirects Impact SEO

The reason why 301 redirects are the preferred method for most redirection cases is because they pass 95-99% of the original URL’s equity (ranking power) to the new URL.

A user won’t be able to tell the difference between a 301 redirect and 302 redirect (after all both redirect the user to the new page), but to search engines, the two are completely different.

You should be careful when implementing 301 redirects because if you later decide to remove the 301 redirect, it may take weeks for Google to recrawl and reindex the URL, not to mention, the rankings for your old URL may have been lost too.

The bottom line: once you’ve implemented a 301 redirect, there’s no going back.

When To Use a 301 Redirect

The circumstances where 301 redirects are particularly useful and effective are when:

  • You want to change the URL of a page or subfolder i.e. https://example.com/home to https://example.com
  • You want to move a subdomain to a subfolder i.e. https://blog.example.com to https://example.com/blog/
  • You want to move your website to a new domain.
  • You want to merge two different websites and want to ensure that the links to any outdated or deleted URLs are redirected to the correct (or most relevant) pages.
  • You want to switch from HTTP to HTTPS and/or from www. to non-www. and vice versa.

How to Implement a 301 Redirect

There are several ways to implement a 301 redirect, but the most common method is to edit .htaccess file which is located in the root folder of your website.

If you are unable to locate this file, you either simply don’t have it, or, your website is hosted on a different web server that isn’t Apache – i.e. it may be hosted on Windows/IIS or Nginx.

Also, if your website is on WordPress, we highly recommend installing this free Redirection plugin.

It makes adding 301 redirects super easy (and you won’t have to worry about editing the .htaccess file)!

Before we dive into how you can go about implementing these various redirects, it’s important we quickly highlight some regular expressions which we will use.

Redirect an Old Page to a New Page

The simplest way to redirect an old page to a new page is…

Redirect 301 /old-page/ /new-page/

… however, by omitting the regular expressions, any URLs that have a UTM query string for instance, would end up as a 404 error, which is something that we don’t want.

Therefore, we recommend the following:

RewriteEngine On
RedirectMatch 301 ^/old-page(/?|/.*)$ /new-page/

Here, the use of the regular expression “^” implies that the URL must start with “/old-page” while (/?|/.*)$ indicates that anything that follows “/old-page/” with or without a forward slash “/” must be redirected to /new-page/.

For example, all of the following URLs will be redirected to /new-page/

  • /old-page/
  • /old-page
  • /old-page/page-2
  • /old-page/?sessionid=3

Redirect an Old Directory to a New Directory

If you want to change the structure of an entire subfolder or directory, here’s how you should go about setting up the redirect:

RewriteRule ^old-directory$ /new-directory/ [R=301,NC,L]
RewriteRule ^old-directory/(.*)$ /new-directory/$1 [R=301,NC,L]

The expression “$1” in the second line is used to remind the server that everything in the URL that succeeds “/old-directory/”(i.e. /old-directory/page-1/) should be directed to the destination folder (i.e., “/subdirectory/” ). This way, it will be redirected to /new-directory/subdirectory/.

Redirect an Old Domain to a New Domain

If you decide to shift domain names, here are the rules you should use to redirect all of the pages from your old domain, to your new domain.

RewriteCond %{HTTP_HOST} ^old-domain.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.old-domain.com$
RewriteRule (.*)$ https://www.new-domain.com/$1 [R=301,L]

Here, we have accounted for both the “www” and “non-www” versions of the URLs. This is because we want to pass any precious link authority that may be coming from internal links pointing to any of these versions of the page.

Redirect From www to non-www Page

If you simply want to redirect all www pages to non-www, use the following rules:

RewriteEngine on
RewriteCond %{HTTP_HOST} ^example.com [NC]
RewriteRule ^(.*)$ https://www.example.com/$1 [L,R=301,NC]

Redirect From non-www to www Page

Likewise, for non-www to www pages, use:

RewriteEngine on
RewriteCond %{HTTP_HOST} ^www.example.com [NC]
RewriteRule ^(.*)$ https://example.com/$1 [L,R=301,NC]

Redirect From HTTP to HTTPS

Google encourages webmasters to use SSL (for obvious reasons which we’ll dive into later), so migrating websites from HTTP to HTTPS is another extremely common reason for implementing a 301 redirect.

To force a HTTPS redirect, use the following rewrite rule:

RewriteCond %{HTTP_HOST} ^yourwebsite\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.yourwebsite\.com [NC]
RewriteRule ^(.*)$ https://www.yourwebsite.com/$1 [L,R=301,NC]

If you wanted to, you could simply use the above to combine www or non-www version redirects into a HTTPS redirect rule.

301 Redirects Common Pitfalls and Best Practices

Implementing redirects correctly is crucial to ensuring that your SEO performance is not hindered. Here are some common pitfalls to avoid and best practices to follow.

Don’t Redirect 404 Pages to the Homepage

If you have too many pages that return a 404 status code, you should avoid redirecting them to the homepage.

In fact, Google’s John Mueller confirmed that Google treats these as “soft 404’s” as it “confuses” users.

Instead, it is recommended that you create a “better” 404 page instead.

Alternatively (and this is preferred), you should aim to redirect pages that currently return 404 errors to the most relevant page possible (i.e. a page where the content is equivalent to the old page). Failing to do this not only spoils the user’s experience, but you will also lose the PageRank (authority) of that page if you don’t use a 301 redirect.

Avoid (and Fix) Any Redirect Chains

A redirect chain is where there is a series of redirects between the start URL and destination URL.

Google states the following: “While Googlebot and browsers can follow a “chain” of multiple redirects (e.g., Page 1 > Page 2 > Page 3), we advise redirecting to the final destination. If this is not possible, keep the number of redirects in the chain low, ideally no more than 3 and fewer than 5.”

That being said, we would strongly advise against any redirect chains as this seriously spoils the user’s experience.

To find pages that may have multiple redirects, we recommend using this HTTP status checker.

To fix redirect chains:

1. Implement a 301 redirect from the old URL to the destination URL.

2. Replace any internal links that may be pointing to the old URL with the destination URL.

Too Many Redirects

In some cases, you may come across infinite redirects – this usually occurs when a regular expression is incorrect, sending the redirects into an infinite loop.

This is the message you’ll see when there are too many redirects.

Here’s an example of what an infinite redirect loop may look like:

To find redirect loops, you can use the same tool.

To fix this issue:

1. Change the HTTP response code to 200 if the URL is not supposed to be redirected.

2. If the URL is supposed to redirect,then remove the loop by fixing the final destination URL (i.e. the one that is causing the loop). Likewise, remove or replace any internal links that may be pointing to the redirecting URL.

Fix Any Broken Redirects

If your redirect points to a page that returns a 404 error or 5xx error, then this is also an issue as it spoils the user experience, but from an SEO standpoint, means that the page authority of the original page is being wasted on a dead page.

You can check for these kinds of redirects using the same HTTP status checker.

Don’t Use 302 Redirects or Meta Refresh for Permanent Redirects

302 redirects should only be used for temporary redirects (i.e. you’ll revert back to the original at some point). When it comes to meta refreshes (a client-side redirect), Google advises not to use them at all.

If you have any of these types of redirects on your website, you should replace them with 301 redirects.

Pages with a 301 Status Code Shouldn’t Appear in Your Sitemap

Remove any pages with a 301 redirect from your sitemap. Remember, your sitemap points Google in the right direction in regards to which pages it should crawl – therefore, including pages with 301 redirects has no value as they technically don’t exist, so you don’t want Google to crawl them. And of course, having pages like this isn’t going to help with your crawl budget either.

To solve this issue:

1. Use this tool to download the URLs from your sitemap (this can usually be found at: yourdomain.com/sitemap.xml)

2. Paste these URLs into this HTTP status checker (you’ll have to do this in batches of 100 URLs)

3. Filter out any that return a 301 redirect.

HTTP Redirects to HTTPS

All websites should be using SSL – in fact, it’s one of the ranking signals that Google looks at. Therefore, not just for security reasons but for SEO reasons too, it’s important to ensure that all of your HTTP pages are redirected to HTTPS.

To ensure that you’re looking at the HTTPS version of a web page, simply look out for the “lock” symbol in the URL bar at the top of your web browser.

If you type in “http://yourdomain.com”, you should be redirected to “https://yourdomain.com”.

If not, then the redirect is not in place.

Merge Similar Pages

This one’s quite important (and powerful). If you have two pages that are topically related and neither are quite hitting the mark in terms of ranking, a powerful approach is to combine the contents of these pages into a single page, and then implement a 301 redirect from the page which isn’t performing as well, to the one that performs better.

Let’s illustrate this via an example:

We have two pages which highlight the best earphones, but one purely focuses on wireless earphones.

Page 1: https://example.com/best-earphones
Number of Organic Visits per Month: 2000

Page 2: https://example.com/best-wireless-earphones
Number of Organic Visits per Month: 175

Both pages have similar content and are topically related, but only Page 1 is receiving a decent amount of traffic per month.

Therefore, we would simply add the content from Page 2 to Page 1 under a heading like “Best Wireless Earphones”, and then implement a 301 redirect from Page 2 to Page 1.

This way, we aren’t losing any of Page 2’s ranking power and are boosting the potential of Page 1’s organic traffic at the same time.