Guide to Blocking Unwanted Resources on Your Website

Googlebot can index any resources on the site, including the ones that does not have much value for users and shouldn’t be indexed. An example of such content could be a “Find out about the product” form where the form is generated under a different URL (eg. containing a product ID or a timestamp). It could result in the same number of form-only pages being indexed as the number of products in the store. Addressing this issue is therefore of a very high importance in terms of SEO.

Which pages should be blocked then?

First and foremost, those pages that we believe, do not contribute any additional value to the site or the ones containing the same content. It’s worth mentioning that an e-commerce site, due to its size, is a type of a website which is the most exposed to indexing such unnecessary “rubbish”.

Below we present a list of the most popular parts of a site that, in most cases, shouldn’t be indexed by Google.

Internal search engine result pages

This type of page is normally only used by users trying to find a specific term or a product, thus does not represent any value for search engine crawlers. The internal search engine result pages should, in most cases, be excluded from Google indexing.

Session IDs

Although it’s not often to see them, sometimes you might notice a session ID (eg. PHPSESSID=21232f297a57a5a743894a0e4a801fc3) trailing every URL you visit. Googlebot will, most probably, be getting a unique session ID every time it crawls the site. Considering how many times Google visits the online store, the issue might be a major SEO problem, if not dealt with correctly.

Empty categories

This type is as dangerous for site’s rankings as frustrating for customers who land on a category page that hasn’t got any products.

Category filters

Filters might be a very useful feature that helps to target very specific keywords. They can, however, also be very ranking-breaking in cases where they create shallow pages (containing only one or just a few products).

Sorting and views

In most online stores you are able to sort products by name, price, date or popularity. You can also select the sort direction: ascending or descending. Apart from the sort order, nothing changes – category contains the same products, so indexing the sort parameters only uses up your crawl budget.

The quite similar situation occurs when you change a product view in a category, for example, a list or grid.

Pagination

Very often the links to the first page of the pagination contain an additional parameter (eg. /category?page=1) creating an obvious duplication with the main category page (/category). In such a case, pages with this exact parameter shouldn’t be indexed by Google. More precisely, they should indicate which category page they duplicate, so Google indexes the main page instead of the one with the ‘page=1’ parameter.

Forms

Any type of form that shoppers have an easy access to should be blocked. Examples include login and user registration, newsletter form (if it’s on a separate page or its submission leads to the same URL with parameters, eg. ‘?newsletter=submitted‘), refer a friend, add a comment/opinion, etc. As a rule, such pages should be blocked from indexing.

Duplicated text-only pages

The common problem in most of the online stores is created by the pages that contain the same or very similar text. Such duplicates may be hosted internally or externally. Examples here include T&C pages, Privacy and cookie policies, Shipping and Payment information.

The Solutions

There are several ways Google can be stopped from indexing some parts of the site. Please read on for the list of possible solutions.

robots.txt file

This is the quickest solution. A simple text file that should be available at your root domain (eg. http://www.google.com/robots.txt) including “Allow:” and “Disallow:” directives. More about robots.txt here.

Below is a sample code blocking access to search engine results pages:

User-agent: *

Allow: /

Disallow: /search

Robots.txt is convenient, but it is also very imperative. If you, for instance, had some resources already indexed, they might not disappear from the index after being blocked using this method. Google will be disallowed to even check up on them, so it will show a sentence “A description for this result is not available because of this site’s robots.txt” under that resource.

Robots meta tag: noindex

The noindex meta tag can inform the search engine bots that the page is not to be indexed, but it allows them to be crawled. The meta tag goes to the head section of the page that is not intended to be indexed:

<meta name="robots" content="noindex">

It is also easy to use this solution but it requires Google to re-crawled the page in order to see the ‘noindex’ command.

Google Search Console

Google normally tries to group similar pages within a site into clusters and make a decision which representative URL should appear in the SERPs. If Google is unable to correctly recognise the duplicated URLs, you have an option to use the URL Parameters tool in Google Search Console to tell them how to handle the addresses consisting specific URLs.

This option is very helpful if there is a specific parameter we don’t want Google to index. It is, however, quite dangerous if not used properly as you can easily de-index a significant chunk of your website.

301 redirect

Unwanted pages can be redirected to the right address, so (put simply) Google will replace the address in the index. The ‘301 Moved Permanently’ HTTP header is recommended and the most SEO-friendly.

Remember to avoid unnecessary redirects! If the redirect is generated by, for example, a wrong URL in the main navigation, you have to fix the link, too.

The redirect can be set in the server’s configuration file – eg. htaccess or by using PHP code.

.htaccess example: redirecting all traffic to google.com

RewriteEngine On

RewriteRule .* https://www.google.com/ [R=301,L]

PHP example: redirecting all traffic to google.com

<?php

header("Location: https://www.google.com/", true, 301);

exit;

Please note, the above examples might not work in your environment! Subject to your server’s specific configuration.

404 header

Every SEO knows 404 Not Found http header! It’s used mainly if a page or resource has been deleted from the Internet. Stumbling upon a 404 header, Google will re-try the URL returning this response a few times before it will delete it from the index. This way it gives you a chance to fix the resource before it will disappear from Google’s index.

Canonical tag

This meta tag is used on known, duplicated pages to show where the original content is. Here’s an example of a canonical tag that is placed in the source code of http://example.com/copied page:

<link rel="canonical" href="https://example.com/original">

The href attribute should point to the page where the content is originally (representative URL). In example, if the default sorting direction is ascending and the URL has order=asc or order=desc query string parameters, then https://example.com/category?order=asc URL should include the canonical tag pointing to the main category page, as shown below:

<link rel="canonical" href="https://example.com/category">

Summary

Leave it to your SEO company to pick the best solution in your particular case.

It’s worth mentioning that Google is not that bad at identifying internal duplication, so it might solve the issue on your behalf and not even bother indexing all this “mess”.

If you decide to use your own initiative and definitely want to deal with the issue for Google (to save the crawl budget, for example), you must know that it is crucial to take a lot of factors into consideration before using any of the above solutions. The wrong implementation might result in inconsistent navigation, loss of link juice (the “power” that goes through links), de-indexing of a significant section of your site and others. If in doubt, always consult any action with your SEO consultant.

Guide to Blocking Unwanted Resources on Your Website

Which pages should be blocked then?

Internal search engine result pages

Session IDs

Empty categories

Category filters

Sorting and views

Pagination

Tags

Forms

Duplicated text-only pages

The Solutions

robots.txt file

Robots meta tag: noindex

Google Search Console

301 redirect

404 header

Canonical tag

Summary

The latest in SEO and AI Search. Straight to your inbox.

Related Reads