Decoration Circle
Advanced SEO Textbook
7

Server Log File Analysis

We look at one of the most overlooked aspects of SEO, Server Log File Analysis. You will learn how log file analysis can help offer valuable insights in how search engine bots experience your website over time.

Topic Details
Clock icon Time: 20
Difficulty Hard

An overlooked aspect of technical SEO (and SEO in general), is server log file analysis (sometimes referred to as log analytics) which offers valuable insights into exactly how search engine spiders have experienced your website over time.

An overlooked aspect of technical SEO (and SEO in general), is server log file analysis (sometimes referred to as log analytics) which offers valuable insights into exactly how search engine spiders have experienced your website over time.

The Fundamentals

What is A Log File?

Every request made to the web server is being recorded (anonymously) inside the log file.

The file contains the following data about requests made to the server:

  • IP Address
  • Timestamp
  • Type of HTTP request (i.e. GET / POST)
  • Requested URL
  • HTTP Status Code of the URL
  • User Agent from the web browser

Log files are typically used for troubleshooting purposes or to audit the technical aspects of the website.

However, for SEO, we are interested in looking at the HTTP requests recorded on the web server by the user agents, specifically Googlebot.

How Does It Work for Internet Users?

Let’s take a look at how this process works for general Internet users.

1. The user visits a website by typing in the URL into their browser and hitting ENTER, i.e. https://www.example.com/page1.html.

2. The URL is broken down into three components:

a. HTTP Protocol i.e. https://

b. Server name – i.e. example.com

c. Requested filename i.e. page1.html

3. The server name is converted into an IP address via the domain name server (DNS) which establishes a connection between the browser and the host web server where the requested filename is located – this is usually via port 80.

4. A HTTP GET request is sent by the browser to the web server via the associated protocol and the HTML (contents of the requested file) is returned.

5. Each request is logged as an entry in the log server file.

Below is an example of a log entry (hover to view a breakdown of the various attributes):

99.65.113.145 – – [30/Apr/2020:10:09:15 -0400]GET /product/footbal/ HTTP/1.1200 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

It’s worth highlighting that this example illustrates how the process works for internet users and not search engine crawlers. Remember, web crawlers like Googlebot largely visit and discover pages by following links.

Why Is Log File Analysis Useful?

The data recorded in the log file is 100% accurate. This means that you are able to see exactly what resources search engine crawlers have discovered on your website.

Log files also help provide deeper insights into the web server dynamic. They allow you to:

  • Inspect any accessibility errors, for example, some 4XX/5XX pages may not be reported in Google Search Console
  • By extension, you can fix these accessibility issues quickly and thoroughly, knowing that the data recorded is as accurate as can be.
  • Understand which bots are crawling your website.
  • See how often bots crawl your website
  • Highlight any pages/directories that perhaps should not be crawled.
  • Identify orphan pages which may still be crawled.
  • Identify any important pages that are not being crawled at all or any that aren’t being crawled as often as they perhaps should.
  • Identify whether the crawl budget is being wasted and where.

Log File Analysis & Crawl Budget

We highly recommend that you read our chapter on crawl budget where we go into much more detail into what it is and why it’s important for SEO, but for the scope of this section, here’s a short refresher.

Crawl budget is essentially how many web pages Google crawls when it visits your website and is based on Google’s perception of how authoritative and valuable your website is. Mindful of its processing resources Google allocates a crawl budget to each site.

From an SEO’s standpoint, this can be problematic if Google is crawling too many pages of low importance (i.e. pages with thin content that offer little value to the user), hitting the “limit” of the crawl budget, and then not finding any new, updated or important content on your website. This is an issue because it means that this important content will not be indexed, which in turn means it will not rank.

Crawl budget analysis is especially important for sites that have many indexed URLs – typically e-commerce sites directories or large/expansive blogs – and as a result conserving crawl budget is key in ensuring that your website’s organic search performance continues to grow.

User Agents

Each web browser has its own distinctive user agent. When the browser performs a request and connects to the website’s web server, the user agent being used is always communicated between the two nodes.

You can see what your user agents are by visiting this website.

For example, the user agent may tell the webserver “Hi, I am Mozilla Firefox on Windows x64” or “Hi, I’m Safari on iPhone/iPad/Mac”.

This is important because web servers serve different web pages to different web browsers and different operating systems. In other words, the webserver sends mobile pages to mobile browsers, modern pages to modern browsers, simpler web pages to old browsers and “please
upgrade your browser” to visitors who may be using an out of date browser.

Web crawling bots also have their own user agents – the user agent for Google’s web crawler is “Googlebot/2.1”.

Each bot that visits your website then reads the rules that are defined in your robots.txt file which tells it what pages it can and cannot access.

The SUSO Method: Log File Analysis

Log files are generally daunting to look at and process manually, so using a log file analysis tool should be used to help process the data more efficiently and easily.

 

There are several tools available that can help you do this, the one we use at SUSO to get some insights fast is Screaming Frog’s Log File Analyzer which allows you to analyse up to 1,000 log entries at a time for free.

They also have a paid version for £99.00 a year which allows you to upload more log events and create additional projects.

 

JetOctopus logoAnother great tool that we highly recommend as one of the best (if not THE Best) log analysers is JetOctopus Server Log Analyser. This tool does not have (almost) any limits and delivers you amazing insights.

Additionally, JetOctopus can be integrated with your Google Search Console and Ahrefs data, so while it’s analysing your server logs, it will present to you all URLs that are in the other tools but are visited rarely.

So, if you really want to take your server log analysis game to the next level, JetOctopus is the tool you don’t want to miss!

 

Enabling Log Archives

Before looking at Screaming Frog, you need to ensure that you have enabled your log archives.

This is most commonly achieved via your web-based hosting control panel (cPanel).

Head over to your cPanel > Raw Access

From here, tick the “Archive logs in your home directory” and hit Save.

Downloading Log Files

To access and download your log files you can use the following guides:

1. Accessing Apache log files (Linux)

2. Accessing NGINX log files (Linux)

3. Accessing IIS log files (Windows)

Alternatively, you can download them directly from the cPanel where it’s listed under Raw Access.

Please bear in mind that we’re showing you a manual way of approaching the analysis below. If you have a big site (more than 500k URLs) or want to get more insights from your server logs, JetOctopus log analyser is the tool we highly recommend!

Screaming Frog Log File Analyser

Once you have your log file, ScreamFrog’s Log File Analyser will do the rest – let’s take a look at some of the powerful things you can do with this tool.

Importing Your Log File

To import your log file, simply drag and drop it into the interface. Once you’ve done this, the analyser will automatically detect the file format (COM) and sort the log entries into its database. Log files are usually zipped monthly (i.e. a log file for each day), so if you have multiple log files saved in a zip folder, you can drag the entire folder.

Whilst importing your log file(s) you can automatically verify the search engine bots that the tool will explore. Tick the “Verify Bots” option under the “User Agents” tab when uploading your logs.

If you want to analyse particular domains or paths (i.e. /blog/ or /products/ pages), then you can specify this on the Include tab during the import. This is especially useful for large eCommerce or content-heavy websites with lots of products/pages.

Identify Crawled URLs

The tool allows you to view exactly which URLs have been crawled by the various search engine user agents.

Use the “Verification Status” filter so that only entries that are verified are displayed.

Use the “User Agents” filter tab to specify which user-agent bots entries you would like displayed i.e. you could choose “all bots” or filter to view just “Googlebots”.

Identify and Analyse Crawl Frequency

By URL

You can identify and analyse the most frequently crawled pages or directories by sorting the “Num Events” field under the “URLs” tab.

This is the number of separate crawl requests made by the user agent in the log file for each URL.

This allows you to:

  • Highlight any pages/directories that offer little value and perhaps should not be crawled as frequently as they are.
  • Identify any important pages that are not being crawled at all
  • Identify any important pages that aren’t being crawled as often as they perhaps should.

By Directory

Head over to the “Directories” tab and sort the “Num Events” to see which sections of your site are being crawled the most/least.

By Content Type

Filter by content type in the “URLs” tab to see how frequently other types of content such as images, CSS and JavaScript files are being crawled.

By User-Agent

Filter by user-agent in the “User Agents” tab to compare how different user-agents are crawling your website.

The number of URLs crawled over a given timeframe will indicate how long each search engine takes to crawl the URLs on your website.

Estimate Your Crawl Budget

By looking at how many URLs have been crawled in total during a given period of time i.e. by day, week, month, you can make an approximation of how long it may take Googlebot to crawl (and re-crawl) your entire website.

The “Overview” tab provides a summary of the total number of events (server requests) over a given period.

This data is incredibly useful as it gives you an insight into what your crawl budget might be.

Find Client & Server Errors

To access URLs which may have returned 3XX, 4XX and 5XX status codes when the search engines accessed your website, head over to the “Response Codes” tab.

Here, you can specify whether you want to view pages that returned a 4XX error code (client errors) or a 5XX code (server error).

Apart from Google Search Console’s “Fetch & Render” tool, only your server log file can provide this level of accuracy when it comes to HTTP response codes encountered by Googlebot (and other user agents).

Identify Large & Slow URLs

Log files also encode information about the files being requested such as their size. Using Screaming Frog you can also see which of these files were particularly large by sorting the URLs by “Average Bytes”. This tells you how long it took for the search engine to perform the request and allows you identify potential performance issues with pages that may have taken longer to load.

Find Orphan Pages

Orphan pages are those that have no internal links pointing towards them, but are still accessible by search engines.

By combining your log file(s) with crawl data, you can identify these pages with ease.

Import a crawl by dragging and dropping the file into the “Imported Data URL” tab – this will display the URLs that were found during the crawl.

Using the “View” filters in the “URLs” and “Response Codes” tabs, you can now view all of the URLs that were found by the search crawlers in your log file, but are not present in the crawl data that you imported using the “Not In URL Data” filter.

Apart from highlighting orphaned URLs, this will also identify any old URLs that may have been redirected or URLs that have been linked to incorrectly from external sources i.e. if an external link had a type within the target URL.

And Much, Much More…

Head here for a more comprehensive look at the many other useful features of Screaming Frog’s Log File Analysis tool.