How to read the web server log

Introduction

In this guide I would like to explain how to retrieve, read and interpret the web server log with Excel. If you prefer to use server-side software you can try GOACCESS .

First of all, however, let’s see why we should dedicate time to this file. Analyzing the data in the web server log is essential to understand how search engine spiders navigate a website, i.e. the crawl frequency, the crawl volume and the crawl depth.

  • Scan frequency: The frequency indicates how many times a resource is requested by the spider.
  • Crawl Volume : The volume represents the number of website pages requested by the spider.
  • Crawl depth : the number of website navigation levels reached by the spider

The logs are used for SEO analysis but also for other purposes, for example web analytics, but this guide wants to focus on the aspects useful for search engine optimization.

What the web server log is for

The web server log is a file where all requests made by external clients (users and bots) to hosted resources (web pages, images, javascript, etc.) are recorded by the web server. Reading the web server log is a snubbed and underestimated activity by many professionals . Personally I believe it is one of the most important things to do when doing SEO.

We can make all the guesses in this world, but nothing will give us practical data like a log.

Observing the behaviors of the spiders it is possible to understand:

  • if the navigation of the website is optimal or if there are pages that are difficult to reach,
  • how long does the spider take to detect the new content published,
  • the interest of search engine spiders towards the pages of the website,
  • quantify the Crawl Budget that Google assigns to our website.

In the log we find the accesses of all the User-Agents that have requested resources but it is particularly interesting to filter only the Google spiders , called Googlebot .

Googlebot may be more interested in certain sites and less in others, it may check one page more often instead of another. Its behaviors vary from site to site and page to page. There are many factors that affect scan frequency, volume and depth, let’s see some of them:

  • Content originality: Googlebot tends to crawl less sites with copied or duplicated content.
  • The frequency of publication: the frequency with which new content is published on the site or existing ones are updated impacts on the Googlebot crawl frequency.
  • 4xx and 5xx errors: When Googlebot detects navigation errors, it usually first executes a series of subsequent requests to check for any recovery. If not, it reduces the crawl rate.
  • The speed of the web server: a fast site can send more resources to Googlebot than a slow one. Googlebot never wastes time, if one server is not fast enough it moves on to the next.
  • The value of our pages: PageRank is  the main factor that determines the depth of Googlebot scans, but it is not the only factor.
  • On-page link:  A page that contains many outbound , themed, quality links is checked by Googlebot quite often.
  • The size of the website: The number of pages published on the site is a factor that causes the spider to crawl the website more deeply.
  • Levels: the number of navigation levels on which the site is organized affects the crawl depth. Google generally doesn’t like sites with too many levels.
  • internal links : the structure of the internal links impacts on the distribution of PageRank and therefore on the crawling priorities of Googlebot.
  • The crawl budget: that is the maximum amount of bandwidth that Google has decided to dedicate to the site.

Where to retrieve the web server log

The logs of the various web servers can be found by default in these folders:

  • Apache: /var/log/apache2/access.log – link
  • Nginx: /var/log/nginx/access.log – link
  • Microsoft IIS: C: \ Windows \ System32 \ LogFiles – link

Each web server has its own options for configuring the log file, so it is possible that your logs are stored in different folders and have different names than access.log.

Low-cost hosting services or shared hosting hardly allow access to server logs, for example Aruba sells the statistics service separately. With virtual, dedicated and cloud hosting, having the operating system available, it is possible to access log files.

Example of string extracted from the log

The web server log file typically looks like a list of lines like this:

Note: different web servers produce logs which may vary slightly from each other.

How to open the log file

Although there are many free and paid software such as Apache Logs Viewer , Excel is enough to read the log file .

Been to Screaming Frog ? The same developers have created and made public a tool to read the web server log files .

How to open the web server log file with Excel and log the data

The elements of the web server log

IP address: “212.209.212.xx”. This data is the IP address of the machine that contacted the web server. To resolve the IP address and get an idea of ​​what type of machine it is, you can use a reverse- DNS or use the TRACEROUTE command.
User name: “- -“. The username is only relevant in cases where the queried resource is password protected.
Timestamp: “[29 / Jul / 2010: 00: 35: 33 -0500]”. The Timestamp item represents the time when the resource was requested from the web server.
Access request: “GET / contacts / HTTP / 1.1”. This string represents the resource requested from the web server.

  • In this case the request is of type ” GET ” (means “show me the page”) for the file / contacts / using the “HTTP / 1.0” protocol.
  • A request of type ” HEAD ” reads only the HTTP header of the document (document header) and is equivalent to “ping” the resource to verify that it is still there and that it has not changed.
  • A ” POST ” type request is used to send data to the web server.

Result status code: “200”. Status codes are defined by the ITEF (RFC 2616) and some RFCs with additional, non-standardized status codes. Microsoft IIS may use other non-standard decimal subcodes to specify more additional details. In this case the status code 200 indicates that the requested resource exists.

The first digit of the status code specifies one of five response categories:

  • 1xx Informational: request received, continue processing,
  • 2xx Success: the action has been successfully received, understood and accepted,
  • 3xx Redirection: the client must perform further actions to fulfill the request,
  • 4xx Client Error: the request is syntactically incorrect or cannot be satisfied,
  • 5xx Server Error: The server failed to fulfill an apparently valid request.

The web server status codes – Download the infographic

Bytes transferred: “11631”. Number of bytes transferred. If you find values ​​lower than the weight of the requested files, it means that the request has not been completed and partial data has been provided to the client. Some User-Agents can download a file one piece at a time, each downloaded piece is identified with a dedicated line in the web server log, therefore a series of “hits” whose total is equivalent to the size of the file means that the download took place with success. If the size of the file does not coincide with the total downloaded in several moments by the same user-agent, it may mean that there are connection problems.

Referrer URL: “https://www.evemilano.com/”. This value represents the referrer page. Not all user agents support this information. This is the page the visitor visited before landing on your site. Usually this means that page has a link to your site, or sometimes it is simply the page the user visited before reaching your site. This information is very useful for identifying which sites drive traffic to our pages (referrals).

User-Agent: “Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT 5.0)”. Identification value of the “User-Agent”. The User-Agent represents the software that is requesting the resource on your site. It is usually a browser, but it could also be a search engine bot, link checker, FTP client, or offline browser.

The User-Agent of the Google spider (Googlebot) is:

How to merge many log files into one

I know a quick method on Bill’s OS… On Windows:

  • run Command Prompt: Start> CMD.
  • navigate to the folder where you downloaded all subsequent logs, for example by typing in the promptcd c:/temp/log
  • enter the commandfor
  • Running the command from the folder that contains all the log files will generate a new file that contains them all. Analyzing the data will now be much easier.

Read the log in real time on the SHELL

If you have direct access to your web server you can read the log in real time on the SHELL. For quick checks, sometimes this method is useful for me, below I’ll explain which commands to use.

The tail command is used to monitor changes in a file, in real time. As a result it prints the delta to the screen.

Try this command to see everything happening on the web server:

Add the grep command to filter only Googlebot traffic:
Try this command to see everything happening on the web server:

With this command instead we simplify the display by showing only the URL of the request made by Googlebot, and status code.

PS: play with the two values ​​of print , for example for the Nginx log I have to use {print $ 7, $ 9}, 7 for the URL and 9 for the status code (depends on how your log file is formatted).

Video

Considerations

Now that you are able to download the web server log, open it with Excel and understand the data, all that remains is to make the appropriate considerations. I’ll leave you some ideas.

  • What are the most visited pages by Googlebot?
  • Which pages are not visited by Googlebot?
  • What relationships are there among the most visited pages by Googlebot?
  • What relationships are there between the least visited pages by Googlebot?
  • How often does Googlebot return to the site?
  • How long does it take for Googlebot to receive pages from the web server?
  • How does Googlebot react to a major change on the website?
  • Are there any aggressive bots to block via Robots.txt?

Related posts