One thing I’ve noticed working on the web is that a lot (really a lot) of webmasters totally ignore the existence and behavior of Google when developing a new website. In particular, I am referring to the error of leaving development sites open to search engine bots.
Letting a search engine scan and index the pages of a developing website can create quite a few problems:
- search engines map URLs which may be temporary and this generates 404 errors in the future
- search engines index work-in-progress pages, which may be incomplete, untranslated or even empty
- search engines don’t like to index duplicate or low-quality content
When the site goes live there will already be duplicate content in Google taken from the development site, there will be several 404 errors and the perceived quality of the website at SEO level will tend to decline.
For these reasons I always advise webmasters to close the development site to search engine bots. Let’s see some methods, sorted by effectiveness, to prevent crawling and indexing of pages in development.
- Set the login password via the Apache Virtual Host configuration file
- Set the login password via Apache .htaccess
- Set the login password via Nginx server block
- Noindex in HTTP header
- Meta robots noindex
- Disallow in Robots.txt
Set the login password via the Apache Virtual Host configuration file
The most efficient way to protect a directory on an Apache web server is through the Virtual Host configuration file.
Open your configuration file and insert the block inside
directory, with all its contents as shown in the example from lines 7 to 12:
Directory "/var/www/html"with the directory you want to block on your Apache web server
/etc/apache2/.htpasswdwith the full path to your .htpasswd file
Set the login password via .htaccess
With the .htaccess file it is very easy to protect access to a folder with a password. The only negative aspect of this method is that the web server re-reads the .htaccess file every time it accesses its directory, on very crowded websites this system could impact performance.
Make sure your web server is enabled to execute the instructions contained in the .htaccess file . If the method does not work, check that
/etc/apache2/apache2.confthe command is present in the file
The method is named htaccess password protection or htaccess authentication and works by uploading two files named .htaccess and .htpasswd to the directory you want to password protect. The .htaccess file should contain the following:
/path/to/.htpasswdwith the full path to your .htpasswd file.
Set the login password via Nginx server block
Nginx also allows you to password protect specific directories. To enable this function it is necessary to open and modify the configuration file of the site instance (Server Block).
- Add the last two lines
auth_basicinside the block
/etc/nginx/.htpasswdwith the path to your .htpasswd file
The .htpasswd file
At this point, whether you have chosen to modify the Virtual Host configuration file or the .htaccess file, you need to create and upload the .htpasswd file .
The .htpasswd file must contain the username and password, an example is as follows:
The code shown allows the user “test” to access the area protected with the password “cicciopasticcio123”. The text “$ apr1 $ S8 / .n6G5 $ r / 3c81y3wR84GZ5EHDdKt1” is the encrypted version of the password “cicciopasticcio123”.
You will need to use an htpasswd generator to create another encrypted password. Each line of the .htpasswd file can contain a combination of username and password, so please feel free to add more combinations.
Upload the .htpasswd file via FTP to the directory you indicated in the configuration file or in the -.htaccess file.
Now that you have finished the process try to access the protected folder, you should receive a popup like the following.
Noindex in HTTP header
Various information can be passed to search engines via the HTTP header, for example with meta robots noindex .
In the HTTP header it is possible to set different x-Robots directives which I summarize below.
- all – there are no restrictions on indexing or posting. Note: This directive is the default and has no effect if it is explicitly listed
- noindex – do not show this page in search results and do not show a “cached” link in search results
- nofollow – do not follow the links on this page
- none – equivalent to noindex, nofollow
- noarchive – do not show the link to the cached version of the page in search results
- nosnippet – do not display a snippet in the search results for this page
- noodp – do not use metadata from the Open Directory project for titles or snippets shown for this page
- notranslate – do not offer translation of this page in search results
- noimageindex – do not index the images on this page
- unavailable_after – [RFC-850 date / time] Do not show this page in search results after the specified date / time. The date / time must be specified in RFC 850 format
To check the HTTP header you can use this online tool .
Meta robots noindex
To prevent search engines from indexing and showing the page in results, you can use the noindex meta robot .
Disallow in Robots.txt
The last method sees the use of the robots.txt file. Remember that robots.txt does not prevent indexing of a page or folder, it only prevents crawling by defined bots .
A page blocked by robots.txt can still appear in the search results, however if you have no way to implement the previous methods you just have to use this last method.
Google shows pages blocked with Robots.txt
As you can see, even a Google page that is Disallowed in robots.txt is still shown in SERP. Take a good look at the meta description tag of the result:
A description for this result is not available because of this site’s robots.txt
A description for this result is not available due to this site’s robots.txt file