Robots.txt

Robots.txt is a text file with instructions for search engine crawlers. It defines which areas of a website crawlers are allowed to search. However, these are not explicitly named by the robots.txt file. Rather, certain areas are not allowed to be searched. Using this simple text file, you can easily exclude entire domains, complete directories, one or more subdirectories or individual files from search engine crawling. However, this file does not protect against unauthorized access.


Robots.txt is stored in the root directory of a domain. Thus it is the first document that crawlers open when visiting your site. However, the file does not only control crawling. You can also integrate a link to your sitemap, which gives search engine crawlers an overview of all existing URLs of your domain.


You can take a peek behind the curtain of any website by typing in any URL and adding: /robots.txt at the end.

Robots.txt is not an essential element to a successful website; in fact, your site can still function correctly and rank well without one.


The instructions (entries) in robots.txt always consist of two parts. In the first part, you define which robots (user-agents) the following instruction apply for. The second part contains the instruction (disallow or allow). "user-agent: Google-Bot" and the instruction "disallow: /clients/" mean that Google bot is not allowed to search the directory /clients/. If the entire website is not to be crawled by a search bot, the entry is: "user-agent: *" with the instruction "disallow: /". You can use the dollar sign "$" to block web pages that have a certain extension. The statement "disallow: /* .doc$" blocks all URLs with a .doc extension. In the same way, you can block specific file formats n robots.txt: "disallow: /*.jpg$".


Here is a pretty simple bash script to grab the output of robots.txt from some website:


Using robots.txt configurations to prevent Google Dorking:


One of the best ways to prevent Google dorks is by using a robots.txt file. Let’s see some practical examples.


The following configuration will deny all crawling from any directory within your website, which is pretty useful for private access websites that don’t rely on publicly-indexable Internet content.

You can also block specific directories to be excepted from web crawling. If you have an /admin area and you need to protect it, just place this code inside:

Restrict access to specific files:

To restrict access to specific file extensions you can use:

In this case, all access to .php files will be denied.



Why Use Robots.txt?


Robots.txt is not an essential element to a successful website; in fact, your site can still function correctly and rank well without one.


However, there are several key benefits you must be aware of before you dismiss it:

  • Point Bots Away From Private Folders: Preventing bots from checking out your private folders will make them much harder to find and index.

  • Keep Resources Under Control: Each time a bot crawls through your site, it sucks up bandwidth and other server resources. For sites with tons of content and lots of pages, e-commerce sites, for example, can have thousands of pages, and these resources can be drained really quickly. You can use robots.txt to make it difficult for bots to access individual scripts and images; this will retain valuable resources for real visitors.

  • Specify Location Of Your Sitemap: It is quite an important point, you want to let crawlers know where your sitemap is located so they can scan it through.

  • Keep Duplicated Content Away From SERPs: By adding the rule to your robots, you can prevent crawlers from indexing pages which contain the duplicated content.

You will naturally want search engines to find their way to the most important pages on your website. By politely cordoning off specific pages, you can control which pages are put in front of searchers (be sure to never completely block search engines from seeing certain pages, though).


Creating your robots.txt file correctly, means you are improving your SEO and the user experience of your visitors.


By allowing bots to spend their days crawling the right things, they will be able to organize and show your content in the way you want it to be seen in the SERPs.

33 views0 comments

Recent Posts

See All

LibreNMS

LibreNMS is an open source, powerful and feature-rich auto-discovering PHP based network monitoring system which uses the SNMP protocol. It supports a broad range of operating systems including Linux,

Vulmap – Web Vulnerability

Vulmap is a vulnerability scanning tool that can scan for vulnerabilities in Web containers, Web servers, Web middleware, and CMS and other Web programs, and has vulnerability exploitation functions.

Vigenere Cipher

Vigenere Cipher is a method of encrypting alphabetic text. It uses a simple form of polyalphabetic substitution. A polyalphabetic cipher is any cipher based on substitution, using multiple substitutio