Interacting with ThecaBot Crawler

ThecaBot is a web crawler operated by Theca, aiming to index and understand the content across various sites. The crawler plays a vital role in acquiring more traffic to targeted sites. Allowing ThecaBot to crawl your site is beneficial as it leads to better visibility on search results driven by Theca. This drives more traffic to your site.

ThecaBot User Agent

Although the full user-agent string for ThecaBot is "Mozilla/5.0 (X11; Linux x86_64) ThecaBot/1.3.5 (+https://docs.theca.com/crawler)", it may vary over time when the crawler is updated. Hence, it's recommended to use the simplified token ThecaBot when specifying directives for the crawler. Universal user-agent tokens (*) will also affect ThecaBot.

Controlling ThecaBot with robots.txt

The robots.txt file is a fundamental part of a website's structure, serving as a guide for web crawlers like ThecaBot on how to navigate through the site. Located at the root directory of a website, this text file sets the ground rules for what parts of the site can be accessed and indexed by crawlers. For example, before accessing the site https://www.example.com/products/test-product.html the file https://www.example.com/robots.txt is checked. By specifying Allow and Disallow directives, webmasters can control the visibility of certain pages or directories, thereby managing how their content is represented in search engine results. The robots.txt file aids in optimizing the crawling process, ensuring that essential pages are indexed while keeping private or irrelevant sections off the index, thus contributing to an effective SEO strategy.

Tip

Robots.txt is checked once for each subdomain in each crawling instance. This can be an issue if, for example, images are served from another subdomain, such as cdn.example.com instead of www.example.com. Hence, all subdomains need to include a robots.txt file.

ThecaBot follows the standard directions used by other web-crawlers, such as Crawl-delay and Disallow.

Crawl-delay

The Crawl-delay directive helps in managing the rate at which ThecaBot accesses your site. While the default settings of ThecaBot are set to ensure minimal load on your server, if needed, you can further limit the crawling rate. For example, to limit crawling to once per second:

User-agent: ThecaBot
Crawl-delay: 1

Settings above 60 are interpreted as milliseconds by ThecaBot.

Disallow

Utilize the Disallow directive to prevent ThecaBot from accessing specific pages or directories. Example:

User-agent: ThecaBot
Disallow: /private/
Disallow: /thumbnails/

Allow

The Allow directive is used to enable ThecaBot to access specific pages or directories within a disallowed path. This directive can be particularly useful if you want to block access to a large section of your site, but still allow crawling of certain pages within that section. Example:

User-agent: ThecaBot
Disallow: /images/
Allow: /images/product-images/

In the above example, ThecaBot is disallowed from accessing any pages or directories under /images/, except for /images/product-images/, which is explicitly allowed. The Allow directive makes it possible to have more granular control over what ThecaBot can access, ensuring that important pages are crawled and indexed even within largely restricted areas of your site.

Whitelisting ThecaBot's IP Address

To ensure that ThecaBot can access your site without interruption, it's important to whitelist its IP address. You can find the current IP address of ThecaBot by performing a DNS lookup against crawler.theca.com (e.g., using nslookup crawler.theca.com or similar commands).

Adding this IP to your firewall's whitelist will prevent any unintended blocking of the crawler, ensuring continuous and effective crawling of your site.

Prioritizing ThecaBot crawling with sitemaps

Sitemap files are also instrumental in helping ThecaBot efficiently navigate and index important pages on your site. By listing your sitemap in the robots.txt file, you can direct in what order pages should be crawled. Also, if sites cannot be find through navigation then sitemaps is the only way to expose them to ThecaBot. The sitemaps should adhere to the standard.

Here is an example on how to specify the sitemap.xml file in robots.txt:

Sitemap: https://www.example.com/sitemap.xml

In this case, the sitemap is an XML file containing URLs. Other formats are also supported, including sitemap <

Interacting with ThecaBot Crawler

ThecaBot User Agent

Controlling ThecaBot with robots.txt

Crawl-delay

Disallow

Allow

Whitelisting ThecaBot's IP Address

Prioritizing ThecaBot crawling with sitemaps

Robots meta tags

Noindex

Nofollow

X-robots-tag