Skip to main content

Interacting with ThecaBot Crawler

ThecaBot is a web crawler operated by Theca, aiming to index and understand the content across various sites. The crawler plays a vital role in acquiring more traffic to targeted sites. Allowing ThecaBot to crawl your site is beneficial as it leads to better visibility on search results driven by Theca. This drives more traffic to your site.

ThecaBot User Agent

Although the full user-agent string for ThecaBot is "Mozilla/5.0 (X11; Linux x86_64) ThecaBot/1.3.5 (+https://docs.theca.com/crawler)", it may vary over time when the crawler is updated. Hence, it's recommended to use the simplified token ThecaBot when specifying directives for the crawler. Universal user-agent tokens (*) will also affect ThecaBot.

Controlling ThecaBot with robots.txt

The robots.txt file is a fundamental part of a website's structure, serving as a guide for web crawlers like ThecaBot on how to navigate through the site. Located at the root directory of a website, this text file sets the ground rules for what parts of the site can be accessed and indexed by crawlers. For example, before accessing the site https://www.example.com/products/test-product.html the file https://www.example.com/robots.txt is checked. By specifying Allow and Disallow directives, webmasters can control the visibility of certain pages or directories, thereby managing how their content is represented in search engine results. The robots.txt file aids in optimizing the crawling process, ensuring that essential pages are indexed while keeping private or irrelevant sections off the index, thus contributing to an effective SEO strategy.

tip

Robots.txt is checked once for each subdomain in each crawling instance. This can be an issue if, for example, images are served from another subdomain, such as cdn.example.com instead of www.example.com. Hence, all subdomains need to include a robots.txt file.

ThecaBot follows the standard directions used by other web-crawlers, such as Crawl-delay and Disallow.

Crawl-delay

The Crawl-delay directive helps in managing the rate at which ThecaBot accesses your site. While the default settings of ThecaBot are set to ensure minimal load on your server, if needed, you can further limit the crawling rate. For example, to limit crawling to once per second:

User-agent: ThecaBot
Crawl-delay: 1

Settings above 60 are interpreted as milliseconds by ThecaBot.

Disallow

Utilize the Disallow directive to prevent ThecaBot from accessing specific pages or directories. Example:

User-agent: ThecaBot
Disallow: /private/
Disallow: /thumbnails/

Allow

The Allow directive is used to enable ThecaBot to access specific pages or directories within a disallowed path. This directive can be particularly useful if you want to block access to a large section of your site, but still allow crawling of certain pages within that section. Example:

User-agent: ThecaBot
Disallow: /images/
Allow: /images/product-images/

In the above example, ThecaBot is disallowed from accessing any pages or directories under /images/, except for /images/product-images/, which is explicitly allowed. The Allow directive makes it possible to have more granular control over what ThecaBot can access, ensuring that important pages are crawled and indexed even within largely restricted areas of your site.

Prioritizing ThecaBot crawling with sitemaps

Sitemap files are also instrumental in helping ThecaBot efficiently navigate and index important pages on your site. By listing your sitemap in the robots.txt file, you can direct in what order pages should be crawled. Also, if sites cannot be find through navigation then sitemaps is the only way to expose them to ThecaBot. The sitemaps should adhere to the standard.

Here is an example on how to specify the sitemap.xml file in robots.txt:

Sitemap: https://www.example.com/sitemap.xml

In this case, the sitemap is an XML file containing URLs. Other formats are also supported, including sitemap indices and RSS files. Here is the example content:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/</loc>
<lastmod>2023-10-01</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://www.example.com/products/</loc>
<lastmod>2023-10-01</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://www.example.com/contact/</loc>
<lastmod>2023-10-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
</urlset>

Robots meta tags

Robots meta tags are snippets of code placed within the HTML of individual web pages, providing instructions to web crawlers like ThecaBot on how to interact with that particular page. These tags can control whether a page is indexed, if links on the page are followed, and how the content is displayed in search engine results. By specifying directives such as noindex, nofollow, and others within a robots meta tag, webmasters can effectively manage the visibility and indexing of their content on a page-by-page basis. While the robots.txt file lays down the general rules for crawling across your site, robots meta tags offer page-specific instructions, ensuring precise control over indexing and displaying content.

Robots meta tags should be placed within the head element. Multiple directives can be comma separated. The value in the name parameter could be either robots (which applies to all crawlers) or ThecaBot (which applies to ThecaBot only).

Here are the content values supported by ThecaBot:

Noindex

Prevents ThecaBot from indexing the page.

<meta name="robots" content="noindex" />

Nofollow

Tells ThecaBot not to follow the links on the page.

<meta name="robots" content="nofollow" />

X-robots-tag

Robots meta tags could also be included on non-HTML pages with the HTTP header x-robots-tag. A comma separated list of directives can be included, for example: x-robots-tag: noindex, nofollow.