Interacting with ThecaBot Crawler
ThecaBot is a web crawler operated by Theca, aiming to index and understand the content across various sites. The crawler plays a vital role in acquiring more traffic to targeted sites. Allowing ThecaBot to crawl your site is beneficial as it leads to better visibility on search results driven by Theca. This drives more traffic to your site.
ThecaBot User Agent
Although the full user-agent string for ThecaBot is "Mozilla/5.0 (X11; Linux x86_64) ThecaBot/1.3.5 (+https://docs.theca.com/crawler)"
, it may vary over time when the crawler is updated. Hence, it's recommended to use the simplified token ThecaBot
when specifying directives for the crawler. Universal user-agent tokens (*) will also affect ThecaBot.
Controlling ThecaBot with robots.txt
The robots.txt file is a fundamental part of a website's structure, serving as a guide for web crawlers like ThecaBot on how to navigate through the site. Located at the root directory of a website, this text file sets the ground rules for what parts of the site can be accessed and indexed by crawlers. For example, before accessing the site https://www.example.com/products/test-product.html
the file https://www.example.com/robots.txt
is checked. By specifying Allow and Disallow directives, webmasters can control the visibility of certain pages or directories, thereby managing how their content is represented in search engine results. The robots.txt file aids in optimizing the crawling process, ensuring that essential pages are indexed while keeping private or irrelevant sections off the index, thus contributing to an effective SEO strategy.
Tip
Robots.txt is checked once for each subdomain in each crawling instance. This can be an issue if, for example, images are served from another subdomain, such as cdn.example.com
instead of www.example.com
. Hence, all subdomains need to include a robots.txt file.
ThecaBot follows the standard directions used by other web-crawlers, such as Crawl-delay
and Disallow
.
Crawl-delay
The Crawl-delay directive helps in managing the rate at which ThecaBot accesses your site. While the default settings of ThecaBot are set to ensure minimal load on your server, if needed, you can further limit the crawling rate. For example, to limit crawling to once per second:
Settings above 60 are interpreted as milliseconds by ThecaBot.
Disallow
Utilize the Disallow directive to prevent ThecaBot from accessing specific pages or directories. Example:
Allow
The Allow directive is used to enable ThecaBot to access specific pages or directories within a disallowed path. This directive can be particularly useful if you want to block access to a large section of your site, but still allow crawling of certain pages within that section. Example:
In the above example, ThecaBot is disallowed from accessing any pages or directories under/images/
, except for /images/product-images/
, which is explicitly allowed. The Allow directive makes it possible to have more granular control over what ThecaBot can access, ensuring that important pages are crawled and indexed even within largely restricted areas of your site. Whitelisting ThecaBot's IP Address
To ensure that ThecaBot can access your site without interruption, it's important to whitelist its IP address. You can find the current IP address of ThecaBot by performing a DNS lookup against crawler.theca.com
(e.g., using nslookup crawler.theca.com or similar commands).
Adding this IP to your firewall's whitelist will prevent any unintended blocking of the crawler, ensuring continuous and effective crawling of your site.
Prioritizing ThecaBot crawling with sitemaps
Sitemap files are also instrumental in helping ThecaBot efficiently navigate and index important pages on your site. By listing your sitemap in the robots.txt file, you can direct in what order pages should be crawled. Also, if sites cannot be find through navigation then sitemaps is the only way to expose them to ThecaBot. The sitemaps should adhere to the standard.
Here is an example on how to specify the sitemap.xml
file in robots.txt
:
In this case, the sitemap is an XML file containing URLs. Other formats are also supported, including sitemap indices and RSS files. Here is the example content:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/</loc>
<lastmod>2023-10-01</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://www.example.com/products/</loc>
<lastmod>2023-10-01</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://www.example.com/contact/</loc>
<lastmod>2023-10-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
</urlset>
Robots meta tags
Robots meta tags are snippets of code placed within the HTML of individual web pages, providing instructions to web crawlers like ThecaBot on how to interact with that particular page. These tags can control whether a page is indexed, if links on the page are followed, and how the content is displayed in search engine results. By specifying directives such as noindex, nofollow, and others within a robots meta tag, webmasters can effectively manage the visibility and indexing of their content on a page-by-page basis. While the robots.txt file lays down the general rules for crawling across your site, robots meta tags offer page-specific instructions, ensuring precise control over indexing and displaying content.
Robots meta tags should be placed within the head
element. Multiple directives can be comma separated. The value in the name parameter could be either robots (which applies to all crawlers) or ThecaBot (which applies to ThecaBot only).
Here are the content values supported by ThecaBot:
Noindex
Prevents ThecaBot from indexing the page.
Nofollow
Tells ThecaBot not to follow the links on the page.
X-robots-tag
Robots meta tags could also be included on non-HTML pages with the HTTP header x-robots-tag. A comma separated list of directives can be included, for example: x-robots-tag: noindex, nofollow
.