I recently conducted an experiment on my test website. I applied the noindex meta robots tag to every page of the site. Then, I prompted several AI search engines to extract specific information from the website and observed the responses.
For this scenario, let’s consider my test website to be domain.in.
The prompt I used was: “What services does https://www.domain.in/ offer as a business? Please check the website directly before answering.”
Key Findings
- ChatGPT: Accessed the site’s content, as indicated by the ChatGPT-User/1.0 agent in my server logs (+https://openai.com/bot), and accurately quoted the requested information.
- Perplexity: Did not retrieve any content from the website, suggesting it honors the noindex meta tag. Its response explicitly stated that direct website access is unavailable and that no content from the test site appears in search results.
- Claude AI: Successfully obtained the required answers. The server logs showed the user agent Claude-User/1.0; +Claude-User@anthropic.com.
- Google AI Mode: Generated fabricated information unrelated to the actual site content. This indicates it primarily relies on Google’s search index during its query process.
- Deepseek AI: Could not access any content from the test website. Its response specified that it doesn’t browse the web directly and depends entirely on search results.
- Qwen AI: Managed to retrieve content from the website, but its server logs revealed it used a HeadlessChrome browser rather than a specific user agent.
This shows to me that adding noindex meta robots tag is not a safe way to block web-pages from AI Search engines.
So after doing the above I had another test website where i added the below in my robots.txt which basically disallowed all the bots from accessing the website.
User-agent: *
Disallow: /
However, even after adding the script, I used the below prompt to ask question about this new test website.
Prompt: what services does https://www.domain.com/ as a business provide. Please check the website now and then only let me know. as they have updated their website What makes them different and why should i choose them.
ChatGPT (specifically ChatGPT-User Agent as appearing in the Server Logs) was able to access my website and was able to quote the information as i asked in the prompt.
I also tested the same prompt in other AI Answer engines but all the rest were not able to fetch any content from the website. Please see the findings below for more info.
Claude AI: It specifically said that its unable to access the website directly as it appears to be blocked by robots.txt rules. This shows that it abides by the robots.txt rules.
Perplexity AI: It wasnt able to get any information from the website. It implies that it respect the robots.txt rule.
Deepseek: It wasnt able to get any information too. It gave the following response with some related information.
Based on the search results provided, I do not have specific information about the services offered by https://www.domain.in/, as this particular website was not included in the search results. However, I can provide a general overview of what a typical AI Search Optimization Agency might offer based on the industry trends and common services described in the search results.
It just shows that Deepseek relies on Search results from certain search engines.
Google AI Mode: Wasn’t able to provide any information from the website.
Qwen AI: Wasn’t able to provide any information from the website.
This implies to me that even when you use robots.txt Disallow rule, its upto the AI Search Engine bot to respect it or not. So if you’re serious about this, you should use services like Cloudflare to block these bots which actively block these bots by identifying their IP Address and other criteria.