Perplexity AI took data even from websites that developers forbade analysing

By: Viktor Tsyrfa | 04.08.2025, 20:17

Cloudflare has published a study that shows the following: Perplexity AI was crawling (downloading and analysing data) websites, even if they clearly indicated in robots.txt that automatic access was prohibited. Moreover, the system bypassed protection by changing the user agent (for example, impersonating Chrome on macOS) and redirecting traffic through different ASNs - "stealth scraping".

The AI activity was detected on tens of thousands of domains with millions of requests daily, and Cloudflare was able to identify the bot using ML models and network signals.

Perplexity is an AI-powered search engine that tries to be a smarter alternative to Google, but with a focus on conversion, dialogue-based search. It tries to analyse the results found and immediately give the user an extract, without the need to click on links. In general, Google has picked up on this trend and added its own Gemini to its search engine.

How Perplexity reacts

The company's spokesperson, Jesse Dwyer, said that the allegation was a "hoax" and that the screenshots posted did not prove access to the content. Later, he even said that the bot in question did not belong to Perplexity.

History of suspicious behaviour

As early as 2024, Wired journalists and developer Robb Knight published findings that Perplexity ignored robots.txt by using hidden IP addresses and third-party crawlers. The company's CEO acknowledged the existence of such crawlers, but refused to clearly explain whether they would stop using them.

Whether it is legal

The robots.txt file is a plain text file that describes pages that should not be analysed by search and advertising bots. It does not have any mechanisms to actually prevent these addresses from being analysed, but rather provides recommendations. In this way, bots "understand" where personal or technical information that is not intended for analysis is located. However, truly confidential information cannot be hidden in this way. The use of different bots, IPs, redirects, and user-agent substitution is also not prohibited. Perplexity's actions are completely legal, albeit unethical. Currently, there are no effective tools to make information public and prevent it from being accessed by AI. Either confidential information should be released only after identification, or it should be accepted that AI will learn from it and use it for its own purposes.

Reactions and consequences

The BBC is threatening a lawsuit over the scraping without permission: it is demanding the removal of materials, compensation and termination of access. Amazon / AWS has also launched an internal review of Perplexity for violating the terms of use of their services.