Perplexity, OpenAI and Anthropic illegally use articles from well-known publications to train their AI models

By: Nastya Bobkova | 23.06.2024, 16:18

Reuters reports that some artificial intelligence (AI) companies are ignoring robots.txt instructions designed to prevent them from collecting data from websites.

Here's What We Know

This has raised concerns among publishers who claim that AI companies are stealing their content without permission and using it to train their models.

One example is Perplexity, which describes itself as a "free AI search engine". It was accused of stealing Forbes articles and republishing them on its own platforms. Wired also reported that Perplexity ignores the robots.txt of Condé Nast and other publications to harvest their content.

According to Reuters, Perplexity is not the only such company. The agency received a letter from TollBit, a startup that helps publishers license their content to AI companies. The letter states that "AI agents from multiple sources have decided to bypass robots.txt to fetch content from websites."

TollBit does not name specific companies, but Business Insider reports that OpenAI and Anthropic, developers of ChatGPT and Claude chatbots, respectively, also ignore robots.txt.

Publishers are concerned that AI companies are using their content without their consent and without proper compensation. AI models can be trained on biased or inaccurate data, which can lead to the spread of misinformation.

Some publications are already taking steps to protect their content. For example, Forbes banned Perplexity from its website.

The situation with robots.txt and AI companies highlights the growing tension between publishers and AI companies. It is important to find a solution that allows AI companies to access the data they need to train their models, but also protects the interests of publishers.

Source: Reuters

Artificial Intelligence Scandals