Researchers reveal that Meta Llama 3.1 AI model infringes copyright - it can recall 42 per cent of the first Harry Potter book
Recently, a study by Stanford, Cornell, and West Virginia University showed that Meta Llama 3.1 (70B), released in July 2024, consciously memorised 42% of the first Harry Potter book so that it can reproduce passages of 50 tokens with a probability of over 50%. In other words, the model doesn't just remember the plot - it can reproduce entire chunks of text if given the first sentence. For comparison, the old model remembered only 4.4%.
What happened
A group of researchers checked how much the new Llama "stuck" to books. It turned out that if the text is very popular (such as Harry Potter or The Hobbit), the AI can repeat large chunks. However, lesser-known books (for example, Sandman Slim) did not arouse much enthusiasm: the model remembered less than 1% of them.
Scientists used a special method that shows how confident the model is in each subsequent word - and this confidence was so high that it became clear that it had definitely seen it before.
How it was measured
They used the method of analysing the probabilities of subsequent tokens: if, after substituting the first 50 tokens from the passage, the model reproduces the following ones, it was considered a sign of memorisation
Why this is important
- Because books are copyrighted, and AI shouldn't just reproduce them like a printer.
- If it can repeat Harry Potter from memory, it will be easy to generate entire books that look like a copy but with a new name in the future.
- This is no longer learning, it's Ctrl+C and Ctrl+V, and the lawyers immediately went into "aha, you've been caught!" mode.
Why did this happen?
Because Meta trained this model on 15 trillion words, which is a lot. Perhaps the training included not just books, but forums, fan sites, and reviews with quotes - the more often a text appeared in the dataset, the more often it influenced the model.
And now what?
If lawyers prove that AI reproduces texts almost verbatim, it could mean big problems for Meta. Especially because their model is open and everyone can check it. Closed language models (OpenAI, Anthropic, and Google) may also have similar problems, but it is much harder to prove. This makes it harder for Meta to defend itself on the basis of fair use - a court may consider memory as evidence that the model creates a derivative from a large part of the texts. It will also be difficult to say that these are "nothing more than patterns".
Ironically, the transparency that Llama's creators pride themselves on can now play against them.
That is, this language model can now bring Meta real court cases, because it has "scraped" texts that it should not have remembered so accurately. On the one hand, the court may tighten the requirements for open-weight models: "provide the scales and you will get the evidence against you". On the other hand, institutions and courts that value transparency may take notice. Meta and others can use this as an argument that openness is a "service" and "responsibility"
Source: understandingai.org