OpenAI taught GPT-4 on decrypted YouTube videos - NYT

By: Bohdan Kaminskyi | 08.04.2024, 19:28

Growtika/Unsplash.

OpenAI used the Whisper text transcription algorithm to transcribe over a million hours of YouTube videos to train its latest GPT-4 language model.

Here's What We Know

According to The New York Times, OpenAI has run out of quality data to train as early as 2021. To solve this problem, the company has developed its own Whisper model specifically for transcribing videos, podcasts, and audiobooks.

The Times claims that OpenAI president Greg Brockman was personally involved in collecting clips from YouTube.

A spokesperson for the company said they use a variety of data sources, including publicly available data and data obtained through partnership agreements.

Google, the owner of YouTube, said the platform's terms of use prohibit unauthorised collection or uploading of content. The company is taking technical and legal measures to prevent such unauthorised use of data, a spokesperson for the tech giant said.

Meanwhile, Google has also used some content from YouTube to train AI. However, the company emphasised that this is under separate agreements with each content creator whose clips are involved.

The newspaper also reports that Meta has faced similar data availability issues for training its AI systems. The company allegedly considered illegally using copyrighted material.

Go Deeper:

Source: The New York Times, The Verge