Clicky chatsimple

YouTuber Sues OpenAI For Scraping Transcripts

Category :

AI

Posted On :

Share This :

A YouTuber is attempting to file a class action lawsuit against OpenAI, claiming that the business used millions of transcripts from YouTube videos to train its generative AI models without disclosing to or paying the owners of the recordings.

Attorneys for Massachusetts-based YouTube user David Millette filed a complaint on Friday in the U.S. District Court for the Northern District of California, claiming that OpenAI secretly transcribed videos of Millette and other creators in order to train the models that drive ChatGPT, the company’s AI-powered chatbot platform, as well as other generative AI tools and products. According to the complaint, OpenAI violated YouTube’s terms of service, which forbid using videos for apps that are not part of the platform, and copyright laws by gathering this data and allegedly “profited significantly” from the creators’ work.

“Potential and existing users, who purchase subscriptions to access [OpenAI’s] AI products, find [OpenAI’s] AI products more valuable as they become more sophisticated through the use of training data sets,” the complaint states. Nonetheless, a large portion of the content found in OpenAI’s training data sets originates from works that OpenAI replicated without permission, acknowledgment, or payment.

Millette is requesting a jury trial and more than $5 million in damages for any YouTube artists and users whose data may have been entangled in OpenAI’s training. She is being represented by the legal firm Bursor & Fisher.

OpenAI’s and other generative AI models lack true intelligence. Based on patterns and the context of any surrounding data, models “learn” how likely it is for data to occur when they are fed an immense number of samples (such as voice recordings, movies, essays, and so on).

The majority of models are trained using information from publicly accessible websites and online data sets. Businesses claim that fair usage protects them from being held accountable for their indiscriminate data scraping and commercial model training. However, a lot of copyright holders don’t agree, and they’re suing to stop the practice.

Since other data sources are, as it were, drying up, video transcriptions have emerged as a critical component of training data.

Based on data from Originality.AI, more than 35 percent of the top 1,000 websites worldwide currently restrict OpenAI’s web crawler. Furthermore, according to a research by MIT’s Data Provenance Initiative, roughly 25% of data from “high-quality” sources has been removed from the main data sets used to train AI models. The research company Epoch AI projects that between 2026 and 2032, developers will run out of data to train generative AI models, should the current trend of access limiting continue.

According to a story published in April in The New York Times, OpenAI developed Whisper, its first speech recognition model, to transcribe audio from videos in order to gather more training data. According to The Times, an OpenAI team that included the company’s president, Greg Brockman, utilized Whisper to transcribe over a million hours of YouTube footage. The transcripts were then used to train OpenAI’s text-generating and -analyzing model, GPT-4.

A few employees of OpenAI talked about how this would be against YouTube policy, according to the Times.

Companies such as Anthropic, Apple, Salesforce, and Nvidia were claimed to have trained generative AI models using a data set called The Pile, which comprises subtitles from hundreds of thousands of YouTube movies, in July by Proof News. Numerous YouTubers whose subtitles were included in The Pile were unaware of this and did not provide their authorization; Apple subsequently issued a statement clarifying that it had no intention of using those models to fuel any artificial intelligence features in its products.

Transcripts have also been used by YouTube’s parent firm, Google, to train its models.

Google expanded its terms of service (ToS) last year in part to enable the corporation to access additional user data for the purpose of building generative AI models. It was unclear under the previous TOS if Google could utilize data from YouTube to develop goods outside of the video site. Not with the new terms in place, which significantly loosen the restrictions.

We’ve contacted Google and OpenAI regarding the class action lawsuit, and we’ll update this article if they answer.

For OpenAI, the month has gotten off to a difficult start.

Elon Musk, the CEO of Tesla and X, filed a new lawsuit against Sam Altman, the CEO of OpenAI, on Monday, charging the latter with forsaking its initial charitable goal in favor of keeping some of its most advanced technology for use by commercial clients. Musk filed a lawsuit against OpenAI in February, making the same accusations, but this time it also charges OpenAI with racketeering.